All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules
@ 2021-08-09 10:21 Ido Schimmel
  2021-08-09 10:21 ` [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode Ido Schimmel
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-09 10:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, andrew, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

This patchset extends the ethtool netlink API to allow user space to
control transceiver modules. Two specific APIs are added, but the plan
is to extend the interface with more APIs in the future (see "Future
plans").

This submission is a complete rework of a previous submission [1] that
tried to achieve the same goal by allowing user space to write to the
EEPROMs of these modules. It was rejected as it could have enabled user
space binary blob drivers.

However, the main issue is that by directly writing to some pages of
these EEPROMs, we are interfering with the entity that is controlling
the modules (kernel / device firmware). In addition, some functionality
cannot be implemented solely by writing to the EEPROM, as it requires
the assertion / de-assertion of hardware signals (e.g., "ResetL" pin in
SFF-8636).

Motivation
==========

The kernel can currently dump the contents of module EEPROMs to user
space via the ethtool legacy ioctl API or the new netlink API. These
dumps can then be parsed by ethtool(8) according to the specification
that defines the memory map of the EEPROM. For example, SFF-8636 [2] for
QSFP and CMIS [3] for QSFP-DD.

In addition to read-only elements, these specifications also define
writeable elements that can be used to control the behavior of the
module. For example, controlling whether the module is put in low or
high power mode to limit its power consumption.

The CMIS specification even defines a message exchange mechanism (CDB,
Command Data Block) on top of the module's memory map. This allows the
host to send various commands to the module. For example, to update its
firmware.

Implementation
==============

The ethtool netlink API is extended with two new messages,
'ETHTOOL_MSG_MODULE_SET' and 'ETHTOOL_MSG_MODULE_GET', that allow user
space to set and get transceiver module parameters. Specifically, the
'ETHTOOL_A_MODULE_LOW_POWER_ENABLED' attribute allows user space to
control the low power mode of the module in order to limit its power
consumption. See detailed description in patch #1.

In addition, patch #2 adds the 'ETHTOOL_MSG_MODULE_RESET_ACT' message
(with a corresponding notification) that allows user space to reset the
module in order to get out of fault state.

The user API is designed to be generic enough so that it could be used
for modules with different memory maps (e.g., SFF-8636, CMIS).

The only implementation of the device driver API in this series is for a
MAC driver (mlxsw) where the module is controlled by the device's
firmware, but it is designed to be generic enough so that it could also
be used by implementations where the module is controlled by the kernel.

Testing and introspection
=========================

See detailed description in patches #1 and #2.

Patchset overview
=================

Patch #1 adds the initial infrastructure in ethtool along with the
ability to control transceiver modules' low power mode.

Patch #2 adds the ability to reset transceiver modules.

Patches #3-#6 add required device registers in mlxsw.

Patch #7 implements in mlxsw the ethtool operations added in patch #1.

Patch #8 implements in mlxsw the ethtool operation added in patch #2.

Future plans
============

* Extend 'ETHTOOL_MSG_MODULE_SET' to control Tx output among other
attributes.

* Add new ethtool message(s) to update firmware on transceiver modules.

* Extend ethtool(8) to parse more diagnostic information from CMIS
modules. No kernel changes required.

[1] https://lore.kernel.org/netdev/20210623075925.2610908-1-idosch@idosch.org/
[2] https://members.snia.org/document/dl/26418
[3] http://www.qsfp-dd.com/wp-content/uploads/2021/05/CMIS5p0.pdf

Ido Schimmel (8):
  ethtool: Add ability to control transceiver modules' low power mode
  ethtool: Add ability to reset transceiver modules
  mlxsw: reg: Add fields to PMAOS register
  mlxsw: Make PMAOS pack function more generic
  mlxsw: reg: Add Port Module Memory Map Properties register
  mlxsw: reg: Add Management Cable IO and Notifications register
  mlxsw: Add ability to control transceiver modules' low power mode
  mlxsw: Add ability to reset transceiver modules

 Documentation/networking/ethtool-netlink.rst  |  83 +++++-
 .../net/ethernet/mellanox/mlxsw/core_env.c    | 232 ++++++++++++++-
 .../net/ethernet/mellanox/mlxsw/core_env.h    |  11 +
 drivers/net/ethernet/mellanox/mlxsw/minimal.c |  34 +++
 drivers/net/ethernet/mellanox/mlxsw/reg.h     | 140 +++++++++-
 .../mellanox/mlxsw/spectrum_ethtool.c         |  68 +++++
 include/linux/ethtool.h                       |  12 +
 include/uapi/linux/ethtool_netlink.h          |  18 ++
 net/ethtool/Makefile                          |   2 +-
 net/ethtool/module.c                          | 264 ++++++++++++++++++
 net/ethtool/netlink.c                         |  26 ++
 net/ethtool/netlink.h                         |   6 +
 12 files changed, 887 insertions(+), 9 deletions(-)
 create mode 100644 net/ethtool/module.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-09 10:21 [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules Ido Schimmel
@ 2021-08-09 10:21 ` Ido Schimmel
  2021-08-09 14:28   ` Andrew Lunn
  2021-08-09 10:21 ` [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules Ido Schimmel
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Ido Schimmel @ 2021-08-09 10:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, andrew, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
modules parameters and retrieve their status.

The first parameter to control is the low power mode of the module. It
is only relevant for paged memory modules, as flat memory modules always
operate in low power mode.

When a paged memory module is in low power mode, its power consumption
is reduced to the minimum, the management interface towards the host is
available and the data path is deactivated.

User space can choose to put modules that are not currently in use in
low power mode and transition them to high power mode before putting the
associated ports administratively up.

Transitioning into low power mode means loss of carrier, so error is
returned when the netdev is administratively up.

The user API is designed to be generic enough so that it could be used
for modules with different memory maps (e.g., SFF-8636, CMIS).

The only implementation of the device driver API in this series is for a
MAC driver (mlxsw) where the module is controlled by the device's
firmware, but it is designed to be generic enough so that it could also
be used by implementations where the module is controlled by the CPU.

CMIS testing
============

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off

The module is not in low power mode, as it is not forced by hardware
(LowPwrAllowRequestHW is off) or by software (LowPwrRequestSW is off).

The low power mode can be queried from the kernel. In case
LowPwrAllowRequestHW was on, the kernel would need to take into account
the state of the LowPwrRequestHW signal, which is not visible to user
space.

 $ ethtool --show-module swp11
 Module parameters for swp11:
 low-power false

Turn on low power mode:

 # ethtool --set-module swp11 low-power on

Query low power mode again:

 $ ethtool --show-module swp11
 Module parameters for swp11:
 low-power true

Verify with the data read from the EEPROM:

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x01 (ModuleLowPwr)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : On

Allow the module to transition out of low power mode:

 # ethtool --set-module swp11 low-power off

Query low power mode again:

 $ ethtool --show-module swp11
 Module parameters for swp11:
 low-power false

Verify with the data read from the EEPROM:

 # ethtool -m swp11
 Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
 ...
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off

SFF-8636 testing
================

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) enabled
 Power set                                 : Off
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.7733 mW / -1.12 dBm
 Transmit avg optical power (Channel 2)    : 0.7649 mW / -1.16 dBm
 Transmit avg optical power (Channel 3)    : 0.7743 mW / -1.11 dBm
 Transmit avg optical power (Channel 4)    : 0.7837 mW / -1.06 dBm
 Rcvr signal avg optical power(Channel 1)  : 0.9186 mW / -0.37 dBm
 Rcvr signal avg optical power(Channel 2)  : 0.9136 mW / -0.39 dBm
 Rcvr signal avg optical power(Channel 3)  : 0.8986 mW / -0.46 dBm
 Rcvr signal avg optical power(Channel 4)  : 0.8701 mW / -0.60 dBm

The module is not in low power mode, as it is not forced by hardware
(Power override is on) or by software (Power set is off).

The low power mode can be queried from the kernel. In case Power
override was off, the kernel would need to take into account the state
of the LPMode signal, which is not visible to user space.

 $ ethtool --show-module swp13
 Module parameters for swp13:
 low-power false

Turn on low power mode:

 # ethtool --set-module swp13 low-power on

Query low power mode again:

 $ ethtool --show-module swp13
 Module parameters for swp13:
 low-power true

Verify with the data read from the EEPROM:

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) not enabled
 Power set                                 : On
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 2)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 3)    : 0.0000 mW / -inf dBm
 Transmit avg optical power (Channel 4)    : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 1)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 2)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 3)  : 0.0000 mW / -inf dBm
 Rcvr signal avg optical power(Channel 4)  : 0.0000 mW / -inf dBm

Allow the module to transition out of low power mode:

 # ethtool --set-module swp13 low-power off

Query low power mode again:

 $ ethtool --show-module swp13
 Module parameters for swp13:
 low-power false

Verify with the data read from the EEPROM:

 # ethtool -m swp13
 Identifier                                : 0x11 (QSFP28)
 ...
 Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) enabled
 Power set                                 : Off
 Power override                            : On
 ...
 Transmit avg optical power (Channel 1)    : 0.7783 mW / -1.09 dBm
 Transmit avg optical power (Channel 2)    : 0.7806 mW / -1.08 dBm
 Transmit avg optical power (Channel 3)    : 0.7885 mW / -1.03 dBm
 Transmit avg optical power (Channel 4)    : 0.7985 mW / -0.98 dBm
 Rcvr signal avg optical power(Channel 1)  : 0.9124 mW / -0.40 dBm
 Rcvr signal avg optical power(Channel 2)  : 0.9071 mW / -0.42 dBm
 Rcvr signal avg optical power(Channel 3)  : 0.8993 mW / -0.46 dBm
 Rcvr signal avg optical power(Channel 4)  : 0.8644 mW / -0.63 dBm

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 Documentation/networking/ethtool-netlink.rst |  55 +++++-
 include/linux/ethtool.h                      |   9 +
 include/uapi/linux/ethtool_netlink.h         |  16 ++
 net/ethtool/Makefile                         |   2 +-
 net/ethtool/module.c                         | 184 +++++++++++++++++++
 net/ethtool/netlink.c                        |  19 ++
 net/ethtool/netlink.h                        |   4 +
 7 files changed, 286 insertions(+), 3 deletions(-)
 create mode 100644 net/ethtool/module.c

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index c86628e6a235..07eac5bc9cfc 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -179,7 +179,7 @@ according to message purpose:
 
 Userspace to kernel:
 
-  ===================================== ================================
+  ===================================== =================================
   ``ETHTOOL_MSG_STRSET_GET``            get string set
   ``ETHTOOL_MSG_LINKINFO_GET``          get link settings
   ``ETHTOOL_MSG_LINKINFO_SET``          set link settings
@@ -213,7 +213,9 @@ Userspace to kernel:
   ``ETHTOOL_MSG_MODULE_EEPROM_GET``     read SFP module EEPROM
   ``ETHTOOL_MSG_STATS_GET``             get standard statistics
   ``ETHTOOL_MSG_PHC_VCLOCKS_GET``       get PHC virtual clocks info
-  ===================================== ================================
+  ``ETHTOOL_MSG_MODULE_SET``            set transceiver module parameters
+  ``ETHTOOL_MSG_MODULE_GET``            get transceiver module parameters
+  ===================================== =================================
 
 Kernel to userspace:
 
@@ -252,6 +254,7 @@ Kernel to userspace:
   ``ETHTOOL_MSG_MODULE_EEPROM_GET_REPLY``  read SFP module EEPROM
   ``ETHTOOL_MSG_STATS_GET_REPLY``          standard statistics
   ``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY``    PHC virtual clocks info
+  ``ETHTOOL_MSG_MODULE_GET_REPLY``         transceiver module parameters
   ======================================== =================================
 
 ``GET`` requests are sent by userspace applications to retrieve device
@@ -1498,6 +1501,52 @@ Kernel response contents:
   ``ETHTOOL_A_PHC_VCLOCKS_INDEX``       s32     PHC index array
   ====================================  ======  ==========================
 
+MODULE_GET
+==========
+
+Gets transceiver module parameters.
+
+Request contents:
+
+  =====================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_HEADER``            nested  request header
+  =====================================  ======  ==========================
+
+Kernel response contents:
+
+  ======================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_HEADER``             nested  reply header
+  ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED``  bool    low power mode is enabled
+  ======================================  ======  ==========================
+
+The optional ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED`` attribute encodes whether
+low power mode is forced by the host.
+
+MODULE_SET
+==========
+
+Sets transceiver module parameters.
+
+Request contents:
+
+  ======================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_HEADER``             nested  request header
+  ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED``  bool    low power mode is enabled
+  ======================================  ======  ==========================
+
+When set, the optional ``ETHTOOL_A_MODULE_LOW_POWER_ENABLED`` attribute is used
+to force the module to stay in low power mode or force it back into low power
+mode. When cleared, capable modules can transition to high power mode.
+
+To avoid changes to the operational state of the device, low power mode can
+only be set when the device is administratively down.
+
+For SFF-8636 modules, low power mode is forced by the host according to table
+6-10 in revision 2.10a of the specification.
+
+For CMIS modules, low power mode is forced by the host according to table 6-12
+in revision 5.0 of the specification.
+
 Request translation
 ===================
 
@@ -1597,4 +1646,6 @@ are netlink only.
   n/a                                 ``ETHTOOL_MSG_CABLE_TEST_TDR_ACT``
   n/a                                 ``ETHTOOL_MSG_TUNNEL_INFO_GET``
   n/a                                 ``ETHTOOL_MSG_PHC_VCLOCKS_GET``
+  n/a                                 ``ETHTOOL_MSG_MODULE_GET``
+  n/a                                 ``ETHTOOL_MSG_MODULE_SET``
   =================================== =====================================
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 4711b96dae0c..04286debdcdc 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -570,6 +570,10 @@ struct ethtool_module_eeprom {
  * @get_eth_ctrl_stats: Query some of the IEEE 802.3 MAC Ctrl statistics.
  * @get_rmon_stats: Query some of the RMON (RFC 2819) statistics.
  *	Set %ranges to a pointer to zero-terminated array of byte ranges.
+ * @get_module_low_power: Get the low power mode status of the plug-in module
+ *	used by the network device.
+ * @set_module_low_power: Set the low power mode status of the plug-in module
+ *	used by the network device.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -689,6 +693,11 @@ struct ethtool_ops {
 	void	(*get_rmon_stats)(struct net_device *dev,
 				  struct ethtool_rmon_stats *rmon_stats,
 				  const struct ethtool_rmon_hist_range **ranges);
+	int	(*get_module_low_power)(struct net_device *dev,
+					bool *p_low_power,
+					struct netlink_ext_ack *extack);
+	int	(*set_module_low_power)(struct net_device *dev, bool low_power,
+					struct netlink_ext_ack *extack);
 };
 
 int ethtool_check_ops(const struct ethtool_ops *ops);
diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h
index b3b93710eff7..72fb821f3928 100644
--- a/include/uapi/linux/ethtool_netlink.h
+++ b/include/uapi/linux/ethtool_netlink.h
@@ -47,6 +47,8 @@ enum {
 	ETHTOOL_MSG_MODULE_EEPROM_GET,
 	ETHTOOL_MSG_STATS_GET,
 	ETHTOOL_MSG_PHC_VCLOCKS_GET,
+	ETHTOOL_MSG_MODULE_GET,
+	ETHTOOL_MSG_MODULE_SET,
 
 	/* add new constants above here */
 	__ETHTOOL_MSG_USER_CNT,
@@ -90,6 +92,8 @@ enum {
 	ETHTOOL_MSG_MODULE_EEPROM_GET_REPLY,
 	ETHTOOL_MSG_STATS_GET_REPLY,
 	ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY,
+	ETHTOOL_MSG_MODULE_GET_REPLY,
+	ETHTOOL_MSG_MODULE_NTF,
 
 	/* add new constants above here */
 	__ETHTOOL_MSG_KERNEL_CNT,
@@ -831,6 +835,18 @@ enum {
 	ETHTOOL_A_STATS_RMON_MAX = (__ETHTOOL_A_STATS_RMON_CNT - 1)
 };
 
+/* MODULE */
+
+enum {
+	ETHTOOL_A_MODULE_UNSPEC,
+	ETHTOOL_A_MODULE_HEADER,		/* nest - _A_HEADER_* */
+	ETHTOOL_A_MODULE_LOW_POWER_ENABLED,	/* u8 */
+
+	/* add new constants above here */
+	__ETHTOOL_A_MODULE_CNT,
+	ETHTOOL_A_MODULE_MAX = (__ETHTOOL_A_MODULE_CNT - 1)
+};
+
 /* generic netlink info */
 #define ETHTOOL_GENL_NAME "ethtool"
 #define ETHTOOL_GENL_VERSION 1
diff --git a/net/ethtool/Makefile b/net/ethtool/Makefile
index 0a19470efbfb..b76432e70e6b 100644
--- a/net/ethtool/Makefile
+++ b/net/ethtool/Makefile
@@ -7,4 +7,4 @@ obj-$(CONFIG_ETHTOOL_NETLINK)	+= ethtool_nl.o
 ethtool_nl-y	:= netlink.o bitset.o strset.o linkinfo.o linkmodes.o \
 		   linkstate.o debug.o wol.o features.o privflags.o rings.o \
 		   channels.o coalesce.o pause.o eee.o tsinfo.o cabletest.o \
-		   tunnels.o fec.o eeprom.o stats.o phc_vclocks.o
+		   tunnels.o fec.o eeprom.o stats.o phc_vclocks.o module.o
diff --git a/net/ethtool/module.c b/net/ethtool/module.c
new file mode 100644
index 000000000000..947f2188d725
--- /dev/null
+++ b/net/ethtool/module.c
@@ -0,0 +1,184 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/ethtool.h>
+
+#include "netlink.h"
+#include "common.h"
+#include "bitset.h"
+
+struct module_req_info {
+	struct ethnl_req_info		base;
+};
+
+struct module_reply_data {
+	struct ethnl_reply_data		base;
+	u8 low_power:1,
+	   low_power_valid:1;
+};
+
+#define MODULE_REPDATA(__reply_base) \
+	container_of(__reply_base, struct module_reply_data, base)
+
+/* MODULE_GET */
+
+const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1] = {
+	[ETHTOOL_A_MODULE_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
+};
+
+static int module_get_low_power(struct net_device *dev,
+				struct module_reply_data *data,
+				struct netlink_ext_ack *extack)
+{
+	const struct ethtool_ops *ops = dev->ethtool_ops;
+	bool low_power;
+	int ret;
+
+	if (!ops->get_module_low_power)
+		return 0;
+
+	ret = ops->get_module_low_power(dev, &low_power, extack);
+	if (ret < 0)
+		return ret;
+
+	data->low_power = low_power;
+	data->low_power_valid = true;
+
+	return 0;
+}
+
+static int module_prepare_data(const struct ethnl_req_info *req_base,
+			       struct ethnl_reply_data *reply_base,
+			       struct genl_info *info)
+{
+	struct module_reply_data *data = MODULE_REPDATA(reply_base);
+	struct netlink_ext_ack *extack = info ? info->extack : NULL;
+	struct net_device *dev = reply_base->dev;
+	int ret;
+
+	ret = ethnl_ops_begin(dev);
+	if (ret < 0)
+		return ret;
+
+	ret = module_get_low_power(dev, data, extack);
+	if (ret < 0)
+		goto out_complete;
+
+out_complete:
+	ethnl_ops_complete(dev);
+	return ret;
+}
+
+static int module_reply_size(const struct ethnl_req_info *req_base,
+			     const struct ethnl_reply_data *reply_base)
+{
+	struct module_reply_data *data = MODULE_REPDATA(reply_base);
+	int len = 0;
+
+	if (data->low_power_valid)
+		len += nla_total_size(sizeof(u8)); /* _MODULE_LOW_POWER_ENABLED */
+
+	return len;
+}
+
+static int module_fill_reply(struct sk_buff *skb,
+			     const struct ethnl_req_info *req_base,
+			     const struct ethnl_reply_data *reply_base)
+{
+	const struct module_reply_data *data = MODULE_REPDATA(reply_base);
+
+	if (data->low_power_valid &&
+	    nla_put_u8(skb, ETHTOOL_A_MODULE_LOW_POWER_ENABLED,
+		       data->low_power))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+const struct ethnl_request_ops ethnl_module_request_ops = {
+	.request_cmd		= ETHTOOL_MSG_MODULE_GET,
+	.reply_cmd		= ETHTOOL_MSG_MODULE_GET_REPLY,
+	.hdr_attr		= ETHTOOL_A_MODULE_HEADER,
+	.req_info_size		= sizeof(struct module_req_info),
+	.reply_data_size	= sizeof(struct module_reply_data),
+
+	.prepare_data		= module_prepare_data,
+	.reply_size		= module_reply_size,
+	.fill_reply		= module_fill_reply,
+};
+
+/* MODULE_SET */
+
+const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_LOW_POWER_ENABLED + 1] = {
+	[ETHTOOL_A_MODULE_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
+	[ETHTOOL_A_MODULE_LOW_POWER_ENABLED] = NLA_POLICY_MAX(NLA_U8, 1),
+};
+
+static int module_set_low_power(struct net_device *dev, struct nlattr **tb,
+				bool *p_mod, struct netlink_ext_ack *extack)
+{
+	const struct ethtool_ops *ops = dev->ethtool_ops;
+	bool low_power_new, low_power;
+	int ret;
+
+	if (!tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED])
+		return 0;
+
+	if (!ops->get_module_low_power || !ops->set_module_low_power) {
+		NL_SET_ERR_MSG_ATTR(extack,
+				    tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED],
+				    "Setting low power mode is not supported by this device");
+		return -EOPNOTSUPP;
+	}
+
+	if (netif_running(dev)) {
+		NL_SET_ERR_MSG_ATTR(extack,
+				    tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED],
+				    "Cannot set low power mode when port is administratively up");
+		return -EINVAL;
+	}
+
+	low_power_new = !!nla_get_u8(tb[ETHTOOL_A_MODULE_LOW_POWER_ENABLED]);
+	ret = ops->get_module_low_power(dev, &low_power, extack);
+	if (ret < 0)
+		return ret;
+	*p_mod = low_power_new != low_power;
+
+	return ops->set_module_low_power(dev, low_power_new, extack);
+}
+
+int ethnl_set_module(struct sk_buff *skb, struct genl_info *info)
+{
+	struct ethnl_req_info req_info = {};
+	struct nlattr **tb = info->attrs;
+	struct net_device *dev;
+	bool mod = false;
+	int ret;
+
+	ret = ethnl_parse_header_dev_get(&req_info, tb[ETHTOOL_A_MODULE_HEADER],
+					 genl_info_net(info), info->extack,
+					 true);
+	if (ret < 0)
+		return ret;
+	dev = req_info.dev;
+
+	rtnl_lock();
+	ret = ethnl_ops_begin(dev);
+	if (ret < 0)
+		goto out_rtnl;
+
+	ret = module_set_low_power(dev, tb, &mod, info->extack);
+	if (ret < 0)
+		goto out_ops;
+
+	if (!mod)
+		goto out_ops;
+
+	ethtool_notify(dev, ETHTOOL_MSG_MODULE_NTF, NULL);
+
+out_ops:
+	ethnl_ops_complete(dev);
+out_rtnl:
+	rtnl_unlock();
+	dev_put(dev);
+	return ret;
+}
diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c
index 1797a0a90019..38b44c0291b1 100644
--- a/net/ethtool/netlink.c
+++ b/net/ethtool/netlink.c
@@ -282,6 +282,7 @@ ethnl_default_requests[__ETHTOOL_MSG_USER_CNT] = {
 	[ETHTOOL_MSG_MODULE_EEPROM_GET]	= &ethnl_module_eeprom_request_ops,
 	[ETHTOOL_MSG_STATS_GET]		= &ethnl_stats_request_ops,
 	[ETHTOOL_MSG_PHC_VCLOCKS_GET]	= &ethnl_phc_vclocks_request_ops,
+	[ETHTOOL_MSG_MODULE_GET]	= &ethnl_module_request_ops,
 };
 
 static struct ethnl_dump_ctx *ethnl_dump_context(struct netlink_callback *cb)
@@ -593,6 +594,7 @@ ethnl_default_notify_ops[ETHTOOL_MSG_KERNEL_MAX + 1] = {
 	[ETHTOOL_MSG_PAUSE_NTF]		= &ethnl_pause_request_ops,
 	[ETHTOOL_MSG_EEE_NTF]		= &ethnl_eee_request_ops,
 	[ETHTOOL_MSG_FEC_NTF]		= &ethnl_fec_request_ops,
+	[ETHTOOL_MSG_MODULE_NTF]	= &ethnl_module_request_ops,
 };
 
 /* default notification handler */
@@ -686,6 +688,7 @@ static const ethnl_notify_handler_t ethnl_notify_handlers[] = {
 	[ETHTOOL_MSG_PAUSE_NTF]		= ethnl_default_notify,
 	[ETHTOOL_MSG_EEE_NTF]		= ethnl_default_notify,
 	[ETHTOOL_MSG_FEC_NTF]		= ethnl_default_notify,
+	[ETHTOOL_MSG_MODULE_NTF]	= ethnl_default_notify,
 };
 
 void ethtool_notify(struct net_device *dev, unsigned int cmd, const void *data)
@@ -999,6 +1002,22 @@ static const struct genl_ops ethtool_genl_ops[] = {
 		.policy = ethnl_phc_vclocks_get_policy,
 		.maxattr = ARRAY_SIZE(ethnl_phc_vclocks_get_policy) - 1,
 	},
+	{
+		.cmd	= ETHTOOL_MSG_MODULE_GET,
+		.doit	= ethnl_default_doit,
+		.start	= ethnl_default_start,
+		.dumpit	= ethnl_default_dumpit,
+		.done	= ethnl_default_done,
+		.policy = ethnl_module_get_policy,
+		.maxattr = ARRAY_SIZE(ethnl_module_get_policy) - 1,
+	},
+	{
+		.cmd	= ETHTOOL_MSG_MODULE_SET,
+		.flags	= GENL_UNS_ADMIN_PERM,
+		.doit	= ethnl_set_module,
+		.policy = ethnl_module_set_policy,
+		.maxattr = ARRAY_SIZE(ethnl_module_set_policy) - 1,
+	},
 };
 
 static const struct genl_multicast_group ethtool_nl_mcgrps[] = {
diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h
index 077aac3929a8..cf0fcbfe3c5c 100644
--- a/net/ethtool/netlink.h
+++ b/net/ethtool/netlink.h
@@ -337,6 +337,7 @@ extern const struct ethnl_request_ops ethnl_fec_request_ops;
 extern const struct ethnl_request_ops ethnl_module_eeprom_request_ops;
 extern const struct ethnl_request_ops ethnl_stats_request_ops;
 extern const struct ethnl_request_ops ethnl_phc_vclocks_request_ops;
+extern const struct ethnl_request_ops ethnl_module_request_ops;
 
 extern const struct nla_policy ethnl_header_policy[ETHTOOL_A_HEADER_FLAGS + 1];
 extern const struct nla_policy ethnl_header_policy_stats[ETHTOOL_A_HEADER_FLAGS + 1];
@@ -373,6 +374,8 @@ extern const struct nla_policy ethnl_fec_set_policy[ETHTOOL_A_FEC_AUTO + 1];
 extern const struct nla_policy ethnl_module_eeprom_get_policy[ETHTOOL_A_MODULE_EEPROM_I2C_ADDRESS + 1];
 extern const struct nla_policy ethnl_stats_get_policy[ETHTOOL_A_STATS_GROUPS + 1];
 extern const struct nla_policy ethnl_phc_vclocks_get_policy[ETHTOOL_A_PHC_VCLOCKS_HEADER + 1];
+extern const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1];
+extern const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_LOW_POWER_ENABLED + 1];
 
 int ethnl_set_linkinfo(struct sk_buff *skb, struct genl_info *info);
 int ethnl_set_linkmodes(struct sk_buff *skb, struct genl_info *info);
@@ -391,6 +394,7 @@ int ethnl_tunnel_info_doit(struct sk_buff *skb, struct genl_info *info);
 int ethnl_tunnel_info_start(struct netlink_callback *cb);
 int ethnl_tunnel_info_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
 int ethnl_set_fec(struct sk_buff *skb, struct genl_info *info);
+int ethnl_set_module(struct sk_buff *skb, struct genl_info *info);
 
 extern const char stats_std_names[__ETHTOOL_STATS_CNT][ETH_GSTRING_LEN];
 extern const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING_LEN];
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules
  2021-08-09 10:21 [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules Ido Schimmel
  2021-08-09 10:21 ` [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode Ido Schimmel
@ 2021-08-09 10:21 ` Ido Schimmel
  2021-08-09 19:13   ` Andrew Lunn
  2021-08-09 10:21 ` [RFC PATCH net-next 3/8] mlxsw: reg: Add fields to PMAOS register Ido Schimmel
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Ido Schimmel @ 2021-08-09 10:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, andrew, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

Add a new ethtool message, 'ETHTOOL_MSG_MODULE_RESET_ACT', which allows
user space to request a reset of transceiver modules. A successful reset
results in a notification being emitted to user space in the form of a
'ETHTOOL_MSG_MODULE_RESET_NTF' message.

Reset can be performed by either asserting the relevant hardware signal
("Reset" in CMIS / "ResetL" in SFF-8636) or by writing to the relevant
reset bit in the module's EEPROM (page 00h, byte 26, bit 3 in CMIS /
page 00h, byte 93, bit 7 in SFF-8636).

Reset is useful in order to allow a module to transition out of a fault
state. From section 6.3.2.12 in CMIS 5.0: "Except for a power cycle, the
only exit path from the ModuleFault state is to perform a module reset
by taking an action that causes the ResetS transition signal to become
TRUE (see Table 6-11)".

To avoid changes to the operational state of the device, reset can only
be performed when the device is administratively down.

Example usage:

 # ethtool --reset-module swp11
 netlink error: Cannot reset module when port is administratively up
 netlink error: Invalid argument

 # ip link set dev swp11 down

 # ethtool --reset-module swp11

Monitor notifications:

 $ ethtool --monitor
 listening...

 Module reset done for swp11

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 Documentation/networking/ethtool-netlink.rst | 28 +++++++
 include/linux/ethtool.h                      |  3 +
 include/uapi/linux/ethtool_netlink.h         |  2 +
 net/ethtool/module.c                         | 80 ++++++++++++++++++++
 net/ethtool/netlink.c                        |  7 ++
 net/ethtool/netlink.h                        |  2 +
 6 files changed, 122 insertions(+)

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 07eac5bc9cfc..4e4d0c6a943e 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -215,6 +215,7 @@ Userspace to kernel:
   ``ETHTOOL_MSG_PHC_VCLOCKS_GET``       get PHC virtual clocks info
   ``ETHTOOL_MSG_MODULE_SET``            set transceiver module parameters
   ``ETHTOOL_MSG_MODULE_GET``            get transceiver module parameters
+  ``ETHTOOL_MSG_MODULE_RESET_ACT``      action reset transceiver module
   ===================================== =================================
 
 Kernel to userspace:
@@ -255,6 +256,7 @@ Kernel to userspace:
   ``ETHTOOL_MSG_STATS_GET_REPLY``          standard statistics
   ``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY``    PHC virtual clocks info
   ``ETHTOOL_MSG_MODULE_GET_REPLY``         transceiver module parameters
+  ``ETHTOOL_MSG_MODULE_RESET_NTF``         transceiver module reset
   ======================================== =================================
 
 ``GET`` requests are sent by userspace applications to retrieve device
@@ -1547,6 +1549,31 @@ For SFF-8636 modules, low power mode is forced by the host according to table
 For CMIS modules, low power mode is forced by the host according to table 6-12
 in revision 5.0 of the specification.
 
+MODULE_RESET_ACT
+================
+
+Resets the transceiver module to its initial state, as if it was just
+plugged-in. The Module State Machine (MSM) is reset to the "Reset" steady state
+and module's registers are reset to their default values.
+
+Action contents:
+
+  ======================================  ======  ==========================
+  ``ETHTOOL_A_MODULE_HEADER``             nested  request header
+  ======================================  ======  ==========================
+
+Upon a successful reset, a ``ETHTOOL_MSG_MODULE_RESET_NTF`` notification is
+sent to user space.
+
+To avoid changes to the operational state of the device, reset can only be
+performed when the device is administratively down.
+
+For SFF-8636 modules, reset can be implemented according to section 4.4.3 in
+revision 2.10a of the specification.
+
+For CMIS modules, reset can be implemented according to table 6-11 in revision
+5.0 of the specification.
+
 Request translation
 ===================
 
@@ -1648,4 +1675,5 @@ are netlink only.
   n/a                                 ``ETHTOOL_MSG_PHC_VCLOCKS_GET``
   n/a                                 ``ETHTOOL_MSG_MODULE_GET``
   n/a                                 ``ETHTOOL_MSG_MODULE_SET``
+  n/a                                 ``ETHTOOL_MSG_MODULE_RESET_ACT``
   =================================== =====================================
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 04286debdcdc..ab67b061be32 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -574,6 +574,7 @@ struct ethtool_module_eeprom {
  *	used by the network device.
  * @set_module_low_power: Set the low power mode status of the plug-in module
  *	used by the network device.
+ * @reset_module: Reset the plug-in module used by the network device.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -698,6 +699,8 @@ struct ethtool_ops {
 					struct netlink_ext_ack *extack);
 	int	(*set_module_low_power)(struct net_device *dev, bool low_power,
 					struct netlink_ext_ack *extack);
+	int	(*reset_module)(struct net_device *dev,
+				struct netlink_ext_ack *extack);
 };
 
 int ethtool_check_ops(const struct ethtool_ops *ops);
diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h
index 72fb821f3928..4e1c1baad250 100644
--- a/include/uapi/linux/ethtool_netlink.h
+++ b/include/uapi/linux/ethtool_netlink.h
@@ -49,6 +49,7 @@ enum {
 	ETHTOOL_MSG_PHC_VCLOCKS_GET,
 	ETHTOOL_MSG_MODULE_GET,
 	ETHTOOL_MSG_MODULE_SET,
+	ETHTOOL_MSG_MODULE_RESET_ACT,
 
 	/* add new constants above here */
 	__ETHTOOL_MSG_USER_CNT,
@@ -94,6 +95,7 @@ enum {
 	ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY,
 	ETHTOOL_MSG_MODULE_GET_REPLY,
 	ETHTOOL_MSG_MODULE_NTF,
+	ETHTOOL_MSG_MODULE_RESET_NTF,
 
 	/* add new constants above here */
 	__ETHTOOL_MSG_KERNEL_CNT,
diff --git a/net/ethtool/module.c b/net/ethtool/module.c
index 947f2188d725..f5b730eb0645 100644
--- a/net/ethtool/module.c
+++ b/net/ethtool/module.c
@@ -182,3 +182,83 @@ int ethnl_set_module(struct sk_buff *skb, struct genl_info *info)
 	dev_put(dev);
 	return ret;
 }
+
+/* MODULE_RESET_ACT */
+
+const struct nla_policy ethnl_module_reset_act_policy[ETHTOOL_A_MODULE_HEADER + 1] = {
+	[ETHTOOL_A_MODULE_HEADER] = NLA_POLICY_NESTED(ethnl_header_policy),
+};
+
+static void ethnl_module_reset_done(struct net_device *dev)
+{
+	struct sk_buff *skb;
+	void *ehdr;
+	int ret;
+
+	skb = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!skb)
+		return;
+
+	ehdr = ethnl_bcastmsg_put(skb, ETHTOOL_MSG_MODULE_RESET_NTF);
+	if (!ehdr)
+		goto out;
+
+	ret = ethnl_fill_reply_header(skb, dev, ETHTOOL_A_MODULE_HEADER);
+	if (ret < 0)
+		goto out;
+
+	genlmsg_end(skb, ehdr);
+	ethnl_multicast(skb, dev);
+	return;
+
+out:
+	nlmsg_free(skb);
+}
+
+int ethnl_act_module_reset(struct sk_buff *skb, struct genl_info *info)
+{
+	struct ethnl_req_info req_info = {};
+	struct nlattr **tb = info->attrs;
+	const struct ethtool_ops *ops;
+	struct net_device *dev;
+	int ret;
+
+	ret = ethnl_parse_header_dev_get(&req_info,
+					 tb[ETHTOOL_A_MODULE_HEADER],
+					 genl_info_net(info), info->extack,
+					 true);
+	if (ret < 0)
+		return ret;
+
+	dev = req_info.dev;
+
+	rtnl_lock();
+	ops = dev->ethtool_ops;
+	if (!ops->reset_module) {
+		ret = -EOPNOTSUPP;
+		goto out_rtnl;
+	}
+
+	if (netif_running(dev)) {
+		NL_SET_ERR_MSG(info->extack,
+			       "Cannot reset module when port is administratively up");
+		ret = -EINVAL;
+		goto out_rtnl;
+	}
+
+	ret = ethnl_ops_begin(dev);
+	if (ret < 0)
+		goto out_rtnl;
+
+	ret = ops->reset_module(dev, info->extack);
+
+	ethnl_ops_complete(dev);
+
+	if (!ret)
+		ethnl_module_reset_done(dev);
+
+out_rtnl:
+	rtnl_unlock();
+	dev_put(dev);
+	return ret;
+}
diff --git a/net/ethtool/netlink.c b/net/ethtool/netlink.c
index 38b44c0291b1..8558caa1a963 100644
--- a/net/ethtool/netlink.c
+++ b/net/ethtool/netlink.c
@@ -1018,6 +1018,13 @@ static const struct genl_ops ethtool_genl_ops[] = {
 		.policy = ethnl_module_set_policy,
 		.maxattr = ARRAY_SIZE(ethnl_module_set_policy) - 1,
 	},
+	{
+		.cmd	= ETHTOOL_MSG_MODULE_RESET_ACT,
+		.flags	= GENL_UNS_ADMIN_PERM,
+		.doit	= ethnl_act_module_reset,
+		.policy = ethnl_module_reset_act_policy,
+		.maxattr = ARRAY_SIZE(ethnl_module_reset_act_policy) - 1,
+	},
 };
 
 static const struct genl_multicast_group ethtool_nl_mcgrps[] = {
diff --git a/net/ethtool/netlink.h b/net/ethtool/netlink.h
index cf0fcbfe3c5c..7087cd20c4d0 100644
--- a/net/ethtool/netlink.h
+++ b/net/ethtool/netlink.h
@@ -376,6 +376,7 @@ extern const struct nla_policy ethnl_stats_get_policy[ETHTOOL_A_STATS_GROUPS + 1
 extern const struct nla_policy ethnl_phc_vclocks_get_policy[ETHTOOL_A_PHC_VCLOCKS_HEADER + 1];
 extern const struct nla_policy ethnl_module_get_policy[ETHTOOL_A_MODULE_HEADER + 1];
 extern const struct nla_policy ethnl_module_set_policy[ETHTOOL_A_MODULE_LOW_POWER_ENABLED + 1];
+extern const struct nla_policy ethnl_module_reset_act_policy[ETHTOOL_A_MODULE_HEADER + 1];
 
 int ethnl_set_linkinfo(struct sk_buff *skb, struct genl_info *info);
 int ethnl_set_linkmodes(struct sk_buff *skb, struct genl_info *info);
@@ -395,6 +396,7 @@ int ethnl_tunnel_info_start(struct netlink_callback *cb);
 int ethnl_tunnel_info_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
 int ethnl_set_fec(struct sk_buff *skb, struct genl_info *info);
 int ethnl_set_module(struct sk_buff *skb, struct genl_info *info);
+int ethnl_act_module_reset(struct sk_buff *skb, struct genl_info *info);
 
 extern const char stats_std_names[__ETHTOOL_STATS_CNT][ETH_GSTRING_LEN];
 extern const char stats_eth_phy_names[__ETHTOOL_A_STATS_ETH_PHY_CNT][ETH_GSTRING_LEN];
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH net-next 3/8] mlxsw: reg: Add fields to PMAOS register
  2021-08-09 10:21 [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules Ido Schimmel
  2021-08-09 10:21 ` [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode Ido Schimmel
  2021-08-09 10:21 ` [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules Ido Schimmel
@ 2021-08-09 10:21 ` Ido Schimmel
  2021-08-09 10:21 ` [RFC PATCH net-next 4/8] mlxsw: Make PMAOS pack function more generic Ido Schimmel
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-09 10:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, andrew, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

The Ports Module Administrative and Operational Status (PMAOS) register
configures and retrieves the per-module status. Extend it with fields
required to support various module settings such as reset and low power
mode.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 58 +++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 6fbda6ebd590..b2c55259f333 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -5681,6 +5681,14 @@ static inline void mlxsw_reg_pspa_pack(char *payload, u8 swid, u8 local_port)
 
 MLXSW_REG_DEFINE(pmaos, MLXSW_REG_PMAOS_ID, MLXSW_REG_PMAOS_LEN);
 
+/* reg_pmaos_rst
+ * Module reset toggle.
+ * Note: Setting reset while module is plugged-in will result in transition to
+ * "initializing" operational state.
+ * Access: OP
+ */
+MLXSW_ITEM32(reg, pmaos, rst, 0x00, 31, 1);
+
 /* reg_pmaos_slot_index
  * Slot index.
  * Access: Index
@@ -5693,6 +5701,38 @@ MLXSW_ITEM32(reg, pmaos, slot_index, 0x00, 24, 4);
  */
 MLXSW_ITEM32(reg, pmaos, module, 0x00, 16, 8);
 
+enum mlxsw_reg_pmaos_admin_status {
+	MLXSW_REG_PMAOS_ADMIN_STATUS_ENABLED = 1,
+	MLXSW_REG_PMAOS_ADMIN_STATUS_DISABLED = 2,
+	/* If the module is active and then unplugged, or experienced an error
+	 * event, the operational status should go to "disabled" and can only
+	 * be enabled upon explicit enable command.
+	 */
+	MLXSW_REG_PMAOS_ADMIN_STATUS_ENABLED_ONCE = 3,
+};
+
+/* reg_pmaos_admin_status
+ * Module administrative state (the desired state of the module).
+ * Note: To disable a module, all ports associated with the port must be
+ * administatively down first.
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, pmaos, admin_status, 0x00, 8, 4);
+
+enum mlxsw_reg_pmaos_oper_status {
+	MLXSW_REG_PMAOS_OPER_STATUS_INITIALIZING,
+	MLXSW_REG_PMAOS_OPER_STATUS_PLUGGED_ENABLED,
+	MLXSW_REG_PMAOS_OPER_STATUS_UNPLUGGED,
+	/* Error code can be read from PMAOS.error_type */
+	MLXSW_REG_PMAOS_OPER_STATUS_PLUGGED_ERROR,
+};
+
+/* reg_pmaos_oper_status
+ * Module state. Reserved while administrative state is disabled.
+ * Access: RO
+ */
+MLXSW_ITEM32(reg, pmaos, oper_status, 0x00, 0, 4);
+
 /* reg_pmaos_ase
  * Admin state update enable.
  * If this bit is set, admin state will be updated based on admin_state field.
@@ -5709,6 +5749,24 @@ MLXSW_ITEM32(reg, pmaos, ase, 0x04, 31, 1);
  */
 MLXSW_ITEM32(reg, pmaos, ee, 0x04, 30, 1);
 
+enum mlxsw_reg_pmaos_error_type {
+	MLXSW_REG_PMAOS_ERROR_TYPE_POWER_BUDGET_EXCEEDED = 0,
+	/* I2C data or clock shorted */
+	MLXSW_REG_PMAOS_ERROR_TYPE_BUS_STUCK = 2,
+	MLXSW_REG_PMAOS_ERROR_TYPE_BAD_UNSUPPORTED_EEPROM = 3,
+	MLXSW_REG_PMAOS_ERROR_TYPE_UNSUPPORTED_CABLE = 5,
+	MLXSW_REG_PMAOS_ERROR_TYPE_HIGH_TEMP = 6,
+	/* Module / cable is shorted */
+	MLXSW_REG_PMAOS_ERROR_TYPE_BAD_CABLE = 7,
+};
+
+/* reg_pmaos_error_type
+ * Module error details. Only valid when operational status is "plugged with
+ * error".
+ * Access: RO
+ */
+MLXSW_ITEM32(reg, pmaos, error_type, 0x04, 8, 4);
+
 enum mlxsw_reg_pmaos_e {
 	MLXSW_REG_PMAOS_E_DO_NOT_GENERATE_EVENT,
 	MLXSW_REG_PMAOS_E_GENERATE_EVENT,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH net-next 4/8] mlxsw: Make PMAOS pack function more generic
  2021-08-09 10:21 [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules Ido Schimmel
                   ` (2 preceding siblings ...)
  2021-08-09 10:21 ` [RFC PATCH net-next 3/8] mlxsw: reg: Add fields to PMAOS register Ido Schimmel
@ 2021-08-09 10:21 ` Ido Schimmel
  2021-08-09 10:21 ` [RFC PATCH net-next 5/8] mlxsw: reg: Add Port Module Memory Map Properties register Ido Schimmel
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-09 10:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, andrew, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

The PMAOS register has enable bits (e.g., PMAOS.ee) that allow changing
only a subset of the fields, which is exactly what subsequent patches
will need to do. Instead of passing multiple arguments to its pack
function, only pass the module index and let the rest be set by the
different callers.

No functional changes intended.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core_env.c | 6 ++++--
 drivers/net/ethernet/mellanox/mlxsw/reg.h      | 5 +----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_env.c b/drivers/net/ethernet/mellanox/mlxsw/core_env.c
index 3713c45cfa1e..32554910506e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_env.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_env.c
@@ -652,8 +652,10 @@ mlxsw_env_module_oper_state_event_enable(struct mlxsw_core *mlxsw_core,
 	for (i = 0; i < module_count; i++) {
 		char pmaos_pl[MLXSW_REG_PMAOS_LEN];
 
-		mlxsw_reg_pmaos_pack(pmaos_pl, i,
-				     MLXSW_REG_PMAOS_E_GENERATE_EVENT);
+		mlxsw_reg_pmaos_pack(pmaos_pl, i);
+		mlxsw_reg_pmaos_e_set(pmaos_pl,
+				      MLXSW_REG_PMAOS_E_GENERATE_EVENT);
+		mlxsw_reg_pmaos_ee_set(pmaos_pl, true);
 		err = mlxsw_reg_write(mlxsw_core, MLXSW_REG(pmaos), pmaos_pl);
 		if (err)
 			return err;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index b2c55259f333..d0361f60d70d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -5779,13 +5779,10 @@ enum mlxsw_reg_pmaos_e {
  */
 MLXSW_ITEM32(reg, pmaos, e, 0x04, 0, 2);
 
-static inline void mlxsw_reg_pmaos_pack(char *payload, u8 module,
-					enum mlxsw_reg_pmaos_e e)
+static inline void mlxsw_reg_pmaos_pack(char *payload, u8 module)
 {
 	MLXSW_REG_ZERO(pmaos, payload);
 	mlxsw_reg_pmaos_module_set(payload, module);
-	mlxsw_reg_pmaos_e_set(payload, e);
-	mlxsw_reg_pmaos_ee_set(payload, true);
 }
 
 /* PPLR - Port Physical Loopback Register
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH net-next 5/8] mlxsw: reg: Add Port Module Memory Map Properties register
  2021-08-09 10:21 [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules Ido Schimmel
                   ` (3 preceding siblings ...)
  2021-08-09 10:21 ` [RFC PATCH net-next 4/8] mlxsw: Make PMAOS pack function more generic Ido Schimmel
@ 2021-08-09 10:21 ` Ido Schimmel
  2021-08-09 10:21 ` [RFC PATCH net-next 6/8] mlxsw: reg: Add Management Cable IO and Notifications register Ido Schimmel
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-09 10:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, andrew, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

Add the Port Module Memory Map Properties register. It will be used to
force a module into low power mode in subsequent patches.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 44 +++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index d0361f60d70d..7808b308e7af 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -5915,6 +5915,49 @@ static inline void mlxsw_reg_pddr_pack(char *payload, u8 local_port,
 	mlxsw_reg_pddr_page_select_set(payload, page_select);
 }
 
+/* PMMP - Port Module Memory Map Properties Register
+ * -------------------------------------------------
+ * The PMMP register allows to override the module memory map advertisement.
+ * The register can only be set when the module is disabled by PMAOS register.
+ */
+#define MLXSW_REG_PMMP_ID 0x5044
+#define MLXSW_REG_PMMP_LEN 0x2C
+
+MLXSW_REG_DEFINE(pmmp, MLXSW_REG_PMMP_ID, MLXSW_REG_PMMP_LEN);
+
+/* reg_pmmp_module
+ * Module number.
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, pmmp, module, 0x00, 16, 8);
+
+/* reg_pmmp_eeprom_override_mask
+ * Write mask bit (negative polarity).
+ * 0 - Allow write
+ * 1 - Ignore write
+ * On write, indicates which of the bits from eeprom_override field are
+ * updated.
+ * Access: WO
+ */
+MLXSW_ITEM32(reg, pmmp, eeprom_override_mask, 0x04, 16, 16);
+
+enum {
+	/* Set module to low power mode */
+	MLXSW_REG_PMMP_EEPROM_OVERRIDE_LOW_POWER_MASK = BIT(8),
+};
+
+/* reg_pmmp_eeprom_override
+ * Override / ignore EEPROM advertisement properties bitmask
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, pmmp, eeprom_override, 0x04, 0, 16);
+
+static inline void mlxsw_reg_pmmp_pack(char *payload, u8 module)
+{
+	MLXSW_REG_ZERO(pmmp, payload);
+	mlxsw_reg_pmmp_module_set(payload, module);
+}
+
 /* PMTM - Port Module Type Mapping Register
  * ----------------------------------------
  * The PMTM allows query or configuration of module types.
@@ -12257,6 +12300,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
 	MLXSW_REG(pplr),
 	MLXSW_REG(pmpe),
 	MLXSW_REG(pddr),
+	MLXSW_REG(pmmp),
 	MLXSW_REG(pmtm),
 	MLXSW_REG(htgt),
 	MLXSW_REG(hpkt),
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH net-next 6/8] mlxsw: reg: Add Management Cable IO and Notifications register
  2021-08-09 10:21 [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules Ido Schimmel
                   ` (4 preceding siblings ...)
  2021-08-09 10:21 ` [RFC PATCH net-next 5/8] mlxsw: reg: Add Port Module Memory Map Properties register Ido Schimmel
@ 2021-08-09 10:21 ` Ido Schimmel
  2021-08-09 10:21 ` [RFC PATCH net-next 7/8] mlxsw: Add ability to control transceiver modules' low power mode Ido Schimmel
  2021-08-09 10:21 ` [RFC PATCH net-next 8/8] mlxsw: Add ability to reset transceiver modules Ido Schimmel
  7 siblings, 0 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-09 10:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, andrew, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

Add the Management Cable IO and Notifications register. It will be used
to retrieve the low power mode status of a module in subsequent patches.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 33 +++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 7808b308e7af..d25ca5f714f4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -10306,6 +10306,38 @@ static inline void mlxsw_reg_mlcr_pack(char *payload, u8 local_port,
 					   MLXSW_REG_MLCR_DURATION_MAX : 0);
 }
 
+/* MCION - Management Cable IO and Notifications Register
+ * ------------------------------------------------------
+ * The MCION register is used to query transceiver modules' IO pins and other
+ * notifications.
+ */
+#define MLXSW_REG_MCION_ID 0x9052
+#define MLXSW_REG_MCION_LEN 0x18
+
+MLXSW_REG_DEFINE(mcion, MLXSW_REG_MCION_ID, MLXSW_REG_MCION_LEN);
+
+/* reg_mcion_module
+ * Module number.
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, mcion, module, 0x00, 16, 8);
+
+enum {
+	MLXSW_REG_MCION_MODULE_STATUS_BITS_LOW_POWER_MASK = BIT(8),
+};
+
+/* reg_mcion_module_status_bits
+ * Module IO status as defined by SFF.
+ * Access: RO
+ */
+MLXSW_ITEM32(reg, mcion, module_status_bits, 0x04, 0, 16);
+
+static inline void mlxsw_reg_mcion_pack(char *payload, u8 module)
+{
+	MLXSW_REG_ZERO(mcion, payload);
+	mlxsw_reg_mcion_module_set(payload, module);
+}
+
 /* MTPPS - Management Pulse Per Second Register
  * --------------------------------------------
  * This register provides the device PPS capabilities, configure the PPS in and
@@ -12348,6 +12380,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
 	MLXSW_REG(mgir),
 	MLXSW_REG(mrsr),
 	MLXSW_REG(mlcr),
+	MLXSW_REG(mcion),
 	MLXSW_REG(mtpps),
 	MLXSW_REG(mtutc),
 	MLXSW_REG(mpsc),
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH net-next 7/8] mlxsw: Add ability to control transceiver modules' low power mode
  2021-08-09 10:21 [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules Ido Schimmel
                   ` (5 preceding siblings ...)
  2021-08-09 10:21 ` [RFC PATCH net-next 6/8] mlxsw: reg: Add Management Cable IO and Notifications register Ido Schimmel
@ 2021-08-09 10:21 ` Ido Schimmel
  2021-08-09 10:21 ` [RFC PATCH net-next 8/8] mlxsw: Add ability to reset transceiver modules Ido Schimmel
  7 siblings, 0 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-09 10:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, andrew, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

Implement support for ethtool_ops::.get_module_low_power and
ethtool_ops::set_module_low_power.

The get operation is implemented using the Management Cable IO and
Notifications (MCION) register that reports the low power mode status of
the module.

The set operation is implemented using the Port Module Memory Map
Properties (PMMP) register. The register instructs the device's firmware
to transition a plugged-in module to / out of low power mode by writing
to its memory map.

Before using the PMMP register, the module must be disabled by the PMAOS
register. All the ports mapped to the module are iterated to ensure they
are administratively down, so that their operational state will not
change during the operation.

After the operation is performed, the module is re-enabled and its
operational state is polled to ensure the module transitioned back to a
valid state. If not, an error is reported to user space via extack.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 .../net/ethernet/mellanox/mlxsw/core_env.c    | 198 ++++++++++++++++++
 .../net/ethernet/mellanox/mlxsw/core_env.h    |   8 +
 drivers/net/ethernet/mellanox/mlxsw/minimal.c |  24 +++
 .../mellanox/mlxsw/spectrum_ethtool.c         |  49 +++++
 4 files changed, 279 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_env.c b/drivers/net/ethernet/mellanox/mlxsw/core_env.c
index 32554910506e..1ae06730d374 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_env.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_env.c
@@ -5,6 +5,7 @@
 #include <linux/err.h>
 #include <linux/ethtool.h>
 #include <linux/sfp.h>
+#include <linux/jiffies.h>
 
 #include "core.h"
 #include "core_env.h"
@@ -389,6 +390,203 @@ mlxsw_env_get_module_eeprom_by_page(struct mlxsw_core *mlxsw_core, u8 module,
 }
 EXPORT_SYMBOL(mlxsw_env_get_module_eeprom_by_page);
 
+int mlxsw_env_get_module_low_power(struct mlxsw_core *mlxsw_core, u8 module,
+				   bool *p_low_power,
+				   struct netlink_ext_ack *extack)
+{
+	char mcion_pl[MLXSW_REG_MCION_LEN];
+	u16 status_bits;
+	int err;
+
+	mlxsw_reg_mcion_pack(mcion_pl, module);
+
+	err = mlxsw_reg_query(mlxsw_core, MLXSW_REG(mcion), mcion_pl);
+	if (err) {
+		NL_SET_ERR_MSG_MOD(extack, "Failed to retrieve module's low power mode");
+		return err;
+	}
+
+	status_bits = mlxsw_reg_mcion_module_status_bits_get(mcion_pl);
+	*p_low_power =
+		status_bits & MLXSW_REG_MCION_MODULE_STATUS_BITS_LOW_POWER_MASK;
+
+	return 0;
+}
+EXPORT_SYMBOL(mlxsw_env_get_module_low_power);
+
+static int mlxsw_env_module_enable_set(struct mlxsw_core *mlxsw_core, u8 module,
+				       bool enable)
+{
+	enum mlxsw_reg_pmaos_admin_status admin_status;
+	char pmaos_pl[MLXSW_REG_PMAOS_LEN];
+
+	mlxsw_reg_pmaos_pack(pmaos_pl, module);
+	admin_status = enable ? MLXSW_REG_PMAOS_ADMIN_STATUS_ENABLED :
+				MLXSW_REG_PMAOS_ADMIN_STATUS_DISABLED;
+	mlxsw_reg_pmaos_admin_status_set(pmaos_pl, admin_status);
+	mlxsw_reg_pmaos_ase_set(pmaos_pl, true);
+
+	return mlxsw_reg_write(mlxsw_core, MLXSW_REG(pmaos), pmaos_pl);
+}
+
+static int mlxsw_env_module_low_power_set(struct mlxsw_core *mlxsw_core,
+					  u8 module, bool low_power)
+{
+	u16 eeprom_override_mask, eeprom_override;
+	char pmmp_pl[MLXSW_REG_PMMP_LEN];
+
+	mlxsw_reg_pmmp_pack(pmmp_pl, module);
+	/* Mask all the bits except low power mode. */
+	eeprom_override_mask = ~MLXSW_REG_PMMP_EEPROM_OVERRIDE_LOW_POWER_MASK;
+	mlxsw_reg_pmmp_eeprom_override_mask_set(pmmp_pl, eeprom_override_mask);
+	eeprom_override = low_power ? MLXSW_REG_PMMP_EEPROM_OVERRIDE_LOW_POWER_MASK :
+				      0;
+	mlxsw_reg_pmmp_eeprom_override_set(pmmp_pl, eeprom_override);
+
+	return mlxsw_reg_write(mlxsw_core, MLXSW_REG(pmmp), pmmp_pl);
+}
+
+static int mlxsw_env_module_error_process(const char *pmaos_pl,
+					  struct netlink_ext_ack *extack)
+{
+	enum mlxsw_reg_pmaos_error_type error_type;
+
+	error_type = mlxsw_reg_pmaos_error_type_get(pmaos_pl);
+	switch (error_type) {
+	case MLXSW_REG_PMAOS_ERROR_TYPE_POWER_BUDGET_EXCEEDED:
+		NL_SET_ERR_MSG_MOD(extack, "Module's power budget exceeded");
+		return -EINVAL;
+	case MLXSW_REG_PMAOS_ERROR_TYPE_BUS_STUCK:
+		NL_SET_ERR_MSG_MOD(extack, "Module's I2C bus is stuck. Data or clock shorted");
+		return -EIO;
+	case MLXSW_REG_PMAOS_ERROR_TYPE_BAD_UNSUPPORTED_EEPROM:
+		NL_SET_ERR_MSG_MOD(extack, "Bad or unsupported module EEPROM");
+		return -EOPNOTSUPP;
+	case MLXSW_REG_PMAOS_ERROR_TYPE_UNSUPPORTED_CABLE:
+		NL_SET_ERR_MSG_MOD(extack, "Unsupported cable");
+		return -EOPNOTSUPP;
+	case MLXSW_REG_PMAOS_ERROR_TYPE_HIGH_TEMP:
+		NL_SET_ERR_MSG_MOD(extack, "Module's temperature is too high");
+		return -EINVAL;
+	case MLXSW_REG_PMAOS_ERROR_TYPE_BAD_CABLE:
+		NL_SET_ERR_MSG_MOD(extack, "Bad module. Module / cable is shorted");
+		return -EINVAL;
+	default:
+		NL_SET_ERR_MSG_MOD(extack, "Encountered unknown module error type");
+		return -EINVAL;
+	}
+}
+
+static bool mlxsw_env_module_oper_should_wait(const char *pmaos_pl)
+{
+	enum mlxsw_reg_pmaos_oper_status oper_status;
+
+	oper_status = mlxsw_reg_pmaos_oper_status_get(pmaos_pl);
+	switch (oper_status) {
+	case MLXSW_REG_PMAOS_OPER_STATUS_PLUGGED_ENABLED:
+	case MLXSW_REG_PMAOS_OPER_STATUS_UNPLUGGED:
+		return false;
+	default:
+		/* Module might not be accessible just after its re-enablement,
+		 * so ignore errors or unknown states during this time period.
+		 */
+		return true;
+	}
+}
+
+static int mlxsw_env_module_oper_status_process(const char *pmaos_pl,
+						struct netlink_ext_ack *extack)
+{
+	enum mlxsw_reg_pmaos_oper_status oper_status;
+
+	oper_status = mlxsw_reg_pmaos_oper_status_get(pmaos_pl);
+	switch (oper_status) {
+	case MLXSW_REG_PMAOS_OPER_STATUS_INITIALIZING:
+		NL_SET_ERR_MSG_MOD(extack, "Module is still initializing");
+		return -EBUSY;
+	case MLXSW_REG_PMAOS_OPER_STATUS_PLUGGED_ENABLED:
+	case MLXSW_REG_PMAOS_OPER_STATUS_UNPLUGGED:
+		return 0;
+	case MLXSW_REG_PMAOS_OPER_STATUS_PLUGGED_ERROR:
+		return mlxsw_env_module_error_process(pmaos_pl, extack);
+	default:
+		NL_SET_ERR_MSG_MOD(extack, "Encountered unknown module operational status");
+		return -EINVAL;
+	}
+}
+
+/* Firmware should take about 2-3 seconds to initialize the module. */
+#define MLXSW_ENV_MODULE_TIMEOUT_MSECS		5000
+#define MLXSW_ENV_MODULE_WAIT_INTERVAL_MSECS	100
+
+static int mlxsw_env_module_oper_wait(struct mlxsw_core *mlxsw_core, u8 module,
+				      struct netlink_ext_ack *extack)
+{
+	char pmaos_pl[MLXSW_REG_PMAOS_LEN];
+	unsigned long end;
+
+	end = jiffies + msecs_to_jiffies(MLXSW_ENV_MODULE_TIMEOUT_MSECS);
+	do {
+		int err;
+
+		mlxsw_reg_pmaos_pack(pmaos_pl, module);
+		err = mlxsw_reg_query(mlxsw_core, MLXSW_REG(pmaos), pmaos_pl);
+		if (err) {
+			NL_SET_ERR_MSG_MOD(extack, "Failed to query module's operational status");
+			return err;
+		}
+
+		if (!mlxsw_env_module_oper_should_wait(pmaos_pl))
+			break;
+		msleep(MLXSW_ENV_MODULE_WAIT_INTERVAL_MSECS);
+	} while (time_before(jiffies, end));
+
+	return mlxsw_env_module_oper_status_process(pmaos_pl, extack);
+}
+
+int mlxsw_env_set_module_low_power(struct mlxsw_core *mlxsw_core, u8 module,
+				   bool low_power,
+				   struct netlink_ext_ack *extack)
+{
+	int err;
+
+	err = mlxsw_env_module_enable_set(mlxsw_core, module, false);
+	if (err) {
+		NL_SET_ERR_MSG_MOD(extack, "Failed to disable module");
+		return err;
+	}
+
+	err = mlxsw_env_module_low_power_set(mlxsw_core, module, low_power);
+	if (err) {
+		NL_SET_ERR_MSG_MOD(extack, "Failed to set module's low power mode");
+		goto err_module_low_power_set;
+	}
+
+	err = mlxsw_env_module_enable_set(mlxsw_core, module, true);
+	if (err) {
+		NL_SET_ERR_MSG_MOD(extack, "Failed to enable module");
+		goto err_module_enable_set;
+	}
+
+	/* Wait for the module to reach a valid operational state following its
+	 * re-enablement.
+	 */
+	err = mlxsw_env_module_oper_wait(mlxsw_core, module, extack);
+	if (err)
+		goto err_module_oper_wait;
+
+	return 0;
+
+err_module_oper_wait:
+	mlxsw_env_module_enable_set(mlxsw_core, module, false);
+err_module_enable_set:
+	mlxsw_env_module_low_power_set(mlxsw_core, module, !low_power);
+err_module_low_power_set:
+	mlxsw_env_module_enable_set(mlxsw_core, module, true);
+	return err;
+}
+EXPORT_SYMBOL(mlxsw_env_set_module_low_power);
+
 static int mlxsw_env_module_has_temp_sensor(struct mlxsw_core *mlxsw_core,
 					    u8 module,
 					    bool *p_has_temp_sensor)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_env.h b/drivers/net/ethernet/mellanox/mlxsw/core_env.h
index 0bf5bd0f8a7e..32960de96674 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_env.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_env.h
@@ -24,6 +24,14 @@ mlxsw_env_get_module_eeprom_by_page(struct mlxsw_core *mlxsw_core, u8 module,
 				    const struct ethtool_module_eeprom *page,
 				    struct netlink_ext_ack *extack);
 
+int mlxsw_env_get_module_low_power(struct mlxsw_core *mlxsw_core, u8 module,
+				   bool *p_low_power,
+				   struct netlink_ext_ack *extack);
+
+int mlxsw_env_set_module_low_power(struct mlxsw_core *mlxsw_core, u8 module,
+				   bool low_power,
+				   struct netlink_ext_ack *extack);
+
 int
 mlxsw_env_module_overheat_counter_get(struct mlxsw_core *mlxsw_core, u8 module,
 				      u64 *p_counter);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/minimal.c b/drivers/net/ethernet/mellanox/mlxsw/minimal.c
index d9d56c44e994..6fb8204c4d8a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/minimal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/minimal.c
@@ -124,11 +124,35 @@ mlxsw_m_get_module_eeprom_by_page(struct net_device *netdev,
 						   page, extack);
 }
 
+static int mlxsw_m_get_module_low_power(struct net_device *netdev,
+					bool *p_low_power,
+					struct netlink_ext_ack *extack)
+{
+	struct mlxsw_m_port *mlxsw_m_port = netdev_priv(netdev);
+	struct mlxsw_core *core = mlxsw_m_port->mlxsw_m->core;
+
+	return mlxsw_env_get_module_low_power(core, mlxsw_m_port->module,
+					      p_low_power, extack);
+}
+
+static int mlxsw_m_set_module_low_power(struct net_device *netdev,
+					bool low_power,
+					struct netlink_ext_ack *extack)
+{
+	struct mlxsw_m_port *mlxsw_m_port = netdev_priv(netdev);
+	struct mlxsw_core *core = mlxsw_m_port->mlxsw_m->core;
+
+	return mlxsw_env_set_module_low_power(core, mlxsw_m_port->module,
+					      low_power, extack);
+}
+
 static const struct ethtool_ops mlxsw_m_port_ethtool_ops = {
 	.get_drvinfo		= mlxsw_m_module_get_drvinfo,
 	.get_module_info	= mlxsw_m_get_module_info,
 	.get_module_eeprom	= mlxsw_m_get_module_eeprom,
 	.get_module_eeprom_by_page = mlxsw_m_get_module_eeprom_by_page,
+	.get_module_low_power	= mlxsw_m_get_module_low_power,
+	.set_module_low_power	= mlxsw_m_set_module_low_power,
 };
 
 static int
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c
index 267590a0eee7..fb6256f16c50 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c
@@ -1197,6 +1197,53 @@ mlxsw_sp_get_rmon_stats(struct net_device *dev,
 	*ranges = mlxsw_rmon_ranges;
 }
 
+static int mlxsw_sp_get_module_low_power(struct net_device *dev,
+					 bool *p_low_power,
+					 struct netlink_ext_ack *extack)
+{
+	struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
+	struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+	u8 module = mlxsw_sp_port->mapping.module;
+
+	return mlxsw_env_get_module_low_power(mlxsw_sp->core, module,
+					      p_low_power, extack);
+}
+
+static bool mlxsw_sp_module_ports_up_check(struct mlxsw_sp *mlxsw_sp, u8 module)
+{
+	int i;
+
+	for (i = 1; i < mlxsw_core_max_ports(mlxsw_sp->core); i++) {
+		if (!mlxsw_sp->ports[i])
+			continue;
+		if (mlxsw_sp->ports[i]->mapping.module != module)
+			continue;
+		if (netif_running(mlxsw_sp->ports[i]->dev))
+			return true;
+	}
+
+	return false;
+}
+
+static int mlxsw_sp_set_module_low_power(struct net_device *dev, bool low_power,
+					 struct netlink_ext_ack *extack)
+{
+	struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
+	struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+	u8 module = mlxsw_sp_port->mapping.module;
+
+	/* We are going to take the module down, so no port using it can be
+	 * administratively up.
+	 */
+	if (mlxsw_sp_module_ports_up_check(mlxsw_sp, module)) {
+		NL_SET_ERR_MSG_MOD(extack, "Cannot set low power mode when ports using the module are administratively up");
+		return -EINVAL;
+	}
+
+	return mlxsw_env_set_module_low_power(mlxsw_sp->core, module, low_power,
+					      extack);
+}
+
 const struct ethtool_ops mlxsw_sp_port_ethtool_ops = {
 	.cap_link_lanes_supported	= true,
 	.get_drvinfo			= mlxsw_sp_port_get_drvinfo,
@@ -1218,6 +1265,8 @@ const struct ethtool_ops mlxsw_sp_port_ethtool_ops = {
 	.get_eth_mac_stats		= mlxsw_sp_get_eth_mac_stats,
 	.get_eth_ctrl_stats		= mlxsw_sp_get_eth_ctrl_stats,
 	.get_rmon_stats			= mlxsw_sp_get_rmon_stats,
+	.get_module_low_power		= mlxsw_sp_get_module_low_power,
+	.set_module_low_power		= mlxsw_sp_set_module_low_power,
 };
 
 struct mlxsw_sp1_port_link_mode {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH net-next 8/8] mlxsw: Add ability to reset transceiver modules
  2021-08-09 10:21 [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules Ido Schimmel
                   ` (6 preceding siblings ...)
  2021-08-09 10:21 ` [RFC PATCH net-next 7/8] mlxsw: Add ability to control transceiver modules' low power mode Ido Schimmel
@ 2021-08-09 10:21 ` Ido Schimmel
  7 siblings, 0 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-09 10:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, andrew, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

Implement support for ethtool_ops::reset_module. This is done by writing
to the "rst" bit of the PMAOS register and waiting for the module to
reach a valid operational state. If the module does not transition to a
valid state, an error is reported to user space via extack.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 .../net/ethernet/mellanox/mlxsw/core_env.c    | 28 +++++++++++++++++++
 .../net/ethernet/mellanox/mlxsw/core_env.h    |  3 ++
 drivers/net/ethernet/mellanox/mlxsw/minimal.c | 10 +++++++
 .../mellanox/mlxsw/spectrum_ethtool.c         | 19 +++++++++++++
 4 files changed, 60 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_env.c b/drivers/net/ethernet/mellanox/mlxsw/core_env.c
index 1ae06730d374..df578ef3319c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_env.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_env.c
@@ -587,6 +587,34 @@ int mlxsw_env_set_module_low_power(struct mlxsw_core *mlxsw_core, u8 module,
 }
 EXPORT_SYMBOL(mlxsw_env_set_module_low_power);
 
+static int mlxsw_env_module_reset(struct mlxsw_core *mlxsw_core, u8 module)
+{
+	char pmaos_pl[MLXSW_REG_PMAOS_LEN];
+
+	mlxsw_reg_pmaos_pack(pmaos_pl, module);
+	mlxsw_reg_pmaos_rst_set(pmaos_pl, true);
+
+	return mlxsw_reg_write(mlxsw_core, MLXSW_REG(pmaos), pmaos_pl);
+}
+
+int mlxsw_env_reset_module(struct mlxsw_core *mlxsw_core, u8 module,
+			   struct netlink_ext_ack *extack)
+{
+	int err;
+
+	err = mlxsw_env_module_reset(mlxsw_core, module);
+	if (err) {
+		NL_SET_ERR_MSG_MOD(extack, "Failed to reset module");
+		return err;
+	}
+
+	/* Wait for the module to reach a valid operational state following its
+	 * reset.
+	 */
+	return mlxsw_env_module_oper_wait(mlxsw_core, module, extack);
+}
+EXPORT_SYMBOL(mlxsw_env_reset_module);
+
 static int mlxsw_env_module_has_temp_sensor(struct mlxsw_core *mlxsw_core,
 					    u8 module,
 					    bool *p_has_temp_sensor)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_env.h b/drivers/net/ethernet/mellanox/mlxsw/core_env.h
index 32960de96674..465a095e6a3e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_env.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_env.h
@@ -32,6 +32,9 @@ int mlxsw_env_set_module_low_power(struct mlxsw_core *mlxsw_core, u8 module,
 				   bool low_power,
 				   struct netlink_ext_ack *extack);
 
+int mlxsw_env_reset_module(struct mlxsw_core *mlxsw_core, u8 module,
+			   struct netlink_ext_ack *extack);
+
 int
 mlxsw_env_module_overheat_counter_get(struct mlxsw_core *mlxsw_core, u8 module,
 				      u64 *p_counter);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/minimal.c b/drivers/net/ethernet/mellanox/mlxsw/minimal.c
index 6fb8204c4d8a..d206442270df 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/minimal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/minimal.c
@@ -146,6 +146,15 @@ static int mlxsw_m_set_module_low_power(struct net_device *netdev,
 					      low_power, extack);
 }
 
+static int mlxsw_m_reset_module(struct net_device *netdev,
+				struct netlink_ext_ack *extack)
+{
+	struct mlxsw_m_port *mlxsw_m_port = netdev_priv(netdev);
+	struct mlxsw_core *core = mlxsw_m_port->mlxsw_m->core;
+
+	return mlxsw_env_reset_module(core, mlxsw_m_port->module, extack);
+}
+
 static const struct ethtool_ops mlxsw_m_port_ethtool_ops = {
 	.get_drvinfo		= mlxsw_m_module_get_drvinfo,
 	.get_module_info	= mlxsw_m_get_module_info,
@@ -153,6 +162,7 @@ static const struct ethtool_ops mlxsw_m_port_ethtool_ops = {
 	.get_module_eeprom_by_page = mlxsw_m_get_module_eeprom_by_page,
 	.get_module_low_power	= mlxsw_m_get_module_low_power,
 	.set_module_low_power	= mlxsw_m_set_module_low_power,
+	.reset_module		= mlxsw_m_reset_module,
 };
 
 static int
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c
index fb6256f16c50..9526ef71e513 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ethtool.c
@@ -1244,6 +1244,24 @@ static int mlxsw_sp_set_module_low_power(struct net_device *dev, bool low_power,
 					      extack);
 }
 
+static int mlxsw_sp_reset_module(struct net_device *dev,
+				 struct netlink_ext_ack *extack)
+{
+	struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
+	struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+	u8 module = mlxsw_sp_port->mapping.module;
+
+	/* We are going to take the module down, so no port using it can be
+	 * administratively up.
+	 */
+	if (mlxsw_sp_module_ports_up_check(mlxsw_sp, module)) {
+		NL_SET_ERR_MSG_MOD(extack, "Cannot reset module when ports using it are administratively up");
+		return -EINVAL;
+	}
+
+	return mlxsw_env_reset_module(mlxsw_sp->core, module, extack);
+}
+
 const struct ethtool_ops mlxsw_sp_port_ethtool_ops = {
 	.cap_link_lanes_supported	= true,
 	.get_drvinfo			= mlxsw_sp_port_get_drvinfo,
@@ -1267,6 +1285,7 @@ const struct ethtool_ops mlxsw_sp_port_ethtool_ops = {
 	.get_rmon_stats			= mlxsw_sp_get_rmon_stats,
 	.get_module_low_power		= mlxsw_sp_get_module_low_power,
 	.set_module_low_power		= mlxsw_sp_set_module_low_power,
+	.reset_module			= mlxsw_sp_reset_module,
 };
 
 struct mlxsw_sp1_port_link_mode {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-09 10:21 ` [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode Ido Schimmel
@ 2021-08-09 14:28   ` Andrew Lunn
  2021-08-10  7:26     ` Ido Schimmel
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Lunn @ 2021-08-09 14:28 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, kuba, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

On Mon, Aug 09, 2021 at 01:21:45PM +0300, Ido Schimmel wrote:
> From: Ido Schimmel <idosch@nvidia.com>
> 
> Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
> 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
> modules parameters and retrieve their status.

Hi Ido

I've not read all the patchset yet, but i like the general direction.

> The first parameter to control is the low power mode of the module. It
> is only relevant for paged memory modules, as flat memory modules always
> operate in low power mode.
> 
> When a paged memory module is in low power mode, its power consumption
> is reduced to the minimum, the management interface towards the host is
> available and the data path is deactivated.
> 
> User space can choose to put modules that are not currently in use in
> low power mode and transition them to high power mode before putting the
> associated ports administratively up.
> 
> Transitioning into low power mode means loss of carrier, so error is
> returned when the netdev is administratively up.

However, i don't get this use case. With copper PHYs, putting the link
administratively down results in a call into phylib and into the
driver to down the link. This effectively puts the PHY into a low
power mode. The management interface, as defined by C22 and C45 remain
available, but the data path is disabled. For a 1G PHY, this can save
a few watts.

For SFPs managed by phylink and the kernal SFP driver, the exact same
happens. The TX_ENABLE pin of the SFP is set to false. The I2C bus
still works, but the data path is disabled.

So i would expect a driver using firmware, not Linux code to manage
SFPs, to just do this on link down. Why do we need user space
involved?

    Andrew


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules
  2021-08-09 10:21 ` [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules Ido Schimmel
@ 2021-08-09 19:13   ` Andrew Lunn
  2021-08-10 13:05     ` Ido Schimmel
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Lunn @ 2021-08-09 19:13 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, kuba, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

On Mon, Aug 09, 2021 at 01:21:46PM +0300, Ido Schimmel wrote:
> From: Ido Schimmel <idosch@nvidia.com>
> 
> Add a new ethtool message, 'ETHTOOL_MSG_MODULE_RESET_ACT', which allows
> user space to request a reset of transceiver modules. A successful reset
> results in a notification being emitted to user space in the form of a
> 'ETHTOOL_MSG_MODULE_RESET_NTF' message.
> 
> Reset can be performed by either asserting the relevant hardware signal
> ("Reset" in CMIS / "ResetL" in SFF-8636) or by writing to the relevant
> reset bit in the module's EEPROM (page 00h, byte 26, bit 3 in CMIS /
> page 00h, byte 93, bit 7 in SFF-8636).
> 
> Reset is useful in order to allow a module to transition out of a fault
> state. From section 6.3.2.12 in CMIS 5.0: "Except for a power cycle, the
> only exit path from the ModuleFault state is to perform a module reset
> by taking an action that causes the ResetS transition signal to become
> TRUE (see Table 6-11)".
> 
> To avoid changes to the operational state of the device, reset can only
> be performed when the device is administratively down.
> 
> Example usage:
> 
>  # ethtool --reset-module swp11
>  netlink error: Cannot reset module when port is administratively up
>  netlink error: Invalid argument
> 
>  # ip link set dev swp11 down
> 
>  # ethtool --reset-module swp11
> 
> Monitor notifications:
> 
>  $ ethtool --monitor
>  listening...
> 
>  Module reset done for swp11

Again, i'm wondering, why is user space doing the reset? Can you think
of any other piece of hardware where Linux relies on user space
performing a reset before the kernel can properly use it?

How long does a reset take? Table 10-1 says the reset pulse must be
10uS and table 10-2 says the reset should not take longer than
2000ms. So maybe reset it on ifup if it is in a bad state?

I assume the driver/firmware is monitoring the SFP and if it goes into
a state which requires a reset it indicates carrier down? Wasn't there
some patches which added link down reasons? It would make sense to add
enum ethtool_link_ext_substate_sfp_fault? You can then use ethtool to
see what state the module is in, and a down/ip should reset it?

    Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-09 14:28   ` Andrew Lunn
@ 2021-08-10  7:26     ` Ido Schimmel
  2021-08-10 13:52       ` Andrew Lunn
  0 siblings, 1 reply; 39+ messages in thread
From: Ido Schimmel @ 2021-08-10  7:26 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, davem, kuba, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

On Mon, Aug 09, 2021 at 04:28:32PM +0200, Andrew Lunn wrote:
> On Mon, Aug 09, 2021 at 01:21:45PM +0300, Ido Schimmel wrote:
> > From: Ido Schimmel <idosch@nvidia.com>
> > 
> > Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
> > 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
> > modules parameters and retrieve their status.
> 
> Hi Ido
> 
> I've not read all the patchset yet, but i like the general direction.
> 
> > The first parameter to control is the low power mode of the module. It
> > is only relevant for paged memory modules, as flat memory modules always
> > operate in low power mode.
> > 
> > When a paged memory module is in low power mode, its power consumption
> > is reduced to the minimum, the management interface towards the host is
> > available and the data path is deactivated.
> > 
> > User space can choose to put modules that are not currently in use in
> > low power mode and transition them to high power mode before putting the
> > associated ports administratively up.
> > 
> > Transitioning into low power mode means loss of carrier, so error is
> > returned when the netdev is administratively up.
> 
> However, i don't get this use case. With copper PHYs, putting the link
> administratively down results in a call into phylib and into the
> driver to down the link. This effectively puts the PHY into a low
> power mode. The management interface, as defined by C22 and C45 remain
> available, but the data path is disabled. For a 1G PHY, this can save
> a few watts.
> 
> For SFPs managed by phylink and the kernal SFP driver, the exact same
> happens. The TX_ENABLE pin of the SFP is set to false. The I2C bus
> still works, but the data path is disabled.
> 
> So i would expect a driver using firmware, not Linux code to manage
> SFPs, to just do this on link down. Why do we need user space
> involved?

The transition from low power to high power can take a few seconds with
QSFP/QSFP-DD and it's likely to only get longer with future / more
complex modules. Therefore, to reduce link-up time, the firmware
automatically transitions modules to high power mode.

There is obviously a trade-off here between power consumption and
link-up time. My understanding is that Mellanox is not the only vendor
favoring shorter link-up times as users have the ability to control the
low power mode of the modules in other implementations.

Regarding "why do we need user space involved?", by default, it does not
need to be involved (the system works without this API), but if it wants
to reduce the power consumption by setting unused modules to low power
mode, then it will need to use this API.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules
  2021-08-09 19:13   ` Andrew Lunn
@ 2021-08-10 13:05     ` Ido Schimmel
  2021-08-10 13:54       ` Jakub Kicinski
  0 siblings, 1 reply; 39+ messages in thread
From: Ido Schimmel @ 2021-08-10 13:05 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, davem, kuba, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

On Mon, Aug 09, 2021 at 09:13:47PM +0200, Andrew Lunn wrote:
> On Mon, Aug 09, 2021 at 01:21:46PM +0300, Ido Schimmel wrote:
> > From: Ido Schimmel <idosch@nvidia.com>
> > 
> > Add a new ethtool message, 'ETHTOOL_MSG_MODULE_RESET_ACT', which allows
> > user space to request a reset of transceiver modules. A successful reset
> > results in a notification being emitted to user space in the form of a
> > 'ETHTOOL_MSG_MODULE_RESET_NTF' message.
> > 
> > Reset can be performed by either asserting the relevant hardware signal
> > ("Reset" in CMIS / "ResetL" in SFF-8636) or by writing to the relevant
> > reset bit in the module's EEPROM (page 00h, byte 26, bit 3 in CMIS /
> > page 00h, byte 93, bit 7 in SFF-8636).
> > 
> > Reset is useful in order to allow a module to transition out of a fault
> > state. From section 6.3.2.12 in CMIS 5.0: "Except for a power cycle, the
> > only exit path from the ModuleFault state is to perform a module reset
> > by taking an action that causes the ResetS transition signal to become
> > TRUE (see Table 6-11)".
> > 
> > To avoid changes to the operational state of the device, reset can only
> > be performed when the device is administratively down.
> > 
> > Example usage:
> > 
> >  # ethtool --reset-module swp11
> >  netlink error: Cannot reset module when port is administratively up
> >  netlink error: Invalid argument
> > 
> >  # ip link set dev swp11 down
> > 
> >  # ethtool --reset-module swp11
> > 
> > Monitor notifications:
> > 
> >  $ ethtool --monitor
> >  listening...
> > 
> >  Module reset done for swp11
> 
> Again, i'm wondering, why is user space doing the reset? Can you think
> of any other piece of hardware where Linux relies on user space
> performing a reset before the kernel can properly use it?
> 
> How long does a reset take? Table 10-1 says the reset pulse must be
> 10uS and table 10-2 says the reset should not take longer than
> 2000ms.

Takes about 1.5ms to get an ACK on the reset request and another few
seconds to ensure module is in a valid operational state (will remove
RTNL in next version).

> So maybe reset it on ifup if it is in a bad state?

We can have multiple ports (split) using the same module and in CMIS
each data path is controlled by a different state machine. Given the
complexity of these modules and possible faults, it is possible to
imagine a situation in which a few ports are fine and the rest are
unable to obtain a carrier.

Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't
affect other split ports (e.g., swp1s1). With the dedicated reset
command we have the ability to enforce all the required restrictions
from the start instead of changing the behavior of existing commands.

> I assume the driver/firmware is monitoring the SFP and if it goes into
> a state which requires a reset it indicates carrier down? Wasn't there
> some patches which added link down reasons? It would make sense to add
> enum ethtool_link_ext_substate_sfp_fault? You can then use ethtool to
> see what state the module is in, and a down/ip should reset it?

I will look into extending the interface with more reasons and parse the
CMIS ModuleFaultCause (see table 8-15) in ethtool(8).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10  7:26     ` Ido Schimmel
@ 2021-08-10 13:52       ` Andrew Lunn
  2021-08-10 13:59         ` Jakub Kicinski
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Lunn @ 2021-08-10 13:52 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, kuba, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

> The transition from low power to high power can take a few seconds with
> QSFP/QSFP-DD and it's likely to only get longer with future / more
> complex modules. Therefore, to reduce link-up time, the firmware
> automatically transitions modules to high power mode.
> 
> There is obviously a trade-off here between power consumption and
> link-up time. My understanding is that Mellanox is not the only vendor
> favoring shorter link-up times as users have the ability to control the
> low power mode of the modules in other implementations.
> 
> Regarding "why do we need user space involved?", by default, it does not
> need to be involved (the system works without this API), but if it wants
> to reduce the power consumption by setting unused modules to low power
> mode, then it will need to use this API.

O.K. Thanks for the better explanation. Some of this should go into
the commit message.

I suggest it gets a different name and semantics, to avoid
confusion. I think we should consider this the default power mode for
when the link is administratively down, rather than direct control
over the modules power mode. The driver should transition the module
to this setting on link down, be it high power or low power. That
saves a lot of complexity, since i assume you currently need a udev
script or something which sets it to low power mode on link down,
where as you can avoid this be configuring the default and let the
driver do it.

I also wonder if a hierarchy is needed? You can set the default for
the switch, and then override is per module? I _guess_ most users will
decide at a switch level they want to save power and pay the penalty
over longer link up times. But then we have the question, is it an
ethtool option, or a devlink parameter?

	Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules
  2021-08-10 13:05     ` Ido Schimmel
@ 2021-08-10 13:54       ` Jakub Kicinski
  2021-08-10 18:15         ` Ido Schimmel
  0 siblings, 1 reply; 39+ messages in thread
From: Jakub Kicinski @ 2021-08-10 13:54 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, netdev, davem, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

On Tue, 10 Aug 2021 16:05:07 +0300 Ido Schimmel wrote:
> > Again, i'm wondering, why is user space doing the reset? Can you think
> > of any other piece of hardware where Linux relies on user space
> > performing a reset before the kernel can properly use it?
> > 
> > How long does a reset take? Table 10-1 says the reset pulse must be
> > 10uS and table 10-2 says the reset should not take longer than
> > 2000ms.  
> 
> Takes about 1.5ms to get an ACK on the reset request and another few
> seconds to ensure module is in a valid operational state (will remove
> RTNL in next version).

Hm. RTNL-lock-less ethtool ops are a little concerning. The devlink
locking was much complicated by the unclear locking rules. Can we keep
ethtool simple and put this functionality in a different API or make
the reset async?

> > So maybe reset it on ifup if it is in a bad state?  
> 
> We can have multiple ports (split) using the same module and in CMIS
> each data path is controlled by a different state machine. Given the
> complexity of these modules and possible faults, it is possible to
> imagine a situation in which a few ports are fine and the rest are
> unable to obtain a carrier.
> 
> Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't
> affect other split ports (e.g., swp1s1). With the dedicated reset
> command we have the ability to enforce all the required restrictions
> from the start instead of changing the behavior of existing commands.

Sounds similar to what ethtool --reset was trying to address (upper
16 bits). Could we reuse that? Whether its a SFP or other part of the
port being reset may not be entirely important to the user, so perhaps
it's not a bad idea to abstract that away and stick to "reset levels"?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 13:52       ` Andrew Lunn
@ 2021-08-10 13:59         ` Jakub Kicinski
  2021-08-10 20:46           ` Ido Schimmel
  2021-08-10 21:38           ` Keller, Jacob E
  0 siblings, 2 replies; 39+ messages in thread
From: Jakub Kicinski @ 2021-08-10 13:59 UTC (permalink / raw)
  To: Andrew Lunn, Keller, Jacob E
  Cc: Ido Schimmel, netdev, davem, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:
> > The transition from low power to high power can take a few seconds with
> > QSFP/QSFP-DD and it's likely to only get longer with future / more
> > complex modules. Therefore, to reduce link-up time, the firmware
> > automatically transitions modules to high power mode.
> > 
> > There is obviously a trade-off here between power consumption and
> > link-up time. My understanding is that Mellanox is not the only vendor
> > favoring shorter link-up times as users have the ability to control the
> > low power mode of the modules in other implementations.
> > 
> > Regarding "why do we need user space involved?", by default, it does not
> > need to be involved (the system works without this API), but if it wants
> > to reduce the power consumption by setting unused modules to low power
> > mode, then it will need to use this API.  
> 
> O.K. Thanks for the better explanation. Some of this should go into
> the commit message.
> 
> I suggest it gets a different name and semantics, to avoid
> confusion. I think we should consider this the default power mode for
> when the link is administratively down, rather than direct control
> over the modules power mode. The driver should transition the module
> to this setting on link down, be it high power or low power. That
> saves a lot of complexity, since i assume you currently need a udev
> script or something which sets it to low power mode on link down,
> where as you can avoid this be configuring the default and let the
> driver do it.

Good point. And actually NICs have similar knobs, exposed via ethtool
priv flags today. Intel NICs for example. Maybe we should create a
"really power the port down policy" API?

Jake do you know what the use cases for Intel are? Are they SFP, MAC,
or NC-SI related?

> I also wonder if a hierarchy is needed? You can set the default for
> the switch, and then override is per module? I _guess_ most users will
> decide at a switch level they want to save power and pay the penalty
> over longer link up times. But then we have the question, is it an
> ethtool option, or a devlink parameter?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules
  2021-08-10 13:54       ` Jakub Kicinski
@ 2021-08-10 18:15         ` Ido Schimmel
  2021-08-10 18:58           ` Andrew Lunn
  2021-08-10 19:00           ` Jakub Kicinski
  0 siblings, 2 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-10 18:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, netdev, davem, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

On Tue, Aug 10, 2021 at 06:54:23AM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 16:05:07 +0300 Ido Schimmel wrote:
> > > Again, i'm wondering, why is user space doing the reset? Can you think
> > > of any other piece of hardware where Linux relies on user space
> > > performing a reset before the kernel can properly use it?
> > > 
> > > How long does a reset take? Table 10-1 says the reset pulse must be
> > > 10uS and table 10-2 says the reset should not take longer than
> > > 2000ms.  
> > 
> > Takes about 1.5ms to get an ACK on the reset request and another few
> > seconds to ensure module is in a valid operational state (will remove
> > RTNL in next version).
> 
> Hm. RTNL-lock-less ethtool ops are a little concerning. The devlink
> locking was much complicated by the unclear locking rules. Can we keep
> ethtool simple and put this functionality in a different API or make
> the reset async?

I thought there are already RTNL-lock-less ethtool ops, but maybe I
imagined it. Given that we also want to support firmware update on
modules and that user space might want to update several modules
simultaneously, do you have a suggestion on how to handle it from
locking perspective? The ethtool netlink backend has parallel ops, but
RTNL is a problem. Firmware flashing is currently synchronous in both
ethtool and devlink, but the latter does not hold RTNL.

> 
> > > So maybe reset it on ifup if it is in a bad state?  
> > 
> > We can have multiple ports (split) using the same module and in CMIS
> > each data path is controlled by a different state machine. Given the
> > complexity of these modules and possible faults, it is possible to
> > imagine a situation in which a few ports are fine and the rest are
> > unable to obtain a carrier.
> > 
> > Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't
> > affect other split ports (e.g., swp1s1). With the dedicated reset
> > command we have the ability to enforce all the required restrictions
> > from the start instead of changing the behavior of existing commands.
> 
> Sounds similar to what ethtool --reset was trying to address (upper
> 16 bits). Could we reuse that? Whether its a SFP or other part of the
> port being reset may not be entirely important to the user, so perhaps
> it's not a bad idea to abstract that away and stick to "reset levels"?

Wasn't aware of this API. Looks like it is only implemented by a few
drivers, but man page says "phy    Transceiver/PHY", so I think we can
reuse it.

What do you mean by "reset levels"? The split between shared /
dedicated?

Just to make sure I understand, you suggest the following semantics?

# ethtool --reset swp1s0 phy
Error since module is used by several ports (split)

# ethtool --reset swp1s0 phy-shared
OK

# ethtool --reset swp1 phy
OK (no split)

# ethtool --reset swp1 phy-shared
OK

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules
  2021-08-10 18:15         ` Ido Schimmel
@ 2021-08-10 18:58           ` Andrew Lunn
  2021-08-10 19:00           ` Jakub Kicinski
  1 sibling, 0 replies; 39+ messages in thread
From: Andrew Lunn @ 2021-08-10 18:58 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jakub Kicinski, netdev, davem, mkubecek, pali, vadimp, mlxsw,
	Ido Schimmel

> I thought there are already RTNL-lock-less ethtool ops, but maybe I
> imagined it. Given that we also want to support firmware update on
> modules and that user space might want to update several modules
> simultaneously, do you have a suggestion on how to handle it from
> locking perspective?

I had a similar problem for Ethernet cable testing. It takes a few
seconds to perform a cable test, and you don't want to hold the RTNL
for that, if you can help it. I made changes to the PHY state machine,
so it has an additional state, indicating a cable test is in
operation. The ethtool netlink op simply starts the cable test and
then returns. Once the cable test is complete, it async reports the
results to user space.

So for module upgrade, you probably need to add a per module state
machine. Changes to the state need to hold RTNL, but you can release
it between state changes.

   Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules
  2021-08-10 18:15         ` Ido Schimmel
  2021-08-10 18:58           ` Andrew Lunn
@ 2021-08-10 19:00           ` Jakub Kicinski
  2021-08-10 19:28             ` Andrew Lunn
  1 sibling, 1 reply; 39+ messages in thread
From: Jakub Kicinski @ 2021-08-10 19:00 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, netdev, davem, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

On Tue, 10 Aug 2021 21:15:45 +0300 Ido Schimmel wrote:
> On Tue, Aug 10, 2021 at 06:54:23AM -0700, Jakub Kicinski wrote:
> > On Tue, 10 Aug 2021 16:05:07 +0300 Ido Schimmel wrote:  
> > > Takes about 1.5ms to get an ACK on the reset request and another few
> > > seconds to ensure module is in a valid operational state (will remove
> > > RTNL in next version).  
> > 
> > Hm. RTNL-lock-less ethtool ops are a little concerning. The devlink
> > locking was much complicated by the unclear locking rules. Can we keep
> > ethtool simple and put this functionality in a different API or make
> > the reset async?  
> 
> I thought there are already RTNL-lock-less ethtool ops, but maybe I
> imagined it. Given that we also want to support firmware update on
> modules and that user space might want to update several modules
> simultaneously, do you have a suggestion on how to handle it from
> locking perspective?

Hm, flashing is harder than reset. We can't unbind the driver while
it's in progress. I don't have any ready solution in mind, but I'd 
like to make sure the locking is clear and hard to get wrong. Maybe 
we could have a mix of ops, one called for "preparing" the flashing
called under rtnl and another for "commit" with "unlocked" in the name.
Drivers which don't want to deal with dropping rtnl lock can just do
everything in the first stage? Perhaps Andrew has better ideas, I'm
just spit-balling. Presumably there are already locks at play, locks
we would have to take in the case where Linux manages the PHY. Maybe
they dictate an architecture?

> The ethtool netlink backend has parallel ops, but
> RTNL is a problem. Firmware flashing is currently synchronous in both
> ethtool and devlink, but the latter does not hold RTNL.

Yeah, drivers dropping rtnl_lock mid way thru the ethtool flashing op
was one of my main motivations for moving it into devlink.

> > > We can have multiple ports (split) using the same module and in CMIS
> > > each data path is controlled by a different state machine. Given the
> > > complexity of these modules and possible faults, it is possible to
> > > imagine a situation in which a few ports are fine and the rest are
> > > unable to obtain a carrier.
> > > 
> > > Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't
> > > affect other split ports (e.g., swp1s1). With the dedicated reset
> > > command we have the ability to enforce all the required restrictions
> > > from the start instead of changing the behavior of existing commands.  
> > 
> > Sounds similar to what ethtool --reset was trying to address (upper
> > 16 bits). Could we reuse that? Whether its a SFP or other part of the
> > port being reset may not be entirely important to the user, so perhaps
> > it's not a bad idea to abstract that away and stick to "reset levels"?  
> 
> Wasn't aware of this API. Looks like it is only implemented by a few
> drivers, but man page says "phy    Transceiver/PHY", so I think we can
> reuse it.
> 
> What do you mean by "reset levels"? The split between shared /
> dedicated?

Indeed.

> Just to make sure I understand, you suggest the following semantics?
> 
> # ethtool --reset swp1s0 phy
> Error since module is used by several ports (split)
> 
> # ethtool --reset swp1s0 phy-shared
> OK
> 
> # ethtool --reset swp1 phy
> OK (no split)
> 
> # ethtool --reset swp1 phy-shared
> OK

Exactly.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules
  2021-08-10 19:00           ` Jakub Kicinski
@ 2021-08-10 19:28             ` Andrew Lunn
  2021-08-10 20:50               ` Ido Schimmel
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Lunn @ 2021-08-10 19:28 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, netdev, davem, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

> Hm, flashing is harder than reset. We can't unbind the driver while
> it's in progress. I don't have any ready solution in mind, but I'd 
> like to make sure the locking is clear and hard to get wrong. Maybe 
> we could have a mix of ops, one called for "preparing" the flashing
> called under rtnl and another for "commit" with "unlocked" in the name.
> Drivers which don't want to deal with dropping rtnl lock can just do
> everything in the first stage? Perhaps Andrew has better ideas, I'm
> just spit-balling. Presumably there are already locks at play, locks
> we would have to take in the case where Linux manages the PHY. Maybe
> they dictate an architecture?

I don't think the way linux manages PHYs dictates the
architecture. PHY cable test requires that the link is
administratively up, so the PHY state machine is in play. It
transitions into a testing state when cable test is started, and when
the test is finished, it resets the PHY to put it back into running
state. If you down the interface while the cable test is running, it
aborts the cable test, and then downs the PHY.

Flashing firmware is a bit different. You need to ensure the interface
is down. And i guess that gets interesting with split modules. You
really should not abort an upgrade because the user wants to up the
interface. So -EBUSY to open() seems like the best option, based on
the state of the SFP state machine.

I suspect you are going to need a kernel thread to do the real
work. So your "prepare" netlink op would pass the name of the firmware
file. Some basic validation would be performed, that all the needed
interfaces are down etc, and then the netlink OP would return. The
thread then uses request_firmware() to get access to the firmware, and
program it. Once complete, or on error, it can async notify user space
that it is sorry, your module is toast, or firmware upgrade was
successful.

This is just throwing out ideas...

     Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 13:59         ` Jakub Kicinski
@ 2021-08-10 20:46           ` Ido Schimmel
  2021-08-10 22:00             ` Keller, Jacob E
  2021-08-10 22:05             ` Jakub Kicinski
  2021-08-10 21:38           ` Keller, Jacob E
  1 sibling, 2 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-10 20:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:
> > > The transition from low power to high power can take a few seconds with
> > > QSFP/QSFP-DD and it's likely to only get longer with future / more
> > > complex modules. Therefore, to reduce link-up time, the firmware
> > > automatically transitions modules to high power mode.
> > > 
> > > There is obviously a trade-off here between power consumption and
> > > link-up time. My understanding is that Mellanox is not the only vendor
> > > favoring shorter link-up times as users have the ability to control the
> > > low power mode of the modules in other implementations.
> > > 
> > > Regarding "why do we need user space involved?", by default, it does not
> > > need to be involved (the system works without this API), but if it wants
> > > to reduce the power consumption by setting unused modules to low power
> > > mode, then it will need to use this API.  
> > 
> > O.K. Thanks for the better explanation. Some of this should go into
> > the commit message.
> > 
> > I suggest it gets a different name and semantics, to avoid
> > confusion. I think we should consider this the default power mode for
> > when the link is administratively down, rather than direct control
> > over the modules power mode. The driver should transition the module
> > to this setting on link down, be it high power or low power. That
> > saves a lot of complexity, since i assume you currently need a udev
> > script or something which sets it to low power mode on link down,
> > where as you can avoid this be configuring the default and let the
> > driver do it.
> 
> Good point. And actually NICs have similar knobs, exposed via ethtool
> priv flags today. Intel NICs for example. Maybe we should create a
> "really power the port down policy" API?

See below about Intel. I'm not sure it's the same thing...

I'm against adding a vague "really power the port down policy" API. The
API proposed in the patch is well-defined, its implementation is
documented in standards, its implications are clear and we offer APIs
that give user space full observability into its operation.

A vague API means that it is going to be abused and user space will get
different results over different implementations. After reading the
*commit messages* about the private flags, I'm not sure what the flags
really do, what is their true motivation, implications or how do I get
observability into their operation. I'm not too hopeful about the user
documentation.

Also, like I mentioned in the cover letter, given the complexity of
these modules and as they become more common, it is likely that we will
need to extend the API to control more parameters and expose more
diagnostic information. I would really like to keep it clean and
contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over
different APIs.

> 
> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
> or NC-SI related?

I went through all the Intel drivers that implement these operations and
I believe you are talking about these commits:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1

There isn't too much information about the motivation, but maybe it has
something to do with multi-host controllers where you want to prevent
one host from taking the physical link down for all the other hosts
sharing it? I remember such issues with mlx5.

> 
> > I also wonder if a hierarchy is needed? You can set the default for
> > the switch, and then override is per module? I _guess_ most users will
> > decide at a switch level they want to save power and pay the penalty
> > over longer link up times. But then we have the question, is it an
> > ethtool option, or a devlink parameter?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules
  2021-08-10 19:28             ` Andrew Lunn
@ 2021-08-10 20:50               ` Ido Schimmel
  0 siblings, 0 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-10 20:50 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, netdev, davem, mkubecek, pali, vadimp, mlxsw,
	Ido Schimmel

On Tue, Aug 10, 2021 at 09:28:08PM +0200, Andrew Lunn wrote:
> > Hm, flashing is harder than reset. We can't unbind the driver while
> > it's in progress. I don't have any ready solution in mind, but I'd 
> > like to make sure the locking is clear and hard to get wrong. Maybe 
> > we could have a mix of ops, one called for "preparing" the flashing
> > called under rtnl and another for "commit" with "unlocked" in the name.
> > Drivers which don't want to deal with dropping rtnl lock can just do
> > everything in the first stage? Perhaps Andrew has better ideas, I'm
> > just spit-balling. Presumably there are already locks at play, locks
> > we would have to take in the case where Linux manages the PHY. Maybe
> > they dictate an architecture?
> 
> I don't think the way linux manages PHYs dictates the
> architecture. PHY cable test requires that the link is
> administratively up, so the PHY state machine is in play. It
> transitions into a testing state when cable test is started, and when
> the test is finished, it resets the PHY to put it back into running
> state. If you down the interface while the cable test is running, it
> aborts the cable test, and then downs the PHY.
> 
> Flashing firmware is a bit different. You need to ensure the interface
> is down. And i guess that gets interesting with split modules. You
> really should not abort an upgrade because the user wants to up the
> interface. So -EBUSY to open() seems like the best option, based on
> the state of the SFP state machine.
> 
> I suspect you are going to need a kernel thread to do the real
> work. So your "prepare" netlink op would pass the name of the firmware
> file. Some basic validation would be performed, that all the needed
> interfaces are down etc, and then the netlink OP would return. The
> thread then uses request_firmware() to get access to the firmware, and
> program it. Once complete, or on error, it can async notify user space
> that it is sorry, your module is toast, or firmware upgrade was
> successful.
> 
> This is just throwing out ideas...

Thanks Andrew and Jakub. I will look into these suggestions more closely
when I start working on modules firmware update.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 13:59         ` Jakub Kicinski
  2021-08-10 20:46           ` Ido Schimmel
@ 2021-08-10 21:38           ` Keller, Jacob E
  1 sibling, 0 replies; 39+ messages in thread
From: Keller, Jacob E @ 2021-08-10 21:38 UTC (permalink / raw)
  To: Jakub Kicinski, Andrew Lunn
  Cc: Ido Schimmel, netdev, davem, mkubecek, pali, vadimp, mlxsw, Ido Schimmel



> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Tuesday, August 10, 2021 7:00 AM
> To: Andrew Lunn <andrew@lunn.ch>; Keller, Jacob E <jacob.e.keller@intel.com>
> Cc: Ido Schimmel <idosch@idosch.org>; netdev@vger.kernel.org;
> davem@davemloft.net; mkubecek@suse.cz; pali@kernel.org;
> vadimp@nvidia.com; mlxsw@nvidia.com; Ido Schimmel <idosch@nvidia.com>
> Subject: Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver
> modules' low power mode
> 
> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:
> > > The transition from low power to high power can take a few seconds with
> > > QSFP/QSFP-DD and it's likely to only get longer with future / more
> > > complex modules. Therefore, to reduce link-up time, the firmware
> > > automatically transitions modules to high power mode.
> > >
> > > There is obviously a trade-off here between power consumption and
> > > link-up time. My understanding is that Mellanox is not the only vendor
> > > favoring shorter link-up times as users have the ability to control the
> > > low power mode of the modules in other implementations.
> > >
> > > Regarding "why do we need user space involved?", by default, it does not
> > > need to be involved (the system works without this API), but if it wants
> > > to reduce the power consumption by setting unused modules to low power
> > > mode, then it will need to use this API.
> >
> > O.K. Thanks for the better explanation. Some of this should go into
> > the commit message.
> >
> > I suggest it gets a different name and semantics, to avoid
> > confusion. I think we should consider this the default power mode for
> > when the link is administratively down, rather than direct control
> > over the modules power mode. The driver should transition the module
> > to this setting on link down, be it high power or low power. That
> > saves a lot of complexity, since i assume you currently need a udev
> > script or something which sets it to low power mode on link down,
> > where as you can avoid this be configuring the default and let the
> > driver do it.
> 
> Good point. And actually NICs have similar knobs, exposed via ethtool
> priv flags today. Intel NICs for example. Maybe we should create a
> "really power the port down policy" API?
> 
> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
> or NC-SI related?


Offhand I don't know. I think we have some requirements documents I can look up. I'll try to get back to you soon if I can find any further information. (Yes, I wish the commit messages gave stronger motivation too...)

Thanks,
Jake

> 
> > I also wonder if a hierarchy is needed? You can set the default for
> > the switch, and then override is per module? I _guess_ most users will
> > decide at a switch level they want to save power and pay the penalty
> > over longer link up times. But then we have the question, is it an
> > ethtool option, or a devlink parameter?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 20:46           ` Ido Schimmel
@ 2021-08-10 22:00             ` Keller, Jacob E
  2021-08-10 22:06               ` Jakub Kicinski
  2021-08-10 22:05             ` Jakub Kicinski
  1 sibling, 1 reply; 39+ messages in thread
From: Keller, Jacob E @ 2021-08-10 22:00 UTC (permalink / raw)
  To: Ido Schimmel, Jakub Kicinski
  Cc: Andrew Lunn, netdev, davem, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

On 8/10/2021 1:46 PM, Ido Schimmel wrote:
> On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:
>> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:
>>>> The transition from low power to high power can take a few seconds with
>>>> QSFP/QSFP-DD and it's likely to only get longer with future / more
>>>> complex modules. Therefore, to reduce link-up time, the firmware
>>>> automatically transitions modules to high power mode.
>>>>
>>>> There is obviously a trade-off here between power consumption and
>>>> link-up time. My understanding is that Mellanox is not the only vendor
>>>> favoring shorter link-up times as users have the ability to control the
>>>> low power mode of the modules in other implementations.
>>>>
>>>> Regarding "why do we need user space involved?", by default, it does not
>>>> need to be involved (the system works without this API), but if it wants
>>>> to reduce the power consumption by setting unused modules to low power
>>>> mode, then it will need to use this API.  
>>>
>>> O.K. Thanks for the better explanation. Some of this should go into
>>> the commit message.
>>>
>>> I suggest it gets a different name and semantics, to avoid
>>> confusion. I think we should consider this the default power mode for
>>> when the link is administratively down, rather than direct control
>>> over the modules power mode. The driver should transition the module
>>> to this setting on link down, be it high power or low power. That
>>> saves a lot of complexity, since i assume you currently need a udev
>>> script or something which sets it to low power mode on link down,
>>> where as you can avoid this be configuring the default and let the
>>> driver do it.
>>
>> Good point. And actually NICs have similar knobs, exposed via ethtool
>> priv flags today. Intel NICs for example. Maybe we should create a
>> "really power the port down policy" API?
> 
> See below about Intel. I'm not sure it's the same thing...
> 
> I'm against adding a vague "really power the port down policy" API. The
> API proposed in the patch is well-defined, its implementation is
> documented in standards, its implications are clear and we offer APIs
> that give user space full observability into its operation.
> 
> A vague API means that it is going to be abused and user space will get
> different results over different implementations. After reading the
> *commit messages* about the private flags, I'm not sure what the flags
> really do, what is their true motivation, implications or how do I get
> observability into their operation. I'm not too hopeful about the user
> documentation.
> 
> Also, like I mentioned in the cover letter, given the complexity of
> these modules and as they become more common, it is likely that we will
> need to extend the API to control more parameters and expose more
> diagnostic information. I would really like to keep it clean and
> contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over
> different APIs.
> 
>>
>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
>> or NC-SI related?
> 
> I went through all the Intel drivers that implement these operations and
> I believe you are talking about these commits:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
> 
> There isn't too much information about the motivation, but maybe it has
> something to do with multi-host controllers where you want to prevent
> one host from taking the physical link down for all the other hosts
> sharing it? I remember such issues with mlx5.
> 

Ok, I found some more information here. The primary motivation of the
changes in the i40e and ice drivers is from customer requests asking to
have the link go down when the port is administratively disabled. This
is because if the link is down then the switch on the other side will
see the port not having link and will stop trying to send traffic to it.

As far as I can tell, the reason its a flag is because some users wanted
the behavior the other way.

I'm not sure it's really related to the behavior here.

For what it's worth, I'm in favor of containing things like this into
ethtool as well.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 20:46           ` Ido Schimmel
  2021-08-10 22:00             ` Keller, Jacob E
@ 2021-08-10 22:05             ` Jakub Kicinski
  2021-08-10 22:51               ` Andrew Lunn
  2021-08-11 11:33               ` Ido Schimmel
  1 sibling, 2 replies; 39+ messages in thread
From: Jakub Kicinski @ 2021-08-10 22:05 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

On Tue, 10 Aug 2021 23:46:28 +0300 Ido Schimmel wrote:
> On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:
> > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:  
> > > O.K. Thanks for the better explanation. Some of this should go into
> > > the commit message.
> > > 
> > > I suggest it gets a different name and semantics, to avoid
> > > confusion. I think we should consider this the default power mode for
> > > when the link is administratively down, rather than direct control
> > > over the modules power mode. The driver should transition the module
> > > to this setting on link down, be it high power or low power. That
> > > saves a lot of complexity, since i assume you currently need a udev
> > > script or something which sets it to low power mode on link down,
> > > where as you can avoid this be configuring the default and let the
> > > driver do it.  
> > 
> > Good point. And actually NICs have similar knobs, exposed via ethtool
> > priv flags today. Intel NICs for example. Maybe we should create a
> > "really power the port down policy" API?  
> 
> See below about Intel. I'm not sure it's the same thing...
> 
> I'm against adding a vague "really power the port down policy" API. The
> API proposed in the patch is well-defined, its implementation is
> documented in standards, its implications are clear and we offer APIs
> that give user space full observability into its operation.
> 
> A vague API means that it is going to be abused and user space will get
> different results over different implementations. After reading the
> *commit messages* about the private flags, I'm not sure what the flags
> really do, what is their true motivation, implications or how do I get
> observability into their operation. I'm not too hopeful about the user
> documentation.
> 
> Also, like I mentioned in the cover letter, given the complexity of
> these modules and as they become more common, it is likely that we will
> need to extend the API to control more parameters and expose more
> diagnostic information. I would really like to keep it clean and
> contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over
> different APIs.

The patch is well defined but it doesn't provide user with the answer
to the question "why is the SFP still up if I asked it to be down?"
It's good to match specs closely but Linux may need to reconcile
multiple policies.

IIUC if Intel decides to keep the SFP up for "other" reasons the
situation will look like this:

 $ ethtool --show-module eth0
 Module parameters for eth0:
 low-power true

 # ethtool -m eth0
 Module State                              : 0x03 (ModuleReady)
 LowPwrAllowRequestHW                      : Off
 LowPwrRequestSW                           : Off


IOW the low-power mode is a way for user to express preference to
shut down/keep up the SFP, but it's not necessarily going to be
the only "policy" that matters. If other policies (e.g. NC-SI) express
preference to keep the interface up it will stay up, right?

The LowPwrRequestSW is not directly controlled by low-power && IF_UP.

What I had in mind was something along the lines of a bitmap of reasons
which are allowed to keep the interface up, and for your use case
the reason would be something like SFP_ALWAYS_ON, but other reasons
(well defined) may also keep the ifc up.

That's just to explain what I meant, if it's going to be clear to
everyone that low-power != LowPwrRequestSW I'm fine either way.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 22:00             ` Keller, Jacob E
@ 2021-08-10 22:06               ` Jakub Kicinski
  2021-08-10 22:18                 ` Keller, Jacob E
                                   ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Jakub Kicinski @ 2021-08-10 22:06 UTC (permalink / raw)
  To: Keller, Jacob E
  Cc: Ido Schimmel, Andrew Lunn, netdev, davem, mkubecek, pali, vadimp,
	mlxsw, Ido Schimmel

On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
> >> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
> >> or NC-SI related?  
> > 
> > I went through all the Intel drivers that implement these operations and
> > I believe you are talking about these commits:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
> > 
> > There isn't too much information about the motivation, but maybe it has
> > something to do with multi-host controllers where you want to prevent
> > one host from taking the physical link down for all the other hosts
> > sharing it? I remember such issues with mlx5.
> >   
> 
> Ok, I found some more information here. The primary motivation of the
> changes in the i40e and ice drivers is from customer requests asking to
> have the link go down when the port is administratively disabled. This
> is because if the link is down then the switch on the other side will
> see the port not having link and will stop trying to send traffic to it.
> 
> As far as I can tell, the reason its a flag is because some users wanted
> the behavior the other way.
> 
> I'm not sure it's really related to the behavior here.
> 
> For what it's worth, I'm in favor of containing things like this into
> ethtool as well.

I think the question was the inverse - why not always shut down the
port if the interface is brought down?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 22:06               ` Jakub Kicinski
@ 2021-08-10 22:18                 ` Keller, Jacob E
  2021-08-10 22:24                 ` Keller, Jacob E
  2021-08-10 22:31                 ` Andrew Lunn
  2 siblings, 0 replies; 39+ messages in thread
From: Keller, Jacob E @ 2021-08-10 22:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, Andrew Lunn, netdev, davem, mkubecek, pali, vadimp,
	mlxsw, Ido Schimmel

On 8/10/2021 3:06 PM, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
>>>> or NC-SI related?  
>>>
>>> I went through all the Intel drivers that implement these operations and
>>> I believe you are talking about these commits:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
>>>
>>> There isn't too much information about the motivation, but maybe it has
>>> something to do with multi-host controllers where you want to prevent
>>> one host from taking the physical link down for all the other hosts
>>> sharing it? I remember such issues with mlx5.
>>>   
>>
>> Ok, I found some more information here. The primary motivation of the
>> changes in the i40e and ice drivers is from customer requests asking to
>> have the link go down when the port is administratively disabled. This
>> is because if the link is down then the switch on the other side will
>> see the port not having link and will stop trying to send traffic to it.
>>
>> As far as I can tell, the reason its a flag is because some users wanted
>> the behavior the other way.
>>
>> I'm not sure it's really related to the behavior here.
>>
>> For what it's worth, I'm in favor of containing things like this into
>> ethtool as well.
> 
> I think the question was the inverse - why not always shut down the
> port if the interface is brought down?
> 

That... is a better question yes. Unfortunately so far I haven't found
any argument for not doing this. Only a bit about many requests to have
this behavior. It might just be inertia to maintain current behavior by
default...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 22:06               ` Jakub Kicinski
  2021-08-10 22:18                 ` Keller, Jacob E
@ 2021-08-10 22:24                 ` Keller, Jacob E
  2021-08-10 22:31                 ` Andrew Lunn
  2 siblings, 0 replies; 39+ messages in thread
From: Keller, Jacob E @ 2021-08-10 22:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, Andrew Lunn, netdev, davem, mkubecek, pali, vadimp,
	mlxsw, Ido Schimmel

On 8/10/2021 3:06 PM, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
>>>> or NC-SI related?  
>>>
>>> I went through all the Intel drivers that implement these operations and
>>> I believe you are talking about these commits:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
>>>
>>> There isn't too much information about the motivation, but maybe it has
>>> something to do with multi-host controllers where you want to prevent
>>> one host from taking the physical link down for all the other hosts
>>> sharing it? I remember such issues with mlx5.
>>>   
>>
>> Ok, I found some more information here. The primary motivation of the
>> changes in the i40e and ice drivers is from customer requests asking to
>> have the link go down when the port is administratively disabled. This
>> is because if the link is down then the switch on the other side will
>> see the port not having link and will stop trying to send traffic to it.
>>
>> As far as I can tell, the reason its a flag is because some users wanted
>> the behavior the other way.
>>
>> I'm not sure it's really related to the behavior here.
>>
>> For what it's worth, I'm in favor of containing things like this into
>> ethtool as well.
> 
> I think the question was the inverse - why not always shut down the
> port if the interface is brought down?
> 

So far the best I've found after digging is that forcing total shutdown
causes manageability and VF traffic to stop. Other customers want this
traffic to continue even when the PF port is brought down.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 22:06               ` Jakub Kicinski
  2021-08-10 22:18                 ` Keller, Jacob E
  2021-08-10 22:24                 ` Keller, Jacob E
@ 2021-08-10 22:31                 ` Andrew Lunn
  2021-08-11  0:38                   ` Keller, Jacob E
  2 siblings, 1 reply; 39+ messages in thread
From: Andrew Lunn @ 2021-08-10 22:31 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Keller, Jacob E, Ido Schimmel, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

On Tue, Aug 10, 2021 at 03:06:36PM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
> > >> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
> > >> or NC-SI related?  
> > > 
> > > I went through all the Intel drivers that implement these operations and
> > > I believe you are talking about these commits:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
> > > 
> > > There isn't too much information about the motivation, but maybe it has
> > > something to do with multi-host controllers where you want to prevent
> > > one host from taking the physical link down for all the other hosts
> > > sharing it? I remember such issues with mlx5.
> > >   
> > 
> > Ok, I found some more information here. The primary motivation of the
> > changes in the i40e and ice drivers is from customer requests asking to
> > have the link go down when the port is administratively disabled. This
> > is because if the link is down then the switch on the other side will
> > see the port not having link and will stop trying to send traffic to it.
> > 
> > As far as I can tell, the reason its a flag is because some users wanted
> > the behavior the other way.
> > 
> > I'm not sure it's really related to the behavior here.
> 
> I think the question was the inverse - why not always shut down the
> port if the interface is brought down?

Humm. Something does not seem right here. I would assume that when you
administratively configure the link down, the SERDES in the MAC would
stop sending anything. So the module has nothing to send. The link
peer SERDES then looses sync, and reports that upwards as carrier
lost.

Does the i40e and ice leave its SERDES running when the link is
configured down? Or is the switch FUBAR and does not consider SERDES
loss of sync as carrier down?

Ido's use case does seem to be different. The link is down. Do we want
to leave the module active, probably sending a bit stream of all 0,
maybe noise, or do we want to power the module down, so it does not
send anything at all.

     Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 22:05             ` Jakub Kicinski
@ 2021-08-10 22:51               ` Andrew Lunn
  2021-08-11 11:33               ` Ido Schimmel
  1 sibling, 0 replies; 39+ messages in thread
From: Andrew Lunn @ 2021-08-10 22:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

> The patch is well defined but it doesn't provide user with the answer
> to the question "why is the SFP still up if I asked it to be down?"

Can the user even see that the SFP is still up? Does ip link show
give:

3: eth42: <BROADCAST,MULTICAST,LOWER_UP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

i.e. LOWER_UP despite DOWN?

Roopa added:

    rtnetlink: add support for protodown reason

Maybe we need the opposite as well?

      Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 22:31                 ` Andrew Lunn
@ 2021-08-11  0:38                   ` Keller, Jacob E
  0 siblings, 0 replies; 39+ messages in thread
From: Keller, Jacob E @ 2021-08-11  0:38 UTC (permalink / raw)
  To: Andrew Lunn, Jakub Kicinski
  Cc: Ido Schimmel, netdev, davem, mkubecek, pali, vadimp, mlxsw, Ido Schimmel

On 8/10/2021 3:31 PM, Andrew Lunn wrote:
> On Tue, Aug 10, 2021 at 03:06:36PM -0700, Jakub Kicinski wrote:
>> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote:
>>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC,
>>>>> or NC-SI related?  
>>>>
>>>> I went through all the Intel drivers that implement these operations and
>>>> I believe you are talking about these commits:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1
>>>>
>>>> There isn't too much information about the motivation, but maybe it has
>>>> something to do with multi-host controllers where you want to prevent
>>>> one host from taking the physical link down for all the other hosts
>>>> sharing it? I remember such issues with mlx5.
>>>>   
>>>
>>> Ok, I found some more information here. The primary motivation of the
>>> changes in the i40e and ice drivers is from customer requests asking to
>>> have the link go down when the port is administratively disabled. This
>>> is because if the link is down then the switch on the other side will
>>> see the port not having link and will stop trying to send traffic to it.
>>>
>>> As far as I can tell, the reason its a flag is because some users wanted
>>> the behavior the other way.
>>>
>>> I'm not sure it's really related to the behavior here.
>>
>> I think the question was the inverse - why not always shut down the
>> port if the interface is brought down?
> 
> Humm. Something does not seem right here. I would assume that when you
> administratively configure the link down, the SERDES in the MAC would
> stop sending anything. So the module has nothing to send. The link
> peer SERDES then looses sync, and reports that upwards as carrier
> lost.
> 

Right....

> Does the i40e and ice leave its SERDES running when the link is
> configured down? Or is the switch FUBAR and does not consider SERDES
> loss of sync as carrier down?
>


It's not clear to me. I tried to test with the driver, and it looks like
upstream doesn't yet have the link-down-on-close merged into net-next,
so I grabbed our out-of-tree driver.

Interestingly, both ip link show and ethtool do not report link as up
when the device is closed (ip link set enp175f0s0 down).....

So... whatever difference link-down-on-close makes we're definitely not
reporting things up.


I don't have a setup to confirm anything else right now unfortunately,
but I suspect something is wrong with the implementation of
link-down-on-close (at the very least it seems like we should still be
reporting LOWER_UP.... no?)

Unless there's some other weirdness like with QSFP or other
multi-port-single-cable setups?

I even tried adding some VFs and I see that regardless of whether
link-down-on-close is set, I can see link up on the VF....

Hmmmmm.

> Ido's use case does seem to be different. The link is down. Do we want
> to leave the module active, probably sending a bit stream of all 0,
> maybe noise, or do we want to power the module down, so it does not
> send anything at all.
> 
>      Andrew
> 

Right, I think these two cases are very different.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-10 22:05             ` Jakub Kicinski
  2021-08-10 22:51               ` Andrew Lunn
@ 2021-08-11 11:33               ` Ido Schimmel
  2021-08-11 13:03                 ` Jakub Kicinski
  1 sibling, 1 reply; 39+ messages in thread
From: Ido Schimmel @ 2021-08-11 11:33 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

On Tue, Aug 10, 2021 at 03:05:44PM -0700, Jakub Kicinski wrote:
> On Tue, 10 Aug 2021 23:46:28 +0300 Ido Schimmel wrote:
> > On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote:
> > > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote:  
> > > > O.K. Thanks for the better explanation. Some of this should go into
> > > > the commit message.
> > > > 
> > > > I suggest it gets a different name and semantics, to avoid
> > > > confusion. I think we should consider this the default power mode for
> > > > when the link is administratively down, rather than direct control
> > > > over the modules power mode. The driver should transition the module
> > > > to this setting on link down, be it high power or low power. That
> > > > saves a lot of complexity, since i assume you currently need a udev
> > > > script or something which sets it to low power mode on link down,
> > > > where as you can avoid this be configuring the default and let the
> > > > driver do it.  
> > > 
> > > Good point. And actually NICs have similar knobs, exposed via ethtool
> > > priv flags today. Intel NICs for example. Maybe we should create a
> > > "really power the port down policy" API?  
> > 
> > See below about Intel. I'm not sure it's the same thing...
> > 
> > I'm against adding a vague "really power the port down policy" API. The
> > API proposed in the patch is well-defined, its implementation is
> > documented in standards, its implications are clear and we offer APIs
> > that give user space full observability into its operation.
> > 
> > A vague API means that it is going to be abused and user space will get
> > different results over different implementations. After reading the
> > *commit messages* about the private flags, I'm not sure what the flags
> > really do, what is their true motivation, implications or how do I get
> > observability into their operation. I'm not too hopeful about the user
> > documentation.
> > 
> > Also, like I mentioned in the cover letter, given the complexity of
> > these modules and as they become more common, it is likely that we will
> > need to extend the API to control more parameters and expose more
> > diagnostic information. I would really like to keep it clean and
> > contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over
> > different APIs.
> 
> The patch is well defined but it doesn't provide user with the answer
> to the question "why is the SFP still up if I asked it to be down?"

But you didn't ask the module to be "down", you asked the MAC. See more
below.

> It's good to match specs closely but Linux may need to reconcile
> multiple policies.
> 
> IIUC if Intel decides to keep the SFP up for "other" reasons the
> situation will look like this:

Intel did not decide to keep the module "up", it decided to keep both
the MAC and the module up. If one of them was down, the peer would
notice it, but it didn't (according to Jake's reply). This is a very
problematic behavior as you are telling your peer that everything is
fine, but in practice you are dropping all of its packets. I hit the
exact same issue with mlx5 a few years ago and when I investigated the
reason for this behavior I was referred to multi-host systems where you
don't want one host to take down the shared link for all the rest. See:

https://www.mellanox.com/sites/default/files/doc-2020/sb-externally-connected-multi-host.pdf

> 
>  $ ethtool --show-module eth0
>  Module parameters for eth0:
>  low-power true
> 
>  # ethtool -m eth0
>  Module State                              : 0x03 (ModuleReady)
>  LowPwrAllowRequestHW                      : Off
>  LowPwrRequestSW                           : Off

This output is wrong. In Intel's case "ip link show" will report the
link as down (according to Jake's reply) despite the MAC being up. On
the other hand, the output of "ethtool --show-module eth0" will show
that the module is not in low power mode and it will be correct.

> 
> 
> IOW the low-power mode is a way for user to express preference to
> shut down/keep up the SFP,

Yes, it controls the module, not the MAC. If you want to get a carrier,
both the module and the MAC need to be operational. See following
example where swp13 and swp14 are connected to each other:

$ ip link show dev swp13
127: swp13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

$ ip link show dev swp14
128: swp14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

# ip link set dev swp13 down

# ethtool --set-module swp13 low-power on

$ ethtool --show-module swp13
Module parameters for swp13:
low-power true

# ip link set dev swp13 up

$ ip link show dev swp13
127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

$ ip link show dev swp14
128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

# ip link set dev swp13 down

# ethtool --set-module swp13 low-power off

$ ethtool --show-module swp13
Module parameters for swp13:
low-power false

# ip link set dev swp13 up

$ ip link show dev swp13
127: swp13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff

$ ip link show dev swp14
128: swp14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

> but it's not necessarily going to be the only "policy" that matters.
> If other policies (e.g. NC-SI) express preference to keep the
> interface up it will stay up, right?
> 
> The LowPwrRequestSW is not directly controlled by low-power && IF_UP.
> 
> What I had in mind was something along the lines of a bitmap of reasons
> which are allowed to keep the interface up, and for your use case
> the reason would be something like SFP_ALWAYS_ON, but other reasons
> (well defined) may also keep the ifc up.
> 
> That's just to explain what I meant, if it's going to be clear to
> everyone that low-power != LowPwrRequestSW I'm fine either way.

I think we mixed two different use cases here. The first is a way to
make sure the link is fully operational (both MAC and module). In this
case, contrary to the expected behavior, "ip link set dev eth0 down"
will not take the MAC down and the peer will not lose its carrier. This
is most likely motivated by special hardware designs or exotic use cases
like I mentioned above.

The use case for this patch is completely different. Today, when you do
"ip link set dev eth0 up" you expect to gain a carrier which means that
both the MAC and the module are operational. The latter can be made
operational following the user request (e.g., SFP driver) or as soon as
it was plugged-in, by the device's firmware.

When you do "ip link set dev eth0 down" you expect the reverse to happen
and this is what happens today. In implementations where the module is
always operational, it stays in high power mode and continues to get
warm and consume unnecessary amount of power.

Some users might not be OK with that and would like more control, which
is exactly what this patch is doing. This patch does not change existing
behavior, the API has clear semantics and implications and user space
has full observability into its operation.

If in the future someone sees the need to add 'protoup', it can be done:

# ip link set dev eth0 up  # MAC and module are operational
# ip link set dev eth0 protoup on
# ip link set dev eth0 down # returns an error / ignore
# ethtool --set-module eth0 low-power on # returns an error because of admin up

You can obviously engineer situations that do not make any sense. Like
this:

# ethtool --set-module eth0 low-power on
# ip link set dev eth0 up
# ip link set dev eth0 protoup on

The MAC is operational and you can't take it down, but you will never
get a carrier because you explicitly asked the module to stay in low
power mode.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-11 11:33               ` Ido Schimmel
@ 2021-08-11 13:03                 ` Jakub Kicinski
  2021-08-11 14:36                   ` Andrew Lunn
  0 siblings, 1 reply; 39+ messages in thread
From: Jakub Kicinski @ 2021-08-11 13:03 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote:
> # ethtool --set-module swp13 low-power on
> 
> $ ethtool --show-module swp13
> Module parameters for swp13:
> low-power true
> 
> # ip link set dev swp13 up
> 
> $ ip link show dev swp13
> 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
>     link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff
> 
> $ ip link show dev swp14
> 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
>     link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff

Oh, so if we set low-power true the carrier will never show up?
I thought Andrew suggested the setting is only taken into account 
when netdev is down. That made so much sense to me I assumed we'll
just go with that. I must have misunderstood.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-11 13:03                 ` Jakub Kicinski
@ 2021-08-11 14:36                   ` Andrew Lunn
  2021-08-11 19:37                     ` Ido Schimmel
  0 siblings, 1 reply; 39+ messages in thread
From: Andrew Lunn @ 2021-08-11 14:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote:
> On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote:
> > # ethtool --set-module swp13 low-power on
> > 
> > $ ethtool --show-module swp13
> > Module parameters for swp13:
> > low-power true
> > 
> > # ip link set dev swp13 up
> > 
> > $ ip link show dev swp13
> > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
> >     link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff
> > 
> > $ ip link show dev swp14
> > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
> >     link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff
> 
> Oh, so if we set low-power true the carrier will never show up?
> I thought Andrew suggested the setting is only taken into account 
> when netdev is down.

Yes, that was my intention. If this low power mode also applies when
the interface is admin up, it sounds like a foot gun. ip link show
gives you no idea why the carrier is down, and people will assume the
cable or peer is broken. We at least need a new flag, LOWER_DISABLED
or similar to give the poor user some chance to figure out what is
going on.

To me, this setting should only apply when the link is admin down.

	Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-11 14:36                   ` Andrew Lunn
@ 2021-08-11 19:37                     ` Ido Schimmel
  2021-08-11 20:30                       ` Jakub Kicinski
  2021-08-11 20:42                       ` Andrew Lunn
  0 siblings, 2 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-11 19:37 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Jakub Kicinski, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

On Wed, Aug 11, 2021 at 04:36:13PM +0200, Andrew Lunn wrote:
> On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote:
> > On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote:
> > > # ethtool --set-module swp13 low-power on
> > > 
> > > $ ethtool --show-module swp13
> > > Module parameters for swp13:
> > > low-power true
> > > 
> > > # ip link set dev swp13 up
> > > 
> > > $ ip link show dev swp13
> > > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
> > >     link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff
> > > 
> > > $ ip link show dev swp14
> > > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000
> > >     link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff
> > 
> > Oh, so if we set low-power true the carrier will never show up?
> > I thought Andrew suggested the setting is only taken into account 
> > when netdev is down.
> 
> Yes, that was my intention. If this low power mode also applies when
> the interface is admin up, it sounds like a foot gun. ip link show
> gives you no idea why the carrier is down, and people will assume the
> cable or peer is broken. We at least need a new flag, LOWER_DISABLED
> or similar to give the poor user some chance to figure out what is
> going on.

The canonical way to report such errors is via LINKSTATE_GET and I will
add an extended sub-state describing the problem.

> 
> To me, this setting should only apply when the link is admin down.

I don't want to bake such an assumption into the kernel, but I have a
suggestion that resolves the issue.

We currently have a single attribute that encodes the desired state on
SET messages and the operational state on GET_REPLY messages
(ETHTOOL_A_MODULE_LOW_POWER_ENABLED):

$ ethtool --show-module swp11
Module parameters for swp11:
low-power true

It is not defined very well when a module is not connected despite being
a very interesting use case. We really need to have two attributes. The
first one describing the power mode policy and the second one describing
the operational power mode which is only reported when a module is
plugged in.

For the policy we can have these values:

1. low: Always transition the module to low power mode
2. high: Always transition the module to high power mode
3. high-on-up: Transition the module to high power mode when a port
using it is administratively up. Otherwise, low

A different policy for up/down seems like an overkill for me.

See example usage below.

No module connected:

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high

Like I mentioned before, this is the default on Mellanox systems so this
new attribute allows user space to query the default policy.

Change to a different policy:

# ethtool --set-module swp11 power-mode-policy high-on-up

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high-on-up

After a module was connected:

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high-on-up
power-mode low

# ip link set dev swp11 up

$ ethtool --show-module swp11
Module parameters for swp11:
power-mode-policy high-on-up
low-power high

# ip link set dev swp11 down

# ethtool --set-module swp11 power-mode-policy low

# ip link set dev swp11 up

$ ethtool swp11
...
Link detected: no (Cable issue, Module is in low power mode)

I'm quite happy with the above. Might change a few things as I implement
it, but you get the gist. WDYT?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-11 19:37                     ` Ido Schimmel
@ 2021-08-11 20:30                       ` Jakub Kicinski
  2021-08-11 20:57                         ` Andrew Lunn
  2021-08-11 21:04                         ` Ido Schimmel
  2021-08-11 20:42                       ` Andrew Lunn
  1 sibling, 2 replies; 39+ messages in thread
From: Jakub Kicinski @ 2021-08-11 20:30 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Andrew Lunn, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

On Wed, 11 Aug 2021 22:37:53 +0300 Ido Schimmel wrote:
> On Wed, Aug 11, 2021 at 04:36:13PM +0200, Andrew Lunn wrote:
> > On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote:  
> > > Oh, so if we set low-power true the carrier will never show up?
> > > I thought Andrew suggested the setting is only taken into account 
> > > when netdev is down.  
> > 
> > Yes, that was my intention. If this low power mode also applies when
> > the interface is admin up, it sounds like a foot gun. ip link show
> > gives you no idea why the carrier is down, and people will assume the
> > cable or peer is broken. We at least need a new flag, LOWER_DISABLED
> > or similar to give the poor user some chance to figure out what is
> > going on.  
> 
> The canonical way to report such errors is via LINKSTATE_GET and I will
> add an extended sub-state describing the problem.
>  
> > To me, this setting should only apply when the link is admin down.  
> 
> I don't want to bake such an assumption into the kernel, but I have a
> suggestion that resolves the issue.
> 
> We currently have a single attribute that encodes the desired state on
> SET messages and the operational state on GET_REPLY messages
> (ETHTOOL_A_MODULE_LOW_POWER_ENABLED):
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> low-power true
> 
> It is not defined very well when a module is not connected despite being
> a very interesting use case. We really need to have two attributes. The
> first one describing the power mode policy and the second one describing
> the operational power mode which is only reported when a module is
> plugged in.
> 
> For the policy we can have these values:
> 
> 1. low: Always transition the module to low power mode
> 2. high: Always transition the module to high power mode
> 3. high-on-up: Transition the module to high power mode when a port
> using it is administratively up. Otherwise, low
> 
> A different policy for up/down seems like an overkill for me.
> 
> See example usage below.
> 
> No module connected:
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high
> 
> Like I mentioned before, this is the default on Mellanox systems so this
> new attribute allows user space to query the default policy.
> 
> Change to a different policy:
> 
> # ethtool --set-module swp11 power-mode-policy high-on-up
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> 
> After a module was connected:
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> power-mode low
> 
> # ip link set dev swp11 up
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> low-power high
> 
> # ip link set dev swp11 down
> 
> # ethtool --set-module swp11 power-mode-policy low
> 
> # ip link set dev swp11 up
> 
> $ ethtool swp11
> ...
> Link detected: no (Cable issue, Module is in low power mode)
> 
> I'm quite happy with the above. Might change a few things as I implement
> it, but you get the gist. WDYT?

Isn't the "low-power" attr just duplicating the relevant bits from -m?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-11 19:37                     ` Ido Schimmel
  2021-08-11 20:30                       ` Jakub Kicinski
@ 2021-08-11 20:42                       ` Andrew Lunn
  1 sibling, 0 replies; 39+ messages in thread
From: Andrew Lunn @ 2021-08-11 20:42 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jakub Kicinski, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

> For the policy we can have these values:
> 
> 1. low: Always transition the module to low power mode
> 2. high: Always transition the module to high power mode
> 3. high-on-up: Transition the module to high power mode when a port
> using it is administratively up. Otherwise, low
> 
> A different policy for up/down seems like an overkill for me.

O.K. The current kernel SFP driver is going to default to high-on-up,
which is what kernel driven copper PHYs also do.

> After a module was connected:
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> power-mode low
> 
> # ip link set dev swp11 up
> 
> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> low-power high
> 
> # ip link set dev swp11 down

You missed a show here. I expect it to be:

> $ ethtool --show-module swp11
> Module parameters for swp11:
> power-mode-policy high-on-up
> power-mode low

since it is now down.

I suppose we should consider the bigger picture. Is this feature
limited to just SFP modules, or does it make sense to any other sort
of network technology? CAN, bluetooth, 5G, IPoAC?

   Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-11 20:30                       ` Jakub Kicinski
@ 2021-08-11 20:57                         ` Andrew Lunn
  2021-08-11 21:04                         ` Ido Schimmel
  1 sibling, 0 replies; 39+ messages in thread
From: Andrew Lunn @ 2021-08-11 20:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

> Isn't the "low-power" attr just duplicating the relevant bits from -m?

Do all SFPs report it in the dump? I'm thinking GPON, 1G modules with
a TX_ENABLE pin? INF-8074 does not specify a bit in the 'EEPROM' data
to indicate the status. So you need to know the state of the GPIO
driving the TX_ENABLE pin.

   Andrew

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode
  2021-08-11 20:30                       ` Jakub Kicinski
  2021-08-11 20:57                         ` Andrew Lunn
@ 2021-08-11 21:04                         ` Ido Schimmel
  1 sibling, 0 replies; 39+ messages in thread
From: Ido Schimmel @ 2021-08-11 21:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Andrew Lunn, Keller, Jacob E, netdev, davem, mkubecek, pali,
	vadimp, mlxsw, Ido Schimmel

On Wed, Aug 11, 2021 at 01:30:06PM -0700, Jakub Kicinski wrote:
> Isn't the "low-power" attr just duplicating the relevant bits from -m?

If low power is forced by hardware, then it depends on the assertion /
de-assertion of the hardware signal which is obviously not part of the
EEPROM dump.

I knew this would come up, so I mentioned it in the commit message:

"The low power mode can be queried from the kernel. In case
LowPwrAllowRequestHW was on, the kernel would need to take into account
the state of the LowPwrRequestHW signal, which is not visible to user
space."

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2021-08-11 21:04 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-09 10:21 [RFC PATCH net-next 0/8] ethtool: Add ability to control transceiver modules Ido Schimmel
2021-08-09 10:21 ` [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver modules' low power mode Ido Schimmel
2021-08-09 14:28   ` Andrew Lunn
2021-08-10  7:26     ` Ido Schimmel
2021-08-10 13:52       ` Andrew Lunn
2021-08-10 13:59         ` Jakub Kicinski
2021-08-10 20:46           ` Ido Schimmel
2021-08-10 22:00             ` Keller, Jacob E
2021-08-10 22:06               ` Jakub Kicinski
2021-08-10 22:18                 ` Keller, Jacob E
2021-08-10 22:24                 ` Keller, Jacob E
2021-08-10 22:31                 ` Andrew Lunn
2021-08-11  0:38                   ` Keller, Jacob E
2021-08-10 22:05             ` Jakub Kicinski
2021-08-10 22:51               ` Andrew Lunn
2021-08-11 11:33               ` Ido Schimmel
2021-08-11 13:03                 ` Jakub Kicinski
2021-08-11 14:36                   ` Andrew Lunn
2021-08-11 19:37                     ` Ido Schimmel
2021-08-11 20:30                       ` Jakub Kicinski
2021-08-11 20:57                         ` Andrew Lunn
2021-08-11 21:04                         ` Ido Schimmel
2021-08-11 20:42                       ` Andrew Lunn
2021-08-10 21:38           ` Keller, Jacob E
2021-08-09 10:21 ` [RFC PATCH net-next 2/8] ethtool: Add ability to reset transceiver modules Ido Schimmel
2021-08-09 19:13   ` Andrew Lunn
2021-08-10 13:05     ` Ido Schimmel
2021-08-10 13:54       ` Jakub Kicinski
2021-08-10 18:15         ` Ido Schimmel
2021-08-10 18:58           ` Andrew Lunn
2021-08-10 19:00           ` Jakub Kicinski
2021-08-10 19:28             ` Andrew Lunn
2021-08-10 20:50               ` Ido Schimmel
2021-08-09 10:21 ` [RFC PATCH net-next 3/8] mlxsw: reg: Add fields to PMAOS register Ido Schimmel
2021-08-09 10:21 ` [RFC PATCH net-next 4/8] mlxsw: Make PMAOS pack function more generic Ido Schimmel
2021-08-09 10:21 ` [RFC PATCH net-next 5/8] mlxsw: reg: Add Port Module Memory Map Properties register Ido Schimmel
2021-08-09 10:21 ` [RFC PATCH net-next 6/8] mlxsw: reg: Add Management Cable IO and Notifications register Ido Schimmel
2021-08-09 10:21 ` [RFC PATCH net-next 7/8] mlxsw: Add ability to control transceiver modules' low power mode Ido Schimmel
2021-08-09 10:21 ` [RFC PATCH net-next 8/8] mlxsw: Add ability to reset transceiver modules Ido Schimmel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.