netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
@ 2019-11-03  8:35 Ido Schimmel
  2019-11-03  8:35 ` [PATCH net-next 1/6] mlxsw: core: Parse TLVs' offsets of incoming EMADs Ido Schimmel
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Ido Schimmel @ 2019-11-03  8:35 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, shalomt, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@mellanox.com>

Ethernet Management Datagrams (EMADs) are Ethernet packets sent between
the driver and device's firmware. They are used to pass various
configurations to the device, but also to get events (e.g., port up)
from it. After the Ethernet header, these packets are built in a TLV
format.

Up until now, whenever the driver issued an erroneous register access it
only got an error code indicating a bad parameter was used. This patch
set from Shalom adds a new TLV (string TLV) that can be used by the
firmware to encode a 128 character string describing the error. The new
TLV is allocated by the driver and set to zeros. In case of error, the
driver will check the length of the string in the response and print it
to the kernel log.

Example output:

mlxsw_spectrum 0000:03:00.0: EMAD reg access failed (tid=a9719f9700001306,reg_id=8018(rauhtd),type=query,status=7(bad parameter))
mlxsw_spectrum 0000:03:00.0: Firmware error (tid=a9719f9700001306,emad_err_string=inside er_rauhtd_write_query(), num_rec=32 is over the maximum number of records supported)

Patch #1 parses the offsets of the different TLVs in incoming EMADs and
stores them in the skb's control block. This makes it easier to later
add new TLVs.

Patches #2-#3 remove deprecated TLVs and add string TLV definition.

Patches #4-#6 gradually add support for the new string TLV.

Shalom Toledo (6):
  mlxsw: core: Parse TLVs' offsets of incoming EMADs
  mlxsw: emad: Remove deprecated EMAD TLVs
  mlxsw: core: Add EMAD string TLV
  mlxsw: core: Add support for EMAD string TLV parsing
  mlxsw: core: Add support for using EMAD string TLV
  mlxsw: spectrum: Enable EMAD string TLV

 drivers/net/ethernet/mellanox/mlxsw/core.c    | 154 ++++++++++++++++--
 drivers/net/ethernet/mellanox/mlxsw/core.h    |   2 +
 drivers/net/ethernet/mellanox/mlxsw/emad.h    |   7 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |   2 +
 4 files changed, 147 insertions(+), 18 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH net-next 1/6] mlxsw: core: Parse TLVs' offsets of incoming EMADs
  2019-11-03  8:35 [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Ido Schimmel
@ 2019-11-03  8:35 ` Ido Schimmel
  2019-11-03  8:35 ` [PATCH net-next 2/6] mlxsw: emad: Remove deprecated EMAD TLVs Ido Schimmel
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Ido Schimmel @ 2019-11-03  8:35 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, shalomt, mlxsw, Ido Schimmel

From: Shalom Toledo <shalomt@mellanox.com>

Until now the code assumed a fixed structure which makes it difficult to
support EMADs with and without new TLVs.

Make it more generic by parsing the TLVs when the EMADs are received and
store the offset to the different TLVs in the control block. Use these
offsets to extract information from the EMADs without relying on a
specific structure.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 53 +++++++++++++++++-----
 1 file changed, 42 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index e1a90f5bddd0..3d92956047d5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -361,20 +361,45 @@ static void mlxsw_emad_construct(struct sk_buff *skb,
 	mlxsw_emad_construct_eth_hdr(skb);
 }
 
+struct mlxsw_emad_tlv_offsets {
+	u16 op_tlv;
+	u16 reg_tlv;
+};
+
+static void mlxsw_emad_tlv_parse(struct sk_buff *skb)
+{
+	struct mlxsw_emad_tlv_offsets *offsets =
+		(struct mlxsw_emad_tlv_offsets *) skb->cb;
+
+	offsets->op_tlv = MLXSW_EMAD_ETH_HDR_LEN;
+	offsets->reg_tlv = MLXSW_EMAD_ETH_HDR_LEN +
+			   MLXSW_EMAD_OP_TLV_LEN * sizeof(u32);
+}
+
 static char *mlxsw_emad_op_tlv(const struct sk_buff *skb)
 {
-	return ((char *) (skb->data + MLXSW_EMAD_ETH_HDR_LEN));
+	struct mlxsw_emad_tlv_offsets *offsets =
+		(struct mlxsw_emad_tlv_offsets *) skb->cb;
+
+	return ((char *) (skb->data + offsets->op_tlv));
 }
 
 static char *mlxsw_emad_reg_tlv(const struct sk_buff *skb)
 {
-	return ((char *) (skb->data + MLXSW_EMAD_ETH_HDR_LEN +
-				      MLXSW_EMAD_OP_TLV_LEN * sizeof(u32)));
+	struct mlxsw_emad_tlv_offsets *offsets =
+		(struct mlxsw_emad_tlv_offsets *) skb->cb;
+
+	return ((char *) (skb->data + offsets->reg_tlv));
 }
 
-static char *mlxsw_emad_reg_payload(const char *op_tlv)
+static char *mlxsw_emad_reg_payload(const char *reg_tlv)
 {
-	return ((char *) (op_tlv + (MLXSW_EMAD_OP_TLV_LEN + 1) * sizeof(u32)));
+	return ((char *) (reg_tlv + sizeof(u32)));
+}
+
+static char *mlxsw_emad_reg_payload_cmd(const char *mbox)
+{
+	return ((char *) (mbox + (MLXSW_EMAD_OP_TLV_LEN + 1) * sizeof(u32)));
 }
 
 static u64 mlxsw_emad_get_tid(const struct sk_buff *skb)
@@ -535,11 +560,11 @@ static void mlxsw_emad_process_response(struct mlxsw_core *mlxsw_core,
 		mlxsw_emad_transmit_retry(mlxsw_core, trans);
 	} else {
 		if (err == 0) {
-			char *op_tlv = mlxsw_emad_op_tlv(skb);
+			char *reg_tlv = mlxsw_emad_reg_tlv(skb);
 
 			if (trans->cb)
 				trans->cb(mlxsw_core,
-					  mlxsw_emad_reg_payload(op_tlv),
+					  mlxsw_emad_reg_payload(reg_tlv),
 					  trans->reg->len, trans->cb_priv);
 		}
 		mlxsw_emad_trans_finish(trans, err);
@@ -556,6 +581,8 @@ static void mlxsw_emad_rx_listener_func(struct sk_buff *skb, u8 local_port,
 	trace_devlink_hwmsg(priv_to_devlink(mlxsw_core), true, 0,
 			    skb->data, skb->len);
 
+	mlxsw_emad_tlv_parse(skb);
+
 	if (!mlxsw_emad_is_resp(skb))
 		goto free_skb;
 
@@ -1390,12 +1417,16 @@ static void mlxsw_core_event_listener_func(struct sk_buff *skb, u8 local_port,
 	struct mlxsw_event_listener_item *event_listener_item = priv;
 	struct mlxsw_reg_info reg;
 	char *payload;
-	char *op_tlv = mlxsw_emad_op_tlv(skb);
-	char *reg_tlv = mlxsw_emad_reg_tlv(skb);
+	char *reg_tlv;
+	char *op_tlv;
+
+	mlxsw_emad_tlv_parse(skb);
+	op_tlv = mlxsw_emad_op_tlv(skb);
+	reg_tlv = mlxsw_emad_reg_tlv(skb);
 
 	reg.id = mlxsw_emad_op_tlv_register_id_get(op_tlv);
 	reg.len = (mlxsw_emad_reg_tlv_len_get(reg_tlv) - 1) * sizeof(u32);
-	payload = mlxsw_emad_reg_payload(op_tlv);
+	payload = mlxsw_emad_reg_payload(reg_tlv);
 	event_listener_item->el.func(&reg, payload, event_listener_item->priv);
 	dev_kfree_skb(skb);
 }
@@ -1708,7 +1739,7 @@ static int mlxsw_core_reg_access_cmd(struct mlxsw_core *mlxsw_core,
 	}
 
 	if (!err)
-		memcpy(payload, mlxsw_emad_reg_payload(out_mbox),
+		memcpy(payload, mlxsw_emad_reg_payload_cmd(out_mbox),
 		       reg->len);
 
 	mlxsw_cmd_mbox_free(out_mbox);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 2/6] mlxsw: emad: Remove deprecated EMAD TLVs
  2019-11-03  8:35 [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Ido Schimmel
  2019-11-03  8:35 ` [PATCH net-next 1/6] mlxsw: core: Parse TLVs' offsets of incoming EMADs Ido Schimmel
@ 2019-11-03  8:35 ` Ido Schimmel
  2019-11-03  8:35 ` [PATCH net-next 3/6] mlxsw: core: Add EMAD string TLV Ido Schimmel
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Ido Schimmel @ 2019-11-03  8:35 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, shalomt, mlxsw, Ido Schimmel

From: Shalom Toledo <shalomt@mellanox.com>

Remove deprecated EMAD TLVs.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/emad.h | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/emad.h b/drivers/net/ethernet/mellanox/mlxsw/emad.h
index a33b896f4bb8..5d7c78419fa7 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/emad.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/emad.h
@@ -19,10 +19,7 @@
 enum {
 	MLXSW_EMAD_TLV_TYPE_END,
 	MLXSW_EMAD_TLV_TYPE_OP,
-	MLXSW_EMAD_TLV_TYPE_DR,
-	MLXSW_EMAD_TLV_TYPE_REG,
-	MLXSW_EMAD_TLV_TYPE_USERDATA,
-	MLXSW_EMAD_TLV_TYPE_OOBETH,
+	MLXSW_EMAD_TLV_TYPE_REG = 0x3,
 };
 
 /* OP TLV */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 3/6] mlxsw: core: Add EMAD string TLV
  2019-11-03  8:35 [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Ido Schimmel
  2019-11-03  8:35 ` [PATCH net-next 1/6] mlxsw: core: Parse TLVs' offsets of incoming EMADs Ido Schimmel
  2019-11-03  8:35 ` [PATCH net-next 2/6] mlxsw: emad: Remove deprecated EMAD TLVs Ido Schimmel
@ 2019-11-03  8:35 ` Ido Schimmel
  2019-11-03  8:35 ` [PATCH net-next 4/6] mlxsw: core: Add support for EMAD string TLV parsing Ido Schimmel
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Ido Schimmel @ 2019-11-03  8:35 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, shalomt, mlxsw, Ido Schimmel

From: Shalom Toledo <shalomt@mellanox.com>

Add EMAD string TLV, an ASCII string the driver can receive from the
firmware in case of an error.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 19 +++++++++++++++++++
 drivers/net/ethernet/mellanox/mlxsw/emad.h |  6 +++++-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 3d92956047d5..1803b246d9b4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -249,6 +249,25 @@ MLXSW_ITEM32(emad, op_tlv, class, 0x04, 0, 8);
  */
 MLXSW_ITEM64(emad, op_tlv, tid, 0x08, 0, 64);
 
+/* emad_string_tlv_type
+ * Type of the TLV.
+ * Must be set to 0x2 (string TLV).
+ */
+MLXSW_ITEM32(emad, string_tlv, type, 0x00, 27, 5);
+
+/* emad_string_tlv_len
+ * Length of the string TLV in u32.
+ */
+MLXSW_ITEM32(emad, string_tlv, len, 0x00, 16, 11);
+
+#define MLXSW_EMAD_STRING_TLV_STRING_LEN 128
+
+/* emad_string_tlv_string
+ * String provided by the device's firmware in case of erroneous register access
+ */
+MLXSW_ITEM_BUF(emad, string_tlv, string, 0x04,
+	       MLXSW_EMAD_STRING_TLV_STRING_LEN);
+
 /* emad_reg_tlv_type
  * Type of the TLV.
  * Must be set to 0x3 (register TLV).
diff --git a/drivers/net/ethernet/mellanox/mlxsw/emad.h b/drivers/net/ethernet/mellanox/mlxsw/emad.h
index 5d7c78419fa7..acfbbec52424 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/emad.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/emad.h
@@ -19,7 +19,8 @@
 enum {
 	MLXSW_EMAD_TLV_TYPE_END,
 	MLXSW_EMAD_TLV_TYPE_OP,
-	MLXSW_EMAD_TLV_TYPE_REG = 0x3,
+	MLXSW_EMAD_TLV_TYPE_STRING,
+	MLXSW_EMAD_TLV_TYPE_REG,
 };
 
 /* OP TLV */
@@ -86,6 +87,9 @@ enum {
 	MLXSW_EMAD_OP_TLV_METHOD_EVENT = 5,
 };
 
+/* STRING TLV */
+#define MLXSW_EMAD_STRING_TLV_LEN 33	/* Length in u32 */
+
 /* END TLV */
 #define MLXSW_EMAD_END_TLV_LEN 1	/* Length in u32 */
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 4/6] mlxsw: core: Add support for EMAD string TLV parsing
  2019-11-03  8:35 [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Ido Schimmel
                   ` (2 preceding siblings ...)
  2019-11-03  8:35 ` [PATCH net-next 3/6] mlxsw: core: Add EMAD string TLV Ido Schimmel
@ 2019-11-03  8:35 ` Ido Schimmel
  2019-11-03  8:35 ` [PATCH net-next 5/6] mlxsw: core: Add support for using EMAD string TLV Ido Schimmel
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: Ido Schimmel @ 2019-11-03  8:35 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, shalomt, mlxsw, Ido Schimmel

From: Shalom Toledo <shalomt@mellanox.com>

During parsing of incoming EMADs, fill the string TLV's offset when it is
used.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 1803b246d9b4..e166ce745150 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -382,17 +382,32 @@ static void mlxsw_emad_construct(struct sk_buff *skb,
 
 struct mlxsw_emad_tlv_offsets {
 	u16 op_tlv;
+	u16 string_tlv;
 	u16 reg_tlv;
 };
 
+static bool mlxsw_emad_tlv_is_string_tlv(const char *tlv)
+{
+	u8 tlv_type = mlxsw_emad_string_tlv_type_get(tlv);
+
+	return tlv_type == MLXSW_EMAD_TLV_TYPE_STRING;
+}
+
 static void mlxsw_emad_tlv_parse(struct sk_buff *skb)
 {
 	struct mlxsw_emad_tlv_offsets *offsets =
 		(struct mlxsw_emad_tlv_offsets *) skb->cb;
 
 	offsets->op_tlv = MLXSW_EMAD_ETH_HDR_LEN;
+	offsets->string_tlv = 0;
 	offsets->reg_tlv = MLXSW_EMAD_ETH_HDR_LEN +
 			   MLXSW_EMAD_OP_TLV_LEN * sizeof(u32);
+
+	/* If string TLV is present, it must come after the operation TLV. */
+	if (mlxsw_emad_tlv_is_string_tlv(skb->data + offsets->reg_tlv)) {
+		offsets->string_tlv = offsets->reg_tlv;
+		offsets->reg_tlv += MLXSW_EMAD_STRING_TLV_LEN * sizeof(u32);
+	}
 }
 
 static char *mlxsw_emad_op_tlv(const struct sk_buff *skb)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 5/6] mlxsw: core: Add support for using EMAD string TLV
  2019-11-03  8:35 [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Ido Schimmel
                   ` (3 preceding siblings ...)
  2019-11-03  8:35 ` [PATCH net-next 4/6] mlxsw: core: Add support for EMAD string TLV parsing Ido Schimmel
@ 2019-11-03  8:35 ` Ido Schimmel
  2019-11-03  8:35 ` [PATCH net-next 6/6] mlxsw: spectrum: Enable " Ido Schimmel
  2019-11-04 20:39 ` [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Jakub Kicinski
  6 siblings, 0 replies; 20+ messages in thread
From: Ido Schimmel @ 2019-11-03  8:35 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, shalomt, mlxsw, Ido Schimmel

From: Shalom Toledo <shalomt@mellanox.com>

In case the firmware had an error while processing EMADs, it can send back
an ASCII string with the reason using EMAD string TLV.

This patch adds the support for using EMAD string TLV and prints the string
reason in case of an error.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 67 ++++++++++++++++++++--
 drivers/net/ethernet/mellanox/mlxsw/core.h |  2 +
 2 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index e166ce745150..96b2657789d2 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -71,6 +71,7 @@ struct mlxsw_core {
 		struct list_head trans_list;
 		spinlock_t trans_list_lock; /* protects trans_list writes */
 		bool use_emad;
+		bool enable_string_tlv;
 	} emad;
 	struct {
 		u8 *mapping; /* lag_id+port_index to local_port mapping */
@@ -323,6 +324,12 @@ static void mlxsw_emad_pack_reg_tlv(char *reg_tlv,
 	memcpy(reg_tlv + sizeof(u32), payload, reg->len);
 }
 
+static void mlxsw_emad_pack_string_tlv(char *string_tlv)
+{
+	mlxsw_emad_string_tlv_type_set(string_tlv, MLXSW_EMAD_TLV_TYPE_STRING);
+	mlxsw_emad_string_tlv_len_set(string_tlv, MLXSW_EMAD_STRING_TLV_LEN);
+}
+
 static void mlxsw_emad_pack_op_tlv(char *op_tlv,
 				   const struct mlxsw_reg_info *reg,
 				   enum mlxsw_core_reg_access_type type,
@@ -364,7 +371,7 @@ static void mlxsw_emad_construct(struct sk_buff *skb,
 				 const struct mlxsw_reg_info *reg,
 				 char *payload,
 				 enum mlxsw_core_reg_access_type type,
-				 u64 tid)
+				 u64 tid, bool enable_string_tlv)
 {
 	char *buf;
 
@@ -374,6 +381,11 @@ static void mlxsw_emad_construct(struct sk_buff *skb,
 	buf = skb_push(skb, reg->len + sizeof(u32));
 	mlxsw_emad_pack_reg_tlv(buf, reg, payload);
 
+	if (enable_string_tlv) {
+		buf = skb_push(skb, MLXSW_EMAD_STRING_TLV_LEN * sizeof(u32));
+		mlxsw_emad_pack_string_tlv(buf);
+	}
+
 	buf = skb_push(skb, MLXSW_EMAD_OP_TLV_LEN * sizeof(u32));
 	mlxsw_emad_pack_op_tlv(buf, reg, type, tid);
 
@@ -418,6 +430,17 @@ static char *mlxsw_emad_op_tlv(const struct sk_buff *skb)
 	return ((char *) (skb->data + offsets->op_tlv));
 }
 
+static char *mlxsw_emad_string_tlv(const struct sk_buff *skb)
+{
+	struct mlxsw_emad_tlv_offsets *offsets =
+		(struct mlxsw_emad_tlv_offsets *) skb->cb;
+
+	if (!offsets->string_tlv)
+		return NULL;
+
+	return ((char *) (skb->data + offsets->string_tlv));
+}
+
 static char *mlxsw_emad_reg_tlv(const struct sk_buff *skb)
 {
 	struct mlxsw_emad_tlv_offsets *offsets =
@@ -499,10 +522,26 @@ struct mlxsw_reg_trans {
 	const struct mlxsw_reg_info *reg;
 	enum mlxsw_core_reg_access_type type;
 	int err;
+	char emad_err_string[MLXSW_EMAD_STRING_TLV_STRING_LEN];
 	enum mlxsw_emad_op_tlv_status emad_status;
 	struct rcu_head rcu;
 };
 
+static void mlxsw_emad_process_string_tlv(const struct sk_buff *skb,
+					  struct mlxsw_reg_trans *trans)
+{
+	char *string_tlv;
+	char *string;
+
+	string_tlv = mlxsw_emad_string_tlv(skb);
+	if (!string_tlv)
+		return;
+
+	string = mlxsw_emad_string_tlv_string_data(string_tlv);
+	strlcpy(trans->emad_err_string, string,
+		MLXSW_EMAD_STRING_TLV_STRING_LEN);
+}
+
 #define MLXSW_EMAD_TIMEOUT_DURING_FW_FLASH_MS	3000
 #define MLXSW_EMAD_TIMEOUT_MS			200
 
@@ -600,6 +639,8 @@ static void mlxsw_emad_process_response(struct mlxsw_core *mlxsw_core,
 				trans->cb(mlxsw_core,
 					  mlxsw_emad_reg_payload(reg_tlv),
 					  trans->reg->len, trans->cb_priv);
+		} else {
+			mlxsw_emad_process_string_tlv(skb, trans);
 		}
 		mlxsw_emad_trans_finish(trans, err);
 	}
@@ -692,7 +733,7 @@ static void mlxsw_emad_fini(struct mlxsw_core *mlxsw_core)
 }
 
 static struct sk_buff *mlxsw_emad_alloc(const struct mlxsw_core *mlxsw_core,
-					u16 reg_len)
+					u16 reg_len, bool enable_string_tlv)
 {
 	struct sk_buff *skb;
 	u16 emad_len;
@@ -700,6 +741,8 @@ static struct sk_buff *mlxsw_emad_alloc(const struct mlxsw_core *mlxsw_core,
 	emad_len = (reg_len + sizeof(u32) + MLXSW_EMAD_ETH_HDR_LEN +
 		    (MLXSW_EMAD_OP_TLV_LEN + MLXSW_EMAD_END_TLV_LEN) *
 		    sizeof(u32) + mlxsw_core->driver->txhdr_len);
+	if (enable_string_tlv)
+		emad_len += MLXSW_EMAD_STRING_TLV_LEN * sizeof(u32);
 	if (emad_len > MLXSW_EMAD_MAX_FRAME_LEN)
 		return NULL;
 
@@ -721,6 +764,7 @@ static int mlxsw_emad_reg_access(struct mlxsw_core *mlxsw_core,
 				 mlxsw_reg_trans_cb_t *cb,
 				 unsigned long cb_priv, u64 tid)
 {
+	bool enable_string_tlv;
 	struct sk_buff *skb;
 	int err;
 
@@ -728,7 +772,12 @@ static int mlxsw_emad_reg_access(struct mlxsw_core *mlxsw_core,
 		tid, reg->id, mlxsw_reg_id_str(reg->id),
 		mlxsw_core_reg_access_type_str(type));
 
-	skb = mlxsw_emad_alloc(mlxsw_core, reg->len);
+	/* Since this can be changed during emad_reg_access, read it once and
+	 * use the value all the way.
+	 */
+	enable_string_tlv = mlxsw_core->emad.enable_string_tlv;
+
+	skb = mlxsw_emad_alloc(mlxsw_core, reg->len, enable_string_tlv);
 	if (!skb)
 		return -ENOMEM;
 
@@ -745,7 +794,8 @@ static int mlxsw_emad_reg_access(struct mlxsw_core *mlxsw_core,
 	trans->reg = reg;
 	trans->type = type;
 
-	mlxsw_emad_construct(skb, reg, payload, type, trans->tid);
+	mlxsw_emad_construct(skb, reg, payload, type, trans->tid,
+			     enable_string_tlv);
 	mlxsw_core->driver->txhdr_construct(skb, &trans->tx_info);
 
 	spin_lock_bh(&mlxsw_core->emad.trans_list_lock);
@@ -1697,6 +1747,9 @@ static int mlxsw_reg_trans_wait(struct mlxsw_reg_trans *trans)
 			mlxsw_core_reg_access_type_str(trans->type),
 			trans->emad_status,
 			mlxsw_emad_op_tlv_status_str(trans->emad_status));
+		if (strlen(trans->emad_err_string))
+			dev_err(mlxsw_core->bus_info->dev, "Firmware error (tid=%llx,emad_err_string=%s)\n",
+				trans->tid, trans->emad_err_string);
 		trace_devlink_hwerr(priv_to_devlink(mlxsw_core),
 				    trans->emad_status,
 				    mlxsw_emad_op_tlv_status_str(trans->emad_status));
@@ -2270,6 +2323,12 @@ u32 mlxsw_core_read_frc_l(struct mlxsw_core *mlxsw_core)
 }
 EXPORT_SYMBOL(mlxsw_core_read_frc_l);
 
+void mlxsw_core_emad_string_tlv_enable(struct mlxsw_core *mlxsw_core)
+{
+	mlxsw_core->emad.enable_string_tlv = true;
+}
+EXPORT_SYMBOL(mlxsw_core_emad_string_tlv_enable);
+
 static int __init mlxsw_core_module_init(void)
 {
 	int err;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h
index 0d18bee6d140..543476a2e503 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -347,6 +347,8 @@ void mlxsw_core_fw_flash_end(struct mlxsw_core *mlxsw_core);
 u32 mlxsw_core_read_frc_h(struct mlxsw_core *mlxsw_core);
 u32 mlxsw_core_read_frc_l(struct mlxsw_core *mlxsw_core);
 
+void mlxsw_core_emad_string_tlv_enable(struct mlxsw_core *mlxsw_core);
+
 bool mlxsw_core_res_valid(struct mlxsw_core *mlxsw_core,
 			  enum mlxsw_res_id res_id);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH net-next 6/6] mlxsw: spectrum: Enable EMAD string TLV
  2019-11-03  8:35 [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Ido Schimmel
                   ` (4 preceding siblings ...)
  2019-11-03  8:35 ` [PATCH net-next 5/6] mlxsw: core: Add support for using EMAD string TLV Ido Schimmel
@ 2019-11-03  8:35 ` Ido Schimmel
  2019-11-04 20:39 ` [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Jakub Kicinski
  6 siblings, 0 replies; 20+ messages in thread
From: Ido Schimmel @ 2019-11-03  8:35 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, shalomt, mlxsw, Ido Schimmel

From: Shalom Toledo <shalomt@mellanox.com>

Make sure to enable EMAD string TLV only after using the required firmware
version.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index ea4cc2aa99e0..5f250b4cc662 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -4897,6 +4897,8 @@ static int mlxsw_sp_init(struct mlxsw_core *mlxsw_core,
 	if (err)
 		return err;
 
+	mlxsw_core_emad_string_tlv_enable(mlxsw_core);
+
 	err = mlxsw_sp_base_mac_get(mlxsw_sp);
 	if (err) {
 		dev_err(mlxsw_sp->bus_info->dev, "Failed to get base mac\n");
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-03  8:35 [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Ido Schimmel
                   ` (5 preceding siblings ...)
  2019-11-03  8:35 ` [PATCH net-next 6/6] mlxsw: spectrum: Enable " Ido Schimmel
@ 2019-11-04 20:39 ` Jakub Kicinski
  2019-11-04 21:04   ` Ido Schimmel
  6 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2019-11-04 20:39 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

On Sun,  3 Nov 2019 10:35:48 +0200, Ido Schimmel wrote:
> From: Ido Schimmel <idosch@mellanox.com>
> 
> Ethernet Management Datagrams (EMADs) are Ethernet packets sent between
> the driver and device's firmware. They are used to pass various
> configurations to the device, but also to get events (e.g., port up)
> from it. After the Ethernet header, these packets are built in a TLV
> format.
> 
> Up until now, whenever the driver issued an erroneous register access it
> only got an error code indicating a bad parameter was used. This patch
> set from Shalom adds a new TLV (string TLV) that can be used by the
> firmware to encode a 128 character string describing the error. The new
> TLV is allocated by the driver and set to zeros. In case of error, the
> driver will check the length of the string in the response and print it
> to the kernel log.
> 
> Example output:
> 
> mlxsw_spectrum 0000:03:00.0: EMAD reg access failed (tid=a9719f9700001306,reg_id=8018(rauhtd),type=query,status=7(bad parameter))
> mlxsw_spectrum 0000:03:00.0: Firmware error (tid=a9719f9700001306,emad_err_string=inside er_rauhtd_write_query(), num_rec=32 is over the maximum number of records supported)

Personally I'm not a big fan of passing unstructured data between user
and firmware. Not having access to the errors makes it harder to create
common interfaces by inspecting driver code.

Is there any precedent in the tree for printing FW errors into the logs
like this?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-04 20:39 ` [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Jakub Kicinski
@ 2019-11-04 21:04   ` Ido Schimmel
  2019-11-04 22:44     ` Jakub Kicinski
  0 siblings, 1 reply; 20+ messages in thread
From: Ido Schimmel @ 2019-11-04 21:04 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

On Mon, Nov 04, 2019 at 12:39:54PM -0800, Jakub Kicinski wrote:
> On Sun,  3 Nov 2019 10:35:48 +0200, Ido Schimmel wrote:
> > From: Ido Schimmel <idosch@mellanox.com>
> > 
> > Ethernet Management Datagrams (EMADs) are Ethernet packets sent between
> > the driver and device's firmware. They are used to pass various
> > configurations to the device, but also to get events (e.g., port up)
> > from it. After the Ethernet header, these packets are built in a TLV
> > format.
> > 
> > Up until now, whenever the driver issued an erroneous register access it
> > only got an error code indicating a bad parameter was used. This patch
> > set from Shalom adds a new TLV (string TLV) that can be used by the
> > firmware to encode a 128 character string describing the error. The new
> > TLV is allocated by the driver and set to zeros. In case of error, the
> > driver will check the length of the string in the response and print it
> > to the kernel log.
> > 
> > Example output:
> > 
> > mlxsw_spectrum 0000:03:00.0: EMAD reg access failed (tid=a9719f9700001306,reg_id=8018(rauhtd),type=query,status=7(bad parameter))
> > mlxsw_spectrum 0000:03:00.0: Firmware error (tid=a9719f9700001306,emad_err_string=inside er_rauhtd_write_query(), num_rec=32 is over the maximum number of records supported)
> 
> Personally I'm not a big fan of passing unstructured data between user
> and firmware. Not having access to the errors makes it harder to create
> common interfaces by inspecting driver code.

I don't understand the problem. If we get an error from firmware today,
we have no clue what the actual problem is. With this we can actually
understand what went wrong. How is it different from kernel passing a
string ("unstructured data") to user space in response to an erroneous
netlink request? Obviously it's much better than an "-EINVAL".

Also, in case it was not clear, this is a read-only interface and only
from firmware to kernel. No hidden knobs or something fishy like that.

> Is there any precedent in the tree for printing FW errors into the logs
> like this?

The mlx5 driver prints a unique number for each firmware error. We tried
to do the same in switch firmware, but it lacked the infrastructure so
we decided on this solution instead. It achieves the same goal, but in a
different way.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-04 21:04   ` Ido Schimmel
@ 2019-11-04 22:44     ` Jakub Kicinski
  2019-11-04 23:20       ` Ido Schimmel
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2019-11-04 22:44 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

On Mon, 4 Nov 2019 23:04:50 +0200, Ido Schimmel wrote:
> On Mon, Nov 04, 2019 at 12:39:54PM -0800, Jakub Kicinski wrote:
> > On Sun,  3 Nov 2019 10:35:48 +0200, Ido Schimmel wrote:  
> > > From: Ido Schimmel <idosch@mellanox.com>
> > > 
> > > Ethernet Management Datagrams (EMADs) are Ethernet packets sent between
> > > the driver and device's firmware. They are used to pass various
> > > configurations to the device, but also to get events (e.g., port up)
> > > from it. After the Ethernet header, these packets are built in a TLV
> > > format.
> > > 
> > > Up until now, whenever the driver issued an erroneous register access it
> > > only got an error code indicating a bad parameter was used. This patch
> > > set from Shalom adds a new TLV (string TLV) that can be used by the
> > > firmware to encode a 128 character string describing the error. The new
> > > TLV is allocated by the driver and set to zeros. In case of error, the
> > > driver will check the length of the string in the response and print it
> > > to the kernel log.
> > > 
> > > Example output:
> > > 
> > > mlxsw_spectrum 0000:03:00.0: EMAD reg access failed (tid=a9719f9700001306,reg_id=8018(rauhtd),type=query,status=7(bad parameter))
> > > mlxsw_spectrum 0000:03:00.0: Firmware error (tid=a9719f9700001306,emad_err_string=inside er_rauhtd_write_query(), num_rec=32 is over the maximum number of records supported)  
> > 
> > Personally I'm not a big fan of passing unstructured data between user
> > and firmware. Not having access to the errors makes it harder to create
> > common interfaces by inspecting driver code.  
> 
> I don't understand the problem. If we get an error from firmware today,
> we have no clue what the actual problem is. With this we can actually
> understand what went wrong. How is it different from kernel passing a
> string ("unstructured data") to user space in response to an erroneous
> netlink request? Obviously it's much better than an "-EINVAL".

The difference is obviously that I can look at the code in the kernel
and understand it. FW code is a black box. Kernel should abstract its
black boxiness away.

> Also, in case it was not clear, this is a read-only interface and only
> from firmware to kernel. No hidden knobs or something fishy like that.

I'm not saying it's fishy, I'm saying it's way harder to refactor code
if some of the user-visible outputs are not accessible (i.e. hidden in
a binary blob).

> > Is there any precedent in the tree for printing FW errors into the logs
> > like this?  
> 
> The mlx5 driver prints a unique number for each firmware error. We tried
> to do the same in switch firmware, but it lacked the infrastructure so
> we decided on this solution instead. It achieves the same goal, but in a
> different way.

FWIW nfp FW also passes error numbers to the driver and based on that
driver makes decisions and prints errors of its own choosing. The big
difference being you can see all the relevant errors by looking at
driver code.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-04 22:44     ` Jakub Kicinski
@ 2019-11-04 23:20       ` Ido Schimmel
  2019-11-04 23:33         ` Jakub Kicinski
  0 siblings, 1 reply; 20+ messages in thread
From: Ido Schimmel @ 2019-11-04 23:20 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

On Mon, Nov 04, 2019 at 02:44:19PM -0800, Jakub Kicinski wrote:
> On Mon, 4 Nov 2019 23:04:50 +0200, Ido Schimmel wrote:
> > On Mon, Nov 04, 2019 at 12:39:54PM -0800, Jakub Kicinski wrote:
> > > On Sun,  3 Nov 2019 10:35:48 +0200, Ido Schimmel wrote:  
> > > > From: Ido Schimmel <idosch@mellanox.com>
> > > > 
> > > > Ethernet Management Datagrams (EMADs) are Ethernet packets sent between
> > > > the driver and device's firmware. They are used to pass various
> > > > configurations to the device, but also to get events (e.g., port up)
> > > > from it. After the Ethernet header, these packets are built in a TLV
> > > > format.
> > > > 
> > > > Up until now, whenever the driver issued an erroneous register access it
> > > > only got an error code indicating a bad parameter was used. This patch
> > > > set from Shalom adds a new TLV (string TLV) that can be used by the
> > > > firmware to encode a 128 character string describing the error. The new
> > > > TLV is allocated by the driver and set to zeros. In case of error, the
> > > > driver will check the length of the string in the response and print it
> > > > to the kernel log.
> > > > 
> > > > Example output:
> > > > 
> > > > mlxsw_spectrum 0000:03:00.0: EMAD reg access failed (tid=a9719f9700001306,reg_id=8018(rauhtd),type=query,status=7(bad parameter))
> > > > mlxsw_spectrum 0000:03:00.0: Firmware error (tid=a9719f9700001306,emad_err_string=inside er_rauhtd_write_query(), num_rec=32 is over the maximum number of records supported)  
> > > 
> > > Personally I'm not a big fan of passing unstructured data between user
> > > and firmware. Not having access to the errors makes it harder to create
> > > common interfaces by inspecting driver code.  
> > 
> > I don't understand the problem. If we get an error from firmware today,
> > we have no clue what the actual problem is. With this we can actually
> > understand what went wrong. How is it different from kernel passing a
> > string ("unstructured data") to user space in response to an erroneous
> > netlink request? Obviously it's much better than an "-EINVAL".
> 
> The difference is obviously that I can look at the code in the kernel
> and understand it. FW code is a black box. Kernel should abstract its
> black boxiness away.

But FW code is still code and it needs to be able to report errors in a
way that will aid us in debugging when problems occur. I want meaningful
errors from applications regardless if I can read their code or not.

> 
> > Also, in case it was not clear, this is a read-only interface and only
> > from firmware to kernel. No hidden knobs or something fishy like that.
> 
> I'm not saying it's fishy, I'm saying it's way harder to refactor code
> if some of the user-visible outputs are not accessible (i.e. hidden in
> a binary blob).

Not sure I understand which code you're referring to? The error print
statement?

>
> > > Is there any precedent in the tree for printing FW errors into the logs
> > > like this?  
> > 
> > The mlx5 driver prints a unique number for each firmware error. We tried
> > to do the same in switch firmware, but it lacked the infrastructure so
> > we decided on this solution instead. It achieves the same goal, but in a
> > different way.
> 
> FWIW nfp FW also passes error numbers to the driver and based on that
> driver makes decisions and prints errors of its own choosing. The big
> difference being you can see all the relevant errors by looking at
> driver code.

This is done by mlxsw as well. See:
$ vi drivers/net/ethernet/mellanox/mlxsw/emad.h +50

But by far the most common error is "bad parameter" in which case we
would like to know exactly what is "bad" and why.

Anyway, it's already tomorrow here, so I'll be back in a few hours (IOW:
expect some delay).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-04 23:20       ` Ido Schimmel
@ 2019-11-04 23:33         ` Jakub Kicinski
  2019-11-05  7:46           ` Ido Schimmel
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2019-11-04 23:33 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

On Tue, 5 Nov 2019 01:20:36 +0200, Ido Schimmel wrote:
> On Mon, Nov 04, 2019 at 02:44:19PM -0800, Jakub Kicinski wrote:
> > On Mon, 4 Nov 2019 23:04:50 +0200, Ido Schimmel wrote:  
> > > I don't understand the problem. If we get an error from firmware today,
> > > we have no clue what the actual problem is. With this we can actually
> > > understand what went wrong. How is it different from kernel passing a
> > > string ("unstructured data") to user space in response to an erroneous
> > > netlink request? Obviously it's much better than an "-EINVAL".  
> > 
> > The difference is obviously that I can look at the code in the kernel
> > and understand it. FW code is a black box. Kernel should abstract its
> > black boxiness away.  
> 
> But FW code is still code and it needs to be able to report errors in a
> way that will aid us in debugging when problems occur. I want meaningful
> errors from applications regardless if I can read their code or not.

And the usual way accessing FW logs is through ethtool dumps.

> > > Also, in case it was not clear, this is a read-only interface and only
> > > from firmware to kernel. No hidden knobs or something fishy like that.  
> > 
> > I'm not saying it's fishy, I'm saying it's way harder to refactor code
> > if some of the user-visible outputs are not accessible (i.e. hidden in
> > a binary blob).  
> 
> Not sure I understand which code you're referring to? The error print
> statement?

The main form of refactoring I'm talking about is pulling out common
driver code into the core so we can make the drivers less gargantuan.

> > > > Is there any precedent in the tree for printing FW errors into the logs
> > > > like this?    
> > > 
> > > The mlx5 driver prints a unique number for each firmware error. We tried
> > > to do the same in switch firmware, but it lacked the infrastructure so
> > > we decided on this solution instead. It achieves the same goal, but in a
> > > different way.  
> > 
> > FWIW nfp FW also passes error numbers to the driver and based on that
> > driver makes decisions and prints errors of its own choosing. The big
> > difference being you can see all the relevant errors by looking at
> > driver code.  
> 
> This is done by mlxsw as well. See:
> $ vi drivers/net/ethernet/mellanox/mlxsw/emad.h +50
> 
> But by far the most common error is "bad parameter" in which case we
> would like to know exactly what is "bad" and why.

Right, when intelligence is placed in the FW, and the kernel doesn't
know which parameters may be bad then there is a need for FW log
reporting.. I am painfully familiar. FW engineers like to use terms
like "forward compatible" and "future proof".

> Anyway, it's already tomorrow here, so I'll be back in a few hours
> (IOW: expect some delay).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-04 23:33         ` Jakub Kicinski
@ 2019-11-05  7:46           ` Ido Schimmel
  2019-11-05 17:54             ` Jakub Kicinski
  0 siblings, 1 reply; 20+ messages in thread
From: Ido Schimmel @ 2019-11-05  7:46 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

On Mon, Nov 04, 2019 at 03:33:42PM -0800, Jakub Kicinski wrote:
> On Tue, 5 Nov 2019 01:20:36 +0200, Ido Schimmel wrote:
> > On Mon, Nov 04, 2019 at 02:44:19PM -0800, Jakub Kicinski wrote:
> > > On Mon, 4 Nov 2019 23:04:50 +0200, Ido Schimmel wrote:  
> > > > I don't understand the problem. If we get an error from firmware today,
> > > > we have no clue what the actual problem is. With this we can actually
> > > > understand what went wrong. How is it different from kernel passing a
> > > > string ("unstructured data") to user space in response to an erroneous
> > > > netlink request? Obviously it's much better than an "-EINVAL".  
> > > 
> > > The difference is obviously that I can look at the code in the kernel
> > > and understand it. FW code is a black box. Kernel should abstract its
> > > black boxiness away.  
> > 
> > But FW code is still code and it needs to be able to report errors in a
> > way that will aid us in debugging when problems occur. I want meaningful
> > errors from applications regardless if I can read their code or not.
> 
> And the usual way accessing FW logs is through ethtool dumps.

I assume you're referring to set_dump() / get_dump_flag() /
get_dump_data() callbacks?

In our case it's not really a dump. These are errors that are reported
inline to the driver for a specific erroneous operation. We currently
can't retrieve them from firmware later on. Using ethtool means that we
need to store these errors in the driver and then push them to user
space upon get operation. Seems like a stretch to me. Especially when
we're already reporting the error code today and this set merely
augments it with more data to make the error more specific.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-05  7:46           ` Ido Schimmel
@ 2019-11-05 17:54             ` Jakub Kicinski
  2019-11-05 20:48               ` Ido Schimmel
  2019-11-06  1:58               ` David Miller
  0 siblings, 2 replies; 20+ messages in thread
From: Jakub Kicinski @ 2019-11-05 17:54 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

On Tue, 5 Nov 2019 09:46:50 +0200, Ido Schimmel wrote:
> On Mon, Nov 04, 2019 at 03:33:42PM -0800, Jakub Kicinski wrote:
> > On Tue, 5 Nov 2019 01:20:36 +0200, Ido Schimmel wrote:  
> > > On Mon, Nov 04, 2019 at 02:44:19PM -0800, Jakub Kicinski wrote:  
> > > > On Mon, 4 Nov 2019 23:04:50 +0200, Ido Schimmel wrote:    
> > > > > I don't understand the problem. If we get an error from firmware today,
> > > > > we have no clue what the actual problem is. With this we can actually
> > > > > understand what went wrong. How is it different from kernel passing a
> > > > > string ("unstructured data") to user space in response to an erroneous
> > > > > netlink request? Obviously it's much better than an "-EINVAL".    
> > > > 
> > > > The difference is obviously that I can look at the code in the kernel
> > > > and understand it. FW code is a black box. Kernel should abstract its
> > > > black boxiness away.    
> > > 
> > > But FW code is still code and it needs to be able to report errors in a
> > > way that will aid us in debugging when problems occur. I want meaningful
> > > errors from applications regardless if I can read their code or not.  
> > 
> > And the usual way accessing FW logs is through ethtool dumps.  
> 
> I assume you're referring to set_dump() / get_dump_flag() /
> get_dump_data() callbacks?
> 
> In our case it's not really a dump. These are errors that are reported
> inline to the driver for a specific erroneous operation. We currently
> can't retrieve them from firmware later on. Using ethtool means that we
> need to store these errors in the driver and then push them to user
> space upon get operation. Seems like a stretch to me. Especially when
> we're already reporting the error code today and this set merely
> augments it with more data to make the error more specific.

Hm, the firmware has no log that it keeps? Surely FW runs a lot of
periodic jobs etc which may encounter some error conditions, how do 
you deal with those?

Bottom line is I don't like when data from FW is just blindly passed
to user space. Printing to the logs is perhaps the smallest of this
sort of infractions but nonetheless if there is no precedent for doing
this today I'd consider not opening this box.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-05 17:54             ` Jakub Kicinski
@ 2019-11-05 20:48               ` Ido Schimmel
  2019-11-05 21:48                 ` Jakub Kicinski
  2019-11-06  1:58               ` David Miller
  1 sibling, 1 reply; 20+ messages in thread
From: Ido Schimmel @ 2019-11-05 20:48 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

On Tue, Nov 05, 2019 at 09:54:48AM -0800, Jakub Kicinski wrote:
> On Tue, 5 Nov 2019 09:46:50 +0200, Ido Schimmel wrote:
> > On Mon, Nov 04, 2019 at 03:33:42PM -0800, Jakub Kicinski wrote:
> > > On Tue, 5 Nov 2019 01:20:36 +0200, Ido Schimmel wrote:  
> > > > On Mon, Nov 04, 2019 at 02:44:19PM -0800, Jakub Kicinski wrote:  
> > > > > On Mon, 4 Nov 2019 23:04:50 +0200, Ido Schimmel wrote:    
> > > > > > I don't understand the problem. If we get an error from firmware today,
> > > > > > we have no clue what the actual problem is. With this we can actually
> > > > > > understand what went wrong. How is it different from kernel passing a
> > > > > > string ("unstructured data") to user space in response to an erroneous
> > > > > > netlink request? Obviously it's much better than an "-EINVAL".    
> > > > > 
> > > > > The difference is obviously that I can look at the code in the kernel
> > > > > and understand it. FW code is a black box. Kernel should abstract its
> > > > > black boxiness away.    
> > > > 
> > > > But FW code is still code and it needs to be able to report errors in a
> > > > way that will aid us in debugging when problems occur. I want meaningful
> > > > errors from applications regardless if I can read their code or not.  
> > > 
> > > And the usual way accessing FW logs is through ethtool dumps.  
> > 
> > I assume you're referring to set_dump() / get_dump_flag() /
> > get_dump_data() callbacks?
> > 
> > In our case it's not really a dump. These are errors that are reported
> > inline to the driver for a specific erroneous operation. We currently
> > can't retrieve them from firmware later on. Using ethtool means that we
> > need to store these errors in the driver and then push them to user
> > space upon get operation. Seems like a stretch to me. Especially when
> > we're already reporting the error code today and this set merely
> > augments it with more data to make the error more specific.
> 
> Hm, the firmware has no log that it keeps? Surely FW runs a lot of
> periodic jobs etc which may encounter some error conditions, how do 
> you deal with those?

There are intrusive out-of-tree modules that can get this information.
It's currently not possible to retrieve this information from the
driver. We try to move away from such methods, but it can't happen
overnight. This set and the work done in the firmware team to add this
new TLV is one step towards that goal.

> Bottom line is I don't like when data from FW is just blindly passed
> to user space.

The same information will be passed to user space regardless if you use
ethtool / devlink / printk.

> Printing to the logs is perhaps the smallest of this sort of
> infractions but nonetheless if there is no precedent for doing this
> today I'd consider not opening this box.

The mlx5 driver prints a 32-bit number that represents a unique error
code from firmware. As a user it tells you nothing, but internally
engineers can correlate it to a specific error.

I think it would be unfortunate to give up on this set due to personal
preferences alone. Just last week it proved its usefulness twice when I
tried to utilize a new firmware API and got it wrong.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-05 20:48               ` Ido Schimmel
@ 2019-11-05 21:48                 ` Jakub Kicinski
  2019-11-06  8:20                   ` Jiri Pirko
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2019-11-05 21:48 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

On Tue, 5 Nov 2019 22:48:26 +0200, Ido Schimmel wrote:
> On Tue, Nov 05, 2019 at 09:54:48AM -0800, Jakub Kicinski wrote:
> > Hm, the firmware has no log that it keeps? Surely FW runs a lot of
> > periodic jobs etc which may encounter some error conditions, how do 
> > you deal with those?  
> 
> There are intrusive out-of-tree modules that can get this information.
> It's currently not possible to retrieve this information from the
> driver. We try to move away from such methods, but it can't happen
> overnight. This set and the work done in the firmware team to add this
> new TLV is one step towards that goal.
> 
> > Bottom line is I don't like when data from FW is just blindly passed
> > to user space.  
> 
> The same information will be passed to user space regardless if you use
> ethtool / devlink / printk.

Sure, but the additional hoop to jump through makes it clear that this
is discouraged and it keeps clear separation between the Linux
interfaces and proprietary custom FW.. "stuff".

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-05 17:54             ` Jakub Kicinski
  2019-11-05 20:48               ` Ido Schimmel
@ 2019-11-06  1:58               ` David Miller
  1 sibling, 0 replies; 20+ messages in thread
From: David Miller @ 2019-11-06  1:58 UTC (permalink / raw)
  To: jakub.kicinski; +Cc: idosch, netdev, jiri, shalomt, mlxsw, idosch

From: Jakub Kicinski <jakub.kicinski@netronome.com>
Date: Tue, 5 Nov 2019 09:54:48 -0800

> Bottom line is I don't like when data from FW is just blindly passed
> to user space. Printing to the logs is perhaps the smallest of this
> sort of infractions but nonetheless if there is no precedent for doing
> this today I'd consider not opening this box.

I agree with Jakub and we should set this kind of precedence.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-05 21:48                 ` Jakub Kicinski
@ 2019-11-06  8:20                   ` Jiri Pirko
  2019-11-06 17:26                     ` Jakub Kicinski
  0 siblings, 1 reply; 20+ messages in thread
From: Jiri Pirko @ 2019-11-06  8:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

Tue, Nov 05, 2019 at 10:48:58PM CET, jakub.kicinski@netronome.com wrote:
>On Tue, 5 Nov 2019 22:48:26 +0200, Ido Schimmel wrote:
>> On Tue, Nov 05, 2019 at 09:54:48AM -0800, Jakub Kicinski wrote:
>> > Hm, the firmware has no log that it keeps? Surely FW runs a lot of
>> > periodic jobs etc which may encounter some error conditions, how do 
>> > you deal with those?  
>> 
>> There are intrusive out-of-tree modules that can get this information.
>> It's currently not possible to retrieve this information from the
>> driver. We try to move away from such methods, but it can't happen
>> overnight. This set and the work done in the firmware team to add this
>> new TLV is one step towards that goal.
>> 
>> > Bottom line is I don't like when data from FW is just blindly passed
>> > to user space.  
>> 
>> The same information will be passed to user space regardless if you use
>> ethtool / devlink / printk.
>
>Sure, but the additional hoop to jump through makes it clear that this
>is discouraged and it keeps clear separation between the Linux
>interfaces and proprietary custom FW.. "stuff".

Hmm, let me try to understand. So you basically have problem with
passing random FW generated data and putting it to user (via dmesg,
extack). However, ethtool dump is fine. Devlink health reporter is also
fine.

That is completely sufficient for async events/errors.
However in this case, we have MSG sent to fw which generates an ERROR
and this error is sent from FW back, as a reaction to this particular
message.

What do you suggest we should use in order to maintain the MSG-ERROR
pairing? Perhaps a separate devlink health reporter just for this?

What do you think?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-06  8:20                   ` Jiri Pirko
@ 2019-11-06 17:26                     ` Jakub Kicinski
  2019-11-07  9:42                       ` Jiri Pirko
  0 siblings, 1 reply; 20+ messages in thread
From: Jakub Kicinski @ 2019-11-06 17:26 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Ido Schimmel, netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

On Wed, 6 Nov 2019 09:20:39 +0100, Jiri Pirko wrote:
> Tue, Nov 05, 2019 at 10:48:58PM CET, jakub.kicinski@netronome.com wrote:
> >On Tue, 5 Nov 2019 22:48:26 +0200, Ido Schimmel wrote:  
> >> On Tue, Nov 05, 2019 at 09:54:48AM -0800, Jakub Kicinski wrote:  
> >> > Hm, the firmware has no log that it keeps? Surely FW runs a lot of
> >> > periodic jobs etc which may encounter some error conditions, how do 
> >> > you deal with those?    
> >> 
> >> There are intrusive out-of-tree modules that can get this information.
> >> It's currently not possible to retrieve this information from the
> >> driver. We try to move away from such methods, but it can't happen
> >> overnight. This set and the work done in the firmware team to add this
> >> new TLV is one step towards that goal.
> >>   
> >> > Bottom line is I don't like when data from FW is just blindly passed
> >> > to user space.    
> >> 
> >> The same information will be passed to user space regardless if you use
> >> ethtool / devlink / printk.  
> >
> >Sure, but the additional hoop to jump through makes it clear that this
> >is discouraged and it keeps clear separation between the Linux
> >interfaces and proprietary custom FW.. "stuff".  
> 
> Hmm, let me try to understand. So you basically have problem with
> passing random FW generated data and putting it to user (via dmesg,
> extack). However, ethtool dump is fine. Devlink health reporter is also
> fine.

Yup.

> That is completely sufficient for async events/errors.
> However in this case, we have MSG sent to fw which generates an ERROR
> and this error is sent from FW back, as a reaction to this particular
> message.

Well, outputting to dmesg is not more synchronous than putting it in
some other device specific facility.

> What do you suggest we should use in order to maintain the MSG-ERROR
> pairing? Perhaps a separate devlink health reporter just for this?

Again, to be clear - that's future work, right? Kernel logs as
implemented here do not maintain MSG-ERROR pairing.

> What do you think?

In all honesty it's hard to tell for sure, because we don't see the FW
and what it needs. That's kind of the point, it's a black box.

I prefer the driver to mediate all the information in a meaningful way.
If you want to report an error regarding a parameter the FW could
communicate some identification of the field and the reason and the
driver can control the output, for example format a string to print to
logs?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs
  2019-11-06 17:26                     ` Jakub Kicinski
@ 2019-11-07  9:42                       ` Jiri Pirko
  0 siblings, 0 replies; 20+ messages in thread
From: Jiri Pirko @ 2019-11-07  9:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, netdev, davem, jiri, shalomt, mlxsw, Ido Schimmel

Wed, Nov 06, 2019 at 06:26:47PM CET, jakub.kicinski@netronome.com wrote:
>On Wed, 6 Nov 2019 09:20:39 +0100, Jiri Pirko wrote:
>> Tue, Nov 05, 2019 at 10:48:58PM CET, jakub.kicinski@netronome.com wrote:
>> >On Tue, 5 Nov 2019 22:48:26 +0200, Ido Schimmel wrote:  
>> >> On Tue, Nov 05, 2019 at 09:54:48AM -0800, Jakub Kicinski wrote:  
>> >> > Hm, the firmware has no log that it keeps? Surely FW runs a lot of
>> >> > periodic jobs etc which may encounter some error conditions, how do 
>> >> > you deal with those?    
>> >> 
>> >> There are intrusive out-of-tree modules that can get this information.
>> >> It's currently not possible to retrieve this information from the
>> >> driver. We try to move away from such methods, but it can't happen
>> >> overnight. This set and the work done in the firmware team to add this
>> >> new TLV is one step towards that goal.
>> >>   
>> >> > Bottom line is I don't like when data from FW is just blindly passed
>> >> > to user space.    
>> >> 
>> >> The same information will be passed to user space regardless if you use
>> >> ethtool / devlink / printk.  
>> >
>> >Sure, but the additional hoop to jump through makes it clear that this
>> >is discouraged and it keeps clear separation between the Linux
>> >interfaces and proprietary custom FW.. "stuff".  
>> 
>> Hmm, let me try to understand. So you basically have problem with
>> passing random FW generated data and putting it to user (via dmesg,
>> extack). However, ethtool dump is fine. Devlink health reporter is also
>> fine.
>
>Yup.
>
>> That is completely sufficient for async events/errors.
>> However in this case, we have MSG sent to fw which generates an ERROR
>> and this error is sent from FW back, as a reaction to this particular
>> message.
>
>Well, outputting to dmesg is not more synchronous than putting it in
>some other device specific facility.

Well, not really. In dmesg, you see not only the fw msg, you see it along
with the tid (sequence number) and emad register name.


>
>> What do you suggest we should use in order to maintain the MSG-ERROR
>> pairing? Perhaps a separate devlink health reporter just for this?
>
>Again, to be clear - that's future work, right? Kernel logs as
>implemented here do not maintain MSG-ERROR pairing.

As I described above, they actually do.


>
>> What do you think?
>
>In all honesty it's hard to tell for sure, because we don't see the FW
>and what it needs. That's kind of the point, it's a black box.
>
>I prefer the driver to mediate all the information in a meaningful way.
>If you want to report an error regarding a parameter the FW could
>communicate some identification of the field and the reason and the
>driver can control the output, for example format a string to print to
>logs?

You are right, it is a blackbox. But the blackbox is sending texts about
what is went wrong with some particular operation. And these texts are
quite handy to figure out what is going on there. Either we ignore them,
or we show them somehow. It is very valuable to show them.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-11-07  9:42 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-03  8:35 [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Ido Schimmel
2019-11-03  8:35 ` [PATCH net-next 1/6] mlxsw: core: Parse TLVs' offsets of incoming EMADs Ido Schimmel
2019-11-03  8:35 ` [PATCH net-next 2/6] mlxsw: emad: Remove deprecated EMAD TLVs Ido Schimmel
2019-11-03  8:35 ` [PATCH net-next 3/6] mlxsw: core: Add EMAD string TLV Ido Schimmel
2019-11-03  8:35 ` [PATCH net-next 4/6] mlxsw: core: Add support for EMAD string TLV parsing Ido Schimmel
2019-11-03  8:35 ` [PATCH net-next 5/6] mlxsw: core: Add support for using EMAD string TLV Ido Schimmel
2019-11-03  8:35 ` [PATCH net-next 6/6] mlxsw: spectrum: Enable " Ido Schimmel
2019-11-04 20:39 ` [PATCH net-next 0/6] mlxsw: Add extended ACK for EMADs Jakub Kicinski
2019-11-04 21:04   ` Ido Schimmel
2019-11-04 22:44     ` Jakub Kicinski
2019-11-04 23:20       ` Ido Schimmel
2019-11-04 23:33         ` Jakub Kicinski
2019-11-05  7:46           ` Ido Schimmel
2019-11-05 17:54             ` Jakub Kicinski
2019-11-05 20:48               ` Ido Schimmel
2019-11-05 21:48                 ` Jakub Kicinski
2019-11-06  8:20                   ` Jiri Pirko
2019-11-06 17:26                     ` Jakub Kicinski
2019-11-07  9:42                       ` Jiri Pirko
2019-11-06  1:58               ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).