All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/6] devlink: Add device metric support
@ 2020-08-17 12:50 Ido Schimmel
  2020-08-17 12:50 ` [RFC PATCH net-next 1/6] devlink: Add device metric infrastructure Ido Schimmel
                   ` (6 more replies)
  0 siblings, 7 replies; 29+ messages in thread
From: Ido Schimmel @ 2020-08-17 12:50 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, jiri, amcohen, danieller, mlxsw, roopa, dsahern,
	andrew, f.fainelli, vivien.didelot, saeedm, tariqt, ayal, eranbe,
	mkubecek, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

This patch set extends devlink to allow device drivers to expose device
metrics to user space in a standard and extensible fashion, as opposed
to the driver-specific debugfs approach.

This a joint work with Amit Cohen and Danielle Ratson during a two-day
company hackathon.

Motivation
==========

Certain devices have metrics (e.g., diagnostic counters, histograms)
that are useful to expose to user space for testing and debugging
purposes.

Currently, there is no standardized interface through which these
metrics can be exposed and most drivers resort to debugfs, which is not
very welcome in the networking subsystem and for good reasons. For one,
it is not a stable interface on which users can rely. Secondly, it
results in duplicated code and inconsistent interfaces in case drivers
are implementing similar functionality.

While Ethernet drivers can expose per-port counters to user space via
ethtool, they cannot expose device-wide metrics or configurable metrics
such as counters that can be enabled / disabled or histogram agents.

Solution overview
=================

Currently, the only supported metric type is a counter, but histograms
will be added in the future. The current interface is:

devlink dev metric show [ DEV metric METRIC | group GROUP ]
devlink dev metric set DEV metric METRIC [ group GROUP ]

Device drivers can dynamically register or unregister their metrics with
devlink by calling devlink_metric_counter_create() /
devlink_metric_destroy().

Grouping allows user space to group certain metrics together so that
they can be queried from the kernel using one request and retrieved in a
single response filtered by the kernel (i.e., kernel sets
NLM_F_DUMP_FILTERED).

Example
=======

Instantiate two netdevsim devices:

# echo "10 1" > /sys/bus/netdevsim/new_device
# echo "20 1" > /sys/bus/netdevsim/new_device

Dump all available metrics:

# devlink -s dev metric show
netdevsim/netdevsim10:
   metric dummy_counter type counter group 0 value 2
netdevsim/netdevsim20:
   metric dummy_counter type counter group 0 value 2

Dump a specific metric:

# devlink -s dev metric show netdevsim/netdevsim10 metric dummy_counter
netdevsim/netdevsim10:
   metric dummy_counter type counter group 0 value 3

Set a metric to a different group:

# devlink dev metric set netdevsim/netdevsim10 metric dummy_counter group 10

Dump all metrics in a specific group:

# devlink -s dev metric show group 10
netdevsim/netdevsim10:
   metric dummy_counter type counter group 10 value 4

Future extensions
=================

1. Enablement and disablement of metrics. This is useful in case the
metric adds latency when enabled or consumes limited resources (e.g.,
counters or histogram agents). It is up to the device driver to decide
if a metric is enabled by default or not. Proposed interface:

devlink dev metric set DEV metric METRIC [ group GROUP ]
	[ enable { true | false } ]

2. Histogram metrics. Some devices have the ability to calculate
histograms in hardware by sampling a specific parameter multiple times
per second. For example, the transmission queue depth of a port. This
enables the debugging of microbursts which would otherwise be invisible.
While this can be achieved in software using BPF, it is not applicable
when the data plane is offloaded as the CPU does not see the traffic.
Proposed interface:

devlink dev metric set DEV metric METRIC [ group GROUP ]
	[ enable { true | false } ] [ hist_type { linear | exp } ]
	[ hist_sample_interval SAMPLE ] [ hist_min MIN ] [ hist_max MAX ]
	[ hist_buckets BUCKETS ]

3. Per-port metrics. While all the metrics can be exposed as global and
namespaced as per-port by naming them accordingly, there is value in
allowing user space to dump all metrics related to a certain port.
Proposed interface:

devlink port metric set DEV/PORT_INDEX metric METRIC [ group GROUP ]
	[ enable { true | false } ] [ hist_type { linear | exp } ]
	[ hist_sample_interval SAMPLE ] [ hist_min MIN ] [ hist_max MAX ]
	[ hist_buckets BUCKETS ]
devlink port metric show [ DEV/PORT_INDEX metric METRIC | group GROUP ]

To avoid duplicating ethtool functionality we can decide to expose via
this interface only:

1. Configurable metrics
2. Metrics that are not only relevant to Ethernet ports

TODO
====

1. Add devlink-metric man page
2. Add selftests over mlxsw

Ido Schimmel (6):
  devlink: Add device metric infrastructure
  netdevsim: Add devlink metric support
  selftests: netdevsim: Add devlink metric tests
  mlxsw: reg: Add Tunneling NVE Counters Register
  mlxsw: reg: Add Tunneling NVE Counters Register Version 2
  mlxsw: spectrum_nve: Expose VXLAN counters via devlink-metric

 .../networking/devlink/devlink-metric.rst     |  37 ++
 Documentation/networking/devlink/index.rst    |   1 +
 Documentation/networking/devlink/mlxsw.rst    |  36 ++
 drivers/net/ethernet/mellanox/mlxsw/reg.h     | 104 ++++++
 .../ethernet/mellanox/mlxsw/spectrum_nve.h    |  10 +
 .../mellanox/mlxsw/spectrum_nve_vxlan.c       | 285 +++++++++++++++
 drivers/net/netdevsim/dev.c                   |  92 ++++-
 drivers/net/netdevsim/netdevsim.h             |   1 +
 include/net/devlink.h                         |  18 +
 include/uapi/linux/devlink.h                  |  19 +
 net/core/devlink.c                            | 346 ++++++++++++++++++
 .../drivers/net/netdevsim/devlink.sh          |  49 ++-
 12 files changed, 995 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/networking/devlink/devlink-metric.rst

-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 1/6] devlink: Add device metric infrastructure
  2020-08-17 12:50 [RFC PATCH net-next 0/6] devlink: Add device metric support Ido Schimmel
@ 2020-08-17 12:50 ` Ido Schimmel
  2020-08-17 14:12   ` Andrew Lunn
  2020-08-17 12:50 ` [RFC PATCH net-next 2/6] netdevsim: Add devlink metric support Ido Schimmel
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 29+ messages in thread
From: Ido Schimmel @ 2020-08-17 12:50 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, jiri, amcohen, danieller, mlxsw, roopa, dsahern,
	andrew, f.fainelli, vivien.didelot, saeedm, tariqt, ayal, eranbe,
	mkubecek, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

Add an infrastructure that allows device drivers to dynamically register
and unregister their supported metrics with devlink. The metrics and
their values are exposed to user space which can decide to group certain
metrics together. This allows user space to request a filtered dump of
only the metrics member in the provided group.

Currently, the only supported metric type is a counter, but histograms
will be added in the future for devices that implement histogram agents
in hardware.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 .../networking/devlink/devlink-metric.rst     |  26 ++
 Documentation/networking/devlink/index.rst    |   1 +
 include/net/devlink.h                         |  18 +
 include/uapi/linux/devlink.h                  |  19 +
 net/core/devlink.c                            | 346 ++++++++++++++++++
 5 files changed, 410 insertions(+)
 create mode 100644 Documentation/networking/devlink/devlink-metric.rst

diff --git a/Documentation/networking/devlink/devlink-metric.rst b/Documentation/networking/devlink/devlink-metric.rst
new file mode 100644
index 000000000000..cf5c5b4e4077
--- /dev/null
+++ b/Documentation/networking/devlink/devlink-metric.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============
+Devlink Metric
+==============
+
+The ``devlink-metric`` mechanism allows device drivers to expose device metrics
+to user space in a standard and extensible fashion. It provides an alternative
+to the driver-specific debugfs interface.
+
+Metric Types
+============
+
+The ``devlink-metric`` mechanism supports the following metric types:
+
+  * ``counter``: Monotonically increasing. Cannot be reset.
+
+Metrics Documentation
+=====================
+
+All the metrics exposed by a device driver must be clearly documented in the
+driver-specific ``devlink`` documentation under
+``Documentation/networking/devlink/``.
+
+When possible, a selftest (under ``tools/testing/selftests/drivers/``) should
+also be provided to ensure the metrics are updated under the right conditions.
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index 7684ae5c4a4a..b6f353384968 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -21,6 +21,7 @@ general.
    devlink-region
    devlink-resource
    devlink-trap
+   devlink-metric
 
 Driver-specific documentation
 -----------------------------
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 8f3c8a443238..f4754075dc43 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -36,6 +36,7 @@ struct devlink {
 	struct list_head trap_list;
 	struct list_head trap_group_list;
 	struct list_head trap_policer_list;
+	struct list_head metric_list;
 	const struct devlink_ops *ops;
 	struct xarray snapshot_ids;
 	struct device *dev;
@@ -990,6 +991,16 @@ enum devlink_trap_group_generic_id {
 		.min_burst = _min_burst,				      \
 	}
 
+struct devlink_metric;
+
+/**
+ * struct devlink_metric_ops - Metric operations.
+ * @counter_get: Get the counter value. Cannot be NULL when counter.
+ */
+struct devlink_metric_ops {
+	int (*counter_get)(struct devlink_metric *metric, u64 *p_val);
+};
+
 struct devlink_ops {
 	int (*reload_down)(struct devlink *devlink, bool netns_change,
 			   struct netlink_ext_ack *extack);
@@ -1405,6 +1416,13 @@ devlink_trap_policers_unregister(struct devlink *devlink,
 				 const struct devlink_trap_policer *policers,
 				 size_t policers_count);
 
+void *devlink_metric_priv(struct devlink_metric *metric);
+struct devlink_metric *
+devlink_metric_counter_create(struct devlink *devlink, const char *name,
+			      const struct devlink_metric_ops *ops, void *priv);
+void devlink_metric_destroy(struct devlink *devlink,
+			    struct devlink_metric *metric);
+
 #if IS_ENABLED(CONFIG_NET_DEVLINK)
 
 void devlink_compat_running_version(struct net_device *dev,
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index cfef4245ea5a..ebb555cb7cf7 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -122,6 +122,11 @@ enum devlink_command {
 	DEVLINK_CMD_TRAP_POLICER_NEW,
 	DEVLINK_CMD_TRAP_POLICER_DEL,
 
+	DEVLINK_CMD_METRIC_GET,		/* can dump */
+	DEVLINK_CMD_METRIC_SET,
+	DEVLINK_CMD_METRIC_NEW,
+	DEVLINK_CMD_METRIC_DEL,
+
 	/* add new commands above here */
 	__DEVLINK_CMD_MAX,
 	DEVLINK_CMD_MAX = __DEVLINK_CMD_MAX - 1
@@ -272,6 +277,14 @@ enum {
 	DEVLINK_ATTR_TRAP_METADATA_TYPE_FA_COOKIE,
 };
 
+/**
+ * enum devlink_metric_type - Metric type.
+ * @DEVLINK_METRIC_TYPE_COUNTER: Counter. Monotonically increasing.
+ */
+enum devlink_metric_type {
+	DEVLINK_METRIC_TYPE_COUNTER,
+};
+
 enum devlink_attr {
 	/* don't change the order or add anything between, this is ABI! */
 	DEVLINK_ATTR_UNSPEC,
@@ -458,6 +471,12 @@ enum devlink_attr {
 	DEVLINK_ATTR_PORT_LANES,			/* u32 */
 	DEVLINK_ATTR_PORT_SPLITTABLE,			/* u8 */
 
+	DEVLINK_ATTR_METRIC_NAME,		/* string */
+	/* enum devlink_metric_type */
+	DEVLINK_ATTR_METRIC_TYPE,		/* u8 */
+	DEVLINK_ATTR_METRIC_COUNTER_VALUE,	/* u64 */
+	DEVLINK_ATTR_METRIC_GROUP,		/* u32 */
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index e674f0f46dc2..94c0a1e09242 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -6994,6 +6994,218 @@ static int devlink_nl_cmd_trap_policer_set_doit(struct sk_buff *skb,
 	return devlink_trap_policer_set(devlink, policer_item, info);
 }
 
+/**
+ * struct devlink_metric - Metric attributes.
+ * @name: Metric name.
+ * @ops: Metric operations.
+ * @list: Member of 'metric_list'
+ * @type: Metric type.
+ * @group: Group number. '0' is the default group number.
+ * @priv: Metric private information.
+ */
+struct devlink_metric {
+	const char *name;
+	const struct devlink_metric_ops *ops;
+	struct list_head list;
+	enum devlink_metric_type type;
+	u32 group;
+	void *priv;
+};
+
+static struct devlink_metric *
+devlink_metric_lookup(struct devlink *devlink, const char *name)
+{
+	struct devlink_metric *metric;
+
+	list_for_each_entry(metric, &devlink->metric_list, list) {
+		if (!strcmp(metric->name, name))
+			return metric;
+	}
+
+	return NULL;
+}
+
+static struct devlink_metric *
+devlink_metric_get_from_info(struct devlink *devlink, struct genl_info *info)
+{
+	struct nlattr *attr;
+
+	if (!info->attrs[DEVLINK_ATTR_METRIC_NAME])
+		return NULL;
+	attr = info->attrs[DEVLINK_ATTR_METRIC_NAME];
+
+	return devlink_metric_lookup(devlink, nla_data(attr));
+}
+
+static int devlink_nl_metric_counter_fill(struct sk_buff *msg,
+					  struct devlink_metric *metric)
+{
+	u64 val;
+	int err;
+
+	err = metric->ops->counter_get(metric, &val);
+	if (err)
+		return err;
+
+	if (nla_put_u64_64bit(msg, DEVLINK_ATTR_METRIC_COUNTER_VALUE, val,
+			      DEVLINK_ATTR_PAD))
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int
+devlink_nl_metric_fill(struct sk_buff *msg, struct devlink *devlink,
+		       struct devlink_metric *metric, enum devlink_command cmd,
+		       u32 portid, u32 seq, int flags)
+{
+	void *hdr;
+
+	hdr = genlmsg_put(msg, portid, seq, &devlink_nl_family, flags, cmd);
+	if (!hdr)
+		return -EMSGSIZE;
+
+	if (devlink_nl_put_handle(msg, devlink))
+		goto nla_put_failure;
+
+	if (nla_put_string(msg, DEVLINK_ATTR_METRIC_NAME, metric->name))
+		goto nla_put_failure;
+
+	if (nla_put_u8(msg, DEVLINK_ATTR_METRIC_TYPE, metric->type))
+		goto nla_put_failure;
+
+	if (metric->type == DEVLINK_METRIC_TYPE_COUNTER &&
+	    devlink_nl_metric_counter_fill(msg, metric))
+		goto nla_put_failure;
+
+	if (nla_put_u32(msg, DEVLINK_ATTR_METRIC_GROUP, metric->group))
+		goto nla_put_failure;
+
+	genlmsg_end(msg, hdr);
+
+	return 0;
+
+nla_put_failure:
+	genlmsg_cancel(msg, hdr);
+	return -EMSGSIZE;
+}
+
+static int devlink_nl_cmd_metric_get_doit(struct sk_buff *skb,
+					  struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct devlink *devlink = info->user_ptr[0];
+	struct devlink_metric *metric;
+	struct sk_buff *msg;
+	int err;
+
+	if (list_empty(&devlink->metric_list))
+		return -EOPNOTSUPP;
+
+	metric = devlink_metric_get_from_info(devlink, info);
+	if (!metric) {
+		NL_SET_ERR_MSG_MOD(extack, "Device did not register this metric");
+		return -ENOENT;
+	}
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	err = devlink_nl_metric_fill(msg, devlink, metric,
+				     DEVLINK_CMD_METRIC_NEW, info->snd_portid,
+				     info->snd_seq, 0);
+	if (err)
+		goto err_metric_fill;
+
+	return genlmsg_reply(msg, info);
+
+err_metric_fill:
+	nlmsg_free(msg);
+	return err;
+}
+
+static int devlink_nl_cmd_metric_get_dumpit(struct sk_buff *msg,
+					    struct netlink_callback *cb)
+{
+	const struct genl_dumpit_info *info = genl_dumpit_info(cb);
+	enum devlink_command cmd = DEVLINK_CMD_METRIC_NEW;
+	u32 portid = NETLINK_CB(cb->skb).portid;
+	struct devlink_metric *metric;
+	struct devlink *devlink;
+	int start = cb->args[0];
+	int flags = NLM_F_MULTI;
+	u32 group = 0;
+	int idx = 0;
+	int err;
+
+	if (info->attrs[DEVLINK_ATTR_METRIC_GROUP]) {
+		group = nla_get_u32(info->attrs[DEVLINK_ATTR_METRIC_GROUP]);
+		flags |= NLM_F_DUMP_FILTERED;
+	}
+
+	mutex_lock(&devlink_mutex);
+	list_for_each_entry(devlink, &devlink_list, list) {
+		if (!net_eq(devlink_net(devlink), sock_net(msg->sk)))
+			continue;
+		mutex_lock(&devlink->lock);
+		list_for_each_entry(metric, &devlink->metric_list, list) {
+			if (idx < start) {
+				idx++;
+				continue;
+			}
+			if (group && metric->group != group) {
+				idx++;
+				continue;
+			}
+			err = devlink_nl_metric_fill(msg, devlink, metric, cmd,
+						     portid, cb->nlh->nlmsg_seq,
+						     flags);
+			if (err) {
+				mutex_unlock(&devlink->lock);
+				goto out;
+			}
+			idx++;
+		}
+		mutex_unlock(&devlink->lock);
+	}
+out:
+	mutex_unlock(&devlink_mutex);
+
+	cb->args[0] = idx;
+	return msg->len;
+}
+
+static void devlink_metric_group_set(struct devlink_metric *metric,
+				     struct genl_info *info)
+{
+	if (!info->attrs[DEVLINK_ATTR_METRIC_GROUP])
+		return;
+
+	metric->group = nla_get_u32(info->attrs[DEVLINK_ATTR_METRIC_GROUP]);
+}
+
+static int devlink_nl_cmd_metric_set_doit(struct sk_buff *skb,
+					  struct genl_info *info)
+{
+	struct netlink_ext_ack *extack = info->extack;
+	struct devlink *devlink = info->user_ptr[0];
+	struct devlink_metric *metric;
+
+	if (list_empty(&devlink->metric_list))
+		return -EOPNOTSUPP;
+
+	metric = devlink_metric_get_from_info(devlink, info);
+	if (!metric) {
+		NL_SET_ERR_MSG_MOD(extack, "Device did not register this metric");
+		return -ENOENT;
+	}
+
+	devlink_metric_group_set(metric, info);
+
+	return 0;
+}
+
 static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_UNSPEC] = { .strict_start_type =
 		DEVLINK_ATTR_TRAP_POLICER_ID },
@@ -7039,6 +7251,9 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_TRAP_POLICER_RATE] = { .type = NLA_U64 },
 	[DEVLINK_ATTR_TRAP_POLICER_BURST] = { .type = NLA_U64 },
 	[DEVLINK_ATTR_PORT_FUNCTION] = { .type = NLA_NESTED },
+	[DEVLINK_ATTR_METRIC_NAME] = { .type = NLA_NUL_STRING },
+	[DEVLINK_ATTR_METRIC_TYPE] = { .type = NLA_U8 },
+	[DEVLINK_ATTR_METRIC_GROUP] = { .type = NLA_U32 },
 };
 
 static const struct genl_ops devlink_nl_ops[] = {
@@ -7347,6 +7562,17 @@ static const struct genl_ops devlink_nl_ops[] = {
 		.doit = devlink_nl_cmd_trap_policer_set_doit,
 		.flags = GENL_ADMIN_PERM,
 	},
+	{
+		.cmd = DEVLINK_CMD_METRIC_GET,
+		.doit = devlink_nl_cmd_metric_get_doit,
+		.dumpit = devlink_nl_cmd_metric_get_dumpit,
+		/* can be retrieved by unprivileged users */
+	},
+	{
+		.cmd = DEVLINK_CMD_METRIC_SET,
+		.doit = devlink_nl_cmd_metric_set_doit,
+		.flags = GENL_ADMIN_PERM,
+	},
 };
 
 static struct genl_family devlink_nl_family __ro_after_init = {
@@ -7396,6 +7622,7 @@ struct devlink *devlink_alloc(const struct devlink_ops *ops, size_t priv_size)
 	INIT_LIST_HEAD(&devlink->trap_list);
 	INIT_LIST_HEAD(&devlink->trap_group_list);
 	INIT_LIST_HEAD(&devlink->trap_policer_list);
+	INIT_LIST_HEAD(&devlink->metric_list);
 	mutex_init(&devlink->lock);
 	mutex_init(&devlink->reporters_lock);
 	return devlink;
@@ -7480,6 +7707,7 @@ void devlink_free(struct devlink *devlink)
 {
 	mutex_destroy(&devlink->reporters_lock);
 	mutex_destroy(&devlink->lock);
+	WARN_ON(!list_empty(&devlink->metric_list));
 	WARN_ON(!list_empty(&devlink->trap_policer_list));
 	WARN_ON(!list_empty(&devlink->trap_group_list));
 	WARN_ON(!list_empty(&devlink->trap_list));
@@ -9484,6 +9712,124 @@ devlink_trap_policers_unregister(struct devlink *devlink,
 }
 EXPORT_SYMBOL_GPL(devlink_trap_policers_unregister);
 
+/**
+ * devlink_metric_priv - Return metric private information.
+ * @metric: Metric.
+ *
+ * Return: Metric private information that was passed from device-driver
+ * during metric creation.
+ */
+void *devlink_metric_priv(struct devlink_metric *metric)
+{
+	return metric->priv;
+}
+EXPORT_SYMBOL_GPL(devlink_metric_priv);
+
+static void devlink_metric_notify(struct devlink *devlink,
+				  struct devlink_metric *metric,
+				  enum devlink_command cmd)
+{
+	struct sk_buff *msg;
+	int err;
+
+	WARN_ON_ONCE(cmd != DEVLINK_CMD_METRIC_NEW &&
+		     cmd != DEVLINK_CMD_METRIC_DEL);
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return;
+
+	err = devlink_nl_metric_fill(msg, devlink, metric, cmd, 0, 0, 0);
+	if (err) {
+		nlmsg_free(msg);
+		return;
+	}
+
+	genlmsg_multicast_netns(&devlink_nl_family, devlink_net(devlink),
+				msg, 0, DEVLINK_MCGRP_CONFIG, GFP_KERNEL);
+}
+
+/**
+ * devlink_metric_counter_create - Create metric of counter type.
+ * @devlink: devlink.
+ * @name: Metric name.
+ * @ops: Metric operations.
+ * @priv: Metric private information.
+ *
+ * All metrics must be documented in the per-device documentation under
+ * Documentation/networking/devlink/.
+ *
+ * Return: Error pointer on failure.
+ */
+struct devlink_metric *
+devlink_metric_counter_create(struct devlink *devlink, const char *name,
+			      const struct devlink_metric_ops *ops, void *priv)
+{
+	struct devlink_metric *metric;
+	int err;
+
+	if (!ops || !ops->counter_get || !name)
+		return ERR_PTR(-EINVAL);
+
+	mutex_lock(&devlink->lock);
+
+	if (devlink_metric_lookup(devlink, name)) {
+		err = -EEXIST;
+		goto err_exists;
+	}
+
+	metric = kzalloc(sizeof(*metric), GFP_KERNEL);
+	if (!metric) {
+		err = -ENOMEM;
+		goto err_alloc_metric;
+	}
+
+	metric->name = kstrdup(name, GFP_KERNEL);
+	if (!metric->name) {
+		err = -ENOMEM;
+		goto err_alloc_metric_name;
+	}
+
+	metric->ops = ops;
+	metric->type = DEVLINK_METRIC_TYPE_COUNTER;
+	metric->group = 0;
+	metric->priv = priv;
+
+	list_add_tail(&metric->list, &devlink->metric_list);
+	devlink_metric_notify(devlink, metric, DEVLINK_CMD_METRIC_NEW);
+
+	mutex_unlock(&devlink->lock);
+
+	return metric;
+
+err_alloc_metric_name:
+	kfree(metric);
+err_alloc_metric:
+err_exists:
+	mutex_unlock(&devlink->lock);
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(devlink_metric_counter_create);
+
+/**
+ * devlink_metric_destroy - Destroy metric.
+ * @devlink: devlink.
+ * @metric: Metric.
+ */
+void devlink_metric_destroy(struct devlink *devlink,
+			    struct devlink_metric *metric)
+{
+	mutex_lock(&devlink->lock);
+
+	devlink_metric_notify(devlink, metric, DEVLINK_CMD_METRIC_DEL);
+	list_del(&metric->list);
+	kfree(metric->name);
+	kfree(metric);
+
+	mutex_unlock(&devlink->lock);
+}
+EXPORT_SYMBOL_GPL(devlink_metric_destroy);
+
 static void __devlink_compat_running_version(struct devlink *devlink,
 					     char *buf, size_t len)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 2/6] netdevsim: Add devlink metric support
  2020-08-17 12:50 [RFC PATCH net-next 0/6] devlink: Add device metric support Ido Schimmel
  2020-08-17 12:50 ` [RFC PATCH net-next 1/6] devlink: Add device metric infrastructure Ido Schimmel
@ 2020-08-17 12:50 ` Ido Schimmel
  2020-08-17 12:50 ` [RFC PATCH net-next 3/6] selftests: netdevsim: Add devlink metric tests Ido Schimmel
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Ido Schimmel @ 2020-08-17 12:50 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, jiri, amcohen, danieller, mlxsw, roopa, dsahern,
	andrew, f.fainelli, vivien.didelot, saeedm, tariqt, ayal, eranbe,
	mkubecek, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

Register a dummy counter with devlink that is incremented by one
whenever queried.

Allow the query to fail by writing to a file in debugfs so that error
paths in the core infrastructure could be exercised.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/netdevsim/dev.c       | 92 ++++++++++++++++++++++++++++++-
 drivers/net/netdevsim/netdevsim.h |  1 +
 2 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index 32f339fedb21..075d2d4e22a5 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -692,6 +692,81 @@ static void nsim_dev_traps_exit(struct devlink *devlink)
 	kfree(nsim_dev->trap_data);
 }
 
+struct nsim_metric_data {
+	struct devlink_metric *dummy_counter;
+	struct dentry *ddir;
+	u64 dummy_counter_value;
+	bool fail_counter_get;
+};
+
+static int nsim_dev_dummy_counter_get(struct devlink_metric *metric, u64 *p_val)
+{
+	struct nsim_dev *nsim_dev = devlink_metric_priv(metric);
+	u64 *cnt;
+
+	if (nsim_dev->metric_data->fail_counter_get)
+		return -EINVAL;
+
+	cnt = &nsim_dev->metric_data->dummy_counter_value;
+	*p_val = (*cnt)++;
+
+	return 0;
+}
+
+static const struct devlink_metric_ops nsim_dev_dummy_counter_ops = {
+	.counter_get = nsim_dev_dummy_counter_get,
+};
+
+static int nsim_dev_metric_init(struct nsim_dev *nsim_dev)
+{
+	struct devlink *devlink = priv_to_devlink(nsim_dev);
+	struct nsim_metric_data *nsim_metric_data;
+	struct devlink_metric *dummy_counter;
+	int err;
+
+	nsim_metric_data = kzalloc(sizeof(*nsim_metric_data), GFP_KERNEL);
+	if (!nsim_metric_data)
+		return -ENOMEM;
+	nsim_dev->metric_data = nsim_metric_data;
+
+	dummy_counter = devlink_metric_counter_create(devlink, "dummy_counter",
+						      &nsim_dev_dummy_counter_ops,
+						      nsim_dev);
+	if (IS_ERR(dummy_counter)) {
+		err = PTR_ERR(dummy_counter);
+		goto err_free_metric_data;
+	}
+	nsim_metric_data->dummy_counter = dummy_counter;
+
+	nsim_metric_data->ddir = debugfs_create_dir("metric", nsim_dev->ddir);
+	if (IS_ERR(nsim_metric_data->ddir)) {
+		err = PTR_ERR(nsim_metric_data->ddir);
+		goto err_dummy_counter_destroy;
+	}
+
+	nsim_metric_data->fail_counter_get = false;
+	debugfs_create_bool("fail_counter_get", 0600, nsim_metric_data->ddir,
+			    &nsim_metric_data->fail_counter_get);
+
+	return 0;
+
+err_dummy_counter_destroy:
+	devlink_metric_destroy(devlink, dummy_counter);
+err_free_metric_data:
+	kfree(nsim_metric_data);
+	return err;
+}
+
+static void nsim_dev_metric_exit(struct nsim_dev *nsim_dev)
+{
+	struct nsim_metric_data *nsim_metric_data = nsim_dev->metric_data;
+	struct devlink *devlink = priv_to_devlink(nsim_dev);
+
+	debugfs_remove_recursive(nsim_metric_data->ddir);
+	devlink_metric_destroy(devlink, nsim_metric_data->dummy_counter);
+	kfree(nsim_metric_data);
+}
+
 static int nsim_dev_reload_create(struct nsim_dev *nsim_dev,
 				  struct netlink_ext_ack *extack);
 static void nsim_dev_reload_destroy(struct nsim_dev *nsim_dev);
@@ -1008,10 +1083,14 @@ static int nsim_dev_reload_create(struct nsim_dev *nsim_dev,
 	if (err)
 		goto err_traps_exit;
 
-	err = nsim_dev_port_add_all(nsim_dev, nsim_bus_dev->port_count);
+	err = nsim_dev_metric_init(nsim_dev);
 	if (err)
 		goto err_health_exit;
 
+	err = nsim_dev_port_add_all(nsim_dev, nsim_bus_dev->port_count);
+	if (err)
+		goto err_metric_exit;
+
 	nsim_dev->take_snapshot = debugfs_create_file("take_snapshot",
 						      0200,
 						      nsim_dev->ddir,
@@ -1019,6 +1098,8 @@ static int nsim_dev_reload_create(struct nsim_dev *nsim_dev,
 						&nsim_dev_take_snapshot_fops);
 	return 0;
 
+err_metric_exit:
+	nsim_dev_metric_exit(nsim_dev);
 err_health_exit:
 	nsim_dev_health_exit(nsim_dev);
 err_traps_exit:
@@ -1089,10 +1170,14 @@ int nsim_dev_probe(struct nsim_bus_dev *nsim_bus_dev)
 	if (err)
 		goto err_debugfs_exit;
 
-	err = nsim_bpf_dev_init(nsim_dev);
+	err = nsim_dev_metric_init(nsim_dev);
 	if (err)
 		goto err_health_exit;
 
+	err = nsim_bpf_dev_init(nsim_dev);
+	if (err)
+		goto err_metric_exit;
+
 	err = nsim_dev_port_add_all(nsim_dev, nsim_bus_dev->port_count);
 	if (err)
 		goto err_bpf_dev_exit;
@@ -1103,6 +1188,8 @@ int nsim_dev_probe(struct nsim_bus_dev *nsim_bus_dev)
 
 err_bpf_dev_exit:
 	nsim_bpf_dev_exit(nsim_dev);
+err_metric_exit:
+	nsim_dev_metric_exit(nsim_dev);
 err_health_exit:
 	nsim_dev_health_exit(nsim_dev);
 err_debugfs_exit:
@@ -1133,6 +1220,7 @@ static void nsim_dev_reload_destroy(struct nsim_dev *nsim_dev)
 		return;
 	debugfs_remove(nsim_dev->take_snapshot);
 	nsim_dev_port_del_all(nsim_dev);
+	nsim_dev_metric_exit(nsim_dev);
 	nsim_dev_health_exit(nsim_dev);
 	nsim_dev_traps_exit(devlink);
 	nsim_dev_dummy_region_exit(nsim_dev);
diff --git a/drivers/net/netdevsim/netdevsim.h b/drivers/net/netdevsim/netdevsim.h
index 284f7092241d..5f9a99bc4022 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -171,6 +171,7 @@ struct nsim_dev {
 	struct nsim_bus_dev *nsim_bus_dev;
 	struct nsim_fib_data *fib_data;
 	struct nsim_trap_data *trap_data;
+	struct nsim_metric_data *metric_data;
 	struct dentry *ddir;
 	struct dentry *ports_ddir;
 	struct dentry *take_snapshot;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 3/6] selftests: netdevsim: Add devlink metric tests
  2020-08-17 12:50 [RFC PATCH net-next 0/6] devlink: Add device metric support Ido Schimmel
  2020-08-17 12:50 ` [RFC PATCH net-next 1/6] devlink: Add device metric infrastructure Ido Schimmel
  2020-08-17 12:50 ` [RFC PATCH net-next 2/6] netdevsim: Add devlink metric support Ido Schimmel
@ 2020-08-17 12:50 ` Ido Schimmel
  2020-08-17 12:50 ` [RFC PATCH net-next 4/6] mlxsw: reg: Add Tunneling NVE Counters Register Ido Schimmel
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Ido Schimmel @ 2020-08-17 12:50 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, jiri, amcohen, danieller, mlxsw, roopa, dsahern,
	andrew, f.fainelli, vivien.didelot, saeedm, tariqt, ayal, eranbe,
	mkubecek, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

Test the existing functionality of the devlink metric infrastructure.
Tests will be added for any new functionality.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 .../networking/devlink/devlink-metric.rst     | 11 +++++
 .../drivers/net/netdevsim/devlink.sh          | 49 ++++++++++++++++++-
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/devlink/devlink-metric.rst b/Documentation/networking/devlink/devlink-metric.rst
index cf5c5b4e4077..8a4515df1bc0 100644
--- a/Documentation/networking/devlink/devlink-metric.rst
+++ b/Documentation/networking/devlink/devlink-metric.rst
@@ -24,3 +24,14 @@ driver-specific ``devlink`` documentation under
 
 When possible, a selftest (under ``tools/testing/selftests/drivers/``) should
 also be provided to ensure the metrics are updated under the right conditions.
+
+Testing
+=======
+
+See ``tools/testing/selftests/drivers/net/netdevsim/devlink.sh`` for a
+test covering the core infrastructure. Test cases should be added for any new
+functionality.
+
+Device drivers should focus their tests on device-specific functionality, such
+as making sure the exposed metrics are correctly incremented and read from the
+device.
diff --git a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
index de4b32fc4223..4ca345e227bc 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/devlink.sh
@@ -5,7 +5,8 @@ lib_dir=$(dirname $0)/../../../net/forwarding
 
 ALL_TESTS="fw_flash_test params_test regions_test reload_test \
 	   netns_reload_test resource_test dev_info_test \
-	   empty_reporter_test dummy_reporter_test"
+	   empty_reporter_test dummy_reporter_test \
+	   metric_counter_test"
 NUM_NETIFS=0
 source $lib_dir/lib.sh
 
@@ -486,6 +487,52 @@ dummy_reporter_test()
 	log_test "dummy reporter test"
 }
 
+metric_counter_value_get()
+{
+	local metric=$1; shift
+
+	cmd_jq "devlink -jps dev metric show $DL_HANDLE metric $metric" \
+		'.[][][]["value"]'
+}
+
+metric_group_get()
+{
+	local metric=$1; shift
+
+	cmd_jq "devlink -jp dev metric show $DL_HANDLE metric $metric" \
+		'.[][][]["group"]'
+}
+
+metric_counter_test()
+{
+	RET=0
+
+	local val_t0=$(metric_counter_value_get dummy_counter)
+	local val_t1=$(metric_counter_value_get dummy_counter)
+	(( val_t0 < val_t1 ))
+	check_err $? "Expected to read a higher value in second read"
+
+	echo "y" > $DEBUGFS_DIR/metric/fail_counter_get
+	metric_counter_value_get dummy_counter
+	check_fail $? "Unexpected success of counter get"
+	echo "n" > $DEBUGFS_DIR/metric/fail_counter_get
+
+	devlink dev metric set $DL_HANDLE metric dummy_counter group 10
+
+	(( 10 == $(metric_group_get dummy_counter) ))
+	check_err $? "Expected \"dummy_counter\" to be in group 10"
+
+	devlink dev metric show group 10 | grep -q "dummy_counter"
+	check_err $? "Expected \"dummy_counter\" to be dumped"
+
+	devlink dev metric show group 20 | grep -q "dummy_counter"
+	check_fail $? "Did not expect to see \"dummy_counter\" in group 20"
+
+	devlink dev metric set $DL_HANDLE metric dummy_counter group 0
+
+	log_test "metric counter test"
+}
+
 setup_prepare()
 {
 	modprobe netdevsim
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 4/6] mlxsw: reg: Add Tunneling NVE Counters Register
  2020-08-17 12:50 [RFC PATCH net-next 0/6] devlink: Add device metric support Ido Schimmel
                   ` (2 preceding siblings ...)
  2020-08-17 12:50 ` [RFC PATCH net-next 3/6] selftests: netdevsim: Add devlink metric tests Ido Schimmel
@ 2020-08-17 12:50 ` Ido Schimmel
  2020-08-17 12:50 ` [RFC PATCH net-next 5/6] mlxsw: reg: Add Tunneling NVE Counters Register Version 2 Ido Schimmel
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 29+ messages in thread
From: Ido Schimmel @ 2020-08-17 12:50 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, jiri, amcohen, danieller, mlxsw, roopa, dsahern,
	andrew, f.fainelli, vivien.didelot, saeedm, tariqt, ayal, eranbe,
	mkubecek, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

The TNCR register exposes counters of NVE encapsulation and
decapsulation on Spectrum-1.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 51 +++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 079b080de7f7..9f19127caf83 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -10070,6 +10070,56 @@ static inline void mlxsw_reg_tngcr_pack(char *payload,
 	mlxsw_reg_tngcr_nve_group_size_flood_set(payload, 1);
 }
 
+/* TNCR - Tunneling NVE Counters Register
+ * --------------------------------------
+ * The TNCR register exposes counters of NVE encapsulation and decapsulation.
+ *
+ * Note: Not supported by Spectrum-2 onwards.
+ */
+#define MLXSW_REG_TNCR_ID 0xA002
+#define MLXSW_REG_TNCR_LEN 0x30
+
+MLXSW_REG_DEFINE(tncr, MLXSW_REG_TNCR_ID, MLXSW_REG_TNCR_LEN);
+
+/* reg_tncr_clear_counters
+ * Clear counters.
+ * Access: OP
+ */
+MLXSW_ITEM32(reg, tncr, clear_counters, 0x00, 31, 1);
+
+/* reg_tncr_count_encap
+ * Count number of packets which did encapsulation to an NVE tunnel.
+ * Access: RO
+ *
+ * Note: Multicast packets which are encapsulated multiple times are counted
+ * multiple times.
+ */
+MLXSW_ITEM64(reg, tncr, count_encap, 0x10, 0, 64);
+
+/* reg_tncr_count_decap
+ * Count number of packets which did decapsulation from an NVE tunnel.
+ * Access: RO
+ */
+MLXSW_ITEM64(reg, tncr, count_decap, 0x18, 0, 64);
+
+/* reg_tncr_count_decap_errors
+ * Count number of packets which had decapsulation errors from an NVE tunnel.
+ * Access: RO
+ */
+MLXSW_ITEM64(reg, tncr, count_decap_errors, 0x20, 0, 64);
+
+/* reg_tncr_count_decap_discards
+ * Count number of packets which had decapsulation discards from an NVE tunnel.
+ * Access: RO
+ */
+MLXSW_ITEM64(reg, tncr, count_decap_discards, 0x28, 0, 64);
+
+static inline void mlxsw_reg_tncr_pack(char *payload, bool clear_counters)
+{
+	MLXSW_REG_ZERO(tncr, payload);
+	mlxsw_reg_tncr_clear_counters_set(payload, clear_counters);
+}
+
 /* TNUMT - Tunneling NVE Underlay Multicast Table Register
  * -------------------------------------------------------
  * The TNUMT register is for building the underlay MC table. It is used
@@ -11001,6 +11051,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
 	MLXSW_REG(mtptpt),
 	MLXSW_REG(mgpir),
 	MLXSW_REG(tngcr),
+	MLXSW_REG(tncr),
 	MLXSW_REG(tnumt),
 	MLXSW_REG(tnqcr),
 	MLXSW_REG(tnqdr),
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 5/6] mlxsw: reg: Add Tunneling NVE Counters Register Version 2
  2020-08-17 12:50 [RFC PATCH net-next 0/6] devlink: Add device metric support Ido Schimmel
                   ` (3 preceding siblings ...)
  2020-08-17 12:50 ` [RFC PATCH net-next 4/6] mlxsw: reg: Add Tunneling NVE Counters Register Ido Schimmel
@ 2020-08-17 12:50 ` Ido Schimmel
  2020-08-17 12:50 ` [RFC PATCH net-next 6/6] mlxsw: spectrum_nve: Expose VXLAN counters via devlink-metric Ido Schimmel
  2020-08-19  0:24 ` [RFC PATCH net-next 0/6] devlink: Add device metric support Jakub Kicinski
  6 siblings, 0 replies; 29+ messages in thread
From: Ido Schimmel @ 2020-08-17 12:50 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, jiri, amcohen, danieller, mlxsw, roopa, dsahern,
	andrew, f.fainelli, vivien.didelot, saeedm, tariqt, ayal, eranbe,
	mkubecek, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

The TNCR-V2 register exposes counters of NVE encapsulation and
decapsulation on Spectrum-2 onwards.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 53 +++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 9f19127caf83..c891fc590ddd 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -10210,6 +10210,58 @@ static inline void mlxsw_reg_tnumt_pack(char *payload,
 	mlxsw_reg_tnumt_record_size_set(payload, record_size);
 }
 
+/* TNCR-V2 - Tunneling NVE Counters Register Version 2
+ * ---------------------------------------------------
+ * The TNCR-V2 register exposes counters of NVE encapsulation and
+ * decapsulation.
+ *
+ * Note: Not supported by Spectrum-1.
+ */
+#define MLXSW_REG_TNCR2_ID 0xA004
+#define MLXSW_REG_TNCR2_LEN 0x38
+
+MLXSW_REG_DEFINE(tncr2, MLXSW_REG_TNCR2_ID, MLXSW_REG_TNCR2_LEN);
+
+/* reg_tncr2_clear_counters
+ * Clear counters.
+ * Access: OP
+ */
+MLXSW_ITEM32(reg, tncr2, clear_counters, 0x00, 31, 1);
+
+enum mlxsw_reg_tncr2_tunnel_port {
+	MLXSW_REG_TNCR2_TUNNEL_PORT_NVE,
+	MLXSW_REG_TNCR2_TUNNEL_PORT_VPLS,
+	MLXSW_REG_TNCR2_TUNNEL_FLEX_TUNNEL0,
+	MLXSW_REG_TNCR2_TUNNEL_FLEX_TUNNEL1,
+};
+
+/* reg_tncr2_tunnel_port
+ * Tunnel port.
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, tncr2, tunnel_port, 0x00, 0, 4);
+
+/* reg_tncr2_count_decap_discards
+ * Count number of packets which had decapsulation discards from an NVE tunnel.
+ * Access: RO
+ */
+MLXSW_ITEM64(reg, tncr2, count_decap_discards, 0x28, 0, 64);
+
+/* reg_tncr2_count_encap_discards
+ * Count number of packets which had encapsulation discards to an NVE tunnel.
+ * Access: RO
+ */
+MLXSW_ITEM64(reg, tncr2, count_encap_discards, 0x30, 0, 64);
+
+static inline void mlxsw_reg_tncr2_pack(char *payload,
+					enum mlxsw_reg_tncr2_tunnel_port tport,
+					bool clear_counters)
+{
+	MLXSW_REG_ZERO(tncr2, payload);
+	mlxsw_reg_tncr2_clear_counters_set(payload, clear_counters);
+	mlxsw_reg_tncr2_tunnel_port_set(payload, tport);
+}
+
 /* TNQCR - Tunneling NVE QoS Configuration Register
  * ------------------------------------------------
  * The TNQCR register configures how QoS is set in encapsulation into the
@@ -11053,6 +11105,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
 	MLXSW_REG(tngcr),
 	MLXSW_REG(tncr),
 	MLXSW_REG(tnumt),
+	MLXSW_REG(tncr2),
 	MLXSW_REG(tnqcr),
 	MLXSW_REG(tnqdr),
 	MLXSW_REG(tneem),
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 6/6] mlxsw: spectrum_nve: Expose VXLAN counters via devlink-metric
  2020-08-17 12:50 [RFC PATCH net-next 0/6] devlink: Add device metric support Ido Schimmel
                   ` (4 preceding siblings ...)
  2020-08-17 12:50 ` [RFC PATCH net-next 5/6] mlxsw: reg: Add Tunneling NVE Counters Register Version 2 Ido Schimmel
@ 2020-08-17 12:50 ` Ido Schimmel
  2020-08-17 14:29   ` Andrew Lunn
  2020-08-19  0:24 ` [RFC PATCH net-next 0/6] devlink: Add device metric support Jakub Kicinski
  6 siblings, 1 reply; 29+ messages in thread
From: Ido Schimmel @ 2020-08-17 12:50 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, jiri, amcohen, danieller, mlxsw, roopa, dsahern,
	andrew, f.fainelli, vivien.didelot, saeedm, tariqt, ayal, eranbe,
	mkubecek, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

The Spectrum ASICs have a single hardware VTEP that is able to perform
VXLAN encapsulation and decapsulation. The VTEP is logically mapped by
mlxsw to the multiple VXLAN netdevs that are using it. Exposing the
counters of this VTEP via the multiple VXLAN netdevs that are using it
would be both inaccurate and confusing for users.

Instead, expose the counters of the VTEP via devlink-metric. Note that
Spectrum-1 supports a different set of counters compared to newer ASICs
in the Spectrum family.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 Documentation/networking/devlink/mlxsw.rst    |  36 +++
 .../ethernet/mellanox/mlxsw/spectrum_nve.h    |  10 +
 .../mellanox/mlxsw/spectrum_nve_vxlan.c       | 285 ++++++++++++++++++
 3 files changed, 331 insertions(+)

diff --git a/Documentation/networking/devlink/mlxsw.rst b/Documentation/networking/devlink/mlxsw.rst
index cf857cb4ba8f..5d95056a571a 100644
--- a/Documentation/networking/devlink/mlxsw.rst
+++ b/Documentation/networking/devlink/mlxsw.rst
@@ -79,3 +79,39 @@ Driver-specific Traps
        routed through a disabled router interface (RIF). This can happen during
        RIF dismantle, when the RIF is first disabled before being removed
        completely
+
+Metrics
+=======
+
+.. list-table:: List of metrics registered by ``mlxsw``
+   :widths: 5 5 20 70
+
+   * - Name
+     - Type
+     - Supported platforms
+     - Description
+   * - ``nve_vxlan_encap``
+     - ``counter``
+     - Spectrum-1 only
+     - Counts number of packets that were VXLAN encapsulated by the device. A
+       packet sent to multiple VTEPs is counted multiple times
+   * - ``nve_vxlan_decap``
+     - ``counter``
+     - Spectrum-1 only
+     - Counts number of VXLAN packets that were decapsulated (successfully or
+       otherwise) by the device
+   * - ``nve_vxlan_decap_errors``
+     - ``counter``
+     - Spectrum-1 only
+     - Counts number of VXLAN packets that encountered decapsulation errors.
+       This includes overlay packets with a VLAN tag, ECN mismatch between
+       overlay and underlay, multicast overlay source MAC, overlay source MAC
+       equals overlay destination MAC and packets too short to decapsulate
+   * - ``nve_vxlan_decap_discards``
+     - ``counter``
+     - All
+     - Counts number of VXLAN packets that were discarded during decapsulation.
+       In Spectrum-1 this includes packets that had to be VXLAN decapsulated
+       when VXLAN decapsulation is disabled and fragmented overlay packets. In
+       Spectrum-2 this includes ``nve_vxlan_decap_errors`` errors and a missing
+       mapping between VNI and filtering identifier (FID)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.h
index 12f664f42f21..249adea4d547 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve.h
@@ -6,6 +6,7 @@
 
 #include <linux/netlink.h>
 #include <linux/rhashtable.h>
+#include <net/devlink.h>
 
 #include "spectrum.h"
 
@@ -20,10 +21,19 @@ struct mlxsw_sp_nve_config {
 	union mlxsw_sp_l3addr ul_sip;
 };
 
+struct mlxsw_sp_nve_metrics {
+	struct devlink_metric *counter_encap;
+	struct devlink_metric *counter_decap;
+	struct devlink_metric *counter_decap_errors;
+	struct devlink_metric *counter_decap_discards;
+	struct devlink_metric *counter_encap_discards;
+};
+
 struct mlxsw_sp_nve {
 	struct mlxsw_sp_nve_config config;
 	struct rhashtable mc_list_ht;
 	struct mlxsw_sp *mlxsw_sp;
+	struct mlxsw_sp_nve_metrics metrics;
 	const struct mlxsw_sp_nve_ops **nve_ops_arr;
 	unsigned int num_nve_tunnels;	/* Protected by RTNL */
 	unsigned int num_max_mc_entries[MLXSW_SP_L3_PROTO_MAX];
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve_vxlan.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve_vxlan.c
index 05517c7feaa5..7b71fecb3b96 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve_vxlan.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_nve_vxlan.c
@@ -4,6 +4,7 @@
 #include <linux/netdevice.h>
 #include <linux/netlink.h>
 #include <linux/random.h>
+#include <net/devlink.h>
 #include <net/vxlan.h>
 
 #include "reg.h"
@@ -220,6 +221,173 @@ static int mlxsw_sp1_nve_vxlan_rtdp_set(struct mlxsw_sp *mlxsw_sp,
 	return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(rtdp), rtdp_pl);
 }
 
+static int
+mlxsw_sp1_nve_vxlan_common_counter_get(struct devlink_metric *metric,
+				       char *tncr_pl)
+{
+	struct mlxsw_sp *mlxsw_sp = devlink_metric_priv(metric);
+
+	mlxsw_reg_tncr_pack(tncr_pl, false);
+
+	return mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(tncr), tncr_pl);
+}
+
+static int
+mlxsw_sp1_nve_vxlan_encap_counter_get(struct devlink_metric *metric,
+				      u64 *p_val)
+{
+	char tncr_pl[MLXSW_REG_TNCR_LEN];
+	int err;
+
+	err = mlxsw_sp1_nve_vxlan_common_counter_get(metric, tncr_pl);
+	if (err)
+		return err;
+
+	*p_val = mlxsw_reg_tncr_count_encap_get(tncr_pl);
+
+	return 0;
+}
+
+static const struct devlink_metric_ops mlxsw_sp1_nve_vxlan_encap_ops = {
+	.counter_get = mlxsw_sp1_nve_vxlan_encap_counter_get,
+};
+
+static int
+mlxsw_sp1_nve_vxlan_decap_counter_get(struct devlink_metric *metric,
+				      u64 *p_val)
+{
+	char tncr_pl[MLXSW_REG_TNCR_LEN];
+	int err;
+
+	err = mlxsw_sp1_nve_vxlan_common_counter_get(metric, tncr_pl);
+	if (err)
+		return err;
+
+	*p_val = mlxsw_reg_tncr_count_decap_get(tncr_pl);
+
+	return 0;
+}
+
+static const struct devlink_metric_ops mlxsw_sp1_nve_vxlan_decap_ops = {
+	.counter_get = mlxsw_sp1_nve_vxlan_decap_counter_get,
+};
+
+static int
+mlxsw_sp1_nve_vxlan_decap_errors_counter_get(struct devlink_metric *metric,
+					     u64 *p_val)
+{
+	char tncr_pl[MLXSW_REG_TNCR_LEN];
+	int err;
+
+	err = mlxsw_sp1_nve_vxlan_common_counter_get(metric, tncr_pl);
+	if (err)
+		return err;
+
+	*p_val = mlxsw_reg_tncr_count_decap_errors_get(tncr_pl);
+
+	return 0;
+}
+
+static const struct devlink_metric_ops mlxsw_sp1_nve_vxlan_decap_errors_ops = {
+	.counter_get = mlxsw_sp1_nve_vxlan_decap_errors_counter_get,
+};
+
+static int
+mlxsw_sp1_nve_vxlan_decap_discards_counter_get(struct devlink_metric *metric,
+					       u64 *p_val)
+{
+	char tncr_pl[MLXSW_REG_TNCR_LEN];
+	int err;
+
+	err = mlxsw_sp1_nve_vxlan_common_counter_get(metric, tncr_pl);
+	if (err)
+		return err;
+
+	*p_val = mlxsw_reg_tncr_count_decap_discards_get(tncr_pl);
+
+	return 0;
+}
+
+static const struct devlink_metric_ops mlxsw_sp1_nve_vxlan_decap_discards_ops = {
+	.counter_get = mlxsw_sp1_nve_vxlan_decap_discards_counter_get,
+};
+
+static int mlxsw_sp1_nve_vxlan_counters_clear(struct mlxsw_sp *mlxsw_sp)
+{
+	char tncr_pl[MLXSW_REG_TNCR_LEN];
+
+	mlxsw_reg_tncr_pack(tncr_pl, true);
+
+	/* Clear operation is implemented on query. */
+	return mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(tncr), tncr_pl);
+}
+
+static int mlxsw_sp1_nve_vxlan_metrics_init(struct mlxsw_sp *mlxsw_sp)
+{
+	struct mlxsw_sp_nve_metrics *metrics = &mlxsw_sp->nve->metrics;
+	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
+	int err;
+
+	err = mlxsw_sp1_nve_vxlan_counters_clear(mlxsw_sp);
+	if (err)
+		return err;
+
+	metrics->counter_encap =
+		devlink_metric_counter_create(devlink, "nve_vxlan_encap",
+					      &mlxsw_sp1_nve_vxlan_encap_ops,
+					      mlxsw_sp);
+	if (IS_ERR(metrics->counter_encap))
+		return PTR_ERR(metrics->counter_encap);
+
+	metrics->counter_decap =
+		devlink_metric_counter_create(devlink, "nve_vxlan_decap",
+					      &mlxsw_sp1_nve_vxlan_decap_ops,
+					      mlxsw_sp);
+	if (IS_ERR(metrics->counter_decap)) {
+		err = PTR_ERR(metrics->counter_decap);
+		goto err_counter_decap;
+	}
+
+	metrics->counter_decap_errors =
+		devlink_metric_counter_create(devlink, "nve_vxlan_decap_errors",
+					      &mlxsw_sp1_nve_vxlan_decap_errors_ops,
+					      mlxsw_sp);
+	if (IS_ERR(metrics->counter_decap_errors)) {
+		err = PTR_ERR(metrics->counter_decap_errors);
+		goto err_counter_decap_errors;
+	}
+
+	metrics->counter_decap_discards =
+		devlink_metric_counter_create(devlink, "nve_vxlan_decap_discards",
+					      &mlxsw_sp1_nve_vxlan_decap_discards_ops,
+					      mlxsw_sp);
+	if (IS_ERR(metrics->counter_decap_discards)) {
+		err = PTR_ERR(metrics->counter_decap_discards);
+		goto err_counter_decap_discards;
+	}
+
+	return 0;
+
+err_counter_decap_discards:
+	devlink_metric_destroy(devlink, metrics->counter_decap_errors);
+err_counter_decap_errors:
+	devlink_metric_destroy(devlink, metrics->counter_decap);
+err_counter_decap:
+	devlink_metric_destroy(devlink, metrics->counter_encap);
+	return err;
+}
+
+static void mlxsw_sp1_nve_vxlan_metrics_fini(struct mlxsw_sp *mlxsw_sp)
+{
+	struct mlxsw_sp_nve_metrics *metrics = &mlxsw_sp->nve->metrics;
+	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
+
+	devlink_metric_destroy(devlink, metrics->counter_decap_discards);
+	devlink_metric_destroy(devlink, metrics->counter_decap_errors);
+	devlink_metric_destroy(devlink, metrics->counter_decap);
+	devlink_metric_destroy(devlink, metrics->counter_encap);
+}
+
 static int mlxsw_sp1_nve_vxlan_init(struct mlxsw_sp_nve *nve,
 				    const struct mlxsw_sp_nve_config *config)
 {
@@ -238,6 +406,10 @@ static int mlxsw_sp1_nve_vxlan_init(struct mlxsw_sp_nve *nve,
 	if (err)
 		goto err_rtdp_set;
 
+	err = mlxsw_sp1_nve_vxlan_metrics_init(mlxsw_sp);
+	if (err)
+		goto err_metrics_init;
+
 	err = mlxsw_sp_router_nve_promote_decap(mlxsw_sp, config->ul_tb_id,
 						config->ul_proto,
 						&config->ul_sip,
@@ -248,6 +420,8 @@ static int mlxsw_sp1_nve_vxlan_init(struct mlxsw_sp_nve *nve,
 	return 0;
 
 err_promote_decap:
+	mlxsw_sp1_nve_vxlan_metrics_fini(mlxsw_sp);
+err_metrics_init:
 err_rtdp_set:
 	mlxsw_sp1_nve_vxlan_config_clear(mlxsw_sp);
 err_config_set:
@@ -262,6 +436,7 @@ static void mlxsw_sp1_nve_vxlan_fini(struct mlxsw_sp_nve *nve)
 
 	mlxsw_sp_router_nve_demote_decap(mlxsw_sp, config->ul_tb_id,
 					 config->ul_proto, &config->ul_sip);
+	mlxsw_sp1_nve_vxlan_metrics_fini(mlxsw_sp);
 	mlxsw_sp1_nve_vxlan_config_clear(mlxsw_sp);
 	__mlxsw_sp_nve_inc_parsing_depth_put(mlxsw_sp, 0);
 }
@@ -360,6 +535,109 @@ static int mlxsw_sp2_nve_vxlan_rtdp_set(struct mlxsw_sp *mlxsw_sp,
 	return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(rtdp), rtdp_pl);
 }
 
+static int
+mlxsw_sp2_nve_vxlan_common_counter_get(struct devlink_metric *metric,
+				       char *tncr2_pl)
+{
+	struct mlxsw_sp *mlxsw_sp = devlink_metric_priv(metric);
+
+	mlxsw_reg_tncr2_pack(tncr2_pl, MLXSW_REG_TNCR2_TUNNEL_PORT_NVE, false);
+
+	return mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(tncr2), tncr2_pl);
+}
+
+static int
+mlxsw_sp2_nve_vxlan_decap_discards_counter_get(struct devlink_metric *metric,
+					       u64 *p_val)
+{
+	char tncr2_pl[MLXSW_REG_TNCR2_LEN];
+	int err;
+
+	err = mlxsw_sp2_nve_vxlan_common_counter_get(metric, tncr2_pl);
+	if (err)
+		return err;
+
+	*p_val = mlxsw_reg_tncr2_count_decap_discards_get(tncr2_pl);
+
+	return 0;
+}
+
+static const struct devlink_metric_ops mlxsw_sp2_nve_vxlan_decap_discards_ops = {
+	.counter_get = mlxsw_sp2_nve_vxlan_decap_discards_counter_get,
+};
+
+static int
+mlxsw_sp2_nve_vxlan_encap_discards_counter_get(struct devlink_metric *metric,
+					       u64 *p_val)
+{
+	char tncr2_pl[MLXSW_REG_TNCR2_LEN];
+	int err;
+
+	err = mlxsw_sp2_nve_vxlan_common_counter_get(metric, tncr2_pl);
+	if (err)
+		return err;
+
+	*p_val = mlxsw_reg_tncr2_count_encap_discards_get(tncr2_pl);
+
+	return 0;
+}
+
+static const struct devlink_metric_ops mlxsw_sp2_nve_vxlan_encap_discards_ops = {
+	.counter_get = mlxsw_sp2_nve_vxlan_encap_discards_counter_get,
+};
+
+static int mlxsw_sp2_nve_vxlan_counters_clear(struct mlxsw_sp *mlxsw_sp)
+{
+	char tncr2_pl[MLXSW_REG_TNCR2_LEN];
+
+	mlxsw_reg_tncr2_pack(tncr2_pl, MLXSW_REG_TNCR2_TUNNEL_PORT_NVE, true);
+
+	/* Clear operation is implemented on query. */
+	return mlxsw_reg_query(mlxsw_sp->core, MLXSW_REG(tncr2), tncr2_pl);
+}
+
+static int mlxsw_sp2_nve_vxlan_metrics_init(struct mlxsw_sp *mlxsw_sp)
+{
+	struct mlxsw_sp_nve_metrics *metrics = &mlxsw_sp->nve->metrics;
+	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
+	int err;
+
+	err = mlxsw_sp2_nve_vxlan_counters_clear(mlxsw_sp);
+	if (err)
+		return err;
+
+	metrics->counter_decap_discards =
+		devlink_metric_counter_create(devlink, "nve_vxlan_decap_discards",
+					      &mlxsw_sp2_nve_vxlan_decap_discards_ops,
+					      mlxsw_sp);
+	if (IS_ERR(metrics->counter_decap_discards))
+		return PTR_ERR(metrics->counter_decap_discards);
+
+	metrics->counter_encap_discards =
+		devlink_metric_counter_create(devlink, "nve_vxlan_encap_discards",
+					      &mlxsw_sp2_nve_vxlan_encap_discards_ops,
+					      mlxsw_sp);
+	if (IS_ERR(metrics->counter_encap_discards)) {
+		err = PTR_ERR(metrics->counter_encap_discards);
+		goto err_counter_encap_discards;
+	}
+
+	return 0;
+
+err_counter_encap_discards:
+	devlink_metric_destroy(devlink, metrics->counter_decap_discards);
+	return err;
+}
+
+static void mlxsw_sp2_nve_vxlan_metrics_fini(struct mlxsw_sp *mlxsw_sp)
+{
+	struct mlxsw_sp_nve_metrics *metrics = &mlxsw_sp->nve->metrics;
+	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
+
+	devlink_metric_destroy(devlink, metrics->counter_encap_discards);
+	devlink_metric_destroy(devlink, metrics->counter_decap_discards);
+}
+
 static int mlxsw_sp2_nve_vxlan_init(struct mlxsw_sp_nve *nve,
 				    const struct mlxsw_sp_nve_config *config)
 {
@@ -379,6 +657,10 @@ static int mlxsw_sp2_nve_vxlan_init(struct mlxsw_sp_nve *nve,
 	if (err)
 		goto err_rtdp_set;
 
+	err = mlxsw_sp2_nve_vxlan_metrics_init(mlxsw_sp);
+	if (err)
+		goto err_metrics_init;
+
 	err = mlxsw_sp_router_nve_promote_decap(mlxsw_sp, config->ul_tb_id,
 						config->ul_proto,
 						&config->ul_sip,
@@ -389,6 +671,8 @@ static int mlxsw_sp2_nve_vxlan_init(struct mlxsw_sp_nve *nve,
 	return 0;
 
 err_promote_decap:
+	mlxsw_sp2_nve_vxlan_metrics_fini(mlxsw_sp);
+err_metrics_init:
 err_rtdp_set:
 	mlxsw_sp2_nve_vxlan_config_clear(mlxsw_sp);
 err_config_set:
@@ -403,6 +687,7 @@ static void mlxsw_sp2_nve_vxlan_fini(struct mlxsw_sp_nve *nve)
 
 	mlxsw_sp_router_nve_demote_decap(mlxsw_sp, config->ul_tb_id,
 					 config->ul_proto, &config->ul_sip);
+	mlxsw_sp2_nve_vxlan_metrics_fini(mlxsw_sp);
 	mlxsw_sp2_nve_vxlan_config_clear(mlxsw_sp);
 	__mlxsw_sp_nve_inc_parsing_depth_put(mlxsw_sp, 0);
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 1/6] devlink: Add device metric infrastructure
  2020-08-17 12:50 ` [RFC PATCH net-next 1/6] devlink: Add device metric infrastructure Ido Schimmel
@ 2020-08-17 14:12   ` Andrew Lunn
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Lunn @ 2020-08-17 14:12 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, kuba, jiri, amcohen, danieller, mlxsw, roopa,
	dsahern, f.fainelli, vivien.didelot, saeedm, tariqt, ayal,
	eranbe, mkubecek, Ido Schimmel

On Mon, Aug 17, 2020 at 03:50:54PM +0300, Ido Schimmel wrote:
> From: Ido Schimmel <idosch@nvidia.com>
> 
> Add an infrastructure that allows device drivers to dynamically register
> and unregister their supported metrics with devlink. The metrics and
> their values are exposed to user space which can decide to group certain
> metrics together. This allows user space to request a filtered dump of
> only the metrics member in the provided group.
> 
> Currently, the only supported metric type is a counter, but histograms
> will be added in the future for devices that implement histogram agents
> in hardware.

Hi Ido, Amit

Some initial thoughts.

I said this during netdevconf, i think we need some way to group
metrics together. The example you gave was supporting the counters for
a TCAM and VXLAN offload. I expect users are wanting to get just the
TCAM counters, or just the VXLAN counters.

Maybe one way to support this is to allow the create function to pass
the group, rather than defaulting it to 0? The drive can then split
them up, if it wants to. Otherwise, provide some other sort of
identifier which can be used, maybe a hardware block name?

One big difference between this API and normal netlink statistics is
that each devlink counter is totally independent of every other
devlink counter. You cannot compare counters, because they are not
atomically read. Most hardware i come across supports snapshots of the
counters. So with the current ethtool counters, you snapshot them,
read them all into one buffer, and then return them to user space. The
rtnl lock prevents two snapshots at the same time.

	Andrew

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 6/6] mlxsw: spectrum_nve: Expose VXLAN counters via devlink-metric
  2020-08-17 12:50 ` [RFC PATCH net-next 6/6] mlxsw: spectrum_nve: Expose VXLAN counters via devlink-metric Ido Schimmel
@ 2020-08-17 14:29   ` Andrew Lunn
  2020-08-18  6:59     ` Ido Schimmel
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Lunn @ 2020-08-17 14:29 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, kuba, jiri, amcohen, danieller, mlxsw, roopa,
	dsahern, f.fainelli, vivien.didelot, saeedm, tariqt, ayal,
	eranbe, mkubecek, Ido Schimmel

> +static int mlxsw_sp1_nve_vxlan_metrics_init(struct mlxsw_sp *mlxsw_sp)
> +{
> +	struct mlxsw_sp_nve_metrics *metrics = &mlxsw_sp->nve->metrics;
> +	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
> +	int err;
> +
> +	err = mlxsw_sp1_nve_vxlan_counters_clear(mlxsw_sp);
> +	if (err)
> +		return err;
> +
> +	metrics->counter_encap =
> +		devlink_metric_counter_create(devlink, "nve_vxlan_encap",
> +					      &mlxsw_sp1_nve_vxlan_encap_ops,
> +					      mlxsw_sp);
> +	if (IS_ERR(metrics->counter_encap))
> +		return PTR_ERR(metrics->counter_encap);
> +
> +	metrics->counter_decap =
> +		devlink_metric_counter_create(devlink, "nve_vxlan_decap",
> +					      &mlxsw_sp1_nve_vxlan_decap_ops,
> +					      mlxsw_sp);
> +	if (IS_ERR(metrics->counter_decap)) {
> +		err = PTR_ERR(metrics->counter_decap);
> +		goto err_counter_decap;
> +	}
> +
> +	metrics->counter_decap_errors =
> +		devlink_metric_counter_create(devlink, "nve_vxlan_decap_errors",
> +					      &mlxsw_sp1_nve_vxlan_decap_errors_ops,
> +					      mlxsw_sp);
> +	if (IS_ERR(metrics->counter_decap_errors)) {
> +		err = PTR_ERR(metrics->counter_decap_errors);
> +		goto err_counter_decap_errors;
> +	}
> +
> +	metrics->counter_decap_discards =
> +		devlink_metric_counter_create(devlink, "nve_vxlan_decap_discards",
> +					      &mlxsw_sp1_nve_vxlan_decap_discards_ops,
> +					      mlxsw_sp);
> +	if (IS_ERR(metrics->counter_decap_discards)) {
> +		err = PTR_ERR(metrics->counter_decap_discards);
> +		goto err_counter_decap_discards;
> +	}
> +
> +	return 0;

Looking at this, i wonder about the scalability of this API. With just
4 counters it looks pretty ugly. What about 50 counters?

Maybe move the name into the ops structure. Then add a call
devlink_metric_counters_create() where you can pass an array and array
size of op structures? There are plenty of other examples in the
kernel, e.g. sysfs groups, hwmon, etc. where you register a large
bunch of things with the core with a single call.

> +static void mlxsw_sp1_nve_vxlan_metrics_fini(struct mlxsw_sp *mlxsw_sp)
> +{
> +	struct mlxsw_sp_nve_metrics *metrics = &mlxsw_sp->nve->metrics;
> +	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
> +
> +	devlink_metric_destroy(devlink, metrics->counter_decap_discards);
> +	devlink_metric_destroy(devlink, metrics->counter_decap_errors);
> +	devlink_metric_destroy(devlink, metrics->counter_decap);
> +	devlink_metric_destroy(devlink, metrics->counter_encap);
> +}

I guess the most frequent use case is to remove all counters,
e.g. driver unload, or when probe fails. So maybe provide a
devlink_metric_destroy_all(devlink) ?

    Andrew

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 6/6] mlxsw: spectrum_nve: Expose VXLAN counters via devlink-metric
  2020-08-17 14:29   ` Andrew Lunn
@ 2020-08-18  6:59     ` Ido Schimmel
  0 siblings, 0 replies; 29+ messages in thread
From: Ido Schimmel @ 2020-08-18  6:59 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, davem, kuba, jiri, amcohen, danieller, mlxsw, roopa,
	dsahern, f.fainelli, vivien.didelot, saeedm, tariqt, ayal,
	eranbe, mkubecek, Ido Schimmel

On Mon, Aug 17, 2020 at 04:29:52PM +0200, Andrew Lunn wrote:
> > +static int mlxsw_sp1_nve_vxlan_metrics_init(struct mlxsw_sp *mlxsw_sp)
> > +{
> > +	struct mlxsw_sp_nve_metrics *metrics = &mlxsw_sp->nve->metrics;
> > +	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
> > +	int err;
> > +
> > +	err = mlxsw_sp1_nve_vxlan_counters_clear(mlxsw_sp);
> > +	if (err)
> > +		return err;
> > +
> > +	metrics->counter_encap =
> > +		devlink_metric_counter_create(devlink, "nve_vxlan_encap",
> > +					      &mlxsw_sp1_nve_vxlan_encap_ops,
> > +					      mlxsw_sp);
> > +	if (IS_ERR(metrics->counter_encap))
> > +		return PTR_ERR(metrics->counter_encap);
> > +
> > +	metrics->counter_decap =
> > +		devlink_metric_counter_create(devlink, "nve_vxlan_decap",
> > +					      &mlxsw_sp1_nve_vxlan_decap_ops,
> > +					      mlxsw_sp);
> > +	if (IS_ERR(metrics->counter_decap)) {
> > +		err = PTR_ERR(metrics->counter_decap);
> > +		goto err_counter_decap;
> > +	}
> > +
> > +	metrics->counter_decap_errors =
> > +		devlink_metric_counter_create(devlink, "nve_vxlan_decap_errors",
> > +					      &mlxsw_sp1_nve_vxlan_decap_errors_ops,
> > +					      mlxsw_sp);
> > +	if (IS_ERR(metrics->counter_decap_errors)) {
> > +		err = PTR_ERR(metrics->counter_decap_errors);
> > +		goto err_counter_decap_errors;
> > +	}
> > +
> > +	metrics->counter_decap_discards =
> > +		devlink_metric_counter_create(devlink, "nve_vxlan_decap_discards",
> > +					      &mlxsw_sp1_nve_vxlan_decap_discards_ops,
> > +					      mlxsw_sp);
> > +	if (IS_ERR(metrics->counter_decap_discards)) {
> > +		err = PTR_ERR(metrics->counter_decap_discards);
> > +		goto err_counter_decap_discards;
> > +	}
> > +
> > +	return 0;
> 
> Looking at this, i wonder about the scalability of this API. With just
> 4 counters it looks pretty ugly. What about 50 counters?
> 
> Maybe move the name into the ops structure. Then add a call
> devlink_metric_counters_create() where you can pass an array and array
> size of op structures? There are plenty of other examples in the
> kernel, e.g. sysfs groups, hwmon, etc. where you register a large
> bunch of things with the core with a single call.

Yes, good suggestion. Will add the ability to register multiple metrics
at once.

> 
> > +static void mlxsw_sp1_nve_vxlan_metrics_fini(struct mlxsw_sp *mlxsw_sp)
> > +{
> > +	struct mlxsw_sp_nve_metrics *metrics = &mlxsw_sp->nve->metrics;
> > +	struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
> > +
> > +	devlink_metric_destroy(devlink, metrics->counter_decap_discards);
> > +	devlink_metric_destroy(devlink, metrics->counter_decap_errors);
> > +	devlink_metric_destroy(devlink, metrics->counter_decap);
> > +	devlink_metric_destroy(devlink, metrics->counter_encap);
> > +}
> 
> I guess the most frequent use case is to remove all counters,
> e.g. driver unload, or when probe fails. So maybe provide a
> devlink_metric_destroy_all(devlink) ?

If we are going to add something like devlink_metric_counters_create(),
then we can also add devlink_metrics_destroy() which will remove all
provided metrics in one call. I prefer it over _all() because then it's
symmetric with _create() operation.

Thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-17 12:50 [RFC PATCH net-next 0/6] devlink: Add device metric support Ido Schimmel
                   ` (5 preceding siblings ...)
  2020-08-17 12:50 ` [RFC PATCH net-next 6/6] mlxsw: spectrum_nve: Expose VXLAN counters via devlink-metric Ido Schimmel
@ 2020-08-19  0:24 ` Jakub Kicinski
  2020-08-19  2:43   ` David Ahern
  6 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2020-08-19  0:24 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: netdev, davem, jiri, amcohen, danieller, mlxsw, roopa, dsahern,
	andrew, f.fainelli, vivien.didelot, saeedm, tariqt, ayal, eranbe,
	mkubecek, Ido Schimmel

On Mon, 17 Aug 2020 15:50:53 +0300 Ido Schimmel wrote:
> From: Ido Schimmel <idosch@nvidia.com>
> 
> This patch set extends devlink to allow device drivers to expose device
> metrics to user space in a standard and extensible fashion, as opposed
> to the driver-specific debugfs approach.

I feel like all those loose hardware interfaces are a huge maintenance
burden. I don't know what the solution is, but the status quo is not
great.

I spend way too much time patrolling ethtool -S outputs already.

I've done my absolute best to make sure that something as simple as
updating device firmware can be done in a vendor-agnostic fashion, 
and I'm not sure I've succeeded. Every single vendor comes up with
their own twists.

Long story short I'm extremely unexcited about another interface where
drivers expose random strings of their picking. Maybe udp_tunnel module
can have "global" stats?

This would be a good topic for netconf, or LPC hallway discussion :(

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-19  0:24 ` [RFC PATCH net-next 0/6] devlink: Add device metric support Jakub Kicinski
@ 2020-08-19  2:43   ` David Ahern
  2020-08-19  3:35     ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: David Ahern @ 2020-08-19  2:43 UTC (permalink / raw)
  To: Jakub Kicinski, Ido Schimmel
  Cc: netdev, davem, jiri, amcohen, danieller, mlxsw, roopa, andrew,
	f.fainelli, vivien.didelot, saeedm, tariqt, ayal, eranbe,
	mkubecek, Ido Schimmel

On 8/18/20 6:24 PM, Jakub Kicinski wrote:
> On Mon, 17 Aug 2020 15:50:53 +0300 Ido Schimmel wrote:
>> From: Ido Schimmel <idosch@nvidia.com>
>>
>> This patch set extends devlink to allow device drivers to expose device
>> metrics to user space in a standard and extensible fashion, as opposed
>> to the driver-specific debugfs approach.
> 
> I feel like all those loose hardware interfaces are a huge maintenance
> burden. I don't know what the solution is, but the status quo is not
> great.

I don't agree with the 'loose' characterization. Ido and team are
pushing what is arguably a modern version of `ethtool -S`, so it
provides a better API for retrieving data.

> 
> I spend way too much time patrolling ethtool -S outputs already.
> 

But that's the nature of detailed stats which are often essential to
ensuring the system is operating as expected or debugging some problem.
Commonality is certainly desired in names when relevant to be able to
build tooling around the stats. As an example, per-queue stats have been
essential to me for recent investigations. ethq has been really helpful
in crossing NIC vendors and viewing those stats as it handles the
per-vendor naming differences, but it requires changes to show anything
else - errors per queue, xdp stats, drops, etc. This part could be simpler.

As for this set, I believe the metrics exposed here are more unique to
switch ASICs. At least one company I know of has built a business model
around exposing detailed telemetry of switch ASICs, so clearly some find
them quite valuable.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-19  2:43   ` David Ahern
@ 2020-08-19  3:35     ` Jakub Kicinski
  2020-08-19  4:30       ` Florian Fainelli
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2020-08-19  3:35 UTC (permalink / raw)
  To: David Ahern
  Cc: Ido Schimmel, netdev, davem, jiri, amcohen, danieller, mlxsw,
	roopa, andrew, f.fainelli, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Tue, 18 Aug 2020 20:43:11 -0600 David Ahern wrote:
> On 8/18/20 6:24 PM, Jakub Kicinski wrote:
> > On Mon, 17 Aug 2020 15:50:53 +0300 Ido Schimmel wrote:  
> >> From: Ido Schimmel <idosch@nvidia.com>
> >>
> >> This patch set extends devlink to allow device drivers to expose device
> >> metrics to user space in a standard and extensible fashion, as opposed
> >> to the driver-specific debugfs approach.  
> > 
> > I feel like all those loose hardware interfaces are a huge maintenance
> > burden. I don't know what the solution is, but the status quo is not
> > great.  
> 
> I don't agree with the 'loose' characterization.

Loose as in not bound by any standard or best practices.

> Ido and team are pushing what is arguably a modern version of
> `ethtool -S`, so it provides a better API for retrieving data.

ethtool -S is absolutely terrible. Everybody comes up with their own
names for IEEE stats, and dumps stats which clearly have corresponding
fields in rtnl_link_stats64 there. We don't need a modern ethtool -S,
we need to get away from that mess.

> > I spend way too much time patrolling ethtool -S outputs already.
> 
> But that's the nature of detailed stats which are often essential to
> ensuring the system is operating as expected or debugging some problem.
> Commonality is certainly desired in names when relevant to be able to
> build tooling around the stats.

There are stats which are clearly detailed and device specific, 
but what ends up happening is that people expose very much not
implementation specific stats through the free form interfaces, 
because it's the easiest. 

And users are left picking up the pieces, having to ask vendors what
each stat means, and trying to create abstractions in their user space
glue.

> As an example, per-queue stats have been
> essential to me for recent investigations. ethq has been really helpful
> in crossing NIC vendors and viewing those stats as it handles the
> per-vendor naming differences, but it requires changes to show anything
> else - errors per queue, xdp stats, drops, etc. This part could be simpler.

Sounds like you're agreeing with me?

> As for this set, I believe the metrics exposed here are more unique to
> switch ASICs.

This is the list from patch 6:

   * - ``nve_vxlan_encap``
   * - ``nve_vxlan_decap``
   * - ``nve_vxlan_decap_errors``
   * - ``nve_vxlan_decap_discards``

What's so unique?

> At least one company I know of has built a business model
> around exposing detailed telemetry of switch ASICs, so clearly some find
> them quite valuable.

It's a question of interface, not the value of exposed data.

If I have to download vendor documentation and tooling, or adapt my own
scripts for every new vendor, I could have as well downloaded an SDK.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-19  3:35     ` Jakub Kicinski
@ 2020-08-19  4:30       ` Florian Fainelli
  2020-08-19 16:18         ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: Florian Fainelli @ 2020-08-19  4:30 UTC (permalink / raw)
  To: Jakub Kicinski, David Ahern
  Cc: Ido Schimmel, netdev, davem, jiri, amcohen, danieller, mlxsw,
	roopa, andrew, vivien.didelot, tariqt, ayal, mkubecek,
	Ido Schimmel



On 8/18/2020 8:35 PM, Jakub Kicinski wrote:
> On Tue, 18 Aug 2020 20:43:11 -0600 David Ahern wrote:
>> On 8/18/20 6:24 PM, Jakub Kicinski wrote:
>>> On Mon, 17 Aug 2020 15:50:53 +0300 Ido Schimmel wrote:
>>>> From: Ido Schimmel <idosch@nvidia.com>
>>>>
>>>> This patch set extends devlink to allow device drivers to expose device
>>>> metrics to user space in a standard and extensible fashion, as opposed
>>>> to the driver-specific debugfs approach.
>>>
>>> I feel like all those loose hardware interfaces are a huge maintenance
>>> burden. I don't know what the solution is, but the status quo is not
>>> great.
>>
>> I don't agree with the 'loose' characterization.
> 
> Loose as in not bound by any standard or best practices.
> 
>> Ido and team are pushing what is arguably a modern version of
>> `ethtool -S`, so it provides a better API for retrieving data.
> 
> ethtool -S is absolutely terrible. Everybody comes up with their own
> names for IEEE stats, and dumps stats which clearly have corresponding
> fields in rtnl_link_stats64 there. We don't need a modern ethtool -S,
> we need to get away from that mess.
> 
>>> I spend way too much time patrolling ethtool -S outputs already.
>>
>> But that's the nature of detailed stats which are often essential to
>> ensuring the system is operating as expected or debugging some problem.
>> Commonality is certainly desired in names when relevant to be able to
>> build tooling around the stats.
> 
> There are stats which are clearly detailed and device specific,
> but what ends up happening is that people expose very much not
> implementation specific stats through the free form interfaces,
> because it's the easiest.
> 
> And users are left picking up the pieces, having to ask vendors what
> each stat means, and trying to create abstractions in their user space
> glue.

Should we require vendors to either provide a Documentation/ entry for 
each statistics they have (and be guaranteed that it will be outdated 
unless someone notices), or would you rather have the statistics 
description be part of the devlink interface itself? Should we define 
namespaces such that standard metrics should be under the standard 
namespace and the vendor standard is the wild west?

> 
>> As an example, per-queue stats have been
>> essential to me for recent investigations. ethq has been really helpful
>> in crossing NIC vendors and viewing those stats as it handles the
>> per-vendor naming differences, but it requires changes to show anything
>> else - errors per queue, xdp stats, drops, etc. This part could be simpler.
> 
> Sounds like you're agreeing with me?
> 
>> As for this set, I believe the metrics exposed here are more unique to
>> switch ASICs.
> 
> This is the list from patch 6:
> 
>     * - ``nve_vxlan_encap``
>     * - ``nve_vxlan_decap``
>     * - ``nve_vxlan_decap_errors``
>     * - ``nve_vxlan_decap_discards``
> 
> What's so unique?
> 
>> At least one company I know of has built a business model
>> around exposing detailed telemetry of switch ASICs, so clearly some find
>> them quite valuable.
> 
> It's a question of interface, not the value of exposed data.
> 
> If I have to download vendor documentation and tooling, or adapt my own
> scripts for every new vendor, I could have as well downloaded an SDK.

Are not you being a bit over dramatic here with your example? At least 
you can run the same command to obtain the stats regardless of the 
driver and vendor, so from that perspective Linux continues to be the 
abstraction and that is not broken.
-- 
Florian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-19  4:30       ` Florian Fainelli
@ 2020-08-19 16:18         ` Jakub Kicinski
  2020-08-19 17:20           ` Florian Fainelli
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2020-08-19 16:18 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: David Ahern, Ido Schimmel, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Tue, 18 Aug 2020 21:30:16 -0700 Florian Fainelli wrote:
> >>> I spend way too much time patrolling ethtool -S outputs already.  
> >>
> >> But that's the nature of detailed stats which are often essential to
> >> ensuring the system is operating as expected or debugging some problem.
> >> Commonality is certainly desired in names when relevant to be able to
> >> build tooling around the stats.  
> > 
> > There are stats which are clearly detailed and device specific,
> > but what ends up happening is that people expose very much not
> > implementation specific stats through the free form interfaces,
> > because it's the easiest.
> > 
> > And users are left picking up the pieces, having to ask vendors what
> > each stat means, and trying to create abstractions in their user space
> > glue.  
> 
> Should we require vendors to either provide a Documentation/ entry for 
> each statistics they have (and be guaranteed that it will be outdated 
> unless someone notices), or would you rather have the statistics 
> description be part of the devlink interface itself? Should we define 
> namespaces such that standard metrics should be under the standard 
> namespace and the vendor standard is the wild west?

I'm trying to find a solution which will not require a policeman to
constantly monitor the compliance. Please see my effort to ensure
drivers document and use the same ethtool -S stats in the TLS offload
implementations. I've been trying to improve this situation for a long
time, and it's getting old.

Please focus on the stats this set adds, instead of fantasizing of what
could be. These are absolutely not implementation specific!

> > If I have to download vendor documentation and tooling, or adapt my own
> > scripts for every new vendor, I could have as well downloaded an SDK.  
> 
> Are not you being a bit over dramatic here with your example? 

I hope not. It's very hard/impossible today to run a fleet of Linux
machines without resorting to vendor tooling.

> At least  you can run the same command to obtain the stats regardless
> of the driver and vendor, so from that perspective Linux continues to
> be the abstraction and that is not broken.

Format of the data is no abstraction.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-19 16:18         ` Jakub Kicinski
@ 2020-08-19 17:20           ` Florian Fainelli
  2020-08-19 18:07             ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: Florian Fainelli @ 2020-08-19 17:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, Ido Schimmel, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On 8/19/20 9:18 AM, Jakub Kicinski wrote:
> On Tue, 18 Aug 2020 21:30:16 -0700 Florian Fainelli wrote:
>>>>> I spend way too much time patrolling ethtool -S outputs already.  
>>>>
>>>> But that's the nature of detailed stats which are often essential to
>>>> ensuring the system is operating as expected or debugging some problem.
>>>> Commonality is certainly desired in names when relevant to be able to
>>>> build tooling around the stats.  
>>>
>>> There are stats which are clearly detailed and device specific,
>>> but what ends up happening is that people expose very much not
>>> implementation specific stats through the free form interfaces,
>>> because it's the easiest.
>>>
>>> And users are left picking up the pieces, having to ask vendors what
>>> each stat means, and trying to create abstractions in their user space
>>> glue.  
>>
>> Should we require vendors to either provide a Documentation/ entry for 
>> each statistics they have (and be guaranteed that it will be outdated 
>> unless someone notices), or would you rather have the statistics 
>> description be part of the devlink interface itself? Should we define 
>> namespaces such that standard metrics should be under the standard 
>> namespace and the vendor standard is the wild west?
> 
> I'm trying to find a solution which will not require a policeman to
> constantly monitor the compliance. Please see my effort to ensure
> drivers document and use the same ethtool -S stats in the TLS offload
> implementations. I've been trying to improve this situation for a long
> time, and it's getting old.

Which is why I am asking genuinely what do you think should be done
besides doing more code reviews? It does not seem to me that there is an
easy way to catch new stats being added with tools/scripts/whatever and
then determine what they are about, right?

> 
> Please focus on the stats this set adds, instead of fantasizing of what
> could be. These are absolutely not implementation specific!

Not sure if fantasizing is quite what I would use. I am just pointing
out that given the inability to standardize on statistics maybe we
should have namespaces and try our best to have everything fit into the
standard namespace along with a standard set of names, and push back
whenever we see vendor stats being added (or more pragmatically, ask
what they are). But maybe this very idea is moot.

> 
>>> If I have to download vendor documentation and tooling, or adapt my own
>>> scripts for every new vendor, I could have as well downloaded an SDK.  
>>
>> Are not you being a bit over dramatic here with your example? 
> 
> I hope not. It's very hard/impossible today to run a fleet of Linux
> machines without resorting to vendor tooling.

Your argument was putting on the same level resorting to vendor tooling
to extract meaningful statistics/counters versus using a SDK to operate
the hardware (this is how I understood it), and I do not believe this is
fair.

> 
>> At least  you can run the same command to obtain the stats regardless
>> of the driver and vendor, so from that perspective Linux continues to
>> be the abstraction and that is not broken.
> 
> Format of the data is no abstraction.
> 
-- 
Florian

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-19 17:20           ` Florian Fainelli
@ 2020-08-19 18:07             ` Jakub Kicinski
  2020-08-20 14:35               ` David Ahern
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2020-08-19 18:07 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: David Ahern, Ido Schimmel, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Wed, 19 Aug 2020 10:20:08 -0700 Florian Fainelli wrote:
> > I'm trying to find a solution which will not require a policeman to
> > constantly monitor the compliance. Please see my effort to ensure
> > drivers document and use the same ethtool -S stats in the TLS offload
> > implementations. I've been trying to improve this situation for a long
> > time, and it's getting old.  
> 
> Which is why I am asking genuinely what do you think should be done
> besides doing more code reviews? It does not seem to me that there is an
> easy way to catch new stats being added with tools/scripts/whatever and
> then determine what they are about, right?

I don't have a great way forward in mind, sadly. All I can think of is
that we should try to create more well defined interfaces and steer
away from free-form ones.

Example, here if the stats are vxlan decap/encap/error - we should
expose that from the vxlan module. That way vxlan module defines one
set of stats for everyone.

In general unless we attach stats to the object they relate to, we will
end up building parallel structures for exposing statistics from the
drivers. I posted a set once which was implementing hierarchical stats,
but I've abandoned it for this reason.

> > Please focus on the stats this set adds, instead of fantasizing of what
> > could be. These are absolutely not implementation specific!  
> 
> Not sure if fantasizing is quite what I would use. I am just pointing
> out that given the inability to standardize on statistics maybe we
> should have namespaces and try our best to have everything fit into the
> standard namespace along with a standard set of names, and push back
> whenever we see vendor stats being added (or more pragmatically, ask
> what they are). But maybe this very idea is moot.

IDK. I just don't feel like this is going to fly, see how many names
people invented for the CRC error statistic in ethtool -S, even tho
there is a standard stat for that! And users are actually parsing the
output of ethtool -S to get CRC stats because (a) it became the go-to
place for NIC stats and (b) some drivers forget to report in the
standard place.

The cover letter says this set replaces the bad debugfs with a good,
standard API. It may look good and standard for _vendors_ because they
will know where to dump their counters, but it makes very little
difference for _users_. If I have to parse names for every vendor I use,
I can as well add a per-vendor debugfs path to my script.

The bar for implementation-specific driver stats has to be high.

> >>> If I have to download vendor documentation and tooling, or adapt my own
> >>> scripts for every new vendor, I could have as well downloaded an SDK.    
> >>
> >> Are not you being a bit over dramatic here with your example?   
> > 
> > I hope not. It's very hard/impossible today to run a fleet of Linux
> > machines without resorting to vendor tooling.  
> 
> Your argument was putting on the same level resorting to vendor tooling
> to extract meaningful statistics/counters versus using a SDK to operate
> the hardware (this is how I understood it), and I do not believe this is
> fair.

Okay, fair. I just think that in datacenter deployments we are way
closer to the SDK model than people may want to admit.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-19 18:07             ` Jakub Kicinski
@ 2020-08-20 14:35               ` David Ahern
  2020-08-20 16:09                 ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: David Ahern @ 2020-08-20 14:35 UTC (permalink / raw)
  To: Jakub Kicinski, Florian Fainelli
  Cc: Ido Schimmel, netdev, davem, jiri, amcohen, danieller, mlxsw,
	roopa, andrew, vivien.didelot, tariqt, ayal, mkubecek,
	Ido Schimmel

On 8/19/20 12:07 PM, Jakub Kicinski wrote:
> On Wed, 19 Aug 2020 10:20:08 -0700 Florian Fainelli wrote:
>>> I'm trying to find a solution which will not require a policeman to
>>> constantly monitor the compliance. Please see my effort to ensure
>>> drivers document and use the same ethtool -S stats in the TLS offload
>>> implementations. I've been trying to improve this situation for a long
>>> time, and it's getting old.  
>>
>> Which is why I am asking genuinely what do you think should be done
>> besides doing more code reviews? It does not seem to me that there is an
>> easy way to catch new stats being added with tools/scripts/whatever and
>> then determine what they are about, right?
> 
> I don't have a great way forward in mind, sadly. All I can think of is
> that we should try to create more well defined interfaces and steer
> away from free-form ones.

There is a lot of value in free-form too.

> 
> Example, here if the stats are vxlan decap/encap/error - we should
> expose that from the vxlan module. That way vxlan module defines one
> set of stats for everyone.
> 
> In general unless we attach stats to the object they relate to, we will
> end up building parallel structures for exposing statistics from the
> drivers. I posted a set once which was implementing hierarchical stats,
> but I've abandoned it for this reason.
> 
>>> Please focus on the stats this set adds, instead of fantasizing of what
>>> could be. These are absolutely not implementation specific!  
>>
>> Not sure if fantasizing is quite what I would use. I am just pointing
>> out that given the inability to standardize on statistics maybe we
>> should have namespaces and try our best to have everything fit into the
>> standard namespace along with a standard set of names, and push back
>> whenever we see vendor stats being added (or more pragmatically, ask
>> what they are). But maybe this very idea is moot.
> 
> IDK. I just don't feel like this is going to fly, see how many names
> people invented for the CRC error statistic in ethtool -S, even tho
> there is a standard stat for that! And users are actually parsing the
> output of ethtool -S to get CRC stats because (a) it became the go-to
> place for NIC stats and (b) some drivers forget to report in the
> standard place.
> 
> The cover letter says this set replaces the bad debugfs with a good,
> standard API. It may look good and standard for _vendors_ because they
> will know where to dump their counters, but it makes very little
> difference for _users_. If I have to parse names for every vendor I use,
> I can as well add a per-vendor debugfs path to my script.
> 
> The bar for implementation-specific driver stats has to be high.

My take away from this is you do not like the names - the strings side
of it.

Do you object to the netlink API? The netlink API via devlink?

'perf' has json files to describe and document counters
(tools/perf/pmu-events). Would something like that be acceptable as a
form of in-tree documentation of counters? (vs Documentation/networking
or URLs like
https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)

> 
>>>>> If I have to download vendor documentation and tooling, or adapt my own
>>>>> scripts for every new vendor, I could have as well downloaded an SDK.    
>>>>
>>>> Are not you being a bit over dramatic here with your example?   
>>>
>>> I hope not. It's very hard/impossible today to run a fleet of Linux
>>> machines without resorting to vendor tooling.  
>>
>> Your argument was putting on the same level resorting to vendor tooling
>> to extract meaningful statistics/counters versus using a SDK to operate
>> the hardware (this is how I understood it), and I do not believe this is
>> fair.
> 
> Okay, fair. I just think that in datacenter deployments we are way
> closer to the SDK model than people may want to admit.
> 

I do not agree with that; the SDK model means you *must* use vendor code
to make something work. Your argument here is about labels for stats and
an understanding of their meaning.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-20 14:35               ` David Ahern
@ 2020-08-20 16:09                 ` Jakub Kicinski
  2020-08-21 10:30                   ` Ido Schimmel
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2020-08-20 16:09 UTC (permalink / raw)
  To: David Ahern
  Cc: Florian Fainelli, Ido Schimmel, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Thu, 20 Aug 2020 08:35:25 -0600 David Ahern wrote:
> On 8/19/20 12:07 PM, Jakub Kicinski wrote:
> > I don't have a great way forward in mind, sadly. All I can think of is
> > that we should try to create more well defined interfaces and steer
> > away from free-form ones.  
> 
> There is a lot of value in free-form too.

On Tue, 18 Aug 2020 20:35:01 -0700 Jakub Kicinski wrote:
> It's a question of interface, not the value of exposed data.

> > Example, here if the stats are vxlan decap/encap/error - we should
> > expose that from the vxlan module. That way vxlan module defines one
> > set of stats for everyone.
> > 
> > In general unless we attach stats to the object they relate to, we will
> > end up building parallel structures for exposing statistics from the
> > drivers. I posted a set once which was implementing hierarchical stats,
> > but I've abandoned it for this reason.
> > > [...]
> > 
> > IDK. I just don't feel like this is going to fly, see how many names
> > people invented for the CRC error statistic in ethtool -S, even tho
> > there is a standard stat for that! And users are actually parsing the
> > output of ethtool -S to get CRC stats because (a) it became the go-to
> > place for NIC stats and (b) some drivers forget to report in the
> > standard place.
> > 
> > The cover letter says this set replaces the bad debugfs with a good,
> > standard API. It may look good and standard for _vendors_ because they
> > will know where to dump their counters, but it makes very little
> > difference for _users_. If I have to parse names for every vendor I use,
> > I can as well add a per-vendor debugfs path to my script.
> > 
> > The bar for implementation-specific driver stats has to be high.  
> 
> My take away from this is you do not like the names - the strings side
> of it.
> 
> Do you object to the netlink API? The netlink API via devlink?
> 
> 'perf' has json files to describe and document counters
> (tools/perf/pmu-events). Would something like that be acceptable as a
> form of in-tree documentation of counters? (vs Documentation/networking
> or URLs like
> https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)

Please refer to what I said twice now about the definition of the stats
exposed here belonging with the VxLAN code, not the driver.

> > Okay, fair. I just think that in datacenter deployments we are way
> > closer to the SDK model than people may want to admit.
> 
> I do not agree with that; the SDK model means you *must* use vendor code
> to make something work. Your argument here is about labels for stats and
> an understanding of their meaning.

Sure, no "must" for passing packets, but you "must" use vendor tooling
to operate a fleet.

Since everybody already has vendor tools what value does this API add?
I still need per vendor logic. Let's try to build APIs which will
actually make user's life easier, which users will want to switch to.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-20 16:09                 ` Jakub Kicinski
@ 2020-08-21 10:30                   ` Ido Schimmel
  2020-08-21 16:53                     ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: Ido Schimmel @ 2020-08-21 10:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, Florian Fainelli, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Thu, Aug 20, 2020 at 09:09:42AM -0700, Jakub Kicinski wrote:
> On Thu, 20 Aug 2020 08:35:25 -0600 David Ahern wrote:
> > On 8/19/20 12:07 PM, Jakub Kicinski wrote:
> > > I don't have a great way forward in mind, sadly. All I can think of is
> > > that we should try to create more well defined interfaces and steer
> > > away from free-form ones.  
> > 
> > There is a lot of value in free-form too.
> 
> On Tue, 18 Aug 2020 20:35:01 -0700 Jakub Kicinski wrote:
> > It's a question of interface, not the value of exposed data.
> 
> > > Example, here if the stats are vxlan decap/encap/error - we should
> > > expose that from the vxlan module. That way vxlan module defines one
> > > set of stats for everyone.
> > > 
> > > In general unless we attach stats to the object they relate to, we will
> > > end up building parallel structures for exposing statistics from the
> > > drivers. I posted a set once which was implementing hierarchical stats,
> > > but I've abandoned it for this reason.
> > > > [...]
> > > 
> > > IDK. I just don't feel like this is going to fly, see how many names
> > > people invented for the CRC error statistic in ethtool -S, even tho
> > > there is a standard stat for that! And users are actually parsing the
> > > output of ethtool -S to get CRC stats because (a) it became the go-to
> > > place for NIC stats and (b) some drivers forget to report in the
> > > standard place.
> > > 
> > > The cover letter says this set replaces the bad debugfs with a good,
> > > standard API. It may look good and standard for _vendors_ because they
> > > will know where to dump their counters, but it makes very little
> > > difference for _users_. If I have to parse names for every vendor I use,
> > > I can as well add a per-vendor debugfs path to my script.
> > > 
> > > The bar for implementation-specific driver stats has to be high.  
> > 
> > My take away from this is you do not like the names - the strings side
> > of it.
> > 
> > Do you object to the netlink API? The netlink API via devlink?
> > 
> > 'perf' has json files to describe and document counters
> > (tools/perf/pmu-events). Would something like that be acceptable as a
> > form of in-tree documentation of counters? (vs Documentation/networking
> > or URLs like
> > https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)
> 
> Please refer to what I said twice now about the definition of the stats
> exposed here belonging with the VxLAN code, not the driver.

Please refer to the changelog:

"The Spectrum ASICs have a single hardware VTEP that is able to perform
VXLAN encapsulation and decapsulation. The VTEP is logically mapped by
mlxsw to the multiple VXLAN netdevs that are using it. Exposing the
counters of this VTEP via the multiple VXLAN netdevs that are using it
would be both inaccurate and confusing for users.
    
Instead, expose the counters of the VTEP via devlink-metric. Note that
Spectrum-1 supports a different set of counters compared to newer ASICs
in the Spectrum family."

Hardware implementations will rarely fit 1:1 to the nice and discrete
software implementations that they try to accelerate. The purpose of
this API is exposing metrics specific to these hardware implementations.
This results in better visibility which can be leveraged for faster
debugging and more thorough testing.

The reason I came up with this interface is not the specific VXLAN
metrics that bother you, but a new platform we are working on. It uses
the ASIC as a cache that refers lookups to an external device in case of
cache misses. It is completely transparent to user space (you get better
scale), but the driver is very much aware of this stuff as it needs to
insert objects (e.g., routes) in a way that will minimize cache misses.
Just checking that ping works is hardly enough. We must be able to read
the cache counters to ensure we do not see cache misses when we do not
expect them.

As another example, consider the algorithmic TCAM implementation we have
in Spectrum-2 for ACLs [1]. While a user simply adds / deletes filters,
the driver needs to jump through multiple hops in order to program them
in a way that will result in a better scale and reduced latency. We
currently do not have an interface through which we can expose metrics
related to this specific implementation.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=756cd36626f773e9a72a39c1dd12da4deacfacdf

> 
> > > Okay, fair. I just think that in datacenter deployments we are way
> > > closer to the SDK model than people may want to admit.
> > 
> > I do not agree with that; the SDK model means you *must* use vendor code
> > to make something work. Your argument here is about labels for stats and
> > an understanding of their meaning.
> 
> Sure, no "must" for passing packets, but you "must" use vendor tooling
> to operate a fleet.
> 
> Since everybody already has vendor tools what value does this API add?

We don't have any "vendor tools" to get this information. Our team is
doing everything it possibly can in order to move away from such an
approach.

> I still need per vendor logic. Let's try to build APIs which will
> actually make user's life easier, which users will want to switch to.

Developers are also users and they should be able to read whatever
information they need from the device in order to help them do their
work. You have a multitude of tools (e.g., kprobes, tracepoints) to get
better visibility into the software data path. Commonality is not a
reason to be blind as a bat when looking into the hardware data path.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-21 10:30                   ` Ido Schimmel
@ 2020-08-21 16:53                     ` Jakub Kicinski
  2020-08-21 19:12                       ` David Ahern
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2020-08-21 16:53 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: David Ahern, Florian Fainelli, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Fri, 21 Aug 2020 13:30:21 +0300 Ido Schimmel wrote:
> On Thu, Aug 20, 2020 at 09:09:42AM -0700, Jakub Kicinski wrote:
> > On Thu, 20 Aug 2020 08:35:25 -0600 David Ahern wrote:  
> > > On 8/19/20 12:07 PM, Jakub Kicinski wrote:  
> > > > I don't have a great way forward in mind, sadly. All I can think of is
> > > > that we should try to create more well defined interfaces and steer
> > > > away from free-form ones.    
> > > 
> > > There is a lot of value in free-form too.  
> > 
> > On Tue, 18 Aug 2020 20:35:01 -0700 Jakub Kicinski wrote:  
> > > It's a question of interface, not the value of exposed data.  
> >   
> > > > Example, here if the stats are vxlan decap/encap/error - we should
> > > > expose that from the vxlan module. That way vxlan module defines one
> > > > set of stats for everyone.
> > > > 
> > > > In general unless we attach stats to the object they relate to, we will
> > > > end up building parallel structures for exposing statistics from the
> > > > drivers. I posted a set once which was implementing hierarchical stats,
> > > > but I've abandoned it for this reason.  
> > > > > [...]  
> > > > 
> > > > IDK. I just don't feel like this is going to fly, see how many names
> > > > people invented for the CRC error statistic in ethtool -S, even tho
> > > > there is a standard stat for that! And users are actually parsing the
> > > > output of ethtool -S to get CRC stats because (a) it became the go-to
> > > > place for NIC stats and (b) some drivers forget to report in the
> > > > standard place.
> > > > 
> > > > The cover letter says this set replaces the bad debugfs with a good,
> > > > standard API. It may look good and standard for _vendors_ because they
> > > > will know where to dump their counters, but it makes very little
> > > > difference for _users_. If I have to parse names for every vendor I use,
> > > > I can as well add a per-vendor debugfs path to my script.
> > > > 
> > > > The bar for implementation-specific driver stats has to be high.    
> > > 
> > > My take away from this is you do not like the names - the strings side
> > > of it.
> > > 
> > > Do you object to the netlink API? The netlink API via devlink?
> > > 
> > > 'perf' has json files to describe and document counters
> > > (tools/perf/pmu-events). Would something like that be acceptable as a
> > > form of in-tree documentation of counters? (vs Documentation/networking
> > > or URLs like
> > > https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)  
> > 
> > Please refer to what I said twice now about the definition of the stats
> > exposed here belonging with the VxLAN code, not the driver.  
> 
> Please refer to the changelog:
> 
> "The Spectrum ASICs have a single hardware VTEP that is able to perform
> VXLAN encapsulation and decapsulation. The VTEP is logically mapped by
> mlxsw to the multiple VXLAN netdevs that are using it. Exposing the
> counters of this VTEP via the multiple VXLAN netdevs that are using it
> would be both inaccurate and confusing for users.
>     
> Instead, expose the counters of the VTEP via devlink-metric. Note that
> Spectrum-1 supports a different set of counters compared to newer ASICs
> in the Spectrum family."

I read your cover letter, I didn't say the metrics have to be reported
on particular netdevs.

> Hardware implementations will rarely fit 1:1 to the nice and discrete
> software implementations that they try to accelerate. The purpose of
> this API is exposing metrics specific to these hardware implementations.
> This results in better visibility which can be leveraged for faster
> debugging and more thorough testing.
> 
> The reason I came up with this interface is not the specific VXLAN
> metrics that bother you, but a new platform we are working on. It uses
> the ASIC as a cache that refers lookups to an external device in case of
> cache misses. It is completely transparent to user space (you get better
> scale), but the driver is very much aware of this stuff as it needs to
> insert objects (e.g., routes) in a way that will minimize cache misses.
> Just checking that ping works is hardly enough. We must be able to read
> the cache counters to ensure we do not see cache misses when we do not
> expect them.
> 
> As another example, consider the algorithmic TCAM implementation we have
> in Spectrum-2 for ACLs [1]. While a user simply adds / deletes filters,
> the driver needs to jump through multiple hops in order to program them
> in a way that will result in a better scale and reduced latency. We
> currently do not have an interface through which we can expose metrics
> related to this specific implementation.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=756cd36626f773e9a72a39c1dd12da4deacfacdf
> 
> >   
> > > > Okay, fair. I just think that in datacenter deployments we are way
> > > > closer to the SDK model than people may want to admit.  
> > > 
> > > I do not agree with that; the SDK model means you *must* use vendor code
> > > to make something work. Your argument here is about labels for stats and
> > > an understanding of their meaning.  
> > 
> > Sure, no "must" for passing packets, but you "must" use vendor tooling
> > to operate a fleet.
> > 
> > Since everybody already has vendor tools what value does this API add?  
> 
> We don't have any "vendor tools" to get this information. Our team is
> doing everything it possibly can in order to move away from such an
> approach.

For clarity I wasn't speaking about your switches, sadly I have no
experience with those.

> > I still need per vendor logic. Let's try to build APIs which will
> > actually make user's life easier, which users will want to switch to.  
> 
> Developers are also users and they should be able to read whatever
> information they need from the device in order to help them do their
> work. You have a multitude of tools (e.g., kprobes, tracepoints) to get
> better visibility into the software data path. Commonality is not a
> reason to be blind as a bat when looking into the hardware data path.

How many times do I have to say that I'm not arguing against the value
of the data? 

If you open up this interface either someone will police it, or it will
become a dumpster.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-21 16:53                     ` Jakub Kicinski
@ 2020-08-21 19:12                       ` David Ahern
  2020-08-21 23:50                         ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: David Ahern @ 2020-08-21 19:12 UTC (permalink / raw)
  To: Jakub Kicinski, Ido Schimmel
  Cc: Florian Fainelli, netdev, davem, jiri, amcohen, danieller, mlxsw,
	roopa, andrew, vivien.didelot, tariqt, ayal, mkubecek,
	Ido Schimmel

On 8/21/20 10:53 AM, Jakub Kicinski wrote:
> How many times do I have to say that I'm not arguing against the value
> of the data? 
> 
> If you open up this interface either someone will police it, or it will
> become a dumpster.

I am not following what you are proposing as a solution. You do not like
Ido's idea of stats going through devlink, but you are not being clear
on what you think is a better way.

You say vxlan stats belong in the vxlan driver, but the stats do not
have to be reported on particular netdevs. How then do h/w stats get
exposed via vxlan code?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-21 19:12                       ` David Ahern
@ 2020-08-21 23:50                         ` Jakub Kicinski
  2020-08-21 23:59                           ` David Ahern
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2020-08-21 23:50 UTC (permalink / raw)
  To: David Ahern
  Cc: Ido Schimmel, Florian Fainelli, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Fri, 21 Aug 2020 13:12:59 -0600 David Ahern wrote:
> On 8/21/20 10:53 AM, Jakub Kicinski wrote:
> > How many times do I have to say that I'm not arguing against the value
> > of the data? 
> > 
> > If you open up this interface either someone will police it, or it will
> > become a dumpster.  
> 
> I am not following what you are proposing as a solution. You do not like
> Ido's idea of stats going through devlink, but you are not being clear
> on what you think is a better way.
> 
> You say vxlan stats belong in the vxlan driver, but the stats do not
> have to be reported on particular netdevs. How then do h/w stats get
> exposed via vxlan code?

No strong preference, for TLS I've done:

# cat /proc/net/tls_stat 
TlsCurrTxSw                     	0
TlsCurrRxSw                     	0
TlsCurrTxDevice                 	0
TlsCurrRxDevice                 	0
TlsTxSw                         	0
TlsRxSw                         	0
TlsTxDevice                     	0
TlsRxDevice                     	0
TlsDecryptError                 	0
TlsRxDeviceResync               	0

We can add something over netlink, I opted for simplicity since global
stats don't have to scale with number of interfaces. 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-21 23:50                         ` Jakub Kicinski
@ 2020-08-21 23:59                           ` David Ahern
  2020-08-22  0:37                             ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: David Ahern @ 2020-08-21 23:59 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, Florian Fainelli, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On 8/21/20 5:50 PM, Jakub Kicinski wrote:
> On Fri, 21 Aug 2020 13:12:59 -0600 David Ahern wrote:
>> On 8/21/20 10:53 AM, Jakub Kicinski wrote:
>>> How many times do I have to say that I'm not arguing against the value
>>> of the data? 
>>>
>>> If you open up this interface either someone will police it, or it will
>>> become a dumpster.  
>>
>> I am not following what you are proposing as a solution. You do not like
>> Ido's idea of stats going through devlink, but you are not being clear
>> on what you think is a better way.
>>
>> You say vxlan stats belong in the vxlan driver, but the stats do not
>> have to be reported on particular netdevs. How then do h/w stats get
>> exposed via vxlan code?
> 
> No strong preference, for TLS I've done:

But you clearly *do* have a strong preference.

> 
> # cat /proc/net/tls_stat 

I do not agree with adding files under /proc/net for this.

> TlsCurrTxSw                     	0
> TlsCurrRxSw                     	0
> TlsCurrTxDevice                 	0
> TlsCurrRxDevice                 	0
> TlsTxSw                         	0
> TlsRxSw                         	0
> TlsTxDevice                     	0
> TlsRxDevice                     	0
> TlsDecryptError                 	0
> TlsRxDeviceResync               	0
> 
> We can add something over netlink, I opted for simplicity since global
> stats don't have to scale with number of interfaces. 
> 

IMHO, netlink is the right "channel" to move data from kernel to
userspace, and opting in to *specific* stats is a must have feature.

I think devlink is the right framework given that the stats are device
based but not specific to any particular netdev instance. Further, this
allows discrimination of hardware stats from software stats which if
tied to vxlan as a protocol and somehow pulled from the vxan driver
those would be combined into one (at least how my mind is thinking of this).

####

Let's say the direction is for these specific stats (as opposed to the
general problem that Ido and others are considering) to be pulled from
the vxlan driver. How does that driver get access to hardware stats?
vxlan is a protocol and not tied to devices. How should the connection
be made?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-21 23:59                           ` David Ahern
@ 2020-08-22  0:37                             ` Jakub Kicinski
  2020-08-22  1:18                               ` David Ahern
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2020-08-22  0:37 UTC (permalink / raw)
  To: David Ahern
  Cc: Ido Schimmel, Florian Fainelli, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Fri, 21 Aug 2020 17:59:57 -0600 David Ahern wrote:
> On 8/21/20 5:50 PM, Jakub Kicinski wrote:
> > On Fri, 21 Aug 2020 13:12:59 -0600 David Ahern wrote:  
> >> I am not following what you are proposing as a solution. You do not like
> >> Ido's idea of stats going through devlink, but you are not being clear
> >> on what you think is a better way.
> >>
> >> You say vxlan stats belong in the vxlan driver, but the stats do not
> >> have to be reported on particular netdevs. How then do h/w stats get
> >> exposed via vxlan code?  
> > 
> > No strong preference, for TLS I've done:  
> 
> But you clearly *do* have a strong preference.

I'm answering your question.

The question is "How then do h/w stats get exposed via vxlan code?"

Please note that the question includes "via vxlan code".

So no, I have no preference as long as it's "via vxlan code", and not
directly from the driver with a vendor-invented name.

> > # cat /proc/net/tls_stat   
> 
> I do not agree with adding files under /proc/net for this.

Yeah it's not the best, with higher LoC a better solution should be
within reach.

> > TlsCurrTxSw                     	0
> > TlsCurrRxSw                     	0
> > TlsCurrTxDevice                 	0
> > TlsCurrRxDevice                 	0
> > TlsTxSw                         	0
> > TlsRxSw                         	0
> > TlsTxDevice                     	0
> > TlsRxDevice                     	0
> > TlsDecryptError                 	0
> > TlsRxDeviceResync               	0
> > 
> > We can add something over netlink, I opted for simplicity since global
> > stats don't have to scale with number of interfaces. 
> 
> IMHO, netlink is the right "channel" to move data from kernel to
> userspace, and opting in to *specific* stats is a must have feature.
> 
> I think devlink is the right framework given that the stats are device
> based but not specific to any particular netdev instance. 

I'd be careful with the "not specific to any particular netdev
instance". A perfect API would be flexible when it comes to scoping :)

> Further, this
> allows discrimination of hardware stats from software stats which if
> tied to vxlan as a protocol and somehow pulled from the vxan driver
> those would be combined into one (at least how my mind is thinking of this).

Right, for tls the stats which have "Device" in the name are hardware.
But netlink will have better ways of separating the two.

> ####
> 
> Let's say the direction is for these specific stats (as opposed to the
> general problem that Ido and others are considering) to be pulled from
> the vxlan driver. How does that driver get access to hardware stats?
> vxlan is a protocol and not tied to devices. How should the connection
> be made?

Drivers which offload VxLAN already have a dependency on it, right?
They can just registers to it and get queried on dump. Or if we want
scoping we can piggyback on whatever object stats are scoped to.

*If* we scope on HW objects do we need to worry about some user some
day wanting to have stats per vxlan netdev and per HW instance?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-22  0:37                             ` Jakub Kicinski
@ 2020-08-22  1:18                               ` David Ahern
  2020-08-22 16:27                                 ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: David Ahern @ 2020-08-22  1:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Ido Schimmel, Florian Fainelli, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On 8/21/20 6:37 PM, Jakub Kicinski wrote:
>>> # cat /proc/net/tls_stat   
>>
>> I do not agree with adding files under /proc/net for this.
> 
> Yeah it's not the best, with higher LoC a better solution should be
> within reach.

The duplicity here is mind-boggling. Tls stats from hardware is on par
with Ido's *example* of vxlan stats from an ASIC. You agree that
/proc/net files are wrong, but you did it anyway and now you want the
next person to solve the problem you did not want to tackle but have
strong opinions on.

Ido has a history of thinking through problems and solutions in a proper
Linux Way. netlink is the right API, and devlink was created for
'device' stuff versus 'netdev' stuff. Hence, I agree with this
*framework* for extracting asic stats.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-22  1:18                               ` David Ahern
@ 2020-08-22 16:27                                 ` Jakub Kicinski
  2020-08-23  7:04                                   ` Ido Schimmel
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2020-08-22 16:27 UTC (permalink / raw)
  To: David Ahern
  Cc: Ido Schimmel, Florian Fainelli, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Fri, 21 Aug 2020 19:18:37 -0600 David Ahern wrote:
> On 8/21/20 6:37 PM, Jakub Kicinski wrote:
> >>> # cat /proc/net/tls_stat     
> >>
> >> I do not agree with adding files under /proc/net for this.  
> > 
> > Yeah it's not the best, with higher LoC a better solution should be
> > within reach.  
> 
> The duplicity here is mind-boggling. Tls stats from hardware is on par
> with Ido's *example* of vxlan stats from an ASIC. You agree that
> /proc/net files are wrong,

I didn't say /proc/net was wrong, I'm just trying to be agreeable.
Maybe I need to improve my command of the English language.

AFAIK /proc/net is where protocol stats are.

> but you did it anyway and now you want the
> next person to solve the problem you did not want to tackle but have
> strong opinions on.

I have no need and interest in vxlan stats.

> Ido has a history of thinking through problems and solutions in a proper
> Linux Way. netlink is the right API, and devlink was created for
> 'device' stuff versus 'netdev' stuff. Hence, I agree with this
> *framework* for extracting asic stats.

You seem to focus on less relevant points. I primarily care about the
statistics being defined and identified by Linux, not every vendor for
themselves.

No question about Ido's ability and contributions, but then again 
(from the cover letter):

> This a joint work [...] during a two-day company hackathon.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-22 16:27                                 ` Jakub Kicinski
@ 2020-08-23  7:04                                   ` Ido Schimmel
  2020-08-24 19:11                                     ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: Ido Schimmel @ 2020-08-23  7:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, Florian Fainelli, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Sat, Aug 22, 2020 at 09:27:39AM -0700, Jakub Kicinski wrote:
> On Fri, 21 Aug 2020 19:18:37 -0600 David Ahern wrote:
> > On 8/21/20 6:37 PM, Jakub Kicinski wrote:
> > >>> # cat /proc/net/tls_stat     
> > >>
> > >> I do not agree with adding files under /proc/net for this.  
> > > 
> > > Yeah it's not the best, with higher LoC a better solution should be
> > > within reach.  
> > 
> > The duplicity here is mind-boggling. Tls stats from hardware is on par
> > with Ido's *example* of vxlan stats from an ASIC. You agree that
> > /proc/net files are wrong,
> 
> I didn't say /proc/net was wrong, I'm just trying to be agreeable.
> Maybe I need to improve my command of the English language.
> 
> AFAIK /proc/net is where protocol stats are.
> 
> > but you did it anyway and now you want the
> > next person to solve the problem you did not want to tackle but have
> > strong opinions on.
> 
> I have no need and interest in vxlan stats.
> 
> > Ido has a history of thinking through problems and solutions in a proper
> > Linux Way. netlink is the right API, and devlink was created for
> > 'device' stuff versus 'netdev' stuff. Hence, I agree with this
> > *framework* for extracting asic stats.
> 
> You seem to focus on less relevant points. I primarily care about the
> statistics being defined and identified by Linux, not every vendor for
> themselves.

Trying to understand how we can move this forward. The issue is with the
specific VXLAN metrics, but you generally agree with the need for the
framework? See my two other examples: Cache counters and algorithmic
TCAM counters.

> No question about Ido's ability and contributions, but then again 
> (from the cover letter):
> 
> > This a joint work [...] during a two-day company hackathon.

The *implementation* was done during the hackathon. The *design* was
done before. I sent to the team three design proposals (CLI / netlink
API / device drivers API) before we landed on this. Started thinking
about this idea a few months ago and the hackathon was simply a good
opportunity to implement and showcase it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 0/6] devlink: Add device metric support
  2020-08-23  7:04                                   ` Ido Schimmel
@ 2020-08-24 19:11                                     ` Jakub Kicinski
  0 siblings, 0 replies; 29+ messages in thread
From: Jakub Kicinski @ 2020-08-24 19:11 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: David Ahern, Florian Fainelli, netdev, davem, jiri, amcohen,
	danieller, mlxsw, roopa, andrew, vivien.didelot, tariqt, ayal,
	mkubecek, Ido Schimmel

On Sun, 23 Aug 2020 10:04:34 +0300 Ido Schimmel wrote:
> > You seem to focus on less relevant points. I primarily care about the
> > statistics being defined and identified by Linux, not every vendor for
> > themselves.  
> 
> Trying to understand how we can move this forward. The issue is with the
> specific VXLAN metrics, but you generally agree with the need for the
> framework? See my two other examples: Cache counters and algorithmic
> TCAM counters.

Yes, we will likely need a way to report design-specific performance
counters no matter what. That said I would prefer to pave the way for
exposing standardized stats first, so the reviewers (e.g. myself) have
a clear place to point folks to. 

My last attempt was to just try to standardize the strings for the
per-netdev TLS offload stats (those are in addition to the /proc stats),
and document them in Documentation/. It turned out to have quite a
high review overhead, and the convergence is not satisfactory.

The only strong use I have right now is FEC stats, and I'm planning to
add IEEE-based counters to devlink ports. The scoping of MAC/PHY
counters to dl-port is, I hope, reasonable, although it remains to be
seen what phy folks think about it.

As I previously said - I think that protocol stats are best exported
from the protocol driver, otherwise the API may need to grow parallel
hierarchies. E.g. semantics of per-queue NIC counters get confusing
unless the are reported with the information about the queues - sadly
no API for that exists. In particular the life time of objects is hard
to match with lifetime of statistics. Similar thing with low
granularity counters related to traffic classification.

Long story short, it's a complicated topic, IDK how much of it I can
expect you to tackle. At the minimum I'd like it if we had a clear
separation between Linux/standard stats that drivers should share, 
and justifiably implementation specific values.

The DEVLINK_..GENERIC identifiers or trying to standardize on strings
are not working for me as a reviewer, and as an infrastructure engineer.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-08-24 19:11 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-17 12:50 [RFC PATCH net-next 0/6] devlink: Add device metric support Ido Schimmel
2020-08-17 12:50 ` [RFC PATCH net-next 1/6] devlink: Add device metric infrastructure Ido Schimmel
2020-08-17 14:12   ` Andrew Lunn
2020-08-17 12:50 ` [RFC PATCH net-next 2/6] netdevsim: Add devlink metric support Ido Schimmel
2020-08-17 12:50 ` [RFC PATCH net-next 3/6] selftests: netdevsim: Add devlink metric tests Ido Schimmel
2020-08-17 12:50 ` [RFC PATCH net-next 4/6] mlxsw: reg: Add Tunneling NVE Counters Register Ido Schimmel
2020-08-17 12:50 ` [RFC PATCH net-next 5/6] mlxsw: reg: Add Tunneling NVE Counters Register Version 2 Ido Schimmel
2020-08-17 12:50 ` [RFC PATCH net-next 6/6] mlxsw: spectrum_nve: Expose VXLAN counters via devlink-metric Ido Schimmel
2020-08-17 14:29   ` Andrew Lunn
2020-08-18  6:59     ` Ido Schimmel
2020-08-19  0:24 ` [RFC PATCH net-next 0/6] devlink: Add device metric support Jakub Kicinski
2020-08-19  2:43   ` David Ahern
2020-08-19  3:35     ` Jakub Kicinski
2020-08-19  4:30       ` Florian Fainelli
2020-08-19 16:18         ` Jakub Kicinski
2020-08-19 17:20           ` Florian Fainelli
2020-08-19 18:07             ` Jakub Kicinski
2020-08-20 14:35               ` David Ahern
2020-08-20 16:09                 ` Jakub Kicinski
2020-08-21 10:30                   ` Ido Schimmel
2020-08-21 16:53                     ` Jakub Kicinski
2020-08-21 19:12                       ` David Ahern
2020-08-21 23:50                         ` Jakub Kicinski
2020-08-21 23:59                           ` David Ahern
2020-08-22  0:37                             ` Jakub Kicinski
2020-08-22  1:18                               ` David Ahern
2020-08-22 16:27                                 ` Jakub Kicinski
2020-08-23  7:04                                   ` Ido Schimmel
2020-08-24 19:11                                     ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.