netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] netdev: add per-queue statistics
@ 2024-02-26 21:10 Jakub Kicinski
  2024-02-26 21:10 ` [PATCH net-next 1/3] " Jakub Kicinski
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jakub Kicinski @ 2024-02-26 21:10 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, sdf, vadim.fedorenko, Jakub Kicinski

Hi!

Per queue stats keep coming up, so it's about time someone laid
the foundation. This series adds the uAPI, a handful of stats
and a sample support for bnxt. It's not very comprehensive in
terms of stat types or driver support. The expectation is that
the support will grow organically. If we have the basic pieces
in place it will be easy for reviewers to request new stats,
or use of the API in place of ethtool -S.

See patch 3 for sample output.

v1:
 - rename projection -> scope
 - turn projection/scope into flags
rfc: https://lore.kernel.org/all/20240222223629.158254-1-kuba@kernel.org/

Jakub Kicinski (3):
  netdev: add per-queue statistics
  netdev: add queue stat for alloc failures
  eth: bnxt: support per-queue statistics

 Documentation/netlink/specs/netdev.yaml   |  91 +++++++++
 Documentation/networking/statistics.rst   |  17 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  63 +++++++
 include/linux/netdevice.h                 |   3 +
 include/net/netdev_queues.h               |  56 ++++++
 include/uapi/linux/netdev.h               |  20 ++
 net/core/netdev-genl-gen.c                |  12 ++
 net/core/netdev-genl-gen.h                |   2 +
 net/core/netdev-genl.c                    | 218 ++++++++++++++++++++++
 tools/include/uapi/linux/netdev.h         |  20 ++
 10 files changed, 501 insertions(+), 1 deletion(-)

-- 
2.43.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-26 21:10 [PATCH net-next 0/3] netdev: add per-queue statistics Jakub Kicinski
@ 2024-02-26 21:10 ` Jakub Kicinski
  2024-02-26 21:35   ` Stanislav Fomichev
  2024-02-27 10:29   ` Przemek Kitszel
  2024-02-26 21:10 ` [PATCH net-next 2/3] netdev: add queue stat for alloc failures Jakub Kicinski
  2024-02-26 21:10 ` [PATCH net-next 3/3] eth: bnxt: support per-queue statistics Jakub Kicinski
  2 siblings, 2 replies; 14+ messages in thread
From: Jakub Kicinski @ 2024-02-26 21:10 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, sdf, vadim.fedorenko, Jakub Kicinski

The ethtool-nl family does a good job exposing various protocol
related and IEEE/IETF statistics which used to get dumped under
ethtool -S, with creative names. Queue stats don't have a netlink
API, yet, and remain a lion's share of ethtool -S output for new
drivers. Not only is that bad because the names differ driver to
driver but it's also bug-prone. Intuitively drivers try to report
only the stats for active queues, but querying ethtool stats
involves multiple system calls, and the number of stats is
read separately from the stats themselves. Worse still when user
space asks for values of the stats, it doesn't inform the kernel
how big the buffer is. If number of stats increases in the meantime
kernel will overflow user buffer.

Add a netlink API for dumping queue stats. Queue information is
exposed via the netdev-genl family, so add the stats there.
Support per-queue and sum-for-device dumps. Latter will be useful
when subsequent patches add more interesting common stats than
just bytes and packets.

The API does not currently distinguish between HW and SW stats.
The expectation is that the source of the stats will either not
matter much (good packets) or be obvious (skb alloc errors).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/netlink/specs/netdev.yaml |  84 +++++++++
 Documentation/networking/statistics.rst |  17 +-
 include/linux/netdevice.h               |   3 +
 include/net/netdev_queues.h             |  54 ++++++
 include/uapi/linux/netdev.h             |  19 +++
 net/core/netdev-genl-gen.c              |  12 ++
 net/core/netdev-genl-gen.h              |   2 +
 net/core/netdev-genl.c                  | 217 ++++++++++++++++++++++++
 tools/include/uapi/linux/netdev.h       |  19 +++
 9 files changed, 426 insertions(+), 1 deletion(-)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 3addac970680..2570cc371fc8 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -74,6 +74,10 @@ name: netdev
     name: queue-type
     type: enum
     entries: [ rx, tx ]
+  -
+    name: stats-scope
+    type: flags
+    entries: [ queue ]
 
 attribute-sets:
   -
@@ -265,6 +269,66 @@ name: netdev
         doc: ID of the NAPI instance which services this queue.
         type: u32
 
+  -
+    name: stats
+    doc: |
+      Get device statistics, scoped to a device or a queue.
+      These statistics extend (and partially duplicate) statistics available
+      in struct rtnl_link_stats64.
+      Value of the `scope` attribute determines how statistics are
+      aggregated. When aggregated for the entire device the statistics
+      represent the total number of events since last explicit reset of
+      the device (i.e. not a reconfiguration like changing queue count).
+      When reported per-queue, however, the statistics may not add
+      up to the total number of events, will only be reported for currently
+      active objects, and will likely report the number of events since last
+      reconfiguration.
+    attributes:
+      -
+        name: ifindex
+        doc: ifindex of the netdevice to which stats belong.
+        type: u32
+        checks:
+          min: 1
+      -
+        name: queue-type
+        doc: Queue type as rx, tx, for queue-id.
+        type: u32
+        enum: queue-type
+      -
+        name: queue-id
+        doc: Queue ID, if stats are scoped to a single queue instance.
+        type: u32
+      -
+        name: scope
+        doc: |
+          What object type should be used to iterate over the stats.
+        type: uint
+        enum: stats-scope
+      -
+        name: rx-packets
+        doc: |
+          Number of wire packets successfully received and passed to the stack.
+          For drivers supporting XDP, XDP is considered the first layer
+          of the stack, so packets consumed by XDP are still counted here.
+        type: uint
+        value: 8 # reserve some attr ids in case we need more metadata later
+      -
+        name: rx-bytes
+        doc: Successfully received bytes, see `rx-packets`.
+        type: uint
+      -
+        name: tx-packets
+        doc: |
+          Number of wire packets successfully sent. Packet is considered to be
+          successfully sent once it is in device memory (usually this means
+          the device has issued a DMA completion for the packet).
+        type: uint
+      -
+        name: tx-bytes
+        doc: Successfully sent bytes, see `tx-packets`.
+        type: uint
+
 operations:
   list:
     -
@@ -405,6 +469,26 @@ name: netdev
           attributes:
             - ifindex
         reply: *napi-get-op
+    -
+      name: stats-get
+      doc: |
+        Get / dump fine grained statistics. Which statistics are reported
+        depends on the device and the driver, and whether the driver stores
+        software counters per-queue.
+      attribute-set: stats
+      dump:
+        request:
+          attributes:
+            - scope
+        reply:
+          attributes:
+            - ifindex
+            - queue-type
+            - queue-id
+            - rx-packets
+            - rx-bytes
+            - tx-packets
+            - tx-bytes
 
 mcast-groups:
   list:
diff --git a/Documentation/networking/statistics.rst b/Documentation/networking/statistics.rst
index 551b3cc29a41..8a4d166af3c0 100644
--- a/Documentation/networking/statistics.rst
+++ b/Documentation/networking/statistics.rst
@@ -41,6 +41,15 @@ If `-s` is specified once the detailed errors won't be shown.
 
 `ip` supports JSON formatting via the `-j` option.
 
+Queue statistics
+~~~~~~~~~~~~~~~~
+
+Queue statistics are accessible via the netdev netlink family.
+
+Currently no widely distributed CLI exists to access those statistics.
+Kernel development tools (ynl) can be used to experiment with them,
+see :ref:`Documentation/userspace-api/netlink/intro-specs.rst`.
+
 Protocol-specific statistics
 ----------------------------
 
@@ -134,7 +143,7 @@ reading multiple stats as it internally performs a full dump of
 and reports only the stat corresponding to the accessed file.
 
 Sysfs files are documented in
-`Documentation/ABI/testing/sysfs-class-net-statistics`.
+:ref:`Documentation/ABI/testing/sysfs-class-net-statistics`.
 
 
 netlink
@@ -147,6 +156,12 @@ Statistics are reported both in the responses to link information
 requests (`RTM_GETLINK`) and statistic requests (`RTM_GETSTATS`,
 when `IFLA_STATS_LINK_64` bit is set in the `.filter_mask` of the request).
 
+netdev (netlink)
+~~~~~~~~~~~~~~~~
+
+`netdev` generic netlink family allows accessing page pool and per queue
+statistics.
+
 ethtool
 -------
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 09023e44db4e..9e4dc83a92ab 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2039,6 +2039,7 @@ enum netdev_reg_state {
  *
  *	@sysfs_rx_queue_group:	Space for optional per-rx queue attributes
  *	@rtnl_link_ops:	Rtnl_link_ops
+ *	@stat_ops:	Optional ops for queue-aware statistics
  *
  *	@gso_max_size:	Maximum size of generic segmentation offload
  *	@tso_max_size:	Device (as in HW) limit on the max TSO request size
@@ -2419,6 +2420,8 @@ struct net_device {
 
 	const struct rtnl_link_ops *rtnl_link_ops;
 
+	const struct netdev_stat_ops *stat_ops;
+
 	/* for setting kernel sock attribute on TCP connection setup */
 #define GSO_MAX_SEGS		65535u
 #define GSO_LEGACY_MAX_SIZE	65536u
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index 8b8ed4e13d74..d633347eeda5 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -4,6 +4,60 @@
 
 #include <linux/netdevice.h>
 
+struct netdev_queue_stats_rx {
+	u64 bytes;
+	u64 packets;
+};
+
+struct netdev_queue_stats_tx {
+	u64 bytes;
+	u64 packets;
+};
+
+/**
+ * struct netdev_stat_ops - netdev ops for fine grained stats
+ * @get_queue_stats_rx:	get stats for a given Rx queue
+ * @get_queue_stats_tx:	get stats for a given Tx queue
+ * @get_base_stats:	get base stats (not belonging to any live instance)
+ *
+ * Query stats for a given object. The values of the statistics are undefined
+ * on entry (specifically they are *not* zero-initialized). Drivers should
+ * assign values only to the statistics they collect. Statistics which are not
+ * collected must be left undefined.
+ *
+ * Queue objects are not necessarily persistent, and only currently active
+ * queues are queried by the per-queue callbacks. This means that per-queue
+ * statistics will not generally add up to the total number of events for
+ * the device. The @get_base_stats callback allows filling in the delta
+ * between events for currently live queues and overall device history.
+ * When the statistics for the entire device are queried, first @get_base_stats
+ * is issued to collect the delta, and then a series of per-queue callbacks.
+ * Only statistics which are set in @get_base_stats will be reported
+ * at the device level, meaning that unlike in queue callbacks, setting
+ * a statistic to zero in @get_base_stats is a legitimate thing to do.
+ * This is because @get_base_stats has a second function of designating which
+ * statistics are in fact correct for the entire device (e.g. when history
+ * for some of the events is not maintained, and reliable "total" cannot
+ * be provided).
+ *
+ * Device drivers can assume that when collecting total device stats,
+ * the @get_base_stats and subsequent per-queue calls are performed
+ * "atomically" (without releasing the rtnl_lock).
+ *
+ * Device drivers are encouraged to reset the per-queue statistics when
+ * number of queues change. This is because the primary use case for
+ * per-queue statistics is currently to detect traffic imbalance.
+ */
+struct netdev_stat_ops {
+	void (*get_queue_stats_rx)(struct net_device *dev, int idx,
+				   struct netdev_queue_stats_rx *stats);
+	void (*get_queue_stats_tx)(struct net_device *dev, int idx,
+				   struct netdev_queue_stats_tx *stats);
+	void (*get_base_stats)(struct net_device *dev,
+			       struct netdev_queue_stats_rx *rx,
+			       struct netdev_queue_stats_tx *tx);
+};
+
 /**
  * DOC: Lockless queue stopping / waking helpers.
  *
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 93cb411adf72..f282634e031d 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -70,6 +70,10 @@ enum netdev_queue_type {
 	NETDEV_QUEUE_TYPE_TX,
 };
 
+enum netdev_stats_scope {
+	NETDEV_STATS_SCOPE_QUEUE = 1,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
@@ -132,6 +136,20 @@ enum {
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
 };
 
+enum {
+	NETDEV_A_STATS_IFINDEX = 1,
+	NETDEV_A_STATS_QUEUE_TYPE,
+	NETDEV_A_STATS_QUEUE_ID,
+	NETDEV_A_STATS_SCOPE,
+	NETDEV_A_STATS_RX_PACKETS = 8,
+	NETDEV_A_STATS_RX_BYTES,
+	NETDEV_A_STATS_TX_PACKETS,
+	NETDEV_A_STATS_TX_BYTES,
+
+	__NETDEV_A_STATS_MAX,
+	NETDEV_A_STATS_MAX = (__NETDEV_A_STATS_MAX - 1)
+};
+
 enum {
 	NETDEV_CMD_DEV_GET = 1,
 	NETDEV_CMD_DEV_ADD_NTF,
@@ -144,6 +162,7 @@ enum {
 	NETDEV_CMD_PAGE_POOL_STATS_GET,
 	NETDEV_CMD_QUEUE_GET,
 	NETDEV_CMD_NAPI_GET,
+	NETDEV_CMD_STATS_GET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c
index be7f2ebd61b2..76566ea5025f 100644
--- a/net/core/netdev-genl-gen.c
+++ b/net/core/netdev-genl-gen.c
@@ -68,6 +68,11 @@ static const struct nla_policy netdev_napi_get_dump_nl_policy[NETDEV_A_NAPI_IFIN
 	[NETDEV_A_NAPI_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1),
 };
 
+/* NETDEV_CMD_STATS_GET - dump */
+static const struct nla_policy netdev_stats_get_nl_policy[NETDEV_A_STATS_SCOPE + 1] = {
+	[NETDEV_A_STATS_SCOPE] = NLA_POLICY_MASK(NLA_UINT, 0x1),
+};
+
 /* Ops table for netdev */
 static const struct genl_split_ops netdev_nl_ops[] = {
 	{
@@ -138,6 +143,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
 		.maxattr	= NETDEV_A_NAPI_IFINDEX,
 		.flags		= GENL_CMD_CAP_DUMP,
 	},
+	{
+		.cmd		= NETDEV_CMD_STATS_GET,
+		.dumpit		= netdev_nl_stats_get_dumpit,
+		.policy		= netdev_stats_get_nl_policy,
+		.maxattr	= NETDEV_A_STATS_SCOPE,
+		.flags		= GENL_CMD_CAP_DUMP,
+	},
 };
 
 static const struct genl_multicast_group netdev_nl_mcgrps[] = {
diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h
index a47f2bcbe4fa..de878ba2bad7 100644
--- a/net/core/netdev-genl-gen.h
+++ b/net/core/netdev-genl-gen.h
@@ -28,6 +28,8 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb,
 			       struct netlink_callback *cb);
 int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info);
 int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
+int netdev_nl_stats_get_dumpit(struct sk_buff *skb,
+			       struct netlink_callback *cb);
 
 enum {
 	NETDEV_NLGRP_MGMT,
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index fd98936da3ae..0fbd666f2b79 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -8,6 +8,7 @@
 #include <net/xdp.h>
 #include <net/xdp_sock.h>
 #include <net/netdev_rx_queue.h>
+#include <net/netdev_queues.h>
 #include <net/busy_poll.h>
 
 #include "netdev-genl-gen.h"
@@ -469,6 +470,222 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+#define NETDEV_STAT_NOT_SET		(~0ULL)
+
+static void
+netdev_nl_stats_add(void *_sum, const void *_add, size_t size)
+{
+	const u64 *add = _add;
+	u64 *sum = _sum;
+
+	while (size) {
+		if (*add != NETDEV_STAT_NOT_SET && *sum != NETDEV_STAT_NOT_SET)
+			*sum += *add;
+		sum++;
+		add++;
+		size -= 8;
+	}
+}
+
+static int netdev_stat_put(struct sk_buff *rsp, unsigned int attr_id, u64 value)
+{
+	if (value == NETDEV_STAT_NOT_SET)
+		return 0;
+	return nla_put_uint(rsp, attr_id, value);
+}
+
+static int
+netdev_nl_stats_write_rx(struct sk_buff *rsp, struct netdev_queue_stats_rx *rx)
+{
+	if (netdev_stat_put(rsp, NETDEV_A_STATS_RX_PACKETS, rx->packets) ||
+	    netdev_stat_put(rsp, NETDEV_A_STATS_RX_BYTES, rx->bytes))
+		return -EMSGSIZE;
+	return 0;
+}
+
+static int
+netdev_nl_stats_write_tx(struct sk_buff *rsp, struct netdev_queue_stats_tx *tx)
+{
+	if (netdev_stat_put(rsp, NETDEV_A_STATS_TX_PACKETS, tx->packets) ||
+	    netdev_stat_put(rsp, NETDEV_A_STATS_TX_BYTES, tx->bytes))
+		return -EMSGSIZE;
+	return 0;
+}
+
+static int
+netdev_nl_stats_queue(struct net_device *netdev, struct sk_buff *rsp,
+		      u32 q_type, int i, const struct genl_info *info)
+{
+	const struct netdev_stat_ops *ops = netdev->stat_ops;
+	struct netdev_queue_stats_rx rx;
+	struct netdev_queue_stats_tx tx;
+	void *hdr;
+
+	hdr = genlmsg_iput(rsp, info);
+	if (!hdr)
+		return -EMSGSIZE;
+	if (nla_put_u32(rsp, NETDEV_A_STATS_IFINDEX, netdev->ifindex) ||
+	    nla_put_u32(rsp, NETDEV_A_STATS_QUEUE_TYPE, q_type) ||
+	    nla_put_u32(rsp, NETDEV_A_STATS_QUEUE_ID, i))
+		goto nla_put_failure;
+
+	switch (q_type) {
+	case NETDEV_QUEUE_TYPE_RX:
+		memset(&rx, 0xff, sizeof(rx));
+		ops->get_queue_stats_rx(netdev, i, &rx);
+		if (!memchr_inv(&rx, 0xff, sizeof(rx)))
+			goto nla_cancel;
+		if (netdev_nl_stats_write_rx(rsp, &rx))
+			goto nla_put_failure;
+		break;
+	case NETDEV_QUEUE_TYPE_TX:
+		memset(&tx, 0xff, sizeof(tx));
+		ops->get_queue_stats_tx(netdev, i, &tx);
+		if (!memchr_inv(&tx, 0xff, sizeof(tx)))
+			goto nla_cancel;
+		if (netdev_nl_stats_write_tx(rsp, &tx))
+			goto nla_put_failure;
+		break;
+	}
+
+	genlmsg_end(rsp, hdr);
+	return 0;
+
+nla_cancel:
+	genlmsg_cancel(rsp, hdr);
+	return 0;
+nla_put_failure:
+	genlmsg_cancel(rsp, hdr);
+	return -EMSGSIZE;
+}
+
+static int
+netdev_nl_stats_by_queue(struct net_device *netdev, struct sk_buff *rsp,
+			 const struct genl_info *info,
+			 struct netdev_nl_dump_ctx *ctx)
+{
+	const struct netdev_stat_ops *ops = netdev->stat_ops;
+	int i, err;
+
+	if (!(netdev->flags & IFF_UP))
+		return 0;
+
+	i = ctx->rxq_idx;
+	while (ops->get_queue_stats_rx && i < netdev->real_num_rx_queues) {
+		err = netdev_nl_stats_queue(netdev, rsp, NETDEV_QUEUE_TYPE_RX,
+					    i, info);
+		if (err)
+			return err;
+		ctx->rxq_idx = i++;
+	}
+	i = ctx->txq_idx;
+	while (ops->get_queue_stats_tx && i < netdev->real_num_tx_queues) {
+		err = netdev_nl_stats_queue(netdev, rsp, NETDEV_QUEUE_TYPE_TX,
+					    i, info);
+		if (err)
+			return err;
+		ctx->txq_idx = i++;
+	}
+
+	ctx->rxq_idx = 0;
+	ctx->txq_idx = 0;
+	return 0;
+}
+
+static int
+netdev_nl_stats_by_netdev(struct net_device *netdev, struct sk_buff *rsp,
+			  const struct genl_info *info)
+{
+	struct netdev_queue_stats_rx rx_sum, rx;
+	struct netdev_queue_stats_tx tx_sum, tx;
+	const struct netdev_stat_ops *ops;
+	void *hdr;
+	int i;
+
+	ops = netdev->stat_ops;
+	/* Netdev can't guarantee any complete counters */
+	if (!ops->get_base_stats)
+		return 0;
+
+	memset(&rx_sum, 0xff, sizeof(rx_sum));
+	memset(&tx_sum, 0xff, sizeof(tx_sum));
+
+	ops->get_base_stats(netdev, &rx_sum, &tx_sum);
+
+	/* The op was there, but nothing reported, don't bother */
+	if (!memchr_inv(&rx_sum, 0xff, sizeof(rx_sum)) &&
+	    !memchr_inv(&tx_sum, 0xff, sizeof(tx_sum)))
+		return 0;
+
+	hdr = genlmsg_iput(rsp, info);
+	if (!hdr)
+		return -EMSGSIZE;
+	if (nla_put_u32(rsp, NETDEV_A_STATS_IFINDEX, netdev->ifindex))
+		goto nla_put_failure;
+
+	for (i = 0; i < netdev->real_num_rx_queues; i++) {
+		memset(&rx, 0xff, sizeof(rx));
+		if (ops->get_queue_stats_rx)
+			ops->get_queue_stats_rx(netdev, i, &rx);
+		netdev_nl_stats_add(&rx_sum, &rx, sizeof(rx));
+	}
+	for (i = 0; i < netdev->real_num_tx_queues; i++) {
+		memset(&tx, 0xff, sizeof(tx));
+		if (ops->get_queue_stats_tx)
+			ops->get_queue_stats_tx(netdev, i, &tx);
+		netdev_nl_stats_add(&tx_sum, &tx, sizeof(tx));
+	}
+
+	if (netdev_nl_stats_write_rx(rsp, &rx_sum) ||
+	    netdev_nl_stats_write_tx(rsp, &tx_sum))
+		goto nla_put_failure;
+
+	genlmsg_end(rsp, hdr);
+	return 0;
+
+nla_put_failure:
+	genlmsg_cancel(rsp, hdr);
+	return -EMSGSIZE;
+}
+
+int netdev_nl_stats_get_dumpit(struct sk_buff *skb,
+			       struct netlink_callback *cb)
+{
+	struct netdev_nl_dump_ctx *ctx = netdev_dump_ctx(cb);
+	const struct genl_info *info = genl_info_dump(cb);
+	struct net *net = sock_net(skb->sk);
+	struct net_device *netdev;
+	unsigned int scope;
+	int err = 0;
+
+	scope = 0;
+	if (info->attrs[NETDEV_A_STATS_SCOPE])
+		scope = nla_get_uint(info->attrs[NETDEV_A_STATS_SCOPE]);
+
+	rtnl_lock();
+	for_each_netdev_dump(net, netdev, ctx->ifindex) {
+		if (!netdev->stat_ops)
+			continue;
+
+		switch (scope) {
+		case 0:
+			err = netdev_nl_stats_by_netdev(netdev, skb, info);
+			break;
+		case NETDEV_STATS_SCOPE_QUEUE:
+			err = netdev_nl_stats_by_queue(netdev, skb, info, ctx);
+			break;
+		}
+		if (err < 0)
+			break;
+	}
+	rtnl_unlock();
+
+	if (err != -EMSGSIZE)
+		return err;
+
+	return skb->len;
+}
+
 static int netdev_genl_netdevice_event(struct notifier_block *nb,
 				       unsigned long event, void *ptr)
 {
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 93cb411adf72..f282634e031d 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -70,6 +70,10 @@ enum netdev_queue_type {
 	NETDEV_QUEUE_TYPE_TX,
 };
 
+enum netdev_stats_scope {
+	NETDEV_STATS_SCOPE_QUEUE = 1,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
@@ -132,6 +136,20 @@ enum {
 	NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)
 };
 
+enum {
+	NETDEV_A_STATS_IFINDEX = 1,
+	NETDEV_A_STATS_QUEUE_TYPE,
+	NETDEV_A_STATS_QUEUE_ID,
+	NETDEV_A_STATS_SCOPE,
+	NETDEV_A_STATS_RX_PACKETS = 8,
+	NETDEV_A_STATS_RX_BYTES,
+	NETDEV_A_STATS_TX_PACKETS,
+	NETDEV_A_STATS_TX_BYTES,
+
+	__NETDEV_A_STATS_MAX,
+	NETDEV_A_STATS_MAX = (__NETDEV_A_STATS_MAX - 1)
+};
+
 enum {
 	NETDEV_CMD_DEV_GET = 1,
 	NETDEV_CMD_DEV_ADD_NTF,
@@ -144,6 +162,7 @@ enum {
 	NETDEV_CMD_PAGE_POOL_STATS_GET,
 	NETDEV_CMD_QUEUE_GET,
 	NETDEV_CMD_NAPI_GET,
+	NETDEV_CMD_STATS_GET,
 
 	__NETDEV_CMD_MAX,
 	NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 2/3] netdev: add queue stat for alloc failures
  2024-02-26 21:10 [PATCH net-next 0/3] netdev: add per-queue statistics Jakub Kicinski
  2024-02-26 21:10 ` [PATCH net-next 1/3] " Jakub Kicinski
@ 2024-02-26 21:10 ` Jakub Kicinski
  2024-02-26 21:10 ` [PATCH net-next 3/3] eth: bnxt: support per-queue statistics Jakub Kicinski
  2 siblings, 0 replies; 14+ messages in thread
From: Jakub Kicinski @ 2024-02-26 21:10 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, sdf, vadim.fedorenko, Jakub Kicinski

Rx alloc failures are commonly counted by drivers.
Support reporting those via netdev-genl queue stats.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/netlink/specs/netdev.yaml | 7 +++++++
 include/net/netdev_queues.h             | 2 ++
 include/uapi/linux/netdev.h             | 1 +
 net/core/netdev-genl.c                  | 3 ++-
 tools/include/uapi/linux/netdev.h       | 1 +
 5 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 2570cc371fc8..382a98383b6e 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -328,6 +328,13 @@ name: netdev
         name: tx-bytes
         doc: Successfully sent bytes, see `tx-packets`.
         type: uint
+      -
+        name: rx-alloc-fail
+        doc: |
+          Number of times skb or buffer allocation failed on the Rx datapath.
+          Allocation failure may, or may not result in a packet drop, depending
+          on driver implementation and whether system recovers quickly.
+        type: uint
 
 operations:
   list:
diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
index d633347eeda5..1ec408585373 100644
--- a/include/net/netdev_queues.h
+++ b/include/net/netdev_queues.h
@@ -4,9 +4,11 @@
 
 #include <linux/netdevice.h>
 
+/* See the netdev.yaml spec for definition of each statistic */
 struct netdev_queue_stats_rx {
 	u64 bytes;
 	u64 packets;
+	u64 alloc_fail;
 };
 
 struct netdev_queue_stats_tx {
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index f282634e031d..6e7a0e74ccd6 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -145,6 +145,7 @@ enum {
 	NETDEV_A_STATS_RX_BYTES,
 	NETDEV_A_STATS_TX_PACKETS,
 	NETDEV_A_STATS_TX_BYTES,
+	NETDEV_A_STATS_RX_ALLOC_FAIL,
 
 	__NETDEV_A_STATS_MAX,
 	NETDEV_A_STATS_MAX = (__NETDEV_A_STATS_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 0fbd666f2b79..db72c4801d5c 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -498,7 +498,8 @@ static int
 netdev_nl_stats_write_rx(struct sk_buff *rsp, struct netdev_queue_stats_rx *rx)
 {
 	if (netdev_stat_put(rsp, NETDEV_A_STATS_RX_PACKETS, rx->packets) ||
-	    netdev_stat_put(rsp, NETDEV_A_STATS_RX_BYTES, rx->bytes))
+	    netdev_stat_put(rsp, NETDEV_A_STATS_RX_BYTES, rx->bytes) ||
+	    netdev_stat_put(rsp, NETDEV_A_STATS_RX_ALLOC_FAIL, rx->alloc_fail))
 		return -EMSGSIZE;
 	return 0;
 }
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index f282634e031d..6e7a0e74ccd6 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -145,6 +145,7 @@ enum {
 	NETDEV_A_STATS_RX_BYTES,
 	NETDEV_A_STATS_TX_PACKETS,
 	NETDEV_A_STATS_TX_BYTES,
+	NETDEV_A_STATS_RX_ALLOC_FAIL,
 
 	__NETDEV_A_STATS_MAX,
 	NETDEV_A_STATS_MAX = (__NETDEV_A_STATS_MAX - 1)
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 3/3] eth: bnxt: support per-queue statistics
  2024-02-26 21:10 [PATCH net-next 0/3] netdev: add per-queue statistics Jakub Kicinski
  2024-02-26 21:10 ` [PATCH net-next 1/3] " Jakub Kicinski
  2024-02-26 21:10 ` [PATCH net-next 2/3] netdev: add queue stat for alloc failures Jakub Kicinski
@ 2024-02-26 21:10 ` Jakub Kicinski
  2 siblings, 0 replies; 14+ messages in thread
From: Jakub Kicinski @ 2024-02-26 21:10 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, sdf, vadim.fedorenko, Jakub Kicinski

Support per-queue statistics API in bnxt.

$ ethtool -S eth0
NIC statistics:
     [0]: rx_ucast_packets: 1418
     [0]: rx_mcast_packets: 178
     [0]: rx_bcast_packets: 0
     [0]: rx_discards: 0
     [0]: rx_errors: 0
     [0]: rx_ucast_bytes: 1141815
     [0]: rx_mcast_bytes: 16766
     [0]: rx_bcast_bytes: 0
     [0]: tx_ucast_packets: 1734
...

$ ./cli.py --spec netlink/specs/netdev.yaml \
   --dump stats-get --json '{"scope": "queue"}'
[{'ifindex': 2,
  'queue-id': 0,
  'queue-type': 'rx',
  'rx-alloc-fail': 0,
  'rx-bytes': 1164931,
  'rx-packets': 1641},
...
 {'ifindex': 2,
  'queue-id': 0,
  'queue-type': 'tx',
  'tx-bytes': 631494,
  'tx-packets': 1771},
...

Reset the per queue counters:
$ ethtool -L eth0 combined 4

Inspect again:

$ ./cli.py --spec netlink/specs/netdev.yaml \
   --dump stats-get --json '{"scope": "queue"}'
[{'ifindex': 2,
  'queue-id': 0,
  'queue-type': 'rx',
  'rx-alloc-fail': 0,
  'rx-bytes': 32397,
  'rx-packets': 145},
...
 {'ifindex': 2,
  'queue-id': 0,
  'queue-type': 'tx',
  'tx-bytes': 37481,
  'tx-packets': 196},
...

$ ethtool -S eth0 | head
NIC statistics:
     [0]: rx_ucast_packets: 174
     [0]: rx_mcast_packets: 3
     [0]: rx_bcast_packets: 0
     [0]: rx_discards: 0
     [0]: rx_errors: 0
     [0]: rx_ucast_bytes: 37151
     [0]: rx_mcast_bytes: 267
     [0]: rx_bcast_bytes: 0
     [0]: tx_ucast_packets: 267
...

Totals are still correct:

$ ./cli.py --spec netlink/specs/netdev.yaml --dump stats-get
[{'ifindex': 2,
  'rx-alloc-fail': 0,
  'rx-bytes': 281949995,
  'rx-packets': 216524,
  'tx-bytes': 52694905,
  'tx-packets': 75546}]
$ ip -s link show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 14:23:f2:61:05:40 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast
     282519546  218100      0       0       0     516
    TX:  bytes packets errors dropped carrier collsns
      53323054   77674      0       0       0       0

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 63 +++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index a15e6d31fc22..97abde27d5fe 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -14523,6 +14523,68 @@ static const struct net_device_ops bnxt_netdev_ops = {
 	.ndo_bridge_setlink	= bnxt_bridge_setlink,
 };
 
+static void bnxt_get_queue_stats_rx(struct net_device *dev, int i,
+				    struct netdev_queue_stats_rx *stats)
+{
+	struct bnxt *bp = netdev_priv(dev);
+	struct bnxt_cp_ring_info *cpr;
+	u64 *sw;
+
+	cpr = &bp->bnapi[i]->cp_ring;
+	sw = cpr->stats.sw_stats;
+
+	stats->packets = 0;
+	stats->packets += BNXT_GET_RING_STATS64(sw, rx_ucast_pkts);
+	stats->packets += BNXT_GET_RING_STATS64(sw, rx_mcast_pkts);
+	stats->packets += BNXT_GET_RING_STATS64(sw, rx_bcast_pkts);
+
+	stats->bytes = 0;
+	stats->bytes += BNXT_GET_RING_STATS64(sw, rx_ucast_bytes);
+	stats->bytes += BNXT_GET_RING_STATS64(sw, rx_mcast_bytes);
+	stats->bytes += BNXT_GET_RING_STATS64(sw, rx_bcast_bytes);
+
+	stats->alloc_fail = cpr->sw_stats.rx.rx_oom_discards;
+}
+
+static void bnxt_get_queue_stats_tx(struct net_device *dev, int i,
+				    struct netdev_queue_stats_tx *stats)
+{
+	struct bnxt *bp = netdev_priv(dev);
+	u64 *sw;
+
+	sw = bp->bnapi[i]->cp_ring.stats.sw_stats;
+
+	stats->packets = 0;
+	stats->packets += BNXT_GET_RING_STATS64(sw, tx_ucast_pkts);
+	stats->packets += BNXT_GET_RING_STATS64(sw, tx_mcast_pkts);
+	stats->packets += BNXT_GET_RING_STATS64(sw, tx_bcast_pkts);
+
+	stats->bytes = 0;
+	stats->bytes += BNXT_GET_RING_STATS64(sw, tx_ucast_bytes);
+	stats->bytes += BNXT_GET_RING_STATS64(sw, tx_mcast_bytes);
+	stats->bytes += BNXT_GET_RING_STATS64(sw, tx_bcast_bytes);
+}
+
+static void bnxt_get_base_stats(struct net_device *dev,
+				struct netdev_queue_stats_rx *rx,
+				struct netdev_queue_stats_tx *tx)
+{
+	struct bnxt *bp = netdev_priv(dev);
+
+	rx->packets = bp->net_stats_prev.rx_packets;
+	rx->bytes = bp->net_stats_prev.rx_bytes;
+	rx->alloc_fail = bp->ring_err_stats_prev.rx_total_oom_discards;
+
+	tx->packets = bp->net_stats_prev.tx_packets;
+	tx->bytes = bp->net_stats_prev.tx_bytes;
+}
+
+static const struct netdev_stat_ops bnxt_stat_ops = {
+	.get_queue_stats_rx	= bnxt_get_queue_stats_rx,
+	.get_queue_stats_tx	= bnxt_get_queue_stats_tx,
+	.get_base_stats		= bnxt_get_base_stats,
+};
+
 static void bnxt_remove_one(struct pci_dev *pdev)
 {
 	struct net_device *dev = pci_get_drvdata(pdev);
@@ -14970,6 +15032,7 @@ static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		goto init_err_free;
 
 	dev->netdev_ops = &bnxt_netdev_ops;
+	dev->stat_ops = &bnxt_stat_ops;
 	dev->watchdog_timeo = BNXT_TX_TIMEOUT;
 	dev->ethtool_ops = &bnxt_ethtool_ops;
 	pci_set_drvdata(pdev, dev);
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-26 21:10 ` [PATCH net-next 1/3] " Jakub Kicinski
@ 2024-02-26 21:35   ` Stanislav Fomichev
  2024-02-26 22:19     ` Jakub Kicinski
  2024-02-27 10:29   ` Przemek Kitszel
  1 sibling, 1 reply; 14+ messages in thread
From: Stanislav Fomichev @ 2024-02-26 21:35 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, vadim.fedorenko

On 02/26, Jakub Kicinski wrote:
> The ethtool-nl family does a good job exposing various protocol
> related and IEEE/IETF statistics which used to get dumped under
> ethtool -S, with creative names. Queue stats don't have a netlink
> API, yet, and remain a lion's share of ethtool -S output for new
> drivers. Not only is that bad because the names differ driver to
> driver but it's also bug-prone. Intuitively drivers try to report
> only the stats for active queues, but querying ethtool stats
> involves multiple system calls, and the number of stats is
> read separately from the stats themselves. Worse still when user
> space asks for values of the stats, it doesn't inform the kernel
> how big the buffer is. If number of stats increases in the meantime
> kernel will overflow user buffer.
> 
> Add a netlink API for dumping queue stats. Queue information is
> exposed via the netdev-genl family, so add the stats there.
> Support per-queue and sum-for-device dumps. Latter will be useful
> when subsequent patches add more interesting common stats than
> just bytes and packets.
> 
> The API does not currently distinguish between HW and SW stats.
> The expectation is that the source of the stats will either not
> matter much (good packets) or be obvious (skb alloc errors).
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  Documentation/netlink/specs/netdev.yaml |  84 +++++++++
>  Documentation/networking/statistics.rst |  17 +-
>  include/linux/netdevice.h               |   3 +
>  include/net/netdev_queues.h             |  54 ++++++
>  include/uapi/linux/netdev.h             |  19 +++
>  net/core/netdev-genl-gen.c              |  12 ++
>  net/core/netdev-genl-gen.h              |   2 +
>  net/core/netdev-genl.c                  | 217 ++++++++++++++++++++++++
>  tools/include/uapi/linux/netdev.h       |  19 +++
>  9 files changed, 426 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index 3addac970680..2570cc371fc8 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -74,6 +74,10 @@ name: netdev
>      name: queue-type
>      type: enum
>      entries: [ rx, tx ]
> +  -
> +    name: stats-scope
> +    type: flags
> +    entries: [ queue ]

IIUC, in order to get netdev-scoped stats in v1 (vs rfc) is to not set
stats-scope, right? Any reason we dropped the explicit netdev entry?
It seems more robust with a separate entry and removes the ambiguity about
which stats we're querying.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-26 21:35   ` Stanislav Fomichev
@ 2024-02-26 22:19     ` Jakub Kicinski
  2024-02-27  3:37       ` Stanislav Fomichev
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2024-02-26 22:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: davem, netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, vadim.fedorenko

On Mon, 26 Feb 2024 13:35:34 -0800 Stanislav Fomichev wrote:
> > +  -
> > +    name: stats-scope
> > +    type: flags
> > +    entries: [ queue ]  
> 
> IIUC, in order to get netdev-scoped stats in v1 (vs rfc) is to not set
> stats-scope, right? Any reason we dropped the explicit netdev entry?
> It seems more robust with a separate entry and removes the ambiguity about
> which stats we're querying.

The change is because I switched from enum to flags.

I'm not 100% sure which one is going to cause fewer issues down
the line. It's a question of whether the next scope we add will 
be disjoint with or subdividing previous scopes.

I think only subdividing previous scopes makes sense. If we were 
to add "stats per NAPI" (bad example) or "per buffer pool" or IDK what
other thing -- we should expose that as a new netlink command, not mix 
it with the queues.

The expectation is that scopes will be extended with hw vs sw, or
per-CPU (e.g. page pool recycling). In which case we'll want flags,
so that we can combine them -- ask for HW stats for a queue or hw
stats for the entire netdev.

Perhaps I should rename stats -> queue-stats to make this more explicit?

The initial version I wrote could iterate both over NAPIs and
queues. This could be helpful to some drivers - but I realized that it
would lead to rather painful user experience (does the driver maintain
stats per NAPI or per queue?) and tricky implementation of the device
level sum (device stats = Sum(queue) or Sum(queue) + Sum(NAPI)??)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-26 22:19     ` Jakub Kicinski
@ 2024-02-27  3:37       ` Stanislav Fomichev
  2024-02-27 15:24         ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Stanislav Fomichev @ 2024-02-27  3:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, vadim.fedorenko

On 02/26, Jakub Kicinski wrote:
> On Mon, 26 Feb 2024 13:35:34 -0800 Stanislav Fomichev wrote:
> > > +  -
> > > +    name: stats-scope
> > > +    type: flags
> > > +    entries: [ queue ]  
> > 
> > IIUC, in order to get netdev-scoped stats in v1 (vs rfc) is to not set
> > stats-scope, right? Any reason we dropped the explicit netdev entry?
> > It seems more robust with a separate entry and removes the ambiguity about
> > which stats we're querying.
> 
> The change is because I switched from enum to flags.
> 
> I'm not 100% sure which one is going to cause fewer issues down
> the line. It's a question of whether the next scope we add will 
> be disjoint with or subdividing previous scopes.
> 
> I think only subdividing previous scopes makes sense. If we were 
> to add "stats per NAPI" (bad example) or "per buffer pool" or IDK what
> other thing -- we should expose that as a new netlink command, not mix 
> it with the queues.
> 
> The expectation is that scopes will be extended with hw vs sw, or
> per-CPU (e.g. page pool recycling). In which case we'll want flags,
> so that we can combine them -- ask for HW stats for a queue or hw
> stats for the entire netdev.
> 
> Perhaps I should rename stats -> queue-stats to make this more explicit?
> 
> The initial version I wrote could iterate both over NAPIs and
> queues. This could be helpful to some drivers - but I realized that it
> would lead to rather painful user experience (does the driver maintain
> stats per NAPI or per queue?) and tricky implementation of the device
> level sum (device stats = Sum(queue) or Sum(queue) + Sum(NAPI)??)

Yeah, same, not sure. The flags may be more flexible but a bit harder
wrt discoverability. Assuming a somewhat ignorant spec reader/user,
it might be hard to say which flags makes sense to combine and which isn't.
Or, I guess, we can try to document it?

For HW vs SW, do you think it makes sense to expose it as a scope?
Why not have something like 'rx-packets' and 'hw-rx-packets'?

Maybe, as you're suggesting, we should rename stats to queue-states
and drop the score for now? When the time comes to add hw counters,
we can revisit. For total netdev stats, we can ask the user to aggregate
the per-queue ones?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-26 21:10 ` [PATCH net-next 1/3] " Jakub Kicinski
  2024-02-26 21:35   ` Stanislav Fomichev
@ 2024-02-27 10:29   ` Przemek Kitszel
  2024-02-27 15:00     ` Jakub Kicinski
  1 sibling, 1 reply; 14+ messages in thread
From: Przemek Kitszel @ 2024-02-27 10:29 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, sdf, vadim.fedorenko

On 2/26/24 22:10, Jakub Kicinski wrote:
> The ethtool-nl family does a good job exposing various protocol
> related and IEEE/IETF statistics which used to get dumped under
> ethtool -S, with creative names. Queue stats don't have a netlink
> API, yet, and remain a lion's share of ethtool -S output for new
> drivers. Not only is that bad because the names differ driver to
> driver but it's also bug-prone. Intuitively drivers try to report
> only the stats for active queues, but querying ethtool stats
> involves multiple system calls, and the number of stats is
> read separately from the stats themselves. Worse still when user
> space asks for values of the stats, it doesn't inform the kernel
> how big the buffer is. If number of stats increases in the meantime
> kernel will overflow user buffer.
> 
> Add a netlink API for dumping queue stats. Queue information is
> exposed via the netdev-genl family, so add the stats there.
> Support per-queue and sum-for-device dumps. Latter will be useful
> when subsequent patches add more interesting common stats than
> just bytes and packets.
> 
> The API does not currently distinguish between HW and SW stats.
> The expectation is that the source of the stats will either not
> matter much (good packets) or be obvious (skb alloc errors).
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>   Documentation/netlink/specs/netdev.yaml |  84 +++++++++
>   Documentation/networking/statistics.rst |  17 +-
>   include/linux/netdevice.h               |   3 +
>   include/net/netdev_queues.h             |  54 ++++++
>   include/uapi/linux/netdev.h             |  19 +++
>   net/core/netdev-genl-gen.c              |  12 ++
>   net/core/netdev-genl-gen.h              |   2 +
>   net/core/netdev-genl.c                  | 217 ++++++++++++++++++++++++
>   tools/include/uapi/linux/netdev.h       |  19 +++
>   9 files changed, 426 insertions(+), 1 deletion(-)
> 

I like the series, thank you very much!

[...]

> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> index 8b8ed4e13d74..d633347eeda5 100644
> --- a/include/net/netdev_queues.h
> +++ b/include/net/netdev_queues.h
> @@ -4,6 +4,60 @@
>   
>   #include <linux/netdevice.h>
>   
> +struct netdev_queue_stats_rx {
> +	u64 bytes;
> +	u64 packets;
> +};
> +
> +struct netdev_queue_stats_tx {
> +	u64 bytes;
> +	u64 packets;
> +};
> +
> +/**
> + * struct netdev_stat_ops - netdev ops for fine grained stats
> + * @get_queue_stats_rx:	get stats for a given Rx queue
> + * @get_queue_stats_tx:	get stats for a given Tx queue
> + * @get_base_stats:	get base stats (not belonging to any live instance)
> + *
> + * Query stats for a given object. The values of the statistics are undefined
> + * on entry (specifically they are *not* zero-initialized). Drivers should
> + * assign values only to the statistics they collect. Statistics which are not
> + * collected must be left undefined.
> + *
> + * Queue objects are not necessarily persistent, and only currently active
> + * queues are queried by the per-queue callbacks. This means that per-queue
> + * statistics will not generally add up to the total number of events for
> + * the device. The @get_base_stats callback allows filling in the delta
> + * between events for currently live queues and overall device history.
> + * When the statistics for the entire device are queried, first @get_base_stats
> + * is issued to collect the delta, and then a series of per-queue callbacks.
> + * Only statistics which are set in @get_base_stats will be reported
> + * at the device level, meaning that unlike in queue callbacks, setting
> + * a statistic to zero in @get_base_stats is a legitimate thing to do.
> + * This is because @get_base_stats has a second function of designating which
> + * statistics are in fact correct for the entire device (e.g. when history
> + * for some of the events is not maintained, and reliable "total" cannot
> + * be provided).
> + *
> + * Device drivers can assume that when collecting total device stats,
> + * the @get_base_stats and subsequent per-queue calls are performed
> + * "atomically" (without releasing the rtnl_lock).
> + *
> + * Device drivers are encouraged to reset the per-queue statistics when
> + * number of queues change. This is because the primary use case for
> + * per-queue statistics is currently to detect traffic imbalance.

I get it, but encouraging users to reset those on queue-count-change
seems to cover that case too. I'm fine though :P

> + */
> +struct netdev_stat_ops {
> +	void (*get_queue_stats_rx)(struct net_device *dev, int idx,
> +				   struct netdev_queue_stats_rx *stats);
> +	void (*get_queue_stats_tx)(struct net_device *dev, int idx,
> +				   struct netdev_queue_stats_tx *stats);
> +	void (*get_base_stats)(struct net_device *dev,
> +			       struct netdev_queue_stats_rx *rx,
> +			       struct netdev_queue_stats_tx *tx);
> +};
> +
>   /**
>    * DOC: Lockless queue stopping / waking helpers.
>    *

[...]

> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index fd98936da3ae..0fbd666f2b79 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -8,6 +8,7 @@
>   #include <net/xdp.h>
>   #include <net/xdp_sock.h>
>   #include <net/netdev_rx_queue.h>
> +#include <net/netdev_queues.h>
>   #include <net/busy_poll.h>
>   
>   #include "netdev-genl-gen.h"
> @@ -469,6 +470,222 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
>   	return skb->len;
>   }
>   
> +#define NETDEV_STAT_NOT_SET		(~0ULL)
> +
> +static void
> +netdev_nl_stats_add(void *_sum, const void *_add, size_t size)

nit: this declaration fits in one line

> +{
> +	const u64 *add = _add;
> +	u64 *sum = _sum;
> +
> +	while (size) {
> +		if (*add != NETDEV_STAT_NOT_SET && *sum != NETDEV_STAT_NOT_SET)
> +			*sum += *add;
> +		sum++;
> +		add++;
> +		size -= 8;
> +	}
> +}
> +


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-27 10:29   ` Przemek Kitszel
@ 2024-02-27 15:00     ` Jakub Kicinski
  2024-02-27 16:17       ` Przemek Kitszel
  0 siblings, 1 reply; 14+ messages in thread
From: Jakub Kicinski @ 2024-02-27 15:00 UTC (permalink / raw)
  To: Przemek Kitszel
  Cc: davem, netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, sdf, vadim.fedorenko

On Tue, 27 Feb 2024 11:29:02 +0100 Przemek Kitszel wrote:
> > + * Device drivers are encouraged to reset the per-queue statistics when
> > + * number of queues change. This is because the primary use case for
> > + * per-queue statistics is currently to detect traffic imbalance.  
> 
> I get it, but encouraging users to reset those on queue-count-change
> seems to cover that case too. I'm fine though :P

What do you mean? Did I encourage the users somewhere?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-27  3:37       ` Stanislav Fomichev
@ 2024-02-27 15:24         ` Jakub Kicinski
  2024-02-27 18:09           ` Stanislav Fomichev
  2024-02-27 19:49           ` Nambiar, Amritha
  0 siblings, 2 replies; 14+ messages in thread
From: Jakub Kicinski @ 2024-02-27 15:24 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: davem, netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, vadim.fedorenko

On Mon, 26 Feb 2024 19:37:04 -0800 Stanislav Fomichev wrote:
> On 02/26, Jakub Kicinski wrote:
> > On Mon, 26 Feb 2024 13:35:34 -0800 Stanislav Fomichev wrote:  
> > > IIUC, in order to get netdev-scoped stats in v1 (vs rfc) is to not set
> > > stats-scope, right? Any reason we dropped the explicit netdev entry?
> > > It seems more robust with a separate entry and removes the ambiguity about
> > > which stats we're querying.  
> > 
> > The change is because I switched from enum to flags.
> > 
> > I'm not 100% sure which one is going to cause fewer issues down
> > the line. It's a question of whether the next scope we add will 
> > be disjoint with or subdividing previous scopes.
> > 
> > I think only subdividing previous scopes makes sense. If we were 
> > to add "stats per NAPI" (bad example) or "per buffer pool" or IDK what
> > other thing -- we should expose that as a new netlink command, not mix 
> > it with the queues.
> > 
> > The expectation is that scopes will be extended with hw vs sw, or
> > per-CPU (e.g. page pool recycling). In which case we'll want flags,
> > so that we can combine them -- ask for HW stats for a queue or hw
> > stats for the entire netdev.
> > 
> > Perhaps I should rename stats -> queue-stats to make this more explicit?
> > 
> > The initial version I wrote could iterate both over NAPIs and
> > queues. This could be helpful to some drivers - but I realized that it
> > would lead to rather painful user experience (does the driver maintain
> > stats per NAPI or per queue?) and tricky implementation of the device
> > level sum (device stats = Sum(queue) or Sum(queue) + Sum(NAPI)??)  
> 
> Yeah, same, not sure. The flags may be more flexible but a bit harder
> wrt discoverability. Assuming a somewhat ignorant spec reader/user,
> it might be hard to say which flags makes sense to combine and which isn't.
> Or, I guess, we can try to document it?

We're talking about driver API here, so document and enforce in code
review :) But fundamentally, I don't think we should be turning this op
into a mux for all sort of stats. We can have 64k ops in the family.

> For HW vs SW, do you think it makes sense to expose it as a scope?
> Why not have something like 'rx-packets' and 'hw-rx-packets'?

I had that in one of the WIP versions but (a) a lot of the stats can 
be maintained by either device or the driver, so we'd end up with a hw-
flavor for most of the entries, and (b) 90% of the time the user will
not care whether it's the HW or SW that counted the bytes, or GSO
segments. Similarly to how most of the users will not care about
per-queue breakdown, TBH, which made me think that from user
perspective both queue and hw vs sw are just a form of detailed
breakdown. Majority will dump the combined sw|hw stats for the device.

I could be wrong.

> Maybe, as you're suggesting, we should rename stats to queue-states
> and drop the score for now? When the time comes to add hw counters,
> we can revisit. For total netdev stats, we can ask the user to aggregate
> the per-queue ones?

I'd keep the scope, and ability to show the device level aggregation.
There are drivers (bnxt, off the top of my head, but I feel like there's
more) which stash the counters when queues get freed. Without the device
level aggregation we'd need to expose that as "no queue" or "history"
or "delta" etc stats. I think that's uglier that showing the sum, which
is what user will care about 99% of the time.

It'd be a pure rename.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-27 15:00     ` Jakub Kicinski
@ 2024-02-27 16:17       ` Przemek Kitszel
  2024-02-27 23:01         ` Jakub Kicinski
  0 siblings, 1 reply; 14+ messages in thread
From: Przemek Kitszel @ 2024-02-27 16:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, sdf, vadim.fedorenko

On 2/27/24 16:00, Jakub Kicinski wrote:
> On Tue, 27 Feb 2024 11:29:02 +0100 Przemek Kitszel wrote:
>>> + * Device drivers are encouraged to reset the per-queue statistics when
>>> + * number of queues change. This is because the primary use case for
>>> + * per-queue statistics is currently to detect traffic imbalance.
>>
>> I get it, but encouraging users to reset those on queue-count-change
>> seems to cover that case too. I'm fine though :P
> 
> What do you mean? Did I encourage the users somewhere?

I mean that instead of 'driver should reset on q num change' we could
have 'user should reset stats if wants them zeroed' :)

but this is not a strong opinion

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-27 15:24         ` Jakub Kicinski
@ 2024-02-27 18:09           ` Stanislav Fomichev
  2024-02-27 19:49           ` Nambiar, Amritha
  1 sibling, 0 replies; 14+ messages in thread
From: Stanislav Fomichev @ 2024-02-27 18:09 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, vadim.fedorenko

On 02/27, Jakub Kicinski wrote:
> On Mon, 26 Feb 2024 19:37:04 -0800 Stanislav Fomichev wrote:
> > On 02/26, Jakub Kicinski wrote:
> > > On Mon, 26 Feb 2024 13:35:34 -0800 Stanislav Fomichev wrote:  
> > > > IIUC, in order to get netdev-scoped stats in v1 (vs rfc) is to not set
> > > > stats-scope, right? Any reason we dropped the explicit netdev entry?
> > > > It seems more robust with a separate entry and removes the ambiguity about
> > > > which stats we're querying.  
> > > 
> > > The change is because I switched from enum to flags.
> > > 
> > > I'm not 100% sure which one is going to cause fewer issues down
> > > the line. It's a question of whether the next scope we add will 
> > > be disjoint with or subdividing previous scopes.
> > > 
> > > I think only subdividing previous scopes makes sense. If we were 
> > > to add "stats per NAPI" (bad example) or "per buffer pool" or IDK what
> > > other thing -- we should expose that as a new netlink command, not mix 
> > > it with the queues.
> > > 
> > > The expectation is that scopes will be extended with hw vs sw, or
> > > per-CPU (e.g. page pool recycling). In which case we'll want flags,
> > > so that we can combine them -- ask for HW stats for a queue or hw
> > > stats for the entire netdev.
> > > 
> > > Perhaps I should rename stats -> queue-stats to make this more explicit?
> > > 
> > > The initial version I wrote could iterate both over NAPIs and
> > > queues. This could be helpful to some drivers - but I realized that it
> > > would lead to rather painful user experience (does the driver maintain
> > > stats per NAPI or per queue?) and tricky implementation of the device
> > > level sum (device stats = Sum(queue) or Sum(queue) + Sum(NAPI)??)  
> > 
> > Yeah, same, not sure. The flags may be more flexible but a bit harder
> > wrt discoverability. Assuming a somewhat ignorant spec reader/user,
> > it might be hard to say which flags makes sense to combine and which isn't.
> > Or, I guess, we can try to document it?
> 
> We're talking about driver API here, so document and enforce in code
> review :) But fundamentally, I don't think we should be turning this op
> into a mux for all sort of stats. We can have 64k ops in the family.
> 
> > For HW vs SW, do you think it makes sense to expose it as a scope?
> > Why not have something like 'rx-packets' and 'hw-rx-packets'?
> 
> I had that in one of the WIP versions but (a) a lot of the stats can 
> be maintained by either device or the driver, so we'd end up with a hw-
> flavor for most of the entries, and (b) 90% of the time the user will
> not care whether it's the HW or SW that counted the bytes, or GSO
> segments. Similarly to how most of the users will not care about
> per-queue breakdown, TBH, which made me think that from user
> perspective both queue and hw vs sw are just a form of detailed
> breakdown. Majority will dump the combined sw|hw stats for the device.
> 
> I could be wrong.
> 
> > Maybe, as you're suggesting, we should rename stats to queue-states
> > and drop the score for now? When the time comes to add hw counters,
> > we can revisit. For total netdev stats, we can ask the user to aggregate
> > the per-queue ones?
> 
> I'd keep the scope, and ability to show the device level aggregation.
> There are drivers (bnxt, off the top of my head, but I feel like there's
> more) which stash the counters when queues get freed. Without the device
> level aggregation we'd need to expose that as "no queue" or "history"
> or "delta" etc stats. I think that's uglier that showing the sum, which
> is what user will care about 99% of the time.
> 
> It'd be a pure rename.

Ack, sounds fair!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-27 15:24         ` Jakub Kicinski
  2024-02-27 18:09           ` Stanislav Fomichev
@ 2024-02-27 19:49           ` Nambiar, Amritha
  1 sibling, 0 replies; 14+ messages in thread
From: Nambiar, Amritha @ 2024-02-27 19:49 UTC (permalink / raw)
  To: Jakub Kicinski, Stanislav Fomichev
  Cc: davem, netdev, edumazet, pabeni, danielj, mst, michael.chan,
	vadim.fedorenko

On 2/27/2024 7:24 AM, Jakub Kicinski wrote:
> On Mon, 26 Feb 2024 19:37:04 -0800 Stanislav Fomichev wrote:
>> On 02/26, Jakub Kicinski wrote:
>>> On Mon, 26 Feb 2024 13:35:34 -0800 Stanislav Fomichev wrote:
>>>> IIUC, in order to get netdev-scoped stats in v1 (vs rfc) is to not set
>>>> stats-scope, right? Any reason we dropped the explicit netdev entry?
>>>> It seems more robust with a separate entry and removes the ambiguity about
>>>> which stats we're querying.
>>>
>>> The change is because I switched from enum to flags.
>>>
>>> I'm not 100% sure which one is going to cause fewer issues down
>>> the line. It's a question of whether the next scope we add will
>>> be disjoint with or subdividing previous scopes.
>>>
>>> I think only subdividing previous scopes makes sense. If we were
>>> to add "stats per NAPI" (bad example) or "per buffer pool" or IDK what
>>> other thing -- we should expose that as a new netlink command, not mix
>>> it with the queues.
>>>
>>> The expectation is that scopes will be extended with hw vs sw, or
>>> per-CPU (e.g. page pool recycling). In which case we'll want flags,
>>> so that we can combine them -- ask for HW stats for a queue or hw
>>> stats for the entire netdev.
>>>
>>> Perhaps I should rename stats -> queue-stats to make this more explicit?
>>>
>>> The initial version I wrote could iterate both over NAPIs and
>>> queues. This could be helpful to some drivers - but I realized that it
>>> would lead to rather painful user experience (does the driver maintain
>>> stats per NAPI or per queue?) and tricky implementation of the device
>>> level sum (device stats = Sum(queue) or Sum(queue) + Sum(NAPI)??)
>>
>> Yeah, same, not sure. The flags may be more flexible but a bit harder
>> wrt discoverability. Assuming a somewhat ignorant spec reader/user,
>> it might be hard to say which flags makes sense to combine and which isn't.
>> Or, I guess, we can try to document it?
> 
> We're talking about driver API here, so document and enforce in code
> review :) But fundamentally, I don't think we should be turning this op
> into a mux for all sort of stats. We can have 64k ops in the family.
> 
>> For HW vs SW, do you think it makes sense to expose it as a scope?
>> Why not have something like 'rx-packets' and 'hw-rx-packets'?
> 
> I had that in one of the WIP versions but (a) a lot of the stats can
> be maintained by either device or the driver, so we'd end up with a hw-
> flavor for most of the entries, and (b) 90% of the time the user will
> not care whether it's the HW or SW that counted the bytes, or GSO
> segments. Similarly to how most of the users will not care about
> per-queue breakdown, TBH, which made me think that from user
> perspective both queue and hw vs sw are just a form of detailed
> breakdown. Majority will dump the combined sw|hw stats for the device.
> 
> I could be wrong.
> 

I think the per-queue breakdown would be useful as well, especially in 
usecases where there are filters directing traffic to different queues.

>> Maybe, as you're suggesting, we should rename stats to queue-states
>> and drop the score for now? When the time comes to add hw counters,
>> we can revisit. For total netdev stats, we can ask the user to aggregate
>> the per-queue ones?
> 
> I'd keep the scope, and ability to show the device level aggregation.
> There are drivers (bnxt, off the top of my head, but I feel like there's
> more) which stash the counters when queues get freed. Without the device
> level aggregation we'd need to expose that as "no queue" or "history"
> or "delta" etc stats. I think that's uglier that showing the sum, which
> is what user will care about 99% of the time.
> 

+1 to retaining the scope and device level aggregation to reduce ambiguity.

> It'd be a pure rename.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 1/3] netdev: add per-queue statistics
  2024-02-27 16:17       ` Przemek Kitszel
@ 2024-02-27 23:01         ` Jakub Kicinski
  0 siblings, 0 replies; 14+ messages in thread
From: Jakub Kicinski @ 2024-02-27 23:01 UTC (permalink / raw)
  To: Przemek Kitszel
  Cc: davem, netdev, edumazet, pabeni, amritha.nambiar, danielj, mst,
	michael.chan, sdf, vadim.fedorenko

On Tue, 27 Feb 2024 17:17:38 +0100 Przemek Kitszel wrote:
> >> I get it, but encouraging users to reset those on queue-count-change
> >> seems to cover that case too. I'm fine though :P  
> > 
> > What do you mean? Did I encourage the users somewhere?  
> 
> I mean that instead of 'driver should reset on q num change' we could
> have 'user should reset stats if wants them zeroed' :)
> 
> but this is not a strong opinion

Let's revisit the recommendation once we actually have that API for
resetting? :)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-02-27 23:01 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-26 21:10 [PATCH net-next 0/3] netdev: add per-queue statistics Jakub Kicinski
2024-02-26 21:10 ` [PATCH net-next 1/3] " Jakub Kicinski
2024-02-26 21:35   ` Stanislav Fomichev
2024-02-26 22:19     ` Jakub Kicinski
2024-02-27  3:37       ` Stanislav Fomichev
2024-02-27 15:24         ` Jakub Kicinski
2024-02-27 18:09           ` Stanislav Fomichev
2024-02-27 19:49           ` Nambiar, Amritha
2024-02-27 10:29   ` Przemek Kitszel
2024-02-27 15:00     ` Jakub Kicinski
2024-02-27 16:17       ` Przemek Kitszel
2024-02-27 23:01         ` Jakub Kicinski
2024-02-26 21:10 ` [PATCH net-next 2/3] netdev: add queue stat for alloc failures Jakub Kicinski
2024-02-26 21:10 ` [PATCH net-next 3/3] eth: bnxt: support per-queue statistics Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).