linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rdma-next 00/10] Optional counter statistics support
@ 2021-08-18 11:24 Mark Zhang
  2021-08-18 11:24 ` [PATCH rdma-next 01/10] net/mlx5: Add support in bth_opcode as a match criteria Mark Zhang
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

Hi,

This series from Aharon and Neta provides an extension to the rdma
statistics tool that allows to add and remove optional counters
dynamically, using new netlink commands.

The idea of having optional counters is to provide to the users the
ability to get statistics of counters that hurts performance.

Once an optional counter was added, its statistics will be presented
along with all the counters, using the show command.

Binding objects to the optional counters is currently not supported,
neither in auto mode nor in manual mode.

To get the list of optional counters that are supported on this device,
use "rdma statistic mode supported". To see which counters are currently
enabled, use "rdma statistic mode".

$ rdma statistic mode supported
link rocep8s0f0/1
    Optional-set: cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts
link rocep8s0f1/1
    Optional-set: cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts

$ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_rx_ce_pkts
$ rdma statistic mode
link rocep8s0f0/1
    Optional-set: cc_rx_ce_pkts
$ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_tx_cnp_pkts
$ rdma statistic mode
link rocep8s0f0/1
    Optional-set: cc_rx_ce_pkts cc_tx_cnp_pkts

$ rdma statistic show link rocep8s0f0/1
link rocep8s0f0/1 rx_write_requests 0 rx_read_requests 0 rx_atomic_requests 0 out_of_buffer 0
out_of_sequence 0 duplicate_request 0 rnr_nak_retry_err 0 packet_seq_err 0 implied_nak_seq_err 0
local_ack_timeout_err 0 resp_local_length_error 0 resp_cqe_error 0 req_cqe_error 0
req_remote_invalid_request 0 req_remote_access_errors 0 resp_remote_access_errors 0
resp_cqe_flush_error 0 req_cqe_flush_error 0 roce_adp_retrans 0 roce_adp_retrans_to 0
roce_slow_restart 0 roce_slow_restart_cnps 0 roce_slow_restart_trans 0 rp_cnp_ignored 0
rp_cnp_handled 0 np_ecn_marked_roce_packets 0 np_cnp_sent 0 rx_icrc_encapsulated 0
    Optional-set: cc_rx_ce_pkts 0 cc_tx_cnp_pkts 0

$ sudo rdma statistic remove link rocep8s0f0/1 optional-set cc_rx_ce_pkts
$ sudo rdma statistic remove link rocep8s0f0/1 optional-set cc_tx_cnp_pkts

Thanks

Aharon Landau (9):
  net/mlx5: Add support in bth_opcode as a match criteria
  net/mlx5: Add priorities for counters in RDMA namespaces
  RDMA/counters: Support to allocate per-port optional counter
    statistics
  RDMA/mlx5: Add alloc_op_port_stats() support
  RDMA/mlx5: Add steering support in optional flow counters
  RDMA/nldev: Add support to add and remove optional counters
  RDMA/mlx5: Add add_op_stat() and remove_op_stat() support
  RDMA/mlx5: Add get_op_stats() support
  RDMA/nldev: Add support to get current enabled optional counters

Neta Ostrovsky (1):
  RDMA/nldev: Add support to get optional counters statistics

 drivers/infiniband/core/counters.c            |  86 +++++
 drivers/infiniband/core/device.c              |   4 +
 drivers/infiniband/core/nldev.c               | 297 ++++++++++++++++--
 drivers/infiniband/hw/mlx5/counters.c         | 157 ++++++++-
 drivers/infiniband/hw/mlx5/fs.c               | 111 +++++++
 drivers/infiniband/hw/mlx5/mlx5_ib.h          |  21 ++
 .../net/ethernet/mellanox/mlx5/core/fs_core.c |  54 +++-
 include/linux/mlx5/device.h                   |   2 +
 include/linux/mlx5/fs.h                       |   2 +
 include/linux/mlx5/mlx5_ifc.h                 |  20 +-
 include/rdma/ib_hdrs.h                        |   1 +
 include/rdma/ib_verbs.h                       |  36 +++
 include/rdma/rdma_counter.h                   |   8 +
 include/rdma/rdma_netlink.h                   |   1 +
 include/uapi/rdma/rdma_netlink.h              |  11 +
 15 files changed, 775 insertions(+), 36 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH rdma-next 01/10] net/mlx5: Add support in bth_opcode as a match criteria
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
@ 2021-08-18 11:24 ` Mark Zhang
  2021-08-18 11:24 ` [PATCH rdma-next 02/10] net/mlx5: Add priorities for counters in RDMA namespaces Mark Zhang
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

From: Aharon Landau <aharonl@nvidia.com>

Adding bth_opcode field and the relevant bits.
This field will be used to capture and count cnp packets.

Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
---
 include/linux/mlx5/mlx5_ifc.h | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 3dd6641e942c..2828d596af49 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -393,6 +393,14 @@ struct mlx5_ifc_flow_table_fields_supported_bits {
 	u8         metadata_reg_c_0[0x1];
 };
 
+struct mlx5_ifc_flow_table_fields_supported_2_bits {
+	u8         reserved_at_0[0xe];
+	u8         bth_opcode[0x1];
+	u8         reserved_at_f[0x11];
+
+	u8         reserved_at_20[0x60];
+};
+
 struct mlx5_ifc_flow_table_prop_layout_bits {
 	u8         ft_support[0x1];
 	u8         reserved_at_1[0x1];
@@ -539,7 +547,7 @@ struct mlx5_ifc_fte_match_set_misc_bits {
 	union mlx5_ifc_gre_key_bits gre_key;
 
 	u8         vxlan_vni[0x18];
-	u8         reserved_at_b8[0x8];
+	u8         bth_opcode[0x8];
 
 	u8         geneve_vni[0x18];
 	u8         reserved_at_d8[0x7];
@@ -756,7 +764,15 @@ struct mlx5_ifc_flow_table_nic_cap_bits {
 
 	struct mlx5_ifc_flow_table_prop_layout_bits flow_table_properties_nic_transmit_sniffer;
 
-	u8         reserved_at_e00[0x1200];
+	u8         reserved_at_e00[0x700];
+
+	struct mlx5_ifc_flow_table_fields_supported_2_bits ft_field_support_2_nic_receive_rdma;
+
+	u8         reserved_at_1580[0x280];
+
+	struct mlx5_ifc_flow_table_fields_supported_2_bits ft_field_support_2_nic_transmit_rdma;
+
+	u8         reserved_at_1880[0x780];
 
 	u8         sw_steering_nic_rx_action_drop_icm_address[0x40];
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH rdma-next 02/10] net/mlx5: Add priorities for counters in RDMA namespaces
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
  2021-08-18 11:24 ` [PATCH rdma-next 01/10] net/mlx5: Add support in bth_opcode as a match criteria Mark Zhang
@ 2021-08-18 11:24 ` Mark Zhang
  2021-08-18 11:24 ` [PATCH rdma-next 03/10] RDMA/counters: Support to allocate per-port optional counter statistics Mark Zhang
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

From: Aharon Landau <aharonl@nvidia.com>

Add additional flow steering priorities in the RDMA namespace.
This allows adding flow counters to count filtered RDMA traffic and then
continue processing in the regular RDMA steering flow.

Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
---
 .../net/ethernet/mellanox/mlx5/core/fs_core.c | 54 ++++++++++++++++---
 include/linux/mlx5/device.h                   |  2 +
 include/linux/mlx5/fs.h                       |  2 +
 3 files changed, 50 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index d7bf0a3e4a52..f9064661f2ec 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -99,6 +99,9 @@
 #define LEFTOVERS_NUM_LEVELS 1
 #define LEFTOVERS_NUM_PRIOS 1
 
+#define RDMA_RX_COUNTERS_PRIO_NUM_LEVELS 1
+#define RDMA_TX_COUNTERS_PRIO_NUM_LEVELS 1
+
 #define BY_PASS_PRIO_NUM_LEVELS 1
 #define BY_PASS_MIN_LEVEL (ETHTOOL_MIN_LEVEL + MLX5_BY_PASS_NUM_PRIOS +\
 			   LEFTOVERS_NUM_PRIOS)
@@ -206,34 +209,63 @@ static struct init_tree_node egress_root_fs = {
 	}
 };
 
-#define RDMA_RX_BYPASS_PRIO 0
-#define RDMA_RX_KERNEL_PRIO 1
+enum {
+	RDMA_RX_COUNTERS_PRIO,
+	RDMA_RX_BYPASS_PRIO,
+	RDMA_RX_KERNEL_PRIO,
+};
+
+#define RDMA_RX_BYPASS_MIN_LEVEL MLX5_BY_PASS_NUM_REGULAR_PRIOS
+#define RDMA_RX_KERNEL_MIN_LEVEL (RDMA_RX_BYPASS_MIN_LEVEL + 1)
+#define RDMA_RX_COUNTERS_MIN_LEVEL (RDMA_RX_KERNEL_MIN_LEVEL + 2)
+
 static struct init_tree_node rdma_rx_root_fs = {
 	.type = FS_TYPE_NAMESPACE,
-	.ar_size = 2,
+	.ar_size = 3,
 	.children = (struct init_tree_node[]) {
+		[RDMA_RX_COUNTERS_PRIO] =
+		ADD_PRIO(0, RDMA_RX_COUNTERS_MIN_LEVEL, 0,
+			 FS_CHAINING_CAPS,
+			 ADD_NS(MLX5_FLOW_TABLE_MISS_ACTION_DEF,
+				ADD_MULTIPLE_PRIO(MLX5_RDMA_RX_NUM_COUNTERS_PRIOS,
+						  RDMA_RX_COUNTERS_PRIO_NUM_LEVELS))),
 		[RDMA_RX_BYPASS_PRIO] =
-		ADD_PRIO(0, MLX5_BY_PASS_NUM_REGULAR_PRIOS, 0,
+		ADD_PRIO(0, RDMA_RX_BYPASS_MIN_LEVEL, 0,
 			 FS_CHAINING_CAPS,
 			 ADD_NS(MLX5_FLOW_TABLE_MISS_ACTION_DEF,
 				ADD_MULTIPLE_PRIO(MLX5_BY_PASS_NUM_REGULAR_PRIOS,
 						  BY_PASS_PRIO_NUM_LEVELS))),
 		[RDMA_RX_KERNEL_PRIO] =
-		ADD_PRIO(0, MLX5_BY_PASS_NUM_REGULAR_PRIOS + 1, 0,
+		ADD_PRIO(0, RDMA_RX_KERNEL_MIN_LEVEL, 0,
 			 FS_CHAINING_CAPS,
 			 ADD_NS(MLX5_FLOW_TABLE_MISS_ACTION_SWITCH_DOMAIN,
 				ADD_MULTIPLE_PRIO(1, 1))),
 	}
 };
 
+enum {
+	RDMA_TX_COUNTERS_PRIO,
+	RDMA_TX_BYPASS_PRIO,
+};
+
+#define RDMA_TX_BYPASS_MIN_LEVEL MLX5_BY_PASS_NUM_PRIOS
+#define RDMA_TX_COUNTERS_MIN_LEVEL (RDMA_TX_BYPASS_MIN_LEVEL + 1)
+
 static struct init_tree_node rdma_tx_root_fs = {
 	.type = FS_TYPE_NAMESPACE,
-	.ar_size = 1,
+	.ar_size = 2,
 	.children = (struct init_tree_node[]) {
-		ADD_PRIO(0, MLX5_BY_PASS_NUM_PRIOS, 0,
+		[RDMA_TX_COUNTERS_PRIO] =
+		ADD_PRIO(0, RDMA_TX_COUNTERS_MIN_LEVEL, 0,
+			 FS_CHAINING_CAPS,
+			 ADD_NS(MLX5_FLOW_TABLE_MISS_ACTION_DEF,
+				ADD_MULTIPLE_PRIO(MLX5_RDMA_TX_NUM_COUNTERS_PRIOS,
+						  RDMA_TX_COUNTERS_PRIO_NUM_LEVELS))),
+		[RDMA_TX_BYPASS_PRIO] =
+		ADD_PRIO(0, RDMA_TX_BYPASS_MIN_LEVEL, 0,
 			 FS_CHAINING_CAPS_RDMA_TX,
 			 ADD_NS(MLX5_FLOW_TABLE_MISS_ACTION_DEF,
-				ADD_MULTIPLE_PRIO(MLX5_BY_PASS_NUM_PRIOS,
+				ADD_MULTIPLE_PRIO(RDMA_TX_BYPASS_MIN_LEVEL,
 						  BY_PASS_PRIO_NUM_LEVELS))),
 	}
 };
@@ -2212,6 +2244,12 @@ struct mlx5_flow_namespace *mlx5_get_flow_namespace(struct mlx5_core_dev *dev,
 	} else if (type == MLX5_FLOW_NAMESPACE_RDMA_RX_KERNEL) {
 		root_ns = steering->rdma_rx_root_ns;
 		prio = RDMA_RX_KERNEL_PRIO;
+	} else if (type == MLX5_FLOW_NAMESPACE_RDMA_RX_COUNTERS) {
+		root_ns = steering->rdma_rx_root_ns;
+		prio = RDMA_RX_COUNTERS_PRIO;
+	} else if (type == MLX5_FLOW_NAMESPACE_RDMA_TX_COUNTERS) {
+		root_ns = steering->rdma_tx_root_ns;
+		prio = RDMA_TX_COUNTERS_PRIO;
 	} else if (type == MLX5_FLOW_NAMESPACE_RDMA_TX) {
 		root_ns = steering->rdma_tx_root_ns;
 	} else { /* Must be NIC RX */
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 0025913505ab..c2c0380fd608 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -1453,6 +1453,8 @@ static inline u16 mlx5_to_sw_pkey_sz(int pkey_sz)
 	return MLX5_MIN_PKEY_TABLE_SIZE << pkey_sz;
 }
 
+#define MLX5_RDMA_RX_NUM_COUNTERS_PRIOS 2
+#define MLX5_RDMA_TX_NUM_COUNTERS_PRIOS 1
 #define MLX5_BY_PASS_NUM_REGULAR_PRIOS 16
 #define MLX5_BY_PASS_NUM_DONT_TRAP_PRIOS 16
 #define MLX5_BY_PASS_NUM_MULTICAST_PRIOS 1
diff --git a/include/linux/mlx5/fs.h b/include/linux/mlx5/fs.h
index 77746f7e35b8..39722897e5c7 100644
--- a/include/linux/mlx5/fs.h
+++ b/include/linux/mlx5/fs.h
@@ -81,6 +81,8 @@ enum mlx5_flow_namespace_type {
 	MLX5_FLOW_NAMESPACE_RDMA_RX,
 	MLX5_FLOW_NAMESPACE_RDMA_RX_KERNEL,
 	MLX5_FLOW_NAMESPACE_RDMA_TX,
+	MLX5_FLOW_NAMESPACE_RDMA_RX_COUNTERS,
+	MLX5_FLOW_NAMESPACE_RDMA_TX_COUNTERS,
 };
 
 enum {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH rdma-next 03/10] RDMA/counters: Support to allocate per-port optional counter statistics
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
  2021-08-18 11:24 ` [PATCH rdma-next 01/10] net/mlx5: Add support in bth_opcode as a match criteria Mark Zhang
  2021-08-18 11:24 ` [PATCH rdma-next 02/10] net/mlx5: Add priorities for counters in RDMA namespaces Mark Zhang
@ 2021-08-18 11:24 ` Mark Zhang
  2021-08-23 19:30   ` Jason Gunthorpe
  2021-08-18 11:24 ` [PATCH rdma-next 04/10] RDMA/mlx5: Add alloc_op_port_stats() support Mark Zhang
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

From: Aharon Landau <aharonl@nvidia.com>

Add an alloc_op_port_stats() API, as well as related structures, to support
per-port op_stats allocation during counter module initialization.

Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
---
 drivers/infiniband/core/counters.c | 18 ++++++++++++++++++
 drivers/infiniband/core/device.c   |  1 +
 include/rdma/ib_verbs.h            | 24 ++++++++++++++++++++++++
 include/rdma/rdma_counter.h        |  1 +
 4 files changed, 44 insertions(+)

diff --git a/drivers/infiniband/core/counters.c b/drivers/infiniband/core/counters.c
index df9e6c5e4ddf..b8b6db98bfdf 100644
--- a/drivers/infiniband/core/counters.c
+++ b/drivers/infiniband/core/counters.c
@@ -611,6 +611,15 @@ void rdma_counter_init(struct ib_device *dev)
 		port_counter->hstats = dev->ops.alloc_hw_port_stats(dev, port);
 		if (!port_counter->hstats)
 			goto fail;
+
+		if (dev->ops.alloc_op_port_stats) {
+			port_counter->opstats =
+				dev->ops.alloc_op_port_stats(dev, port);
+			if (!port_counter->opstats)
+				goto fail;
+
+			mutex_init(&port_counter->opstats->lock);
+		}
 	}
 
 	return;
@@ -618,6 +627,11 @@ void rdma_counter_init(struct ib_device *dev)
 fail:
 	for (i = port; i >= rdma_start_port(dev); i--) {
 		port_counter = &dev->port_data[port].port_counter;
+		if (port_counter && port_counter->opstats) {
+			mutex_destroy(&port_counter->opstats->lock);
+			kfree(port_counter->opstats);
+			port_counter->opstats = NULL;
+		}
 		kfree(port_counter->hstats);
 		port_counter->hstats = NULL;
 		mutex_destroy(&port_counter->lock);
@@ -631,6 +645,10 @@ void rdma_counter_release(struct ib_device *dev)
 
 	rdma_for_each_port(dev, port) {
 		port_counter = &dev->port_data[port].port_counter;
+		if (port_counter && port_counter->opstats) {
+			mutex_destroy(&port_counter->opstats->lock);
+			kfree(port_counter->opstats);
+		}
 		kfree(port_counter->hstats);
 		mutex_destroy(&port_counter->lock);
 	}
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index f4814bb7f082..23e1ae50b2e4 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2597,6 +2597,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
 	SET_DEVICE_OP(dev_ops, alloc_mr);
 	SET_DEVICE_OP(dev_ops, alloc_mr_integrity);
 	SET_DEVICE_OP(dev_ops, alloc_mw);
+	SET_DEVICE_OP(dev_ops, alloc_op_port_stats);
 	SET_DEVICE_OP(dev_ops, alloc_pd);
 	SET_DEVICE_OP(dev_ops, alloc_rdma_netdev);
 	SET_DEVICE_OP(dev_ops, alloc_ucontext);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index aa7806335cba..e8763d336df1 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -598,6 +598,28 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct(
 	return stats;
 }
 
+/**
+ * struct rdma_op_counter
+ * @name - The name of the counter
+ * @value - The value of the counter
+ */
+struct rdma_op_counter {
+	const char *name;
+	u64 value;
+};
+
+/**
+ * struct rdma_op_stats
+ * @lock - Mutex to protect parallel write access to opstats of counters
+ * @num_opcounters - How many optional counters there are
+ * @opcounters - Array of optional counters that are filled in by the drivers
+ */
+struct rdma_op_stats {
+	/* Hold this mutex when accessing optional counter */
+	struct mutex lock;
+	int num_opcounters;
+	struct rdma_op_counter opcounters[];
+};
 
 /* Define bits for the various functionality this port needs to be supported by
  * the core.
@@ -2568,6 +2590,8 @@ struct ib_device_ops {
 	 */
 	int (*get_hw_stats)(struct ib_device *device,
 			    struct rdma_hw_stats *stats, u32 port, int index);
+	struct rdma_op_stats *(*alloc_op_port_stats)(struct ib_device *device,
+						     u32 port_num);
 
 	/**
 	 * Allows rdma drivers to add their own restrack attributes.
diff --git a/include/rdma/rdma_counter.h b/include/rdma/rdma_counter.h
index 0295b22cd1cd..3531c5061718 100644
--- a/include/rdma/rdma_counter.h
+++ b/include/rdma/rdma_counter.h
@@ -29,6 +29,7 @@ struct rdma_port_counter {
 	struct rdma_counter_mode mode;
 	struct rdma_hw_stats *hstats;
 	unsigned int num_counters;
+	struct rdma_op_stats *opstats;
 	struct mutex lock;
 };
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH rdma-next 04/10] RDMA/mlx5: Add alloc_op_port_stats() support
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
                   ` (2 preceding siblings ...)
  2021-08-18 11:24 ` [PATCH rdma-next 03/10] RDMA/counters: Support to allocate per-port optional counter statistics Mark Zhang
@ 2021-08-18 11:24 ` Mark Zhang
  2021-08-23 19:19   ` Jason Gunthorpe
  2021-08-18 11:24 ` [PATCH rdma-next 05/10] RDMA/mlx5: Add steering support in optional flow counters Mark Zhang
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

From: Aharon Landau <aharonl@nvidia.com>

Add support for ib callback alloc_op_port_stats().

Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
---
 drivers/infiniband/hw/mlx5/counters.c | 71 ++++++++++++++++++++++++++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  8 +++
 include/rdma/ib_verbs.h               |  2 +
 3 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/counters.c b/drivers/infiniband/hw/mlx5/counters.c
index 224ba36f2946..5b4b446727d1 100644
--- a/drivers/infiniband/hw/mlx5/counters.c
+++ b/drivers/infiniband/hw/mlx5/counters.c
@@ -187,6 +187,72 @@ mlx5_ib_alloc_hw_port_stats(struct ib_device *ibdev, u32 port_num)
 					  RDMA_HW_STATS_DEFAULT_LIFESPAN);
 }
 
+struct mlx5_ib_opcounter {
+	const char *name;
+	enum mlx5_ib_optional_counter_type type;
+};
+
+static const struct mlx5_ib_opcounter basic_op_cnts[] = {
+	{.name = "cc_rx_ce_pkts", .type = MLX5_IB_OPCOUNTER_CC_RX_CE_PKTS},
+};
+
+static const struct mlx5_ib_opcounter rdmarx_cnp_op_cnts[] = {
+	{.name = "cc_rx_cnp_pkts", .type = MLX5_IB_OPCOUNTER_CC_RX_CNP_PKTS},
+};
+
+static const struct mlx5_ib_opcounter rdmatx_cnp_op_cnts[] = {
+	{.name = "cc_tx_cnp_pkts", .type = MLX5_IB_OPCOUNTER_CC_TX_CNP_PKTS},
+};
+
+static struct rdma_op_stats *
+mlx5_ib_alloc_op_port_stats(struct ib_device *ibdev, u32 port_num)
+{
+	struct rdma_op_stats *opstats;
+	struct mlx5_ib_dev *dev = to_mdev(ibdev);
+	int num_opcounters, i, j = 0;
+
+	num_opcounters = ARRAY_SIZE(basic_op_cnts);
+
+	if (MLX5_CAP_FLOWTABLE(dev->mdev,
+			       ft_field_support_2_nic_receive_rdma.bth_opcode))
+		num_opcounters += ARRAY_SIZE(rdmarx_cnp_op_cnts);
+
+	if (MLX5_CAP_FLOWTABLE(dev->mdev,
+			       ft_field_support_2_nic_transmit_rdma.bth_opcode))
+		num_opcounters += ARRAY_SIZE(rdmatx_cnp_op_cnts);
+
+	opstats = kzalloc(sizeof(*opstats) +
+			  num_opcounters * sizeof(struct rdma_op_counter),
+			  GFP_KERNEL);
+	if (!opstats)
+		return NULL;
+
+	for (i = 0; i < ARRAY_SIZE(basic_op_cnts); i++, j++) {
+		opstats->opcounters[j].name = basic_op_cnts[i].name;
+		opstats->opcounters[j].type = basic_op_cnts[i].type;
+	}
+
+	if (MLX5_CAP_FLOWTABLE(dev->mdev,
+			       ft_field_support_2_nic_receive_rdma.bth_opcode)) {
+		for (i = 0; i < ARRAY_SIZE(rdmarx_cnp_op_cnts); i++, j++) {
+			opstats->opcounters[j].name = rdmarx_cnp_op_cnts[i].name;
+			opstats->opcounters[j].type = rdmarx_cnp_op_cnts[i].type;
+		}
+	}
+
+	if (MLX5_CAP_FLOWTABLE(dev->mdev,
+			       ft_field_support_2_nic_transmit_rdma.bth_opcode)) {
+		for (i = 0; i < ARRAY_SIZE(rdmatx_cnp_op_cnts); i++, j++) {
+			opstats->opcounters[j].name = rdmatx_cnp_op_cnts[i].name;
+			opstats->opcounters[j].type = rdmatx_cnp_op_cnts[i].type;
+		}
+	}
+
+	opstats->num_opcounters = num_opcounters;
+
+	return opstats;
+}
+
 static int mlx5_ib_query_q_counters(struct mlx5_core_dev *mdev,
 				    const struct mlx5_ib_counters *cnts,
 				    struct rdma_hw_stats *stats,
@@ -672,8 +738,9 @@ void mlx5_ib_counters_clear_description(struct ib_counters *counters)
 	mutex_unlock(&mcounters->mcntrs_mutex);
 }
 
-static const struct ib_device_ops hw_stats_ops = {
+static const struct ib_device_ops stats_ops = {
 	.alloc_hw_port_stats = mlx5_ib_alloc_hw_port_stats,
+	.alloc_op_port_stats = mlx5_ib_alloc_op_port_stats,
 	.get_hw_stats = mlx5_ib_get_hw_stats,
 	.counter_bind_qp = mlx5_ib_counter_bind_qp,
 	.counter_unbind_qp = mlx5_ib_counter_unbind_qp,
@@ -710,7 +777,7 @@ int mlx5_ib_counters_init(struct mlx5_ib_dev *dev)
 	if (is_mdev_switchdev_mode(dev->mdev))
 		ib_set_device_ops(&dev->ib_dev, &hw_switchdev_stats_ops);
 	else
-		ib_set_device_ops(&dev->ib_dev, &hw_stats_ops);
+		ib_set_device_ops(&dev->ib_dev, &stats_ops);
 	return mlx5_ib_alloc_counters(dev);
 }
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index bf20a388eabe..2ba352702294 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -797,6 +797,14 @@ struct mlx5_ib_resources {
 	struct mlx5_ib_port_resources ports[2];
 };
 
+enum mlx5_ib_optional_counter_type {
+	MLX5_IB_OPCOUNTER_CC_RX_CE_PKTS,
+	MLX5_IB_OPCOUNTER_CC_RX_CNP_PKTS,
+	MLX5_IB_OPCOUNTER_CC_TX_CNP_PKTS,
+
+	MLX5_IB_OPCOUNTER_MAX,
+};
+
 struct mlx5_ib_counters {
 	const char **names;
 	size_t *offsets;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index e8763d336df1..40b0f7825975 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -600,10 +600,12 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct(
 
 /**
  * struct rdma_op_counter
+ * @type - The vendor-specific type of the counter
  * @name - The name of the counter
  * @value - The value of the counter
  */
 struct rdma_op_counter {
+	int type;
 	const char *name;
 	u64 value;
 };
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH rdma-next 05/10] RDMA/mlx5: Add steering support in optional flow counters
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
                   ` (3 preceding siblings ...)
  2021-08-18 11:24 ` [PATCH rdma-next 04/10] RDMA/mlx5: Add alloc_op_port_stats() support Mark Zhang
@ 2021-08-18 11:24 ` Mark Zhang
  2021-08-18 11:24 ` [PATCH rdma-next 06/10] RDMA/nldev: Add support to add and remove optional counters Mark Zhang
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

From: Aharon Landau <aharonl@nvidia.com>

Adding steering infrastructure for adding and removing optional counter.
This allows to add and remove the counters dynamically in order not to
hurt performance.

Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
---
 drivers/infiniband/hw/mlx5/fs.c      | 111 +++++++++++++++++++++++++++
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  12 +++
 include/rdma/ib_hdrs.h               |   1 +
 3 files changed, 124 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/fs.c b/drivers/infiniband/hw/mlx5/fs.c
index 5fbc0a8454b9..be6a00969ddb 100644
--- a/drivers/infiniband/hw/mlx5/fs.c
+++ b/drivers/infiniband/hw/mlx5/fs.c
@@ -10,12 +10,14 @@
 #include <rdma/uverbs_std_types.h>
 #include <rdma/mlx5_user_ioctl_cmds.h>
 #include <rdma/mlx5_user_ioctl_verbs.h>
+#include <rdma/ib_hdrs.h>
 #include <rdma/ib_umem.h>
 #include <linux/mlx5/driver.h>
 #include <linux/mlx5/fs.h>
 #include <linux/mlx5/fs_helpers.h>
 #include <linux/mlx5/accel.h>
 #include <linux/mlx5/eswitch.h>
+#include <net/inet_ecn.h>
 #include "mlx5_ib.h"
 #include "counters.h"
 #include "devx.h"
@@ -847,6 +849,115 @@ static struct mlx5_ib_flow_prio *get_flow_table(struct mlx5_ib_dev *dev,
 	return prio;
 }
 
+enum {
+	RDMA_RX_ECN_OPCOUNTER_PRIO,
+	RDMA_RX_CNP_OPCOUNTER_PRIO,
+};
+
+enum {
+	RDMA_TX_CNP_OPCOUNTER_PRIO,
+};
+
+int mlx5_ib_fs_add_op_fc(struct mlx5_ib_dev *dev, struct mlx5_ib_op_fc *opfc,
+			 enum mlx5_ib_optional_counter_type type)
+{
+	enum mlx5_flow_namespace_type fn_type;
+	struct mlx5_flow_act flow_act = {};
+	struct mlx5_flow_destination dst;
+	struct mlx5_flow_namespace *ns;
+	struct mlx5_ib_flow_prio *prio;
+	struct mlx5_flow_spec *spec;
+	int priority, err = 0;
+
+	spec = kvzalloc(sizeof(*spec), GFP_KERNEL);
+	if (!spec)
+		return -ENOMEM;
+
+	switch (type) {
+	case MLX5_IB_OPCOUNTER_CC_RX_CE_PKTS:
+		fn_type = MLX5_FLOW_NAMESPACE_RDMA_RX_COUNTERS;
+		priority = RDMA_RX_ECN_OPCOUNTER_PRIO;
+		spec->match_criteria_enable = MLX5_MATCH_OUTER_HEADERS;
+		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria,
+				 outer_headers.ip_ecn);
+		MLX5_SET(fte_match_param, spec->match_value,
+			 outer_headers.ip_ecn, INET_ECN_CE);
+		break;
+
+	case MLX5_IB_OPCOUNTER_CC_RX_CNP_PKTS:
+		if (!MLX5_CAP_FLOWTABLE(dev->mdev,
+					ft_field_support_2_nic_receive_rdma.bth_opcode)) {
+			err = -EOPNOTSUPP;
+			goto free;
+		}
+		fn_type = MLX5_FLOW_NAMESPACE_RDMA_RX_COUNTERS;
+		priority = RDMA_RX_CNP_OPCOUNTER_PRIO;
+		spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS;
+		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria,
+				 misc_parameters.bth_opcode);
+		MLX5_SET(fte_match_param, spec->match_value,
+			 misc_parameters.bth_opcode, IB_BTH_OPCODE_CNP);
+		break;
+
+	case MLX5_IB_OPCOUNTER_CC_TX_CNP_PKTS:
+		if (!MLX5_CAP_FLOWTABLE(dev->mdev,
+					ft_field_support_2_nic_transmit_rdma.bth_opcode)) {
+			err = -EOPNOTSUPP;
+			goto free;
+		}
+		fn_type = MLX5_FLOW_NAMESPACE_RDMA_TX_COUNTERS;
+		priority = RDMA_TX_CNP_OPCOUNTER_PRIO;
+		spec->match_criteria_enable = MLX5_MATCH_MISC_PARAMETERS;
+		MLX5_SET_TO_ONES(fte_match_param, spec->match_criteria,
+				 misc_parameters.bth_opcode);
+		MLX5_SET(fte_match_param, spec->match_value,
+			 misc_parameters.bth_opcode, IB_BTH_OPCODE_CNP);
+		break;
+
+	default:
+		err = -EOPNOTSUPP;
+		goto free;
+	}
+
+	ns = mlx5_get_flow_namespace(dev->mdev, fn_type);
+	if (!ns) {
+		err = -EOPNOTSUPP;
+		goto free;
+	}
+	prio = _get_prio(ns, &opfc->prio, priority, 1, 1, 0);
+	if (IS_ERR(prio)) {
+		err = PTR_ERR(prio);
+		goto free;
+	}
+
+	dst.type = MLX5_FLOW_DESTINATION_TYPE_COUNTER;
+	dst.counter_id = mlx5_fc_id(opfc->fc);
+
+	flow_act.action =
+		MLX5_FLOW_CONTEXT_ACTION_COUNT | MLX5_FLOW_CONTEXT_ACTION_ALLOW;
+
+	opfc->rule =
+		mlx5_add_flow_rules(prio->flow_table, spec, &flow_act, &dst, 1);
+	if (IS_ERR(opfc->rule)) {
+		put_flow_table(dev, prio, false);
+		err = PTR_ERR(opfc->rule);
+		goto free;
+	}
+	prio->refcount++;
+
+free:
+	kfree(spec);
+	return err;
+}
+
+void mlx5_ib_fs_remove_op_fc(struct mlx5_ib_dev *dev,
+			     struct mlx5_ib_op_fc *opfc)
+{
+	mlx5_del_flow_rules(opfc->rule);
+	put_flow_table(dev, &opfc->prio, true);
+	WARN_ON(opfc->prio.flow_table);
+}
+
 static void set_underlay_qp(struct mlx5_ib_dev *dev,
 			    struct mlx5_flow_spec *spec,
 			    u32 underlay_qpn)
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 2ba352702294..130b2ed79ba2 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -805,6 +805,12 @@ enum mlx5_ib_optional_counter_type {
 	MLX5_IB_OPCOUNTER_MAX,
 };
 
+struct mlx5_ib_op_fc {
+	struct mlx5_fc *fc;
+	struct mlx5_ib_flow_prio prio;
+	struct mlx5_flow_handle *rule;
+};
+
 struct mlx5_ib_counters {
 	const char **names;
 	size_t *offsets;
@@ -814,6 +820,12 @@ struct mlx5_ib_counters {
 	u16 set_id;
 };
 
+int mlx5_ib_fs_add_op_fc(struct mlx5_ib_dev *dev, struct mlx5_ib_op_fc *opfc,
+			 enum mlx5_ib_optional_counter_type type);
+
+void mlx5_ib_fs_remove_op_fc(struct mlx5_ib_dev *dev,
+			     struct mlx5_ib_op_fc *opfc);
+
 struct mlx5_ib_multiport_info;
 
 struct mlx5_ib_multiport {
diff --git a/include/rdma/ib_hdrs.h b/include/rdma/ib_hdrs.h
index 7e542205861c..8ae07c0ecdf7 100644
--- a/include/rdma/ib_hdrs.h
+++ b/include/rdma/ib_hdrs.h
@@ -232,6 +232,7 @@ static inline u32 ib_get_sqpn(struct ib_other_headers *ohdr)
 #define IB_BTH_SE_SHIFT	23
 #define IB_BTH_TVER_MASK	0xf
 #define IB_BTH_TVER_SHIFT	16
+#define IB_BTH_OPCODE_CNP	0x81
 
 static inline u8 ib_bth_get_pad(struct ib_other_headers *ohdr)
 {
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH rdma-next 06/10] RDMA/nldev: Add support to add and remove optional counters
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
                   ` (4 preceding siblings ...)
  2021-08-18 11:24 ` [PATCH rdma-next 05/10] RDMA/mlx5: Add steering support in optional flow counters Mark Zhang
@ 2021-08-18 11:24 ` Mark Zhang
  2021-08-23 19:42   ` Jason Gunthorpe
  2021-08-18 11:24 ` [PATCH rdma-next 07/10] RDMA/mlx5: Add add_op_stat() and remove_op_stat() support Mark Zhang
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

From: Aharon Landau <aharonl@nvidia.com>

This patch adds the ability to add/remove optional counter to a link
through RDMA netlink. Limit it to users with ADMIN capability only.

Examples:
$ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_rx_ce_pkts
$ sudo rdma statistic remove link rocep8s0f0/1 optional-set cc_rx_ce_pkts

Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
---
 drivers/infiniband/core/counters.c | 50 ++++++++++++++++
 drivers/infiniband/core/device.c   |  2 +
 drivers/infiniband/core/nldev.c    | 93 ++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h            |  7 +++
 include/rdma/rdma_counter.h        |  4 ++
 include/rdma/rdma_netlink.h        |  1 +
 include/uapi/rdma/rdma_netlink.h   |  9 +++
 7 files changed, 166 insertions(+)

diff --git a/drivers/infiniband/core/counters.c b/drivers/infiniband/core/counters.c
index b8b6db98bfdf..fa04178aa0eb 100644
--- a/drivers/infiniband/core/counters.c
+++ b/drivers/infiniband/core/counters.c
@@ -106,6 +106,56 @@ static int __rdma_counter_bind_qp(struct rdma_counter *counter,
 	return ret;
 }
 
+static struct rdma_op_counter *get_opcounter(struct rdma_op_stats *opstats,
+					     const char *name)
+{
+	int i;
+
+	for (i = 0; i < opstats->num_opcounters; i++)
+		if (!strcmp(opstats->opcounters[i].name, name))
+			return opstats->opcounters + i;
+
+	return NULL;
+}
+
+static int rdma_opcounter_set(struct ib_device *dev, u32 port,
+			      const char *name, bool is_add)
+{
+	struct rdma_port_counter *port_counter;
+	struct rdma_op_counter *opc;
+	int ret;
+
+	if (!dev->ops.add_op_stat || !dev->ops.remove_op_stat)
+		return -EOPNOTSUPP;
+
+	port_counter = &dev->port_data[port].port_counter;
+	opc = get_opcounter(port_counter->opstats, name);
+	if (!opc)
+		return -EINVAL;
+
+	mutex_lock(&port_counter->opstats->lock);
+	ret = is_add ? dev->ops.add_op_stat(dev, port, opc->type) :
+		dev->ops.remove_op_stat(dev, port, opc->type);
+	if (ret)
+		goto end;
+
+	opc->enabled = is_add;
+end:
+	mutex_unlock(&port_counter->opstats->lock);
+	return ret;
+}
+
+int rdma_opcounter_add(struct ib_device *dev, u32 port, const char *name)
+{
+	return rdma_opcounter_set(dev, port, name, true);
+}
+
+int rdma_opcounter_remove(struct ib_device *dev, u32 port,
+			  const char *name)
+{
+	return rdma_opcounter_set(dev, port, name, false);
+}
+
 static struct rdma_counter *alloc_and_bind(struct ib_device *dev, u32 port,
 					   struct ib_qp *qp,
 					   enum rdma_nl_counter_mode mode)
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 23e1ae50b2e4..b9138f20f9a8 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2590,6 +2590,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
 		ops->uverbs_no_driver_id_binding;
 
 	SET_DEVICE_OP(dev_ops, add_gid);
+	SET_DEVICE_OP(dev_ops, add_op_stat);
 	SET_DEVICE_OP(dev_ops, advise_mr);
 	SET_DEVICE_OP(dev_ops, alloc_dm);
 	SET_DEVICE_OP(dev_ops, alloc_hw_device_stats);
@@ -2701,6 +2702,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
 	SET_DEVICE_OP(dev_ops, reg_dm_mr);
 	SET_DEVICE_OP(dev_ops, reg_user_mr);
 	SET_DEVICE_OP(dev_ops, reg_user_mr_dmabuf);
+	SET_DEVICE_OP(dev_ops, remove_op_stat);
 	SET_DEVICE_OP(dev_ops, req_notify_cq);
 	SET_DEVICE_OP(dev_ops, rereg_user_mr);
 	SET_DEVICE_OP(dev_ops, resize_cq);
diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index e9b4b2cccaa0..17d55d89f11c 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -154,6 +154,11 @@ static const struct nla_policy nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 	[RDMA_NLDEV_NET_NS_FD]			= { .type = NLA_U32 },
 	[RDMA_NLDEV_SYS_ATTR_NETNS_MODE]	= { .type = NLA_U8 },
 	[RDMA_NLDEV_SYS_ATTR_COPY_ON_FORK]	= { .type = NLA_U8 },
+	[RDMA_NLDEV_ATTR_STAT_OPCOUNTERS]       = { .type = NLA_NESTED },
+	[RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY]  = { .type = NLA_NESTED },
+	[RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME] = { .type = NLA_NUL_STRING,
+				  .len = RDMA_NLDEV_ATTR_OPCOUNTER_NAME_SIZE },
+	[RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_VALUE] = { .type = NLA_U64 },
 };
 
 static int put_driver_name_print_type(struct sk_buff *msg, const char *name,
@@ -1888,6 +1893,86 @@ static int nldev_set_sys_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 	return err;
 }
 
+static int nldev_stat_set_op_stat(struct sk_buff *skb,
+				  struct nlmsghdr *nlh,
+				  struct netlink_ext_ack *extack,
+				  bool cmd_add)
+{
+	char opcounter[RDMA_NLDEV_ATTR_OPCOUNTER_NAME_SIZE] = {};
+	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
+	struct ib_device *device;
+	struct sk_buff *msg;
+	u32 index, port;
+	int ret;
+
+	ret = nlmsg_parse(nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1,
+			  nldev_policy, extack);
+
+	if (ret || !tb[RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME] ||
+	    !tb[RDMA_NLDEV_ATTR_DEV_INDEX] ||
+	    !tb[RDMA_NLDEV_ATTR_PORT_INDEX])
+		return -EINVAL;
+
+	index = nla_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
+	device = ib_device_get_by_index(sock_net(skb->sk), index);
+	if (!device)
+		return -EINVAL;
+
+	port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
+	if (!rdma_is_port_valid(device, port)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	nla_strscpy(opcounter, tb[RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME],
+		    sizeof(opcounter));
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	nlh = nlmsg_put(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq,
+			RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
+					 (cmd_add ?
+					  RDMA_NLDEV_CMD_STAT_ADD_OPCOUNTER :
+					  RDMA_NLDEV_CMD_STAT_REMOVE_OPCOUNTER)),
+			0, 0);
+
+	if (cmd_add)
+		ret = rdma_opcounter_add(device, port, opcounter);
+	else
+		ret = rdma_opcounter_remove(device, port, opcounter);
+	if (ret)
+		goto err_msg;
+
+	nlmsg_end(msg, nlh);
+	ib_device_put(device);
+	return rdma_nl_unicast(sock_net(skb->sk), msg,
+			       NETLINK_CB(skb).portid);
+
+err_msg:
+	nlmsg_free(msg);
+err:
+	ib_device_put(device);
+	return ret;
+}
+
+static int nldev_stat_add_op_stat_doit(struct sk_buff *skb,
+				       struct nlmsghdr *nlh,
+				       struct netlink_ext_ack *extack)
+{
+	return nldev_stat_set_op_stat(skb, nlh, extack, true);
+}
+
+static int nldev_stat_remove_op_stat_doit(struct sk_buff *skb,
+					  struct nlmsghdr *nlh,
+					  struct netlink_ext_ack *extack)
+{
+	return nldev_stat_set_op_stat(skb, nlh, extack, false);
+}
+
 static int nldev_stat_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 			       struct netlink_ext_ack *extack)
 {
@@ -2342,6 +2427,14 @@ static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
 		.dump = nldev_res_get_mr_raw_dumpit,
 		.flags = RDMA_NL_ADMIN_PERM,
 	},
+	[RDMA_NLDEV_CMD_STAT_ADD_OPCOUNTER] = {
+		.doit = nldev_stat_add_op_stat_doit,
+		.flags = RDMA_NL_ADMIN_PERM,
+	},
+	[RDMA_NLDEV_CMD_STAT_REMOVE_OPCOUNTER] = {
+		.doit = nldev_stat_remove_op_stat_doit,
+		.flags = RDMA_NL_ADMIN_PERM,
+	},
 };
 
 void __init nldev_init(void)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 40b0f7825975..fa9e668b9b14 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -600,11 +600,14 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct(
 
 /**
  * struct rdma_op_counter
+ * @enabled - To indicate if this counter is currently enabled (as optional
+ *    counters can be dynamically enabled/disabled)
  * @type - The vendor-specific type of the counter
  * @name - The name of the counter
  * @value - The value of the counter
  */
 struct rdma_op_counter {
+	bool enabled;
 	int type;
 	const char *name;
 	u64 value;
@@ -2595,6 +2598,10 @@ struct ib_device_ops {
 	struct rdma_op_stats *(*alloc_op_port_stats)(struct ib_device *device,
 						     u32 port_num);
 
+	int (*add_op_stat)(struct ib_device *device, u32 port,
+			   int optional_stat);
+	int (*remove_op_stat)(struct ib_device *device, u32 port,
+			      int optional_stat);
 	/**
 	 * Allows rdma drivers to add their own restrack attributes.
 	 */
diff --git a/include/rdma/rdma_counter.h b/include/rdma/rdma_counter.h
index 3531c5061718..48086a7248ac 100644
--- a/include/rdma/rdma_counter.h
+++ b/include/rdma/rdma_counter.h
@@ -63,5 +63,9 @@ int rdma_counter_unbind_qpn(struct ib_device *dev, u32 port,
 int rdma_counter_get_mode(struct ib_device *dev, u32 port,
 			  enum rdma_nl_counter_mode *mode,
 			  enum rdma_nl_counter_mask *mask);
+int rdma_opcounter_add(struct ib_device *dev, u32 port,
+		       const char *name);
+int rdma_opcounter_remove(struct ib_device *dev, u32 port,
+			  const char *name);
 
 #endif /* _RDMA_COUNTER_H_ */
diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
index 2758d9df71ee..ac47a0cc0508 100644
--- a/include/rdma/rdma_netlink.h
+++ b/include/rdma/rdma_netlink.h
@@ -10,6 +10,7 @@ enum {
 	RDMA_NLDEV_ATTR_EMPTY_STRING = 1,
 	RDMA_NLDEV_ATTR_ENTRY_STRLEN = 16,
 	RDMA_NLDEV_ATTR_CHARDEV_TYPE_SIZE = 32,
+	RDMA_NLDEV_ATTR_OPCOUNTER_NAME_SIZE = 64,
 };
 
 struct rdma_nl_cbs {
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 75a1ae2311d8..79e6ca87d2e0 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -297,6 +297,10 @@ enum rdma_nldev_command {
 
 	RDMA_NLDEV_CMD_RES_SRQ_GET, /* can dump */
 
+	RDMA_NLDEV_CMD_STAT_ADD_OPCOUNTER,
+
+	RDMA_NLDEV_CMD_STAT_REMOVE_OPCOUNTER,
+
 	RDMA_NLDEV_NUM_OPS
 };
 
@@ -549,6 +553,11 @@ enum rdma_nldev_attr {
 
 	RDMA_NLDEV_SYS_ATTR_COPY_ON_FORK,	/* u8 */
 
+	RDMA_NLDEV_ATTR_STAT_OPCOUNTERS,	/* nested table */
+	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY,	/* nested table */
+	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME,	/* string */
+	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_VALUE,	/* u64 */
+
 	/*
 	 * Always the end
 	 */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH rdma-next 07/10] RDMA/mlx5: Add add_op_stat() and remove_op_stat() support
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
                   ` (5 preceding siblings ...)
  2021-08-18 11:24 ` [PATCH rdma-next 06/10] RDMA/nldev: Add support to add and remove optional counters Mark Zhang
@ 2021-08-18 11:24 ` Mark Zhang
  2021-08-18 11:24 ` [PATCH rdma-next 08/10] RDMA/nldev: Add support to get optional counters statistics Mark Zhang
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

From: Aharon Landau <aharonl@nvidia.com>

Add support for ib callback add_op_stat() and remove_op_stat().
When adding an optional counter statistic, a steering flow table is
created with a rule that catches and counts all the matching packets.
When removing this counter, the table and flow counter are being
destroyed.

Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
---
 drivers/infiniband/hw/mlx5/counters.c | 63 ++++++++++++++++++++++++++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  1 +
 2 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/counters.c b/drivers/infiniband/hw/mlx5/counters.c
index 5b4b446727d1..5bd1e5a5dffa 100644
--- a/drivers/infiniband/hw/mlx5/counters.c
+++ b/drivers/infiniband/hw/mlx5/counters.c
@@ -544,7 +544,7 @@ static void mlx5_ib_dealloc_counters(struct mlx5_ib_dev *dev)
 {
 	u32 in[MLX5_ST_SZ_DW(dealloc_q_counter_in)] = {};
 	int num_cnt_ports;
-	int i;
+	int i, j;
 
 	num_cnt_ports = is_mdev_switchdev_mode(dev->mdev) ? 1 : dev->num_ports;
 
@@ -559,6 +559,16 @@ static void mlx5_ib_dealloc_counters(struct mlx5_ib_dev *dev)
 		}
 		kfree(dev->port[i].cnts.names);
 		kfree(dev->port[i].cnts.offsets);
+
+		for (j = 0; j < MLX5_IB_OPCOUNTER_MAX; j++) {
+			if (dev->port[i].cnts.opfcs[j].fc) {
+				mlx5_ib_fs_remove_op_fc(dev,
+							&dev->port[i].cnts.opfcs[j]);
+				mlx5_fc_destroy(dev->mdev,
+						dev->port[i].cnts.opfcs[j].fc);
+				dev->port[i].cnts.opfcs[j].fc = NULL;
+			}
+		}
 	}
 }
 
@@ -738,9 +748,60 @@ void mlx5_ib_counters_clear_description(struct ib_counters *counters)
 	mutex_unlock(&mcounters->mcntrs_mutex);
 }
 
+static int mlx5_ib_add_op_stat(struct ib_device *device, u32 port, int type)
+{
+	struct mlx5_ib_dev *dev = to_mdev(device);
+	struct mlx5_ib_op_fc *opfc;
+	int ret;
+
+	if (mlx5_core_mp_enabled(dev->mdev))
+		return -EOPNOTSUPP;
+
+	if (type >= MLX5_IB_OPCOUNTER_MAX)
+		return -EINVAL;
+
+	opfc = &dev->port[port - 1].cnts.opfcs[type];
+	if (opfc->fc)
+		return -EEXIST;
+
+	opfc->fc = mlx5_fc_create(dev->mdev, false);
+	if (IS_ERR(opfc->fc))
+		return PTR_ERR(opfc->fc);
+
+	ret = mlx5_ib_fs_add_op_fc(dev, opfc, type);
+	if (ret) {
+		mlx5_fc_destroy(dev->mdev, opfc->fc);
+		opfc->fc = NULL;
+		return ret;
+	}
+
+	return ret;
+}
+
+static int mlx5_ib_remove_op_stat(struct ib_device *device, u32 port, int type)
+{
+	struct mlx5_ib_dev *dev = to_mdev(device);
+	struct mlx5_ib_op_fc *opfc;
+
+	if (type >= MLX5_IB_OPCOUNTER_MAX)
+		return -EINVAL;
+
+	opfc = &dev->port[port - 1].cnts.opfcs[type];
+	if (!opfc->fc)
+		return -EINVAL;
+
+	mlx5_ib_fs_remove_op_fc(dev, opfc);
+	mlx5_fc_destroy(dev->mdev, opfc->fc);
+	opfc->fc = NULL;
+
+	return 0;
+}
+
 static const struct ib_device_ops stats_ops = {
 	.alloc_hw_port_stats = mlx5_ib_alloc_hw_port_stats,
 	.alloc_op_port_stats = mlx5_ib_alloc_op_port_stats,
+	.add_op_stat = mlx5_ib_add_op_stat,
+	.remove_op_stat = mlx5_ib_remove_op_stat,
 	.get_hw_stats = mlx5_ib_get_hw_stats,
 	.counter_bind_qp = mlx5_ib_counter_bind_qp,
 	.counter_unbind_qp = mlx5_ib_counter_unbind_qp,
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 130b2ed79ba2..4dd4e43f118e 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -818,6 +818,7 @@ struct mlx5_ib_counters {
 	u32 num_cong_counters;
 	u32 num_ext_ppcnt_counters;
 	u16 set_id;
+	struct mlx5_ib_op_fc opfcs[MLX5_IB_OPCOUNTER_MAX];
 };
 
 int mlx5_ib_fs_add_op_fc(struct mlx5_ib_dev *dev, struct mlx5_ib_op_fc *opfc,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH rdma-next 08/10] RDMA/nldev: Add support to get optional counters statistics
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
                   ` (6 preceding siblings ...)
  2021-08-18 11:24 ` [PATCH rdma-next 07/10] RDMA/mlx5: Add add_op_stat() and remove_op_stat() support Mark Zhang
@ 2021-08-18 11:24 ` Mark Zhang
  2021-08-18 11:24 ` [PATCH rdma-next 09/10] RDMA/mlx5: Add get_op_stats() support Mark Zhang
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

From: Neta Ostrovsky <netao@nvidia.com>

This patch adds the ability to return per-port optional counter
statistisc through RDMA netlink.

Examples:

$ rdma statistic show link rocep8s0f0/1
link rocep8s0f0/1 rx_write_requests 0 rx_read_requests 0 rx_atomic_requests 0 out_of_buffer 0
out_of_sequence 0 duplicate_request 0 rnr_nak_retry_err 0 packet_seq_err 0 implied_nak_seq_err 0
local_ack_timeout_err 0 resp_local_length_error 0 resp_cqe_error 0 req_cqe_error 0
req_remote_invalid_request 0 req_remote_access_errors 0 resp_remote_access_errors 0
resp_cqe_flush_error 0 req_cqe_flush_error 0 roce_adp_retrans 0 roce_adp_retrans_to 0
roce_slow_restart 0 roce_slow_restart_cnps 0 roce_slow_restart_trans 0 rp_cnp_ignored 0
rp_cnp_handled 0 np_ecn_marked_roce_packets 0 np_cnp_sent 0 rx_icrc_encapsulated 0
    Optional-set: cc_rx_ce_pkts 0

Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
---
 drivers/infiniband/core/counters.c |  18 +++++
 drivers/infiniband/core/device.c   |   1 +
 drivers/infiniband/core/nldev.c    | 117 +++++++++++++++++++++++------
 include/rdma/ib_verbs.h            |   3 +
 include/rdma/rdma_counter.h        |   3 +
 5 files changed, 119 insertions(+), 23 deletions(-)

diff --git a/drivers/infiniband/core/counters.c b/drivers/infiniband/core/counters.c
index fa04178aa0eb..5f7a12b8f1bb 100644
--- a/drivers/infiniband/core/counters.c
+++ b/drivers/infiniband/core/counters.c
@@ -459,6 +459,24 @@ u64 rdma_counter_get_hwstat_value(struct ib_device *dev, u32 port, u32 index)
 	return sum;
 }
 
+/*
+ * rdma_opcounter_query_stats - Query the per-port optional counter values
+ */
+int rdma_opcounter_query_stats(struct rdma_op_stats *opstats,
+			       struct ib_device *dev, u32 port)
+{
+	int ret = 0;
+
+	if (!dev->ops.get_op_stats)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&opstats->lock);
+	ret = dev->ops.get_op_stats(dev, port, opstats);
+	mutex_unlock(&opstats->lock);
+
+	return ret;
+}
+
 static struct ib_qp *rdma_counter_get_qp(struct ib_device *dev, u32 qp_num)
 {
 	struct rdma_restrack_entry *res = NULL;
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index b9138f20f9a8..efd4b75b7752 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2657,6 +2657,7 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
 	SET_DEVICE_OP(dev_ops, get_link_layer);
 	SET_DEVICE_OP(dev_ops, get_netdev);
 	SET_DEVICE_OP(dev_ops, get_numa_node);
+	SET_DEVICE_OP(dev_ops, get_op_stats);
 	SET_DEVICE_OP(dev_ops, get_port_immutable);
 	SET_DEVICE_OP(dev_ops, get_vector_affinity);
 	SET_DEVICE_OP(dev_ops, get_vf_config);
diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index 17d55d89f11c..b665651dfb1d 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -945,6 +945,30 @@ int rdma_nl_stat_hwcounter_entry(struct sk_buff *msg, const char *name,
 }
 EXPORT_SYMBOL(rdma_nl_stat_hwcounter_entry);
 
+static int rdma_nl_stat_opcounter_entry(struct sk_buff *msg, const char *name,
+					u64 value)
+{
+	struct nlattr *entry_attr;
+
+	entry_attr = nla_nest_start(msg, RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY);
+	if (!entry_attr)
+		return -EMSGSIZE;
+
+	if (nla_put_string(msg, RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME,
+			   name))
+		goto err;
+	if (nla_put_u64_64bit(msg, RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_VALUE,
+			      value, RDMA_NLDEV_ATTR_PAD))
+		goto err;
+
+	nla_nest_end(msg, entry_attr);
+	return 0;
+
+err:
+	nla_nest_cancel(msg, entry_attr);
+	return -EMSGSIZE;
+}
+
 static int fill_stat_mr_entry(struct sk_buff *msg, bool has_cap_net_admin,
 			      struct rdma_restrack_entry *res, uint32_t port)
 {
@@ -2124,15 +2148,52 @@ static int nldev_stat_del_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 	return ret;
 }
 
+static int stat_get_optional_counter(struct sk_buff *msg,
+				     struct ib_device *device, u32 port)
+{
+	struct rdma_op_stats *opstats;
+	struct nlattr *opstats_table;
+	int i, ret = 0;
+
+	opstats = device->port_data[port].port_counter.opstats;
+	if (!opstats)
+		return 0;
+
+	ret = rdma_opcounter_query_stats(opstats, device, port);
+	if (ret)
+		return ret;
+
+	opstats_table = nla_nest_start(msg, RDMA_NLDEV_ATTR_STAT_OPCOUNTERS);
+	if (!opstats_table)
+		return -EMSGSIZE;
+
+	for (i = 0; i < opstats->num_opcounters; i++) {
+		if (!(opstats->opcounters[i].enabled))
+			continue;
+		ret = rdma_nl_stat_opcounter_entry(msg,
+						   opstats->opcounters[i].name,
+						   opstats->opcounters[i].value);
+		if (ret)
+			goto err;
+	}
+	nla_nest_end(msg, opstats_table);
+
+	return 0;
+
+err:
+	nla_nest_cancel(msg, opstats_table);
+	return ret;
+}
+
 static int stat_get_doit_default_counter(struct sk_buff *skb,
 					 struct nlmsghdr *nlh,
 					 struct netlink_ext_ack *extack,
 					 struct nlattr *tb[])
 {
-	struct rdma_hw_stats *stats;
-	struct nlattr *table_attr;
+	struct rdma_hw_stats *hwstats;
+	struct nlattr *hwstats_table;
 	struct ib_device *device;
-	int ret, num_cnts, i;
+	int ret, num_hwstats, i;
 	struct sk_buff *msg;
 	u32 index, port;
 	u64 v;
@@ -2145,14 +2206,19 @@ static int stat_get_doit_default_counter(struct sk_buff *skb,
 	if (!device)
 		return -EINVAL;
 
+	port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
+	if (!rdma_is_port_valid(device, port)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
 	if (!device->ops.alloc_hw_port_stats || !device->ops.get_hw_stats) {
 		ret = -EINVAL;
 		goto err;
 	}
 
-	port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
-	stats = ib_get_hw_stats_port(device, port);
-	if (!stats) {
+	hwstats = ib_get_hw_stats_port(device, port);
+	if (!hwstats) {
 		ret = -EINVAL;
 		goto err;
 	}
@@ -2174,38 +2240,43 @@ static int stat_get_doit_default_counter(struct sk_buff *skb,
 		goto err_msg;
 	}
 
-	mutex_lock(&stats->lock);
+	mutex_lock(&hwstats->lock);
 
-	num_cnts = device->ops.get_hw_stats(device, stats, port, 0);
-	if (num_cnts < 0) {
+	num_hwstats = device->ops.get_hw_stats(device, hwstats, port, 0);
+	if (num_hwstats < 0) {
 		ret = -EINVAL;
-		goto err_stats;
+		goto err_hwstats;
 	}
 
-	table_attr = nla_nest_start(msg, RDMA_NLDEV_ATTR_STAT_HWCOUNTERS);
-	if (!table_attr) {
+	hwstats_table = nla_nest_start(msg, RDMA_NLDEV_ATTR_STAT_HWCOUNTERS);
+	if (!hwstats_table) {
 		ret = -EMSGSIZE;
-		goto err_stats;
+		goto err_hwstats;
 	}
-	for (i = 0; i < num_cnts; i++) {
-		v = stats->value[i] +
+	for (i = 0; i < num_hwstats; i++) {
+		v = hwstats->value[i] +
 			rdma_counter_get_hwstat_value(device, port, i);
-		if (rdma_nl_stat_hwcounter_entry(msg, stats->names[i], v)) {
+		if (rdma_nl_stat_hwcounter_entry(msg, hwstats->names[i], v)) {
 			ret = -EMSGSIZE;
-			goto err_table;
+			goto err_hwstats_table;
 		}
 	}
-	nla_nest_end(msg, table_attr);
+	nla_nest_end(msg, hwstats_table);
+
+	mutex_unlock(&hwstats->lock);
+
+	ret = stat_get_optional_counter(msg, device, port);
+	if (ret)
+		goto err_msg;
 
-	mutex_unlock(&stats->lock);
 	nlmsg_end(msg, nlh);
 	ib_device_put(device);
 	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);
 
-err_table:
-	nla_nest_cancel(msg, table_attr);
-err_stats:
-	mutex_unlock(&stats->lock);
+err_hwstats_table:
+	nla_nest_cancel(msg, hwstats_table);
+err_hwstats:
+	mutex_unlock(&hwstats->lock);
 err_msg:
 	nlmsg_free(msg);
 err:
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index fa9e668b9b14..d85f2e842a1d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2602,6 +2602,9 @@ struct ib_device_ops {
 			   int optional_stat);
 	int (*remove_op_stat)(struct ib_device *device, u32 port,
 			      int optional_stat);
+	int (*get_op_stats)(struct ib_device *device, u32 port,
+			    struct rdma_op_stats *stats);
+
 	/**
 	 * Allows rdma drivers to add their own restrack attributes.
 	 */
diff --git a/include/rdma/rdma_counter.h b/include/rdma/rdma_counter.h
index 48086a7248ac..31686a234c77 100644
--- a/include/rdma/rdma_counter.h
+++ b/include/rdma/rdma_counter.h
@@ -67,5 +67,8 @@ int rdma_opcounter_add(struct ib_device *dev, u32 port,
 		       const char *name);
 int rdma_opcounter_remove(struct ib_device *dev, u32 port,
 			  const char *name);
+int rdma_opcounter_query_stats(struct rdma_op_stats *opstats,
+			       struct ib_device *dev, u32 port);
+
 
 #endif /* _RDMA_COUNTER_H_ */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH rdma-next 09/10] RDMA/mlx5: Add get_op_stats() support
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
                   ` (7 preceding siblings ...)
  2021-08-18 11:24 ` [PATCH rdma-next 08/10] RDMA/nldev: Add support to get optional counters statistics Mark Zhang
@ 2021-08-18 11:24 ` Mark Zhang
  2021-08-18 11:24 ` [PATCH rdma-next 10/10] RDMA/nldev: Add support to get current enabled optional counters Mark Zhang
  2021-08-23 19:33 ` [PATCH rdma-next 00/10] Optional counter statistics support Jason Gunthorpe
  10 siblings, 0 replies; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

From: Aharon Landau <aharonl@nvidia.com>

Add support for ib callback alloc_op_port_stats(), to get optional
counter statistics.

Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
---
 drivers/infiniband/hw/mlx5/counters.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/counters.c b/drivers/infiniband/hw/mlx5/counters.c
index 5bd1e5a5dffa..5e0cc5af9761 100644
--- a/drivers/infiniband/hw/mlx5/counters.c
+++ b/drivers/infiniband/hw/mlx5/counters.c
@@ -797,12 +797,35 @@ static int mlx5_ib_remove_op_stat(struct ib_device *device, u32 port, int type)
 	return 0;
 }
 
+static int mlx5_ib_get_op_stats(struct ib_device *device, u32 port,
+				struct rdma_op_stats *stats)
+{
+	struct mlx5_ib_dev *dev = to_mdev(device);
+	struct mlx5_ib_op_fc *opfcs = dev->port[port - 1].cnts.opfcs;
+	u64 packets, bytes;
+	int i, ret, type;
+
+	for (i = 0; i < stats->num_opcounters; i++) {
+		type = stats->opcounters[i].type;
+		if (opfcs[type].fc) {
+			ret = mlx5_fc_query(dev->mdev, opfcs[type].fc,
+					    &packets, &bytes);
+			if (ret)
+				return ret;
+			stats->opcounters[i].value = packets;
+		}
+	}
+
+	return 0;
+}
+
 static const struct ib_device_ops stats_ops = {
 	.alloc_hw_port_stats = mlx5_ib_alloc_hw_port_stats,
 	.alloc_op_port_stats = mlx5_ib_alloc_op_port_stats,
 	.add_op_stat = mlx5_ib_add_op_stat,
 	.remove_op_stat = mlx5_ib_remove_op_stat,
 	.get_hw_stats = mlx5_ib_get_hw_stats,
+	.get_op_stats = mlx5_ib_get_op_stats,
 	.counter_bind_qp = mlx5_ib_counter_bind_qp,
 	.counter_unbind_qp = mlx5_ib_counter_unbind_qp,
 	.counter_dealloc = mlx5_ib_counter_dealloc,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH rdma-next 10/10] RDMA/nldev: Add support to get current enabled optional counters
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
                   ` (8 preceding siblings ...)
  2021-08-18 11:24 ` [PATCH rdma-next 09/10] RDMA/mlx5: Add get_op_stats() support Mark Zhang
@ 2021-08-18 11:24 ` Mark Zhang
  2021-08-23 19:44   ` Jason Gunthorpe
  2021-08-23 19:33 ` [PATCH rdma-next 00/10] Optional counter statistics support Jason Gunthorpe
  10 siblings, 1 reply; 23+ messages in thread
From: Mark Zhang @ 2021-08-18 11:24 UTC (permalink / raw)
  To: jgg, dledford, saeedm
  Cc: linux-rdma, netdev, aharonl, netao, leonro, Mark Zhang

From: Aharon Landau <aharonl@nvidia.com>

This patch adds the ability to show the added and supported optional
counters for each link through RDMA netlink.

Examples:

$ rdma statistic mode
link rocep8s0f0/1
    Optional-set: cc_rx_ce_pkts

$ rdma statistic mode supported
link rocep8s0f0/1
    Optional-set: cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts
link rocep8s0f1/1
    Optional-set: cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts

Signed-off-by: Aharon Landau <aharonl@nvidia.com>
Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
---
 drivers/infiniband/core/nldev.c  | 91 +++++++++++++++++++++++++++++++-
 include/uapi/rdma/rdma_netlink.h |  2 +
 2 files changed, 91 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index b665651dfb1d..611e4fe6a244 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -159,6 +159,8 @@ static const struct nla_policy nldev_policy[RDMA_NLDEV_ATTR_MAX] = {
 	[RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME] = { .type = NLA_NUL_STRING,
 				  .len = RDMA_NLDEV_ATTR_OPCOUNTER_NAME_SIZE },
 	[RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_VALUE] = { .type = NLA_U64 },
+	[RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST]	= { .type = NLA_U8 },
+	[RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST_SUPPORTED] = { .type = NLA_U8 },
 };
 
 static int put_driver_name_print_type(struct sk_buff *msg, const char *name,
@@ -946,7 +948,7 @@ int rdma_nl_stat_hwcounter_entry(struct sk_buff *msg, const char *name,
 EXPORT_SYMBOL(rdma_nl_stat_hwcounter_entry);
 
 static int rdma_nl_stat_opcounter_entry(struct sk_buff *msg, const char *name,
-					u64 value)
+					u64 value, bool send_value)
 {
 	struct nlattr *entry_attr;
 
@@ -960,6 +962,12 @@ static int rdma_nl_stat_opcounter_entry(struct sk_buff *msg, const char *name,
 	if (nla_put_u64_64bit(msg, RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_VALUE,
 			      value, RDMA_NLDEV_ATTR_PAD))
 		goto err;
+	if (send_value) {
+		if (nla_put_u64_64bit(msg,
+				      RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_VALUE,
+				      value, RDMA_NLDEV_ATTR_PAD))
+			goto err;
+	}
 
 	nla_nest_end(msg, entry_attr);
 	return 0;
@@ -2148,6 +2156,79 @@ static int nldev_stat_del_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
 	return ret;
 }
 
+static int stat_get_doit_list_opstats(struct sk_buff *skb,
+				      struct nlmsghdr *nlh,
+				      struct netlink_ext_ack *extack,
+				      struct nlattr *tb[], u32 index,
+				      struct ib_device *device, u32 port)
+{
+	struct rdma_op_stats *opstats;
+	struct nlattr *opstats_list;
+	bool list_supported = false;
+	struct sk_buff *msg;
+	int i, ret;
+
+	if (tb[RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST_SUPPORTED])
+		list_supported = true;
+
+	opstats = device->port_data[port].port_counter.opstats;
+	if (!opstats)
+		return -EOPNOTSUPP;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	nlh = nlmsg_put(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq,
+			RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
+					 RDMA_NLDEV_CMD_STAT_GET),
+			0, 0);
+
+	if (fill_nldev_handle(msg, device) ||
+	    nla_put_u32(msg, RDMA_NLDEV_ATTR_PORT_INDEX, port)) {
+		ret = -EMSGSIZE;
+		goto err_msg;
+	}
+
+	opstats_list =
+		nla_nest_start(msg, RDMA_NLDEV_ATTR_STAT_OPCOUNTERS);
+
+	if (!opstats_list) {
+		ret = -EMSGSIZE;
+		goto err_msg;
+	}
+
+	mutex_lock(&opstats->lock);
+
+	for (i = 0; i < opstats->num_opcounters; i++) {
+		if (!opstats->opcounters[i].enabled && !list_supported)
+			continue;
+		ret = rdma_nl_stat_opcounter_entry(msg,
+						   opstats->opcounters[i].name,
+						   0, false);
+		if (ret)
+			goto err_opstats_msg;
+	}
+	nla_nest_end(msg, opstats_list);
+
+	mutex_unlock(&opstats->lock);
+
+	nlmsg_end(msg, nlh);
+	ib_device_put(device);
+	return rdma_nl_unicast(sock_net(skb->sk), msg, NETLINK_CB(skb).portid);
+
+err_opstats_msg:
+	nla_nest_cancel(msg, opstats_list);
+	mutex_unlock(&opstats->lock);
+err_msg:
+	nlmsg_free(msg);
+err:
+	ib_device_put(device);
+	return ret;
+}
+
 static int stat_get_optional_counter(struct sk_buff *msg,
 				     struct ib_device *device, u32 port)
 {
@@ -2172,7 +2253,8 @@ static int stat_get_optional_counter(struct sk_buff *msg,
 			continue;
 		ret = rdma_nl_stat_opcounter_entry(msg,
 						   opstats->opcounters[i].name,
-						   opstats->opcounters[i].value);
+						   opstats->opcounters[i].value,
+						   true);
 		if (ret)
 			goto err;
 	}
@@ -2212,6 +2294,11 @@ static int stat_get_doit_default_counter(struct sk_buff *skb,
 		goto err;
 	}
 
+	if (tb[RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST] ||
+	    tb[RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST_SUPPORTED])
+		return stat_get_doit_list_opstats(skb, nlh, extack, tb,
+						  index, device, port);
+
 	if (!device->ops.alloc_hw_port_stats || !device->ops.get_hw_stats) {
 		ret = -EINVAL;
 		goto err;
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 79e6ca87d2e0..57f39d8fe434 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -557,6 +557,8 @@ enum rdma_nldev_attr {
 	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY,	/* nested table */
 	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME,	/* string */
 	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_VALUE,	/* u64 */
+	RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST,		/* u8 */
+	RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST_SUPPORTED,	/* u8 */
 
 	/*
 	 * Always the end
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 04/10] RDMA/mlx5: Add alloc_op_port_stats() support
  2021-08-18 11:24 ` [PATCH rdma-next 04/10] RDMA/mlx5: Add alloc_op_port_stats() support Mark Zhang
@ 2021-08-23 19:19   ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2021-08-23 19:19 UTC (permalink / raw)
  To: Mark Zhang; +Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On Wed, Aug 18, 2021 at 02:24:22PM +0300, Mark Zhang wrote:

> +static struct rdma_op_stats *
> +mlx5_ib_alloc_op_port_stats(struct ib_device *ibdev, u32 port_num)
> +{
> +	struct rdma_op_stats *opstats;
> +	struct mlx5_ib_dev *dev = to_mdev(ibdev);
> +	int num_opcounters, i, j = 0;
> +
> +	num_opcounters = ARRAY_SIZE(basic_op_cnts);
> +
> +	if (MLX5_CAP_FLOWTABLE(dev->mdev,
> +			       ft_field_support_2_nic_receive_rdma.bth_opcode))
> +		num_opcounters += ARRAY_SIZE(rdmarx_cnp_op_cnts);
> +
> +	if (MLX5_CAP_FLOWTABLE(dev->mdev,
> +			       ft_field_support_2_nic_transmit_rdma.bth_opcode))
> +		num_opcounters += ARRAY_SIZE(rdmatx_cnp_op_cnts);
> +
> +	opstats = kzalloc(sizeof(*opstats) +
> +			  num_opcounters * sizeof(struct rdma_op_counter),
> +			  GFP_KERNEL);

This should use struct_size

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 03/10] RDMA/counters: Support to allocate per-port optional counter statistics
  2021-08-18 11:24 ` [PATCH rdma-next 03/10] RDMA/counters: Support to allocate per-port optional counter statistics Mark Zhang
@ 2021-08-23 19:30   ` Jason Gunthorpe
  2021-08-24  6:22     ` Mark Zhang
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2021-08-23 19:30 UTC (permalink / raw)
  To: Mark Zhang; +Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On Wed, Aug 18, 2021 at 02:24:21PM +0300, Mark Zhang wrote:
> From: Aharon Landau <aharonl@nvidia.com>
> 
> Add an alloc_op_port_stats() API, as well as related structures, to support
> per-port op_stats allocation during counter module initialization.
> 
> Signed-off-by: Aharon Landau <aharonl@nvidia.com>
> Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
> Signed-off-by: Mark Zhang <markzhang@nvidia.com>
>  drivers/infiniband/core/counters.c | 18 ++++++++++++++++++
>  drivers/infiniband/core/device.c   |  1 +
>  include/rdma/ib_verbs.h            | 24 ++++++++++++++++++++++++
>  include/rdma/rdma_counter.h        |  1 +
>  4 files changed, 44 insertions(+)
> 
> diff --git a/drivers/infiniband/core/counters.c b/drivers/infiniband/core/counters.c
> index df9e6c5e4ddf..b8b6db98bfdf 100644
> +++ b/drivers/infiniband/core/counters.c
> @@ -611,6 +611,15 @@ void rdma_counter_init(struct ib_device *dev)
>  		port_counter->hstats = dev->ops.alloc_hw_port_stats(dev, port);
>  		if (!port_counter->hstats)
>  			goto fail;
> +
> +		if (dev->ops.alloc_op_port_stats) {
> +			port_counter->opstats =
> +				dev->ops.alloc_op_port_stats(dev, port);
> +			if (!port_counter->opstats)
> +				goto fail;

It would be nicer to change the normal stats to have more detailed
meta information instead of adding an entire parallel interface like
this.

struct rdma_hw_stats {
	struct mutex	lock;
	unsigned long	timestamp;
	unsigned long	lifespan;
	const char * const *names;

Change the names to a struct

 const struct rdma_hw_stat_desc *descs;

struct rdma_hw_stat_desc {
   const char *name;
   unsigned int flags;
   unsigned int private;
}

and then define a FLAG_OPTIONAL.

Then alot of this oddness goes away.

You might also need a small allocated bitmap to store the
enabled/disabled state

Then the series basically boils down to adding some 'modify counter'
driver op that flips the enable/disable flag

And the netlink patches to expose the additional information.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 00/10] Optional counter statistics support
  2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
                   ` (9 preceding siblings ...)
  2021-08-18 11:24 ` [PATCH rdma-next 10/10] RDMA/nldev: Add support to get current enabled optional counters Mark Zhang
@ 2021-08-23 19:33 ` Jason Gunthorpe
  2021-08-24  1:44   ` Mark Zhang
  10 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2021-08-23 19:33 UTC (permalink / raw)
  To: Mark Zhang; +Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On Wed, Aug 18, 2021 at 02:24:18PM +0300, Mark Zhang wrote:
> Hi,
> 
> This series from Aharon and Neta provides an extension to the rdma
> statistics tool that allows to add and remove optional counters
> dynamically, using new netlink commands.
> 
> The idea of having optional counters is to provide to the users the
> ability to get statistics of counters that hurts performance.
> 
> Once an optional counter was added, its statistics will be presented
> along with all the counters, using the show command.
> 
> Binding objects to the optional counters is currently not supported,
> neither in auto mode nor in manual mode.
> 
> To get the list of optional counters that are supported on this device,
> use "rdma statistic mode supported". To see which counters are currently
> enabled, use "rdma statistic mode".
> 
> $ rdma statistic mode supported
> link rocep8s0f0/1
>     Optional-set: cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts
> link rocep8s0f1/1
>     Optional-set: cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts
> 
> $ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_rx_ce_pkts
> $ rdma statistic mode
> link rocep8s0f0/1
>     Optional-set: cc_rx_ce_pkts
> $ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_tx_cnp_pkts
> $ rdma statistic mode
> link rocep8s0f0/1
>     Optional-set: cc_rx_ce_pkts cc_tx_cnp_pkts

This doesn't look like the right output to iproute to me, the two
command should not be using the same tag and the output of iproute
should always be formed to be valid input to iproute


> $ rdma statistic show link rocep8s0f0/1
> link rocep8s0f0/1 rx_write_requests 0 rx_read_requests 0 rx_atomic_requests 0 out_of_buffer 0
> out_of_sequence 0 duplicate_request 0 rnr_nak_retry_err 0 packet_seq_err 0 implied_nak_seq_err 0
> local_ack_timeout_err 0 resp_local_length_error 0 resp_cqe_error 0 req_cqe_error 0
> req_remote_invalid_request 0 req_remote_access_errors 0 resp_remote_access_errors 0
> resp_cqe_flush_error 0 req_cqe_flush_error 0 roce_adp_retrans 0 roce_adp_retrans_to 0
> roce_slow_restart 0 roce_slow_restart_cnps 0 roce_slow_restart_trans 0 rp_cnp_ignored 0
> rp_cnp_handled 0 np_ecn_marked_roce_packets 0 np_cnp_sent 0 rx_icrc_encapsulated 0
>     Optional-set: cc_rx_ce_pkts 0 cc_tx_cnp_pkts 0

Also looks bad, optional counters should not be marked specially at
this point.

> Aharon Landau (9):
>   net/mlx5: Add support in bth_opcode as a match criteria
>   net/mlx5: Add priorities for counters in RDMA namespaces
>   RDMA/counters: Support to allocate per-port optional counter
>     statistics
>   RDMA/mlx5: Add alloc_op_port_stats() support
>   RDMA/mlx5: Add steering support in optional flow counters
>   RDMA/nldev: Add support to add and remove optional counters
>   RDMA/mlx5: Add add_op_stat() and remove_op_stat() support
>   RDMA/mlx5: Add get_op_stats() support
>   RDMA/nldev: Add support to get current enabled optional counters
> 
> Neta Ostrovsky (1):
>   RDMA/nldev: Add support to get optional counters statistics

This series is in a poor order, all the core update should come first
and the commit messages should explain what is going on when building
out the new APIs.

The RDMA/mlx5 patches can go last

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 06/10] RDMA/nldev: Add support to add and remove optional counters
  2021-08-18 11:24 ` [PATCH rdma-next 06/10] RDMA/nldev: Add support to add and remove optional counters Mark Zhang
@ 2021-08-23 19:42   ` Jason Gunthorpe
  2021-08-24  2:09     ` Mark Zhang
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2021-08-23 19:42 UTC (permalink / raw)
  To: Mark Zhang; +Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On Wed, Aug 18, 2021 at 02:24:24PM +0300, Mark Zhang wrote:
> From: Aharon Landau <aharonl@nvidia.com>
> 
> This patch adds the ability to add/remove optional counter to a link
> through RDMA netlink. Limit it to users with ADMIN capability only.
> 
> Examples:
> $ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_rx_ce_pkts
> $ sudo rdma statistic remove link rocep8s0f0/1 optional-set cc_rx_ce_pkts
> 
> Signed-off-by: Aharon Landau <aharonl@nvidia.com>
> Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
> Signed-off-by: Mark Zhang <markzhang@nvidia.com>
>  drivers/infiniband/core/counters.c | 50 ++++++++++++++++
>  drivers/infiniband/core/device.c   |  2 +
>  drivers/infiniband/core/nldev.c    | 93 ++++++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h            |  7 +++
>  include/rdma/rdma_counter.h        |  4 ++
>  include/rdma/rdma_netlink.h        |  1 +
>  include/uapi/rdma/rdma_netlink.h   |  9 +++
>  7 files changed, 166 insertions(+)
> 
> diff --git a/drivers/infiniband/core/counters.c b/drivers/infiniband/core/counters.c
> index b8b6db98bfdf..fa04178aa0eb 100644
> +++ b/drivers/infiniband/core/counters.c
> @@ -106,6 +106,56 @@ static int __rdma_counter_bind_qp(struct rdma_counter *counter,
>  	return ret;
>  }
>  
> +static struct rdma_op_counter *get_opcounter(struct rdma_op_stats *opstats,
> +					     const char *name)
> +{
> +	int i;
> +
> +	for (i = 0; i < opstats->num_opcounters; i++)
> +		if (!strcmp(opstats->opcounters[i].name, name))
> +			return opstats->opcounters + i;
> +
> +	return NULL;
> +}

Export this and have the netlink code call it instead of working with
strings.

> +static int rdma_opcounter_set(struct ib_device *dev, u32 port,
> +			      const char *name, bool is_add)
> +{
> +	struct rdma_port_counter *port_counter;
> +	struct rdma_op_counter *opc;
> +	int ret;
> +
> +	if (!dev->ops.add_op_stat || !dev->ops.remove_op_stat)
> +		return -EOPNOTSUPP;
> +
> +	port_counter = &dev->port_data[port].port_counter;
> +	opc = get_opcounter(port_counter->opstats, name);
> +	if (!opc)
> +		return -EINVAL;
> +
> +	mutex_lock(&port_counter->opstats->lock);
> +	ret = is_add ? dev->ops.add_op_stat(dev, port, opc->type) :
> +		dev->ops.remove_op_stat(dev, port, opc->type);

Drivers should work by indexes not types, that is how the counter API
is designed

> +int rdma_opcounter_add(struct ib_device *dev, u32 port, const char *name)
> +{
> +	return rdma_opcounter_set(dev, port, name, true);
> +}
> +
> +int rdma_opcounter_remove(struct ib_device *dev, u32 port,
> +			  const char *name)
> +{
> +	return rdma_opcounter_set(dev, port, name, false);
> +}

Just pass in the add/remove flag - all this switching between wrappers
adding the flag is ugly. Do it all the way to the driver.

> +static int nldev_stat_set_op_stat(struct sk_buff *skb,
> +				  struct nlmsghdr *nlh,
> +				  struct netlink_ext_ack *extack,
> +				  bool cmd_add)
> +{
> +	char opcounter[RDMA_NLDEV_ATTR_OPCOUNTER_NAME_SIZE] = {};
> +	struct nlattr *tb[RDMA_NLDEV_ATTR_MAX];
> +	struct ib_device *device;
> +	struct sk_buff *msg;
> +	u32 index, port;
> +	int ret;
> +
> +	ret = nlmsg_parse(nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1,
> +			  nldev_policy, extack);
> +
> +	if (ret || !tb[RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME] ||
> +	    !tb[RDMA_NLDEV_ATTR_DEV_INDEX] ||
> +	    !tb[RDMA_NLDEV_ATTR_PORT_INDEX])
> +		return -EINVAL;
> +
> +	index = nla_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]);
> +	device = ib_device_get_by_index(sock_net(skb->sk), index);
> +	if (!device)
> +		return -EINVAL;
> +
> +	port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
> +	if (!rdma_is_port_valid(device, port)) {
> +		ret = -EINVAL;
> +		goto err;
> +	}
> +
> +	nla_strscpy(opcounter, tb[RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME],
> +		    sizeof(opcounter));
> +
> +	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
> +	if (!msg) {
> +		ret = -ENOMEM;
> +		goto err;
> +	}
> +
> +	nlh = nlmsg_put(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq,
> +			RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
> +					 (cmd_add ?
> +					  RDMA_NLDEV_CMD_STAT_ADD_OPCOUNTER :
> +					  RDMA_NLDEV_CMD_STAT_REMOVE_OPCOUNTER)),
> +			0, 0);
> +
> +	if (cmd_add)
> +		ret = rdma_opcounter_add(device, port, opcounter);
> +	else
> +		ret = rdma_opcounter_remove(device, port, opcounter);
> +	if (ret)
> +		goto err_msg;
> +
> +	nlmsg_end(msg, nlh);

Shouldn't the netlink message for a 'set' always return the current
value of the thing being set on return? Eg the same output that GET
would generate?

> +static int nldev_stat_add_op_stat_doit(struct sk_buff *skb,
> +				       struct nlmsghdr *nlh,
> +				       struct netlink_ext_ack *extack)
> +{
> +	return nldev_stat_set_op_stat(skb, nlh, extack, true);
> +}
> +
> +static int nldev_stat_remove_op_stat_doit(struct sk_buff *skb,
> +					  struct nlmsghdr *nlh,
> +					  struct netlink_ext_ack *extack)
> +{
> +	return nldev_stat_set_op_stat(skb, nlh, extack, false);
> +}
> +
>  static int nldev_stat_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
>  			       struct netlink_ext_ack *extack)
>  {
> @@ -2342,6 +2427,14 @@ static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
>  		.dump = nldev_res_get_mr_raw_dumpit,
>  		.flags = RDMA_NL_ADMIN_PERM,
>  	},
> +	[RDMA_NLDEV_CMD_STAT_ADD_OPCOUNTER] = {
> +		.doit = nldev_stat_add_op_stat_doit,
> +		.flags = RDMA_NL_ADMIN_PERM,
> +	},
> +	[RDMA_NLDEV_CMD_STAT_REMOVE_OPCOUNTER] = {
> +		.doit = nldev_stat_remove_op_stat_doit,
> +		.flags = RDMA_NL_ADMIN_PERM,
> +	},
>  };

And here I wonder if this is the cannonical way to manipulate lists of
strings in netlink? I'm trying to think of another case like this, did
you reference something?

Are you sure this shouldn't be done via some set on some counter
object?

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 10/10] RDMA/nldev: Add support to get current enabled optional counters
  2021-08-18 11:24 ` [PATCH rdma-next 10/10] RDMA/nldev: Add support to get current enabled optional counters Mark Zhang
@ 2021-08-23 19:44   ` Jason Gunthorpe
  2021-08-24  2:13     ` Mark Zhang
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2021-08-23 19:44 UTC (permalink / raw)
  To: Mark Zhang; +Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On Wed, Aug 18, 2021 at 02:24:28PM +0300, Mark Zhang wrote:

> diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
> index 79e6ca87d2e0..57f39d8fe434 100644
> +++ b/include/uapi/rdma/rdma_netlink.h
> @@ -557,6 +557,8 @@ enum rdma_nldev_attr {
>  	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY,	/* nested table */
>  	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME,	/* string */
>  	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_VALUE,	/* u64 */
> +	RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST,		/* u8 */
> +	RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST_SUPPORTED,	/* u8 */

See, here - shouldn't manipulation of MODE_LIST be done by a normal
RDMA_NLDEV_CMD_STAT_SET with the new MODE_LIST array? This doesn't seem
netlinky at all..

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 00/10] Optional counter statistics support
  2021-08-23 19:33 ` [PATCH rdma-next 00/10] Optional counter statistics support Jason Gunthorpe
@ 2021-08-24  1:44   ` Mark Zhang
  2021-08-24 13:11     ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Zhang @ 2021-08-24  1:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On 8/24/2021 3:33 AM, Jason Gunthorpe wrote:
> On Wed, Aug 18, 2021 at 02:24:18PM +0300, Mark Zhang wrote:
>> Hi,
>>
>> This series from Aharon and Neta provides an extension to the rdma
>> statistics tool that allows to add and remove optional counters
>> dynamically, using new netlink commands.
>>
>> The idea of having optional counters is to provide to the users the
>> ability to get statistics of counters that hurts performance.
>>
>> Once an optional counter was added, its statistics will be presented
>> along with all the counters, using the show command.
>>
>> Binding objects to the optional counters is currently not supported,
>> neither in auto mode nor in manual mode.
>>
>> To get the list of optional counters that are supported on this device,
>> use "rdma statistic mode supported". To see which counters are currently
>> enabled, use "rdma statistic mode".
>>
>> $ rdma statistic mode supported
>> link rocep8s0f0/1
>>      Optional-set: cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts
>> link rocep8s0f1/1
>>      Optional-set: cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts
>>
>> $ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_rx_ce_pkts
>> $ rdma statistic mode
>> link rocep8s0f0/1
>>      Optional-set: cc_rx_ce_pkts
>> $ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_tx_cnp_pkts
>> $ rdma statistic mode
>> link rocep8s0f0/1
>>      Optional-set: cc_rx_ce_pkts cc_tx_cnp_pkts
> 
> This doesn't look like the right output to iproute to me, the two
> command should not be using the same tag and the output of iproute
> should always be formed to be valid input to iproute

So it should be like this:

$ rdma statistic mode supported
link rocep8s0f0/1 optional-set cc_rx_ce_pkts cc_rx_cnp_pkts  cc_tx_cnp_pkts
link rocep8s0f1/1 optional-set cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts

$ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_rx_ce_pkts
$ rdma statistic mode
link rocep8s0f0/1 optional-set cc_rx_ce_pkts
$ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_tx_cnp_pkts
$ rdma statistic mode
link rocep8s0f0/1 optional-set cc_rx_ce_pkts cc_tx_cnp_pkts

> 
>> $ rdma statistic show link rocep8s0f0/1
>> link rocep8s0f0/1 rx_write_requests 0 rx_read_requests 0 rx_atomic_requests 0 out_of_buffer 0
>> out_of_sequence 0 duplicate_request 0 rnr_nak_retry_err 0 packet_seq_err 0 implied_nak_seq_err 0
>> local_ack_timeout_err 0 resp_local_length_error 0 resp_cqe_error 0 req_cqe_error 0
>> req_remote_invalid_request 0 req_remote_access_errors 0 resp_remote_access_errors 0
>> resp_cqe_flush_error 0 req_cqe_flush_error 0 roce_adp_retrans 0 roce_adp_retrans_to 0
>> roce_slow_restart 0 roce_slow_restart_cnps 0 roce_slow_restart_trans 0 rp_cnp_ignored 0
>> rp_cnp_handled 0 np_ecn_marked_roce_packets 0 np_cnp_sent 0 rx_icrc_encapsulated 0
>>      Optional-set: cc_rx_ce_pkts 0 cc_tx_cnp_pkts 0
> 
> Also looks bad, optional counters should not be marked specially at
> this point.

Will put optional counters in the last, like this:

$ rdma statistic show link rocep8s0f0/1
link rocep8s0f0/1 rx_write_requests 0 rx_read_requests 0 
rx_atomic_requests 0 out_of_buffer 0
out_of_sequence 0 duplicate_request 0 rnr_nak_retry_err 0 packet_seq_err 
0 implied_nak_seq_err 0
local_ack_timeout_err 0 resp_local_length_error 0 resp_cqe_error 0 
req_cqe_error 0
req_remote_invalid_request 0 req_remote_access_errors 0 
resp_remote_access_errors 0
resp_cqe_flush_error 0 req_cqe_flush_error 0 roce_adp_retrans 0 
roce_adp_retrans_to 0
roce_slow_restart 0 roce_slow_restart_cnps 0 roce_slow_restart_trans 0 
rp_cnp_ignored 0
rp_cnp_handled 0 np_ecn_marked_roce_packets 0 np_cnp_sent 0 
rx_icrc_encapsulated 0 cc_rx_ce_pkts 0 cc_tx_cnp_pkts 0

>> Aharon Landau (9):
>>    net/mlx5: Add support in bth_opcode as a match criteria
>>    net/mlx5: Add priorities for counters in RDMA namespaces
>>    RDMA/counters: Support to allocate per-port optional counter
>>      statistics
>>    RDMA/mlx5: Add alloc_op_port_stats() support
>>    RDMA/mlx5: Add steering support in optional flow counters
>>    RDMA/nldev: Add support to add and remove optional counters
>>    RDMA/mlx5: Add add_op_stat() and remove_op_stat() support
>>    RDMA/mlx5: Add get_op_stats() support
>>    RDMA/nldev: Add support to get current enabled optional counters
>>
>> Neta Ostrovsky (1):
>>    RDMA/nldev: Add support to get optional counters statistics
> 
> This series is in a poor order, all the core update should come first
> and the commit messages should explain what is going on when building
> out the new APIs.
> 
> The RDMA/mlx5 patches can go last

Will fix it, thanks Jason.

> Jason
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 06/10] RDMA/nldev: Add support to add and remove optional counters
  2021-08-23 19:42   ` Jason Gunthorpe
@ 2021-08-24  2:09     ` Mark Zhang
  0 siblings, 0 replies; 23+ messages in thread
From: Mark Zhang @ 2021-08-24  2:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On 8/24/2021 3:42 AM, Jason Gunthorpe wrote:
> On Wed, Aug 18, 2021 at 02:24:24PM +0300, Mark Zhang wrote:
>> From: Aharon Landau <aharonl@nvidia.com>
>>
>> This patch adds the ability to add/remove optional counter to a link
>> through RDMA netlink. Limit it to users with ADMIN capability only.
>>
>> Examples:
>> $ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_rx_ce_pkts
>> $ sudo rdma statistic remove link rocep8s0f0/1 optional-set cc_rx_ce_pkts
>>
>> Signed-off-by: Aharon Landau <aharonl@nvidia.com>
>> Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
>> Signed-off-by: Mark Zhang <markzhang@nvidia.com>
>>   drivers/infiniband/core/counters.c | 50 ++++++++++++++++
>>   drivers/infiniband/core/device.c   |  2 +
>>   drivers/infiniband/core/nldev.c    | 93 ++++++++++++++++++++++++++++++
>>   include/rdma/ib_verbs.h            |  7 +++
>>   include/rdma/rdma_counter.h        |  4 ++
>>   include/rdma/rdma_netlink.h        |  1 +
>>   include/uapi/rdma/rdma_netlink.h   |  9 +++
>>   7 files changed, 166 insertions(+)
>>
>> diff --git a/drivers/infiniband/core/counters.c b/drivers/infiniband/core/counters.c

...

>> +static int nldev_stat_set_op_stat(struct sk_buff *skb,
>> +				  struct nlmsghdr *nlh,
>> +				  struct netlink_ext_ack *extack,
>> +				  bool cmd_add)
>> +{

...

>> +
>> +	nlh = nlmsg_put(msg, NETLINK_CB(skb).portid, nlh->nlmsg_seq,
>> +			RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
>> +					 (cmd_add ?
>> +					  RDMA_NLDEV_CMD_STAT_ADD_OPCOUNTER :
>> +					  RDMA_NLDEV_CMD_STAT_REMOVE_OPCOUNTER)),
>> +			0, 0);
>> +
>> +	if (cmd_add)
>> +		ret = rdma_opcounter_add(device, port, opcounter);
>> +	else
>> +		ret = rdma_opcounter_remove(device, port, opcounter);
>> +	if (ret)
>> +		goto err_msg;
>> +
>> +	nlmsg_end(msg, nlh);
> 
> Shouldn't the netlink message for a 'set' always return the current
> value of the thing being set on return? Eg the same output that GET
> would generate?

May I ask why can't just return an error code?

>> +static int nldev_stat_add_op_stat_doit(struct sk_buff *skb,
>> +				       struct nlmsghdr *nlh,
>> +				       struct netlink_ext_ack *extack)
>> +{
>> +	return nldev_stat_set_op_stat(skb, nlh, extack, true);
>> +}
>> +
>> +static int nldev_stat_remove_op_stat_doit(struct sk_buff *skb,
>> +					  struct nlmsghdr *nlh,
>> +					  struct netlink_ext_ack *extack)
>> +{
>> +	return nldev_stat_set_op_stat(skb, nlh, extack, false);
>> +}
>> +
>>   static int nldev_stat_set_doit(struct sk_buff *skb, struct nlmsghdr *nlh,
>>   			       struct netlink_ext_ack *extack)
>>   {
>> @@ -2342,6 +2427,14 @@ static const struct rdma_nl_cbs nldev_cb_table[RDMA_NLDEV_NUM_OPS] = {
>>   		.dump = nldev_res_get_mr_raw_dumpit,
>>   		.flags = RDMA_NL_ADMIN_PERM,
>>   	},
>> +	[RDMA_NLDEV_CMD_STAT_ADD_OPCOUNTER] = {
>> +		.doit = nldev_stat_add_op_stat_doit,
>> +		.flags = RDMA_NL_ADMIN_PERM,
>> +	},
>> +	[RDMA_NLDEV_CMD_STAT_REMOVE_OPCOUNTER] = {
>> +		.doit = nldev_stat_remove_op_stat_doit,
>> +		.flags = RDMA_NL_ADMIN_PERM,
>> +	},
>>   };
> 
> And here I wonder if this is the cannonical way to manipulate lists of
> strings in netlink? I'm trying to think of another case like this, did
> you reference something?

For add/remove, we only support one op-counter at one time (for 
simplicity), so it's just a string, not a list of string.

This is supported:
#  rdma stat add link mlx5_0/1 optional-set cc_rx_ce_pkts

This is not supported:
# rdma stat add link mlx5_0/1 optional-set cc_rx_ce_pkts cc_tx_cnp_pkts

> Are you sure this shouldn't be done via some set on some counter
> object?

Currently we don't support do on a counter object, just per-port.

> Jason
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 10/10] RDMA/nldev: Add support to get current enabled optional counters
  2021-08-23 19:44   ` Jason Gunthorpe
@ 2021-08-24  2:13     ` Mark Zhang
  2021-08-24 13:13       ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Zhang @ 2021-08-24  2:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On 8/24/2021 3:44 AM, Jason Gunthorpe wrote:
> On Wed, Aug 18, 2021 at 02:24:28PM +0300, Mark Zhang wrote:
> 
>> diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
>> index 79e6ca87d2e0..57f39d8fe434 100644
>> +++ b/include/uapi/rdma/rdma_netlink.h
>> @@ -557,6 +557,8 @@ enum rdma_nldev_attr {
>>   	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY,	/* nested table */
>>   	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME,	/* string */
>>   	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_VALUE,	/* u64 */
>> +	RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST,		/* u8 */
>> +	RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST_SUPPORTED,	/* u8 */
> 
> See, here - shouldn't manipulation of MODE_LIST be done by a normal
> RDMA_NLDEV_CMD_STAT_SET with the new MODE_LIST array? This doesn't seem
> netlinky at all..

Both of them are flags and this is a "get" operation; "MODE_LIST" asks 
kernel to return currently enabled op-counters, "MODE_LIST_SUPPORTED" 
asks kernel to return supported op-counters. Maybe the macro name are 
not good?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 03/10] RDMA/counters: Support to allocate per-port optional counter statistics
  2021-08-23 19:30   ` Jason Gunthorpe
@ 2021-08-24  6:22     ` Mark Zhang
  2021-08-24 13:14       ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Zhang @ 2021-08-24  6:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On 8/24/2021 3:30 AM, Jason Gunthorpe wrote:
> On Wed, Aug 18, 2021 at 02:24:21PM +0300, Mark Zhang wrote:
>> From: Aharon Landau <aharonl@nvidia.com>
>>
>> Add an alloc_op_port_stats() API, as well as related structures, to support
>> per-port op_stats allocation during counter module initialization.
>>
>> Signed-off-by: Aharon Landau <aharonl@nvidia.com>
>> Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
>> Signed-off-by: Mark Zhang <markzhang@nvidia.com>
>>   drivers/infiniband/core/counters.c | 18 ++++++++++++++++++
>>   drivers/infiniband/core/device.c   |  1 +
>>   include/rdma/ib_verbs.h            | 24 ++++++++++++++++++++++++
>>   include/rdma/rdma_counter.h        |  1 +
>>   4 files changed, 44 insertions(+)
>>
>> diff --git a/drivers/infiniband/core/counters.c b/drivers/infiniband/core/counters.c
>> index df9e6c5e4ddf..b8b6db98bfdf 100644
>> +++ b/drivers/infiniband/core/counters.c
>> @@ -611,6 +611,15 @@ void rdma_counter_init(struct ib_device *dev)
>>   		port_counter->hstats = dev->ops.alloc_hw_port_stats(dev, port);
>>   		if (!port_counter->hstats)
>>   			goto fail;
>> +
>> +		if (dev->ops.alloc_op_port_stats) {
>> +			port_counter->opstats =
>> +				dev->ops.alloc_op_port_stats(dev, port);
>> +			if (!port_counter->opstats)
>> +				goto fail;
> 
> It would be nicer to change the normal stats to have more detailed
> meta information instead of adding an entire parallel interface like
> this.
> 
> struct rdma_hw_stats {
> 	struct mutex	lock;
> 	unsigned long	timestamp;
> 	unsigned long	lifespan;
> 	const char * const *names;
> 
> Change the names to a struct
> 
>   const struct rdma_hw_stat_desc *descs;
> 
> struct rdma_hw_stat_desc {
>     const char *name;
>     unsigned int flags;
>     unsigned int private;
> }
> 
> and then define a FLAG_OPTIONAL.
> 
> Then alot of this oddness goes away.
> 
> You might also need a small allocated bitmap to store the
> enabled/disabled state
> 
> Then the series basically boils down to adding some 'modify counter'
> driver op that flips the enable/disable flag
> 
> And the netlink patches to expose the additional information.

Maybe it can be defined like this:

struct rdma_stat_desc {
         bool enabled;
         const char *name;
         u64 value;
};

struct rdma_hw_stats {
         struct mutex    lock; /* Protect lifespan and values[] */
         unsigned long   timestamp;
         unsigned long   lifespan;
         int             num_counters;
         unsigned int    private;  // ?
         u64             flags;    // 0 or FLAG_OPTIONAL
         struct rdma_stat_desc descs[];
         //const char * const *names;
         //u64           value[]; 
 

};

What does the "private" field mean? Driver-specific? Aren't all counters 
driver-specific?

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 00/10] Optional counter statistics support
  2021-08-24  1:44   ` Mark Zhang
@ 2021-08-24 13:11     ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2021-08-24 13:11 UTC (permalink / raw)
  To: Mark Zhang; +Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On Tue, Aug 24, 2021 at 09:44:26AM +0800, Mark Zhang wrote:
> On 8/24/2021 3:33 AM, Jason Gunthorpe wrote:
> > On Wed, Aug 18, 2021 at 02:24:18PM +0300, Mark Zhang wrote:
> > > Hi,
> > > 
> > > This series from Aharon and Neta provides an extension to the rdma
> > > statistics tool that allows to add and remove optional counters
> > > dynamically, using new netlink commands.
> > > 
> > > The idea of having optional counters is to provide to the users the
> > > ability to get statistics of counters that hurts performance.
> > > 
> > > Once an optional counter was added, its statistics will be presented
> > > along with all the counters, using the show command.
> > > 
> > > Binding objects to the optional counters is currently not supported,
> > > neither in auto mode nor in manual mode.
> > > 
> > > To get the list of optional counters that are supported on this device,
> > > use "rdma statistic mode supported". To see which counters are currently
> > > enabled, use "rdma statistic mode".
> > > 
> > > $ rdma statistic mode supported
> > > link rocep8s0f0/1
> > >      Optional-set: cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts
> > > link rocep8s0f1/1
> > >      Optional-set: cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts
> > > 
> > > $ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_rx_ce_pkts
> > > $ rdma statistic mode
> > > link rocep8s0f0/1
> > >      Optional-set: cc_rx_ce_pkts
> > > $ sudo rdma statistic add link rocep8s0f0/1 optional-set cc_tx_cnp_pkts
> > > $ rdma statistic mode
> > > link rocep8s0f0/1
> > >      Optional-set: cc_rx_ce_pkts cc_tx_cnp_pkts
> > 
> > This doesn't look like the right output to iproute to me, the two
> > command should not be using the same tag and the output of iproute
> > should always be formed to be valid input to iproute
> 
> So it should be like this:
> 
> $ rdma statistic mode supported
> link rocep8s0f0/1 optional-set cc_rx_ce_pkts cc_rx_cnp_pkts  cc_tx_cnp_pkts
> link rocep8s0f1/1 optional-set cc_rx_ce_pkts cc_rx_cnp_pkts cc_tx_cnp_pkts

Each netlink tag in the protocol should have a unique string in the
output. So you need strings that mean "optional set supported" and
"optional set currently enabled"

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 10/10] RDMA/nldev: Add support to get current enabled optional counters
  2021-08-24  2:13     ` Mark Zhang
@ 2021-08-24 13:13       ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2021-08-24 13:13 UTC (permalink / raw)
  To: Mark Zhang; +Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On Tue, Aug 24, 2021 at 10:13:34AM +0800, Mark Zhang wrote:
> On 8/24/2021 3:44 AM, Jason Gunthorpe wrote:
> > On Wed, Aug 18, 2021 at 02:24:28PM +0300, Mark Zhang wrote:
> > 
> > > diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
> > > index 79e6ca87d2e0..57f39d8fe434 100644
> > > +++ b/include/uapi/rdma/rdma_netlink.h
> > > @@ -557,6 +557,8 @@ enum rdma_nldev_attr {
> > >   	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY,	/* nested table */
> > >   	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_NAME,	/* string */
> > >   	RDMA_NLDEV_ATTR_STAT_OPCOUNTER_ENTRY_VALUE,	/* u64 */
> > > +	RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST,		/* u8 */
> > > +	RDMA_NLDEV_ATTR_STAT_OP_MODE_LIST_SUPPORTED,	/* u8 */
> > 
> > See, here - shouldn't manipulation of MODE_LIST be done by a normal
> > RDMA_NLDEV_CMD_STAT_SET with the new MODE_LIST array? This doesn't seem
> > netlinky at all..
> 
> Both of them are flags and this is a "get" operation; "MODE_LIST" asks
> kernel to return currently enabled op-counters, "MODE_LIST_SUPPORTED" asks
> kernel to return supported op-counters. Maybe the macro name are not good?

The marcors are fine, the protocol is just a bit wonky. The ADD/REMOVE
idea should only be used on top level objects, but this is a nested
sub so you should be using SET to manipulate it and it should provide
the entire current list, not a add/remove type operation.

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH rdma-next 03/10] RDMA/counters: Support to allocate per-port optional counter statistics
  2021-08-24  6:22     ` Mark Zhang
@ 2021-08-24 13:14       ` Jason Gunthorpe
  0 siblings, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2021-08-24 13:14 UTC (permalink / raw)
  To: Mark Zhang; +Cc: dledford, saeedm, linux-rdma, netdev, aharonl, netao, leonro

On Tue, Aug 24, 2021 at 02:22:52PM +0800, Mark Zhang wrote:
> On 8/24/2021 3:30 AM, Jason Gunthorpe wrote:
> > On Wed, Aug 18, 2021 at 02:24:21PM +0300, Mark Zhang wrote:
> > > From: Aharon Landau <aharonl@nvidia.com>
> > > 
> > > Add an alloc_op_port_stats() API, as well as related structures, to support
> > > per-port op_stats allocation during counter module initialization.
> > > 
> > > Signed-off-by: Aharon Landau <aharonl@nvidia.com>
> > > Signed-off-by: Neta Ostrovsky <netao@nvidia.com>
> > > Signed-off-by: Mark Zhang <markzhang@nvidia.com>
> > >   drivers/infiniband/core/counters.c | 18 ++++++++++++++++++
> > >   drivers/infiniband/core/device.c   |  1 +
> > >   include/rdma/ib_verbs.h            | 24 ++++++++++++++++++++++++
> > >   include/rdma/rdma_counter.h        |  1 +
> > >   4 files changed, 44 insertions(+)
> > > 
> > > diff --git a/drivers/infiniband/core/counters.c b/drivers/infiniband/core/counters.c
> > > index df9e6c5e4ddf..b8b6db98bfdf 100644
> > > +++ b/drivers/infiniband/core/counters.c
> > > @@ -611,6 +611,15 @@ void rdma_counter_init(struct ib_device *dev)
> > >   		port_counter->hstats = dev->ops.alloc_hw_port_stats(dev, port);
> > >   		if (!port_counter->hstats)
> > >   			goto fail;
> > > +
> > > +		if (dev->ops.alloc_op_port_stats) {
> > > +			port_counter->opstats =
> > > +				dev->ops.alloc_op_port_stats(dev, port);
> > > +			if (!port_counter->opstats)
> > > +				goto fail;
> > 
> > It would be nicer to change the normal stats to have more detailed
> > meta information instead of adding an entire parallel interface like
> > this.
> > 
> > struct rdma_hw_stats {
> > 	struct mutex	lock;
> > 	unsigned long	timestamp;
> > 	unsigned long	lifespan;
> > 	const char * const *names;
> > 
> > Change the names to a struct
> > 
> >   const struct rdma_hw_stat_desc *descs;
> > 
> > struct rdma_hw_stat_desc {
> >     const char *name;
> >     unsigned int flags;
> >     unsigned int private;
> > }
> > 
> > and then define a FLAG_OPTIONAL.
> > 
> > Then alot of this oddness goes away.
> > 
> > You might also need a small allocated bitmap to store the
> > enabled/disabled state
> > 
> > Then the series basically boils down to adding some 'modify counter'
> > driver op that flips the enable/disable flag
> > 
> > And the netlink patches to expose the additional information.
> 
> Maybe it can be defined like this:
> 
> struct rdma_stat_desc {
>         bool enabled;

This isn't quite a nice because the definition array can't be setup as
a static const inside the driver code. You'd have to memory copy to
set it up.

> What does the "private" field mean? Driver-specific? Aren't all counters
> driver-specific?

The other patch had used it to identify the counter

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-08-24 13:14 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18 11:24 [PATCH rdma-next 00/10] Optional counter statistics support Mark Zhang
2021-08-18 11:24 ` [PATCH rdma-next 01/10] net/mlx5: Add support in bth_opcode as a match criteria Mark Zhang
2021-08-18 11:24 ` [PATCH rdma-next 02/10] net/mlx5: Add priorities for counters in RDMA namespaces Mark Zhang
2021-08-18 11:24 ` [PATCH rdma-next 03/10] RDMA/counters: Support to allocate per-port optional counter statistics Mark Zhang
2021-08-23 19:30   ` Jason Gunthorpe
2021-08-24  6:22     ` Mark Zhang
2021-08-24 13:14       ` Jason Gunthorpe
2021-08-18 11:24 ` [PATCH rdma-next 04/10] RDMA/mlx5: Add alloc_op_port_stats() support Mark Zhang
2021-08-23 19:19   ` Jason Gunthorpe
2021-08-18 11:24 ` [PATCH rdma-next 05/10] RDMA/mlx5: Add steering support in optional flow counters Mark Zhang
2021-08-18 11:24 ` [PATCH rdma-next 06/10] RDMA/nldev: Add support to add and remove optional counters Mark Zhang
2021-08-23 19:42   ` Jason Gunthorpe
2021-08-24  2:09     ` Mark Zhang
2021-08-18 11:24 ` [PATCH rdma-next 07/10] RDMA/mlx5: Add add_op_stat() and remove_op_stat() support Mark Zhang
2021-08-18 11:24 ` [PATCH rdma-next 08/10] RDMA/nldev: Add support to get optional counters statistics Mark Zhang
2021-08-18 11:24 ` [PATCH rdma-next 09/10] RDMA/mlx5: Add get_op_stats() support Mark Zhang
2021-08-18 11:24 ` [PATCH rdma-next 10/10] RDMA/nldev: Add support to get current enabled optional counters Mark Zhang
2021-08-23 19:44   ` Jason Gunthorpe
2021-08-24  2:13     ` Mark Zhang
2021-08-24 13:13       ` Jason Gunthorpe
2021-08-23 19:33 ` [PATCH rdma-next 00/10] Optional counter statistics support Jason Gunthorpe
2021-08-24  1:44   ` Mark Zhang
2021-08-24 13:11     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).