All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it
@ 2022-09-15 13:42 Michal Wilczynski
  2022-09-15 13:42 ` [RFC PATCH net-next v4 1/6] ice: Add function for move/reconfigure TxQ AQ command Michal Wilczynski
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Michal Wilczynski @ 2022-09-15 13:42 UTC (permalink / raw)
  To: netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel,
	Michal Wilczynski

This patch series implements devlink-rate for ice driver. Unfortunately
current API isn't flexible enough for our use case, so there is a need to
extend it. New object type 'queue' is being introduced, and more functions
has been changed to non-static, to enable the driver to export current
Tx scheduling configuration.

This patch series is a follow up for this thread:
https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/T/#u

V4:
- changed static variable counter to per port IDA to
  uniquely identify nodes

V3:
- removed shift macros, since FIELD_PREP is used
- added static_assert for struct
- removed unnecessary functions
- used tab instead of space in define

V2:
- fixed Alexandr comments
- refactored code to fix checkpatch issues
- added mutual exclusion for RDMA, DCB


Ben Shelton (1):
  ice: Add function for move/reconfigure TxQ AQ command

Michal Wilczynski (5):
  devlink: Extend devlink-rate api with queues and new parameters
  ice: Introduce new parameters in ice_sched_node
  ice: Implement devlink-rate API
  ice: Export Tx scheduler configuration to devlink-rate
  ice: Prevent ADQ, DCB and RDMA coexistence with Custom Tx scheduler

 .../net/ethernet/intel/ice/ice_adminq_cmd.h   |  41 +-
 drivers/net/ethernet/intel/ice/ice_common.c   |  75 ++-
 drivers/net/ethernet/intel/ice/ice_common.h   |   8 +
 drivers/net/ethernet/intel/ice/ice_dcb.c      |   2 +-
 drivers/net/ethernet/intel/ice/ice_dcb_lib.c  |   4 +
 drivers/net/ethernet/intel/ice/ice_devlink.c  | 598 ++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_devlink.h  |   2 +
 drivers/net/ethernet/intel/ice/ice_idc.c      |   5 +
 drivers/net/ethernet/intel/ice/ice_main.c     |   2 +
 drivers/net/ethernet/intel/ice/ice_sched.c    |  81 ++-
 drivers/net/ethernet/intel/ice/ice_sched.h    |  27 +-
 drivers/net/ethernet/intel/ice/ice_type.h     |   7 +
 drivers/net/ethernet/intel/ice/ice_virtchnl.c |  10 +
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   6 +-
 .../mellanox/mlx5/core/esw/devlink_port.c     |   8 +-
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c |  12 +-
 .../net/ethernet/mellanox/mlx5/core/esw/qos.h |  10 +-
 drivers/net/netdevsim/dev.c                   |  32 +-
 include/net/devlink.h                         |  56 +-
 include/uapi/linux/devlink.h                  |   8 +-
 net/core/devlink.c                            | 407 ++++++++++--
 21 files changed, 1284 insertions(+), 117 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH net-next v4 1/6] ice: Add function for move/reconfigure TxQ AQ command
  2022-09-15 13:42 [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Michal Wilczynski
@ 2022-09-15 13:42 ` Michal Wilczynski
  2022-09-15 13:42 ` [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters Michal Wilczynski
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Michal Wilczynski @ 2022-09-15 13:42 UTC (permalink / raw)
  To: netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel,
	Ben Shelton, Michal Wilczynski

From: Ben Shelton <benjamin.h.shelton@intel.com>

Currently there is no way to reconfigure queues in ice driver
Tx scheduler topology. Add a function that will allow us to do
so. This will enable us to allow user to change this manually
using devlink-rate interface.

Signed-off-by: Ben Shelton <benjamin.h.shelton@intel.com>
Co-developed-by: Michal Wilczynski <michal.wilczynski@intel.com>
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
---
 .../net/ethernet/intel/ice/ice_adminq_cmd.h   | 37 ++++++++++
 drivers/net/ethernet/intel/ice/ice_common.c   | 70 +++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_common.h   |  8 +++
 drivers/net/ethernet/intel/ice/ice_main.c     |  2 +
 4 files changed, 117 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 1bdc70aa979d..7574267f014e 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -1910,6 +1910,40 @@ struct ice_aqc_dis_txq_item {
 	__le16 q_id[];
 } __packed;
 
+struct ice_aqc_move_txqs {
+	u8 cmd_type;
+#define ICE_AQC_Q_CMD_TYPE_M		GENMASK(1, 0)
+#define ICE_AQC_Q_CMD_TYPE_MOVE		1
+#define ICE_AQC_Q_CMD_TYPE_TC_CHANGE	2
+#define ICE_AQC_Q_CMD_TYPE_MOVE_AND_TC	3
+#define ICE_AQC_Q_CMD_SUBSEQ_CALL	BIT(2)
+#define ICE_AQC_Q_CMD_FLUSH_PIPE	BIT(3)
+
+	u8 num_qs;
+	u8 rsvd;
+	u8 timeout;
+#define ICE_AQC_Q_CMD_TIMEOUT_M		GENMASK(5, 2)
+
+	__le32 blocked_cgds;
+	__le32 addr_high;
+	__le32 addr_low;
+};
+
+static_assert(sizeof(struct ice_aqc_move_txqs) == 16);
+
+struct ice_aqc_move_txqs_elem {
+	__le16 txq_id;
+	u8 q_cgd;
+	u8 rsvd;
+	__le32 q_teid;
+};
+
+struct ice_aqc_move_txqs_data {
+	__le32 src_teid;
+	__le32 dest_teid;
+	struct ice_aqc_move_txqs_elem txqs[];
+};
+
 /* Add Tx RDMA Queue Set (indirect 0x0C33) */
 struct ice_aqc_add_rdma_qset {
 	u8 num_qset_grps;
@@ -2148,6 +2182,7 @@ struct ice_aq_desc {
 		struct ice_aqc_get_topo get_topo;
 		struct ice_aqc_sched_elem_cmd sched_elem_cmd;
 		struct ice_aqc_query_txsched_res query_sched_res;
+		struct ice_aqc_move_txqs move_txqs;
 		struct ice_aqc_query_port_ets port_ets;
 		struct ice_aqc_rl_profile rl_profile;
 		struct ice_aqc_nvm nvm;
@@ -2207,6 +2242,7 @@ enum ice_aq_err {
 	ICE_AQ_RC_OK		= 0,  /* Success */
 	ICE_AQ_RC_EPERM		= 1,  /* Operation not permitted */
 	ICE_AQ_RC_ENOENT	= 2,  /* No such element */
+	ICE_AQ_RC_EAGAIN	= 8,  /* Try again */
 	ICE_AQ_RC_ENOMEM	= 9,  /* Out of memory */
 	ICE_AQ_RC_EBUSY		= 12, /* Device or resource busy */
 	ICE_AQ_RC_EEXIST	= 13, /* Object already exists */
@@ -2342,6 +2378,7 @@ enum ice_adminq_opc {
 	/* Tx queue handling commands/events */
 	ice_aqc_opc_add_txqs				= 0x0C30,
 	ice_aqc_opc_dis_txqs				= 0x0C31,
+	ice_aqc_opc_move_recfg_txqs			= 0x0C32,
 	ice_aqc_opc_add_rdma_qset			= 0x0C33,
 
 	/* package commands */
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index bec770e34f39..39769141c8e8 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -7,6 +7,7 @@
 #include "ice_flow.h"
 
 #define ICE_PF_RESET_WAIT_COUNT	300
+#define ICE_LAN_TXQ_MOVE_TIMEOUT_MAX	50
 
 static const char * const ice_link_mode_str_low[] = {
 	[0] = "100BASE_TX",
@@ -4189,6 +4190,75 @@ ice_aq_dis_lan_txq(struct ice_hw *hw, u8 num_qgrps,
 	return status;
 }
 
+/**
+ * ice_aq_move_recfg_lan_txq
+ * @hw: pointer to the hardware structure
+ * @num_qs: number of queues to move/reconfigure
+ * @is_move: true if this operation involves node movement
+ * @is_tc_change: true if this operation involves a TC change
+ * @subseq_call: true if this operation is a subsequent call
+ * @flush_pipe: on timeout, true to flush pipe, false to return EAGAIN
+ * @timeout: timeout in units of 100 usec (valid values 0-50)
+ * @blocked_cgds: out param, bitmap of CGDs that timed out if returning EAGAIN
+ * @buf: struct containing src/dest TEID and per-queue info
+ * @buf_size: size of buffer for indirect command
+ * @txqs_moved: out param, number of queues successfully moved
+ * @cd: pointer to command details structure or NULL
+ *
+ * Move / Reconfigure Tx LAN queues (0x0C32)
+ */
+int
+ice_aq_move_recfg_lan_txq(struct ice_hw *hw, u8 num_qs, bool is_move,
+			  bool is_tc_change, bool subseq_call, bool flush_pipe,
+			  u8 timeout, u32 *blocked_cgds,
+			  struct ice_aqc_move_txqs_data *buf, u16 buf_size,
+			  u8 *txqs_moved, struct ice_sq_cd *cd)
+{
+	struct ice_aqc_move_txqs *cmd;
+	struct ice_aq_desc desc;
+	int status;
+
+	cmd = &desc.params.move_txqs;
+	ice_fill_dflt_direct_cmd_desc(&desc, ice_aqc_opc_move_recfg_txqs);
+
+	if (timeout > ICE_LAN_TXQ_MOVE_TIMEOUT_MAX)
+		return -EINVAL;
+
+	if (is_tc_change && !flush_pipe && !blocked_cgds)
+		return -EINVAL;
+
+	if (!is_move && !is_tc_change)
+		return -EINVAL;
+
+	desc.flags |= cpu_to_le16(ICE_AQ_FLAG_RD);
+
+	if (is_move)
+		cmd->cmd_type |= FIELD_PREP(ICE_AQC_Q_CMD_TYPE_M, ICE_AQC_Q_CMD_TYPE_MOVE);
+
+	if (is_tc_change)
+		cmd->cmd_type |= FIELD_PREP(ICE_AQC_Q_CMD_TYPE_M, ICE_AQC_Q_CMD_TYPE_TC_CHANGE);
+
+	if (subseq_call)
+		cmd->cmd_type |= ICE_AQC_Q_CMD_SUBSEQ_CALL;
+
+	if (flush_pipe)
+		cmd->cmd_type |= ICE_AQC_Q_CMD_FLUSH_PIPE;
+
+	cmd->num_qs = num_qs;
+	cmd->timeout = FIELD_PREP(ICE_AQC_Q_CMD_TIMEOUT_M, timeout);
+
+	status = ice_aq_send_cmd(hw, &desc, buf, buf_size, cd);
+
+	if (!status && txqs_moved)
+		*txqs_moved = cmd->num_qs;
+
+	if (hw->adminq.sq_last_status == ICE_AQ_RC_EAGAIN &&
+	    is_tc_change && !flush_pipe)
+		*blocked_cgds = le32_to_cpu(cmd->blocked_cgds);
+
+	return status;
+}
+
 /**
  * ice_aq_add_rdma_qsets
  * @hw: pointer to the hardware structure
diff --git a/drivers/net/ethernet/intel/ice/ice_common.h b/drivers/net/ethernet/intel/ice/ice_common.h
index 8b6712b92e84..996dc7d81f62 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.h
+++ b/drivers/net/ethernet/intel/ice/ice_common.h
@@ -185,6 +185,14 @@ int
 ice_ena_vsi_txq(struct ice_port_info *pi, u16 vsi_handle, u8 tc, u16 q_handle,
 		u8 num_qgrps, struct ice_aqc_add_tx_qgrp *buf, u16 buf_size,
 		struct ice_sq_cd *cd);
+
+int
+ice_aq_move_recfg_lan_txq(struct ice_hw *hw, u8 num_qs, bool is_move,
+			  bool is_tc_change, bool subseq_call, bool flush_pipe,
+			  u8 timeout, u32 *blocked_cgds,
+			  struct ice_aqc_move_txqs_data *buf, u16 buf_size,
+			  u8 *txqs_moved, struct ice_sq_cd *cd);
+
 int ice_replay_vsi(struct ice_hw *hw, u16 vsi_handle);
 void ice_replay_post(struct ice_hw *hw);
 void ice_output_fw_log(struct ice_hw *hw, struct ice_aq_desc *desc, void *buf);
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 7f59050e4122..20f8bb08e479 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -7377,6 +7377,8 @@ const char *ice_aq_str(enum ice_aq_err aq_err)
 	switch (aq_err) {
 	case ICE_AQ_RC_OK:
 		return "OK";
+	case ICE_AQ_RC_EAGAIN:
+		return "ICE_AQ_RC_EAGAIN";
 	case ICE_AQ_RC_EPERM:
 		return "ICE_AQ_RC_EPERM";
 	case ICE_AQ_RC_ENOENT:
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-15 13:42 [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Michal Wilczynski
  2022-09-15 13:42 ` [RFC PATCH net-next v4 1/6] ice: Add function for move/reconfigure TxQ AQ command Michal Wilczynski
@ 2022-09-15 13:42 ` Michal Wilczynski
  2022-09-15 15:31   ` Edward Cree
  2022-09-15 13:42 ` [RFC PATCH net-next v4 3/6] ice: Introduce new parameters in ice_sched_node Michal Wilczynski
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Michal Wilczynski @ 2022-09-15 13:42 UTC (permalink / raw)
  To: netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel,
	Michal Wilczynski

Currently devlink-rate only have two types of objects: nodes and leafs.
There is a need to extend this interface to account for a third type of
scheduling elements - queues. In our use case customer is sending
different types of traffic on each queue, which requires an ability to
assign rate parameters to individual queues.

Also, besides two basic parameters provided currently: tx_share and tx_max
we also need two additional parameters to utilize the WFQ (Weighted Fair
Queueing) algorithm.

tx_priority - priority among siblings (0-7)
tx_weight - weights for the WFQ algorithm (1-200)

The whole lifecycle of the queue is being managed from the driver.
User using the 'devlink' utility might reconfigure the queue parent to a
different node, or change it's parameters.

Example:
devlink port function rate set pci/0000:4b:00.0/queue/91 parent node_custom

Rename the current 'leaf' node to 'vport', since the old name doesn't make
any sense anymore. And the only use case for the leaf nodes so far seem to
be a 'vport' or 'VF'.

Allow creation of the node elements from the driver. This is needed,
since in our use case it makes sense that the driver first exports it's
initial TX topology.

Rearrange elements in devlink_rate struct to remove holes.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
---
 .../net/ethernet/mellanox/mlx5/core/devlink.c |   6 +-
 .../mellanox/mlx5/core/esw/devlink_port.c     |   8 +-
 .../net/ethernet/mellanox/mlx5/core/esw/qos.c |  12 +-
 .../net/ethernet/mellanox/mlx5/core/esw/qos.h |  10 +-
 drivers/net/netdevsim/dev.c                   |  32 +-
 include/net/devlink.h                         |  56 ++-
 include/uapi/linux/devlink.h                  |   8 +-
 net/core/devlink.c                            | 407 +++++++++++++++---
 8 files changed, 436 insertions(+), 103 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
index 66c6a7017695..754318ea8cf6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/devlink.c
@@ -311,13 +311,13 @@ static const struct devlink_ops mlx5_devlink_ops = {
 	.eswitch_encap_mode_get = mlx5_devlink_eswitch_encap_mode_get,
 	.port_function_hw_addr_get = mlx5_devlink_port_function_hw_addr_get,
 	.port_function_hw_addr_set = mlx5_devlink_port_function_hw_addr_set,
-	.rate_leaf_tx_share_set = mlx5_esw_devlink_rate_leaf_tx_share_set,
-	.rate_leaf_tx_max_set = mlx5_esw_devlink_rate_leaf_tx_max_set,
+	.rate_vport_tx_share_set = mlx5_esw_devlink_rate_vport_tx_share_set,
+	.rate_vport_tx_max_set = mlx5_esw_devlink_rate_vport_tx_max_set,
 	.rate_node_tx_share_set = mlx5_esw_devlink_rate_node_tx_share_set,
 	.rate_node_tx_max_set = mlx5_esw_devlink_rate_node_tx_max_set,
 	.rate_node_new = mlx5_esw_devlink_rate_node_new,
 	.rate_node_del = mlx5_esw_devlink_rate_node_del,
-	.rate_leaf_parent_set = mlx5_esw_devlink_rate_parent_set,
+	.rate_vport_parent_set = mlx5_esw_devlink_rate_parent_set,
 #endif
 #ifdef CONFIG_MLX5_SF_MANAGER
 	.port_new = mlx5_devlink_sf_port_new,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
index 9bc7be95db54..cafd3bdcabc9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/devlink_port.c
@@ -91,7 +91,7 @@ int mlx5_esw_offloads_devlink_port_register(struct mlx5_eswitch *esw, u16 vport_
 	if (err)
 		goto reg_err;
 
-	err = devl_rate_leaf_create(dl_port, vport);
+	err = devl_rate_vport_create(dl_port, vport, NULL);
 	if (err)
 		goto rate_err;
 
@@ -118,7 +118,7 @@ void mlx5_esw_offloads_devlink_port_unregister(struct mlx5_eswitch *esw, u16 vpo
 
 	if (vport->dl_port->devlink_rate) {
 		mlx5_esw_qos_vport_update_group(esw, vport, NULL, NULL);
-		devl_rate_leaf_destroy(vport->dl_port);
+		devl_rate_vport_destroy(vport->dl_port);
 	}
 
 	devl_port_unregister(vport->dl_port);
@@ -160,7 +160,7 @@ int mlx5_esw_devlink_sf_port_register(struct mlx5_eswitch *esw, struct devlink_p
 	if (err)
 		return err;
 
-	err = devl_rate_leaf_create(dl_port, vport);
+	err = devl_rate_vport_create(dl_port, vport, NULL);
 	if (err)
 		goto rate_err;
 
@@ -182,7 +182,7 @@ void mlx5_esw_devlink_sf_port_unregister(struct mlx5_eswitch *esw, u16 vport_num
 
 	if (vport->dl_port->devlink_rate) {
 		mlx5_esw_qos_vport_update_group(esw, vport, NULL, NULL);
-		devl_rate_leaf_destroy(vport->dl_port);
+		devl_rate_vport_destroy(vport->dl_port);
 	}
 
 	devl_port_unregister(vport->dl_port);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
index 694c54066955..9af0101ea7b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.c
@@ -783,8 +783,8 @@ static int esw_qos_devlink_rate_to_mbps(struct mlx5_core_dev *mdev, const char *
 
 /* Eswitch devlink rate API */
 
-int mlx5_esw_devlink_rate_leaf_tx_share_set(struct devlink_rate *rate_leaf, void *priv,
-					    u64 tx_share, struct netlink_ext_ack *extack)
+int mlx5_esw_devlink_rate_vport_tx_share_set(struct devlink_rate *rate_leaf, void *priv,
+					     u64 tx_share, struct netlink_ext_ack *extack)
 {
 	struct mlx5_vport *vport = priv;
 	struct mlx5_eswitch *esw;
@@ -809,8 +809,8 @@ int mlx5_esw_devlink_rate_leaf_tx_share_set(struct devlink_rate *rate_leaf, void
 	return err;
 }
 
-int mlx5_esw_devlink_rate_leaf_tx_max_set(struct devlink_rate *rate_leaf, void *priv,
-					  u64 tx_max, struct netlink_ext_ack *extack)
+int mlx5_esw_devlink_rate_vport_tx_max_set(struct devlink_rate *rate_leaf, void *priv,
+					   u64 tx_max, struct netlink_ext_ack *extack)
 {
 	struct mlx5_vport *vport = priv;
 	struct mlx5_eswitch *esw;
@@ -936,11 +936,11 @@ int mlx5_esw_qos_vport_update_group(struct mlx5_eswitch *esw,
 
 int mlx5_esw_devlink_rate_parent_set(struct devlink_rate *devlink_rate,
 				     struct devlink_rate *parent,
-				     void *priv, void *parent_priv,
+				     void **priv, void *parent_priv,
 				     struct netlink_ext_ack *extack)
 {
 	struct mlx5_esw_rate_group *group;
-	struct mlx5_vport *vport = priv;
+	struct mlx5_vport *vport = *priv;
 
 	if (!parent)
 		return mlx5_esw_qos_vport_update_group(vport->dev->priv.eswitch,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h
index 0141e9d52037..61359455a962 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/qos.h
@@ -10,10 +10,10 @@ int mlx5_esw_qos_set_vport_rate(struct mlx5_eswitch *esw, struct mlx5_vport *evp
 				u32 max_rate, u32 min_rate);
 void mlx5_esw_qos_vport_disable(struct mlx5_eswitch *esw, struct mlx5_vport *vport);
 
-int mlx5_esw_devlink_rate_leaf_tx_share_set(struct devlink_rate *rate_leaf, void *priv,
-					    u64 tx_share, struct netlink_ext_ack *extack);
-int mlx5_esw_devlink_rate_leaf_tx_max_set(struct devlink_rate *rate_leaf, void *priv,
-					  u64 tx_max, struct netlink_ext_ack *extack);
+int mlx5_esw_devlink_rate_vport_tx_share_set(struct devlink_rate *rate_leaf, void *priv,
+					     u64 tx_share, struct netlink_ext_ack *extack);
+int mlx5_esw_devlink_rate_vport_tx_max_set(struct devlink_rate *rate_leaf, void *priv,
+					   u64 tx_max, struct netlink_ext_ack *extack);
 int mlx5_esw_devlink_rate_node_tx_share_set(struct devlink_rate *rate_node, void *priv,
 					    u64 tx_share, struct netlink_ext_ack *extack);
 int mlx5_esw_devlink_rate_node_tx_max_set(struct devlink_rate *rate_node, void *priv,
@@ -24,7 +24,7 @@ int mlx5_esw_devlink_rate_node_del(struct devlink_rate *rate_node, void *priv,
 				   struct netlink_ext_ack *extack);
 int mlx5_esw_devlink_rate_parent_set(struct devlink_rate *devlink_rate,
 				     struct devlink_rate *parent,
-				     void *priv, void *parent_priv,
+				     void **priv, void *parent_priv,
 				     struct netlink_ext_ack *extack);
 #endif
 
diff --git a/drivers/net/netdevsim/dev.c b/drivers/net/netdevsim/dev.c
index 794fc0cc73b8..17073abc4af5 100644
--- a/drivers/net/netdevsim/dev.c
+++ b/drivers/net/netdevsim/dev.c
@@ -1166,8 +1166,8 @@ static int nsim_rate_bytes_to_units(char *name, u64 *rate, struct netlink_ext_ac
 	return 0;
 }
 
-static int nsim_leaf_tx_share_set(struct devlink_rate *devlink_rate, void *priv,
-				  u64 tx_share, struct netlink_ext_ack *extack)
+static int nsim_vport_tx_share_set(struct devlink_rate *devlink_rate, void *priv,
+				   u64 tx_share, struct netlink_ext_ack *extack)
 {
 	struct nsim_dev_port *nsim_dev_port = priv;
 	struct nsim_dev *nsim_dev = nsim_dev_port->ns->nsim_dev;
@@ -1182,8 +1182,8 @@ static int nsim_leaf_tx_share_set(struct devlink_rate *devlink_rate, void *priv,
 	return 0;
 }
 
-static int nsim_leaf_tx_max_set(struct devlink_rate *devlink_rate, void *priv,
-				u64 tx_max, struct netlink_ext_ack *extack)
+static int nsim_vport_tx_max_set(struct devlink_rate *devlink_rate, void *priv,
+				 u64 tx_max, struct netlink_ext_ack *extack)
 {
 	struct nsim_dev_port *nsim_dev_port = priv;
 	struct nsim_dev *nsim_dev = nsim_dev_port->ns->nsim_dev;
@@ -1273,10 +1273,10 @@ static int nsim_rate_node_del(struct devlink_rate *node, void *priv,
 	return 0;
 }
 
-static int nsim_rate_leaf_parent_set(struct devlink_rate *child,
-				     struct devlink_rate *parent,
-				     void *priv_child, void *priv_parent,
-				     struct netlink_ext_ack *extack)
+static int nsim_rate_vport_set(struct devlink_rate *child,
+			       struct devlink_rate *parent,
+			       void *priv_child, void *priv_parent,
+			       struct netlink_ext_ack *extack)
 {
 	struct nsim_dev_port *nsim_dev_port = priv_child;
 
@@ -1289,10 +1289,10 @@ static int nsim_rate_leaf_parent_set(struct devlink_rate *child,
 
 static int nsim_rate_node_parent_set(struct devlink_rate *child,
 				     struct devlink_rate *parent,
-				     void *priv_child, void *priv_parent,
+				     void **priv_child, void *priv_parent,
 				     struct netlink_ext_ack *extack)
 {
-	struct nsim_rate_node *nsim_node = priv_child;
+	struct nsim_rate_node *nsim_node = *priv_child;
 
 	if (parent)
 		nsim_node->parent_name = parent->name;
@@ -1332,13 +1332,13 @@ static const struct devlink_ops nsim_dev_devlink_ops = {
 	.trap_group_set = nsim_dev_devlink_trap_group_set,
 	.trap_policer_set = nsim_dev_devlink_trap_policer_set,
 	.trap_policer_counter_get = nsim_dev_devlink_trap_policer_counter_get,
-	.rate_leaf_tx_share_set = nsim_leaf_tx_share_set,
-	.rate_leaf_tx_max_set = nsim_leaf_tx_max_set,
+	.rate_vport_tx_share_set = nsim_vport_tx_share_set,
+	.rate_vport_tx_max_set = nsim_vport_tx_max_set,
 	.rate_node_tx_share_set = nsim_node_tx_share_set,
 	.rate_node_tx_max_set = nsim_node_tx_max_set,
 	.rate_node_new = nsim_rate_node_new,
 	.rate_node_del = nsim_rate_node_del,
-	.rate_leaf_parent_set = nsim_rate_leaf_parent_set,
+	.rate_vport_parent_set = nsim_rate_node_parent_set,
 	.rate_node_parent_set = nsim_rate_node_parent_set,
 	.trap_drop_counter_get = nsim_dev_devlink_trap_drop_counter_get,
 };
@@ -1391,8 +1391,8 @@ static int __nsim_dev_port_add(struct nsim_dev *nsim_dev, enum nsim_dev_port_typ
 	}
 
 	if (nsim_dev_port_is_vf(nsim_dev_port)) {
-		err = devl_rate_leaf_create(&nsim_dev_port->devlink_port,
-					    nsim_dev_port);
+		err = devl_rate_vport_create(&nsim_dev_port->devlink_port,
+					     nsim_dev_port, NULL);
 		if (err)
 			goto err_nsim_destroy;
 	}
@@ -1419,7 +1419,7 @@ static void __nsim_dev_port_del(struct nsim_dev_port *nsim_dev_port)
 
 	list_del(&nsim_dev_port->list);
 	if (nsim_dev_port_is_vf(nsim_dev_port))
-		devl_rate_leaf_destroy(&nsim_dev_port->devlink_port);
+		devl_rate_vport_destroy(&nsim_dev_port->devlink_port);
 	devlink_port_type_clear(devlink_port);
 	nsim_destroy(nsim_dev_port->ns);
 	nsim_dev_port_debugfs_exit(nsim_dev_port);
diff --git a/include/net/devlink.h b/include/net/devlink.h
index 264aa98e6da6..885e1999eb87 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -98,22 +98,22 @@ struct devlink_port_attrs {
 	};
 };
 
+#define DEVLINK_RATE_NAME_MAX_LEN 30
+
 struct devlink_rate {
 	struct list_head list;
-	enum devlink_rate_type type;
 	struct devlink *devlink;
 	void *priv;
 	u64 tx_share;
 	u64 tx_max;
-
 	struct devlink_rate *parent;
-	union {
-		struct devlink_port *devlink_port;
-		struct {
-			char *name;
-			refcount_t refcnt;
-		};
-	};
+	struct devlink_port *devlink_port;
+	char *name;
+	refcount_t refcnt;
+	enum devlink_rate_type type;
+	u16 queue_id;
+	u16 tx_priority;
+	u16 tx_weight;
 };
 
 struct devlink_port {
@@ -1487,26 +1487,46 @@ struct devlink_ops {
 	/**
 	 * Rate control callbacks.
 	 */
-	int (*rate_leaf_tx_share_set)(struct devlink_rate *devlink_rate, void *priv,
+	int (*rate_vport_tx_share_set)(struct devlink_rate *devlink_rate, void *priv,
 				      u64 tx_share, struct netlink_ext_ack *extack);
-	int (*rate_leaf_tx_max_set)(struct devlink_rate *devlink_rate, void *priv,
+	int (*rate_vport_tx_max_set)(struct devlink_rate *devlink_rate, void *priv,
 				    u64 tx_max, struct netlink_ext_ack *extack);
+	int (*rate_vport_tx_priority_set)(struct devlink_rate *devlink_rate, void *priv,
+				    u64 tx_priority, struct netlink_ext_ack *extack);
+	int (*rate_vport_tx_weight_set)(struct devlink_rate *devlink_rate, void *priv,
+				    u64 tx_weight, struct netlink_ext_ack *extack);
 	int (*rate_node_tx_share_set)(struct devlink_rate *devlink_rate, void *priv,
 				      u64 tx_share, struct netlink_ext_ack *extack);
 	int (*rate_node_tx_max_set)(struct devlink_rate *devlink_rate, void *priv,
 				    u64 tx_max, struct netlink_ext_ack *extack);
+	int (*rate_node_tx_priority_set)(struct devlink_rate *devlink_rate, void *priv,
+				    u64 tx_priority, struct netlink_ext_ack *extack);
+	int (*rate_node_tx_weight_set)(struct devlink_rate *devlink_rate, void *priv,
+				    u64 tx_weight, struct netlink_ext_ack *extack);
+	int (*rate_queue_tx_max_set)(struct devlink_rate *devlink_rate, void *priv,
+				    u64 tx_max, struct netlink_ext_ack *extack);
+	int (*rate_queue_tx_share_set)(struct devlink_rate *devlink_rate, void *priv,
+				    u64 tx_share, struct netlink_ext_ack *extack);
+	int (*rate_queue_tx_priority_set)(struct devlink_rate *devlink_rate, void *priv,
+				    u64 tx_priority, struct netlink_ext_ack *extack);
+	int (*rate_queue_tx_weight_set)(struct devlink_rate *devlink_rate, void *priv,
+				    u64 tx_weight, struct netlink_ext_ack *extack);
 	int (*rate_node_new)(struct devlink_rate *rate_node, void **priv,
 			     struct netlink_ext_ack *extack);
 	int (*rate_node_del)(struct devlink_rate *rate_node, void *priv,
 			     struct netlink_ext_ack *extack);
-	int (*rate_leaf_parent_set)(struct devlink_rate *child,
+	int (*rate_vport_parent_set)(struct devlink_rate *child,
 				    struct devlink_rate *parent,
-				    void *priv_child, void *priv_parent,
+				    void **priv_child, void *priv_parent,
 				    struct netlink_ext_ack *extack);
 	int (*rate_node_parent_set)(struct devlink_rate *child,
 				    struct devlink_rate *parent,
-				    void *priv_child, void *priv_parent,
+				    void **priv_child, void *priv_parent,
 				    struct netlink_ext_ack *extack);
+	int (*rate_queue_parent_set)(struct devlink_rate *child,
+				     struct devlink_rate *parent,
+				     void **priv_child, void *priv_parent,
+				     struct netlink_ext_ack *extack);
 	/**
 	 * selftests_check() - queries if selftest is supported
 	 * @devlink: devlink instance
@@ -1584,9 +1604,13 @@ void devlink_port_attrs_pci_vf_set(struct devlink_port *devlink_port, u32 contro
 void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port,
 				   u32 controller, u16 pf, u32 sf,
 				   bool external);
-int devl_rate_leaf_create(struct devlink_port *port, void *priv);
-void devl_rate_leaf_destroy(struct devlink_port *devlink_port);
+int devl_rate_node_create(struct devlink *devlink, void *priv,  char *node_name,
+			  char *parent_name);
+int devl_rate_vport_create(struct devlink_port *port, void *priv, char *parent_name);
+int devl_rate_queue_create(struct devlink *devlink, char *parent_name, u16 queue_id, void *priv);
+void devl_rate_vport_destroy(struct devlink_port *devlink_port);
 void devl_rate_nodes_destroy(struct devlink *devlink);
+void devl_rate_objects_destroy(struct devlink *devlink);
 void devlink_port_linecard_set(struct devlink_port *devlink_port,
 			       struct devlink_linecard *linecard);
 struct devlink_linecard *
diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
index 2f24b53a87a5..2eb5b8dbfe59 100644
--- a/include/uapi/linux/devlink.h
+++ b/include/uapi/linux/devlink.h
@@ -220,8 +220,10 @@ enum devlink_port_flavour {
 };
 
 enum devlink_rate_type {
-	DEVLINK_RATE_TYPE_LEAF,
+	DEVLINK_RATE_TYPE_LEAF, /* deprecated, leaving this for backward compatibility */
 	DEVLINK_RATE_TYPE_NODE,
+	DEVLINK_RATE_TYPE_QUEUE,
+	DEVLINK_RATE_TYPE_VPORT,
 };
 
 enum devlink_param_cmode {
@@ -607,6 +609,10 @@ enum devlink_attr {
 
 	DEVLINK_ATTR_SELFTESTS,			/* nested */
 
+	DEVLINK_ATTR_RATE_QUEUE,		/* u16 */
+	DEVLINK_ATTR_RATE_TX_PRIORITY,		/* u16 */
+	DEVLINK_ATTR_RATE_TX_WEIGHT,		/* u16 */
+
 	/* add new attributes above here, update the policy in devlink.c */
 
 	__DEVLINK_ATTR_MAX,
diff --git a/net/core/devlink.c b/net/core/devlink.c
index 7776dc82f88d..930a3b6b4fcd 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -411,9 +411,10 @@ static struct devlink_port *devlink_port_get_from_info(struct devlink *devlink,
 }
 
 static inline bool
-devlink_rate_is_leaf(struct devlink_rate *devlink_rate)
+devlink_rate_is_vport(struct devlink_rate *devlink_rate)
 {
-	return devlink_rate->type == DEVLINK_RATE_TYPE_LEAF;
+	return devlink_rate->type == DEVLINK_RATE_TYPE_VPORT ||
+	       devlink_rate->type == DEVLINK_RATE_TYPE_LEAF;
 }
 
 static inline bool
@@ -422,8 +423,14 @@ devlink_rate_is_node(struct devlink_rate *devlink_rate)
 	return devlink_rate->type == DEVLINK_RATE_TYPE_NODE;
 }
 
+static inline bool
+devlink_rate_is_queue(struct devlink_rate *devlink_rate)
+{
+	return devlink_rate->type == DEVLINK_RATE_TYPE_QUEUE;
+}
+
 static struct devlink_rate *
-devlink_rate_leaf_get_from_info(struct devlink *devlink, struct genl_info *info)
+devlink_rate_vport_get_from_info(struct devlink *devlink, struct genl_info *info)
 {
 	struct devlink_rate *devlink_rate;
 	struct devlink_port *devlink_port;
@@ -441,8 +448,22 @@ devlink_rate_node_get_by_name(struct devlink *devlink, const char *node_name)
 	static struct devlink_rate *devlink_rate;
 
 	list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
-		if (devlink_rate_is_node(devlink_rate) &&
-		    !strcmp(node_name, devlink_rate->name))
+		if ((devlink_rate_is_node(devlink_rate) ||
+		     devlink_rate_is_vport(devlink_rate)) &&
+		     !strcmp(node_name, devlink_rate->name))
+			return devlink_rate;
+	}
+	return ERR_PTR(-ENODEV);
+}
+
+static struct devlink_rate *
+devlink_rate_queue_get_by_id(struct devlink *devlink, u16 queue_id)
+{
+	static struct devlink_rate *devlink_rate;
+
+	list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
+		if (devlink_rate_is_queue(devlink_rate) &&
+		    devlink_rate->queue_id == queue_id)
 			return devlink_rate;
 	}
 	return ERR_PTR(-ENODEV);
@@ -465,6 +486,30 @@ devlink_rate_node_get_from_attrs(struct devlink *devlink, struct nlattr **attrs)
 	return devlink_rate_node_get_by_name(devlink, rate_node_name);
 }
 
+static struct devlink_rate *
+devlink_rate_queue_get_from_attrs(struct devlink *devlink, struct nlattr **attrs)
+{
+	struct devlink_rate *devlink_rate;
+	u16 queue_id;
+
+	if (!attrs[DEVLINK_ATTR_RATE_QUEUE])
+		return ERR_PTR(-EINVAL);
+
+	queue_id = nla_get_u16(attrs[DEVLINK_ATTR_RATE_QUEUE]);
+
+	devlink_rate = devlink_rate_queue_get_by_id(devlink, queue_id);
+	if (!devlink_rate)
+		return ERR_PTR(-ENODEV);
+
+	return devlink_rate;
+}
+
+static struct devlink_rate *
+devlink_rate_queue_get_from_info(struct devlink *devlink, struct genl_info *info)
+{
+	return devlink_rate_queue_get_from_attrs(devlink, info->attrs);
+}
+
 static struct devlink_rate *
 devlink_rate_node_get_from_info(struct devlink *devlink, struct genl_info *info)
 {
@@ -477,9 +522,11 @@ devlink_rate_get_from_info(struct devlink *devlink, struct genl_info *info)
 	struct nlattr **attrs = info->attrs;
 
 	if (attrs[DEVLINK_ATTR_PORT_INDEX])
-		return devlink_rate_leaf_get_from_info(devlink, info);
+		return devlink_rate_vport_get_from_info(devlink, info);
 	else if (attrs[DEVLINK_ATTR_RATE_NODE_NAME])
 		return devlink_rate_node_get_from_info(devlink, info);
+	else if (attrs[DEVLINK_ATTR_RATE_QUEUE])
+		return devlink_rate_queue_get_from_info(devlink, info);
 	else
 		return ERR_PTR(-EINVAL);
 }
@@ -1159,7 +1206,7 @@ static int devlink_nl_rate_fill(struct sk_buff *msg,
 	if (nla_put_u16(msg, DEVLINK_ATTR_RATE_TYPE, devlink_rate->type))
 		goto nla_put_failure;
 
-	if (devlink_rate_is_leaf(devlink_rate)) {
+	if (devlink_rate_is_vport(devlink_rate)) {
 		if (nla_put_u32(msg, DEVLINK_ATTR_PORT_INDEX,
 				devlink_rate->devlink_port->index))
 			goto nla_put_failure;
@@ -1167,6 +1214,10 @@ static int devlink_nl_rate_fill(struct sk_buff *msg,
 		if (nla_put_string(msg, DEVLINK_ATTR_RATE_NODE_NAME,
 				   devlink_rate->name))
 			goto nla_put_failure;
+	} else if (devlink_rate_is_queue(devlink_rate)) {
+		if (nla_put_u16(msg, DEVLINK_ATTR_RATE_QUEUE,
+				devlink_rate->queue_id))
+			goto nla_put_failure;
 	}
 
 	if (nla_put_u64_64bit(msg, DEVLINK_ATTR_RATE_TX_SHARE,
@@ -1177,10 +1228,19 @@ static int devlink_nl_rate_fill(struct sk_buff *msg,
 			      devlink_rate->tx_max, DEVLINK_ATTR_PAD))
 		goto nla_put_failure;
 
-	if (devlink_rate->parent)
+	if (nla_put_u16(msg, DEVLINK_ATTR_RATE_TX_PRIORITY,
+			devlink_rate->tx_priority))
+		goto nla_put_failure;
+
+	if (nla_put_u16(msg, DEVLINK_ATTR_RATE_TX_WEIGHT,
+			devlink_rate->tx_weight))
+		goto nla_put_failure;
+
+	if (devlink_rate->parent && devlink_rate->parent->name) {
 		if (nla_put_string(msg, DEVLINK_ATTR_RATE_PARENT_NODE_NAME,
 				   devlink_rate->parent->name))
 			goto nla_put_failure;
+	}
 
 	genlmsg_end(msg, hdr);
 	return 0;
@@ -1860,25 +1920,27 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
 	int err = -EOPNOTSUPP;
 
 	parent = devlink_rate->parent;
-	if (parent && len) {
-		NL_SET_ERR_MSG_MOD(info->extack, "Rate object already has parent.");
-		return -EBUSY;
-	} else if (parent && !len) {
-		if (devlink_rate_is_leaf(devlink_rate))
-			err = ops->rate_leaf_parent_set(devlink_rate, NULL,
-							devlink_rate->priv, NULL,
-							info->extack);
+	if (parent && !len) {
+		if (devlink_rate_is_vport(devlink_rate))
+			err = ops->rate_vport_parent_set(devlink_rate, NULL,
+							 &devlink_rate->priv, NULL,
+							 info->extack);
 		else if (devlink_rate_is_node(devlink_rate))
 			err = ops->rate_node_parent_set(devlink_rate, NULL,
-							devlink_rate->priv, NULL,
+							&devlink_rate->priv, NULL,
 							info->extack);
+		else if (devlink_rate_is_queue(devlink_rate))
+			err = ops->rate_queue_parent_set(devlink_rate, NULL,
+							 &devlink_rate->priv, NULL,
+							 info->extack);
 		if (err)
 			return err;
 
 		refcount_dec(&parent->refcnt);
 		devlink_rate->parent = NULL;
-	} else if (!parent && len) {
+	} else if (len) {
 		parent = devlink_rate_node_get_by_name(devlink, parent_name);
+
 		if (IS_ERR(parent))
 			return -ENODEV;
 
@@ -1893,17 +1955,25 @@ devlink_nl_rate_parent_node_set(struct devlink_rate *devlink_rate,
 			return -EEXIST;
 		}
 
-		if (devlink_rate_is_leaf(devlink_rate))
-			err = ops->rate_leaf_parent_set(devlink_rate, parent,
-							devlink_rate->priv, parent->priv,
-							info->extack);
+		if (devlink_rate_is_vport(devlink_rate))
+			err = ops->rate_vport_parent_set(devlink_rate, parent,
+							 &devlink_rate->priv, parent->priv,
+							 info->extack);
 		else if (devlink_rate_is_node(devlink_rate))
 			err = ops->rate_node_parent_set(devlink_rate, parent,
-							devlink_rate->priv, parent->priv,
+							&devlink_rate->priv, parent->priv,
 							info->extack);
+		else if (devlink_rate_is_queue(devlink_rate))
+			err = ops->rate_queue_parent_set(devlink_rate, parent,
+							 &devlink_rate->priv, parent->priv,
+							 info->extack);
 		if (err)
 			return err;
 
+		if (devlink_rate->parent)
+			/* we're reassigning to other parent in this case */
+			refcount_dec(&devlink_rate->parent->refcnt);
+
 		refcount_inc(&parent->refcnt);
 		devlink_rate->parent = parent;
 	}
@@ -1917,16 +1987,21 @@ static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
 {
 	struct nlattr *nla_parent, **attrs = info->attrs;
 	int err = -EOPNOTSUPP;
+	u16 priority;
+	u16 weight;
 	u64 rate;
 
 	if (attrs[DEVLINK_ATTR_RATE_TX_SHARE]) {
 		rate = nla_get_u64(attrs[DEVLINK_ATTR_RATE_TX_SHARE]);
-		if (devlink_rate_is_leaf(devlink_rate))
-			err = ops->rate_leaf_tx_share_set(devlink_rate, devlink_rate->priv,
-							  rate, info->extack);
+		if (devlink_rate_is_vport(devlink_rate))
+			err = ops->rate_vport_tx_share_set(devlink_rate, devlink_rate->priv,
+							   rate, info->extack);
 		else if (devlink_rate_is_node(devlink_rate))
 			err = ops->rate_node_tx_share_set(devlink_rate, devlink_rate->priv,
 							  rate, info->extack);
+		else if (devlink_rate_is_queue(devlink_rate))
+			err = ops->rate_queue_tx_share_set(devlink_rate, devlink_rate->priv,
+							   rate, info->extack);
 		if (err)
 			return err;
 		devlink_rate->tx_share = rate;
@@ -1934,17 +2009,52 @@ static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
 
 	if (attrs[DEVLINK_ATTR_RATE_TX_MAX]) {
 		rate = nla_get_u64(attrs[DEVLINK_ATTR_RATE_TX_MAX]);
-		if (devlink_rate_is_leaf(devlink_rate))
-			err = ops->rate_leaf_tx_max_set(devlink_rate, devlink_rate->priv,
-							rate, info->extack);
+		if (devlink_rate_is_vport(devlink_rate))
+			err = ops->rate_vport_tx_max_set(devlink_rate, devlink_rate->priv,
+							 rate, info->extack);
 		else if (devlink_rate_is_node(devlink_rate))
 			err = ops->rate_node_tx_max_set(devlink_rate, devlink_rate->priv,
 							rate, info->extack);
+		else if (devlink_rate_is_queue(devlink_rate))
+			err = ops->rate_queue_tx_max_set(devlink_rate, devlink_rate->priv,
+							rate, info->extack);
 		if (err)
 			return err;
 		devlink_rate->tx_max = rate;
 	}
 
+	if (attrs[DEVLINK_ATTR_RATE_TX_PRIORITY]) {
+		priority = nla_get_u16(attrs[DEVLINK_ATTR_RATE_TX_PRIORITY]);
+		if (devlink_rate_is_vport(devlink_rate))
+			err = ops->rate_vport_tx_priority_set(devlink_rate, devlink_rate->priv,
+							priority, info->extack);
+		else if (devlink_rate_is_node(devlink_rate))
+			err = ops->rate_node_tx_priority_set(devlink_rate, devlink_rate->priv,
+							priority, info->extack);
+		else if (devlink_rate_is_queue(devlink_rate))
+			err = ops->rate_queue_tx_priority_set(devlink_rate, devlink_rate->priv,
+							priority, info->extack);
+		if (err)
+			return err;
+		devlink_rate->tx_priority = rate;
+	}
+
+	if (attrs[DEVLINK_ATTR_RATE_TX_WEIGHT]) {
+		weight = nla_get_u16(attrs[DEVLINK_ATTR_RATE_TX_WEIGHT]);
+		if (devlink_rate_is_vport(devlink_rate))
+			err = ops->rate_vport_tx_weight_set(devlink_rate, devlink_rate->priv,
+							 weight, info->extack);
+		else if (devlink_rate_is_node(devlink_rate))
+			err = ops->rate_node_tx_weight_set(devlink_rate, devlink_rate->priv,
+							weight, info->extack);
+		else if (devlink_rate_is_queue(devlink_rate))
+			err = ops->rate_queue_tx_weight_set(devlink_rate, devlink_rate->priv,
+							weight, info->extack);
+		if (err)
+			return err;
+		devlink_rate->tx_weight = weight;
+	}
+
 	nla_parent = attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME];
 	if (nla_parent) {
 		err = devlink_nl_rate_parent_node_set(devlink_rate, info,
@@ -1962,32 +2072,85 @@ static bool devlink_rate_set_ops_supported(const struct devlink_ops *ops,
 {
 	struct nlattr **attrs = info->attrs;
 
-	if (type == DEVLINK_RATE_TYPE_LEAF) {
-		if (attrs[DEVLINK_ATTR_RATE_TX_SHARE] && !ops->rate_leaf_tx_share_set) {
-			NL_SET_ERR_MSG_MOD(info->extack, "TX share set isn't supported for the leafs");
+	if (type == DEVLINK_RATE_TYPE_VPORT || type == DEVLINK_RATE_TYPE_LEAF) {
+		if (attrs[DEVLINK_ATTR_RATE_TX_SHARE] && !ops->rate_vport_tx_share_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX share set isn't supported for the vports");
 			return false;
 		}
-		if (attrs[DEVLINK_ATTR_RATE_TX_MAX] && !ops->rate_leaf_tx_max_set) {
-			NL_SET_ERR_MSG_MOD(info->extack, "TX max set isn't supported for the leafs");
+		if (attrs[DEVLINK_ATTR_RATE_TX_MAX] && !ops->rate_vport_tx_max_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX max set isn't supported for the vports");
+			return false;
+		}
+		if (attrs[DEVLINK_ATTR_RATE_TX_PRIORITY] && !ops->rate_vport_tx_priority_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX max set isn't supported for the vports");
+			return false;
+		}
+		if (attrs[DEVLINK_ATTR_RATE_TX_WEIGHT] && !ops->rate_vport_tx_weight_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX max set isn't supported for the vports");
 			return false;
 		}
 		if (attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME] &&
-		    !ops->rate_leaf_parent_set) {
-			NL_SET_ERR_MSG_MOD(info->extack, "Parent set isn't supported for the leafs");
+		    !ops->rate_vport_parent_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "Parent set isn't supported for the vports");
 			return false;
 		}
 	} else if (type == DEVLINK_RATE_TYPE_NODE) {
 		if (attrs[DEVLINK_ATTR_RATE_TX_SHARE] && !ops->rate_node_tx_share_set) {
-			NL_SET_ERR_MSG_MOD(info->extack, "TX share set isn't supported for the nodes");
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX share set isn't supported for the nodes");
 			return false;
 		}
 		if (attrs[DEVLINK_ATTR_RATE_TX_MAX] && !ops->rate_node_tx_max_set) {
-			NL_SET_ERR_MSG_MOD(info->extack, "TX max set isn't supported for the nodes");
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX max set isn't supported for the nodes");
+			return false;
+		}
+		if (attrs[DEVLINK_ATTR_RATE_TX_PRIORITY] && !ops->rate_node_tx_priority_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX priority set isn't supported for the nodes");
+			return false;
+		}
+		if (attrs[DEVLINK_ATTR_RATE_TX_WEIGHT] && !ops->rate_node_tx_weight_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX weight set isn't supported for the nodes");
 			return false;
 		}
 		if (attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME] &&
 		    !ops->rate_node_parent_set) {
-			NL_SET_ERR_MSG_MOD(info->extack, "Parent set isn't supported for the nodes");
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "Parent set isn't supported for the nodes");
+			return false;
+		}
+	} else if (type == DEVLINK_RATE_TYPE_QUEUE) {
+		if (attrs[DEVLINK_ATTR_RATE_TX_SHARE] && !ops->rate_node_tx_share_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX share set isn't supported for the queues");
+			return false;
+		}
+		if (attrs[DEVLINK_ATTR_RATE_TX_MAX] && !ops->rate_node_tx_max_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX max set isn't supported for the queues");
+			return false;
+		}
+		if (attrs[DEVLINK_ATTR_RATE_TX_PRIORITY] && !ops->rate_node_tx_priority_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX priority set isn't supported for the queues");
+			return false;
+		}
+		if (attrs[DEVLINK_ATTR_RATE_TX_WEIGHT] && !ops->rate_node_tx_weight_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "TX weight set isn't supported for the queues");
+			return false;
+		}
+		if (attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME] &&
+		    !ops->rate_queue_parent_set) {
+			NL_SET_ERR_MSG_MOD(info->extack,
+					   "Parent set isn't supported for the queues");
 			return false;
 		}
 	} else {
@@ -9165,6 +9328,9 @@ static const struct nla_policy devlink_nl_policy[DEVLINK_ATTR_MAX + 1] = {
 	[DEVLINK_ATTR_LINECARD_INDEX] = { .type = NLA_U32 },
 	[DEVLINK_ATTR_LINECARD_TYPE] = { .type = NLA_NUL_STRING },
 	[DEVLINK_ATTR_SELFTESTS] = { .type = NLA_NESTED },
+	[DEVLINK_ATTR_RATE_QUEUE] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_RATE_TX_PRIORITY] = { .type = NLA_U16 },
+	[DEVLINK_ATTR_RATE_TX_WEIGHT] = { .type = NLA_U16 },
 };
 
 static const struct genl_small_ops devlink_nl_ops[] = {
@@ -10165,16 +10331,105 @@ void devlink_port_attrs_pci_sf_set(struct devlink_port *devlink_port, u32 contro
 EXPORT_SYMBOL_GPL(devlink_port_attrs_pci_sf_set);
 
 /**
- * devl_rate_leaf_create - create devlink rate leaf
+ * devl_rate_queue_create - create devlink rate queue
+ * @devlink: devlink instance
+ * @parent_name: name of the parent of the resultion node
+ * @queue_id: identifier of the new queue
+ * @priv: driver private data
+ *
+ * Create devlink rate object of type node
+ */
+int devl_rate_queue_create(struct devlink *devlink, char *parent_name, u16 queue_id, void *priv)
+{
+	struct devlink_rate *devlink_rate;
+
+	devl_assert_locked(devlink);
+
+	devlink_rate = devlink_rate_queue_get_by_id(devlink, queue_id);
+	if (!IS_ERR(devlink_rate))
+		return -EEXIST;
+
+	devlink_rate = kzalloc(sizeof(*devlink_rate), GFP_KERNEL);
+	if (!devlink_rate)
+		return -ENOMEM;
+
+	devlink_rate->parent = devlink_rate_node_get_by_name(devlink, parent_name);
+	if (IS_ERR(devlink_rate->parent))
+		devlink_rate->parent = NULL;
+
+	if (devlink_rate->parent)
+		refcount_inc(&devlink_rate->parent->refcnt);
+
+	devlink_rate->type = DEVLINK_RATE_TYPE_QUEUE;
+	devlink_rate->devlink = devlink;
+	devlink_rate->queue_id = queue_id;
+	devlink_rate->priv = priv;
+	list_add_tail(&devlink_rate->list, &devlink->rate_list);
+	devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(devl_rate_queue_create);
+
+/**
+ * devl_rate_node_create - create devlink rate vport
+ * @devlink: devlink instance
+ * @priv: driver private data
+ * @node_name: name of the resulting node
+ * @parent_name: name of the parent of the resultion node
+ *
+ * Create devlink rate object of type node
+ */
+int devl_rate_node_create(struct devlink *devlink, void *priv, char *node_name, char *parent_name)
+{
+	struct devlink_rate *rate_node;
+	struct devlink_rate *parent;
+
+	rate_node = devlink_rate_node_get_by_name(devlink, node_name);
+	if (!IS_ERR(rate_node))
+		return -EEXIST;
+
+	rate_node = kzalloc(sizeof(*rate_node), GFP_KERNEL);
+	if (!rate_node)
+		return -ENOMEM;
+
+	if (parent_name) {
+		parent = devlink_rate_node_get_by_name(devlink, parent_name);
+		if (IS_ERR(parent))
+			return -ENODEV;
+		rate_node->parent = parent;
+		refcount_inc(&rate_node->parent->refcnt);
+	}
+
+	rate_node->devlink = devlink;
+	rate_node->type = DEVLINK_RATE_TYPE_NODE;
+	rate_node->priv = priv;
+
+	rate_node->name = kzalloc(DEVLINK_RATE_NAME_MAX_LEN, GFP_KERNEL);
+	if (!rate_node->name)
+		return -ENOMEM;
+
+	strncpy(rate_node->name, node_name, DEVLINK_RATE_NAME_MAX_LEN);
+
+	refcount_set(&rate_node->refcnt, 1);
+	list_add(&rate_node->list, &devlink->rate_list);
+	devlink_rate_notify(rate_node, DEVLINK_CMD_RATE_NEW);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(devl_rate_node_create);
+
+/**
+ * devl_rate_vport_create - create devlink rate vport
  * @devlink_port: devlink port object to create rate object on
  * @priv: driver private data
  *
- * Create devlink rate object of type leaf on provided @devlink_port.
+ * Create devlink rate object of type vport on provided @devlink_port.
  */
-int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv)
+int devl_rate_vport_create(struct devlink_port *devlink_port, void *priv, char *parent_name)
 {
 	struct devlink *devlink = devlink_port->devlink;
 	struct devlink_rate *devlink_rate;
+	struct devlink_rate *parent;
 
 	devl_assert_locked(devlink_port->devlink);
 
@@ -10185,26 +10440,42 @@ int devl_rate_leaf_create(struct devlink_port *devlink_port, void *priv)
 	if (!devlink_rate)
 		return -ENOMEM;
 
-	devlink_rate->type = DEVLINK_RATE_TYPE_LEAF;
+	if (parent_name) {
+		parent = devlink_rate_node_get_by_name(devlink, parent_name);
+		if (IS_ERR(parent))
+			return -ENODEV;
+		devlink_rate->parent = parent;
+		refcount_inc(&devlink_rate->parent->refcnt);
+	}
+
+	devlink_rate->name = kzalloc(DEVLINK_RATE_NAME_MAX_LEN, GFP_KERNEL);
+	if (!devlink_rate->name)
+		return -ENOMEM;
+
+	snprintf(devlink_rate->name, DEVLINK_RATE_NAME_MAX_LEN, "vport_%u",
+		 devlink_port->index);
+
+	devlink_rate->type = DEVLINK_RATE_TYPE_VPORT;
 	devlink_rate->devlink = devlink;
 	devlink_rate->devlink_port = devlink_port;
 	devlink_rate->priv = priv;
+	refcount_set(&devlink_rate->refcnt, 1);
 	list_add_tail(&devlink_rate->list, &devlink->rate_list);
 	devlink_port->devlink_rate = devlink_rate;
 	devlink_rate_notify(devlink_rate, DEVLINK_CMD_RATE_NEW);
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(devl_rate_leaf_create);
+EXPORT_SYMBOL_GPL(devl_rate_vport_create);
 
 /**
- * devl_rate_leaf_destroy - destroy devlink rate leaf
+ * devl_rate_vport_destroy - destroy devlink rate vport
  *
  * @devlink_port: devlink port linked to the rate object
  *
- * Destroy the devlink rate object of type leaf on provided @devlink_port.
+ * Destroy the devlink rate object of type vport on provided @devlink_port.
  */
-void devl_rate_leaf_destroy(struct devlink_port *devlink_port)
+void devl_rate_vport_destroy(struct devlink_port *devlink_port)
 {
 	struct devlink_rate *devlink_rate = devlink_port->devlink_rate;
 
@@ -10217,9 +10488,32 @@ void devl_rate_leaf_destroy(struct devlink_port *devlink_port)
 		refcount_dec(&devlink_rate->parent->refcnt);
 	list_del(&devlink_rate->list);
 	devlink_port->devlink_rate = NULL;
+	kfree(devlink_rate->name);
 	kfree(devlink_rate);
 }
-EXPORT_SYMBOL_GPL(devl_rate_leaf_destroy);
+EXPORT_SYMBOL_GPL(devl_rate_vport_destroy);
+
+void devl_rate_objects_destroy(struct devlink *devlink)
+{
+	static struct devlink_rate *devlink_rate, *tmp;
+
+	devl_assert_locked(devlink);
+
+	list_for_each_entry(devlink_rate, &devlink->rate_list, list) {
+		if (devlink_rate->parent)
+			refcount_dec(&devlink_rate->parent->refcnt);
+	}
+	list_for_each_entry_safe(devlink_rate, tmp, &devlink->rate_list, list) {
+		kfree(devlink_rate->name);
+		if (devlink_rate->devlink_port) {
+			if (devlink_rate->devlink_port->devlink_rate)
+				devlink_rate->devlink_port->devlink_rate = NULL;
+		}
+		list_del(&devlink_rate->list);
+		kfree(devlink_rate);
+	}
+}
+EXPORT_SYMBOL_GPL(devl_rate_objects_destroy);
 
 /**
  * devl_rate_nodes_destroy - destroy all devlink rate nodes on device
@@ -10240,12 +10534,18 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
 			continue;
 
 		refcount_dec(&devlink_rate->parent->refcnt);
-		if (devlink_rate_is_leaf(devlink_rate))
-			ops->rate_leaf_parent_set(devlink_rate, NULL, devlink_rate->priv,
-						  NULL, NULL);
+		if (devlink_rate_is_vport(devlink_rate))
+			ops->rate_vport_parent_set(devlink_rate, NULL,
+						   &devlink_rate->priv,
+						   NULL, NULL);
 		else if (devlink_rate_is_node(devlink_rate))
-			ops->rate_node_parent_set(devlink_rate, NULL, devlink_rate->priv,
+			ops->rate_node_parent_set(devlink_rate, NULL,
+						  &devlink_rate->priv,
 						  NULL, NULL);
+		else if (devlink_rate_is_queue(devlink_rate))
+			ops->rate_queue_parent_set(devlink_rate, NULL,
+						   &devlink_rate->priv,
+						   NULL, NULL);
 	}
 	list_for_each_entry_safe(devlink_rate, tmp, &devlink->rate_list, list) {
 		if (devlink_rate_is_node(devlink_rate)) {
@@ -10253,6 +10553,9 @@ void devl_rate_nodes_destroy(struct devlink *devlink)
 			list_del(&devlink_rate->list);
 			kfree(devlink_rate->name);
 			kfree(devlink_rate);
+		} else if (devlink_rate_is_queue(devlink_rate)) {
+			list_del(&devlink_rate->list);
+			kfree(devlink_rate);
 		}
 	}
 }
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH net-next v4 3/6] ice: Introduce new parameters in ice_sched_node
  2022-09-15 13:42 [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Michal Wilczynski
  2022-09-15 13:42 ` [RFC PATCH net-next v4 1/6] ice: Add function for move/reconfigure TxQ AQ command Michal Wilczynski
  2022-09-15 13:42 ` [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters Michal Wilczynski
@ 2022-09-15 13:42 ` Michal Wilczynski
  2022-09-15 13:42 ` [RFC PATCH net-next v4 4/6] ice: Implement devlink-rate API Michal Wilczynski
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 35+ messages in thread
From: Michal Wilczynski @ 2022-09-15 13:42 UTC (permalink / raw)
  To: netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel,
	Michal Wilczynski

To support new devlink-rate API ice_sched_node struct needs to store
a number of additional parameters. This includes txq_id, tx_max,
tx_priority, tx_weight, and tx_priority. Also new function needs to be
added to configure the hardware with new parameters.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
---
 .../net/ethernet/intel/ice/ice_adminq_cmd.h   |  4 +-
 drivers/net/ethernet/intel/ice/ice_common.c   |  8 +-
 drivers/net/ethernet/intel/ice/ice_dcb.c      |  2 +-
 drivers/net/ethernet/intel/ice/ice_sched.c    | 82 +++++++++++++++++--
 drivers/net/ethernet/intel/ice/ice_sched.h    | 27 +++++-
 drivers/net/ethernet/intel/ice/ice_type.h     |  8 ++
 6 files changed, 117 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 7574267f014e..37f892ff4e02 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -848,9 +848,9 @@ struct ice_aqc_txsched_elem {
 	u8 generic;
 #define ICE_AQC_ELEM_GENERIC_MODE_M		0x1
 #define ICE_AQC_ELEM_GENERIC_PRIO_S		0x1
-#define ICE_AQC_ELEM_GENERIC_PRIO_M	(0x7 << ICE_AQC_ELEM_GENERIC_PRIO_S)
+#define ICE_AQC_ELEM_GENERIC_PRIO_M	        GENMASK(3, 1)
 #define ICE_AQC_ELEM_GENERIC_SP_S		0x4
-#define ICE_AQC_ELEM_GENERIC_SP_M	(0x1 << ICE_AQC_ELEM_GENERIC_SP_S)
+#define ICE_AQC_ELEM_GENERIC_SP_M	        GENMASK(4, 4)
 #define ICE_AQC_ELEM_GENERIC_ADJUST_VAL_S	0x5
 #define ICE_AQC_ELEM_GENERIC_ADJUST_VAL_M	\
 	(0x3 << ICE_AQC_ELEM_GENERIC_ADJUST_VAL_S)
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index 39769141c8e8..fda20dd204c6 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -1092,6 +1092,9 @@ int ice_init_hw(struct ice_hw *hw)
 
 	hw->evb_veb = true;
 
+	/* init IDA for identifying scheduling nodes uniquely */
+	ida_init(&hw->port_info->sched_node_ids);
+
 	/* Query the allocated resources for Tx scheduler */
 	status = ice_sched_query_res_alloc(hw);
 	if (status) {
@@ -4652,7 +4655,8 @@ ice_ena_vsi_txq(struct ice_port_info *pi, u16 vsi_handle, u8 tc, u16 q_handle,
 	q_ctx->q_teid = le32_to_cpu(node.node_teid);
 
 	/* add a leaf node into scheduler tree queue layer */
-	status = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1, &node);
+	status = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1, &node,
+				    buf->txqs[0].txq_id);
 	if (!status)
 		status = ice_sched_replay_q_bw(pi, q_ctx);
 
@@ -4887,7 +4891,7 @@ ice_ena_vsi_rdma_qset(struct ice_port_info *pi, u16 vsi_handle, u8 tc,
 	for (i = 0; i < num_qsets; i++) {
 		node.node_teid = buf->rdma_qsets[i].qset_teid;
 		ret = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1,
-					 &node);
+					 &node, 0);
 		if (ret)
 			break;
 		qset_teid[i] = le32_to_cpu(node.node_teid);
diff --git a/drivers/net/ethernet/intel/ice/ice_dcb.c b/drivers/net/ethernet/intel/ice/ice_dcb.c
index 0b146a0d4205..1b0dcd4c0323 100644
--- a/drivers/net/ethernet/intel/ice/ice_dcb.c
+++ b/drivers/net/ethernet/intel/ice/ice_dcb.c
@@ -1580,7 +1580,7 @@ ice_update_port_tc_tree_cfg(struct ice_port_info *pi,
 		/* new TC */
 		status = ice_sched_query_elem(pi->hw, teid2, &elem);
 		if (!status)
-			status = ice_sched_add_node(pi, 1, &elem);
+			status = ice_sched_add_node(pi, 1, &elem, 0);
 		if (status)
 			break;
 		/* update the TC number */
diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c b/drivers/net/ethernet/intel/ice/ice_sched.c
index 118595763bba..f29c30c1b58b 100644
--- a/drivers/net/ethernet/intel/ice/ice_sched.c
+++ b/drivers/net/ethernet/intel/ice/ice_sched.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2018, Intel Corporation. */
 
+#include <net/devlink.h>
 #include "ice_sched.h"
 
 /**
@@ -142,12 +143,13 @@ ice_aq_query_sched_elems(struct ice_hw *hw, u16 elems_req,
  * @pi: port information structure
  * @layer: Scheduler layer of the node
  * @info: Scheduler element information from firmware
+ * @txq_id: id of a tx_queue, if this node represents queue
  *
  * This function inserts a scheduler node to the SW DB.
  */
 int
 ice_sched_add_node(struct ice_port_info *pi, u8 layer,
-		   struct ice_aqc_txsched_elem_data *info)
+		   struct ice_aqc_txsched_elem_data *info, u16 txq_id)
 {
 	struct ice_aqc_txsched_elem_data elem;
 	struct ice_sched_node *parent;
@@ -190,6 +192,9 @@ ice_sched_add_node(struct ice_port_info *pi, u8 layer,
 		}
 	}
 
+	if (info->data.elem_type == ICE_AQC_ELEM_TYPE_LEAF)
+		node->tx_queue_id = txq_id;
+
 	node->in_use = true;
 	node->parent = parent;
 	node->tx_sched_layer = layer;
@@ -355,6 +360,9 @@ void ice_free_sched_node(struct ice_port_info *pi, struct ice_sched_node *node)
 	/* leaf nodes have no children */
 	if (node->children)
 		devm_kfree(ice_hw_to_dev(hw), node->children);
+
+	kfree(node->name);
+	ida_free(&pi->sched_node_ids, node->id);
 	devm_kfree(ice_hw_to_dev(hw), node);
 }
 
@@ -875,7 +883,7 @@ void ice_sched_cleanup_all(struct ice_hw *hw)
  *
  * This function add nodes to HW as well as to SW DB for a given layer
  */
-static int
+int
 ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
 		    struct ice_sched_node *parent, u8 layer, u16 num_nodes,
 		    u16 *num_nodes_added, u32 *first_node_teid)
@@ -924,7 +932,7 @@ ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
 	*num_nodes_added = num_nodes;
 	/* add nodes to the SW DB */
 	for (i = 0; i < num_nodes; i++) {
-		status = ice_sched_add_node(pi, layer, &buf->generic[i]);
+		status = ice_sched_add_node(pi, layer, &buf->generic[i], 0);
 		if (status) {
 			ice_debug(hw, ICE_DBG_SCHED, "add nodes in SW DB failed status =%d\n",
 				  status);
@@ -940,6 +948,15 @@ ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
 
 		new_node->sibling = NULL;
 		new_node->tc_num = tc_node->tc_num;
+		new_node->tx_weight = ICE_SCHED_DFLT_BW_WT;
+		new_node->tx_share = ICE_SCHED_DFLT_BW;
+		new_node->tx_max = ICE_SCHED_DFLT_BW;
+		new_node->name = kzalloc(DEVLINK_RATE_NAME_MAX_LEN, GFP_KERNEL);
+		if (!new_node->name)
+			return -ENOMEM;
+
+		new_node->id = ida_alloc(&pi->sched_node_ids, GFP_KERNEL);
+		snprintf(new_node->name, DEVLINK_RATE_NAME_MAX_LEN, "node_%u", new_node->id);
 
 		/* add it to previous node sibling pointer */
 		/* Note: siblings are not linked across branches */
@@ -1268,7 +1285,7 @@ int ice_sched_init_port(struct ice_port_info *pi)
 			    ICE_AQC_ELEM_TYPE_ENTRY_POINT)
 				hw->sw_entry_point_layer = j;
 
-			status = ice_sched_add_node(pi, j, &buf[i].generic[j]);
+			status = ice_sched_add_node(pi, j, &buf[i].generic[j], 0);
 			if (status)
 				goto err_init_port;
 		}
@@ -2154,7 +2171,7 @@ ice_sched_get_free_vsi_parent(struct ice_hw *hw, struct ice_sched_node *node,
  * This function removes the child from the old parent and adds it to a new
  * parent
  */
-static void
+void
 ice_sched_update_parent(struct ice_sched_node *new_parent,
 			struct ice_sched_node *node)
 {
@@ -2188,7 +2205,7 @@ ice_sched_update_parent(struct ice_sched_node *new_parent,
  *
  * This function move the child nodes to a given parent.
  */
-static int
+int
 ice_sched_move_nodes(struct ice_port_info *pi, struct ice_sched_node *parent,
 		     u16 num_items, u32 *list)
 {
@@ -3560,7 +3577,7 @@ ice_sched_set_eir_srl_excl(struct ice_port_info *pi,
  * node's RL profile ID of type CIR, EIR, or SRL, and removes old profile
  * ID from local database. The caller needs to hold scheduler lock.
  */
-static int
+int
 ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
 		      enum ice_rl_type rl_type, u32 bw, u8 layer_num)
 {
@@ -3596,6 +3613,55 @@ ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
 				       ICE_AQC_RL_PROFILE_TYPE_M, old_id);
 }
 
+/**
+ * ice_sched_set_node_priority - set node's priority
+ * @pi: port information structure
+ * @node: tree node
+ * @priority: number 0-7 representing priority among siblings
+ *
+ * This function sets priority of a node among it's siblings.
+ */
+int
+ice_sched_set_node_priority(struct ice_port_info *pi, struct ice_sched_node *node,
+			    u16 priority)
+{
+	struct ice_aqc_txsched_elem_data buf;
+	struct ice_aqc_txsched_elem *data;
+
+	buf = node->info;
+	data = &buf.data;
+
+	data->valid_sections |= ICE_AQC_ELEM_VALID_GENERIC;
+	data->generic |= FIELD_PREP(ICE_AQC_ELEM_GENERIC_MODE_M, 0x1);
+	data->generic |= FIELD_PREP(ICE_AQC_ELEM_GENERIC_PRIO_M, priority);
+
+	return ice_sched_update_elem(pi->hw, node, &buf);
+}
+
+/**
+ * ice_sched_set_node_weight - set node's weight
+ * @pi: port information structure
+ * @node: tree node
+ * @weight: number 1-200 representing weight for WFQ
+ *
+ * This function sets weight of the node for WFQ algorithm.
+ */
+int
+ice_sched_set_node_weight(struct ice_port_info *pi, struct ice_sched_node *node, u16 weight)
+{
+	struct ice_aqc_txsched_elem_data buf;
+	struct ice_aqc_txsched_elem *data;
+
+	buf = node->info;
+	data = &buf.data;
+
+	data->valid_sections = ICE_AQC_ELEM_VALID_CIR | ICE_AQC_ELEM_VALID_EIR;
+	data->cir_bw.bw_alloc = cpu_to_le16(weight);
+	data->eir_bw.bw_alloc = cpu_to_le16(weight);
+
+	return ice_sched_update_elem(pi->hw, node, &buf);
+}
+
 /**
  * ice_sched_set_node_bw_lmt - set node's BW limit
  * @pi: port information structure
@@ -3606,7 +3672,7 @@ ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
  * It updates node's BW limit parameters like BW RL profile ID of type CIR,
  * EIR, or SRL. The caller needs to hold scheduler lock.
  */
-static int
+int
 ice_sched_set_node_bw_lmt(struct ice_port_info *pi, struct ice_sched_node *node,
 			  enum ice_rl_type rl_type, u32 bw)
 {
diff --git a/drivers/net/ethernet/intel/ice/ice_sched.h b/drivers/net/ethernet/intel/ice/ice_sched.h
index 4f91577fed56..7958a3805b7f 100644
--- a/drivers/net/ethernet/intel/ice/ice_sched.h
+++ b/drivers/net/ethernet/intel/ice/ice_sched.h
@@ -69,6 +69,28 @@ int
 ice_aq_query_sched_elems(struct ice_hw *hw, u16 elems_req,
 			 struct ice_aqc_txsched_elem_data *buf, u16 buf_size,
 			 u16 *elems_ret, struct ice_sq_cd *cd);
+
+int
+ice_sched_set_node_bw_lmt(struct ice_port_info *pi, struct ice_sched_node *node,
+			  enum ice_rl_type rl_type, u32 bw);
+
+int
+ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
+		      enum ice_rl_type rl_type, u32 bw, u8 layer_num);
+
+int
+ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
+		    struct ice_sched_node *parent, u8 layer, u16 num_nodes,
+		    u16 *num_nodes_added, u32 *first_node_teid);
+
+int
+ice_sched_move_nodes(struct ice_port_info *pi, struct ice_sched_node *parent,
+		     u16 num_items, u32 *list);
+
+int ice_sched_set_node_priority(struct ice_port_info *pi, struct ice_sched_node *node,
+				u16 priority);
+int ice_sched_set_node_weight(struct ice_port_info *pi, struct ice_sched_node *node, u16 weight);
+
 int ice_sched_init_port(struct ice_port_info *pi);
 int ice_sched_query_res_alloc(struct ice_hw *hw);
 void ice_sched_get_psm_clk_freq(struct ice_hw *hw);
@@ -81,7 +103,10 @@ struct ice_sched_node *
 ice_sched_find_node_by_teid(struct ice_sched_node *start_node, u32 teid);
 int
 ice_sched_add_node(struct ice_port_info *pi, u8 layer,
-		   struct ice_aqc_txsched_elem_data *info);
+		   struct ice_aqc_txsched_elem_data *info, u16 txq_id);
+void
+ice_sched_update_parent(struct ice_sched_node *new_parent,
+			struct ice_sched_node *node);
 void ice_free_sched_node(struct ice_port_info *pi, struct ice_sched_node *node);
 struct ice_sched_node *ice_sched_get_tc_node(struct ice_port_info *pi, u8 tc);
 struct ice_sched_node *
diff --git a/drivers/net/ethernet/intel/ice/ice_type.h b/drivers/net/ethernet/intel/ice/ice_type.h
index 6ea54a3fad2c..dc3a675c988f 100644
--- a/drivers/net/ethernet/intel/ice/ice_type.h
+++ b/drivers/net/ethernet/intel/ice/ice_type.h
@@ -522,8 +522,15 @@ struct ice_sched_node {
 	struct ice_sched_node *sibling; /* next sibling in the same layer */
 	struct ice_sched_node **children;
 	struct ice_aqc_txsched_elem_data info;
+	char *name;
+	u64 tx_max;
+	u64 tx_share;
 	u32 agg_id;			/* aggregator group ID */
+	u32 id;
+	u16 tx_queue_id;
 	u16 vsi_handle;
+	u16 tx_priority;
+	u16 tx_weight;
 	u8 in_use;			/* suspended or in use */
 	u8 tx_sched_layer;		/* Logical Layer (1-9) */
 	u8 num_children;
@@ -704,6 +711,7 @@ struct ice_port_info {
 	/* List contain profile ID(s) and other params per layer */
 	struct list_head rl_prof_list[ICE_AQC_TOPO_MAX_LEVEL_NUM];
 	struct ice_qos_cfg qos_cfg;
+	struct ida sched_node_ids;
 	u8 is_vf:1;
 };
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH net-next v4 4/6] ice: Implement devlink-rate API
  2022-09-15 13:42 [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Michal Wilczynski
                   ` (2 preceding siblings ...)
  2022-09-15 13:42 ` [RFC PATCH net-next v4 3/6] ice: Introduce new parameters in ice_sched_node Michal Wilczynski
@ 2022-09-15 13:42 ` Michal Wilczynski
  2022-09-22 13:08   ` Przemek Kitszel
  2022-09-15 13:42 ` [RFC PATCH net-next v4 5/6] ice: Export Tx scheduler configuration to devlink-rate Michal Wilczynski
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Michal Wilczynski @ 2022-09-15 13:42 UTC (permalink / raw)
  To: netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel,
	Michal Wilczynski

There is a need to support modification of Tx scheduler topology, in the
ice driver. This will allow user to control Tx settings of each node in
the internal hierarchy of nodes. A number of parameters is supported per
each node: tx_max, tx_share, tx_priority and tx_weight.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_devlink.c | 511 +++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_devlink.h |   2 +
 2 files changed, 513 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_devlink.c b/drivers/net/ethernet/intel/ice/ice_devlink.c
index e6ec20079ced..925283605b59 100644
--- a/drivers/net/ethernet/intel/ice/ice_devlink.c
+++ b/drivers/net/ethernet/intel/ice/ice_devlink.c
@@ -713,6 +713,490 @@ ice_devlink_port_unsplit(struct devlink *devlink, struct devlink_port *port,
 	return ice_devlink_port_split(devlink, port, 1, extack);
 }
 
+/**
+ * ice_traverse_tx_tree - traverse Tx scheduler tree
+ * @devlink: devlink struct
+ * @node: current node, used for recursion
+ * @tc_node: tc_node struct, that is treated as a root
+ * @pf: pf struct
+ *
+ * This function traverses Tx scheduler tree and exports
+ * entire structure to the devlink-rate.
+ */
+static void ice_traverse_tx_tree(struct devlink *devlink, struct ice_sched_node *node,
+				 struct ice_sched_node *tc_node, struct ice_pf *pf)
+{
+	struct ice_vf *vf;
+	int i;
+
+	devl_lock(devlink);
+
+	if (node->parent == tc_node) {
+		/* create root node */
+		devl_rate_node_create(devlink, node, node->name, NULL);
+	} else if (node->info.data.elem_type == ICE_AQC_ELEM_TYPE_LEAF &&
+		   node->parent->name) {
+		devl_rate_queue_create(devlink, node->parent->name, node->tx_queue_id, node);
+	} else if (node->vsi_handle &&
+		   pf->vsi[node->vsi_handle]->vf) {
+		vf = pf->vsi[node->vsi_handle]->vf;
+		snprintf(node->name, DEVLINK_RATE_NAME_MAX_LEN, "vport_%u", vf->devlink_port.index);
+		if (!vf->devlink_port.devlink_rate)
+			devl_rate_vport_create(&vf->devlink_port, node, node->parent->name);
+	} else {
+		devl_rate_node_create(devlink, node, node->name, node->parent->name);
+	}
+
+	devl_unlock(devlink);
+
+	for (i = 0; i < node->num_children; i++)
+		ice_traverse_tx_tree(devlink, node->children[i], tc_node, pf);
+}
+
+/**
+ * ice_devlink_rate_init_tx_topology - export Tx scheduler tree to devlink rate
+ * @devlink: devlink struct
+ * @vsi: main vsi struct
+ *
+ * This function finds a root node, then calls ice_traverse_tx tree, which
+ * traverses the tree and export it's contents to devlink rate.
+ */
+int ice_devlink_rate_init_tx_topology(struct devlink *devlink, struct ice_vsi *vsi)
+{
+	struct ice_port_info *pi = vsi->port_info;
+	struct ice_sched_node *tc_node;
+	struct ice_pf *pf = vsi->back;
+	int i;
+
+	tc_node = pi->root->children[0];
+	mutex_lock(&pi->sched_lock);
+	for (i = 0; i < tc_node->num_children; i++)
+		ice_traverse_tx_tree(devlink, tc_node->children[i], tc_node, pf);
+	mutex_unlock(&pi->sched_lock);
+
+	return 0;
+}
+
+/**
+ * ice_set_object_tx_share - sets node scheduling parameter
+ * @pi: devlink struct instance
+ * @node: node struct instance
+ * @extack: extended netdev ack structure
+ *
+ * This function sets ICE_MIN_BW scheduling BW limit.
+ */
+static int ice_set_object_tx_share(struct ice_port_info *pi, struct ice_sched_node *node,
+				   struct netlink_ext_ack *extack)
+{
+	int status;
+
+	mutex_lock(&pi->sched_lock);
+	status = ice_sched_set_node_bw_lmt(pi, node, ICE_MIN_BW, node->tx_share);
+	mutex_unlock(&pi->sched_lock);
+
+	if (status)
+		NL_SET_ERR_MSG_MOD(extack, "Can't set scheduling node tx_share");
+
+	return status;
+}
+
+/**
+ * ice_set_object_tx_max - sets node scheduling parameter
+ * @pi: devlink struct instance
+ * @node: node struct instance
+ * @extack: extended netdev ack structure
+ *
+ * This function sets ICE_MAX_BW scheduling BW limit.
+ */
+static int ice_set_object_tx_max(struct ice_port_info *pi, struct ice_sched_node *node,
+				 struct netlink_ext_ack *extack)
+{
+	int status;
+
+	mutex_lock(&pi->sched_lock);
+	status = ice_sched_set_node_bw_lmt(pi, node, ICE_MAX_BW, node->tx_max);
+	mutex_unlock(&pi->sched_lock);
+
+	if (status)
+		NL_SET_ERR_MSG_MOD(extack, "Can't set scheduling node tx_max");
+
+	return status;
+}
+
+/**
+ * ice_set_object_tx_priority - sets node scheduling parameter
+ * @pi: devlink struct instance
+ * @node: node struct instance
+ * @extack: extended netdev ack structure
+ *
+ * This function sets priority of node among siblings.
+ */
+static int ice_set_object_tx_priority(struct ice_port_info *pi, struct ice_sched_node *node,
+				      struct netlink_ext_ack *extack)
+{
+	int status;
+
+	mutex_lock(&pi->sched_lock);
+	status = ice_sched_set_node_priority(pi, node, node->tx_priority);
+	mutex_unlock(&pi->sched_lock);
+
+	if (status)
+		NL_SET_ERR_MSG_MOD(extack, "Can't set scheduling node tx_priority");
+
+	return status;
+}
+
+/**
+ * ice_set_object_tx_weight - sets node scheduling parameter
+ * @pi: devlink struct instance
+ * @node: node struct instance
+ * @extack: extended netdev ack structure
+ *
+ * This function sets node weight for WFQ algorithm.
+ */
+static int ice_set_object_tx_weight(struct ice_port_info *pi, struct ice_sched_node *node,
+				    struct netlink_ext_ack *extack)
+{
+	int status;
+
+	mutex_lock(&pi->sched_lock);
+	status = ice_sched_set_node_weight(pi, node, node->tx_weight);
+	mutex_unlock(&pi->sched_lock);
+
+	if (status)
+		NL_SET_ERR_MSG_MOD(extack, "Can't set scheduling node tx_weight");
+
+	return status;
+}
+
+/**
+ * ice_get_pi_from_dev_rate - get port info from devlink_rate
+ * @rate_node: devlink struct instance
+ *
+ * This function returns corresponding port_info struct of devlink_rate
+ */
+static struct ice_port_info *ice_get_pi_from_dev_rate(struct devlink_rate *rate_node)
+{
+	struct ice_pf *pf = devlink_priv(rate_node->devlink);
+
+	return ice_get_main_vsi(pf)->port_info;
+}
+
+static int ice_devlink_rate_node_new(struct devlink_rate *rate_node, void **priv,
+				     struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static int ice_devlink_rate_node_del(struct devlink_rate *rate_node, void *priv,
+				     struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node, *tc_node;
+	struct ice_port_info *pi;
+
+	pi = ice_get_pi_from_dev_rate(rate_node);
+	tc_node = pi->root->children[0];
+	node = priv;
+
+	if (!rate_node->parent || !node || tc_node == node)
+		return 0;
+
+	if (node->info.data.elem_type == ICE_AQC_ELEM_TYPE_LEAF) {
+		NL_SET_ERR_MSG_MOD(extack, "Queue can't be deleted");
+		return -EINVAL;
+	}
+	mutex_lock(&pi->sched_lock);
+	ice_free_sched_node(pi, node);
+	mutex_unlock(&pi->sched_lock);
+
+	return 0;
+}
+
+static int ice_devlink_rate_vport_tx_max_set(struct devlink_rate *rate_vport, void *priv,
+					     u64 tx_max, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_max = tx_max / 10;
+
+	return ice_set_object_tx_max(ice_get_pi_from_dev_rate(rate_vport), node, extack);
+}
+
+static int ice_devlink_rate_vport_tx_share_set(struct devlink_rate *rate_vport, void *priv,
+					       u64 tx_share, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_share = tx_share / 10;
+
+	return ice_set_object_tx_share(ice_get_pi_from_dev_rate(rate_vport), node, extack);
+}
+
+static int ice_devlink_rate_vport_tx_priority_set(struct devlink_rate *rate_vport, void *priv,
+						  u64 tx_priority, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_priority = tx_priority;
+
+	return ice_set_object_tx_priority(ice_get_pi_from_dev_rate(rate_vport), node, extack);
+}
+
+static int ice_devlink_rate_vport_tx_weight_set(struct devlink_rate *rate_vport, void *priv,
+						u64 tx_weight, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_weight = tx_weight;
+
+	return ice_set_object_tx_weight(ice_get_pi_from_dev_rate(rate_vport), node, extack);
+}
+
+static int ice_devlink_rate_node_tx_max_set(struct devlink_rate *rate_node, void *priv,
+					    u64 tx_max, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_max = tx_max / 10;
+
+	return ice_set_object_tx_max(ice_get_pi_from_dev_rate(rate_node), node, extack);
+}
+
+static int ice_devlink_rate_node_tx_share_set(struct devlink_rate *rate_node, void *priv,
+					      u64 tx_share, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_share = tx_share / 10;
+
+	return ice_set_object_tx_share(ice_get_pi_from_dev_rate(rate_node), node, extack);
+}
+
+static int ice_devlink_rate_node_tx_priority_set(struct devlink_rate *rate_node, void *priv,
+						 u64 tx_priority, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_priority = tx_priority;
+
+	return ice_set_object_tx_priority(ice_get_pi_from_dev_rate(rate_node), node, extack);
+}
+
+static int ice_devlink_rate_node_tx_weight_set(struct devlink_rate *rate_node, void *priv,
+					       u64 tx_weight, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_weight = tx_weight;
+
+	return ice_set_object_tx_weight(ice_get_pi_from_dev_rate(rate_node), node, extack);
+}
+
+static int ice_devlink_rate_queue_tx_max_set(struct devlink_rate *rate_queue, void *priv,
+					     u64 tx_max, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_max = tx_max / 10;
+
+	return ice_set_object_tx_max(ice_get_pi_from_dev_rate(rate_queue), node, extack);
+}
+
+static int ice_devlink_rate_queue_tx_share_set(struct devlink_rate *rate_queue, void *priv,
+					       u64 tx_share, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_share = tx_share / 10;
+
+	return ice_set_object_tx_share(ice_get_pi_from_dev_rate(rate_queue), node, extack);
+}
+
+static int ice_devlink_rate_queue_tx_priority_set(struct devlink_rate *rate_queue, void *priv,
+						  u64 tx_priority, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_priority = tx_priority;
+
+	return ice_set_object_tx_priority(ice_get_pi_from_dev_rate(rate_queue), node, extack);
+}
+
+static int ice_devlink_rate_queue_tx_weight_set(struct devlink_rate *rate_queue, void *priv,
+						u64 tx_weight, struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node = priv;
+
+	if (!node)
+		return 0;
+
+	node->tx_weight = tx_weight;
+
+	return ice_set_object_tx_weight(ice_get_pi_from_dev_rate(rate_queue), node, extack);
+}
+
+static int ice_devlink_set_parent(struct devlink_rate *devlink_rate,
+				  struct devlink_rate *parent,
+				  void **priv, void *parent_priv,
+				  struct netlink_ext_ack *extack)
+{
+	struct ice_port_info *pi = ice_get_pi_from_dev_rate(devlink_rate);
+	struct ice_sched_node *tc_node, *node, *parent_node;
+	u16 num_nodes_added;
+	u32 first_node_teid;
+	u32 node_teid;
+	int status;
+
+	tc_node = pi->root->children[0];
+	node = *priv;
+
+	if (!parent) {
+		if (!node || tc_node == node ||
+		    node->info.data.elem_type == ICE_AQC_ELEM_TYPE_LEAF) {
+			return -EINVAL;
+		}
+
+		mutex_lock(&pi->sched_lock);
+		ice_free_sched_node(pi, node);
+		mutex_unlock(&pi->sched_lock);
+
+		return 0;
+	}
+
+	parent_node = parent_priv;
+
+	/* if the node doesn't exist, create it */
+	if (!node) {
+		mutex_lock(&pi->sched_lock);
+
+		status = ice_sched_add_elems(pi, tc_node, parent_node,
+					     parent_node->tx_sched_layer + 1,
+					     1, &num_nodes_added, &first_node_teid);
+
+		mutex_unlock(&pi->sched_lock);
+
+		if (status) {
+			NL_SET_ERR_MSG_MOD(extack, "Can't add a new node");
+			return status;
+		}
+
+		node = ice_sched_find_node_by_teid(parent_node, first_node_teid);
+		*priv = node;
+
+		if (devlink_rate->tx_share) {
+			node->tx_share = devlink_rate->tx_share;
+			ice_set_object_tx_share(pi, node, extack);
+		}
+		if (devlink_rate->tx_max) {
+			node->tx_max = devlink_rate->tx_max;
+			ice_set_object_tx_max(pi, node, extack);
+		}
+		if (devlink_rate->tx_priority) {
+			node->tx_priority = devlink_rate->tx_priority;
+			ice_set_object_tx_priority(pi, node, extack);
+		}
+		if (devlink_rate->tx_weight) {
+			node->tx_weight = devlink_rate->tx_weight;
+			ice_set_object_tx_weight(pi, node, extack);
+		}
+	} else {
+		node_teid = le32_to_cpu(node->info.node_teid);
+		mutex_lock(&pi->sched_lock);
+		status = ice_sched_move_nodes(pi, parent_node, 1, &node_teid);
+		mutex_unlock(&pi->sched_lock);
+
+		if (status)
+			NL_SET_ERR_MSG_MOD(extack, "Can't move existing node to a new parent");
+	}
+
+	return status;
+}
+
+static int
+ice_devlink_reassign_queue(struct ice_port_info *pi, struct ice_sched_node *queue_node,
+			   struct ice_sched_node *src_node, struct ice_sched_node *dst_node)
+{
+	struct ice_aqc_move_txqs_data *buf;
+	struct ice_hw *hw = pi->hw;
+	u32 blocked_cgds;
+	u8 txqs_moved;
+	u16 buf_size;
+	int status;
+
+	buf_size = struct_size(buf, txqs, 1);
+	buf = kzalloc(buf_size, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	buf->src_teid = src_node->info.node_teid;
+	buf->dest_teid = dst_node->info.node_teid;
+	buf->txqs[0].txq_id = queue_node->tx_queue_id;
+	buf->txqs[0].q_cgd = 0;
+	buf->txqs[0].q_teid = queue_node->info.node_teid;
+
+	status = ice_aq_move_recfg_lan_txq(hw, 1, true, false, false, false, 50,
+					   &blocked_cgds, buf, buf_size, &txqs_moved, NULL);
+	if (!status)
+		ice_sched_update_parent(dst_node, queue_node);
+
+	kfree(buf);
+
+	return status;
+}
+
+static int ice_devlink_rate_queue_parent_set(struct devlink_rate *devlink_rate,
+					     struct devlink_rate *parent,
+					     void **priv, void *parent_priv,
+					     struct netlink_ext_ack *extack)
+{
+	struct ice_sched_node *node, *prev_parent, *next_parent;
+	struct ice_port_info *pi;
+
+	if (!parent)
+		return -EINVAL;
+
+	pi = ice_get_pi_from_dev_rate(devlink_rate);
+
+	node = *priv;
+	next_parent = parent_priv;
+	prev_parent = node->parent;
+
+	return ice_devlink_reassign_queue(pi, node, prev_parent, next_parent);
+}
+
 static const struct devlink_ops ice_devlink_ops = {
 	.supported_flash_update_params = DEVLINK_SUPPORT_FLASH_UPDATE_OVERWRITE_MASK,
 	.reload_actions = BIT(DEVLINK_RELOAD_ACTION_FW_ACTIVATE),
@@ -725,6 +1209,28 @@ static const struct devlink_ops ice_devlink_ops = {
 	.eswitch_mode_set = ice_eswitch_mode_set,
 	.info_get = ice_devlink_info_get,
 	.flash_update = ice_devlink_flash_update,
+
+	.rate_node_new = ice_devlink_rate_node_new,
+	.rate_node_del = ice_devlink_rate_node_del,
+
+	.rate_vport_tx_max_set = ice_devlink_rate_vport_tx_max_set,
+	.rate_vport_tx_share_set = ice_devlink_rate_vport_tx_share_set,
+	.rate_vport_tx_priority_set = ice_devlink_rate_vport_tx_priority_set,
+	.rate_vport_tx_weight_set = ice_devlink_rate_vport_tx_weight_set,
+
+	.rate_node_tx_max_set = ice_devlink_rate_node_tx_max_set,
+	.rate_node_tx_share_set = ice_devlink_rate_node_tx_share_set,
+	.rate_node_tx_priority_set = ice_devlink_rate_node_tx_priority_set,
+	.rate_node_tx_weight_set = ice_devlink_rate_node_tx_weight_set,
+
+	.rate_queue_tx_max_set = ice_devlink_rate_queue_tx_max_set,
+	.rate_queue_tx_share_set = ice_devlink_rate_queue_tx_share_set,
+	.rate_queue_tx_priority_set = ice_devlink_rate_queue_tx_priority_set,
+	.rate_queue_tx_weight_set = ice_devlink_rate_queue_tx_weight_set,
+
+	.rate_vport_parent_set = ice_devlink_set_parent,
+	.rate_node_parent_set = ice_devlink_set_parent,
+	.rate_queue_parent_set = ice_devlink_rate_queue_parent_set,
 };
 
 static int
@@ -893,6 +1399,9 @@ void ice_devlink_register(struct ice_pf *pf)
  */
 void ice_devlink_unregister(struct ice_pf *pf)
 {
+	devl_lock(priv_to_devlink(pf));
+	devl_rate_objects_destroy(priv_to_devlink(pf));
+	devl_unlock(priv_to_devlink(pf));
 	devlink_unregister(priv_to_devlink(pf));
 }
 
@@ -1098,6 +1607,8 @@ void ice_devlink_destroy_vf_port(struct ice_vf *vf)
 
 	devlink_port = &vf->devlink_port;
 
+	devl_rate_vport_destroy(devlink_port);
+
 	devlink_port_type_clear(devlink_port);
 	devlink_port_unregister(devlink_port);
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_devlink.h b/drivers/net/ethernet/intel/ice/ice_devlink.h
index fe006d9946f8..8bfed9ee2c4c 100644
--- a/drivers/net/ethernet/intel/ice/ice_devlink.h
+++ b/drivers/net/ethernet/intel/ice/ice_devlink.h
@@ -18,4 +18,6 @@ void ice_devlink_destroy_vf_port(struct ice_vf *vf);
 void ice_devlink_init_regions(struct ice_pf *pf);
 void ice_devlink_destroy_regions(struct ice_pf *pf);
 
+int ice_devlink_rate_init_tx_topology(struct devlink *devlink, struct ice_vsi *vsi);
+
 #endif /* _ICE_DEVLINK_H_ */
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH net-next v4 5/6] ice: Export Tx scheduler configuration to devlink-rate
  2022-09-15 13:42 [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Michal Wilczynski
                   ` (3 preceding siblings ...)
  2022-09-15 13:42 ` [RFC PATCH net-next v4 4/6] ice: Implement devlink-rate API Michal Wilczynski
@ 2022-09-15 13:42 ` Michal Wilczynski
  2022-09-15 13:42 ` [RFC PATCH net-next v4 6/6] ice: Prevent ADQ, DCB and RDMA coexistence with Custom Tx scheduler Michal Wilczynski
  2022-09-15 13:57 ` [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Wilczynski, Michal
  6 siblings, 0 replies; 35+ messages in thread
From: Michal Wilczynski @ 2022-09-15 13:42 UTC (permalink / raw)
  To: netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel,
	Michal Wilczynski

There is a need to export Tx scheduler configuration to devlink-rate
kernel mechanism. We also need a complete list of queues in the
scheduling topology. Unfortunately, when the reset happens
ice_sched_node objects that represents queues are re-created. This
forces us to re-initialize devlink-rate representation of the driver Tx
scheduler tree.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_virtchnl.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_virtchnl.c b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
index 2b4c791b6cba..0f0a03b7725e 100644
--- a/drivers/net/ethernet/intel/ice/ice_virtchnl.c
+++ b/drivers/net/ethernet/intel/ice/ice_virtchnl.c
@@ -12,6 +12,7 @@
 #include "ice_vlan.h"
 #include "ice_flex_pipe.h"
 #include "ice_dcb_lib.h"
+#include "ice_devlink.h"
 
 #define FIELD_SELECTOR(proto_hdr_field) \
 		BIT((proto_hdr_field) & PROTO_HDR_FIELD_MASK)
@@ -1597,6 +1598,7 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg)
 	    (struct virtchnl_vsi_queue_config_info *)msg;
 	struct virtchnl_queue_pair_info *qpi;
 	struct ice_pf *pf = vf->pf;
+	struct devlink *devlink;
 	struct ice_vsi *vsi;
 	int i = -1, q_idx;
 
@@ -1655,6 +1657,14 @@ static int ice_vc_cfg_qs_msg(struct ice_vf *vf, u8 *msg)
 			}
 		}
 
+		devlink = priv_to_devlink(pf);
+
+		devl_lock(devlink);
+		devl_rate_objects_destroy(devlink);
+		devl_unlock(devlink);
+
+		ice_devlink_rate_init_tx_topology(devlink, ice_get_main_vsi(pf));
+
 		/* copy Rx queue info from VF into VSI */
 		if (qpi->rxq.ring_len > 0) {
 			u16 max_frame_size = ice_vc_get_max_frame_size(vf);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH net-next v4 6/6] ice: Prevent ADQ, DCB and RDMA coexistence with Custom Tx scheduler
  2022-09-15 13:42 [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Michal Wilczynski
                   ` (4 preceding siblings ...)
  2022-09-15 13:42 ` [RFC PATCH net-next v4 5/6] ice: Export Tx scheduler configuration to devlink-rate Michal Wilczynski
@ 2022-09-15 13:42 ` Michal Wilczynski
  2022-09-15 13:57 ` [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Wilczynski, Michal
  6 siblings, 0 replies; 35+ messages in thread
From: Michal Wilczynski @ 2022-09-15 13:42 UTC (permalink / raw)
  To: netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel,
	Michal Wilczynski

ADQ, DCB and RDMA might interfere with Custom Tx Scheduler changes
that user might introduce using devlink-rate API.

Check if ADQ, DCB or RDMA is active, when user tries to change any
setting in exported Tx scheduler tree.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_dcb_lib.c |  4 +
 drivers/net/ethernet/intel/ice/ice_devlink.c | 87 ++++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_idc.c     |  5 ++
 drivers/net/ethernet/intel/ice/ice_type.h    |  1 +
 4 files changed, 97 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
index add90e75f05c..8d7fc76f49af 100644
--- a/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_dcb_lib.c
@@ -364,6 +364,10 @@ int ice_pf_dcb_cfg(struct ice_pf *pf, struct ice_dcbx_cfg *new_cfg, bool locked)
 	/* Enable DCB tagging only when more than one TC */
 	if (ice_dcb_get_num_tc(new_cfg) > 1) {
 		dev_dbg(dev, "DCB tagging enabled (num TC > 1)\n");
+		if (pf->hw.port_info->is_custom_tx_enabled) {
+			dev_err(dev, "Custom Tx scheduler feature enabled, can't configure DCB\n");
+			return -EBUSY;
+		}
 		set_bit(ICE_FLAG_DCB_ENA, pf->flags);
 	} else {
 		dev_dbg(dev, "DCB tagging disabled (num TC = 1)\n");
diff --git a/drivers/net/ethernet/intel/ice/ice_devlink.c b/drivers/net/ethernet/intel/ice/ice_devlink.c
index 925283605b59..5530d8809a42 100644
--- a/drivers/net/ethernet/intel/ice/ice_devlink.c
+++ b/drivers/net/ethernet/intel/ice/ice_devlink.c
@@ -8,6 +8,7 @@
 #include "ice_devlink.h"
 #include "ice_eswitch.h"
 #include "ice_fw_update.h"
+#include "ice_dcb_lib.h"
 
 static int ice_active_port_option = -1;
 
@@ -713,6 +714,44 @@ ice_devlink_port_unsplit(struct devlink *devlink, struct devlink_port *port,
 	return ice_devlink_port_split(devlink, port, 1, extack);
 }
 
+/**
+ * ice_enable_custom_tx - try to enable custom Tx feature
+ * @pf: devlink struct
+ * @extack: extended netdev ack structure
+ *
+ * This function tries to enabled custom Tx feature,
+ * it's not possible to enable it, if ADQ or DCB is active.
+ */
+static bool ice_enable_custom_tx(struct ice_pf *pf, struct netlink_ext_ack *extack)
+{
+	struct ice_port_info *pi = ice_get_main_vsi(pf)->port_info;
+	struct device *dev = ice_pf_to_dev(pf);
+
+	if (pi->is_custom_tx_enabled)
+		/* already enabled, return true */
+		return true;
+
+	if (ice_is_adq_active(pf)) {
+		dev_err(dev, "ADQ active, can't modify Tx scheduler tree\n");
+		return false;
+	}
+
+	if (ice_is_dcb_active(pf)) {
+		dev_err(dev, "DCB active, can't modify Tx scheduler tree\n");
+		return false;
+	}
+
+	/* check if auxiliary bus is plugged */
+	if (pf->adev) {
+		dev_err(dev, "RDMA active, can't modify Tx scheduler tree\n");
+		return false;
+	}
+
+	pi->is_custom_tx_enabled = true;
+
+	return true;
+}
+
 /**
  * ice_traverse_tx_tree - traverse Tx scheduler tree
  * @devlink: devlink struct
@@ -885,6 +924,9 @@ static struct ice_port_info *ice_get_pi_from_dev_rate(struct devlink_rate *rate_
 static int ice_devlink_rate_node_new(struct devlink_rate *rate_node, void **priv,
 				     struct netlink_ext_ack *extack)
 {
+	if (!ice_enable_custom_tx(devlink_priv(rate_node->devlink), extack))
+		return -EBUSY;
+
 	return 0;
 }
 
@@ -898,6 +940,9 @@ static int ice_devlink_rate_node_del(struct devlink_rate *rate_node, void *priv,
 	tc_node = pi->root->children[0];
 	node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_node->devlink), extack))
+		return -EBUSY;
+
 	if (!rate_node->parent || !node || tc_node == node)
 		return 0;
 
@@ -917,6 +962,9 @@ static int ice_devlink_rate_vport_tx_max_set(struct devlink_rate *rate_vport, vo
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_vport->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -930,6 +978,9 @@ static int ice_devlink_rate_vport_tx_share_set(struct devlink_rate *rate_vport,
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_vport->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -943,6 +994,9 @@ static int ice_devlink_rate_vport_tx_priority_set(struct devlink_rate *rate_vpor
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_vport->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -956,6 +1010,9 @@ static int ice_devlink_rate_vport_tx_weight_set(struct devlink_rate *rate_vport,
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_vport->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -969,6 +1026,9 @@ static int ice_devlink_rate_node_tx_max_set(struct devlink_rate *rate_node, void
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_node->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -982,6 +1042,9 @@ static int ice_devlink_rate_node_tx_share_set(struct devlink_rate *rate_node, vo
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_node->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -995,6 +1058,9 @@ static int ice_devlink_rate_node_tx_priority_set(struct devlink_rate *rate_node,
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_node->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -1008,6 +1074,9 @@ static int ice_devlink_rate_node_tx_weight_set(struct devlink_rate *rate_node, v
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_node->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -1021,6 +1090,9 @@ static int ice_devlink_rate_queue_tx_max_set(struct devlink_rate *rate_queue, vo
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_queue->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -1034,6 +1106,9 @@ static int ice_devlink_rate_queue_tx_share_set(struct devlink_rate *rate_queue,
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_queue->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -1047,6 +1122,9 @@ static int ice_devlink_rate_queue_tx_priority_set(struct devlink_rate *rate_queu
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_queue->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -1060,6 +1138,9 @@ static int ice_devlink_rate_queue_tx_weight_set(struct devlink_rate *rate_queue,
 {
 	struct ice_sched_node *node = priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(rate_queue->devlink), extack))
+		return -EBUSY;
+
 	if (!node)
 		return 0;
 
@@ -1083,6 +1164,9 @@ static int ice_devlink_set_parent(struct devlink_rate *devlink_rate,
 	tc_node = pi->root->children[0];
 	node = *priv;
 
+	if (!ice_enable_custom_tx(devlink_priv(devlink_rate->devlink), extack))
+		return -EBUSY;
+
 	if (!parent) {
 		if (!node || tc_node == node ||
 		    node->info.data.elem_type == ICE_AQC_ELEM_TYPE_LEAF) {
@@ -1185,6 +1269,9 @@ static int ice_devlink_rate_queue_parent_set(struct devlink_rate *devlink_rate,
 	struct ice_sched_node *node, *prev_parent, *next_parent;
 	struct ice_port_info *pi;
 
+	if (!ice_enable_custom_tx(devlink_priv(devlink_rate->devlink), extack))
+		return -EBUSY;
+
 	if (!parent)
 		return -EINVAL;
 
diff --git a/drivers/net/ethernet/intel/ice/ice_idc.c b/drivers/net/ethernet/intel/ice/ice_idc.c
index 895c32bcc8b5..f702bd5272f2 100644
--- a/drivers/net/ethernet/intel/ice/ice_idc.c
+++ b/drivers/net/ethernet/intel/ice/ice_idc.c
@@ -273,6 +273,11 @@ int ice_plug_aux_dev(struct ice_pf *pf)
 	if (!ice_is_rdma_ena(pf))
 		return 0;
 
+	if (pf->hw.port_info->is_custom_tx_enabled) {
+		dev_err(ice_pf_to_dev(pf), "Custom Tx scheduler enabled, it's mutually exclusive with RDMA\n");
+		return -EBUSY;
+	}
+
 	iadev = kzalloc(sizeof(*iadev), GFP_KERNEL);
 	if (!iadev)
 		return -ENOMEM;
diff --git a/drivers/net/ethernet/intel/ice/ice_type.h b/drivers/net/ethernet/intel/ice/ice_type.h
index dc3a675c988f..1a45bc51480c 100644
--- a/drivers/net/ethernet/intel/ice/ice_type.h
+++ b/drivers/net/ethernet/intel/ice/ice_type.h
@@ -713,6 +713,7 @@ struct ice_port_info {
 	struct ice_qos_cfg qos_cfg;
 	struct ida sched_node_ids;
 	u8 is_vf:1;
+	u8 is_custom_tx_enabled:1;
 };
 
 struct ice_switch_info {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it
  2022-09-15 13:42 [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Michal Wilczynski
                   ` (5 preceding siblings ...)
  2022-09-15 13:42 ` [RFC PATCH net-next v4 6/6] ice: Prevent ADQ, DCB and RDMA coexistence with Custom Tx scheduler Michal Wilczynski
@ 2022-09-15 13:57 ` Wilczynski, Michal
  6 siblings, 0 replies; 35+ messages in thread
From: Wilczynski, Michal @ 2022-09-15 13:57 UTC (permalink / raw)
  To: netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel



On 9/15/2022 3:42 PM, Michal Wilczynski wrote:
> This patch series implements devlink-rate for ice driver. Unfortunately
> current API isn't flexible enough for our use case, so there is a need to
> extend it. New object type 'queue' is being introduced, and more functions
> has been changed to non-static, to enable the driver to export current
> Tx scheduling configuration.
>
> This patch series is a follow up for this thread:
> https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/T/#u
>
> V4:
> - changed static variable counter to per port IDA to
>    uniquely identify nodes
>
> V3:
> - removed shift macros, since FIELD_PREP is used
> - added static_assert for struct
> - removed unnecessary functions
> - used tab instead of space in define
>
> V2:
> - fixed Alexandr comments
> - refactored code to fix checkpatch issues
> - added mutual exclusion for RDMA, DCB

I realized now that I haven't removed versioning from internal review,
please ignore this.
BR,
Michał

>
>
> Ben Shelton (1):
>    ice: Add function for move/reconfigure TxQ AQ command
>
> Michal Wilczynski (5):
>    devlink: Extend devlink-rate api with queues and new parameters
>    ice: Introduce new parameters in ice_sched_node
>    ice: Implement devlink-rate API
>    ice: Export Tx scheduler configuration to devlink-rate
>    ice: Prevent ADQ, DCB and RDMA coexistence with Custom Tx scheduler
>
>   .../net/ethernet/intel/ice/ice_adminq_cmd.h   |  41 +-
>   drivers/net/ethernet/intel/ice/ice_common.c   |  75 ++-
>   drivers/net/ethernet/intel/ice/ice_common.h   |   8 +
>   drivers/net/ethernet/intel/ice/ice_dcb.c      |   2 +-
>   drivers/net/ethernet/intel/ice/ice_dcb_lib.c  |   4 +
>   drivers/net/ethernet/intel/ice/ice_devlink.c  | 598 ++++++++++++++++++
>   drivers/net/ethernet/intel/ice/ice_devlink.h  |   2 +
>   drivers/net/ethernet/intel/ice/ice_idc.c      |   5 +
>   drivers/net/ethernet/intel/ice/ice_main.c     |   2 +
>   drivers/net/ethernet/intel/ice/ice_sched.c    |  81 ++-
>   drivers/net/ethernet/intel/ice/ice_sched.h    |  27 +-
>   drivers/net/ethernet/intel/ice/ice_type.h     |   7 +
>   drivers/net/ethernet/intel/ice/ice_virtchnl.c |  10 +
>   .../net/ethernet/mellanox/mlx5/core/devlink.c |   6 +-
>   .../mellanox/mlx5/core/esw/devlink_port.c     |   8 +-
>   .../net/ethernet/mellanox/mlx5/core/esw/qos.c |  12 +-
>   .../net/ethernet/mellanox/mlx5/core/esw/qos.h |  10 +-
>   drivers/net/netdevsim/dev.c                   |  32 +-
>   include/net/devlink.h                         |  56 +-
>   include/uapi/linux/devlink.h                  |   8 +-
>   net/core/devlink.c                            | 407 ++++++++++--
>   21 files changed, 1284 insertions(+), 117 deletions(-)
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-15 13:42 ` [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters Michal Wilczynski
@ 2022-09-15 15:31   ` Edward Cree
  2022-09-15 18:41     ` Wilczynski, Michal
  0 siblings, 1 reply; 35+ messages in thread
From: Edward Cree @ 2022-09-15 15:31 UTC (permalink / raw)
  To: Michal Wilczynski, netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel

On 15/09/2022 14:42, Michal Wilczynski wrote:
> Currently devlink-rate only have two types of objects: nodes and leafs.
> There is a need to extend this interface to account for a third type of
> scheduling elements - queues. In our use case customer is sending
> different types of traffic on each queue, which requires an ability to
> assign rate parameters to individual queues.

Is there a use-case for this queue scheduling in the absence of a netdevice?
If not, then I don't see how this belongs in devlink; the configuration
 should instead be done in two parts: devlink-rate to schedule between
 different netdevices (e.g. VFs) and tc qdiscs (or some other netdev-level
 API) to schedule different queues within each single netdevice.
Please explain why this existing separation does not support your use-case.

Also I would like to see some documentation as part of this patch.  It looks
 like there's no kernel document for devlink-rate unlike most other devlink
 objects; perhaps you could add one?

-ed

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-15 15:31   ` Edward Cree
@ 2022-09-15 18:41     ` Wilczynski, Michal
  2022-09-15 21:01       ` Edward Cree
                         ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Wilczynski, Michal @ 2022-09-15 18:41 UTC (permalink / raw)
  To: Edward Cree, netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel



On 9/15/2022 5:31 PM, Edward Cree wrote:
> On 15/09/2022 14:42, Michal Wilczynski wrote:
>> Currently devlink-rate only have two types of objects: nodes and leafs.
>> There is a need to extend this interface to account for a third type of
>> scheduling elements - queues. In our use case customer is sending
>> different types of traffic on each queue, which requires an ability to
>> assign rate parameters to individual queues.
> Is there a use-case for this queue scheduling in the absence of a netdevice?
> If not, then I don't see how this belongs in devlink; the configuration
>   should instead be done in two parts: devlink-rate to schedule between
>   different netdevices (e.g. VFs) and tc qdiscs (or some other netdev-level
>   API) to schedule different queues within each single netdevice.
> Please explain why this existing separation does not support your use-case.
>
> Also I would like to see some documentation as part of this patch.  It looks
>   like there's no kernel document for devlink-rate unlike most other devlink
>   objects; perhaps you could add one?
>
> -ed

Hi,
Previously we discussed adding queues to devlink-rate in this thread:
https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/T/#u
In our use case we are trying to find a way to expose hardware Tx 
scheduler tree that is defined
per port to user. Obviously if the tree is defined per physical port, 
all the scheduling nodes will reside
on the same tree.

Our customer is trying to send different types of traffic that require 
different QoS levels on the same
VM, but on a different queues. This requires completely different rate 
setups for that queue - in the
implementation that you're mentioning we wouldn't be able to arbitrarily 
reassign the queue to any node.
Those queues would still need to share a single parent - their netdev. 
This wouldn't allow us to fully take
advantage of the HQoS and would introduce arbitrary limitations.

Also I would think that since there is only one vendor implementing this 
particular devlink-rate API, there is
some room for flexibility.

Regarding the documentation,  sure. I just wanted to get all the 
feedback from the mailing list and arrive at the final
solution before writing the docs.

BTW, I'm going to be out of office tomorrow, so will respond in this 
thread on Monday.
BR,
Michał



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-15 18:41     ` Wilczynski, Michal
@ 2022-09-15 21:01       ` Edward Cree
  2022-09-19 13:12         ` Wilczynski, Michal
  2022-09-21 23:33       ` Jakub Kicinski
  2022-09-26 11:51       ` Jiri Pirko
  2 siblings, 1 reply; 35+ messages in thread
From: Edward Cree @ 2022-09-15 21:01 UTC (permalink / raw)
  To: Wilczynski, Michal, netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel

On 15/09/2022 19:41, Wilczynski, Michal wrote:
> Hi,
> Previously we discussed adding queues to devlink-rate in this thread:
> https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/T/#u
> In our use case we are trying to find a way to expose hardware Tx scheduler tree that is defined
> per port to user. Obviously if the tree is defined per physical port, all the scheduling nodes will reside
> on the same tree.
> 
> Our customer is trying to send different types of traffic that require different QoS levels on the same
> VM, but on a different queues. This requires completely different rate setups for that queue - in the
> implementation that you're mentioning we wouldn't be able to arbitrarily reassign the queue to any node.

I'm not sure I 100% understand what you're describing, but I get the
 impression it's maybe a layering violation — the hypervisor should only
 be responsible for shaping the VM's overall traffic, it should be up to
 the VM to decide how to distribute that bandwidth between traffic types.
But if it's what your customer needs then presumably there's some reason
 for it that I'm not seeing.  I'm not a QoS expert by any means — I just
 get antsy that every time I look at devlink it's gotten bigger and keeps
 escaping further out of the "device-wide configuration" concept it was
 originally sold as :(

> Those queues would still need to share a single parent - their netdev. This wouldn't allow us to fully take
> advantage of the HQoS and would introduce arbitrary limitations.

Oh, so you need a hierarchy within which the VF's queues don't form a
 clade (subtree)?  That sounds like something worth calling out in the
 commit message as the reason why you've designed it this way.

> Regarding the documentation,  sure. I just wanted to get all the feedback from the mailing list and arrive at the final
> solution before writing the docs.

Fair.  But you might get better feedback on the code if people have the
 docs to better understand the intent; just a suggestion.

-ed

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-15 13:42 ` [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters Michal Wilczynski
@ 2022-09-19  9:22 ` Dan Carpenter
  0 siblings, 0 replies; 35+ messages in thread
From: kernel test robot @ 2022-09-19  7:22 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 9092 bytes --]

BCC: lkp(a)intel.com
CC: kbuild-all(a)lists.01.org
In-Reply-To: <20220915134239.1935604-3-michal.wilczynski@intel.com>
References: <20220915134239.1935604-3-michal.wilczynski@intel.com>
TO: Michal Wilczynski <michal.wilczynski@intel.com>

Hi Michal,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on net-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Michal-Wilczynski/Implement-devlink-rate-API-and-extend-it/20220915-214539
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 8a26a9dee5e3679637edc6f8caf4beb5f3100dde
:::::: branch date: 4 days ago
:::::: commit date: 4 days ago
config: mips-randconfig-m031-20220918 (https://download.01.org/0day-ci/archive/20220919/202209191557.PkFshuDN-lkp(a)intel.com/config)
compiler: mips64el-linux-gcc (GCC) 12.1.0

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

New smatch warnings:
net/core/devlink.c:2039 devlink_nl_rate_set() error: uninitialized symbol 'rate'.
net/core/devlink.c:10399 devl_rate_node_create() warn: possible memory leak of 'rate_node'
net/core/devlink.c:10446 devl_rate_vport_create() warn: possible memory leak of 'devlink_rate'

Old smatch warnings:
net/core/devlink.c:7425 devlink_fmsg_prepare_skb() error: uninitialized symbol 'err'.
net/core/devlink.c:10410 devl_rate_node_create() warn: possible memory leak of 'rate_node'
net/core/devlink.c:10453 devl_rate_vport_create() warn: possible memory leak of 'devlink_rate'

vim +/rate +2039 net/core/devlink.c

d7555984507822 Dmytro Linkin     2021-06-02  1983  
1897db2ec3109e Dmytro Linkin     2021-06-02  1984  static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
1897db2ec3109e Dmytro Linkin     2021-06-02  1985  			       const struct devlink_ops *ops,
1897db2ec3109e Dmytro Linkin     2021-06-02  1986  			       struct genl_info *info)
1897db2ec3109e Dmytro Linkin     2021-06-02  1987  {
d7555984507822 Dmytro Linkin     2021-06-02  1988  	struct nlattr *nla_parent, **attrs = info->attrs;
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  1989  	int err = -EOPNOTSUPP;
3400deb7c35335 Michal Wilczynski 2022-09-15  1990  	u16 priority;
3400deb7c35335 Michal Wilczynski 2022-09-15  1991  	u16 weight;
1897db2ec3109e Dmytro Linkin     2021-06-02  1992  	u64 rate;
1897db2ec3109e Dmytro Linkin     2021-06-02  1993  
1897db2ec3109e Dmytro Linkin     2021-06-02  1994  	if (attrs[DEVLINK_ATTR_RATE_TX_SHARE]) {
1897db2ec3109e Dmytro Linkin     2021-06-02  1995  		rate = nla_get_u64(attrs[DEVLINK_ATTR_RATE_TX_SHARE]);
3400deb7c35335 Michal Wilczynski 2022-09-15  1996  		if (devlink_rate_is_vport(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  1997  			err = ops->rate_vport_tx_share_set(devlink_rate, devlink_rate->priv,
1897db2ec3109e Dmytro Linkin     2021-06-02  1998  							   rate, info->extack);
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  1999  		else if (devlink_rate_is_node(devlink_rate))
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  2000  			err = ops->rate_node_tx_share_set(devlink_rate, devlink_rate->priv,
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  2001  							  rate, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2002  		else if (devlink_rate_is_queue(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2003  			err = ops->rate_queue_tx_share_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2004  							   rate, info->extack);
1897db2ec3109e Dmytro Linkin     2021-06-02  2005  		if (err)
1897db2ec3109e Dmytro Linkin     2021-06-02  2006  			return err;
1897db2ec3109e Dmytro Linkin     2021-06-02  2007  		devlink_rate->tx_share = rate;
1897db2ec3109e Dmytro Linkin     2021-06-02  2008  	}
1897db2ec3109e Dmytro Linkin     2021-06-02  2009  
1897db2ec3109e Dmytro Linkin     2021-06-02  2010  	if (attrs[DEVLINK_ATTR_RATE_TX_MAX]) {
1897db2ec3109e Dmytro Linkin     2021-06-02  2011  		rate = nla_get_u64(attrs[DEVLINK_ATTR_RATE_TX_MAX]);
3400deb7c35335 Michal Wilczynski 2022-09-15  2012  		if (devlink_rate_is_vport(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2013  			err = ops->rate_vport_tx_max_set(devlink_rate, devlink_rate->priv,
1897db2ec3109e Dmytro Linkin     2021-06-02  2014  							 rate, info->extack);
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  2015  		else if (devlink_rate_is_node(devlink_rate))
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  2016  			err = ops->rate_node_tx_max_set(devlink_rate, devlink_rate->priv,
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  2017  							rate, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2018  		else if (devlink_rate_is_queue(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2019  			err = ops->rate_queue_tx_max_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2020  							rate, info->extack);
1897db2ec3109e Dmytro Linkin     2021-06-02  2021  		if (err)
1897db2ec3109e Dmytro Linkin     2021-06-02  2022  			return err;
1897db2ec3109e Dmytro Linkin     2021-06-02  2023  		devlink_rate->tx_max = rate;
1897db2ec3109e Dmytro Linkin     2021-06-02  2024  	}
1897db2ec3109e Dmytro Linkin     2021-06-02  2025  
3400deb7c35335 Michal Wilczynski 2022-09-15  2026  	if (attrs[DEVLINK_ATTR_RATE_TX_PRIORITY]) {
3400deb7c35335 Michal Wilczynski 2022-09-15  2027  		priority = nla_get_u16(attrs[DEVLINK_ATTR_RATE_TX_PRIORITY]);
3400deb7c35335 Michal Wilczynski 2022-09-15  2028  		if (devlink_rate_is_vport(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2029  			err = ops->rate_vport_tx_priority_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2030  							priority, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2031  		else if (devlink_rate_is_node(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2032  			err = ops->rate_node_tx_priority_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2033  							priority, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2034  		else if (devlink_rate_is_queue(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2035  			err = ops->rate_queue_tx_priority_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2036  							priority, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2037  		if (err)
3400deb7c35335 Michal Wilczynski 2022-09-15  2038  			return err;
3400deb7c35335 Michal Wilczynski 2022-09-15 @2039  		devlink_rate->tx_priority = rate;
3400deb7c35335 Michal Wilczynski 2022-09-15  2040  	}
3400deb7c35335 Michal Wilczynski 2022-09-15  2041  
3400deb7c35335 Michal Wilczynski 2022-09-15  2042  	if (attrs[DEVLINK_ATTR_RATE_TX_WEIGHT]) {
3400deb7c35335 Michal Wilczynski 2022-09-15  2043  		weight = nla_get_u16(attrs[DEVLINK_ATTR_RATE_TX_WEIGHT]);
3400deb7c35335 Michal Wilczynski 2022-09-15  2044  		if (devlink_rate_is_vport(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2045  			err = ops->rate_vport_tx_weight_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2046  							 weight, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2047  		else if (devlink_rate_is_node(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2048  			err = ops->rate_node_tx_weight_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2049  							weight, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2050  		else if (devlink_rate_is_queue(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2051  			err = ops->rate_queue_tx_weight_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2052  							weight, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2053  		if (err)
3400deb7c35335 Michal Wilczynski 2022-09-15  2054  			return err;
3400deb7c35335 Michal Wilczynski 2022-09-15  2055  		devlink_rate->tx_weight = weight;
3400deb7c35335 Michal Wilczynski 2022-09-15  2056  	}
3400deb7c35335 Michal Wilczynski 2022-09-15  2057  
d7555984507822 Dmytro Linkin     2021-06-02  2058  	nla_parent = attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME];
d7555984507822 Dmytro Linkin     2021-06-02  2059  	if (nla_parent) {
d7555984507822 Dmytro Linkin     2021-06-02  2060  		err = devlink_nl_rate_parent_node_set(devlink_rate, info,
d7555984507822 Dmytro Linkin     2021-06-02  2061  						      nla_parent);
d7555984507822 Dmytro Linkin     2021-06-02  2062  		if (err)
d7555984507822 Dmytro Linkin     2021-06-02  2063  			return err;
d7555984507822 Dmytro Linkin     2021-06-02  2064  	}
d7555984507822 Dmytro Linkin     2021-06-02  2065  
1897db2ec3109e Dmytro Linkin     2021-06-02  2066  	return 0;
1897db2ec3109e Dmytro Linkin     2021-06-02  2067  }
1897db2ec3109e Dmytro Linkin     2021-06-02  2068  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
@ 2022-09-19  9:22 ` Dan Carpenter
  0 siblings, 0 replies; 35+ messages in thread
From: Dan Carpenter @ 2022-09-19  9:22 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 8320 bytes --]

Hi Michal,

url:    https://github.com/intel-lab-lkp/linux/commits/Michal-Wilczynski/Implement-devlink-rate-API-and-extend-it/20220915-214539
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 8a26a9dee5e3679637edc6f8caf4beb5f3100dde
config: mips-randconfig-m031-20220918 (https://download.01.org/0day-ci/archive/20220919/202209191557.PkFshuDN-lkp(a)intel.com/config)
compiler: mips64el-linux-gcc (GCC) 12.1.0

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>

New smatch warnings:
net/core/devlink.c:2039 devlink_nl_rate_set() error: uninitialized symbol 'rate'.
net/core/devlink.c:10399 devl_rate_node_create() warn: possible memory leak of 'rate_node'
net/core/devlink.c:10446 devl_rate_vport_create() warn: possible memory leak of 'devlink_rate'

vim +/rate +2039 net/core/devlink.c

1897db2ec3109e Dmytro Linkin     2021-06-02  1984  static int devlink_nl_rate_set(struct devlink_rate *devlink_rate,
1897db2ec3109e Dmytro Linkin     2021-06-02  1985  			       const struct devlink_ops *ops,
1897db2ec3109e Dmytro Linkin     2021-06-02  1986  			       struct genl_info *info)
1897db2ec3109e Dmytro Linkin     2021-06-02  1987  {
d7555984507822 Dmytro Linkin     2021-06-02  1988  	struct nlattr *nla_parent, **attrs = info->attrs;
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  1989  	int err = -EOPNOTSUPP;
3400deb7c35335 Michal Wilczynski 2022-09-15  1990  	u16 priority;
3400deb7c35335 Michal Wilczynski 2022-09-15  1991  	u16 weight;
1897db2ec3109e Dmytro Linkin     2021-06-02  1992  	u64 rate;
1897db2ec3109e Dmytro Linkin     2021-06-02  1993  
1897db2ec3109e Dmytro Linkin     2021-06-02  1994  	if (attrs[DEVLINK_ATTR_RATE_TX_SHARE]) {
1897db2ec3109e Dmytro Linkin     2021-06-02  1995  		rate = nla_get_u64(attrs[DEVLINK_ATTR_RATE_TX_SHARE]);
3400deb7c35335 Michal Wilczynski 2022-09-15  1996  		if (devlink_rate_is_vport(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  1997  			err = ops->rate_vport_tx_share_set(devlink_rate, devlink_rate->priv,
1897db2ec3109e Dmytro Linkin     2021-06-02  1998  							   rate, info->extack);
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  1999  		else if (devlink_rate_is_node(devlink_rate))
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  2000  			err = ops->rate_node_tx_share_set(devlink_rate, devlink_rate->priv,
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  2001  							  rate, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2002  		else if (devlink_rate_is_queue(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2003  			err = ops->rate_queue_tx_share_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2004  							   rate, info->extack);
1897db2ec3109e Dmytro Linkin     2021-06-02  2005  		if (err)
1897db2ec3109e Dmytro Linkin     2021-06-02  2006  			return err;
1897db2ec3109e Dmytro Linkin     2021-06-02  2007  		devlink_rate->tx_share = rate;
1897db2ec3109e Dmytro Linkin     2021-06-02  2008  	}
1897db2ec3109e Dmytro Linkin     2021-06-02  2009  
1897db2ec3109e Dmytro Linkin     2021-06-02  2010  	if (attrs[DEVLINK_ATTR_RATE_TX_MAX]) {
1897db2ec3109e Dmytro Linkin     2021-06-02  2011  		rate = nla_get_u64(attrs[DEVLINK_ATTR_RATE_TX_MAX]);
3400deb7c35335 Michal Wilczynski 2022-09-15  2012  		if (devlink_rate_is_vport(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2013  			err = ops->rate_vport_tx_max_set(devlink_rate, devlink_rate->priv,
1897db2ec3109e Dmytro Linkin     2021-06-02  2014  							 rate, info->extack);
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  2015  		else if (devlink_rate_is_node(devlink_rate))
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  2016  			err = ops->rate_node_tx_max_set(devlink_rate, devlink_rate->priv,
a8ecb93ef03de4 Dmytro Linkin     2021-06-02  2017  							rate, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2018  		else if (devlink_rate_is_queue(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2019  			err = ops->rate_queue_tx_max_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2020  							rate, info->extack);
1897db2ec3109e Dmytro Linkin     2021-06-02  2021  		if (err)
1897db2ec3109e Dmytro Linkin     2021-06-02  2022  			return err;
1897db2ec3109e Dmytro Linkin     2021-06-02  2023  		devlink_rate->tx_max = rate;
1897db2ec3109e Dmytro Linkin     2021-06-02  2024  	}
1897db2ec3109e Dmytro Linkin     2021-06-02  2025  
3400deb7c35335 Michal Wilczynski 2022-09-15  2026  	if (attrs[DEVLINK_ATTR_RATE_TX_PRIORITY]) {
3400deb7c35335 Michal Wilczynski 2022-09-15  2027  		priority = nla_get_u16(attrs[DEVLINK_ATTR_RATE_TX_PRIORITY]);
3400deb7c35335 Michal Wilczynski 2022-09-15  2028  		if (devlink_rate_is_vport(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2029  			err = ops->rate_vport_tx_priority_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2030  							priority, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2031  		else if (devlink_rate_is_node(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2032  			err = ops->rate_node_tx_priority_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2033  							priority, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2034  		else if (devlink_rate_is_queue(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2035  			err = ops->rate_queue_tx_priority_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2036  							priority, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2037  		if (err)
3400deb7c35335 Michal Wilczynski 2022-09-15  2038  			return err;
3400deb7c35335 Michal Wilczynski 2022-09-15 @2039  		devlink_rate->tx_priority = rate;

Copy and paste bug.  s/rate/priority/

3400deb7c35335 Michal Wilczynski 2022-09-15  2040  	}
3400deb7c35335 Michal Wilczynski 2022-09-15  2041  
3400deb7c35335 Michal Wilczynski 2022-09-15  2042  	if (attrs[DEVLINK_ATTR_RATE_TX_WEIGHT]) {
3400deb7c35335 Michal Wilczynski 2022-09-15  2043  		weight = nla_get_u16(attrs[DEVLINK_ATTR_RATE_TX_WEIGHT]);
3400deb7c35335 Michal Wilczynski 2022-09-15  2044  		if (devlink_rate_is_vport(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2045  			err = ops->rate_vport_tx_weight_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2046  							 weight, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2047  		else if (devlink_rate_is_node(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2048  			err = ops->rate_node_tx_weight_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2049  							weight, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2050  		else if (devlink_rate_is_queue(devlink_rate))
3400deb7c35335 Michal Wilczynski 2022-09-15  2051  			err = ops->rate_queue_tx_weight_set(devlink_rate, devlink_rate->priv,
3400deb7c35335 Michal Wilczynski 2022-09-15  2052  							weight, info->extack);
3400deb7c35335 Michal Wilczynski 2022-09-15  2053  		if (err)
3400deb7c35335 Michal Wilczynski 2022-09-15  2054  			return err;
3400deb7c35335 Michal Wilczynski 2022-09-15  2055  		devlink_rate->tx_weight = weight;
3400deb7c35335 Michal Wilczynski 2022-09-15  2056  	}
3400deb7c35335 Michal Wilczynski 2022-09-15  2057  
d7555984507822 Dmytro Linkin     2021-06-02  2058  	nla_parent = attrs[DEVLINK_ATTR_RATE_PARENT_NODE_NAME];
d7555984507822 Dmytro Linkin     2021-06-02  2059  	if (nla_parent) {
d7555984507822 Dmytro Linkin     2021-06-02  2060  		err = devlink_nl_rate_parent_node_set(devlink_rate, info,
d7555984507822 Dmytro Linkin     2021-06-02  2061  						      nla_parent);
d7555984507822 Dmytro Linkin     2021-06-02  2062  		if (err)
d7555984507822 Dmytro Linkin     2021-06-02  2063  			return err;
d7555984507822 Dmytro Linkin     2021-06-02  2064  	}
d7555984507822 Dmytro Linkin     2021-06-02  2065  
1897db2ec3109e Dmytro Linkin     2021-06-02  2066  	return 0;
1897db2ec3109e Dmytro Linkin     2021-06-02  2067  }

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-15 21:01       ` Edward Cree
@ 2022-09-19 13:12         ` Wilczynski, Michal
  2022-09-20 11:09           ` Edward Cree
  0 siblings, 1 reply; 35+ messages in thread
From: Wilczynski, Michal @ 2022-09-19 13:12 UTC (permalink / raw)
  To: Edward Cree, netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel



On 9/15/2022 11:01 PM, Edward Cree wrote:
> On 15/09/2022 19:41, Wilczynski, Michal wrote:
>> Hi,
>> Previously we discussed adding queues to devlink-rate in this thread:
>> https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/T/#u
>> In our use case we are trying to find a way to expose hardware Tx scheduler tree that is defined
>> per port to user. Obviously if the tree is defined per physical port, all the scheduling nodes will reside
>> on the same tree.
>>
>> Our customer is trying to send different types of traffic that require different QoS levels on the same
>> VM, but on a different queues. This requires completely different rate setups for that queue - in the
>> implementation that you're mentioning we wouldn't be able to arbitrarily reassign the queue to any node.
> I'm not sure I 100% understand what you're describing, but I get the
>   impression it's maybe a layering violation — the hypervisor should only
>   be responsible for shaping the VM's overall traffic, it should be up to
>   the VM to decide how to distribute that bandwidth between traffic types.

Maybe a switchdev case would be a good parallel here. When you enable 
switchdev, you get port representors on
the host for each VF that is already attached to the VM. Something that 
gives the host power to configure
netdev that it doesn't 'own'. So it seems to me like giving user more 
power to configure things from the host
is acceptable.


> But if it's what your customer needs then presumably there's some reason
>   for it that I'm not seeing.  I'm not a QoS expert by any means — I just
>   get antsy that every time I look at devlink it's gotten bigger and keeps
>   escaping further out of the "device-wide configuration" concept it was
>   originally sold as :(

I understand the concern, and sympathize with the desire to keep things 
small, but this is the least
evil method I've found, that would enable the customer to achieve 
optimal configuration. I've experimented
with tc-htb in the previous thread, but there are multiple problems with 
that approach - I tried to describe them
there.

In my mind this is a device-wide configuration, since the ice driver 
registers each port as a separate pci device.
And each of this devices have their own hardware Tx Scheduler tree 
global to that port. Queues that we're
discussing are actually hardware queues, and are identified by hardware 
assigned txq_id.

The use-case is basically enabling user to fully utilize hardware 
Hierarchical QoS accounting for every queue
in the system. The current kernel interfaces doesn't allow us to do so, 
so we figured that least amount of duplication
would be to teach devlink about queues, and let user configure the 
desired tree using devlink-rate.

>
>> Those queues would still need to share a single parent - their netdev. This wouldn't allow us to fully take
>> advantage of the HQoS and would introduce arbitrary limitations.
> Oh, so you need a hierarchy within which the VF's queues don't form a
>   clade (subtree)?  That sounds like something worth calling out in the
>   commit message as the reason why you've designed it this way.

This is one of possible supported scenarios. Will include this in a 
commit message, thanks for the tip.

>
>> Regarding the documentation,  sure. I just wanted to get all the feedback from the mailing list and arrive at the final
>> solution before writing the docs.
> Fair.  But you might get better feedback on the code if people have the
>   docs to better understand the intent; just a suggestion.

Thanks for the advice :)


BR,
Michał

>
> -ed


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-19 13:12         ` Wilczynski, Michal
@ 2022-09-20 11:09           ` Edward Cree
  2022-09-26 11:58             ` Jiri Pirko
  0 siblings, 1 reply; 35+ messages in thread
From: Edward Cree @ 2022-09-20 11:09 UTC (permalink / raw)
  To: Wilczynski, Michal, netdev
  Cc: alexandr.lobakin, dchumak, maximmi, jiri, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel

On 19/09/2022 14:12, Wilczynski, Michal wrote:
> Maybe a switchdev case would be a good parallel here. When you enable switchdev, you get port representors on
> the host for each VF that is already attached to the VM. Something that gives the host power to configure
> netdev that it doesn't 'own'. So it seems to me like giving user more power to configure things from the host
> is acceptable.

Right that's the thing though: I instinctively Want this to be done
 through representors somehow, because it _looks_ like it ought to
 be scoped to a single netdev; but that forces the hierarchy to
 respect netdev boundaries which as we've discussed is an unwelcome
 limitation.

> In my mind this is a device-wide configuration, since the ice driver registers each port as a separate pci device.
> And each of this devices have their own hardware Tx Scheduler tree global to that port. Queues that we're
> discussing are actually hardware queues, and are identified by hardware assigned txq_id.

In general, hardware being a single unit at the device level does
 not necessarily mean its configuration should be device-wide.
For instance, in many NICs each port has a single hardware v-switch,
 but we do not have some kind of "devlink filter" API to program it
 directly.  Instead we attach TC rules to _many_ netdevs, and driver
 code transforms and combines these to program the unitary device.
"device-wide configuration" originally meant things like firmware
 version or operating mode (legacy vs. switchdev) that do not relate
 directly to netdevs.

But I agree with you that your approach is the "least evil method";
 if properly explained and documented then I don't have any
 remaining objection to your patch, despite that I'm continuing to
 take the opportunity to proselytise for "reprs >> devlink" ;)

-ed

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-15 18:41     ` Wilczynski, Michal
  2022-09-15 21:01       ` Edward Cree
@ 2022-09-21 23:33       ` Jakub Kicinski
  2022-09-22 11:44         ` Wilczynski, Michal
  2022-09-26 11:51       ` Jiri Pirko
  2 siblings, 1 reply; 35+ messages in thread
From: Jakub Kicinski @ 2022-09-21 23:33 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

On Thu, 15 Sep 2022 20:41:52 +0200 Wilczynski, Michal wrote:
> In our use case we are trying to find a way to expose hardware Tx 
> scheduler tree that is defined per port to user. Obviously if the
> tree is defined per physical port, all the scheduling nodes will
> reside on the same tree.

Can you give some examples of what the resulting hierarchy would look
like?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-21 23:33       ` Jakub Kicinski
@ 2022-09-22 11:44         ` Wilczynski, Michal
  2022-09-22 12:50           ` Jakub Kicinski
  0 siblings, 1 reply; 35+ messages in thread
From: Wilczynski, Michal @ 2022-09-22 11:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel



On 9/22/2022 1:33 AM, Jakub Kicinski wrote:
> On Thu, 15 Sep 2022 20:41:52 +0200 Wilczynski, Michal wrote:
>> In our use case we are trying to find a way to expose hardware Tx
>> scheduler tree that is defined per port to user. Obviously if the
>> tree is defined per physical port, all the scheduling nodes will
>> reside on the same tree.
> Can you give some examples of what the resulting hierarchy would look
> like?

Hi,
Below I'll paste the output of how initially the topology looks like for our
hardware.
If the devlink_port objects are present (as in switchdev mode), there
should also be vport nodes represented. It is NOT a requirement for
a queue to have a vport as it's ancestor.

In this example we have two VF's created with 16 queues per VF,
and 72 PF queues.

User is free to reassign queues to other parents and delete the nodes, 
example:
# Add new node
devlink port function rate add pci/0000:4b:00.0/node_custom parent node_199
# Reassign queue
devlink port function rate set pci/0000:4b:00.0/queue/100 parent node_custom


node_0 represents a port and doesn't have any parent.

[root@moradin iproute2-next]# devlink port function rate show
pci/0000:4b:00.0/node_200: type node parent node_198
pci/0000:4b:00.0/node_199: type node parent node_198
pci/0000:4b:00.0/node_198: type node parent vport_2
pci/0000:4b:00.0/node_196: type node parent node_194
pci/0000:4b:00.0/node_195: type node parent node_194
pci/0000:4b:00.0/node_194: type node parent vport_1
pci/0000:4b:00.0/node_185: type node parent node_184
pci/0000:4b:00.0/node_184: type node parent node_0
pci/0000:4b:00.0/node_192: type node parent node_191
pci/0000:4b:00.0/node_191: type node parent node_190
pci/0000:4b:00.0/node_190: type node parent node_16
pci/0000:4b:00.0/node_14: type node parent node_5
pci/0000:4b:00.0/node_5: type node parent node_3
pci/0000:4b:00.0/node_13: type node parent node_4
pci/0000:4b:00.0/node_12: type node parent node_4
pci/0000:4b:00.0/node_11: type node parent node_4
pci/0000:4b:00.0/node_10: type node parent node_4
pci/0000:4b:00.0/node_9: type node parent node_4
pci/0000:4b:00.0/node_8: type node parent node_4
pci/0000:4b:00.0/node_7: type node parent node_4
pci/0000:4b:00.0/node_6: type node parent node_4
pci/0000:4b:00.0/node_4: type node parent node_3
pci/0000:4b:00.0/node_3: type node parent node_16
pci/0000:4b:00.0/node_16: type node parent node_15
pci/0000:4b:00.0/node_15: type node parent node_0
pci/0000:4b:00.0/node_2: type node parent node_1
pci/0000:4b:00.0/node_1: type node parent node_0
pci/0000:4b:00.0/node_0: type node
pci/0000:4b:00.0/queue/0: type queue parent node_6
pci/0000:4b:00.0/queue/9: type queue parent node_6
pci/0000:4b:00.0/queue/18: type queue parent node_6
pci/0000:4b:00.0/queue/27: type queue parent node_6
pci/0000:4b:00.0/queue/36: type queue parent node_6
pci/0000:4b:00.0/queue/45: type queue parent node_6
pci/0000:4b:00.0/queue/54: type queue parent node_6
pci/0000:4b:00.0/queue/63: type queue parent node_6
pci/0000:4b:00.0/queue/1: type queue parent node_7
pci/0000:4b:00.0/queue/10: type queue parent node_7
pci/0000:4b:00.0/queue/19: type queue parent node_7
pci/0000:4b:00.0/queue/28: type queue parent node_7
pci/0000:4b:00.0/queue/37: type queue parent node_7
pci/0000:4b:00.0/queue/46: type queue parent node_7
pci/0000:4b:00.0/queue/55: type queue parent node_7
pci/0000:4b:00.0/queue/64: type queue parent node_7
pci/0000:4b:00.0/queue/2: type queue parent node_8
pci/0000:4b:00.0/queue/11: type queue parent node_8
pci/0000:4b:00.0/queue/20: type queue parent node_8
pci/0000:4b:00.0/queue/29: type queue parent node_8
pci/0000:4b:00.0/queue/38: type queue parent node_8
pci/0000:4b:00.0/queue/47: type queue parent node_8
pci/0000:4b:00.0/queue/56: type queue parent node_8
pci/0000:4b:00.0/queue/65: type queue parent node_8
pci/0000:4b:00.0/queue/3: type queue parent node_9
pci/0000:4b:00.0/queue/12: type queue parent node_9
pci/0000:4b:00.0/queue/21: type queue parent node_9
pci/0000:4b:00.0/queue/30: type queue parent node_9
pci/0000:4b:00.0/queue/39: type queue parent node_9
pci/0000:4b:00.0/queue/48: type queue parent node_9
pci/0000:4b:00.0/queue/57: type queue parent node_9
pci/0000:4b:00.0/queue/66: type queue parent node_9
pci/0000:4b:00.0/queue/4: type queue parent node_10
pci/0000:4b:00.0/queue/13: type queue parent node_10
pci/0000:4b:00.0/queue/22: type queue parent node_10
pci/0000:4b:00.0/queue/31: type queue parent node_10
pci/0000:4b:00.0/queue/40: type queue parent node_10
pci/0000:4b:00.0/queue/49: type queue parent node_10
pci/0000:4b:00.0/queue/58: type queue parent node_10
pci/0000:4b:00.0/queue/67: type queue parent node_10
pci/0000:4b:00.0/queue/5: type queue parent node_11
pci/0000:4b:00.0/queue/14: type queue parent node_11
pci/0000:4b:00.0/queue/23: type queue parent node_11
pci/0000:4b:00.0/queue/32: type queue parent node_11
pci/0000:4b:00.0/queue/41: type queue parent node_11
pci/0000:4b:00.0/queue/50: type queue parent node_11
pci/0000:4b:00.0/queue/59: type queue parent node_11
pci/0000:4b:00.0/queue/68: type queue parent node_11
pci/0000:4b:00.0/queue/6: type queue parent node_12
pci/0000:4b:00.0/queue/15: type queue parent node_12
pci/0000:4b:00.0/queue/24: type queue parent node_12
pci/0000:4b:00.0/queue/33: type queue parent node_12
pci/0000:4b:00.0/queue/42: type queue parent node_12
pci/0000:4b:00.0/queue/51: type queue parent node_12
pci/0000:4b:00.0/queue/60: type queue parent node_12
pci/0000:4b:00.0/queue/69: type queue parent node_12
pci/0000:4b:00.0/queue/7: type queue parent node_13
pci/0000:4b:00.0/queue/16: type queue parent node_13
pci/0000:4b:00.0/queue/25: type queue parent node_13
pci/0000:4b:00.0/queue/34: type queue parent node_13
pci/0000:4b:00.0/queue/43: type queue parent node_13
pci/0000:4b:00.0/queue/52: type queue parent node_13
pci/0000:4b:00.0/queue/61: type queue parent node_13
pci/0000:4b:00.0/queue/70: type queue parent node_13
pci/0000:4b:00.0/queue/8: type queue parent node_14
pci/0000:4b:00.0/queue/17: type queue parent node_14
pci/0000:4b:00.0/queue/26: type queue parent node_14
pci/0000:4b:00.0/queue/35: type queue parent node_14
pci/0000:4b:00.0/queue/44: type queue parent node_14
pci/0000:4b:00.0/queue/53: type queue parent node_14
pci/0000:4b:00.0/queue/62: type queue parent node_14
pci/0000:4b:00.0/queue/71: type queue parent node_14
pci/0000:4b:00.0/queue/104: type queue parent node_192
pci/0000:4b:00.0/queue/105: type queue parent node_192
pci/0000:4b:00.0/1: type vport parent node_185
pci/0000:4b:00.0/queue/72: type queue parent node_195
pci/0000:4b:00.0/queue/74: type queue parent node_195
pci/0000:4b:00.0/queue/76: type queue parent node_195
pci/0000:4b:00.0/queue/78: type queue parent node_195
pci/0000:4b:00.0/queue/80: type queue parent node_195
pci/0000:4b:00.0/queue/82: type queue parent node_195
pci/0000:4b:00.0/queue/84: type queue parent node_195
pci/0000:4b:00.0/queue/86: type queue parent node_195
pci/0000:4b:00.0/queue/73: type queue parent node_196
pci/0000:4b:00.0/queue/75: type queue parent node_196
pci/0000:4b:00.0/queue/77: type queue parent node_196
pci/0000:4b:00.0/queue/79: type queue parent node_196
pci/0000:4b:00.0/queue/81: type queue parent node_196
pci/0000:4b:00.0/queue/83: type queue parent node_196
pci/0000:4b:00.0/queue/85: type queue parent node_196
pci/0000:4b:00.0/queue/87: type queue parent node_196
pci/0000:4b:00.0/2: type vport parent node_185
pci/0000:4b:00.0/queue/88: type queue parent node_199
pci/0000:4b:00.0/queue/90: type queue parent node_199
pci/0000:4b:00.0/queue/92: type queue parent node_199
pci/0000:4b:00.0/queue/94: type queue parent node_199
pci/0000:4b:00.0/queue/96: type queue parent node_199
pci/0000:4b:00.0/queue/98: type queue parent node_199
pci/0000:4b:00.0/queue/100: type queue parent node_199
pci/0000:4b:00.0/queue/102: type queue parent node_199
pci/0000:4b:00.0/queue/89: type queue parent node_200
pci/0000:4b:00.0/queue/91: type queue parent node_200
pci/0000:4b:00.0/queue/93: type queue parent node_200
pci/0000:4b:00.0/queue/95: type queue parent node_200
pci/0000:4b:00.0/queue/97: type queue parent node_200
pci/0000:4b:00.0/queue/99: type queue parent node_200
pci/0000:4b:00.0/queue/101: type queue parent node_200
pci/0000:4b:00.0/queue/103: type queue parent node_200



BR,
Michał


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-22 11:44         ` Wilczynski, Michal
@ 2022-09-22 12:50           ` Jakub Kicinski
  2022-09-22 13:45             ` Wilczynski, Michal
  0 siblings, 1 reply; 35+ messages in thread
From: Jakub Kicinski @ 2022-09-22 12:50 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

On Thu, 22 Sep 2022 13:44:14 +0200 Wilczynski, Michal wrote:
> Below I'll paste the output of how initially the topology looks like for our
> hardware.
> If the devlink_port objects are present (as in switchdev mode), there
> should also be vport nodes represented. It is NOT a requirement for
> a queue to have a vport as it's ancestor.

Thanks! How did you know that my preferred method of reading 
hierarchies is looking at a linear output!? 😕

Anyway. My gut feeling is that this is cutting a corner. Seems 
most natural for the VF/PF level to be controlled by the admin
and the queue level by whoever owns the queue. The hypervisor
driver/FW should reconcile the two and compile the full hierarchy.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 4/6] ice: Implement devlink-rate API
  2022-09-15 13:42 ` [RFC PATCH net-next v4 4/6] ice: Implement devlink-rate API Michal Wilczynski
@ 2022-09-22 13:08   ` Przemek Kitszel
  0 siblings, 0 replies; 35+ messages in thread
From: Przemek Kitszel @ 2022-09-22 13:08 UTC (permalink / raw)
  To: Michal Wilczynski
  Cc: Przemek Kitszel, netdev, alexandr.lobakin, dchumak, maximmi,
	jiri, simon.horman, jacob.e.keller, jesse.brandeburg

From:   Michal Wilczynski <michal.wilczynski@intel.com>
Date:   Thu, 15 Sep 2022 15:42:37 +0200

> There is a need to support modification of Tx scheduler topology, in the
> ice driver. This will allow user to control Tx settings of each node in
> the internal hierarchy of nodes. A number of parameters is supported per
> each node: tx_max, tx_share, tx_priority and tx_weight.
> 
> Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_devlink.c | 511 +++++++++++++++++++
>  drivers/net/ethernet/intel/ice/ice_devlink.h |   2 +
>  2 files changed, 513 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_devlink.c b/drivers/net/ethernet/intel/ice/ice_devlink.c
> index e6ec20079ced..925283605b59 100644
> --- a/drivers/net/ethernet/intel/ice/ice_devlink.c
> +++ b/drivers/net/ethernet/intel/ice/ice_devlink.c
> @@ -713,6 +713,490 @@ ice_devlink_port_unsplit(struct devlink *devlink, struct devlink_port *port,
>  	return ice_devlink_port_split(devlink, port, 1, extack);
>  }
>  
> +/**
> + * ice_traverse_tx_tree - traverse Tx scheduler tree
> + * @devlink: devlink struct
> + * @node: current node, used for recursion
> + * @tc_node: tc_node struct, that is treated as a root
> + * @pf: pf struct
> + *
> + * This function traverses Tx scheduler tree and exports
> + * entire structure to the devlink-rate.
> + */
> +static void ice_traverse_tx_tree(struct devlink *devlink, struct ice_sched_node *node,
> +				 struct ice_sched_node *tc_node, struct ice_pf *pf)
> +{
> +	struct ice_vf *vf;
> +	int i;
> +
> +	devl_lock(devlink);
> +
> +	if (node->parent == tc_node) {
> +		/* create root node */
> +		devl_rate_node_create(devlink, node, node->name, NULL);
> +	} else if (node->info.data.elem_type == ICE_AQC_ELEM_TYPE_LEAF &&
> +		   node->parent->name) {
> +		devl_rate_queue_create(devlink, node->parent->name, node->tx_queue_id, node);
> +	} else if (node->vsi_handle &&
> +		   pf->vsi[node->vsi_handle]->vf) {
> +		vf = pf->vsi[node->vsi_handle]->vf;
> +		snprintf(node->name, DEVLINK_RATE_NAME_MAX_LEN, "vport_%u", vf->devlink_port.index);
> +		if (!vf->devlink_port.devlink_rate)
> +			devl_rate_vport_create(&vf->devlink_port, node, node->parent->name);
> +	} else {
> +		devl_rate_node_create(devlink, node, node->name, node->parent->name);
> +	}
> +
> +	devl_unlock(devlink);

I would move devlink locking into ice_devlink_rate_init_tx_topology(),
so it would be locked only once for the whole tree walking.

> +
> +	for (i = 0; i < node->num_children; i++)
> +		ice_traverse_tx_tree(devlink, node->children[i], tc_node, pf);
> +}
> +
> +/**
> + * ice_devlink_rate_init_tx_topology - export Tx scheduler tree to devlink rate
> + * @devlink: devlink struct
> + * @vsi: main vsi struct
> + *
> + * This function finds a root node, then calls ice_traverse_tx tree, which
> + * traverses the tree and export it's contents to devlink rate.
> + */
> +int ice_devlink_rate_init_tx_topology(struct devlink *devlink, struct ice_vsi *vsi)
> +{
> +	struct ice_port_info *pi = vsi->port_info;
> +	struct ice_sched_node *tc_node;
> +	struct ice_pf *pf = vsi->back;
> +	int i;
> +
> +	tc_node = pi->root->children[0];
> +	mutex_lock(&pi->sched_lock);
> +	for (i = 0; i < tc_node->num_children; i++)
> +		ice_traverse_tx_tree(devlink, tc_node->children[i], tc_node, pf);
> +	mutex_unlock(&pi->sched_lock);
> +
> +	return 0;
> +}

// snip

>  static int
> @@ -893,6 +1399,9 @@ void ice_devlink_register(struct ice_pf *pf)
>   */
>  void ice_devlink_unregister(struct ice_pf *pf)
>  {
> +	devl_lock(priv_to_devlink(pf));
> +	devl_rate_objects_destroy(priv_to_devlink(pf));
> +	devl_unlock(priv_to_devlink(pf));
>  	devlink_unregister(priv_to_devlink(pf));
>  }
>

Maybe it's now worth a variable for priv_to_devlink(pf)?

[...]

--Przemek

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-22 12:50           ` Jakub Kicinski
@ 2022-09-22 13:45             ` Wilczynski, Michal
  2022-09-22 20:29               ` Jakub Kicinski
  0 siblings, 1 reply; 35+ messages in thread
From: Wilczynski, Michal @ 2022-09-22 13:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel



On 9/22/2022 2:50 PM, Jakub Kicinski wrote:
> On Thu, 22 Sep 2022 13:44:14 +0200 Wilczynski, Michal wrote:
>> Below I'll paste the output of how initially the topology looks like for our
>> hardware.
>> If the devlink_port objects are present (as in switchdev mode), there
>> should also be vport nodes represented. It is NOT a requirement for
>> a queue to have a vport as it's ancestor.
> Thanks! How did you know that my preferred method of reading
> hierarchies is looking at a linear output!? 😕

Haha :D, sorry about that, that's the real output from the devlink tool 
that I
modified, so it's already like that for devlink-rate. At least I don't 
know about
any way to represent this better, besides drawing the hierarchy by hand. And
it's quite big of a hierarchy.

>
> Anyway. My gut feeling is that this is cutting a corner. Seems
> most natural for the VF/PF level to be controlled by the admin
> and the queue level by whoever owns the queue. The hypervisor
> driver/FW should reconcile the two and compile the full hierarchy.

We tried already tc-htb, and it doesn't work for a couple of reasons, 
even in this potential
hybrid with devlink-rate. One of the problems with tc-htb offload is 
that it forces you to allocate a new
queue, it doesn't allow for reassigning an existing queue to another 
scheduling node. This
is our main use case.

Here's a discussion about tc-htb: 
https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/

So either I would have to invent a new offload type (?) for tc, or 
completely rewrite and
probably break tc-htb that mellanox implemented.
Also in our use case it's possible to create completely new branches 
from the root and
reassigning queues there. This wouldn't be possible with the method 
you're proposing.

So existing interface doesn't allow us to do what is required.

BR,
Michał



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-22 13:45             ` Wilczynski, Michal
@ 2022-09-22 20:29               ` Jakub Kicinski
  2022-09-23 12:11                 ` Wilczynski, Michal
  0 siblings, 1 reply; 35+ messages in thread
From: Jakub Kicinski @ 2022-09-22 20:29 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

On Thu, 22 Sep 2022 15:45:55 +0200 Wilczynski, Michal wrote:
> On 9/22/2022 2:50 PM, Jakub Kicinski wrote:
> > Anyway. My gut feeling is that this is cutting a corner. Seems
> > most natural for the VF/PF level to be controlled by the admin
> > and the queue level by whoever owns the queue. The hypervisor
> > driver/FW should reconcile the two and compile the full hierarchy.  
> 
> We tried already tc-htb, and it doesn't work for a couple of reasons, 
> even in this potential hybrid with devlink-rate. One of the problems
> with tc-htb offload is that it forces you to allocate a new
> queue, it doesn't allow for reassigning an existing queue to another 
> scheduling node. This is our main use case.
> 
> Here's a discussion about tc-htb: 
> https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/

This is a problem only for "SR-IOV case" or also for just the PF?

> So either I would have to invent a new offload type (?) for tc, or 
> completely rewrite and
> probably break tc-htb that mellanox implemented.
> Also in our use case it's possible to create completely new branches 
> from the root and
> reassigning queues there. This wouldn't be possible with the method 
> you're proposing.
> 
> So existing interface doesn't allow us to do what is required.

For some definition of "what is required" which was not really
disclosed clearly. Or I'm to slow to grasp.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-22 20:29               ` Jakub Kicinski
@ 2022-09-23 12:11                 ` Wilczynski, Michal
  2022-09-23 13:16                   ` Jakub Kicinski
  0 siblings, 1 reply; 35+ messages in thread
From: Wilczynski, Michal @ 2022-09-23 12:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

[-- Attachment #1: Type: text/plain, Size: 3780 bytes --]



On 9/22/2022 10:29 PM, Jakub Kicinski wrote:
> On Thu, 22 Sep 2022 15:45:55 +0200 Wilczynski, Michal wrote:
>> On 9/22/2022 2:50 PM, Jakub Kicinski wrote:
>>> Anyway. My gut feeling is that this is cutting a corner. Seems
>>> most natural for the VF/PF level to be controlled by the admin
>>> and the queue level by whoever owns the queue. The hypervisor
>>> driver/FW should reconcile the two and compile the full hierarchy.

I'm not sure whether this is allowed on mailing list, but I'm attaching 
a text file
with an ASCII drawing representing a tree I've send previously as 
linear. Hope
you'll find this easier to read.

>> We tried already tc-htb, and it doesn't work for a couple of reasons,
>> even in this potential hybrid with devlink-rate. One of the problems
>> with tc-htb offload is that it forces you to allocate a new
>> queue, it doesn't allow for reassigning an existing queue to another
>> scheduling node. This is our main use case.
>>
>> Here's a discussion about tc-htb:
>> https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/
> This is a problem only for "SR-IOV case" or also for just the PF?

The way tc-htb is coded it's NOT possible to reassign queues from one 
scheduling node to the
other, this is a generic problem with this implementation, regardless of 
SR-IOV or PF. So even if we
wanted to reassign queues only for PF's this wouldn't be possible.
I feel like an example would help. So let's say I do this:

tc qdisc replace dev ens785 root handle 1: htb offload
tc class add dev ens785 parent 1: classid 1:2 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:2 classid 1:3 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:2 classid 1:4 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:3 classid 1:5 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:4 classid 1:6 htb rate 1000 ceil 2000

                   1:    <-- root qdisc
                   |
                  1:2
                  / \
                 /   \
               1:3   1:4
                |     |
                |     |
               1:5   1:6
                |     |
               QID   QID   <---- here we'll have PFIFO qdiscs


At this point I would have two additional queues in the system, and the 
kernel would enqueue packets
to those new queues according to 'tc flower' configuration. So 
theoretically we should create a new queue
in a hardware and put it in a privileged position in the scheduling 
tree. And I would happily write it this
way, but this is NOT what our customer want. He doesn't want any extra 
queues in the system, he just
wants to make existing queues more privileged. And not just PF queues - 
he's mostly interested in VF queues.
I'm not sure how to state use case more clearly.


>
>> So either I would have to invent a new offload type (?) for tc, or
>> completely rewrite and
>> probably break tc-htb that mellanox implemented.
>> Also in our use case it's possible to create completely new branches
>> from the root and
>> reassigning queues there. This wouldn't be possible with the method
>> you're proposing.
>>
>> So existing interface doesn't allow us to do what is required.
> For some definition of "what is required" which was not really
> disclosed clearly. Or I'm to slow to grasp.

In most basic variant what we want is a way to make hardware queues more 
privileged, and modify
hierarchy of nodes/queues freely. We don't want to create new queues, as 
required by tc-htb
implementation. This is main reason why tc-htb and devlink-rate hybrid 
doesn't work for us.

BR,
Michał


[-- Attachment #2: tx_tree.txt --]
[-- Type: text/plain, Size: 3353 bytes --]

                                                      +--------+
                                                      | node_0 |
                                              +-------+-----+--+--------------------+
                                              |             |                       |
                                           +--v---+     +---v---+              +----v---+
                                           |node_1|     |node_15|              |node_184|
                                     +-----+------+     +-------+              +----+---+
                                     |                  |                           |
                                 +---v--+           +---v---+                  +----v---+
                                 |node_2|    +------+node_16|                  |node_185|
                                 +------+    |      +---+---+             +----+--------+--+
                                             |          |                 |                |
          +----------------------------------+          |                 |                |
          |                                             |                 |                |
     +----v---+                                     +---v--+          +---v---+        +---v---+
     |node_190|  +----------------------------------+node_3|          |vport_1|        |vport_2+---------------------------------+
     +-+------+  |                                  +---+--+          +---+---+        +-------+                                 |
       |         |                                      |                 |                            +--------+           +----v---+
  +----v---+  +--v---+                              +---v--+              +---------------------------->node_194+-------+   |node_198|
  |node_191|  |node_5|                              |node_4|                                           +-------++       |   +--------++
  +--+-----+  +-+----+    +----------+---------+----+----+-+--------+---------+----------+----------+          |        |             +-------------+
     |          |         |          |         |         |          |         |          |          |          |        |             |             |
+----v---+  +---v---+ +---v--+   +---v--+  +---v--+   +--v---+  +---v---+ +---v---+  +---v---+  +---v---+  +---v----+  +v-------+  +--v-----+   +---v----+
|node_192|  |node_14| |node_6|   |node_7|  |node_8|   |node_9|  |node_10| |node_11|  |node_12|  |node_13|  |node_195|  |node_196|  |node_199|   |node_200|
+----+---+  +---+---+ +---+--+   +---+--+  +---+--+   +--+---+  +---+---+ +---+---+  +---+---+  +---+---+  +---+----+  +----+---+  +----+---+   +----+---+
     |          |         |          |         |         |          |         |          |          |          |            |           |            |
 +---v--+   +---v--+  +---v--+   +---v--+  +---v--+   +--v---+  +---v--+  +---v--+    +--v---+   +--v---+   +--v---+    +---v--+    +---v--+     +---v--+
 |Queues|   |Queues|  |Queues|   |Queues|  |Queues|   |Queues|  |Queues|  |Queues|    |Queues|   |Queues|   |Queues|    |Queues|    |Queues|     |Queues|
 +------+   +------+  +------+   +------+  +------+   +------+  +------+  +------+    +------+   +------+   +------+    +------+    +------+     +------+

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-23 12:11                 ` Wilczynski, Michal
@ 2022-09-23 13:16                   ` Jakub Kicinski
  2022-09-23 15:46                     ` Wilczynski, Michal
  0 siblings, 1 reply; 35+ messages in thread
From: Jakub Kicinski @ 2022-09-23 13:16 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

On Fri, 23 Sep 2022 14:11:08 +0200 Wilczynski, Michal wrote:
> On 9/22/2022 10:29 PM, Jakub Kicinski wrote:
> > On Thu, 22 Sep 2022 15:45:55 +0200 Wilczynski, Michal wrote:  
> >> On 9/22/2022 2:50 PM, Jakub Kicinski wrote:  
>
> I'm not sure whether this is allowed on mailing list, but I'm attaching 
> a text file  with an ASCII drawing representing a tree I've send
> previously as linear. Hope you'll find this easier to read.

That helps, thanks! So what I was saying was anything under the vport
layer should be configured by the policy local to the owner of the
function.

> >> We tried already tc-htb, and it doesn't work for a couple of reasons,
> >> even in this potential hybrid with devlink-rate. One of the problems
> >> with tc-htb offload is that it forces you to allocate a new
> >> queue, it doesn't allow for reassigning an existing queue to another
> >> scheduling node. This is our main use case.
> >>
> >> Here's a discussion about tc-htb:
> >> https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/  
> > This is a problem only for "SR-IOV case" or also for just the PF?  
> 
> The way tc-htb is coded it's NOT possible to reassign queues from one 
> scheduling node to the other, this is a generic problem with this
> implementation, regardless of SR-IOV or PF. So even if we
> wanted to reassign queues only for PF's this wouldn't be possible.
> I feel like an example would help. So let's say I do this:
> 
> tc qdisc replace dev ens785 root handle 1: htb offload
> tc class add dev ens785 parent 1: classid 1:2 htb rate 1000 ceil 2000
> tc class add dev ens785 parent 1:2 classid 1:3 htb rate 1000 ceil 2000
> tc class add dev ens785 parent 1:2 classid 1:4 htb rate 1000 ceil 2000
> tc class add dev ens785 parent 1:3 classid 1:5 htb rate 1000 ceil 2000
> tc class add dev ens785 parent 1:4 classid 1:6 htb rate 1000 ceil 2000
> 
>                    1:    <-- root qdisc
>                    |
>                   1:2
>                   / \
>                  /   \
>                1:3   1:4
>                 |     |
>                 |     |
>                1:5   1:6
>                 |     |
>                QID   QID   <---- here we'll have PFIFO qdiscs
> 
> 
> At this point I would have two additional queues in the system, and
> the kernel would enqueue packets to those new queues according to 'tc
> flower' configuration. 

TBH I don't know what you mean by "reassign queues from one 
scheduling node to the other", sorry I don't know this code well.
Neither the offload nor HTB itself.

My uneducated anticipation of how HTB offload would work is that 
queue 0 of the NIC is a catch all for leftovers and all other queues
get assigned leaf nodes.

> So theoretically we should create a new queue
> in a hardware and put it in a privileged position in the scheduling 
> tree. And I would happily write it this
> way, but this is NOT what our customer want. He doesn't want any
> extra queues in the system, he just
> wants to make existing queues more privileged. And not just PF queues
> - he's mostly interested in VF queues.
> I'm not sure how to state use case more clearly.

The VF means controlling queue scheduling of another function
via the PF, right? Let's leave that out of the picture for now
so we don't have to worry about "architectural" concerns.

> >> So either I would have to invent a new offload type (?) for tc, or
> >> completely rewrite and
> >> probably break tc-htb that mellanox implemented.
> >> Also in our use case it's possible to create completely new
> >> branches from the root and
> >> reassigning queues there. This wouldn't be possible with the method
> >> you're proposing.
> >>
> >> So existing interface doesn't allow us to do what is required.  
> > For some definition of "what is required" which was not really
> > disclosed clearly. Or I'm to slow to grasp.  
> 
> In most basic variant what we want is a way to make hardware queues
> more privileged, and modify hierarchy of nodes/queues freely. We
> don't want to create new queues, as required by tc-htb
> implementation. This is main reason why tc-htb and devlink-rate
> hybrid doesn't work for us.

Hm, when you say "privileged" do you mean higher quota or priority?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-23 13:16                   ` Jakub Kicinski
@ 2022-09-23 15:46                     ` Wilczynski, Michal
  2022-09-27  0:16                       ` Jakub Kicinski
  0 siblings, 1 reply; 35+ messages in thread
From: Wilczynski, Michal @ 2022-09-23 15:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel



On 9/23/2022 3:16 PM, Jakub Kicinski wrote:
> On Fri, 23 Sep 2022 14:11:08 +0200 Wilczynski, Michal wrote:
>> On 9/22/2022 10:29 PM, Jakub Kicinski wrote:
>>> On Thu, 22 Sep 2022 15:45:55 +0200 Wilczynski, Michal wrote:
>>>> On 9/22/2022 2:50 PM, Jakub Kicinski wrote:
>> I'm not sure whether this is allowed on mailing list, but I'm attaching
>> a text file  with an ASCII drawing representing a tree I've send
>> previously as linear. Hope you'll find this easier to read.
> That helps, thanks! So what I was saying was anything under the vport
> layer should be configured by the policy local to the owner of the
> function.

My main concern is that there are no interfaces to do so. tc-htb in it's 
current state just doesn't
work for us, as I noted before their whole implementation is about 
creating new queues.
https://legacy.netdevconf.info/0x14/pub/papers/44/0x14-paper44-talk-paper.pdf

Also reconfiguration from the VM, would need to be handled by the VF 
driver i.e iavf.
So the solution would get much more complex I guess, since we would need 
to implement
communication between ice-iavf, through virtchnl I guess.


>
>>>> We tried already tc-htb, and it doesn't work for a couple of reasons,
>>>> even in this potential hybrid with devlink-rate. One of the problems
>>>> with tc-htb offload is that it forces you to allocate a new
>>>> queue, it doesn't allow for reassigning an existing queue to another
>>>> scheduling node. This is our main use case.
>>>>
>>>> Here's a discussion about tc-htb:
>>>> https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/
>>> This is a problem only for "SR-IOV case" or also for just the PF?
>> The way tc-htb is coded it's NOT possible to reassign queues from one
>> scheduling node to the other, this is a generic problem with this
>> implementation, regardless of SR-IOV or PF. So even if we
>> wanted to reassign queues only for PF's this wouldn't be possible.
>> I feel like an example would help. So let's say I do this:
>>
>> tc qdisc replace dev ens785 root handle 1: htb offload
>> tc class add dev ens785 parent 1: classid 1:2 htb rate 1000 ceil 2000
>> tc class add dev ens785 parent 1:2 classid 1:3 htb rate 1000 ceil 2000
>> tc class add dev ens785 parent 1:2 classid 1:4 htb rate 1000 ceil 2000
>> tc class add dev ens785 parent 1:3 classid 1:5 htb rate 1000 ceil 2000
>> tc class add dev ens785 parent 1:4 classid 1:6 htb rate 1000 ceil 2000
>>
>>                     1:    <-- root qdisc
>>                     |
>>                    1:2
>>                    / \
>>                   /   \
>>                 1:3   1:4
>>                  |     |
>>                  |     |
>>                 1:5   1:6
>>                  |     |
>>                 QID   QID   <---- here we'll have PFIFO qdiscs
>>
>>
>> At this point I would have two additional queues in the system, and
>> the kernel would enqueue packets to those new queues according to 'tc
>> flower' configuration.
> TBH I don't know what you mean by "reassign queues from one
> scheduling node to the other", sorry I don't know this code well.
> Neither the offload nor HTB itself.

So using as an example parts of the drawing I made:
Imagine you have a queue like this:
pci/0000:4b:00.0/queue/103: type queue parent node_200
It has txq_id 103, that is assigned by hardware - it's uniquely id'd by it.


Then I run commands like this:
devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent 
vport_2 tx_share 100Mbps tx_max 500Mbps priority 5
devlink port function rate add pci/0000:4b:00.0/node_custom_2 parent 
node_custom_1
devlink port function rate add pci/0000:4b:00.0/node_custom_3 parent 
node_custom_3

And here I reassign the queue:
devlink port function rate set pci/0000:4b:00.0/queue/103 parent 
node_custom_3

So now queue has completely different parent and the packets from that 
queue are scheduled
using completely different parameters.


>
> My uneducated anticipation of how HTB offload would work is that
> queue 0 of the NIC is a catch all for leftovers and all other queues
> get assigned leaf nodes.

So enabling an htb offload leaves all the existing queues in place, and 
in case the
kernel can't classify a packet to one of the newly created queues, the 
driver is supposed
to select one of the 'older' queues to do the job. (driver receives 
TC_HTB_LEAF_QUERY_QUEUE
event)
So basically after running
tc qdisc replace dev ens785 root handle 1: htb offload
you don't get any new queues yet, but if you create new classes:
tc class add dev ens785 parent 1: classid 1:2 htb rate 1000 ceil 2000
you'll get an TC_HTB_LEAF_ALLOC_QUEUE event in the driver,
that means you're supposed to allocate new queue in the driver.

This just doesn't work for our case. Also our scheduling tree is rather 
rigid. I can't just remove the whole
subtree just because user decided to enable htb-offload. So as you can 
see in current ultimate
devlink-rate implementation I'm exporting the whole tree, and just allow 
user to modify it. There
are some constraints, like inability to remove nodes with queues, or any 
children really.



>
>> So theoretically we should create a new queue
>> in a hardware and put it in a privileged position in the scheduling
>> tree. And I would happily write it this
>> way, but this is NOT what our customer want. He doesn't want any
>> extra queues in the system, he just
>> wants to make existing queues more privileged. And not just PF queues
>> - he's mostly interested in VF queues.
>> I'm not sure how to state use case more clearly.
> The VF means controlling queue scheduling of another function
> via the PF, right? Let's leave that out of the picture for now
> so we don't have to worry about "architectural" concerns.

Devlink works more on pci devices, but I guess you can call it PF still.

>>>> So either I would have to invent a new offload type (?) for tc, or
>>>> completely rewrite and
>>>> probably break tc-htb that mellanox implemented.
>>>> Also in our use case it's possible to create completely new
>>>> branches from the root and
>>>> reassigning queues there. This wouldn't be possible with the method
>>>> you're proposing.
>>>>
>>>> So existing interface doesn't allow us to do what is required.
>>> For some definition of "what is required" which was not really
>>> disclosed clearly. Or I'm to slow to grasp.
>> In most basic variant what we want is a way to make hardware queues
>> more privileged, and modify hierarchy of nodes/queues freely. We
>> don't want to create new queues, as required by tc-htb
>> implementation. This is main reason why tc-htb and devlink-rate
>> hybrid doesn't work for us.
> Hm, when you say "privileged" do you mean higher quota or priority?

Maybe the meaning wasn't clear, so queue privilegeness is determined by 
everything really:
it's position in the tree, it's ancestors assigned parameters (tx_share, 
tx_max, priority, weight).
And it's own assigned parameters.

BR,
Michał


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-15 18:41     ` Wilczynski, Michal
  2022-09-15 21:01       ` Edward Cree
  2022-09-21 23:33       ` Jakub Kicinski
@ 2022-09-26 11:51       ` Jiri Pirko
  2022-09-28 11:47         ` Wilczynski, Michal
  2 siblings, 1 reply; 35+ messages in thread
From: Jiri Pirko @ 2022-09-26 11:51 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

Thu, Sep 15, 2022 at 08:41:52PM CEST, michal.wilczynski@intel.com wrote:
>
>
>On 9/15/2022 5:31 PM, Edward Cree wrote:
>> On 15/09/2022 14:42, Michal Wilczynski wrote:
>> > Currently devlink-rate only have two types of objects: nodes and leafs.
>> > There is a need to extend this interface to account for a third type of
>> > scheduling elements - queues. In our use case customer is sending
>> > different types of traffic on each queue, which requires an ability to
>> > assign rate parameters to individual queues.
>> Is there a use-case for this queue scheduling in the absence of a netdevice?
>> If not, then I don't see how this belongs in devlink; the configuration
>>   should instead be done in two parts: devlink-rate to schedule between
>>   different netdevices (e.g. VFs) and tc qdiscs (or some other netdev-level
>>   API) to schedule different queues within each single netdevice.
>> Please explain why this existing separation does not support your use-case.
>> 
>> Also I would like to see some documentation as part of this patch.  It looks
>>   like there's no kernel document for devlink-rate unlike most other devlink
>>   objects; perhaps you could add one?
>> 
>> -ed
>
>Hi,
>Previously we discussed adding queues to devlink-rate in this thread:
>https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/T/#u
>In our use case we are trying to find a way to expose hardware Tx scheduler
>tree that is defined
>per port to user. Obviously if the tree is defined per physical port, all the
>scheduling nodes will reside
>on the same tree.
>
>Our customer is trying to send different types of traffic that require
>different QoS levels on the same

Do I understand that correctly, that you are assigning traffic to queues
in VM, and you rate the queues on hypervisor? Is that the goal?


>VM, but on a different queues. This requires completely different rate setups
>for that queue - in the
>implementation that you're mentioning we wouldn't be able to arbitrarily
>reassign the queue to any node.
>Those queues would still need to share a single parent - their netdev. This

So that replies to Edward's note about having the queues maintained
within the single netdev/vport, correct?


>wouldn't allow us to fully take
>advantage of the HQoS and would introduce arbitrary limitations.
>
>Also I would think that since there is only one vendor implementing this
>particular devlink-rate API, there is
>some room for flexibility.
>
>Regarding the documentation,  sure. I just wanted to get all the feedback
>from the mailing list and arrive at the final
>solution before writing the docs.
>
>BTW, I'm going to be out of office tomorrow, so will respond in this thread
>on Monday.
>BR,
>Michał
>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-20 11:09           ` Edward Cree
@ 2022-09-26 11:58             ` Jiri Pirko
  2022-09-28 11:53               ` Wilczynski, Michal
  0 siblings, 1 reply; 35+ messages in thread
From: Jiri Pirko @ 2022-09-26 11:58 UTC (permalink / raw)
  To: Edward Cree
  Cc: Wilczynski, Michal, netdev, alexandr.lobakin, dchumak, maximmi,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

Tue, Sep 20, 2022 at 01:09:04PM CEST, ecree.xilinx@gmail.com wrote:
>On 19/09/2022 14:12, Wilczynski, Michal wrote:
>> Maybe a switchdev case would be a good parallel here. When you enable switchdev, you get port representors on
>> the host for each VF that is already attached to the VM. Something that gives the host power to configure
>> netdev that it doesn't 'own'. So it seems to me like giving user more power to configure things from the host

Well, not really. It gives the user on hypervisor possibility
to configure the eswitch vport side. The other side of the wire, which
is in VM, is autonomous.


>> is acceptable.
>
>Right that's the thing though: I instinctively Want this to be done
> through representors somehow, because it _looks_ like it ought to
> be scoped to a single netdev; but that forces the hierarchy to
> respect netdev boundaries which as we've discussed is an unwelcome
> limitation.

Why exacly? Do you want to share a single queue between multiple vport?
Or what exactly would the the usecase where you hit the limitation?


>
>> In my mind this is a device-wide configuration, since the ice driver registers each port as a separate pci device.
>> And each of this devices have their own hardware Tx Scheduler tree global to that port. Queues that we're
>> discussing are actually hardware queues, and are identified by hardware assigned txq_id.
>
>In general, hardware being a single unit at the device level does
> not necessarily mean its configuration should be device-wide.
>For instance, in many NICs each port has a single hardware v-switch,
> but we do not have some kind of "devlink filter" API to program it
> directly.  Instead we attach TC rules to _many_ netdevs, and driver
> code transforms and combines these to program the unitary device.
>"device-wide configuration" originally meant things like firmware
> version or operating mode (legacy vs. switchdev) that do not relate
> directly to netdevs.
>
>But I agree with you that your approach is the "least evil method";
> if properly explained and documented then I don't have any
> remaining objection to your patch, despite that I'm continuing to
> take the opportunity to proselytise for "reprs >> devlink" ;)
>
>-ed

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-23 15:46                     ` Wilczynski, Michal
@ 2022-09-27  0:16                       ` Jakub Kicinski
  2022-09-28 12:02                         ` Wilczynski, Michal
  0 siblings, 1 reply; 35+ messages in thread
From: Jakub Kicinski @ 2022-09-27  0:16 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

On Fri, 23 Sep 2022 17:46:35 +0200 Wilczynski, Michal wrote:
> Also reconfiguration from the VM, would need to be handled by the VF 
> driver i.e iavf.
> So the solution would get much more complex I guess, since we would need 
> to implement communication between ice-iavf, through virtchnl I guess.

Yup, but it's the correct way to solve your problem AFAICT.

AFAIU you only want to cater to simple cases where the VF and PF 
are in the same control domain, which is not normal, outside of 
running DPDK apps. Sooner or later someone will ask for queuing 
control from the VFs and you're have to redesign the whole thing.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-26 11:51       ` Jiri Pirko
@ 2022-09-28 11:47         ` Wilczynski, Michal
  2022-09-29  7:12           ` Jiri Pirko
  0 siblings, 1 reply; 35+ messages in thread
From: Wilczynski, Michal @ 2022-09-28 11:47 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel



On 9/26/2022 1:51 PM, Jiri Pirko wrote:
> Thu, Sep 15, 2022 at 08:41:52PM CEST, michal.wilczynski@intel.com wrote:
>>
>> On 9/15/2022 5:31 PM, Edward Cree wrote:
>>> On 15/09/2022 14:42, Michal Wilczynski wrote:
>>>> Currently devlink-rate only have two types of objects: nodes and leafs.
>>>> There is a need to extend this interface to account for a third type of
>>>> scheduling elements - queues. In our use case customer is sending
>>>> different types of traffic on each queue, which requires an ability to
>>>> assign rate parameters to individual queues.
>>> Is there a use-case for this queue scheduling in the absence of a netdevice?
>>> If not, then I don't see how this belongs in devlink; the configuration
>>>    should instead be done in two parts: devlink-rate to schedule between
>>>    different netdevices (e.g. VFs) and tc qdiscs (or some other netdev-level
>>>    API) to schedule different queues within each single netdevice.
>>> Please explain why this existing separation does not support your use-case.
>>>
>>> Also I would like to see some documentation as part of this patch.  It looks
>>>    like there's no kernel document for devlink-rate unlike most other devlink
>>>    objects; perhaps you could add one?
>>>
>>> -ed
>> Hi,
>> Previously we discussed adding queues to devlink-rate in this thread:
>> https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/T/#u
>> In our use case we are trying to find a way to expose hardware Tx scheduler
>> tree that is defined
>> per port to user. Obviously if the tree is defined per physical port, all the
>> scheduling nodes will reside
>> on the same tree.
>>
>> Our customer is trying to send different types of traffic that require
>> different QoS levels on the same
> Do I understand that correctly, that you are assigning traffic to queues
> in VM, and you rate the queues on hypervisor? Is that the goal?

Yes.

>
>
>> VM, but on a different queues. This requires completely different rate setups
>> for that queue - in the
>> implementation that you're mentioning we wouldn't be able to arbitrarily
>> reassign the queue to any node.
>> Those queues would still need to share a single parent - their netdev. This
> So that replies to Edward's note about having the queues maintained
> within the single netdev/vport, correct?

  Correct ;)

>
>
>> wouldn't allow us to fully take
>> advantage of the HQoS and would introduce arbitrary limitations.
>>
>> Also I would think that since there is only one vendor implementing this
>> particular devlink-rate API, there is
>> some room for flexibility.
>>
>> Regarding the documentation,  sure. I just wanted to get all the feedback
> >from the mailing list and arrive at the final
>> solution before writing the docs.
>>
>> BTW, I'm going to be out of office tomorrow, so will respond in this thread
>> on Monday.
>> BR,
>> Michał
>>
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-26 11:58             ` Jiri Pirko
@ 2022-09-28 11:53               ` Wilczynski, Michal
  2022-09-29  7:08                 ` Jiri Pirko
  0 siblings, 1 reply; 35+ messages in thread
From: Wilczynski, Michal @ 2022-09-28 11:53 UTC (permalink / raw)
  To: Jiri Pirko, Edward Cree
  Cc: netdev, alexandr.lobakin, dchumak, maximmi, simon.horman,
	jacob.e.keller, jesse.brandeburg, przemyslaw.kitszel



On 9/26/2022 1:58 PM, Jiri Pirko wrote:
> Tue, Sep 20, 2022 at 01:09:04PM CEST, ecree.xilinx@gmail.com wrote:
>> On 19/09/2022 14:12, Wilczynski, Michal wrote:
>>> Maybe a switchdev case would be a good parallel here. When you enable switchdev, you get port representors on
>>> the host for each VF that is already attached to the VM. Something that gives the host power to configure
>>> netdev that it doesn't 'own'. So it seems to me like giving user more power to configure things from the host
> Well, not really. It gives the user on hypervisor possibility
> to configure the eswitch vport side. The other side of the wire, which
> is in VM, is autonomous.

Frankly speaking the VM is still free to assign traffic to queues as before,
I guess the networking card scheduling algorithm will just drain those
queues at different pace.

>
>
>>> is acceptable.
>> Right that's the thing though: I instinctively Want this to be done
>> through representors somehow, because it _looks_ like it ought to
>> be scoped to a single netdev; but that forces the hierarchy to
>> respect netdev boundaries which as we've discussed is an unwelcome
>> limitation.
> Why exacly? Do you want to share a single queue between multiple vport?
> Or what exactly would the the usecase where you hit the limitation?

Like you've noticed in previous comment traffic is assigned from inside 
the VM,
this tree simply represents scheduling algorithm in the HW i.e how fast 
the card
will drain from each queue. So if you have a queue carrying real-time data,
and the rest carrying bulk, you might want to prioritze real-time data
it i.e put it on a completely different branch on the scheduling tree.

BR,
Michał

>
>
>>> In my mind this is a device-wide configuration, since the ice driver registers each port as a separate pci device.
>>> And each of this devices have their own hardware Tx Scheduler tree global to that port. Queues that we're
>>> discussing are actually hardware queues, and are identified by hardware assigned txq_id.
>> In general, hardware being a single unit at the device level does
>> not necessarily mean its configuration should be device-wide.
>> For instance, in many NICs each port has a single hardware v-switch,
>> but we do not have some kind of "devlink filter" API to program it
>> directly.  Instead we attach TC rules to _many_ netdevs, and driver
>> code transforms and combines these to program the unitary device.
>> "device-wide configuration" originally meant things like firmware
>> version or operating mode (legacy vs. switchdev) that do not relate
>> directly to netdevs.
>>
>> But I agree with you that your approach is the "least evil method";
>> if properly explained and documented then I don't have any
>> remaining objection to your patch, despite that I'm continuing to
>> take the opportunity to proselytise for "reprs >> devlink" ;)
>>
>> -ed


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-27  0:16                       ` Jakub Kicinski
@ 2022-09-28 12:02                         ` Wilczynski, Michal
  2022-09-28 17:39                           ` Jakub Kicinski
  0 siblings, 1 reply; 35+ messages in thread
From: Wilczynski, Michal @ 2022-09-28 12:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel



On 9/27/2022 2:16 AM, Jakub Kicinski wrote:
> On Fri, 23 Sep 2022 17:46:35 +0200 Wilczynski, Michal wrote:
>> Also reconfiguration from the VM, would need to be handled by the VF
>> driver i.e iavf.
>> So the solution would get much more complex I guess, since we would need
>> to implement communication between ice-iavf, through virtchnl I guess.
> Yup, but it's the correct way to solve your problem AFAICT.
>
> AFAIU you only want to cater to simple cases where the VF and PF
> are in the same control domain, which is not normal, outside of
> running DPDK apps. Sooner or later someone will ask for queuing
> control from the VFs and you're have to redesign the whole thing.

Hmm, so I guess the queue part of this patch is not well liked.
I wonder if I should re-send this patch with just the implementation
of devlink-rate, and minor changes in devlink, like exposing functions,
so the driver can export initial configurations. This still brings some 
value,
cause the user would still be able to modify at least the upper part of the
tree.

We can still discuss how the final solution should look like, but i'm 
out of ideas
when it comes for a inside VF interface, (like we discussed tc-htb in 
current form
doesn't really work for us).

Thanks,
Michał




^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-28 12:02                         ` Wilczynski, Michal
@ 2022-09-28 17:39                           ` Jakub Kicinski
  0 siblings, 0 replies; 35+ messages in thread
From: Jakub Kicinski @ 2022-09-28 17:39 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi, jiri,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

On Wed, 28 Sep 2022 14:02:57 +0200 Wilczynski, Michal wrote:
> > AFAIU you only want to cater to simple cases where the VF and PF
> > are in the same control domain, which is not normal, outside of
> > running DPDK apps. Sooner or later someone will ask for queuing
> > control from the VFs and you're have to redesign the whole thing.  
> 
> Hmm, so I guess the queue part of this patch is not well liked.
> I wonder if I should re-send this patch with just the implementation
> of devlink-rate, and minor changes in devlink, like exposing functions,
> so the driver can export initial configurations. This still brings some 
> value,
> cause the user would still be able to modify at least the upper part of the
> tree.

Sounds good to me. I don't think we ever had a general discussion
about what should the driver do in terms of exposing the initial
configuration. I can't think of a reason why we wouldn't do that
so please repost and let's see what others think!

> We can still discuss how the final solution should look like, but i'm 
> out of ideas when it comes for a inside VF interface, (like we
> discussed tc-htb in current form doesn't really work for us).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-28 11:53               ` Wilczynski, Michal
@ 2022-09-29  7:08                 ` Jiri Pirko
  0 siblings, 0 replies; 35+ messages in thread
From: Jiri Pirko @ 2022-09-29  7:08 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

Wed, Sep 28, 2022 at 01:53:24PM CEST, michal.wilczynski@intel.com wrote:
>
>
>On 9/26/2022 1:58 PM, Jiri Pirko wrote:
>> Tue, Sep 20, 2022 at 01:09:04PM CEST, ecree.xilinx@gmail.com wrote:
>> > On 19/09/2022 14:12, Wilczynski, Michal wrote:
>> > > Maybe a switchdev case would be a good parallel here. When you enable switchdev, you get port representors on
>> > > the host for each VF that is already attached to the VM. Something that gives the host power to configure
>> > > netdev that it doesn't 'own'. So it seems to me like giving user more power to configure things from the host
>> Well, not really. It gives the user on hypervisor possibility
>> to configure the eswitch vport side. The other side of the wire, which
>> is in VM, is autonomous.
>
>Frankly speaking the VM is still free to assign traffic to queues as before,
>I guess the networking card scheduling algorithm will just drain those
>queues at different pace.

That was not my point, my point is, that with per-queue shaping, you are
basically configuring the other side of the wire (VF), when this config
is out of the domain of hypervisor.

>
>> 
>> 
>> > > is acceptable.
>> > Right that's the thing though: I instinctively Want this to be done
>> > through representors somehow, because it _looks_ like it ought to
>> > be scoped to a single netdev; but that forces the hierarchy to
>> > respect netdev boundaries which as we've discussed is an unwelcome
>> > limitation.
>> Why exacly? Do you want to share a single queue between multiple vport?
>> Or what exactly would the the usecase where you hit the limitation?
>
>Like you've noticed in previous comment traffic is assigned from inside the
>VM,
>this tree simply represents scheduling algorithm in the HW i.e how fast the
>card
>will drain from each queue. So if you have a queue carrying real-time data,
>and the rest carrying bulk, you might want to prioritze real-time data
>it i.e put it on a completely different branch on the scheduling tree.

Yep, so, if you forget about how this is implemented in HW/FW, this is
the VM-side config, correct?


>
>BR,
>Michał
>
>> 
>> 
>> > > In my mind this is a device-wide configuration, since the ice driver registers each port as a separate pci device.
>> > > And each of this devices have their own hardware Tx Scheduler tree global to that port. Queues that we're
>> > > discussing are actually hardware queues, and are identified by hardware assigned txq_id.
>> > In general, hardware being a single unit at the device level does
>> > not necessarily mean its configuration should be device-wide.
>> > For instance, in many NICs each port has a single hardware v-switch,
>> > but we do not have some kind of "devlink filter" API to program it
>> > directly.  Instead we attach TC rules to _many_ netdevs, and driver
>> > code transforms and combines these to program the unitary device.
>> > "device-wide configuration" originally meant things like firmware
>> > version or operating mode (legacy vs. switchdev) that do not relate
>> > directly to netdevs.
>> > 
>> > But I agree with you that your approach is the "least evil method";
>> > if properly explained and documented then I don't have any
>> > remaining objection to your patch, despite that I'm continuing to
>> > take the opportunity to proselytise for "reprs >> devlink" ;)
>> > 
>> > -ed
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-28 11:47         ` Wilczynski, Michal
@ 2022-09-29  7:12           ` Jiri Pirko
  2022-10-11 13:28             ` Wilczynski, Michal
  0 siblings, 1 reply; 35+ messages in thread
From: Jiri Pirko @ 2022-09-29  7:12 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel

Wed, Sep 28, 2022 at 01:47:03PM CEST, michal.wilczynski@intel.com wrote:
>
>
>On 9/26/2022 1:51 PM, Jiri Pirko wrote:
>> Thu, Sep 15, 2022 at 08:41:52PM CEST, michal.wilczynski@intel.com wrote:
>> > 
>> > On 9/15/2022 5:31 PM, Edward Cree wrote:
>> > > On 15/09/2022 14:42, Michal Wilczynski wrote:
>> > > > Currently devlink-rate only have two types of objects: nodes and leafs.
>> > > > There is a need to extend this interface to account for a third type of
>> > > > scheduling elements - queues. In our use case customer is sending
>> > > > different types of traffic on each queue, which requires an ability to
>> > > > assign rate parameters to individual queues.
>> > > Is there a use-case for this queue scheduling in the absence of a netdevice?
>> > > If not, then I don't see how this belongs in devlink; the configuration
>> > >    should instead be done in two parts: devlink-rate to schedule between
>> > >    different netdevices (e.g. VFs) and tc qdiscs (or some other netdev-level
>> > >    API) to schedule different queues within each single netdevice.
>> > > Please explain why this existing separation does not support your use-case.
>> > > 
>> > > Also I would like to see some documentation as part of this patch.  It looks
>> > >    like there's no kernel document for devlink-rate unlike most other devlink
>> > >    objects; perhaps you could add one?
>> > > 
>> > > -ed
>> > Hi,
>> > Previously we discussed adding queues to devlink-rate in this thread:
>> > https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/T/#u
>> > In our use case we are trying to find a way to expose hardware Tx scheduler
>> > tree that is defined
>> > per port to user. Obviously if the tree is defined per physical port, all the
>> > scheduling nodes will reside
>> > on the same tree.
>> > 
>> > Our customer is trying to send different types of traffic that require
>> > different QoS levels on the same
>> Do I understand that correctly, that you are assigning traffic to queues
>> in VM, and you rate the queues on hypervisor? Is that the goal?
>
>Yes.

Why do you have this mismatch? If forces the hypervisor and VM admin to
somehow sync upon the configuration. That does not sound correct to me.


>
>> 
>> 
>> > VM, but on a different queues. This requires completely different rate setups
>> > for that queue - in the
>> > implementation that you're mentioning we wouldn't be able to arbitrarily
>> > reassign the queue to any node.
>> > Those queues would still need to share a single parent - their netdev. This
>> So that replies to Edward's note about having the queues maintained
>> within the single netdev/vport, correct?
>
> Correct ;)

Okay. So you don't really need any kind of sharing devlink might be able
to provide.

From what you say and how I see this, it's clear. You should handle the
per-queue shaping on the VM, on netdevice level, most probably by
offloading some of the TC qdisc.


>
>> 
>> 
>> > wouldn't allow us to fully take
>> > advantage of the HQoS and would introduce arbitrary limitations.
>> > 
>> > Also I would think that since there is only one vendor implementing this
>> > particular devlink-rate API, there is
>> > some room for flexibility.
>> > 
>> > Regarding the documentation,  sure. I just wanted to get all the feedback
>> >from the mailing list and arrive at the final
>> > solution before writing the docs.
>> > 
>> > BTW, I'm going to be out of office tomorrow, so will respond in this thread
>> > on Monday.
>> > BR,
>> > Michał
>> > 
>> > 
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-09-29  7:12           ` Jiri Pirko
@ 2022-10-11 13:28             ` Wilczynski, Michal
  2022-10-11 14:17               ` Jiri Pirko
  0 siblings, 1 reply; 35+ messages in thread
From: Wilczynski, Michal @ 2022-10-11 13:28 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel, Jakub Kicinski



On 9/29/2022 9:12 AM, Jiri Pirko wrote:
> Wed, Sep 28, 2022 at 01:47:03PM CEST, michal.wilczynski@intel.com wrote:
>>
>> On 9/26/2022 1:51 PM, Jiri Pirko wrote:
>>> Thu, Sep 15, 2022 at 08:41:52PM CEST, michal.wilczynski@intel.com wrote:
>>>> On 9/15/2022 5:31 PM, Edward Cree wrote:
>>>>> On 15/09/2022 14:42, Michal Wilczynski wrote:
>>>>>> Currently devlink-rate only have two types of objects: nodes and leafs.
>>>>>> There is a need to extend this interface to account for a third type of
>>>>>> scheduling elements - queues. In our use case customer is sending
>>>>>> different types of traffic on each queue, which requires an ability to
>>>>>> assign rate parameters to individual queues.
>>>>> Is there a use-case for this queue scheduling in the absence of a netdevice?
>>>>> If not, then I don't see how this belongs in devlink; the configuration
>>>>>     should instead be done in two parts: devlink-rate to schedule between
>>>>>     different netdevices (e.g. VFs) and tc qdiscs (or some other netdev-level
>>>>>     API) to schedule different queues within each single netdevice.
>>>>> Please explain why this existing separation does not support your use-case.
>>>>>
>>>>> Also I would like to see some documentation as part of this patch.  It looks
>>>>>     like there's no kernel document for devlink-rate unlike most other devlink
>>>>>     objects; perhaps you could add one?
>>>>>
>>>>> -ed
>>>> Hi,
>>>> Previously we discussed adding queues to devlink-rate in this thread:
>>>> https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/T/#u
>>>> In our use case we are trying to find a way to expose hardware Tx scheduler
>>>> tree that is defined
>>>> per port to user. Obviously if the tree is defined per physical port, all the
>>>> scheduling nodes will reside
>>>> on the same tree.
>>>>
>>>> Our customer is trying to send different types of traffic that require
>>>> different QoS levels on the same
>>> Do I understand that correctly, that you are assigning traffic to queues
>>> in VM, and you rate the queues on hypervisor? Is that the goal?
>> Yes.
> Why do you have this mismatch? If forces the hypervisor and VM admin to
> somehow sync upon the configuration. That does not sound correct to me.

Thanks for a feedback, this is going to be changed

>
>
>>>
>>>> VM, but on a different queues. This requires completely different rate setups
>>>> for that queue - in the
>>>> implementation that you're mentioning we wouldn't be able to arbitrarily
>>>> reassign the queue to any node.
>>>> Those queues would still need to share a single parent - their netdev. This
>>> So that replies to Edward's note about having the queues maintained
>>> within the single netdev/vport, correct?
>>   Correct ;)
> Okay. So you don't really need any kind of sharing devlink might be able
> to provide.
>
>  From what you say and how I see this, it's clear. You should handle the
> per-queue shaping on the VM, on netdevice level, most probably by
> offloading some of the TC qdisc.

I talked with architect, and this is how the solution will end up 
looking like,
I'm not sure however whether creating a hardware-only qdisc is allowed ?



Btw, thanks everyone for valuable feedback, I've resend the patch
without the queue support,
https://lore.kernel.org/netdev/20221011090113.445485-1-michal.wilczynski@intel.com/


BR,
Michał
>
>>>
>>>> wouldn't allow us to fully take
>>>> advantage of the HQoS and would introduce arbitrary limitations.
>>>>
>>>> Also I would think that since there is only one vendor implementing this
>>>> particular devlink-rate API, there is
>>>> some room for flexibility.
>>>>
>>>> Regarding the documentation,  sure. I just wanted to get all the feedback
>>> >from the mailing list and arrive at the final
>>>> solution before writing the docs.
>>>>
>>>> BTW, I'm going to be out of office tomorrow, so will respond in this thread
>>>> on Monday.
>>>> BR,
>>>> Michał
>>>>
>>>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters
  2022-10-11 13:28             ` Wilczynski, Michal
@ 2022-10-11 14:17               ` Jiri Pirko
  0 siblings, 0 replies; 35+ messages in thread
From: Jiri Pirko @ 2022-10-11 14:17 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: Edward Cree, netdev, alexandr.lobakin, dchumak, maximmi,
	simon.horman, jacob.e.keller, jesse.brandeburg,
	przemyslaw.kitszel, Jakub Kicinski

Tue, Oct 11, 2022 at 03:28:38PM CEST, michal.wilczynski@intel.com wrote:
>
>
>On 9/29/2022 9:12 AM, Jiri Pirko wrote:
>> Wed, Sep 28, 2022 at 01:47:03PM CEST, michal.wilczynski@intel.com wrote:
>> > 
>> > On 9/26/2022 1:51 PM, Jiri Pirko wrote:
>> > > Thu, Sep 15, 2022 at 08:41:52PM CEST, michal.wilczynski@intel.com wrote:
>> > > > On 9/15/2022 5:31 PM, Edward Cree wrote:
>> > > > > On 15/09/2022 14:42, Michal Wilczynski wrote:
>> > > > > > Currently devlink-rate only have two types of objects: nodes and leafs.
>> > > > > > There is a need to extend this interface to account for a third type of
>> > > > > > scheduling elements - queues. In our use case customer is sending
>> > > > > > different types of traffic on each queue, which requires an ability to
>> > > > > > assign rate parameters to individual queues.
>> > > > > Is there a use-case for this queue scheduling in the absence of a netdevice?
>> > > > > If not, then I don't see how this belongs in devlink; the configuration
>> > > > >     should instead be done in two parts: devlink-rate to schedule between
>> > > > >     different netdevices (e.g. VFs) and tc qdiscs (or some other netdev-level
>> > > > >     API) to schedule different queues within each single netdevice.
>> > > > > Please explain why this existing separation does not support your use-case.
>> > > > > 
>> > > > > Also I would like to see some documentation as part of this patch.  It looks
>> > > > >     like there's no kernel document for devlink-rate unlike most other devlink
>> > > > >     objects; perhaps you could add one?
>> > > > > 
>> > > > > -ed
>> > > > Hi,
>> > > > Previously we discussed adding queues to devlink-rate in this thread:
>> > > > https://lore.kernel.org/netdev/20220704114513.2958937-1-michal.wilczynski@intel.com/T/#u
>> > > > In our use case we are trying to find a way to expose hardware Tx scheduler
>> > > > tree that is defined
>> > > > per port to user. Obviously if the tree is defined per physical port, all the
>> > > > scheduling nodes will reside
>> > > > on the same tree.
>> > > > 
>> > > > Our customer is trying to send different types of traffic that require
>> > > > different QoS levels on the same
>> > > Do I understand that correctly, that you are assigning traffic to queues
>> > > in VM, and you rate the queues on hypervisor? Is that the goal?
>> > Yes.
>> Why do you have this mismatch? If forces the hypervisor and VM admin to
>> somehow sync upon the configuration. That does not sound correct to me.
>
>Thanks for a feedback, this is going to be changed
>
>> 
>> 
>> > > 
>> > > > VM, but on a different queues. This requires completely different rate setups
>> > > > for that queue - in the
>> > > > implementation that you're mentioning we wouldn't be able to arbitrarily
>> > > > reassign the queue to any node.
>> > > > Those queues would still need to share a single parent - their netdev. This
>> > > So that replies to Edward's note about having the queues maintained
>> > > within the single netdev/vport, correct?
>> >   Correct ;)
>> Okay. So you don't really need any kind of sharing devlink might be able
>> to provide.
>> 
>>  From what you say and how I see this, it's clear. You should handle the
>> per-queue shaping on the VM, on netdevice level, most probably by
>> offloading some of the TC qdisc.
>
>I talked with architect, and this is how the solution will end up looking
>like,
>I'm not sure however whether creating a hardware-only qdisc is allowed ?

Nope.


>
>
>
>Btw, thanks everyone for valuable feedback, I've resend the patch
>without the queue support,
>https://lore.kernel.org/netdev/20221011090113.445485-1-michal.wilczynski@intel.com/
>
>
>BR,
>Michał
>> 
>> > > 
>> > > > wouldn't allow us to fully take
>> > > > advantage of the HQoS and would introduce arbitrary limitations.
>> > > > 
>> > > > Also I would think that since there is only one vendor implementing this
>> > > > particular devlink-rate API, there is
>> > > > some room for flexibility.
>> > > > 
>> > > > Regarding the documentation,  sure. I just wanted to get all the feedback
>> > > >from the mailing list and arrive at the final
>> > > > solution before writing the docs.
>> > > > 
>> > > > BTW, I'm going to be out of office tomorrow, so will respond in this thread
>> > > > on Monday.
>> > > > BR,
>> > > > Michał
>> > > > 
>> > > > 
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2022-10-11 14:18 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-15 13:42 [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Michal Wilczynski
2022-09-15 13:42 ` [RFC PATCH net-next v4 1/6] ice: Add function for move/reconfigure TxQ AQ command Michal Wilczynski
2022-09-15 13:42 ` [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters Michal Wilczynski
2022-09-15 15:31   ` Edward Cree
2022-09-15 18:41     ` Wilczynski, Michal
2022-09-15 21:01       ` Edward Cree
2022-09-19 13:12         ` Wilczynski, Michal
2022-09-20 11:09           ` Edward Cree
2022-09-26 11:58             ` Jiri Pirko
2022-09-28 11:53               ` Wilczynski, Michal
2022-09-29  7:08                 ` Jiri Pirko
2022-09-21 23:33       ` Jakub Kicinski
2022-09-22 11:44         ` Wilczynski, Michal
2022-09-22 12:50           ` Jakub Kicinski
2022-09-22 13:45             ` Wilczynski, Michal
2022-09-22 20:29               ` Jakub Kicinski
2022-09-23 12:11                 ` Wilczynski, Michal
2022-09-23 13:16                   ` Jakub Kicinski
2022-09-23 15:46                     ` Wilczynski, Michal
2022-09-27  0:16                       ` Jakub Kicinski
2022-09-28 12:02                         ` Wilczynski, Michal
2022-09-28 17:39                           ` Jakub Kicinski
2022-09-26 11:51       ` Jiri Pirko
2022-09-28 11:47         ` Wilczynski, Michal
2022-09-29  7:12           ` Jiri Pirko
2022-10-11 13:28             ` Wilczynski, Michal
2022-10-11 14:17               ` Jiri Pirko
2022-09-15 13:42 ` [RFC PATCH net-next v4 3/6] ice: Introduce new parameters in ice_sched_node Michal Wilczynski
2022-09-15 13:42 ` [RFC PATCH net-next v4 4/6] ice: Implement devlink-rate API Michal Wilczynski
2022-09-22 13:08   ` Przemek Kitszel
2022-09-15 13:42 ` [RFC PATCH net-next v4 5/6] ice: Export Tx scheduler configuration to devlink-rate Michal Wilczynski
2022-09-15 13:42 ` [RFC PATCH net-next v4 6/6] ice: Prevent ADQ, DCB and RDMA coexistence with Custom Tx scheduler Michal Wilczynski
2022-09-15 13:57 ` [RFC PATCH net-next v4 0/6] Implement devlink-rate API and extend it Wilczynski, Michal
2022-09-19  7:22 [RFC PATCH net-next v4 2/6] devlink: Extend devlink-rate api with queues and new parameters kernel test robot
2022-09-19  9:22 ` Dan Carpenter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.