All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] ice: Reconfigure tx scheduling for SR-IOV
@ 2022-07-04 11:45 Michal Wilczynski
  2022-07-05 22:15 ` Jakub Kicinski
  0 siblings, 1 reply; 4+ messages in thread
From: Michal Wilczynski @ 2022-07-04 11:45 UTC (permalink / raw)
  To: netdev; +Cc: Michal Wilczynski

We're trying to find a way to reconfigure tx scheduling in hardware using
linux tc-htb tool. To accomplish that use of tc-htb offload mechanism is
proposed. It was introduced in this commit to the linux kernel:
commit d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")

Example configuration:

tc qdisc replace dev ens785 root handle 1: htb offload
tc class add dev ens785 parent 1: classid 1:2 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:2 classid 1:3 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:2 classid 1:4 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:3 classid 1:5 htb rate 1000 ceil 2000
tc class add dev ens785 parent 1:4 classid 1:6 htb rate 1000 ceil 2000
tc qdisc add dev ens785 parent 1:6 handle 9: pfifo

After each tc command ice_setup_tc() is called. This way parameters can be
received from user space.

Kernel thinks that this is supposed to create a following tree:

                  1:    <-- root qdisc
                  |
                 1:2
                 / \
                /   \
              1:3   1:4
               |     |
               |     |
              1:5   1:6
               |     |
              QID   QID   <---- here we'll have PFIFO qdiscs

For each node rate and ceil parameters are set. Thanks to that algorithm
knows how to prioritize traffic. As shown on the picture on the leaf
nodes there are queues present.
If we were to follow normal flow, we would now use tc-filter family of
commands to direct types of interesting traffic to the correct nodes.
That is NOT the case in this implementation. In this POC, meaningful
classid number identifies scheduling node. Number of qdisc handle is a
queue number in a PF space. Reason for this - we want to support ALL
queues on the card including SR-IOV ones that are assigned to VF
netdevs.

Unfortunately this leaves us with a situation where we have to provide
queues to the kernel, which will never be used.

So the tree in hardware would look like this:

                 Root            Layer 0
                  ||
                  TC             Layer 1
                  ||
                 1:2             Layer 2
                 / \
              1:3   1:4          Layer 3
               |     |
               |     |
              1:5   1:6          Layer 4
                     |
                 Queue nr 9

We can have up to 5, or 9 layers. According to HAS there is no
obligation to create all 5 or 9 layers. Extra pass-through layers will
be created in the firmware. Also single TC node is left, cause we can't
really remove this in the software. It is treated as our root node.

So usual flow for this implementations goes like this:

1. tc-htb command replace root qdisc
   a) Queue metadata is saved to an array (which I will replace as a
list in real implementation). Reason for this - AQ 0x0C32 (Move /
Reconfigure TX queue) requires this to successfully reconfigure queue
later.

   b) Whole tree is deleted from the software. Queues still remain in
the firmware, but they are in the orphaned state. (I wonder if I should
suspend them through AQ command).

2. tc-htb adds new class
   a) New scheduling node is added through AQ 0x0401 (Add Scheduling
Elements). Also node is added to SW DB. New BW profiles are created
through AQ 0x0410 (Add RL profiles) and attached to the scheduling node.
   b) Most controversially, as was stated before there is a need to
provide linux kernel with a correct queue id. The queue also has to be
brand new without qdisc attached to it, cause HTB would like to create it's
own PFIFO qdisc. I had to work around this by resizing number of queues
using netif_set_real_num_tx_queues() and providing htb with qid of newly
allocated queue. If this is accepted, it needs some further work to
align it with our vsi/queue rebuilds mechanism in the driver.

3. tc-htb adds qdisc to the leaf class
   a) AQ 0x0C32 (Move /Reconfigure TX queue) is executed to change queue
position. Metadata for the AQ call is taken from orphaned array.

There are still a lot of holes in the implementation. What works for now
as ususally for POC's is a single 'happy path' that aligns with my
testing. Almost no error handling is implemented yet, there will be also
some further changes need to data structures.

I'm mostly hoping for some feedback regarding general direction of
changes.

TL;DR

There is a mellanox implementation of tc-htb offload mechanism,
unfortunately it doesn't support SR-IOV queues, only PF queues. We're
wondering how we could enable SR-IOV queues for the tc-htb.
Alternatively we're looking for other possible ways of handling this
problem, any feedback will be greatly appreciated.

Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
---
 drivers/net/ethernet/intel/ice/ice.h          |   2 +
 .../net/ethernet/intel/ice/ice_adminq_cmd.h   |  38 ++
 drivers/net/ethernet/intel/ice/ice_common.c   |  62 +++-
 drivers/net/ethernet/intel/ice/ice_common.h   |   8 +
 drivers/net/ethernet/intel/ice/ice_dcb.c      |   2 +-
 drivers/net/ethernet/intel/ice/ice_main.c     | 332 ++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_sched.c    |  15 +-
 drivers/net/ethernet/intel/ice/ice_sched.h    |  16 +-
 drivers/net/ethernet/intel/ice/ice_type.h     |   3 +
 9 files changed, 468 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 60453b3b8d23..fb00ad364b2a 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -319,6 +319,8 @@ struct ice_vsi {
 
 	u64 tx_linearize;
 	DECLARE_BITMAP(state, ICE_VSI_STATE_NBITS);
+	DECLARE_BITMAP(avail_qos_qids, 1024);
+	u32 num_qos_tx;
 	unsigned int current_netdev_flags;
 	u32 tx_restart;
 	u32 tx_busy;
diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 05cb9dd7035a..8f5c1b9d72f2 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -1854,6 +1854,41 @@ struct ice_aqc_dis_txq_item {
 	__le16 q_id[];
 } __packed;
 
+struct ice_aqc_move_txqs {
+	u8 cmd_type;
+#define ICE_AQC_Q_CMD_TYPE_S		0
+#define ICE_AQC_Q_CMD_TYPE_M		(0x3 << ICE_AQC_Q_CMD_TYPE_S)
+#define ICE_AQC_Q_CMD_TYPE_MOVE		1
+#define ICE_AQC_Q_CMD_TYPE_TC_CHANGE	2
+#define ICE_AQC_Q_CMD_TYPE_MOVE_AND_TC	3
+#define ICE_AQC_Q_CMD_SUBSEQ_CALL	BIT(2)
+#define ICE_AQC_Q_CMD_FLUSH_PIPE	BIT(3)
+	u8 num_qs;
+	u8 rsvd;
+	u8 timeout;
+#define ICE_AQC_Q_CMD_TIMEOUT_S		2
+#define ICE_AQC_Q_CMD_TIMEOUT_M		(0x3F << ICE_AQC_Q_CMD_TIMEOUT_S)
+	__le32 blocked_cgds;
+	__le32 addr_high;
+	__le32 addr_low;
+};
+
+struct ice_aqc_move_txqs_elem {
+	__le16 txq_id;
+	u8 q_cgd;
+	u8 rsvd;
+#ifndef EXTERNAL_RELEASE
+	/* EAS1.1 has 2 byte TEID, but it is 4-bytes. Moved for alignment */
+#endif /* EXTERNAL_RELEASE */
+	__le32 q_teid;
+};
+
+struct ice_aqc_move_txqs_data {
+	__le32 src_teid;
+	__le32 dest_teid;
+	struct ice_aqc_move_txqs_elem txqs[1];
+};
+
 /* Add Tx RDMA Queue Set (indirect 0x0C33) */
 struct ice_aqc_add_rdma_qset {
 	u8 num_qset_grps;
@@ -2090,6 +2125,7 @@ struct ice_aq_desc {
 		struct ice_aqc_get_topo get_topo;
 		struct ice_aqc_sched_elem_cmd sched_elem_cmd;
 		struct ice_aqc_query_txsched_res query_sched_res;
+		struct ice_aqc_move_txqs move_txqs;
 		struct ice_aqc_query_port_ets port_ets;
 		struct ice_aqc_rl_profile rl_profile;
 		struct ice_aqc_nvm nvm;
@@ -2149,6 +2185,7 @@ enum ice_aq_err {
 	ICE_AQ_RC_OK		= 0,  /* Success */
 	ICE_AQ_RC_EPERM		= 1,  /* Operation not permitted */
 	ICE_AQ_RC_ENOENT	= 2,  /* No such element */
+	ICE_AQ_RC_EAGAIN	= 8,  /* Try again */
 	ICE_AQ_RC_ENOMEM	= 9,  /* Out of memory */
 	ICE_AQ_RC_EBUSY		= 12, /* Device or resource busy */
 	ICE_AQ_RC_EEXIST	= 13, /* Object already exists */
@@ -2281,6 +2318,7 @@ enum ice_adminq_opc {
 	/* Tx queue handling commands/events */
 	ice_aqc_opc_add_txqs				= 0x0C30,
 	ice_aqc_opc_dis_txqs				= 0x0C31,
+	ice_aqc_opc_move_recfg_txqs			= 0x0C32,
 	ice_aqc_opc_add_rdma_qset			= 0x0C33,
 
 	/* package commands */
diff --git a/drivers/net/ethernet/intel/ice/ice_common.c b/drivers/net/ethernet/intel/ice/ice_common.c
index 9619bdb9e49a..138e05c7a06b 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -3932,6 +3932,63 @@ ice_aq_dis_lan_txq(struct ice_hw *hw, u8 num_qgrps,
 	return status;
 }
 
+int
+ice_aq_move_recfg_lan_txq(struct ice_hw *hw, u8 num_qs, bool is_move,
+			  bool is_tc_change, bool subseq_call, bool flush_pipe,
+			  u8 timeout, u32 *blocked_cgds,
+			  struct ice_aqc_move_txqs_data *buf, u16 buf_size,
+			  u8 *txqs_moved, struct ice_sq_cd *cd)
+{
+	struct ice_aqc_move_txqs *cmd;
+	struct ice_aq_desc desc;
+	int status;
+
+	cmd = &desc.params.move_txqs;
+	ice_fill_dflt_direct_cmd_desc(&desc, ice_aqc_opc_move_recfg_txqs);
+
+#ifndef EXTERNAL_RELEASE
+	/* from CPK HAS 1.5 table 10-40 */
+#endif /* !EXTERNAL_RELEASE */
+#define ICE_LAN_TXQ_MOVE_TIMEOUT_MAX 50
+	if (timeout > ICE_LAN_TXQ_MOVE_TIMEOUT_MAX)
+		return -EINVAL;
+
+	if (is_tc_change && !flush_pipe && !blocked_cgds)
+		return -EINVAL;
+
+	if (!is_move && !is_tc_change)
+		return -EINVAL;
+
+	desc.flags |= cpu_to_le16(ICE_AQ_FLAG_RD);
+
+	if (is_move)
+		cmd->cmd_type |= ICE_AQC_Q_CMD_TYPE_MOVE;
+
+	if (is_tc_change)
+		cmd->cmd_type |= ICE_AQC_Q_CMD_TYPE_TC_CHANGE;
+
+	if (subseq_call)
+		cmd->cmd_type |= ICE_AQC_Q_CMD_SUBSEQ_CALL;
+
+	if (flush_pipe)
+		cmd->cmd_type |= ICE_AQC_Q_CMD_FLUSH_PIPE;
+
+	cmd->num_qs = num_qs;
+	cmd->timeout = ((timeout << ICE_AQC_Q_CMD_TIMEOUT_S) &
+			ICE_AQC_Q_CMD_TIMEOUT_M);
+
+	status = ice_aq_send_cmd(hw, &desc, buf, buf_size, cd);
+
+	if (!status && txqs_moved)
+		*txqs_moved = cmd->num_qs;
+
+	if (hw->adminq.sq_last_status == ICE_AQ_RC_EAGAIN &&
+	    is_tc_change && !flush_pipe)
+		*blocked_cgds = le32_to_cpu(cmd->blocked_cgds);
+
+	return status;
+}
+
 /**
  * ice_aq_add_rdma_qsets
  * @hw: pointer to the hardware structure
@@ -4325,7 +4382,8 @@ ice_ena_vsi_txq(struct ice_port_info *pi, u16 vsi_handle, u8 tc, u16 q_handle,
 	q_ctx->q_teid = le32_to_cpu(node.node_teid);
 
 	/* add a leaf node into scheduler tree queue layer */
-	status = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1, &node);
+	status = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1, &node,
+				    buf->txqs[0].txq_id);
 	if (!status)
 		status = ice_sched_replay_q_bw(pi, q_ctx);
 
@@ -4560,7 +4618,7 @@ ice_ena_vsi_rdma_qset(struct ice_port_info *pi, u16 vsi_handle, u8 tc,
 	for (i = 0; i < num_qsets; i++) {
 		node.node_teid = buf->rdma_qsets[i].qset_teid;
 		ret = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1,
-					 &node);
+					 &node, 0);
 		if (ret)
 			break;
 		qset_teid[i] = le32_to_cpu(node.node_teid);
diff --git a/drivers/net/ethernet/intel/ice/ice_common.h b/drivers/net/ethernet/intel/ice/ice_common.h
index 872ea7d2332d..d5bc3316b258 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.h
+++ b/drivers/net/ethernet/intel/ice/ice_common.h
@@ -176,6 +176,14 @@ int
 ice_ena_vsi_txq(struct ice_port_info *pi, u16 vsi_handle, u8 tc, u16 q_handle,
 		u8 num_qgrps, struct ice_aqc_add_tx_qgrp *buf, u16 buf_size,
 		struct ice_sq_cd *cd);
+
+int
+ice_aq_move_recfg_lan_txq(struct ice_hw *hw, u8 num_qs, bool is_move,
+			  bool is_tc_change, bool subseq_call, bool flush_pipe,
+			  u8 timeout, u32 *blocked_cgds,
+			  struct ice_aqc_move_txqs_data *buf, u16 buf_size,
+			  u8 *txqs_moved, struct ice_sq_cd *cd);
+
 int ice_replay_vsi(struct ice_hw *hw, u16 vsi_handle);
 void ice_replay_post(struct ice_hw *hw);
 void ice_output_fw_log(struct ice_hw *hw, struct ice_aq_desc *desc, void *buf);
diff --git a/drivers/net/ethernet/intel/ice/ice_dcb.c b/drivers/net/ethernet/intel/ice/ice_dcb.c
index 0b146a0d4205..1b0dcd4c0323 100644
--- a/drivers/net/ethernet/intel/ice/ice_dcb.c
+++ b/drivers/net/ethernet/intel/ice/ice_dcb.c
@@ -1580,7 +1580,7 @@ ice_update_port_tc_tree_cfg(struct ice_port_info *pi,
 		/* new TC */
 		status = ice_sched_query_elem(pi->hw, teid2, &elem);
 		if (!status)
-			status = ice_sched_add_node(pi, 1, &elem);
+			status = ice_sched_add_node(pi, 1, &elem, 0);
 		if (status)
 			break;
 		/* update the TC number */
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index c1ac2f746714..ef78256009b9 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -7251,6 +7251,8 @@ const char *ice_aq_str(enum ice_aq_err aq_err)
 	switch (aq_err) {
 	case ICE_AQ_RC_OK:
 		return "OK";
+	case ICE_AQ_RC_EAGAIN:
+		return "ICE_AQ_RC_EAGAIN";
 	case ICE_AQ_RC_EPERM:
 		return "ICE_AQ_RC_EPERM";
 	case ICE_AQ_RC_ENOENT:
@@ -8588,6 +8590,332 @@ static int ice_setup_tc_mqprio_qdisc(struct net_device *netdev, void *type_data)
 	return ret;
 }
 
+#define INTEL_QOS_QID_INNER 0xffff
+#define INTEL_HTB_CLASSID_ROOT 0xffffffff
+
+/* this will become a list in final implementation */
+static struct ice_sched_node orphaned_leaf_nodes[1024];
+static u32 nr_of_orphaned_leaf_nodes;
+
+static u16 ice_get_free_qid(struct net_device *netdev, struct ice_vsi *vsi)
+{
+	int i;
+
+	for (i = 0; i < vsi->num_qos_tx; i++) {
+		if (!test_bit(i, vsi->avail_qos_qids)) {
+			/* currently freeing qid's is not implemented yet */
+			set_bit(i, vsi->avail_qos_qids);
+			return vsi->num_qos_tx + i;
+		}
+	}
+	set_bit(vsi->num_qos_tx, vsi->avail_qos_qids);
+	vsi->num_qos_tx++;
+
+	netif_set_real_num_tx_queues(netdev, vsi->num_txq + vsi->num_qos_tx);
+	return vsi->num_txq + vsi->num_qos_tx - 1;
+}
+
+static struct ice_sched_node *ice_look_for_classid(struct ice_sched_node *node, u16 classid)
+{
+	struct ice_sched_node *tmp;
+	int i;
+
+	if (node->classid == classid)
+		return node;
+
+	for (i = 0; i < node->num_children; i++) {
+		tmp = ice_look_for_classid(node->children[i], classid);
+		if (tmp)
+			return tmp;
+	}
+
+	return NULL;
+}
+
+static struct ice_sched_node *ice_look_for_pf_queue_id(struct ice_sched_node *node, u16 pf_queue_id)
+{
+	struct ice_sched_node *tmp;
+	int i;
+
+	if (node->info.data.elem_type == ICE_AQC_ELEM_TYPE_LEAF) {
+		if (node->pf_queue_id == pf_queue_id)
+			return node;
+	}
+
+	for (i = 0; i < node->num_children; i++) {
+		tmp = ice_look_for_pf_queue_id(node->children[i], pf_queue_id);
+		if (tmp)
+			return tmp;
+	}
+
+	return NULL;
+}
+
+static struct ice_sched_node *ice_look_for_orphaned_queue_id(u16 pf_queue_id)
+{
+	int i;
+
+	for (i = 0; i < nr_of_orphaned_leaf_nodes; i++) {
+		if (pf_queue_id == orphaned_leaf_nodes[i].pf_queue_id)
+			return &orphaned_leaf_nodes[i];
+	}
+
+	return NULL;
+}
+
+static int ice_reassign_leaf_node(struct ice_port_info *pi, struct ice_hw *hw,
+				  struct ice_sched_node *src_node,
+				  struct ice_sched_node *dst_node)
+{
+	struct ice_aqc_move_txqs_data *buf;
+	u32 blocked_cgds;
+	u8 txqs_moved;
+	u16 buf_size;
+	int status;
+
+	buf_size = struct_size(buf, txqs, 1);
+	buf = kzalloc(buf_size, GFP_KERNEL);
+
+	buf->src_teid = src_node->info.parent_teid;
+	buf->dest_teid = dst_node->info.node_teid;
+	buf->txqs[0].txq_id = src_node->pf_queue_id;
+	buf->txqs[0].q_cgd = 0;
+	buf->txqs[0].q_teid = src_node->info.node_teid;
+
+	status = ice_aq_move_recfg_lan_txq(hw, 1, true, false, false, false, 50,
+					   &blocked_cgds, buf, buf_size, &txqs_moved, NULL);
+
+	if (!status) {
+		struct ice_aqc_txsched_elem_data info;
+
+		info = src_node->info;
+		info.parent_teid = dst_node->info.node_teid;
+
+		status = ice_sched_add_node(pi, dst_node->tx_sched_layer + 1, &info,
+					    buf->txqs[0].txq_id);
+	}
+
+	kfree(buf);
+
+	return status;
+}
+
+int ice_add_node_to_topology(struct ice_vsi *vsi, struct ice_sched_node *tc_node,
+			     struct ice_port_info *pi, u32 parent_classid, u16 classid,
+			     u64 rate, u64 ceil, bool alloc_new_qid)
+{
+	struct net_device *netdev = vsi->netdev;
+	struct ice_sched_node *parent_node;
+	struct ice_sched_node *new_node;
+	u16 num_nodes_added;
+	u32 first_node_teid;
+	int status;
+
+	rate = rate * 8;
+	ceil = ceil * 8;
+
+	if (rate < ICE_SCHED_MIN_BW || rate > ICE_SCHED_MAX_BW)
+		return -EINVAL;
+
+	if (ceil < ICE_SCHED_MIN_BW || ceil > ICE_SCHED_MAX_BW)
+		return -EINVAL;
+
+	parent_node = ice_look_for_classid(tc_node, parent_classid);
+	if (!parent_node)
+		return -EINVAL;
+
+	status = ice_sched_add_elems(pi, tc_node, parent_node, parent_node->tx_sched_layer + 1,
+				     1, &num_nodes_added, &first_node_teid);
+
+	if (status) {
+		netdev_err(netdev, "Can't insert element to topology status %d\n", status);
+		return -ENOENT;
+	}
+
+	new_node = ice_sched_find_node_by_teid(parent_node, first_node_teid);
+
+	if (alloc_new_qid) {
+		new_node->qid = ice_get_free_qid(netdev, vsi);
+	} else {
+		new_node->qid = parent_node->qid;
+		parent_node->qid = INTEL_QOS_QID_INNER;
+	}
+
+	new_node->classid = classid;
+
+	mutex_lock(&pi->sched_lock);
+	status = ice_sched_set_node_bw_lmt(pi, new_node, ICE_MIN_BW, rate);
+
+	if (status) {
+		netdev_err(netdev, "Can't set scheduling node rate, status %d\n", status);
+		mutex_unlock(&pi->sched_lock);
+		return -ENOENT;
+	}
+
+	status = ice_sched_set_node_bw_lmt(pi, new_node, ICE_MAX_BW, ceil);
+
+	if (status) {
+		netdev_err(netdev, "Can't set scheduling node ceil, status %d\n", status);
+		mutex_unlock(&pi->sched_lock);
+		return -ENOENT;
+	}
+
+	mutex_unlock(&pi->sched_lock);
+
+	return 0;
+}
+
+static void ice_save_orphaned_nodes(struct ice_sched_node *node)
+{
+	int i;
+
+	if (node->info.data.elem_type == ICE_AQC_ELEM_TYPE_LEAF) {
+		memcpy(&orphaned_leaf_nodes[nr_of_orphaned_leaf_nodes],
+		       node, sizeof(struct ice_sched_node));
+		nr_of_orphaned_leaf_nodes++;
+		return;
+	}
+
+	for (i = 0; i < node->num_children; i++)
+		ice_save_orphaned_nodes(node->children[i]);
+}
+
+static int ice_setup_tc_fifo(struct ice_netdev_priv *np, struct tc_fifo_qopt_offload *fifo)
+{
+	struct ice_port_info *pi = np->vsi->port_info;
+	struct net_device *netdev = np->vsi->netdev;
+	struct ice_sched_node *src_node;
+	struct ice_sched_node *dst_node;
+	struct ice_sched_node *tc_node;
+	struct ice_hw *hw  = pi->hw;
+	u32 handle, parent;
+	int status;
+
+	tc_node = pi->root->children[0];
+
+	/* we only care about major number so let's just shift minor left */
+	handle = fifo->handle >> 16;
+
+	/* we care about minor here so take first 16 bits only */
+	parent = fifo->parent & 0xFFFF;
+
+	switch (fifo->command) {
+	case TC_FIFO_REPLACE:
+		dst_node = ice_look_for_classid(tc_node, parent);
+
+		if (!dst_node)
+			return -EINVAL;
+
+		/* first look among orphaned queues */
+		src_node = ice_look_for_orphaned_queue_id(handle);
+
+		/* normally we would also look among non-orphaned nodes
+		 * but we can skip that for the purpose of the RFC
+		 *
+		 * if (src_node == NULL) {
+		 *
+		 * src_node = ice_look_for_pf_queue_id(tc_node, handle);
+		 * }
+		 */
+		if (!src_node)
+			return -EINVAL;
+
+		status = ice_reassign_leaf_node(pi, hw, src_node, dst_node);
+
+		if (status) {
+			netdev_err(netdev, "Failed to reassign queue\n");
+			return -ENOENT;
+		}
+
+		break;
+
+	default:
+		return -ENOENT;
+	}
+	return 0;
+}
+
+static int ice_setup_tc_htb(struct ice_netdev_priv *np, struct tc_htb_qopt_offload *htb)
+{
+	struct ice_port_info *pi = np->vsi->port_info;
+	struct net_device *netdev = np->vsi->netdev;
+	struct ice_pf *pf = np->vsi->back;
+	struct ice_sched_node *tc_node;
+	struct ice_sched_node *node;
+	struct ice_hw *hw  = pi->hw;
+	int err = 0;
+	int i;
+
+	tc_node = pi->root->children[0];
+
+	switch (htb->command) {
+	case TC_HTB_CREATE:
+		if (pi->root->num_children != 1) {
+			netdev_err(netdev, "Tree should contain only one TC before configuring HTB\n");
+			return -EINVAL;
+		}
+
+		if (ice_is_adq_active(pf)) {
+			netdev_err(netdev, "HTB is mutally exclusive with ADQ\n");
+			return -EINVAL;
+		}
+
+		tc_node = pi->root->children[0];
+		tc_node->classid = INTEL_HTB_CLASSID_ROOT;
+		tc_node->qid = INTEL_QOS_QID_INNER;
+
+		for (i = 0; i < tc_node->num_children; i++) {
+			ice_save_orphaned_nodes(tc_node->children[i]);
+			ice_free_sched_node(pi, tc_node->children[i]);
+		}
+
+	break;
+
+	case TC_HTB_DESTROY:
+	/* in this case we need to restore tree structure from the NVM */
+	/* try to schedule a CORER from here */
+		ice_schedule_reset(np->vsi->back, ICE_RESET_CORER);
+	break;
+
+	case TC_HTB_LEAF_TO_INNER:
+		return ice_add_node_to_topology(np->vsi, tc_node, pi, htb->parent_classid,
+						htb->classid, htb->rate, htb->ceil, false);
+	break;
+
+	case TC_HTB_LEAF_ALLOC_QUEUE:
+		tc_node = pi->root->children[0];
+
+		err = ice_add_node_to_topology(np->vsi, tc_node, pi, htb->parent_classid,
+					       htb->classid, htb->rate, htb->ceil, true);
+		if (err)
+			return err;
+
+		node = ice_look_for_classid(tc_node, htb->classid);
+		htb->qid = node->qid;
+	break;
+
+	case TC_HTB_LEAF_DEL:
+		node = ice_look_for_classid(tc_node, htb->classid);
+		ice_free_sched_node(pi, node);
+	break;
+
+	case TC_HTB_LEAF_DEL_LAST:
+		node = ice_look_for_classid(tc_node, htb->classid);
+		ice_free_sched_node(pi, node);
+	break;
+
+	case TC_HTB_LEAF_QUERY_QUEUE:
+		node = ice_look_for_classid(tc_node, htb->classid);
+		htb->qid = node->qid;
+	break;
+
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
 static LIST_HEAD(ice_block_cb_list);
 
 static int
@@ -8610,6 +8938,10 @@ ice_setup_tc(struct net_device *netdev, enum tc_setup_type type,
 		err = ice_setup_tc_mqprio_qdisc(netdev, type_data);
 		mutex_unlock(&pf->tc_mutex);
 		return err;
+	case TC_SETUP_QDISC_HTB:
+		return ice_setup_tc_htb(np, type_data);
+	case TC_SETUP_QDISC_FIFO:
+		return ice_setup_tc_fifo(np, type_data);
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c b/drivers/net/ethernet/intel/ice/ice_sched.c
index 7947223536e3..ff5d0e1af13c 100644
--- a/drivers/net/ethernet/intel/ice/ice_sched.c
+++ b/drivers/net/ethernet/intel/ice/ice_sched.c
@@ -147,7 +147,7 @@ ice_aq_query_sched_elems(struct ice_hw *hw, u16 elems_req,
  */
 int
 ice_sched_add_node(struct ice_port_info *pi, u8 layer,
-		   struct ice_aqc_txsched_elem_data *info)
+		   struct ice_aqc_txsched_elem_data *info, u16 txq_id)
 {
 	struct ice_aqc_txsched_elem_data elem;
 	struct ice_sched_node *parent;
@@ -190,6 +190,9 @@ ice_sched_add_node(struct ice_port_info *pi, u8 layer,
 		}
 	}
 
+	if (info->data.elem_type == ICE_AQC_ELEM_TYPE_LEAF)
+		node->pf_queue_id = txq_id;
+
 	node->in_use = true;
 	node->parent = parent;
 	node->tx_sched_layer = layer;
@@ -875,7 +878,7 @@ void ice_sched_cleanup_all(struct ice_hw *hw)
  *
  * This function add nodes to HW as well as to SW DB for a given layer
  */
-static int
+int
 ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
 		    struct ice_sched_node *parent, u8 layer, u16 num_nodes,
 		    u16 *num_nodes_added, u32 *first_node_teid)
@@ -924,7 +927,7 @@ ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
 	*num_nodes_added = num_nodes;
 	/* add nodes to the SW DB */
 	for (i = 0; i < num_nodes; i++) {
-		status = ice_sched_add_node(pi, layer, &buf->generic[i]);
+		status = ice_sched_add_node(pi, layer, &buf->generic[i], 0);
 		if (status) {
 			ice_debug(hw, ICE_DBG_SCHED, "add nodes in SW DB failed status =%d\n",
 				  status);
@@ -1268,7 +1271,7 @@ int ice_sched_init_port(struct ice_port_info *pi)
 			    ICE_AQC_ELEM_TYPE_ENTRY_POINT)
 				hw->sw_entry_point_layer = j;
 
-			status = ice_sched_add_node(pi, j, &buf[i].generic[j]);
+			status = ice_sched_add_node(pi, j, &buf[i].generic[j], 0);
 			if (status)
 				goto err_init_port;
 		}
@@ -3560,7 +3563,7 @@ ice_sched_set_eir_srl_excl(struct ice_port_info *pi,
  * node's RL profile ID of type CIR, EIR, or SRL, and removes old profile
  * ID from local database. The caller needs to hold scheduler lock.
  */
-static int
+int
 ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
 		      enum ice_rl_type rl_type, u32 bw, u8 layer_num)
 {
@@ -3606,7 +3609,7 @@ ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
  * It updates node's BW limit parameters like BW RL profile ID of type CIR,
  * EIR, or SRL. The caller needs to hold scheduler lock.
  */
-static int
+int
 ice_sched_set_node_bw_lmt(struct ice_port_info *pi, struct ice_sched_node *node,
 			  enum ice_rl_type rl_type, u32 bw)
 {
diff --git a/drivers/net/ethernet/intel/ice/ice_sched.h b/drivers/net/ethernet/intel/ice/ice_sched.h
index 4f91577fed56..b906ec1b8563 100644
--- a/drivers/net/ethernet/intel/ice/ice_sched.h
+++ b/drivers/net/ethernet/intel/ice/ice_sched.h
@@ -69,6 +69,20 @@ int
 ice_aq_query_sched_elems(struct ice_hw *hw, u16 elems_req,
 			 struct ice_aqc_txsched_elem_data *buf, u16 buf_size,
 			 u16 *elems_ret, struct ice_sq_cd *cd);
+
+int
+ice_sched_set_node_bw_lmt(struct ice_port_info *pi, struct ice_sched_node *node,
+			  enum ice_rl_type rl_type, u32 bw);
+
+int
+ice_sched_set_node_bw(struct ice_port_info *pi, struct ice_sched_node *node,
+		      enum ice_rl_type rl_type, u32 bw, u8 layer_num);
+
+int
+ice_sched_add_elems(struct ice_port_info *pi, struct ice_sched_node *tc_node,
+		    struct ice_sched_node *parent, u8 layer, u16 num_nodes,
+		    u16 *num_nodes_added, u32 *first_node_teid);
+
 int ice_sched_init_port(struct ice_port_info *pi);
 int ice_sched_query_res_alloc(struct ice_hw *hw);
 void ice_sched_get_psm_clk_freq(struct ice_hw *hw);
@@ -81,7 +95,7 @@ struct ice_sched_node *
 ice_sched_find_node_by_teid(struct ice_sched_node *start_node, u32 teid);
 int
 ice_sched_add_node(struct ice_port_info *pi, u8 layer,
-		   struct ice_aqc_txsched_elem_data *info);
+		   struct ice_aqc_txsched_elem_data *info, u16 txq_id);
 void ice_free_sched_node(struct ice_port_info *pi, struct ice_sched_node *node);
 struct ice_sched_node *ice_sched_get_tc_node(struct ice_port_info *pi, u8 tc);
 struct ice_sched_node *
diff --git a/drivers/net/ethernet/intel/ice/ice_type.h b/drivers/net/ethernet/intel/ice/ice_type.h
index f2a518a1fd94..d6191b343dc0 100644
--- a/drivers/net/ethernet/intel/ice/ice_type.h
+++ b/drivers/net/ethernet/intel/ice/ice_type.h
@@ -523,7 +523,10 @@ struct ice_sched_node {
 	struct ice_sched_node **children;
 	struct ice_aqc_txsched_elem_data info;
 	u32 agg_id;			/* aggregator group ID */
+	u32 pf_queue_id;
 	u16 vsi_handle;
+	u32 classid;
+	u16 qid;
 	u8 in_use;			/* suspended or in use */
 	u8 tx_sched_layer;		/* Logical Layer (1-9) */
 	u8 num_children;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC] ice: Reconfigure tx scheduling for SR-IOV
  2022-07-04 11:45 [RFC] ice: Reconfigure tx scheduling for SR-IOV Michal Wilczynski
@ 2022-07-05 22:15 ` Jakub Kicinski
  2022-07-06 10:54   ` Wilczynski, Michal
  0 siblings, 1 reply; 4+ messages in thread
From: Jakub Kicinski @ 2022-07-05 22:15 UTC (permalink / raw)
  To: Michal Wilczynski; +Cc: netdev, Dima Chumak, Maxim Mikityanskiy

On Mon,  4 Jul 2022 13:45:13 +0200 Michal Wilczynski wrote:
> If we were to follow normal flow, we would now use tc-filter family of
> commands to direct types of interesting traffic to the correct nodes.
> That is NOT the case in this implementation. In this POC, meaningful
> classid number identifies scheduling node. Number of qdisc handle is a
> queue number in a PF space. Reason for this - we want to support ALL
> queues on the card including SR-IOV ones that are assigned to VF
> netdevs.

Have you looked at the devlink rate API? It should do what you need.
Dima has been working on extending that API recently you may want to
compare notes with him as well: 

https://lore.kernel.org/all/20220620152647.2498927-1-dchumak@nvidia.com/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] ice: Reconfigure tx scheduling for SR-IOV
  2022-07-05 22:15 ` Jakub Kicinski
@ 2022-07-06 10:54   ` Wilczynski, Michal
  2022-07-06 19:56     ` Jakub Kicinski
  0 siblings, 1 reply; 4+ messages in thread
From: Wilczynski, Michal @ 2022-07-06 10:54 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, Dima Chumak, Maxim Mikityanskiy, Knitter, Konrad

Hi,

Thank you for your e-mail.

I considered using devlink-rate, and it seems like a good fit. However 
we would also

need support for rate-limiting for individual queues on the VF. 
Currently we have

two types of rate objects in devlink-rate: leaf and node. Would adding a 
third one - queue be accepted ?

Also we might want to add some other object rate parameters to currently 
existing ones, for

example 'priority'.


If this sounds acceptable I will work on the patch and submit it as 
soon, as it's ready.


Thanks,

Michał


On 7/6/2022 12:15 AM, Jakub Kicinski wrote:
> On Mon,  4 Jul 2022 13:45:13 +0200 Michal Wilczynski wrote:
>> If we were to follow normal flow, we would now use tc-filter family of
>> commands to direct types of interesting traffic to the correct nodes.
>> That is NOT the case in this implementation. In this POC, meaningful
>> classid number identifies scheduling node. Number of qdisc handle is a
>> queue number in a PF space. Reason for this - we want to support ALL
>> queues on the card including SR-IOV ones that are assigned to VF
>> netdevs.
> Have you looked at the devlink rate API? It should do what you need.
> Dima has been working on extending that API recently you may want to
> compare notes with him as well:
>
> https://lore.kernel.org/all/20220620152647.2498927-1-dchumak@nvidia.com/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] ice: Reconfigure tx scheduling for SR-IOV
  2022-07-06 10:54   ` Wilczynski, Michal
@ 2022-07-06 19:56     ` Jakub Kicinski
  0 siblings, 0 replies; 4+ messages in thread
From: Jakub Kicinski @ 2022-07-06 19:56 UTC (permalink / raw)
  To: Wilczynski, Michal
  Cc: netdev, Dima Chumak, Maxim Mikityanskiy, Knitter, Konrad,
	Jiri Pirko, Simon Horman

Reminder: please don't top post on the Linux lists.

On Wed, 6 Jul 2022 12:54:12 +0200 Wilczynski, Michal wrote:
> Hi,
> 
> Thank you for your e-mail.
> 
> I considered using devlink-rate, and it seems like a good fit. However 
> we would also
> 
> need support for rate-limiting for individual queues on the VF. 
> Currently we have
> 
> two types of rate objects in devlink-rate: leaf and node. Would adding a 
> third one - queue be accepted ?

Something along those lines. IIUC htb offload as admission control for
VF representors is not a thing today, so since devlink rate exists the
lowest amount of duplication would be teaching it about queues.

> Also we might want to add some other object rate parameters to currently 
> existing ones, for example 'priority'.

Presumably you can't admission control at a granularity higher than 
a queue, so grouping queues should cover all use cases.

> If this sounds acceptable I will work on the patch and submit it as 
> soon, as it's ready.

I'd be curious to hear from nVidia and Corigine folks as well.

We can revive the switchdev call if talking over VC helps with
the alignment between vendors.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-07-06 19:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-04 11:45 [RFC] ice: Reconfigure tx scheduling for SR-IOV Michal Wilczynski
2022-07-05 22:15 ` Jakub Kicinski
2022-07-06 10:54   ` Wilczynski, Michal
2022-07-06 19:56     ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.