[PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC
@ 2012-05-06  7:05 Amir Vadai
  2012-05-06  7:05 ` [PATCH net-next 1/2] net_sched/mqprio: add support for different pgroup types Amir Vadai
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Amir Vadai @ 2012-05-06  7:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, John Fastabend, Oren Duer, Liran Liss, Amir Vadai

This series comes to revive the discussion initiated on the thread "net:
support tx_ring per UP in HW based QoS mechanism" (see
http://marc.info/?t=133165957200004&r=1&w=2) with the major issue to be address
is - how should sk_prio<=>  TC be done, for both, tagged and untagged traffic.
Following is a staged description addressing the background, problem
description, current situation, suggestion for the change and implementation of
it.

Background
----------
Egress traffic has 3 layers of management to configure QoS attributes:
* Application - sets sk_prio
  * setsockopt() - application may set sk_prio using SO_PRIORITY or IP_TOS
* Host admin - sets sk_prio <=> UP
  * net_prio cgroup
  * Egress map for tagged traffic
* Net admin - sets UP <=> TC + TC QoS attributes
  * lldpad
Commit 4f57c087de9 "net: implement mechanism for HW based QOS" introduced a
mechanism for lower layer devices to steer traffic using skb->priority to tx
queues.

Problem
-------
How should sk_prio <=> TC be done, for both, tagged and untagged traffic?

Current situation
-----------------
* The network priority cgroup infrastructure commit 5bc1421e, introduced implicit
  assumption that sk_prio == UP.
* tc tool is used to map UP <=> TC for both tagged and untagged traffic.
* egress map and lldptool and ignored when tc tool is being used.
* HW queue is per TC.

Suggestion
----------
* sk_prio is an attribute controlled by the Application or cgroup.
  As used to be in tagged traffic
* tc tool is used by the Host admin and sets sk_prio <=> UP for untagged
  traffic. The rest of the chain is UP <=> TC mapped by the Net admin (using
  DCBx netlink).
  To keep backward compatibility, will have an option to set tc tool to
  compatabilty mode, in which, the old sk_prio <=> TC will be kept.
* Depending on HW, queue selection is by UP or by TC.

Implementation
--------------
Extended mqprio hw attribute:
* Bit 1: is queue offset/count owned by HW
* Bits 2-7: HW queueing type. 
  * 0 - by ETS TC
  * 1 - by UP

__skb_tx_hash() is now aware to the HW queuing type (pg_type): for pg_type
being ETS TC, traffic is distributed as it was before - tagged and untagged
packets are distributed by netdev_get_prio_tc_map. For pg_type being UP, tagged
and untagged packets are distributed by UP (taken from egress map for tagged
traffic, or netdev_get_prio_tc_map for untagged).

Amir Vadai (2):
  net_sched/mqprio: add support for different pgroup types
  net/mlx4_en: num cores tx rings for every UP

 drivers/net/ethernet/mellanox/mlx4/en_main.c   |    6 ++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   42 ++++++++++++++++++-----
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |   12 -------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |    9 ++---
 include/linux/netdevice.h                      |   27 +++++++++++++++
 include/linux/pkt_sched.h                      |    3 +-
 net/core/dev.c                                 |   12 +++++--
 net/sched/sch_mqprio.c                         |   11 +++++-
 8 files changed, 88 insertions(+), 34 deletions(-)

-- 
1.7.8.2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net-next 1/2] net_sched/mqprio: add support for different pgroup types
  2012-05-06  7:05 [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC Amir Vadai
@ 2012-05-06  7:05 ` Amir Vadai
  2012-05-09  6:36   ` John Fastabend
  2012-05-06  7:05 ` [PATCH net-next 2/2] net/mlx4_en: num cores tx rings for every UP Amir Vadai
  2012-05-08  0:54 ` [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC John Fastabend
  2 siblings, 1 reply; 9+ messages in thread
From: Amir Vadai @ 2012-05-06  7:05 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, John Fastabend, Oren Duer, Liran Liss, Amir Vadai

Currently, HW based QoS mechanisms use the framework and means introduced in
commits 4f57c087d "net: implement mechanism for HW based QOS" and b8970f0bfc
"net_sched: implement a root container qdisc sch_mqprio".

The approach present in these patches is strongly orientated to the extended
transmission selection (ETS) algorithm traffic classes (TC).

This patch enhances the current scheme to allow for these mechanisms to be used
also with hardware who has queues per UP - user priority (Linux has well
established mechanisms to set UP for both tagged and untagged traffic).

Now, __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings.
This range is defined by the admin through the mqprio scheduler for the
specific HW. For TC based queues, the range is by TC number and for UP based
queues, the range is by UP.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
---
 include/linux/netdevice.h |   27 +++++++++++++++++++++++++++
 include/linux/pkt_sched.h |    3 ++-
 net/core/dev.c            |   12 +++++++++---
 net/sched/sch_mqprio.c    |   11 +++++++++--
 4 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7f377fb..ecdd953 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -835,6 +835,9 @@ struct netdev_fcoe_hbainfo {
  * 	is always called from the stack with the rtnl lock held and netif tx
  * 	queues stopped. This allows the netdevice to perform queue management
  * 	safely.
+ * int (*ndo_set_pg_type)(struct net_device *dev, u8 pg_type)
+ *	Called to setup 'tc' type. According to this type, traffic is
+ *	distributed across tx rings. If not set, ETS TC is in use.
  *
  *	Fiber Channel over Ethernet (FCoE) offload functions.
  * int (*ndo_fcoe_enable)(struct net_device *dev);
@@ -973,6 +976,8 @@ struct net_device_ops {
 	int			(*ndo_get_vf_port)(struct net_device *dev,
 						   int vf, struct sk_buff *skb);
 	int			(*ndo_setup_tc)(struct net_device *dev, u8 tc);
+	int			(*ndo_set_pg_type)(struct net_device *dev,
+						   u8 pg_type);
 #if IS_ENABLED(CONFIG_FCOE)
 	int			(*ndo_fcoe_enable)(struct net_device *dev);
 	int			(*ndo_fcoe_disable)(struct net_device *dev);
@@ -1307,6 +1312,11 @@ struct net_device {
 	/* Data Center Bridging netlink ops */
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
+	enum {
+		PGROUP_TC,
+		PGROUP_UP,
+		PGROUP_MAX,
+	} pg_type:8;
 	u8 num_tc;
 	struct netdev_tc_txq tc_to_txq[TC_MAX_QUEUE];
 	u8 prio_tc_map[TC_BITMASK + 1];
@@ -1329,6 +1339,23 @@ struct net_device {
 #define	NETDEV_ALIGN		32
 
 static inline
+int netdev_get_pg_type(const struct net_device *dev)
+{
+	return dev->pg_type;
+}
+
+static inline
+int netdev_set_pg_type(struct net_device *dev, u8 pg_type)
+{
+	if (pg_type >= PGROUP_MAX)
+		return -EINVAL;
+
+	dev->pg_type = pg_type;
+
+	return 0;
+}
+
+static inline
 int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
 {
 	return dev->prio_tc_map[prio & TC_BITMASK];
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index ffe975c..1ae7d3c 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -596,7 +596,8 @@ struct tc_drr_stats {
 struct tc_mqprio_qopt {
 	__u8	num_tc;
 	__u8	prio_tc_map[TC_QOPT_BITMASK + 1];
-	__u8	hw;
+	__u8	hw;	/* bit 0: hw owned, bits 1-7: hw queuing type.
+			 * valid types: 0 - ETS TC, 1 - UP */
 	__u16	count[TC_QOPT_MAX_QUEUE];
 	__u16	offset[TC_QOPT_MAX_QUEUE];
 };
diff --git a/net/core/dev.c b/net/core/dev.c
index 09024fd..72ac4bf 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2325,9 +2325,15 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 	}
 
 	if (dev->num_tc) {
-		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
-		qoffset = dev->tc_to_txq[tc].offset;
-		qcount = dev->tc_to_txq[tc].count;
+		u8 pgroup;
+
+		if (dev->pg_type == PGROUP_TC || !vlan_tx_tag_present(skb))
+			pgroup = netdev_get_prio_tc_map(dev, skb->priority);
+		else
+			pgroup = (vlan_tx_tag_get(skb) >> 13);
+
+		qoffset = dev->tc_to_txq[pgroup].offset;
+		qcount = dev->tc_to_txq[pgroup].count;
 	}
 
 	if (skb->sk && skb->sk->sk_hash)
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index d1831ca..2149cbb 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -134,11 +134,18 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
 		priv->qdiscs[i] = qdisc;
 	}
 
+	if (dev->netdev_ops->ndo_set_pg_type)
+		err = dev->netdev_ops->ndo_set_pg_type(dev, qopt->hw >> 1);
+	else
+		err = netdev_set_pg_type(dev, PGROUP_TC);
+	if (err)
+		goto err;
+
 	/* If the mqprio options indicate that hardware should own
 	 * the queue mapping then run ndo_setup_tc otherwise use the
 	 * supplied and verified mapping
 	 */
-	if (qopt->hw) {
+	if (qopt->hw & 1) {
 		priv->hw_owned = 1;
 		err = dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc);
 		if (err)
@@ -240,7 +247,7 @@ static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 
 	opt.num_tc = netdev_get_num_tc(dev);
 	memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
-	opt.hw = priv->hw_owned;
+	opt.hw = (!!priv->hw_owned & 1) | (netdev_get_pg_type(dev) << 1);
 
 	for (i = 0; i < netdev_get_num_tc(dev); i++) {
 		opt.count[i] = dev->tc_to_txq[i].count;
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next 2/2] net/mlx4_en: num cores tx rings for every UP
  2012-05-06  7:05 [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC Amir Vadai
  2012-05-06  7:05 ` [PATCH net-next 1/2] net_sched/mqprio: add support for different pgroup types Amir Vadai
@ 2012-05-06  7:05 ` Amir Vadai
  2012-05-08  0:54 ` [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC John Fastabend
  2 siblings, 0 replies; 9+ messages in thread
From: Amir Vadai @ 2012-05-06  7:05 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, John Fastabend, Oren Duer, Liran Liss, Amir Vadai, Amir Vadai

Instead of having num cores tx rings for untagged traffic and only 1 tx ring
per UP, allocate num cores * num UP's tx rings - There is no reason why all
cores will share the same tx_ring when using tagged traffic.

Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/en_main.c   |    6 ++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   42 ++++++++++++++++++-----
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |   12 -------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |    9 ++---
 4 files changed, 41 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index 346fdb2..988b242 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -101,6 +101,8 @@ static int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 	int i;
 
 	params->udp_rss = udp_rss;
+	params->num_tx_rings_p_up = min_t(int, num_online_cpus(),
+			MLX4_EN_MAX_TX_RING_P_UP);
 	if (params->udp_rss && !(mdev->dev->caps.flags
 					& MLX4_DEV_CAP_FLAG_UDP_RSS)) {
 		mlx4_warn(mdev, "UDP RSS is not supported on this device.\n");
@@ -113,8 +115,8 @@ static int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 		params->prof[i].tx_ppp = pfctx;
 		params->prof[i].tx_ring_size = MLX4_EN_DEF_TX_RING_SIZE;
 		params->prof[i].rx_ring_size = MLX4_EN_DEF_RX_RING_SIZE;
-		params->prof[i].tx_ring_num = MLX4_EN_NUM_TX_RINGS +
-			MLX4_EN_NUM_PPP_RINGS;
+		params->prof[i].tx_ring_num = params->num_tx_rings_p_up *
+			MLX4_EN_NUM_UP;
 		params->prof[i].rss_rings = 0;
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index eaa8fad..9cf85ba 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -47,12 +47,20 @@
 
 static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
-	if (up != MLX4_EN_NUM_UP)
+	if (up && up != MLX4_EN_NUM_UP)
 		return -EINVAL;
 
 	return 0;
 }
 
+static int mlx4_en_set_pg_type(struct net_device *dev, u8 pg_type)
+{
+	if (pg_type != PGROUP_UP)
+		return -EINVAL;
+
+	return netdev_set_pg_type(dev, pg_type);
+}
+
 static int mlx4_en_vlan_rx_add_vid(struct net_device *dev, unsigned short vid)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -661,7 +669,7 @@ int mlx4_en_start_port(struct net_device *dev)
 		/* Configure ring */
 		tx_ring = &priv->tx_ring[i];
 		err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn,
-				max(0, i - MLX4_EN_NUM_TX_RINGS));
+			i / priv->mdev->profile.num_tx_rings_p_up);
 		if (err) {
 			en_err(priv, "Failed allocating Tx ring\n");
 			mlx4_en_deactivate_cq(priv, cq);
@@ -986,6 +994,9 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
 
 	mlx4_en_free_resources(priv);
 
+	kfree(priv->tx_ring);
+	kfree(priv->tx_cq);
+
 	free_netdev(dev);
 }
 
@@ -1043,7 +1054,6 @@ static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_open		= mlx4_en_open,
 	.ndo_stop		= mlx4_en_close,
 	.ndo_start_xmit		= mlx4_en_xmit,
-	.ndo_select_queue	= mlx4_en_select_queue,
 	.ndo_get_stats		= mlx4_en_get_stats,
 	.ndo_set_rx_mode	= mlx4_en_set_multicast,
 	.ndo_set_mac_address	= mlx4_en_set_mac,
@@ -1057,6 +1067,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
 #endif
 	.ndo_set_features	= mlx4_en_set_features,
 	.ndo_setup_tc		= mlx4_en_setup_tc,
+	.ndo_set_pg_type	= mlx4_en_set_pg_type,
 };
 
 int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
@@ -1066,6 +1077,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	struct mlx4_en_priv *priv;
 	int i;
 	int err;
+	unsigned int q, offset = 0;
 
 	dev = alloc_etherdev_mqs(sizeof(struct mlx4_en_priv),
 	    prof->tx_ring_num, prof->rx_ring_num);
@@ -1091,6 +1103,18 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	priv->ctrl_flags = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE |
 			MLX4_WQE_CTRL_SOLICITED);
 	priv->tx_ring_num = prof->tx_ring_num;
+	priv->tx_ring = kzalloc(sizeof(struct mlx4_en_tx_ring) *
+			priv->tx_ring_num, GFP_KERNEL);
+	if (!priv->tx_ring) {
+		err = -ENOMEM;
+		goto out;
+	}
+	priv->tx_cq = kzalloc(sizeof(struct mlx4_en_cq) * priv->tx_ring_num,
+			GFP_KERNEL);
+	if (!priv->tx_cq) {
+		err = -ENOMEM;
+		goto out;
+	}
 	priv->rx_ring_num = prof->rx_ring_num;
 	priv->mac_index = -1;
 	priv->msg_enable = MLX4_EN_MSG_LEVEL;
@@ -1140,12 +1164,12 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 
 	netdev_set_num_tc(dev, MLX4_EN_NUM_UP);
 
-	/* First 9 rings are for UP 0 */
-	netdev_set_tc_queue(dev, 0, MLX4_EN_NUM_TX_RINGS + 1, 0);
-
-	/* Partition Tx queues evenly amongst UP's 1-7 */
-	for (i = 1; i < MLX4_EN_NUM_UP; i++)
-		netdev_set_tc_queue(dev, i, 1, MLX4_EN_NUM_TX_RINGS + i);
+	/* Partition Tx queues evenly amongst UP's */
+	q = priv->tx_ring_num / MLX4_EN_NUM_UP;
+	for (i = 0; i < MLX4_EN_NUM_UP; i++) {
+		netdev_set_tc_queue(dev, i, q, offset);
+		offset += q;
+	}
 
 	SET_ETHTOOL_OPS(dev, &mlx4_en_ethtool_ops);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 9a38483..ac735d2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -523,18 +523,6 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 	tx_desc->ctrl.fence_size = (real_size / 16) & 0x3f;
 }
 
-u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
-{
-	u16 vlan_tag = 0;
-
-	if (vlan_tx_tag_present(skb)) {
-		vlan_tag = vlan_tx_tag_get(skb);
-		return MLX4_EN_NUM_TX_RINGS + (vlan_tag >> 13);
-	}
-
-	return skb_tx_hash(dev, skb);
-}
-
 static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
 {
 	__iowrite64_copy(dst, src, bytecnt / 8);
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 5d87637..6ae3509 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -111,9 +111,7 @@ enum {
 #define MLX4_EN_MIN_TX_SIZE	(4096 / TXBB_SIZE)
 
 #define MLX4_EN_SMALL_PKT_SIZE		64
-#define MLX4_EN_NUM_TX_RINGS		8
-#define MLX4_EN_NUM_PPP_RINGS		8
-#define MAX_TX_RINGS			(MLX4_EN_NUM_TX_RINGS + MLX4_EN_NUM_PPP_RINGS)
+#define MLX4_EN_MAX_TX_RING_P_UP	32
 #define MLX4_EN_NUM_UP			8
 #define MLX4_EN_DEF_TX_RING_SIZE	512
 #define MLX4_EN_DEF_RX_RING_SIZE  	1024
@@ -339,6 +337,7 @@ struct mlx4_en_profile {
 	u32 active_ports;
 	u32 small_pkt_int;
 	u8 no_reset;
+	u8 num_tx_rings_p_up;
 	struct mlx4_en_port_profile prof[MLX4_MAX_PORTS + 1];
 };
 
@@ -477,9 +476,9 @@ struct mlx4_en_priv {
 	u16 num_frags;
 	u16 log_rx_info;
 
-	struct mlx4_en_tx_ring tx_ring[MAX_TX_RINGS];
+	struct mlx4_en_tx_ring *tx_ring;
 	struct mlx4_en_rx_ring rx_ring[MAX_RX_RINGS];
-	struct mlx4_en_cq tx_cq[MAX_TX_RINGS];
+	struct mlx4_en_cq *tx_cq;
 	struct mlx4_en_cq rx_cq[MAX_RX_RINGS];
 	struct work_struct mcast_task;
 	struct work_struct mac_task;
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC
  2012-05-06  7:05 [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC Amir Vadai
  2012-05-06  7:05 ` [PATCH net-next 1/2] net_sched/mqprio: add support for different pgroup types Amir Vadai
  2012-05-06  7:05 ` [PATCH net-next 2/2] net/mlx4_en: num cores tx rings for every UP Amir Vadai
@ 2012-05-08  0:54 ` John Fastabend
  2012-05-08 13:56   ` Amir Vadai
  2 siblings, 1 reply; 9+ messages in thread
From: John Fastabend @ 2012-05-08  0:54 UTC (permalink / raw)
  To: Amir Vadai; +Cc: David S. Miller, netdev, Oren Duer, Liran Liss

On 5/6/2012 12:05 AM, Amir Vadai wrote:
> This series comes to revive the discussion initiated on the thread "net:
> support tx_ring per UP in HW based QoS mechanism" (see
> http://marc.info/?t=133165957200004&r=1&w=2) with the major issue to be address
> is - how should sk_prio<=>  TC be done, for both, tagged and untagged traffic.
> Following is a staged description addressing the background, problem
> description, current situation, suggestion for the change and implementation of
> it.

OK but the mqprio qdisc is only concerned with mapping skb->priority to
queue sets I perhaps unfortunately called the queue sets tc's. Try not
to get hung up on my perhaps limiting naming of variables and functions.

mqprio is used outside of 802.1Q as well in these cases a traffic class
is not usually even defined.

> 
> Background
> ----------
> Egress traffic has 3 layers of management to configure QoS attributes:
> * Application - sets sk_prio
>   * setsockopt() - application may set sk_prio using SO_PRIORITY or IP_TOS
> * Host admin - sets sk_prio <=> UP
>   * net_prio cgroup

net_prio cgroup is about setting skb->priority not really UP.

>   * Egress map for tagged traffic
> * Net admin - sets UP <=> TC + TC QoS attributes
>   * lldpad
> Commit 4f57c087de9 "net: implement mechanism for HW based QOS" introduced a
> mechanism for lower layer devices to steer traffic using skb->priority to tx
> queues.

yep that was my first comment.

> 
> Problem
> -------
> How should sk_prio <=> TC be done, for both, tagged and untagged traffic?
> 
> Current situation
> -----------------
> * The network priority cgroup infrastructure commit 5bc1421e, introduced implicit
>   assumption that sk_prio == UP.

I don't think this is true at least this is not my perspective.

The cgroup infrastructure in that commit sets the skb->priority. What you
do with the priority and if you map it to a UP or qdisc classes or some
other thing is up to the lower layers. In the DCBX context it eventually
gets mapped to some queue sets but that is only one usage case.

> * tc tool is used to map UP <=> TC for both tagged and untagged traffic.

Yes. But this is just a display issue. If I was being more general I
would have named this 'priority to queue set'. A bit short sighted
on my part.

> * egress map and lldptool and ignored when tc tool is being used.

Agreed. And this creates a bit of a nuisance but I believe it can
be resolved in user space. AFAIK all the interfaces needed are available.

> * HW queue is per TC.
> 
> Suggestion
> ----------
> * sk_prio is an attribute controlled by the Application or cgroup.
>   As used to be in tagged traffic

This is the current usage correct?

> * tc tool is used by the Host admin and sets sk_prio <=> UP for untagged
>   traffic. The rest of the chain is UP <=> TC mapped by the Net admin (using
>   DCBx netlink).

This seems fine. But, what exactly is broken today? If you use the current
mqprio qdisc and rename tc to queue_set then this works right?

>   To keep backward compatibility, will have an option to set tc tool to
>   compatabilty mode, in which, the old sk_prio <=> TC will be kept.
> * Depending on HW, queue selection is by UP or by TC.
> 
> Implementation
> --------------
> Extended mqprio hw attribute:
> * Bit 1: is queue offset/count owned by HW
> * Bits 2-7: HW queueing type. 
>   * 0 - by ETS TC
>   * 1 - by UP
> 
> __skb_tx_hash() is now aware to the HW queuing type (pg_type): for pg_type
> being ETS TC, traffic is distributed as it was before - tagged and untagged
> packets are distributed by netdev_get_prio_tc_map. For pg_type being UP, tagged
> and untagged packets are distributed by UP (taken from egress map for tagged
> traffic, or netdev_get_prio_tc_map for untagged).

I guess I don't see why we need this. If you keep the mqprio priority to
queue set mapping set to 1:1. Then modify the egress map accordingly then
this should work right?

For example:

If we want to map 8 802.1Q priority code points onto 4 traffic classes this
should work,

egress map: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3      <-- vlan layer inserts correct tag
mqprio up2tc: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7    <-- mqprio with unfortunate 'tc' name maps priority to queues
dcbnl up2tc: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3     <-- dcbx pushes up2tc mapping correctly

We may need to fixup userspace tools some but I think the kernel mechanisms
are in place.

BTW I did think about this while implementing the existing code and decided
that rather than create more if/else branches the case you are describing
could be handled by independently controlling the priority to queue set mappings
in mqprio and the egress map.

Feel free to let me know where I went wrong or why this doesn't work on
your hardware. I agree though we may need to fixup lldpad and maybe even
'tc' some to get this to look correct in user space and get automagically
setup correctly.

Thanks,
.John

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC
  2012-05-08  0:54 ` [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC John Fastabend
@ 2012-05-08 13:56   ` Amir Vadai
  2012-05-09  6:22     ` John Fastabend
  0 siblings, 1 reply; 9+ messages in thread
From: Amir Vadai @ 2012-05-08 13:56 UTC (permalink / raw)
  To: John Fastabend; +Cc: David S. Miller, netdev, Oren Duer, Liran Liss

On 05/08/2012 03:54 AM, John Fastabend wrote:
> On 5/6/2012 12:05 AM, Amir Vadai wrote:
>> This series comes to revive the discussion initiated on the thread "net:
>> support tx_ring per UP in HW based QoS mechanism" (see
>> http://marc.info/?t=133165957200004&r=1&w=2) with the major issue to be address
>> is - how should sk_prio<=>   TC be done, for both, tagged and untagged traffic.
>> Following is a staged description addressing the background, problem
>> description, current situation, suggestion for the change and implementation of
>> it.
>
> OK but the mqprio qdisc is only concerned with mapping skb->priority to
> queue sets I perhaps unfortunately called the queue sets tc's. Try not
> to get hung up on my perhaps limiting naming of variables and functions.
I understand that.

>
> mqprio is used outside of 802.1Q as well in these cases a traffic class
> is not usually even defined.
>
>>
>> Background
>> ----------
>> Egress traffic has 3 layers of management to configure QoS attributes:
>> * Application - sets sk_prio
>>    * setsockopt() - application may set sk_prio using SO_PRIORITY or IP_TOS
>> * Host admin - sets sk_prio<=>  UP
>>    * net_prio cgroup
>
> net_prio cgroup is about setting skb->priority not really UP.
Right.
>
>>    * Egress map for tagged traffic
>> * Net admin - sets UP<=>  TC + TC QoS attributes
>>    * lldpad
>> Commit 4f57c087de9 "net: implement mechanism for HW based QOS" introduced a
>> mechanism for lower layer devices to steer traffic using skb->priority to tx
>> queues.
>
> yep that was my first comment.
>
>>
>> Problem
>> -------
>> How should sk_prio<=>  TC be done, for both, tagged and untagged traffic?
>>
>> Current situation
>> -----------------
>> * The network priority cgroup infrastructure commit 5bc1421e, introduced implicit
>>    assumption that sk_prio == UP.
>
> I don't think this is true at least this is not my perspective.
>
> The cgroup infrastructure in that commit sets the skb->priority. What you
> do with the priority and if you map it to a UP or qdisc classes or some
> other thing is up to the lower layers. In the DCBX context it eventually
> gets mapped to some queue sets but that is only one usage case.
>
>> * tc tool is used to map UP<=>  TC for both tagged and untagged traffic.
>
> Yes. But this is just a display issue. If I was being more general I
> would have named this 'priority to queue set'. A bit short sighted
> on my part.
>
>> * egress map and lldptool and ignored when tc tool is being used.
>
> Agreed. And this creates a bit of a nuisance but I believe it can
> be resolved in user space. AFAIK all the interfaces needed are available.
>
>> * HW queue is per TC.
>>
>> Suggestion
>> ----------
>> * sk_prio is an attribute controlled by the Application or cgroup.
>>    As used to be in tagged traffic
>
> This is the current usage correct?
Yes
>
>> * tc tool is used by the Host admin and sets sk_prio<=>  UP for untagged
>>    traffic. The rest of the chain is UP<=>  TC mapped by the Net admin (using
>>    DCBx netlink).
>
> This seems fine. But, what exactly is broken today? If you use the current
> mqprio qdisc and rename tc to queue_set then this works right?
For untagged traffic it is perfectly ok. This will cause problems with
tagged traffic (below).
>
>
>>    To keep backward compatibility, will have an option to set tc tool to
>>    compatabilty mode, in which, the old sk_prio<=>  TC will be kept.
>> * Depending on HW, queue selection is by UP or by TC.
>>
>> Implementation
>> --------------
>> Extended mqprio hw attribute:
>> * Bit 1: is queue offset/count owned by HW
>> * Bits 2-7: HW queueing type.
>>    * 0 - by ETS TC
>>    * 1 - by UP
>>
>> __skb_tx_hash() is now aware to the HW queuing type (pg_type): for pg_type
>> being ETS TC, traffic is distributed as it was before - tagged and untagged
>> packets are distributed by netdev_get_prio_tc_map. For pg_type being UP, tagged
>> and untagged packets are distributed by UP (taken from egress map for tagged
>> traffic, or netdev_get_prio_tc_map for untagged).
>
> I guess I don't see why we need this. If you keep the mqprio priority to
> queue set mapping set to 1:1. Then modify the egress map accordingly then
> this should work right?
>
> For example:
>
> If we want to map 8 802.1Q priority code points onto 4 traffic classes this
> should work,
>
> egress map: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- vlan layer inserts correct tag
> mqprio up2tc: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7<-- mqprio with unfortunate 'tc' name maps priority to queues
But mqprio is mapping sk_priority to queue set, which is different from 
UP to queue set.
The term UP which we use, is the 8021q PCP in tagged traffic, and a 
priority mapped by the host admin for untagged traffic.
What you wrote, is actually, that the application will set the priority 
without enabling the host admin to control it.
> dcbnl up2tc: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- dcbx pushes up2tc mapping correctly

For example, lets take an application that calls setsockopt(SO_PRIORITY, 2):
according to egress map: 8021q PCP field in vlan tag should be set to 1 
(=UP)
according to mqprio: a tx ring belonging to UP 2 will be selected.
according to dcbnl: traffic will have ETS attributes of TC 1

Except some conceptual problems, it won't work:
8021q PCP field is set by HW according to the tx ring (UP=2). which
is different from the one that the user configured in the egress_map
(UP=1).

>
> We may need to fixup userspace tools some but I think the kernel mechanisms
> are in place.
>
> BTW I did think about this while implementing the existing code and decided
> that rather than create more if/else branches the case you are describing
> could be handled by independently controlling the priority to queue set mappings
> in mqprio and the egress map.
>
> Feel free to let me know where I went wrong or why this doesn't work on
> your hardware. I agree though we may need to fixup lldpad and maybe even
> 'tc' some to get this to look correct in user space and get automagically
> setup correctly.
>
> Thanks,
> .John

I think I understand your stand, that mqprio mapping should be generic
and should not be TC nor UP oriented.

The problem is that our HW needs the queues to be selected by UP. ETS
TC mapping is configured before traffic starts, and therefore ETS
attributes are enforced by HW according to the UP, which the tx ring
belongs to. Same thing for vlan tag in tagged traffic. 8021q PCP is
placed by HW according to the tx ring.

For backward compatibility, tagged traffic should be steered to a tx
ring by UP taken from egress map - skb_tx_hash() as it is today, can't
do it. Even if we implement ndo_select_queue(), we will need to
duplicate most of the code from skb_tx_hash(), because there are more
than one tx ring per UP, and we need skb_tx_hash() to be able to select
a ring from a range of rings belonging to a UP.

Also, there are some configurations that can't be done by mqprio and
egress map. For example when having two vlans with different egress
maps.

And in the conceptual level, I think that kernel should not accept bad
configurations and rely on user space to protect it.

- Amir

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC
  2012-05-08 13:56   ` Amir Vadai
@ 2012-05-09  6:22     ` John Fastabend
  2012-05-14 19:24       ` Amir Vadai
  0 siblings, 1 reply; 9+ messages in thread
From: John Fastabend @ 2012-05-09  6:22 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, Oren Duer, Liran Liss, Jamal Hadi Salim

On 5/8/2012 6:56 AM, Amir Vadai wrote:
> On 05/08/2012 03:54 AM, John Fastabend wrote:
>> On 5/6/2012 12:05 AM, Amir Vadai wrote:
>>> This series comes to revive the discussion initiated on the thread "net:
>>> support tx_ring per UP in HW based QoS mechanism" (see
>>> http://marc.info/?t=133165957200004&r=1&w=2) with the major issue to be address
>>> is - how should sk_prio<=>   TC be done, for both, tagged and untagged traffic.
>>> Following is a staged description addressing the background, problem
>>> description, current situation, suggestion for the change and implementation of
>>> it.
>>
>> OK but the mqprio qdisc is only concerned with mapping skb->priority to
>> queue sets I perhaps unfortunately called the queue sets tc's. Try not
>> to get hung up on my perhaps limiting naming of variables and functions.
> I understand that.

I figured you did. Just wanted to state it again.

> 
>>
>> mqprio is used outside of 802.1Q as well in these cases a traffic class
>> is not usually even defined.
>>
>>>

[...]

>>> Implementation
>>> --------------
>>> Extended mqprio hw attribute:
>>> * Bit 1: is queue offset/count owned by HW
>>> * Bits 2-7: HW queueing type.
>>>    * 0 - by ETS TC
>>>    * 1 - by UP
>>>
>>> __skb_tx_hash() is now aware to the HW queuing type (pg_type): for pg_type
>>> being ETS TC, traffic is distributed as it was before - tagged and untagged
>>> packets are distributed by netdev_get_prio_tc_map. For pg_type being UP, tagged
>>> and untagged packets are distributed by UP (taken from egress map for tagged
>>> traffic, or netdev_get_prio_tc_map for untagged).
>>
>> I guess I don't see why we need this. If you keep the mqprio priority to
>> queue set mapping set to 1:1. Then modify the egress map accordingly then
>> this should work right?
>>
>> For example:
>>
>> If we want to map 8 802.1Q priority code points onto 4 traffic classes this
>> should work,
>>
>> egress map: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- vlan layer inserts correct tag
>> mqprio up2tc: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7<-- mqprio with unfortunate 'tc' name maps priority to queues
> But mqprio is mapping sk_priority to queue set, which is different from UP to queue set.
> The term UP which we use, is the 8021q PCP in tagged traffic, and a priority mapped by the host admin for untagged traffic.
> What you wrote, is actually, that the application will set the priority without enabling the host admin to control it.
>> dcbnl up2tc: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- dcbx pushes up2tc mapping correctly
> 
> For example, lets take an application that calls setsockopt(SO_PRIORITY, 2):
> according to egress map: 8021q PCP field in vlan tag should be set to 1 (=UP)
> according to mqprio: a tx ring belonging to UP 2 will be selected.
> according to dcbnl: traffic will have ETS attributes of TC 1
> 
> Except some conceptual problems, it won't work:
> 8021q PCP field is set by HW according to the tx ring (UP=2). which
> is different from the one that the user configured in the egress_map
> (UP=1).

sorry I set it up wrong above but it _can_ be made to work as best I
can tell (for a single egress_map at least)

egress map: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7     <-- ignored
mqprio    : 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7     <-- map priority to up queue sets
up2tc     : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3     <-- dcbx negotiated up2tc map

With your example application does setsockopt(SO_PRIORITY, 2):
	according to egress map : insert PCP 2
	according to mqprio	: tx ring belonging to UP2 is selected
	accroding to dcbnl	: traffic will have ETS attributes of TC1

The point is either you use the skb->priority to PCP map and then a 1:1
mqprio map or you use the mqprio map directly. Agree?

One more example translating the case with these patches onto a case
without these patches as I understand them.

first with these patches mapping:

up2tc     : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3     <-- dcbnl up2tc map 
egress map: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3     <-- sets pcp bits
mqprio    : 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7     <-- with PGROUP_UP set

equivalent mapping without PGROUP_UP set

up2tc     : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3     <-- dcbnl negotiated up2tc map 
egress map: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0     <-- default unset egress map
mqprio    : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3     <-- with PGROUP_UP set

Application sets skb->priority to 2, in the first case egress map
sets the PCP to 1 and mqprio maps it to queues for UP1 based on PCP bits.

In case two egress map does nothing but skb->priority is still 2 so the
mqprio maps these to queues associated with UP1 so everything works
still.

I apologize if I'm missing something here but it seems correct to me. Spell
it out for me if I am still wrong.

> 
>>
>> We may need to fixup userspace tools some but I think the kernel mechanisms
>> are in place.
>>
>> BTW I did think about this while implementing the existing code and decided
>> that rather than create more if/else branches the case you are describing
>> could be handled by independently controlling the priority to queue set mappings
>> in mqprio and the egress map.
>>
>> Feel free to let me know where I went wrong or why this doesn't work on
>> your hardware. I agree though we may need to fixup lldpad and maybe even
>> 'tc' some to get this to look correct in user space and get automagically
>> setup correctly.
>>
>> Thanks,
>> .John
> 
> I think I understand your stand, that mqprio mapping should be generic
> and should not be TC nor UP oriented.
> 

Right. Also I want to avoid user space having to somehow "know" which
mode to put the qdisc in. Checking if the hardware is a mellanox card
does not scale.

> The problem is that our HW needs the queues to be selected by UP. ETS
> TC mapping is configured before traffic starts, and therefore ETS
> attributes are enforced by HW according to the UP, which the tx ring
> belongs to. Same thing for vlan tag in tagged traffic. 8021q PCP is
> placed by HW according to the tx ring.

OK. Can you agree though in the case where we restrict the egress_map
to be the same for all vlan's on a real_dev the existing mqprio map
_can_ work?

> 
> For backward compatibility, tagged traffic should be steered to a tx
> ring by UP taken from egress map - skb_tx_hash() as it is today, can't
> do it. Even if we implement ndo_select_queue(), we will need to
> duplicate most of the code from skb_tx_hash(), because there are more
> than one tx ring per UP, and we need skb_tx_hash() to be able to select
> a ring from a range of rings belonging to a UP.

agreed implementing this in select_queue() is no good.

> 
> Also, there are some configurations that can't be done by mqprio and
> egress map. For example when having two vlans with different egress
> maps.

This is a good example thanks. This currently doesn't work for hardware
that maps tx_rings to traffic classes either. So if we need/want to
solve this case then I guess we need something like your patches. Note
I think the patch you submitted should work regardless if you map
the PCP bits to UP queues or PCP bits to TC queues. We could use this
in both cases which helps my user space concerns.

> 
> And in the conceptual level, I think that kernel should not accept bad
> configurations and rely on user space to protect it.

I disagree policy should be managed in user space. Also I see no reason
to create a strong coupling between egress maps and mqprio maps. I can
imagine a case where they could be used independently. DCB is just one
usage model for mqprio. Using a bit like you did seems fine though.

> 
> - Amir

Assuming we want the multiple different egress map case to work I guess
we will have to do something like this. At least right now I can't think
of anything better but I'll think on it tomorrow some more.

The other thought would be to provide a qdisc hook before queue selection
to attach 'tc filters' and provide a new action to hash a skb across
queue sets.

Thanks,
John

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 1/2] net_sched/mqprio: add support for different pgroup types
  2012-05-06  7:05 ` [PATCH net-next 1/2] net_sched/mqprio: add support for different pgroup types Amir Vadai
@ 2012-05-09  6:36   ` John Fastabend
  0 siblings, 0 replies; 9+ messages in thread
From: John Fastabend @ 2012-05-09  6:36 UTC (permalink / raw)
  To: Amir Vadai; +Cc: David S. Miller, netdev, Oren Duer, Liran Liss

On 5/6/2012 12:05 AM, Amir Vadai wrote:
> Currently, HW based QoS mechanisms use the framework and means introduced in
> commits 4f57c087d "net: implement mechanism for HW based QOS" and b8970f0bfc
> "net_sched: implement a root container qdisc sch_mqprio".
> 
> The approach present in these patches is strongly orientated to the extended
> transmission selection (ETS) algorithm traffic classes (TC).
> 

I would argue the strongly orientated part per our other thread [0/2]

> This patch enhances the current scheme to allow for these mechanisms to be used
> also with hardware who has queues per UP - user priority (Linux has well
> established mechanisms to set UP for both tagged and untagged traffic).
> 

also per other thread I think this is only needed if you have many different
egress map configurations on vlans.

> Now, __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings.
> This range is defined by the admin through the mqprio scheduler for the
> specific HW. For TC based queues, the range is by TC number and for UP based
> queues, the range is by UP.
> 
> Signed-off-by: Amir Vadai <amirv@mellanox.com>
> ---
>  include/linux/netdevice.h |   27 +++++++++++++++++++++++++++
>  include/linux/pkt_sched.h |    3 ++-
>  net/core/dev.c            |   12 +++++++++---
>  net/sched/sch_mqprio.c    |   11 +++++++++--
>  4 files changed, 47 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 7f377fb..ecdd953 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -835,6 +835,9 @@ struct netdev_fcoe_hbainfo {
>   * 	is always called from the stack with the rtnl lock held and netif tx
>   * 	queues stopped. This allows the netdevice to perform queue management
>   * 	safely.
> + * int (*ndo_set_pg_type)(struct net_device *dev, u8 pg_type)
> + *	Called to setup 'tc' type. According to this type, traffic is
> + *	distributed across tx rings. If not set, ETS TC is in use.
>   *
>   *	Fiber Channel over Ethernet (FCoE) offload functions.
>   * int (*ndo_fcoe_enable)(struct net_device *dev);
> @@ -973,6 +976,8 @@ struct net_device_ops {
>  	int			(*ndo_get_vf_port)(struct net_device *dev,
>  						   int vf, struct sk_buff *skb);
>  	int			(*ndo_setup_tc)(struct net_device *dev, u8 tc);
> +	int			(*ndo_set_pg_type)(struct net_device *dev,
> +						   u8 pg_type);

expand ndo_setup_tc() to take either another parameter 'pg_type' or just
start passing in the entire tc_mqprio_opt that way we get the number of
queues as well. I would prefer passing tc_mpqrio_opt to adding more parameters.

This avoids adding another ndo op.


>  #if IS_ENABLED(CONFIG_FCOE)
>  	int			(*ndo_fcoe_enable)(struct net_device *dev);
>  	int			(*ndo_fcoe_disable)(struct net_device *dev);
> @@ -1307,6 +1312,11 @@ struct net_device {
>  	/* Data Center Bridging netlink ops */
>  	const struct dcbnl_rtnl_ops *dcbnl_ops;
>  #endif
> +	enum {
> +		PGROUP_TC,
> +		PGROUP_UP,
> +		PGROUP_MAX,
> +	} pg_type:8;
>  	u8 num_tc;
>  	struct netdev_tc_txq tc_to_txq[TC_MAX_QUEUE];
>  	u8 prio_tc_map[TC_BITMASK + 1];
> @@ -1329,6 +1339,23 @@ struct net_device {
>  #define	NETDEV_ALIGN		32
>  
>  static inline
> +int netdev_get_pg_type(const struct net_device *dev)
> +{
> +	return dev->pg_type;
> +}
> +
> +static inline
> +int netdev_set_pg_type(struct net_device *dev, u8 pg_type)
> +{
> +	if (pg_type >= PGROUP_MAX)
> +		return -EINVAL;
> +
> +	dev->pg_type = pg_type;
> +
> +	return 0;
> +}
> +
> +static inline
>  int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
>  {
>  	return dev->prio_tc_map[prio & TC_BITMASK];
> diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
> index ffe975c..1ae7d3c 100644
> --- a/include/linux/pkt_sched.h
> +++ b/include/linux/pkt_sched.h
> @@ -596,7 +596,8 @@ struct tc_drr_stats {
>  struct tc_mqprio_qopt {
>  	__u8	num_tc;
>  	__u8	prio_tc_map[TC_QOPT_BITMASK + 1];
> -	__u8	hw;
> +	__u8	hw;	/* bit 0: hw owned, bits 1-7: hw queuing type.
> +			 * valid types: 0 - ETS TC, 1 - UP */
>  	__u16	count[TC_QOPT_MAX_QUEUE];
>  	__u16	offset[TC_QOPT_MAX_QUEUE];
>  };
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 09024fd..72ac4bf 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2325,9 +2325,15 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
>  	}
>  
>  	if (dev->num_tc) {
> -		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
> -		qoffset = dev->tc_to_txq[tc].offset;
> -		qcount = dev->tc_to_txq[tc].count;
> +		u8 pgroup;
> +
> +		if (dev->pg_type == PGROUP_TC || !vlan_tx_tag_present(skb))
> +			pgroup = netdev_get_prio_tc_map(dev, skb->priority);
> +		else
> +			pgroup = (vlan_tx_tag_get(skb) >> 13);
> +
> +		qoffset = dev->tc_to_txq[pgroup].offset;
> +		qcount = dev->tc_to_txq[pgroup].count;
>  	}
>  
>  	if (skb->sk && skb->sk->sk_hash)
> diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
> index d1831ca..2149cbb 100644
> --- a/net/sched/sch_mqprio.c
> +++ b/net/sched/sch_mqprio.c
> @@ -134,11 +134,18 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
>  		priv->qdiscs[i] = qdisc;
>  	}
>  
> +	if (dev->netdev_ops->ndo_set_pg_type)
> +		err = dev->netdev_ops->ndo_set_pg_type(dev, qopt->hw >> 1);
> +	else
> +		err = netdev_set_pg_type(dev, PGROUP_TC);

Software should still be allowed to set PGROUP_UP even though hardware may
not support it.

> +	if (err)
> +		goto err;
> +
>  	/* If the mqprio options indicate that hardware should own
>  	 * the queue mapping then run ndo_setup_tc otherwise use the
>  	 * supplied and verified mapping
>  	 */
> -	if (qopt->hw) {
> +	if (qopt->hw & 1) {
>  		priv->hw_owned = 1;
>  		err = dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc);
>  		if (err)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC
  2012-05-09  6:22     ` John Fastabend
@ 2012-05-14 19:24       ` Amir Vadai
  2012-05-15 16:44         ` John Fastabend
  0 siblings, 1 reply; 9+ messages in thread
From: Amir Vadai @ 2012-05-14 19:24 UTC (permalink / raw)
  To: John Fastabend
  Cc: David S. Miller, netdev, Oren Duer, Liran Liss, Jamal Hadi Salim,
	Diego Crupnicoff, Or Gerlitz

On 05/09/2012 09:22 AM, John Fastabend wrote:
> On 5/8/2012 6:56 AM, Amir Vadai wrote:
>> On 05/08/2012 03:54 AM, John Fastabend wrote:
>>> On 5/6/2012 12:05 AM, Amir Vadai wrote:
>>>> This series comes to revive the discussion initiated on the thread "net:
>>>> support tx_ring per UP in HW based QoS mechanism" (see
>>>> http://marc.info/?t=133165957200004&r=1&w=2) with the major issue to be address
>>>> is - how should sk_prio<=>    TC be done, for both, tagged and untagged traffic.
>>>> Following is a staged description addressing the background, problem
>>>> description, current situation, suggestion for the change and implementation of
>>>> it.
>>>
>>> OK but the mqprio qdisc is only concerned with mapping skb->priority to
>>> queue sets I perhaps unfortunately called the queue sets tc's. Try not
>>> to get hung up on my perhaps limiting naming of variables and functions.
>> I understand that.
>
> I figured you did. Just wanted to state it again.
>
>>
>>>
>>> mqprio is used outside of 802.1Q as well in these cases a traffic class
>>> is not usually even defined.
>>>
>>>>
>
> [...]
>
>>>> Implementation
>>>> --------------
>>>> Extended mqprio hw attribute:
>>>> * Bit 1: is queue offset/count owned by HW
>>>> * Bits 2-7: HW queueing type.
>>>>     * 0 - by ETS TC
>>>>     * 1 - by UP
>>>>
>>>> __skb_tx_hash() is now aware to the HW queuing type (pg_type): for pg_type
>>>> being ETS TC, traffic is distributed as it was before - tagged and untagged
>>>> packets are distributed by netdev_get_prio_tc_map. For pg_type being UP, tagged
>>>> and untagged packets are distributed by UP (taken from egress map for tagged
>>>> traffic, or netdev_get_prio_tc_map for untagged).
>>>
>>> I guess I don't see why we need this. If you keep the mqprio priority to
>>> queue set mapping set to 1:1. Then modify the egress map accordingly then
>>> this should work right?
>>>
>>> For example:
>>>
>>> If we want to map 8 802.1Q priority code points onto 4 traffic classes this
>>> should work,
>>>
>>> egress map: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- vlan layer inserts correct tag
>>> mqprio up2tc: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7<-- mqprio with unfortunate 'tc' name maps priority to queues
>> But mqprio is mapping sk_priority to queue set, which is different from UP to queue set.
>> The term UP which we use, is the 8021q PCP in tagged traffic, and a priority mapped by the host admin for untagged traffic.
>> What you wrote, is actually, that the application will set the priority without enabling the host admin to control it.
>>> dcbnl up2tc: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- dcbx pushes up2tc mapping correctly
>>
>> For example, lets take an application that calls setsockopt(SO_PRIORITY, 2):
>> according to egress map: 8021q PCP field in vlan tag should be set to 1 (=UP)
>> according to mqprio: a tx ring belonging to UP 2 will be selected.
>> according to dcbnl: traffic will have ETS attributes of TC 1
>>
>> Except some conceptual problems, it won't work:
>> 8021q PCP field is set by HW according to the tx ring (UP=2). which
>> is different from the one that the user configured in the egress_map
>> (UP=1).
>
> sorry I set it up wrong above but it _can_ be made to work as best I
> can tell (for a single egress_map at least)
>
> egress map: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7<-- ignored
> mqprio    : 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7<-- map priority to up queue sets
> up2tc     : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- dcbx negotiated up2tc map
>
> With your example application does setsockopt(SO_PRIORITY, 2):
> 	according to egress map : insert PCP 2
> 	according to mqprio	: tx ring belonging to UP2 is selected
> 	accroding to dcbnl	: traffic will have ETS attributes of TC1
>
> The point is either you use the skb->priority to PCP map and then a 1:1
> mqprio map or you use the mqprio map directly. Agree?
>
> One more example translating the case with these patches onto a case
> without these patches as I understand them.
>
> first with these patches mapping:
>
> up2tc     : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- dcbnl up2tc map
> egress map: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- sets pcp bits
> mqprio    : 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7<-- with PGROUP_UP set
>
> equivalent mapping without PGROUP_UP set
>
> up2tc     : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- dcbnl negotiated up2tc map
> egress map: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0<-- default unset egress map
> mqprio    : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- with PGROUP_UP set
>
> Application sets skb->priority to 2, in the first case egress map
> sets the PCP to 1 and mqprio maps it to queues for UP1 based on PCP bits.
>
> In case two egress map does nothing but skb->priority is still 2 so the
> mqprio maps these to queues associated with UP1 so everything works
> still.
>
> I apologize if I'm missing something here but it seems correct to me. Spell
> it out for me if I am still wrong.
>
>>
>>>
>>> We may need to fixup userspace tools some but I think the kernel mechanisms
>>> are in place.
>>>
>>> BTW I did think about this while implementing the existing code and decided
>>> that rather than create more if/else branches the case you are describing
>>> could be handled by independently controlling the priority to queue set mappings
>>> in mqprio and the egress map.
>>>
>>> Feel free to let me know where I went wrong or why this doesn't work on
>>> your hardware. I agree though we may need to fixup lldpad and maybe even
>>> 'tc' some to get this to look correct in user space and get automagically
>>> setup correctly.
>>>
>>> Thanks,
>>> .John
>>
>> I think I understand your stand, that mqprio mapping should be generic
>> and should not be TC nor UP oriented.
>>
>
> Right. Also I want to avoid user space having to somehow "know" which
> mode to put the qdisc in. Checking if the hardware is a mellanox card
> does not scale.
>
>> The problem is that our HW needs the queues to be selected by UP. ETS
>> TC mapping is configured before traffic starts, and therefore ETS
>> attributes are enforced by HW according to the UP, which the tx ring
>> belongs to. Same thing for vlan tag in tagged traffic. 8021q PCP is
>> placed by HW according to the tx ring.
>
> OK. Can you agree though in the case where we restrict the egress_map
> to be the same for all vlan's on a real_dev the existing mqprio map
> _can_ work?
>
>>
>> For backward compatibility, tagged traffic should be steered to a tx
>> ring by UP taken from egress map - skb_tx_hash() as it is today, can't
>> do it. Even if we implement ndo_select_queue(), we will need to
>> duplicate most of the code from skb_tx_hash(), because there are more
>> than one tx ring per UP, and we need skb_tx_hash() to be able to select
>> a ring from a range of rings belonging to a UP.
>
> agreed implementing this in select_queue() is no good.
>
>>
>> Also, there are some configurations that can't be done by mqprio and
>> egress map. For example when having two vlans with different egress
>> maps.
>
> This is a good example thanks. This currently doesn't work for hardware
> that maps tx_rings to traffic classes either. So if we need/want to
> solve this case then I guess we need something like your patches. Note
> I think the patch you submitted should work regardless if you map
> the PCP bits to UP queues or PCP bits to TC queues. We could use this
> in both cases which helps my user space concerns.
>
>>
>> And in the conceptual level, I think that kernel should not accept bad
>> configurations and rely on user space to protect it.
>
> I disagree policy should be managed in user space. Also I see no reason
> to create a strong coupling between egress maps and mqprio maps. I can
> imagine a case where they could be used independently. DCB is just one
> usage model for mqprio. Using a bit like you did seems fine though.
>
>>
>> - Amir
>
> Assuming we want the multiple different egress map case to work I guess
> we will have to do something like this. At least right now I can't think
> of anything better but I'll think on it tomorrow some more.
>
> The other thought would be to provide a qdisc hook before queue selection
> to attach 'tc filters' and provide a new action to hash a skb across
> queue sets.
>
> Thanks,
> John

John Hi,

After some internal discussions, it was agreed to line up with your 
approach, to leave mqprio an abstract skb->priority <=> queue set 
mapping and to ignore egress_map if mqprio is enabled.

It would be very nice, if the term 'tc' in kernel code would be replaced 
to queue set, since it is very misleading.

There still might be some small issues with skb_tx_hash for tagged 
traffic, which I will work on tomorrow, and hopefully will send a new 
patch set with the solution.

Thanks,
Amir

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC
  2012-05-14 19:24       ` Amir Vadai
@ 2012-05-15 16:44         ` John Fastabend
  0 siblings, 0 replies; 9+ messages in thread
From: John Fastabend @ 2012-05-15 16:44 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, Oren Duer, Liran Liss, Jamal Hadi Salim,
	Diego Crupnicoff, Or Gerlitz

On 5/14/2012 12:24 PM, Amir Vadai wrote:
>>>> On 5/6/2012 12:05 AM, Amir Vadai wrote:
>>>>> This series comes to revive the discussion initiated on the thread "net:
>>>>> support tx_ring per UP in HW based QoS mechanism" (see
>>>>> http://marc.info/?t=133165957200004&r=1&w=2) with the major issue to be address
>>>>> is - how should sk_prio<=>    TC be done, for both, tagged and untagged traffic.
>>>>> Following is a staged description addressing the background, problem
>>>>> description, current situation, suggestion for the change and implementation of
>>>>> it.

[...]

> John Hi,
> 
> After some internal discussions, it was agreed to line up with your
> approach, to leave mqprio an abstract skb->priority <=> queue set
> mapping and to ignore egress_map if mqprio is enabled.
> 

OK sounds good.

> It would be very nice, if the term 'tc' in kernel code would be
> replaced to queue set, since it is very misleading.
> 

Go ahead and write up a patch. Just be careful not to break existing
user visible API. I agree it is confusing.

> There still might be some small issues with skb_tx_hash for tagged
> traffic, which I will work on tomorrow, and hopefully will send a new
> patch set with the solution.
> 

What are the issues? Lets see a patch.

Thanks,
John

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-05-15 16:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-06  7:05 [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC Amir Vadai
2012-05-06  7:05 ` [PATCH net-next 1/2] net_sched/mqprio: add support for different pgroup types Amir Vadai
2012-05-09  6:36   ` John Fastabend
2012-05-06  7:05 ` [PATCH net-next 2/2] net/mlx4_en: num cores tx rings for every UP Amir Vadai
2012-05-08  0:54 ` [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC John Fastabend
2012-05-08 13:56   ` Amir Vadai
2012-05-09  6:22     ` John Fastabend
2012-05-14 19:24       ` Amir Vadai
2012-05-15 16:44         ` John Fastabend

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.