netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] net/mlx4_en: DCB QoS support
@ 2012-03-13 17:21 Amir Vadai
  2012-03-13 17:21 ` [PATCH 1/8] net/mlx4_en: Force user priority by QP attribute Amir Vadai
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: Amir Vadai @ 2012-03-13 17:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Roland Dreier, Oren Duer, Amir Vadai

DCBX version 802.1qaz is supported.
User Priority (UP) is set in QP context instead of in WQE (QP Work Queue
Element), which means that all traffic from a queue will have the same UP.
UP is also set for untagged traffic to be able to classify such traffic too.

Mapping from sk_prio to User Priority is done by sch_mqprio mapping. Although
confusingly sch_mqprio maps sk_prio to something called TC, it is not related
to DCBX's TC, and is interpreted by mlx4_en driver as UP.

The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
"net: implement mechanism for HW based QOS" is in orientation to ETS traffic
class. Patch 7/8 introduces an approach which allow to use this mechanism also
with hardware who has queues per user priority (UP). After the change,
__skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
range is defined by the caller function by the specific HW. If TC based queues,
the range is by TC number and for UP based queues, the range is by UP. 

Amir Vadai (8):
  net/mlx4_en: Force user priority by QP attribute
  net/mlx4_core: set port QoS attributes
  net/mlx4_en: DCB QoS support
  net/mlx4_en: Set max rate-limit for a TC
  net/mlx4_en: sk_prio <=> UP for untagged traffic
  IB/rdma_cm: TOS <=> UP mapping for IBoE
  net: support tx_ring per UP in HW based QoS mechanism
  net/mlx4_en: num cores tx rings for every UP

 drivers/infiniband/core/cma.c                     |   35 ++++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |   11 +-
 drivers/net/ethernet/mellanox/mlx4/Kconfig        |   12 ++
 drivers/net/ethernet/mellanox/mlx4/Makefile       |    1 +
 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c    |  215 +++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_main.c      |    6 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |   64 ++++++-
 drivers/net/ethernet/mellanox/mlx4/en_port.h      |    2 +
 drivers/net/ethernet/mellanox/mlx4/en_resources.c |    6 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c        |    4 +-
 drivers/net/ethernet/mellanox/mlx4/en_sysfs.c     |  120 ++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c        |   20 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4.h         |   20 ++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h      |   38 +++-
 drivers/net/ethernet/mellanox/mlx4/port.c         |   62 ++++++
 include/linux/mlx4/cmd.h                          |    4 +
 include/linux/mlx4/device.h                       |    3 +
 include/linux/mlx4/qp.h                           |    3 +-
 include/linux/netdevice.h                         |   12 +-
 include/linux/skbuff.h                            |    3 +-
 net/core/dev.c                                    |   10 +-
 21 files changed, 615 insertions(+), 36 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx4/en_sysfs.c

-- 
1.7.8.2

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/8] net/mlx4_en: Force user priority by QP attribute
  2012-03-13 17:21 [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
@ 2012-03-13 17:21 ` Amir Vadai
  2012-03-13 17:21 ` [PATCH 2/8] net/mlx4_core: set port QoS attributes Amir Vadai
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Amir Vadai @ 2012-03-13 17:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Roland Dreier, Oren Duer, Amir Vadai

From: Amir Vadai <amirv@mellanox.co.il>

Instead of relying on HW to change schedule queue by UP, schedule
queue is fixed for a tx_ring, and UP in WQE is ignored in this aspect.  This
resolves two issues with untagged traffic:
1. untagged traffic has no UP in packet which is needed for QoS. The change
   above allows setting the schedule queue (and by that the UP) of such a stream.
2. BlueFlame uses the same field used by vlan tag. So forcing UP from QPC
   allows using BF for untagged but prioritized traffic.

In old firmware that force UP is not supported, untagged traffic will not subject to
QoS.

Because UP is set by QP, need to always have a tx ring per UP, even if pfcrx
module paramter is false.

Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/en_main.c      |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |    3 ++-
 drivers/net/ethernet/mellanox/mlx4/en_resources.c |    6 +++++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c        |    4 ++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c        |    8 ++++----
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h      |    6 +++---
 include/linux/mlx4/qp.h                           |    3 ++-
 7 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index 2097a7d..346fdb2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -114,7 +114,7 @@ static int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 		params->prof[i].tx_ring_size = MLX4_EN_DEF_TX_RING_SIZE;
 		params->prof[i].rx_ring_size = MLX4_EN_DEF_RX_RING_SIZE;
 		params->prof[i].tx_ring_num = MLX4_EN_NUM_TX_RINGS +
-			(!!pfcrx) * MLX4_EN_NUM_PPP_RINGS;
+			MLX4_EN_NUM_PPP_RINGS;
 		params->prof[i].rss_rings = 0;
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 31b455a..2322622 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -650,7 +650,8 @@ int mlx4_en_start_port(struct net_device *dev)
 
 		/* Configure ring */
 		tx_ring = &priv->tx_ring[i];
-		err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn);
+		err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn,
+				max(0, i - MLX4_EN_NUM_TX_RINGS));
 		if (err) {
 			en_err(priv, "Failed allocating Tx ring\n");
 			mlx4_en_deactivate_cq(priv, cq);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_resources.c b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
index bcbc54c..10c24c7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_resources.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_resources.c
@@ -39,7 +39,7 @@
 
 void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int size, int stride,
 			     int is_tx, int rss, int qpn, int cqn,
-			     struct mlx4_qp_context *context)
+			     int user_prio, struct mlx4_qp_context *context)
 {
 	struct mlx4_en_dev *mdev = priv->mdev;
 
@@ -57,6 +57,10 @@ void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int size, int stride,
 	context->local_qpn = cpu_to_be32(qpn);
 	context->pri_path.ackto = 1 & 0x07;
 	context->pri_path.sched_queue = 0x83 | (priv->port - 1) << 6;
+	if (user_prio >= 0) {
+		context->pri_path.sched_queue |= user_prio << 3;
+		context->pri_path.feup = 1 << 6;
+	}
 	context->pri_path.counter_index = 0xff;
 	context->cqn_send = cpu_to_be32(cqn);
 	context->cqn_recv = cpu_to_be32(cqn);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 9adbd53..d49a7ac 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -823,7 +823,7 @@ static int mlx4_en_config_rss_qp(struct mlx4_en_priv *priv, int qpn,
 
 	memset(context, 0, sizeof *context);
 	mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0,
-				qpn, ring->cqn, context);
+				qpn, ring->cqn, -1, context);
 	context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma);
 
 	/* Cancel FCS removal if FW allows */
@@ -890,7 +890,7 @@ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv)
 	}
 	rss_map->indir_qp.event = mlx4_en_sqp_event;
 	mlx4_en_fill_qp_context(priv, 0, 0, 0, 1, priv->base_qpn,
-				priv->rx_ring[0].cqn, &context);
+				priv->rx_ring[0].cqn, -1, &context);
 
 	if (!priv->prof->rss_rings || priv->prof->rss_rings > priv->rx_ring_num)
 		rss_rings = priv->rx_ring_num;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 1796824..9787539 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -156,7 +156,7 @@ void mlx4_en_destroy_tx_ring(struct mlx4_en_priv *priv,
 
 int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 			     struct mlx4_en_tx_ring *ring,
-			     int cq)
+			     int cq, int user_prio)
 {
 	struct mlx4_en_dev *mdev = priv->mdev;
 	int err;
@@ -174,7 +174,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 	ring->doorbell_qpn = ring->qp.qpn << 8;
 
 	mlx4_en_fill_qp_context(priv, ring->size, ring->stride, 1, 0, ring->qpn,
-				ring->cqn, &ring->context);
+				ring->cqn, user_prio, &ring->context);
 	if (ring->bf_enabled)
 		ring->context.usr_page = cpu_to_be32(ring->bf.uar->index);
 
@@ -576,12 +576,12 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
 	/* If we support per priority flow control and the packet contains
 	 * a vlan tag, send the packet to the TX ring assigned to that priority
 	 */
-	if (priv->prof->rx_ppp && vlan_tx_tag_present(skb)) {
+	if (vlan_tx_tag_present(skb)) {
 		vlan_tag = vlan_tx_tag_get(skb);
 		return MLX4_EN_NUM_TX_RINGS + (vlan_tag >> 13);
 	}
 
-	return skb_tx_hash(dev, skb);
+	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS);
 }
 
 static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 9e2b911..5bd7c2a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -521,7 +521,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv, struct mlx4_en_tx_ring *ri
 void mlx4_en_destroy_tx_ring(struct mlx4_en_priv *priv, struct mlx4_en_tx_ring *ring);
 int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 			     struct mlx4_en_tx_ring *ring,
-			     int cq);
+			     int cq, int user_prio);
 void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
 				struct mlx4_en_tx_ring *ring);
 
@@ -539,8 +539,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev,
 			  int budget);
 int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget);
 void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int size, int stride,
-			     int is_tx, int rss, int qpn, int cqn,
-			     struct mlx4_qp_context *context);
+		int is_tx, int rss, int qpn, int cqn, int user_prio,
+		struct mlx4_qp_context *context);
 void mlx4_en_sqp_event(struct mlx4_qp *qp, enum mlx4_event event);
 int mlx4_en_map_buffer(struct mlx4_buf *buf);
 void mlx4_en_unmap_buffer(struct mlx4_buf *buf);
diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h
index 091f9e7..96005d7 100644
--- a/include/linux/mlx4/qp.h
+++ b/include/linux/mlx4/qp.h
@@ -139,7 +139,8 @@ struct mlx4_qp_path {
 	u8			rgid[16];
 	u8			sched_queue;
 	u8			vlan_index;
-	u8			reserved3[2];
+	u8			feup;
+	u8			reserved3;
 	u8			reserved4[2];
 	u8			dmac[6];
 };
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/8] net/mlx4_core: set port QoS attributes
  2012-03-13 17:21 [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
  2012-03-13 17:21 ` [PATCH 1/8] net/mlx4_en: Force user priority by QP attribute Amir Vadai
@ 2012-03-13 17:21 ` Amir Vadai
  2012-03-13 17:21 ` [PATCH 3/8] net/mlx4_en: DCB QoS support Amir Vadai
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Amir Vadai @ 2012-03-13 17:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Roland Dreier, Oren Duer, Amir Vadai

From: Amir Vadai <amirv@mellanox.co.il>

Adding QoS firmware commands:
- mlx4_en_SET_PORT_PRIO2TC - set UP <=> TC
- mlx4_en_SET_PORT_SCHEDULER - set promised BW, max BW and PG number

Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/en_port.h |    2 +
 drivers/net/ethernet/mellanox/mlx4/mlx4.h    |   20 ++++++++
 drivers/net/ethernet/mellanox/mlx4/port.c    |   62 ++++++++++++++++++++++++++
 include/linux/mlx4/cmd.h                     |    4 ++
 include/linux/mlx4/device.h                  |    3 +
 5 files changed, 91 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.h b/drivers/net/ethernet/mellanox/mlx4/en_port.h
index 6934fd7..745090b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.h
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.h
@@ -39,6 +39,8 @@
 #define SET_PORT_PROMISC_SHIFT	31
 #define SET_PORT_MC_PROMISC_SHIFT	30
 
+#define MLX4_EN_NUM_TC		8
+
 #define VLAN_FLTR_SIZE	128
 struct mlx4_set_vlan_fltr_mbox {
 	__be32 entry[VLAN_FLTR_SIZE];
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4.h b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
index 5da51b9..33bf39d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4.h
@@ -53,6 +53,26 @@
 #define DRV_VERSION	"1.1"
 #define DRV_RELDATE	"Dec, 2011"
 
+#define MLX4_NUM_UP		8
+#define MLX4_NUM_TC		8
+#define MLX4_RATELIMIT_UNITS 3 /* 100 Mbps */
+#define MLX4_RATELIMIT_DEFAULT 0xffff
+
+struct mlx4_set_port_prio2tc_context {
+	u8 prio2tc[4];
+};
+
+struct mlx4_port_scheduler_tc_cfg_be {
+	__be16 pg;
+	__be16 bw_precentage;
+	__be16 max_bw_units; /* 3-100Mbps, 4-1Gbps, other values - reserved */
+	__be16 max_bw_value;
+};
+
+struct mlx4_set_port_scheduler_context {
+	struct mlx4_port_scheduler_tc_cfg_be tc[MLX4_NUM_TC];
+};
+
 enum {
 	MLX4_HCR_BASE		= 0x80680,
 	MLX4_HCR_SIZE		= 0x0001c,
diff --git a/drivers/net/ethernet/mellanox/mlx4/port.c b/drivers/net/ethernet/mellanox/mlx4/port.c
index 98e7762..762bbda 100644
--- a/drivers/net/ethernet/mellanox/mlx4/port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c
@@ -858,6 +858,68 @@ int mlx4_SET_PORT_qpn_calc(struct mlx4_dev *dev, u8 port, u32 base_qpn,
 }
 EXPORT_SYMBOL(mlx4_SET_PORT_qpn_calc);
 
+int mlx4_SET_PORT_PRIO2TC(struct mlx4_dev *dev, u8 port, u8 *prio2tc)
+{
+	struct mlx4_cmd_mailbox *mailbox;
+	struct mlx4_set_port_prio2tc_context *context;
+	int err;
+	u32 in_mod;
+	int i;
+
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox))
+		return PTR_ERR(mailbox);
+	context = mailbox->buf;
+	memset(context, 0, sizeof *context);
+
+	for (i = 0; i < MLX4_NUM_UP; i += 2)
+		context->prio2tc[i >> 1] = prio2tc[i] << 4 | prio2tc[i + 1];
+
+	in_mod = MLX4_SET_PORT_PRIO2TC << 8 | port;
+	err = mlx4_cmd(dev, mailbox->dma, in_mod, 1, MLX4_CMD_SET_PORT,
+		       MLX4_CMD_TIME_CLASS_B, MLX4_CMD_NATIVE);
+
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	return err;
+}
+EXPORT_SYMBOL(mlx4_SET_PORT_PRIO2TC);
+
+int mlx4_SET_PORT_SCHEDULER(struct mlx4_dev *dev, u8 port, u8 *tc_tx_bw,
+		u8 *pg, u16 *ratelimit)
+{
+	struct mlx4_cmd_mailbox *mailbox;
+	struct mlx4_set_port_scheduler_context *context;
+	int err;
+	u32 in_mod;
+	int i;
+
+	mailbox = mlx4_alloc_cmd_mailbox(dev);
+	if (IS_ERR(mailbox))
+		return PTR_ERR(mailbox);
+	context = mailbox->buf;
+	memset(context, 0, sizeof *context);
+
+	for (i = 0; i < MLX4_NUM_TC; i++) {
+		struct mlx4_port_scheduler_tc_cfg_be *tc = &context->tc[i];
+		u16 r = ratelimit && ratelimit[i] ? ratelimit[i] :
+			MLX4_RATELIMIT_DEFAULT;
+
+		tc->pg = htons(pg[i]);
+		tc->bw_precentage = htons(tc_tx_bw[i]);
+
+		tc->max_bw_units = htons(MLX4_RATELIMIT_UNITS);
+		tc->max_bw_value = htons(r);
+	}
+
+	in_mod = MLX4_SET_PORT_SCHEDULER << 8 | port;
+	err = mlx4_cmd(dev, mailbox->dma, in_mod, 1, MLX4_CMD_SET_PORT,
+		       MLX4_CMD_TIME_CLASS_B, MLX4_CMD_NATIVE);
+
+	mlx4_free_cmd_mailbox(dev, mailbox);
+	return err;
+}
+EXPORT_SYMBOL(mlx4_SET_PORT_SCHEDULER);
+
 int mlx4_SET_MCAST_FLTR_wrapper(struct mlx4_dev *dev, int slave,
 				struct mlx4_vhcr *vhcr,
 				struct mlx4_cmd_mailbox *inbox,
diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h
index 9958ff2..1f3860a 100644
--- a/include/linux/mlx4/cmd.h
+++ b/include/linux/mlx4/cmd.h
@@ -150,6 +150,10 @@ enum {
 	/* statistics commands */
 	MLX4_CMD_QUERY_IF_STAT	 = 0X54,
 	MLX4_CMD_SET_IF_STAT	 = 0X55,
+
+	/* set port opcode modifiers */
+	MLX4_SET_PORT_PRIO2TC = 0x8,
+	MLX4_SET_PORT_SCHEDULER  = 0x9,
 };
 
 enum {
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 44d8144..20c706f 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -626,6 +626,9 @@ int mlx4_SET_PORT_general(struct mlx4_dev *dev, u8 port, int mtu,
 			  u8 pptx, u8 pfctx, u8 pprx, u8 pfcrx);
 int mlx4_SET_PORT_qpn_calc(struct mlx4_dev *dev, u8 port, u32 base_qpn,
 			   u8 promisc);
+int mlx4_SET_PORT_PRIO2TC(struct mlx4_dev *dev, u8 port, u8 *prio2tc);
+int mlx4_SET_PORT_SCHEDULER(struct mlx4_dev *dev, u8 port, u8 *tc_tx_bw,
+		u8 *pg, u16 *ratelimit);
 int mlx4_find_cached_vlan(struct mlx4_dev *dev, u8 port, u16 vid, int *idx);
 int mlx4_register_vlan(struct mlx4_dev *dev, u8 port, u16 vlan, int *index);
 void mlx4_unregister_vlan(struct mlx4_dev *dev, u8 port, int index);
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/8] net/mlx4_en: DCB QoS support
  2012-03-13 17:21 [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
  2012-03-13 17:21 ` [PATCH 1/8] net/mlx4_en: Force user priority by QP attribute Amir Vadai
  2012-03-13 17:21 ` [PATCH 2/8] net/mlx4_core: set port QoS attributes Amir Vadai
@ 2012-03-13 17:21 ` Amir Vadai
  2012-03-13 17:21 ` [PATCH 4/8] net/mlx4_en: Set max rate-limit for a TC Amir Vadai
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Amir Vadai @ 2012-03-13 17:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Roland Dreier, Oren Duer, Amir Vadai

From: Amir Vadai <amirv@mellanox.co.il>

Set TSA, promised BW and PFC using IEEE 802.1qaz netlink commands.

Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/Kconfig     |   12 ++
 drivers/net/ethernet/mellanox/mlx4/Makefile    |    1 +
 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c |  203 ++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   16 ++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   21 +++
 5 files changed, 253 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c

diff --git a/drivers/net/ethernet/mellanox/mlx4/Kconfig b/drivers/net/ethernet/mellanox/mlx4/Kconfig
index 1bb9353..5f027f9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlx4/Kconfig
@@ -11,6 +11,18 @@ config MLX4_EN
 	  This driver supports Mellanox Technologies ConnectX Ethernet
 	  devices.
 
+config MLX4_EN_DCB
+	bool "Data Center Bridging (DCB) Support"
+	default y
+	depends on MLX4_EN && DCB
+	---help---
+	  Say Y here if you want to use Data Center Bridging (DCB) in the
+	  driver.
+	  If set to N, will not be able to configure QoS and ratelimit attributes.
+	  This flag is depended on the kernel's DCB support.
+
+	  If unsure, set to Y
+
 config MLX4_CORE
 	tristate
 	depends on PCI
diff --git a/drivers/net/ethernet/mellanox/mlx4/Makefile b/drivers/net/ethernet/mellanox/mlx4/Makefile
index 4a40ab9..293127d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx4/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_MLX4_EN)               += mlx4_en.o
 
 mlx4_en-y := 	en_main.o en_tx.o en_rx.o en_ethtool.o en_port.o en_cq.o \
 		en_resources.o en_netdev.o en_selftest.o
+mlx4_en-$(CONFIG_MLX4_EN_DCB) += en_dcb_nl.o
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
new file mode 100644
index 0000000..24ca87c
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
@@ -0,0 +1,203 @@
+/*
+ * Copyright (c) 2011 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/dcbnl.h>
+
+#include "mlx4_en.h"
+
+static int mlx4_en_dcbnl_ieee_getets(struct net_device *dev,
+				   struct ieee_ets *ets)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct ieee_ets *my_ets = priv->mlx4_en_ieee_ets;
+
+	/* No IEEE PFC settings available */
+	if (!my_ets)
+		return -EINVAL;
+
+	ets->ets_cap = IEEE_8021QAZ_MAX_TCS;
+	ets->cbs = my_ets->cbs;
+	memcpy(ets->tc_tx_bw, my_ets->tc_tx_bw, sizeof(ets->tc_tx_bw));
+	memcpy(ets->tc_tsa, my_ets->tc_tsa, sizeof(ets->tc_tsa));
+	memcpy(ets->prio_tc, my_ets->prio_tc, sizeof(ets->prio_tc));
+
+	return 0;
+}
+
+static int mlx4_en_ets_validate(struct mlx4_en_priv *priv, struct ieee_ets *ets)
+{
+	int i;
+	int total_ets_bw = 0;
+	int has_ets_tc = 0;
+
+	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
+		if (ets->prio_tc[i] > MLX4_EN_NUM_UP) {
+			en_err(priv, "Bad priority in UP <=> TC mapping. "
+					"TC: %d, UP: %d\n", i, ets->prio_tc[i]);
+			return -EINVAL;
+		}
+
+		switch (ets->tc_tsa[i]) {
+		case IEEE_8021QAZ_TSA_STRICT:
+			break;
+		case IEEE_8021QAZ_TSA_ETS:
+			has_ets_tc = 1;
+			total_ets_bw += ets->tc_tx_bw[i];
+			break;
+		default:
+			en_err(priv, "TC[%d]: Not supported TSA: %d\n",
+					i, ets->tc_tsa[i]);
+			return -ENOTSUPP;
+		}
+	}
+
+	if (has_ets_tc && total_ets_bw != MLX4_EN_BW_MAX) {
+		en_err(priv, "Bad ETS BW sum: %d. Should be exactly 100%%\n",
+				total_ets_bw);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int
+mlx4_en_dcbnl_ieee_setets(struct net_device *dev, struct ieee_ets *ets)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_dev *mdev = priv->mdev;
+	int num_strict = 0;
+	int i, err;
+	__u8 tc_tx_bw[IEEE_8021QAZ_MAX_TCS] = { 0 };
+	__u8 pg[IEEE_8021QAZ_MAX_TCS] = { 0 };
+
+	if (ets) {
+		err = mlx4_en_ets_validate(priv, ets);
+		if (err)
+			return err;
+	} else
+		ets = priv->mlx4_en_ieee_ets;
+
+	/* higher TC means higher priority => lower pg */
+	for (i = IEEE_8021QAZ_MAX_TCS - 1; i >= 0; i--) {
+		switch (ets->tc_tsa[i]) {
+		case IEEE_8021QAZ_TSA_STRICT:
+			pg[i] = num_strict++;
+			tc_tx_bw[i] = MLX4_EN_BW_MAX;
+			break;
+		case IEEE_8021QAZ_TSA_ETS:
+			pg[i] = MLX4_EN_TC_ETS;
+			tc_tx_bw[i] = ets->tc_tx_bw[i] ?: MLX4_EN_BW_MIN;
+			break;
+		}
+	}
+
+	err = mlx4_SET_PORT_PRIO2TC(mdev->dev, priv->port, ets->prio_tc);
+	if (err)
+		return err;
+
+	err = mlx4_SET_PORT_SCHEDULER(mdev->dev, priv->port, tc_tx_bw, pg,
+			NULL);
+	if (err)
+		return err;
+
+	if (ets != priv->mlx4_en_ieee_ets)
+		memcpy(priv->mlx4_en_ieee_ets, ets,
+				sizeof(*priv->mlx4_en_ieee_ets));
+
+	return 0;
+}
+
+static int mlx4_en_dcbnl_ieee_getpfc(struct net_device *dev,
+		struct ieee_pfc *pfc)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	pfc->pfc_cap = IEEE_8021QAZ_MAX_TCS;
+	pfc->pfc_en = priv->prof->tx_ppp;
+
+	return 0;
+}
+
+static int mlx4_en_dcbnl_ieee_setpfc(struct net_device *dev,
+		struct ieee_pfc *pfc)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_dev *mdev = priv->mdev;
+	int err;
+
+	en_dbg(DRV, priv, "cap: 0x%x en: 0x%x mbc: 0x%x delay: %d\n",
+			pfc->pfc_cap,
+			pfc->pfc_en,
+			pfc->mbc,
+			pfc->delay);
+
+	priv->prof->rx_pause = priv->prof->tx_pause = !!pfc->pfc_en;
+	priv->prof->rx_ppp = priv->prof->tx_ppp = pfc->pfc_en;
+
+	err = mlx4_SET_PORT_general(mdev->dev, priv->port,
+				    priv->rx_skb_size + ETH_FCS_LEN,
+				    priv->prof->tx_pause,
+				    priv->prof->tx_ppp,
+				    priv->prof->rx_pause,
+				    priv->prof->rx_ppp);
+	if (err)
+		en_err(priv, "Failed setting pause params\n");
+
+	return err;
+}
+
+static u8 mlx4_en_dcbnl_getdcbx(struct net_device *dev)
+{
+	return DCB_CAP_DCBX_VER_IEEE;
+}
+
+static u8 mlx4_en_dcbnl_setdcbx(struct net_device *dev, u8 mode)
+{
+	if ((mode & DCB_CAP_DCBX_LLD_MANAGED) ||
+	    (mode & DCB_CAP_DCBX_VER_CEE) ||
+	    !(mode & DCB_CAP_DCBX_VER_IEEE) ||
+	    !(mode & DCB_CAP_DCBX_HOST))
+		return 1;
+
+	return 0;
+}
+
+const struct dcbnl_rtnl_ops mlx4_en_dcbnl_ops = {
+	.ieee_getets	= mlx4_en_dcbnl_ieee_getets,
+	.ieee_setets	= mlx4_en_dcbnl_ieee_setets,
+	.ieee_getpfc	= mlx4_en_dcbnl_ieee_getpfc,
+	.ieee_setpfc	= mlx4_en_dcbnl_ieee_setpfc,
+
+	.getdcbx	= mlx4_en_dcbnl_getdcbx,
+	.setdcbx	= mlx4_en_dcbnl_setdcbx,
+};
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 2322622..9b456ae 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -967,6 +967,11 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
 	mutex_unlock(&mdev->state_lock);
 
 	mlx4_en_free_resources(priv);
+
+#ifdef CONFIG_MLX4_EN_DCB
+	vfree(priv->mlx4_en_ieee_ets);
+#endif
+
 	free_netdev(dev);
 }
 
@@ -1080,6 +1085,17 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	INIT_WORK(&priv->watchdog_task, mlx4_en_restart);
 	INIT_WORK(&priv->linkstate_task, mlx4_en_linkstate);
 	INIT_DELAYED_WORK(&priv->stats_task, mlx4_en_do_get_stats);
+#ifdef CONFIG_MLX4_EN_DCB
+	if (!mlx4_is_slave(priv->mdev->dev)) {
+		dev->dcbnl_ops = &mlx4_en_dcbnl_ops;
+
+		priv->mlx4_en_ieee_ets = vzalloc(sizeof(struct ieee_ets));
+		if (!priv->mlx4_en_ieee_ets) {
+			err = -ENOMEM;
+			goto out;
+		}
+	}
+#endif
 
 	/* Query for default mac and max mtu */
 	priv->max_mtu = mdev->dev->caps.eth_mtu_cap[priv->port];
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 5bd7c2a..fa09792 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -40,6 +40,9 @@
 #include <linux/mutex.h>
 #include <linux/netdevice.h>
 #include <linux/if_vlan.h>
+#ifdef CONFIG_MLX4_EN_DCB
+#include <linux/dcbnl.h>
+#endif
 
 #include <linux/mlx4/device.h>
 #include <linux/mlx4/qp.h>
@@ -110,6 +113,7 @@ enum {
 #define MLX4_EN_NUM_TX_RINGS		8
 #define MLX4_EN_NUM_PPP_RINGS		8
 #define MAX_TX_RINGS			(MLX4_EN_NUM_TX_RINGS + MLX4_EN_NUM_PPP_RINGS)
+#define MLX4_EN_NUM_UP			8
 #define MLX4_EN_DEF_TX_RING_SIZE	512
 #define MLX4_EN_DEF_RX_RING_SIZE  	1024
 
@@ -410,6 +414,15 @@ struct mlx4_en_frag_info {
 
 };
 
+#ifdef CONFIG_MLX4_EN_DCB
+/* Minimal TC BW - setting to 0 will block traffic */
+#define MLX4_EN_BW_MIN 1
+#define MLX4_EN_BW_MAX 100 /* Utilize 100% of the line */
+
+#define MLX4_EN_TC_ETS 7
+
+#endif
+
 struct mlx4_en_priv {
 	struct mlx4_en_dev *mdev;
 	struct mlx4_en_port_profile *prof;
@@ -483,6 +496,10 @@ struct mlx4_en_priv {
 	int vids[128];
 	bool wol;
 	struct device *ddev;
+
+#ifdef CONFIG_MLX4_EN_DCB
+	struct ieee_ets *mlx4_en_ieee_ets;
+#endif
 };
 
 enum mlx4_en_wol {
@@ -557,6 +574,10 @@ int mlx4_SET_VLAN_FLTR(struct mlx4_dev *dev, struct mlx4_en_priv *priv);
 int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset);
 int mlx4_en_QUERY_PORT(struct mlx4_en_dev *mdev, u8 port);
 
+#ifdef CONFIG_MLX4_EN_DCB
+extern const struct dcbnl_rtnl_ops mlx4_en_dcbnl_ops;
+#endif
+
 #define MLX4_EN_NUM_SELF_TEST	5
 void mlx4_en_ex_selftest(struct net_device *dev, u32 *flags, u64 *buf);
 u64 mlx4_en_mac_to_u64(u8 *addr);
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/8] net/mlx4_en: Set max rate-limit for a TC
  2012-03-13 17:21 [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
                   ` (2 preceding siblings ...)
  2012-03-13 17:21 ` [PATCH 3/8] net/mlx4_en: DCB QoS support Amir Vadai
@ 2012-03-13 17:21 ` Amir Vadai
  2012-03-13 18:26   ` John Fastabend
  2012-03-13 19:16   ` Dave Taht
  2012-03-13 17:22 ` [PATCH 5/8] net/mlx4_en: sk_prio <=> UP for untagged traffic Amir Vadai
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 20+ messages in thread
From: Amir Vadai @ 2012-03-13 17:21 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Roland Dreier, Oren Duer, Amir Vadai

From: Amir Vadai <amirv@mellanox.co.il>

Set max rate-limit using sysfs file /sys/class/net/<interface>/qos/ratelimit

To set, enter a space separated list of values in units of 100Mbps.  For
example to set ratelimit of 5G to TC0 and 10G for the reset on eth2 issue:
echo 50 100 100 100 100 100 100 100 100 > /sys/class/net/eth2/qos/ratelimit

Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
---

We used sysfs since max bw isn't part of the ETS / DCBX NL support, and we're
open to other suggestions to add generic support for max bw, e.g add call to
the DCBX NL API.

 drivers/net/ethernet/mellanox/mlx4/Makefile    |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c |   14 +++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   13 +++
 drivers/net/ethernet/mellanox/mlx4/en_sysfs.c  |  120 ++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |    5 +
 5 files changed, 152 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx4/en_sysfs.c

diff --git a/drivers/net/ethernet/mellanox/mlx4/Makefile b/drivers/net/ethernet/mellanox/mlx4/Makefile
index 293127d..8e5bd88 100644
--- a/drivers/net/ethernet/mellanox/mlx4/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx4/Makefile
@@ -7,4 +7,4 @@ obj-$(CONFIG_MLX4_EN)               += mlx4_en.o
 
 mlx4_en-y := 	en_main.o en_tx.o en_rx.o en_ethtool.o en_port.o en_cq.o \
 		en_resources.o en_netdev.o en_selftest.o
-mlx4_en-$(CONFIG_MLX4_EN_DCB) += en_dcb_nl.o
+mlx4_en-$(CONFIG_MLX4_EN_DCB) += en_dcb_nl.o en_sysfs.o
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
index 24ca87c..eb0964f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
@@ -126,7 +126,7 @@ mlx4_en_dcbnl_ieee_setets(struct net_device *dev, struct ieee_ets *ets)
 		return err;
 
 	err = mlx4_SET_PORT_SCHEDULER(mdev->dev, priv->port, tc_tx_bw, pg,
-			NULL);
+			priv->ratelimit);
 	if (err)
 		return err;
 
@@ -192,6 +192,18 @@ static u8 mlx4_en_dcbnl_setdcbx(struct net_device *dev, u8 mode)
 	return 0;
 }
 
+int mlx4_en_dcbnl_set_ratelimit(struct mlx4_en_priv *priv, u16 *ratelimit)
+{
+	memcpy(priv->ratelimit, ratelimit, sizeof(priv->ratelimit));
+
+	if (mlx4_en_dcbnl_ieee_setets(priv->dev, NULL)) {
+		en_err(priv, "Error setting ets in HW\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 const struct dcbnl_rtnl_ops mlx4_en_dcbnl_ops = {
 	.ieee_getets	= mlx4_en_dcbnl_ieee_getets,
 	.ieee_setets	= mlx4_en_dcbnl_ieee_setets,
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 9b456ae..f45d544 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -810,6 +810,18 @@ static void mlx4_en_restart(struct work_struct *work)
 	mutex_unlock(&mdev->state_lock);
 }
 
+static int mlx4_en_init(struct net_device *dev)
+{
+#ifdef CONFIG_MLX4_EN_DCB
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	if (!mlx4_is_slave(priv->mdev->dev))
+		mlx4_en_prepare_sysfs_group(priv);
+#endif
+
+	return 0;
+}
+
 static void mlx4_en_clear_stats(struct net_device *dev)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -1026,6 +1038,7 @@ static int mlx4_en_set_features(struct net_device *netdev,
 }
 
 static const struct net_device_ops mlx4_netdev_ops = {
+	.ndo_init		= mlx4_en_init,
 	.ndo_open		= mlx4_en_open,
 	.ndo_stop		= mlx4_en_close,
 	.ndo_start_xmit		= mlx4_en_xmit,
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_sysfs.c b/drivers/net/ethernet/mellanox/mlx4/en_sysfs.c
new file mode 100644
index 0000000..fdada20
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlx4/en_sysfs.c
@@ -0,0 +1,120 @@
+/*
+ * Copyright (c) 2011 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/netdevice.h>
+
+#include "mlx4_en.h"
+
+#define to_en_priv(cd)	((struct mlx4_en_priv *)(netdev_priv(to_net_dev(cd))))
+
+static ssize_t mlx4_en_show_ratelimit(struct device *d,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct mlx4_en_priv *priv = to_en_priv(d);
+	int i;
+	int len = 0;
+
+	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++)
+		len += sprintf(buf + len,  "%d ", priv->ratelimit[i]);
+	len += sprintf(buf + len, "\n");
+
+	return len;
+}
+
+static ssize_t mlx4_en_store_ratelimit(struct device *d,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	int ret = count;
+	struct mlx4_en_priv *priv = to_en_priv(d);
+	char save;
+	int i = 0;
+	u16 ratelimit[IEEE_8021QAZ_MAX_TCS] = { 0 };
+
+	do {
+		int len;
+		int new_value;
+
+		if (i >= IEEE_8021QAZ_MAX_TCS)
+			goto bad_elem_count;
+
+		len = strcspn(buf, " ");
+
+		/* nul-terminate and parse */
+		save = buf[len];
+		((char *)buf)[len] = '\0';
+
+		if (sscanf(buf, "%d", &new_value) != 1 ||
+				new_value > 100 || new_value < 0) {
+			en_err(priv, "bad ratelimit value: '%s'\n", buf);
+			ret = -EINVAL;
+			goto out;
+		}
+		ratelimit[i] = new_value;
+
+		buf += len+1;
+		i++;
+	} while (save == ' ');
+
+	if (i != IEEE_8021QAZ_MAX_TCS)
+		goto bad_elem_count;
+
+	mlx4_en_dcbnl_set_ratelimit(priv, &ratelimit[0]);
+
+out:
+	return ret;
+
+bad_elem_count:
+	en_err(priv, "bad number of elemets in ratelimit array\n");
+	return -EINVAL;
+}
+
+static DEVICE_ATTR(ratelimit, S_IRUGO | S_IWUSR,
+		   mlx4_en_show_ratelimit, mlx4_en_store_ratelimit);
+
+static struct attribute *mlx4_en_qos_attrs[] = {
+	&dev_attr_ratelimit.attr,
+	NULL,
+};
+
+static struct attribute_group qos_group = {
+	.name = "qos",
+	.attrs = mlx4_en_qos_attrs,
+};
+
+void mlx4_en_prepare_sysfs_group(struct mlx4_en_priv *priv)
+{
+	priv->dev->sysfs_groups[0] = &qos_group;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index fa09792..32b447d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -499,6 +499,7 @@ struct mlx4_en_priv {
 
 #ifdef CONFIG_MLX4_EN_DCB
 	struct ieee_ets *mlx4_en_ieee_ets;
+	u16 ratelimit[IEEE_8021QAZ_MAX_TCS];
 #endif
 };
 
@@ -576,6 +577,10 @@ int mlx4_en_QUERY_PORT(struct mlx4_en_dev *mdev, u8 port);
 
 #ifdef CONFIG_MLX4_EN_DCB
 extern const struct dcbnl_rtnl_ops mlx4_en_dcbnl_ops;
+
+void mlx4_en_prepare_sysfs_group(struct mlx4_en_priv *priv);
+
+int mlx4_en_dcbnl_set_ratelimit(struct mlx4_en_priv *priv, u16 *ratelimit);
 #endif
 
 #define MLX4_EN_NUM_SELF_TEST	5
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/8] net/mlx4_en: sk_prio <=> UP for untagged traffic
  2012-03-13 17:21 [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
                   ` (3 preceding siblings ...)
  2012-03-13 17:21 ` [PATCH 4/8] net/mlx4_en: Set max rate-limit for a TC Amir Vadai
@ 2012-03-13 17:22 ` Amir Vadai
  2012-03-13 17:22 ` [PATCH 6/8] IB/rdma_cm: TOS <=> UP mapping for IBoE Amir Vadai
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Amir Vadai @ 2012-03-13 17:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Roland Dreier, Oren Duer, Amir Vadai

From: Amir Vadai <amirv@mellanox.co.il>

Since vlan egress map is only good for tagged traffic, need to have other
mapping to be used by untagged traffic.
For that, the driver uses sch_mqprio mapping. This mapping could be set by
using tc tool from iproute2 package.
Mapped UP will be used by the HW for QoS purposes, but won't go out on the
wire.

Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   15 +++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |   16 ++++++++--------
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index f45d544..22c4da6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -45,6 +45,14 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
+static int mlx4_en_setup_tc(struct net_device *dev, u8 up)
+{
+	if (up != MLX4_EN_NUM_UP)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int mlx4_en_vlan_rx_add_vid(struct net_device *dev, unsigned short vid)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -1055,6 +1063,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_poll_controller	= mlx4_en_netpoll,
 #endif
 	.ndo_set_features	= mlx4_en_set_features,
+	.ndo_setup_tc		= mlx4_en_setup_tc,
 };
 
 int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
@@ -1143,6 +1152,12 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
 	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
 
+	netdev_set_num_tc(dev, MLX4_EN_NUM_UP);
+
+	/* Partition Tx queues evenly amongst UP's */
+	for (i = 0; i < MLX4_EN_NUM_UP; i++)
+		netdev_set_tc_queue(dev, i, 1, MLX4_EN_NUM_TX_RINGS + i);
+
 	SET_ETHTOOL_OPS(dev, &mlx4_en_ethtool_ops);
 
 	/* Set defualt MAC */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 9787539..7a49830 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -571,15 +571,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	u16 vlan_tag = 0;
+	int up = -1;
 
-	/* If we support per priority flow control and the packet contains
-	 * a vlan tag, send the packet to the TX ring assigned to that priority
-	 */
-	if (vlan_tx_tag_present(skb)) {
-		vlan_tag = vlan_tx_tag_get(skb);
-		return MLX4_EN_NUM_TX_RINGS + (vlan_tag >> 13);
-	}
+	if (vlan_tx_tag_present(skb))
+		up = (vlan_tx_tag_get(skb) >> 13);
+	else if (dev->num_tc)
+		up = netdev_get_prio_tc_map(dev, skb->priority);
+
+	if (up >= 0)
+		return MLX4_EN_NUM_TX_RINGS + up;
 
 	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS);
 }
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 6/8] IB/rdma_cm: TOS <=> UP mapping for IBoE
  2012-03-13 17:21 [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
                   ` (4 preceding siblings ...)
  2012-03-13 17:22 ` [PATCH 5/8] net/mlx4_en: sk_prio <=> UP for untagged traffic Amir Vadai
@ 2012-03-13 17:22 ` Amir Vadai
  2012-03-13 17:22 ` [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism Amir Vadai
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: Amir Vadai @ 2012-03-13 17:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Roland Dreier, Oren Duer, Amir Vadai, Sean Hefty

From: Amir Vadai <amirv@mellanox.co.il>

Both tagged traffic and untagged traffic use tc tool mapping.
Treat RDMA TOS same as IP TOS when mapping to SL

Since IP TOS to priority mapping is not exported, had to borrow the code from
net/ipv4/route.c

Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
CC: Sean Hefty <sean.hefty@intel.com>

---

We can export IP TOS to priority mapping if the networking maintainers will say
where they want it.
---
 drivers/infiniband/core/cma.c |   35 ++++++++++++++++++++++++++++++++++-
 1 files changed, 34 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index e3e470f..c0eeb2c 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -42,6 +42,7 @@
 #include <linux/inetdevice.h>
 #include <linux/slab.h>
 #include <linux/module.h>
+#include <linux/pkt_sched.h>
 
 #include <net/tcp.h>
 #include <net/ipv6.h>
@@ -1781,6 +1782,32 @@ static int cma_resolve_iw_route(struct rdma_id_private *id_priv, int timeout_ms)
 	return 0;
 }
 
+#define ECN_OR_COST(class)	TC_PRIO_##class
+
+static const __u8 tos2prio[16] = {
+	TC_PRIO_BESTEFFORT,
+	ECN_OR_COST(BESTEFFORT),
+	TC_PRIO_BESTEFFORT,
+	ECN_OR_COST(BESTEFFORT),
+	TC_PRIO_BULK,
+	ECN_OR_COST(BULK),
+	TC_PRIO_BULK,
+	ECN_OR_COST(BULK),
+	TC_PRIO_INTERACTIVE,
+	ECN_OR_COST(INTERACTIVE),
+	TC_PRIO_INTERACTIVE,
+	ECN_OR_COST(INTERACTIVE),
+	TC_PRIO_INTERACTIVE_BULK,
+	ECN_OR_COST(INTERACTIVE_BULK),
+	TC_PRIO_INTERACTIVE_BULK,
+	ECN_OR_COST(INTERACTIVE_BULK)
+};
+
+static inline char cma_tos2priority(u8 tos)
+{
+	return tos2prio[IPTOS_TOS(tos)>>1];
+}
+
 static int cma_resolve_iboe_route(struct rdma_id_private *id_priv)
 {
 	struct rdma_route *route = &id_priv->id.route;
@@ -1826,7 +1853,13 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv)
 	route->path_rec->reversible = 1;
 	route->path_rec->pkey = cpu_to_be16(0xffff);
 	route->path_rec->mtu_selector = IB_SA_EQ;
-	route->path_rec->sl = id_priv->tos >> 5;
+	if (ndev->priv_flags & IFF_802_1Q_VLAN)
+		route->path_rec->sl =
+			netdev_get_prio_tc_map(vlan_dev_real_dev(ndev),
+					cma_tos2priority(id_priv->tos));
+	else
+		route->path_rec->sl = netdev_get_prio_tc_map(ndev,
+				cma_tos2priority(id_priv->tos));
 
 	route->path_rec->mtu = iboe_get_mtu(ndev->mtu);
 	route->path_rec->rate_selector = IB_SA_EQ;
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism
  2012-03-13 17:21 [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
                   ` (5 preceding siblings ...)
  2012-03-13 17:22 ` [PATCH 6/8] IB/rdma_cm: TOS <=> UP mapping for IBoE Amir Vadai
@ 2012-03-13 17:22 ` Amir Vadai
  2012-03-13 18:23   ` John Fastabend
  2012-03-13 17:22 ` [PATCH 8/8] net/mlx4_en: num cores tx rings for every UP Amir Vadai
  2012-03-20 11:29 ` [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
  8 siblings, 1 reply; 20+ messages in thread
From: Amir Vadai @ 2012-03-13 17:22 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Roland Dreier, Oren Duer, Amir Vadai, John Fastabend,
	Jeff Kirsher, Eilon Greenstein

From: Amir Vadai <amirv@mellanox.co.il>

The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
"net: implement mechanism for HW based QOS" is in orientation to ETS traffic
class. This patch introduces an approach which allow to use this mechanism also
with hardware who has queues per user priority (UP). After the change,
__skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
range is defined by the caller function by the specific HW. If TC based queues,
the range is by TC number and for UP based queues, the range is by UP.

CC: John Fastabend <john.r.fastabend@intel.com>
CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
CC: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |   11 ++++++++++-
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      |    9 +++------
 include/linux/netdevice.h                       |   12 +++++++++++-
 include/linux/skbuff.h                          |    3 ++-
 net/core/dev.c                                  |   10 +---------
 5 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index c11e50d..614d0b2 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -1422,6 +1422,8 @@ void bnx2x_netif_stop(struct bnx2x *bp, int disable_hw)
 u16 bnx2x_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
 	struct bnx2x *bp = netdev_priv(dev);
+	u16 qoffset = 0;
+	u16 qcount = BNX2X_NUM_ETH_QUEUES(bp);
 
 #ifdef BCM_CNIC
 	if (!NO_FCOE(bp)) {
@@ -1441,8 +1443,15 @@ u16 bnx2x_select_queue(struct net_device *dev, struct sk_buff *skb)
 			return bnx2x_fcoe_tx(bp, txq_index);
 	}
 #endif
+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		qoffset = dev->tc_to_txq[tc].offset;
+		qcount = dev->tc_to_txq[tc].count;
+	}
+
 	/* select a non-FCoE queue */
-	return __skb_tx_hash(dev, skb, BNX2X_NUM_ETH_QUEUES(bp));
+	return __skb_tx_hash(dev, skb, BNX2X_NUM_ETH_QUEUES(bp), qoffset,
+			qcount);
 }
 
 void bnx2x_set_num_queues(struct bnx2x *bp)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 7a49830..d0d96e3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
-	struct mlx4_en_priv *priv = netdev_priv(dev);
-	int up = -1;
+	int up = 0;
 
 	if (vlan_tx_tag_present(skb))
 		up = (vlan_tx_tag_get(skb) >> 13);
 	else if (dev->num_tc)
 		up = netdev_get_prio_tc_map(dev, skb->priority);
 
-	if (up >= 0)
-		return MLX4_EN_NUM_TX_RINGS + up;
-
-	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS);
+	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS,
+			MLX4_EN_NUM_TX_RINGS + up, 1);
 }
 
 static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4535a4e..952dde3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2061,7 +2061,17 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index)
 static inline u16 skb_tx_hash(const struct net_device *dev,
 			      const struct sk_buff *skb)
 {
-	return __skb_tx_hash(dev, skb, dev->real_num_tx_queues);
+	u16 qoffset = 0;
+	u16 qcount = dev->real_num_tx_queues;
+
+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		qoffset = dev->tc_to_txq[tc].offset;
+		qcount = dev->tc_to_txq[tc].count;
+	}
+
+	return __skb_tx_hash(dev, skb, dev->real_num_tx_queues, qoffset,
+			qcount);
 }
 
 /**
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8dc8257..14fa201 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2455,7 +2455,8 @@ static inline bool skb_rx_queue_recorded(const struct sk_buff *skb)
 
 extern u16 __skb_tx_hash(const struct net_device *dev,
 			 const struct sk_buff *skb,
-			 unsigned int num_tx_queues);
+			 unsigned int num_tx_queues,
+			 u16 qoffset, u16 qcount);
 
 #ifdef CONFIG_XFRM
 static inline struct sec_path *skb_sec_path(struct sk_buff *skb)
diff --git a/net/core/dev.c b/net/core/dev.c
index 0090809..ecbf5c1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2290,11 +2290,9 @@ static u32 hashrnd __read_mostly;
  * to be used as a distribution range.
  */
 u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
-		  unsigned int num_tx_queues)
+		  unsigned int num_tx_queues, u16 qoffset, u16 qcount)
 {
 	u32 hash;
-	u16 qoffset = 0;
-	u16 qcount = num_tx_queues;
 
 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
@@ -2303,12 +2301,6 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		return hash;
 	}
 
-	if (dev->num_tc) {
-		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
-		qoffset = dev->tc_to_txq[tc].offset;
-		qcount = dev->tc_to_txq[tc].count;
-	}
-
 	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
 	else
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 8/8] net/mlx4_en: num cores tx rings for every UP
  2012-03-13 17:21 [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
                   ` (6 preceding siblings ...)
  2012-03-13 17:22 ` [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism Amir Vadai
@ 2012-03-13 17:22 ` Amir Vadai
  2012-03-20 11:29 ` [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
  8 siblings, 0 replies; 20+ messages in thread
From: Amir Vadai @ 2012-03-13 17:22 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Roland Dreier, Oren Duer, Amir Vadai

From: Amir Vadai <amirv@mellanox.co.il>

Instead of having num cores tx rings for untagged traffic and only 1 tx ring
per UP, allocate num cores * num UP's tx rings - There is no reason why all
cores will share the same tx_ring when using tagged traffic.

Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
---
 drivers/net/ethernet/mellanox/mlx4/en_main.c   |    6 ++++--
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |   23 ++++++++++++++++++++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c     |    5 +++--
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |    6 ++++--
 4 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index 346fdb2..988b242 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -101,6 +101,8 @@ static int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 	int i;
 
 	params->udp_rss = udp_rss;
+	params->num_tx_rings_p_up = min_t(int, num_online_cpus(),
+			MLX4_EN_MAX_TX_RING_P_UP);
 	if (params->udp_rss && !(mdev->dev->caps.flags
 					& MLX4_DEV_CAP_FLAG_UDP_RSS)) {
 		mlx4_warn(mdev, "UDP RSS is not supported on this device.\n");
@@ -113,8 +115,8 @@ static int mlx4_en_get_profile(struct mlx4_en_dev *mdev)
 		params->prof[i].tx_ppp = pfctx;
 		params->prof[i].tx_ring_size = MLX4_EN_DEF_TX_RING_SIZE;
 		params->prof[i].rx_ring_size = MLX4_EN_DEF_RX_RING_SIZE;
-		params->prof[i].tx_ring_num = MLX4_EN_NUM_TX_RINGS +
-			MLX4_EN_NUM_PPP_RINGS;
+		params->prof[i].tx_ring_num = params->num_tx_rings_p_up *
+			MLX4_EN_NUM_UP;
 		params->prof[i].rss_rings = 0;
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 22c4da6..993e148 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -659,7 +659,7 @@ int mlx4_en_start_port(struct net_device *dev)
 		/* Configure ring */
 		tx_ring = &priv->tx_ring[i];
 		err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn,
-				max(0, i - MLX4_EN_NUM_TX_RINGS));
+			i / priv->mdev->profile.num_tx_rings_p_up);
 		if (err) {
 			en_err(priv, "Failed allocating Tx ring\n");
 			mlx4_en_deactivate_cq(priv, cq);
@@ -991,6 +991,8 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
 #ifdef CONFIG_MLX4_EN_DCB
 	vfree(priv->mlx4_en_ieee_ets);
 #endif
+	vfree(priv->tx_ring);
+	vfree(priv->tx_cq);
 
 	free_netdev(dev);
 }
@@ -1073,6 +1075,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	struct mlx4_en_priv *priv;
 	int i;
 	int err;
+	unsigned int q, offset = 0;
 
 	dev = alloc_etherdev_mqs(sizeof(struct mlx4_en_priv),
 	    prof->tx_ring_num, prof->rx_ring_num);
@@ -1098,6 +1101,17 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	priv->ctrl_flags = cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE |
 			MLX4_WQE_CTRL_SOLICITED);
 	priv->tx_ring_num = prof->tx_ring_num;
+	priv->tx_ring = vzalloc(sizeof(struct mlx4_en_tx_ring) *
+			priv->tx_ring_num);
+	if (!priv->tx_ring) {
+		err = -ENOMEM;
+		goto out;
+	}
+	priv->tx_cq = vzalloc(sizeof(struct mlx4_en_cq) * priv->tx_ring_num);
+	if (!priv->tx_cq) {
+		err = -ENOMEM;
+		goto out;
+	}
 	priv->rx_ring_num = prof->rx_ring_num;
 	priv->mac_index = -1;
 	priv->msg_enable = MLX4_EN_MSG_LEVEL;
@@ -1155,8 +1169,11 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 	netdev_set_num_tc(dev, MLX4_EN_NUM_UP);
 
 	/* Partition Tx queues evenly amongst UP's */
-	for (i = 0; i < MLX4_EN_NUM_UP; i++)
-		netdev_set_tc_queue(dev, i, 1, MLX4_EN_NUM_TX_RINGS + i);
+	q = priv->tx_ring_num / MLX4_EN_NUM_UP;
+	for (i = 0; i < MLX4_EN_NUM_UP; i++) {
+		netdev_set_tc_queue(dev, i, q, offset);
+		offset += q;
+	}
 
 	SET_ETHTOOL_OPS(dev, &mlx4_en_ethtool_ops);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index d0d96e3..c9b9e84 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -570,6 +570,8 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
 
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	u16 qcount = priv->mdev->profile.num_tx_rings_p_up;
 	int up = 0;
 
 	if (vlan_tx_tag_present(skb))
@@ -577,8 +579,7 @@ u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
 	else if (dev->num_tc)
 		up = netdev_get_prio_tc_map(dev, skb->priority);
 
-	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS,
-			MLX4_EN_NUM_TX_RINGS + up, 1);
+	return __skb_tx_hash(dev, skb, qcount, qcount * up, qcount);
 }
 
 static void mlx4_bf_copy(void __iomem *dst, unsigned long *src, unsigned bytecnt)
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 32b447d..a739d3e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -113,6 +113,7 @@ enum {
 #define MLX4_EN_NUM_TX_RINGS		8
 #define MLX4_EN_NUM_PPP_RINGS		8
 #define MAX_TX_RINGS			(MLX4_EN_NUM_TX_RINGS + MLX4_EN_NUM_PPP_RINGS)
+#define MLX4_EN_MAX_TX_RING_P_UP	32
 #define MLX4_EN_NUM_UP			8
 #define MLX4_EN_DEF_TX_RING_SIZE	512
 #define MLX4_EN_DEF_RX_RING_SIZE  	1024
@@ -339,6 +340,7 @@ struct mlx4_en_profile {
 	u32 active_ports;
 	u32 small_pkt_int;
 	u8 no_reset;
+	u8 num_tx_rings_p_up;
 	struct mlx4_en_port_profile prof[MLX4_MAX_PORTS + 1];
 };
 
@@ -477,9 +479,9 @@ struct mlx4_en_priv {
 	u16 num_frags;
 	u16 log_rx_info;
 
-	struct mlx4_en_tx_ring tx_ring[MAX_TX_RINGS];
+	struct mlx4_en_tx_ring *tx_ring;
 	struct mlx4_en_rx_ring rx_ring[MAX_RX_RINGS];
-	struct mlx4_en_cq tx_cq[MAX_TX_RINGS];
+	struct mlx4_en_cq *tx_cq;
 	struct mlx4_en_cq rx_cq[MAX_RX_RINGS];
 	struct work_struct mcast_task;
 	struct work_struct mac_task;
-- 
1.7.8.2

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism
  2012-03-13 17:22 ` [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism Amir Vadai
@ 2012-03-13 18:23   ` John Fastabend
  2012-03-14 10:09     ` Amir Vadai
  0 siblings, 1 reply; 20+ messages in thread
From: John Fastabend @ 2012-03-13 18:23 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, Roland Dreier, Oren Duer, Amir Vadai,
	Jeff Kirsher, Eilon Greenstein

On 3/13/2012 10:22 AM, Amir Vadai wrote:
> From: Amir Vadai <amirv@mellanox.co.il>
> 
> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
> class. This patch introduces an approach which allow to use this mechanism also
> with hardware who has queues per user priority (UP). After the change,
> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
> range is defined by the caller function by the specific HW. If TC based queues,
> the range is by TC number and for UP based queues, the range is by UP.
> 

ETS is one specific use case for mqprio it can easily be used with other
hardware transmission selection algorithms 802.1Q std based or otherwise.

The mapping is really just an skb->priority to group of queues (qoffset and
qcount). I happened to call the queue grouping a traffic class because that
aligns with 802.1Q but it _really_ is just a queue grouping.

In your case what would it mean to change the map and num_tc see 'tc':

[root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help
Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
                  [queues count1@offset1 count2@offset2 ...] [hw 1|0]

For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might
not work correctly. Would you end up with an skb tagged with priority
4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work
with PFC or your ETS correctly.

In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the
priority set then use the priority to steer the skb to the correct queue
groupings. In your case I think you can just fail any num_tc != 8 and keep
the dflt map 1:1 then this should work. What did I miss? It looks like you
already fail the num_tc != 8 case so why do we need this change?

At most maybe we need a flag to indicate the mqprio map can't be changed in
some cases.


> CC: John Fastabend <john.r.fastabend@intel.com>
> CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> CC: Eilon Greenstein <eilong@broadcom.com>
> Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |   11 ++++++++++-
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c      |    9 +++------
>  include/linux/netdevice.h                       |   12 +++++++++++-
>  include/linux/skbuff.h                          |    3 ++-
>  net/core/dev.c                                  |   10 +---------
>  5 files changed, 27 insertions(+), 18 deletions(-)
> 

[...]

>  
>  void bnx2x_set_num_queues(struct bnx2x *bp)
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 7a49830..d0d96e3 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
>  
>  u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>  {
> -	struct mlx4_en_priv *priv = netdev_priv(dev);
> -	int up = -1;
> +	int up = 0;
>  
>  	if (vlan_tx_tag_present(skb))
>  		up = (vlan_tx_tag_get(skb) >> 13);

I was trying to avoid logic like this in select_queue().

Can we get the same behavior by keeping the egress map and mqprio
map in sync?

>  	else if (dev->num_tc)
>  		up = netdev_get_prio_tc_map(dev, skb->priority);
>  
> -	if (up >= 0)
> -		return MLX4_EN_NUM_TX_RINGS + up;
> -
> -	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS);
> +	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS,
> +			MLX4_EN_NUM_TX_RINGS + up, 1);
>  }
>  

Thanks,
John

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] net/mlx4_en: Set max rate-limit for a TC
  2012-03-13 17:21 ` [PATCH 4/8] net/mlx4_en: Set max rate-limit for a TC Amir Vadai
@ 2012-03-13 18:26   ` John Fastabend
  2012-03-14 10:31     ` Amir Vadai
  2012-03-13 19:16   ` Dave Taht
  1 sibling, 1 reply; 20+ messages in thread
From: John Fastabend @ 2012-03-13 18:26 UTC (permalink / raw)
  To: Amir Vadai; +Cc: David S. Miller, netdev, Roland Dreier, Oren Duer, Amir Vadai

On 3/13/2012 10:21 AM, Amir Vadai wrote:
> From: Amir Vadai <amirv@mellanox.co.il>
> 
> Set max rate-limit using sysfs file /sys/class/net/<interface>/qos/ratelimit
> 
> To set, enter a space separated list of values in units of 100Mbps.  For
> example to set ratelimit of 5G to TC0 and 10G for the reset on eth2 issue:
> echo 50 100 100 100 100 100 100 100 100 > /sys/class/net/eth2/qos/ratelimit
> 
> Signed-off-by: Amir Vadai <amirv@mellanox.co.il>
> ---
> 
> We used sysfs since max bw isn't part of the ETS / DCBX NL support, and we're
> open to other suggestions to add generic support for max bw, e.g add call to
> the DCBX NL API.
> 

Its not really part of DCB so adding it to DCBnl seems a bit forced. But how
about adding it as an attribute of the mqprio which "knows" about the queue
groupings?

Does the rate limiter take into account the user priority to tc mapping or
is it really just a group of queues with a rate limit? The code makes it
look like it really is per 802.1Q traffic class.

.John

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] net/mlx4_en: Set max rate-limit for a TC
  2012-03-13 17:21 ` [PATCH 4/8] net/mlx4_en: Set max rate-limit for a TC Amir Vadai
  2012-03-13 18:26   ` John Fastabend
@ 2012-03-13 19:16   ` Dave Taht
  2012-03-14 10:42     ` Amir Vadai
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Taht @ 2012-03-13 19:16 UTC (permalink / raw)
  To: Amir Vadai; +Cc: David S. Miller, netdev, Roland Dreier, Oren Duer, Amir Vadai

On Tue, Mar 13, 2012 at 10:21 AM, Amir Vadai <amirv@mellanox.com> wrote:
> From: Amir Vadai <amirv@mellanox.co.il>
>
> Set max rate-limit using sysfs file /sys/class/net/<interface>/qos/ratelimit
>
> To set, enter a space separated list of values in units of 100Mbps.  For
> example to set ratelimit of 5G to TC0 and 10G for the reset on eth2 issue:
> echo 50 100 100 100 100 100 100 100 100 > /sys/class/net/eth2/qos/ratelimit

At least in my world, rates go down as low as 128k.

If this 'qos' sys value is intended to be a generic mechanism to be
used instead of (say) htb,
for more than one type of ethernet device.... I would argue that the
units need to be specified.

Similarly what else is intended to live below the 'qos' subdir?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism
  2012-03-13 18:23   ` John Fastabend
@ 2012-03-14 10:09     ` Amir Vadai
  2012-03-14 21:36       ` John Fastabend
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Vadai @ 2012-03-14 10:09 UTC (permalink / raw)
  To: John Fastabend
  Cc: David S. Miller, netdev, Roland Dreier, Oren Duer, Amir Vadai,
	Jeff Kirsher, Eilon Greenstein

On 03/13/2012 08:23 PM, John Fastabend wrote:
> On 3/13/2012 10:22 AM, Amir Vadai wrote:
>> From: Amir Vadai<amirv@mellanox.co.il>
>>
>> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
>> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
>> class. This patch introduces an approach which allow to use this mechanism also
>> with hardware who has queues per user priority (UP). After the change,
>> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
>> range is defined by the caller function by the specific HW. If TC based queues,
>> the range is by TC number and for UP based queues, the range is by UP.
>>
> ETS is one specific use case for mqprio it can easily be used with other
> hardware transmission selection algorithms 802.1Q std based or otherwise.
>
> The mapping is really just an skb->priority to group of queues (qoffset and
> qcount). I happened to call the queue grouping a traffic class because that
> aligns with 802.1Q but it _really_ is just a queue grouping.
This is good for untagged traffic, but for tagged traffic I can see 2 
problems with this approach:
1. mqprio mapping could contradict egress map or 8021Qaz mapping (UP <=> 
TC). This could be solved (not very elegantly) by forcing the mappings 
to be synced.
2. egress map is per vlan, and mqprio mapping is one global mapping.
>
> In your case what would it mean to change the map and num_tc see 'tc':
>
> [root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help
> Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
>                    [queues count1@offset1 count2@offset2 ...] [hw 1|0]
>
> For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might
> not work correctly. Would you end up with an skb tagged with priority
> 4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work
> with PFC or your ETS correctly.
I don't see a problem here. For example, skb tagged with priority 5 is 
mapped to UP 1. And sent through one of the tx rings of UP 1. All the 
rings of UP 1 share the same transmission queue (schedule queue) which 
is controlled by PFC and ETS by the HW. What is the problem here?
>
> In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the
> priority set then use the priority to steer the skb to the correct queue
> groupings. In your case I think you can just fail any num_tc != 8 and keep
> the dflt map 1:1 then this should work. What did I miss? It looks like you
> already fail the num_tc != 8 case so why do we need this change?
>
> At most maybe we need a flag to indicate the mqprio map can't be changed in
> some cases.
What you suggest is that the priority in net_prio cgroup will be the 
User Priority, and not just the skb priority?
And also, for tagged traffic, how could it be forced to be synced with 
egress map?
>
>
>> CC: John Fastabend<john.r.fastabend@intel.com>
>> CC: Jeff Kirsher<jeffrey.t.kirsher@intel.com>
>> CC: Eilon Greenstein<eilong@broadcom.com>
>> Signed-off-by: Amir Vadai<amirv@mellanox.co.il>
>> ---
>>   drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |   11 ++++++++++-
>>   drivers/net/ethernet/mellanox/mlx4/en_tx.c      |    9 +++------
>>   include/linux/netdevice.h                       |   12 +++++++++++-
>>   include/linux/skbuff.h                          |    3 ++-
>>   net/core/dev.c                                  |   10 +---------
>>   5 files changed, 27 insertions(+), 18 deletions(-)
>>
> [...]
>
>>
>>   void bnx2x_set_num_queues(struct bnx2x *bp)
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> index 7a49830..d0d96e3 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
>>
>>   u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>>   {
>> -	struct mlx4_en_priv *priv = netdev_priv(dev);
>> -	int up = -1;
>> +	int up = 0;
>>
>>   	if (vlan_tx_tag_present(skb))
>>   		up = (vlan_tx_tag_get(skb)>>  13);
> I was trying to avoid logic like this in select_queue().
Why?
>
> Can we get the same behavior by keeping the egress map and mqprio
> map in sync?
As I said above, if we force egress map to be synced to mqprio mapping, 
we loose it's power - mqprio is global, and egress map is per vlan.
>
>>   	else if (dev->num_tc)
>>   		up = netdev_get_prio_tc_map(dev, skb->priority);
>>
>> -	if (up>= 0)
>> -		return MLX4_EN_NUM_TX_RINGS + up;
>> -
>> -	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS);
>> +	return __skb_tx_hash(dev, skb, MLX4_EN_NUM_TX_RINGS,
>> +			MLX4_EN_NUM_TX_RINGS + up, 1);
>>   }
>>
> Thanks,
> John
Thanks,
Amir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] net/mlx4_en: Set max rate-limit for a TC
  2012-03-13 18:26   ` John Fastabend
@ 2012-03-14 10:31     ` Amir Vadai
  0 siblings, 0 replies; 20+ messages in thread
From: Amir Vadai @ 2012-03-14 10:31 UTC (permalink / raw)
  To: John Fastabend
  Cc: David S. Miller, netdev, Roland Dreier, Oren Duer, Amir Vadai

On 03/13/2012 08:26 PM, John Fastabend wrote:
> On 3/13/2012 10:21 AM, Amir Vadai wrote:
>> From: Amir Vadai<amirv@mellanox.co.il>
>>
>> Set max rate-limit using sysfs file /sys/class/net/<interface>/qos/ratelimit
>>
>> To set, enter a space separated list of values in units of 100Mbps.  For
>> example to set ratelimit of 5G to TC0 and 10G for the reset on eth2 issue:
>> echo 50 100 100 100 100 100 100 100 100>  /sys/class/net/eth2/qos/ratelimit
>>
>> Signed-off-by: Amir Vadai<amirv@mellanox.co.il>
>> ---
>>
>> We used sysfs since max bw isn't part of the ETS / DCBX NL support, and we're
>> open to other suggestions to add generic support for max bw, e.g add call to
>> the DCBX NL API.
>>
>
> Its not really part of DCB so adding it to DCBnl seems a bit forced. But how
> about adding it as an attribute of the mqprio which "knows" about the queue
> groupings?
See my answer below
>
> Does the rate limiter take into account the user priority to tc mapping or
> is it really just a group of queues with a rate limit? The code makes it
> look like it really is per 802.1Q traffic class.
This is true, the rate limiter is per 802.1Q traffic class, and the HW 
is using the UP to TC mapping when enforcing it.
That's why it can't be added to mqprio, which is a group of queues and 
not 802.1Q traffic class.
>
> .John

Thanks,
Amir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] net/mlx4_en: Set max rate-limit for a TC
  2012-03-13 19:16   ` Dave Taht
@ 2012-03-14 10:42     ` Amir Vadai
  0 siblings, 0 replies; 20+ messages in thread
From: Amir Vadai @ 2012-03-14 10:42 UTC (permalink / raw)
  To: Dave Taht; +Cc: David S. Miller, netdev, Roland Dreier, Oren Duer, Amir Vadai

On 03/13/2012 09:16 PM, Dave Taht wrote:
> On Tue, Mar 13, 2012 at 10:21 AM, Amir Vadai<amirv@mellanox.com>  wrote:
>> From: Amir Vadai<amirv@mellanox.co.il>
>>
>> Set max rate-limit using sysfs file /sys/class/net/<interface>/qos/ratelimit
>>
>> To set, enter a space separated list of values in units of 100Mbps.  For
>> example to set ratelimit of 5G to TC0 and 10G for the reset on eth2 issue:
>> echo 50 100 100 100 100 100 100 100 100>  /sys/class/net/eth2/qos/ratelimit
>
> At least in my world, rates go down as low as 128k.
>
> If this 'qos' sys value is intended to be a generic mechanism to be
> used instead of (say) htb,
> for more than one type of ethernet device.... I would argue that the
> units need to be specified.
>
> Similarly what else is intended to live below the 'qos' subdir?

This sys value is specific to our driver, and will be shown only on 
mlx4_en netdev's.

We're open to suggestions to make it a generic sysfs entry. If it will 
really be generic, we will sure have well known units (for example KB) 
and we'll do the translation in our driver.

Thanks,
Amir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism
  2012-03-14 10:09     ` Amir Vadai
@ 2012-03-14 21:36       ` John Fastabend
  2012-03-15 10:05         ` Amir Vadai
  0 siblings, 1 reply; 20+ messages in thread
From: John Fastabend @ 2012-03-14 21:36 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, Roland Dreier, Oren Duer, Amir Vadai,
	Jeff Kirsher, Eilon Greenstein

On 3/14/2012 3:09 AM, Amir Vadai wrote:
> On 03/13/2012 08:23 PM, John Fastabend wrote:
>> On 3/13/2012 10:22 AM, Amir Vadai wrote:
>>> From: Amir Vadai<amirv@mellanox.co.il>
>>>
>>> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
>>> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
>>> class. This patch introduces an approach which allow to use this mechanism also
>>> with hardware who has queues per user priority (UP). After the change,
>>> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
>>> range is defined by the caller function by the specific HW. If TC based queues,
>>> the range is by TC number and for UP based queues, the range is by UP.
>>>
>> ETS is one specific use case for mqprio it can easily be used with other
>> hardware transmission selection algorithms 802.1Q std based or otherwise.
>>
>> The mapping is really just an skb->priority to group of queues (qoffset and
>> qcount). I happened to call the queue grouping a traffic class because that
>> aligns with 802.1Q but it _really_ is just a queue grouping.

> This is good for untagged traffic, but for tagged traffic I can see 2 problems with this approach:
> 1. mqprio mapping could contradict egress map or 8021Qaz mapping (UP
> <=> TC). This could be solved (not very elegantly) by forcing the
> mappings to be synced. 

OK. We've just been keeping them in-sync.

> 2. egress map is per vlan, and mqprio mapping
> is one global mapping.

So it only matters when you want the egress map per vlan? The problem
I see with this now is the mellanox driver does DCB differently then
the existing drivers.

For example if I put a task in a net_prio cgroup and assign the vlan
a priority this won't actually steer the packet correctly on mlx. I
also have to create an egress map existing drivers will ignore the
egress_map and steer the skb as they always have.

At minimum skbs need to be steered the same on all drivers. We can't
expect user space to "know" what hardware is underneath.

>>
>> In your case what would it mean to change the map and num_tc see 'tc':
>>
>> [root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help
>> Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
>>                    [queues count1@offset1 count2@offset2 ...] [hw 1|0]
>>
>> For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might
>> not work correctly. Would you end up with an skb tagged with priority
>> 4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work
>> with PFC or your ETS correctly.
> I don't see a problem here. For example, skb tagged with priority 5
> is mapped to UP 1. And sent through one of the tx rings of UP 1. All
> the rings of UP 1 share the same transmission queue (schedule queue)
> which is controlled by PFC and ETS by the HW. What is the problem
> here?

I was concerned about the actual tag that gets added. In ixgbe we've been
adding a tag based on skb->priority in the untagged pkt case. In your
driver after looking at the code either your not adding a tag or the
hardware is adding the correct user priority to the priority tagged pkts.

We use the skb->priority in ixgbe because we can have multiple user
priorities (PCPs) on a single tx_ring with the above map. We have
no other way to know what the priority should be in the untagged
case.

>>
>> In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the
>> priority set then use the priority to steer the skb to the correct queue
>> groupings. In your case I think you can just fail any num_tc != 8 and keep
>> the dflt map 1:1 then this should work. What did I miss? It looks like you
>> already fail the num_tc != 8 case so why do we need this change?
>>
>> At most maybe we need a flag to indicate the mqprio map can't be changed in
>> some cases.
> What you suggest is that the priority in net_prio cgroup will be the User Priority, and not just the skb priority?

That is how we are using it today yes. Which creates the some what
unfortunate case (I guess) that the egress map has to be aligned
with the qdisc map. This hasn't caused any problems in practice for us.

> And also, for tagged traffic, how could it be forced to be synced with egress map?

there is a priority in net_prio.ifpriomap group for each vlan as well as
real device so we just setup the mapping for the vlan.

This is how we do things like assign a vlan a default priority.


>> [...]
>>
>>>
>>>   void bnx2x_set_num_queues(struct bnx2x *bp)
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>> index 7a49830..d0d96e3 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
>>>
>>>   u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>>>   {
>>> -    struct mlx4_en_priv *priv = netdev_priv(dev);
>>> -    int up = -1;
>>> +    int up = 0;
>>>
>>>       if (vlan_tx_tag_present(skb))
>>>           up = (vlan_tx_tag_get(skb)>>  13);
>> I was trying to avoid logic like this in select_queue().
> Why?

Because this makes your driver potentially behave differently then
other drivers. DCB should look the same from the user side
regardless of the driver.

>>
>> Can we get the same behavior by keeping the egress map and mqprio
>> map in sync?
> As I said above, if we force egress map to be synced to mqprio mapping, we loose it's power - mqprio is global, and egress map is per vlan.

using net_prio cgroups per vlan allows per vlan priority mappings.
I agree this is a bit awkward right now and it seems reasonable
to expect setting the egress_map causes the skb steering to work
correctly.

The crux of the issue here is that ixgbe and bnx2x are modeling
the qdisc tc as a traffic class but your hardware is based on
a model that exposes user priorities. We need these to look the
same from the user perspective. We need to figure out how to
make this correct for both models. Any suggestions?

.John

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism
  2012-03-14 21:36       ` John Fastabend
@ 2012-03-15 10:05         ` Amir Vadai
  2012-03-16  7:16           ` John Fastabend
  0 siblings, 1 reply; 20+ messages in thread
From: Amir Vadai @ 2012-03-15 10:05 UTC (permalink / raw)
  To: John Fastabend
  Cc: David S. Miller, netdev, Roland Dreier, Oren Duer, Amir Vadai,
	Jeff Kirsher, Eilon Greenstein, Liran Liss, Yevgeny Petrilin,
	Or Gerlitz

On 03/14/2012 11:36 PM, John Fastabend wrote:
> On 3/14/2012 3:09 AM, Amir Vadai wrote:
>> On 03/13/2012 08:23 PM, John Fastabend wrote:
>>> On 3/13/2012 10:22 AM, Amir Vadai wrote:
>>>> From: Amir Vadai<amirv@mellanox.co.il>
>>>>
>>>> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
>>>> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
>>>> class. This patch introduces an approach which allow to use this mechanism also
>>>> with hardware who has queues per user priority (UP). After the change,
>>>> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
>>>> range is defined by the caller function by the specific HW. If TC based queues,
>>>> the range is by TC number and for UP based queues, the range is by UP.
>>>>
>>> ETS is one specific use case for mqprio it can easily be used with other
>>> hardware transmission selection algorithms 802.1Q std based or otherwise.
>>>
>>> The mapping is really just an skb->priority to group of queues (qoffset and
>>> qcount). I happened to call the queue grouping a traffic class because that
>>> aligns with 802.1Q but it _really_ is just a queue grouping.
>
>> This is good for untagged traffic, but for tagged traffic I can see 2 problems with this approach:
>> 1. mqprio mapping could contradict egress map or 8021Qaz mapping (UP
>> <=>  TC). This could be solved (not very elegantly) by forcing the
>> mappings to be synced.
>
> OK. We've just been keeping them in-sync.
>
>> 2. egress map is per vlan, and mqprio mapping
>> is one global mapping.
>
> So it only matters when you want the egress map per vlan? The problem
> I see with this now is the mellanox driver does DCB differently then
> the existing drivers.
>
> For example if I put a task in a net_prio cgroup and assign the vlan
> a priority this won't actually steer the packet correctly on mlx. I
> also have to create an egress map existing drivers will ignore the
> egress_map and steer the skb as they always have.
But if you don't create an egress map for tagged traffic. What will be 
in the PCP field of the vlan tag (= User Priority)?
>
> At minimum skbs need to be steered the same on all drivers. We can't
> expect user space to "know" what hardware is underneath.
>
>>>
>>> In your case what would it mean to change the map and num_tc see 'tc':
>>>
>>> [root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help
>>> Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
>>>                     [queues count1@offset1 count2@offset2 ...] [hw 1|0]
>>>
>>> For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might
>>> not work correctly. Would you end up with an skb tagged with priority
>>> 4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work
>>> with PFC or your ETS correctly.
>> I don't see a problem here. For example, skb tagged with priority 5
>> is mapped to UP 1. And sent through one of the tx rings of UP 1. All
>> the rings of UP 1 share the same transmission queue (schedule queue)
>> which is controlled by PFC and ETS by the HW. What is the problem
>> here?
>
> I was concerned about the actual tag that gets added. In ixgbe we've been
> adding a tag based on skb->priority in the untagged pkt case. In your
> driver after looking at the code either your not adding a tag or the
> hardware is adding the correct user priority to the priority tagged pkts.
It is added by the HW according to the tx_ring.
>
> We use the skb->priority in ixgbe because we can have multiple user
> priorities (PCPs) on a single tx_ring with the above map. We have
> no other way to know what the priority should be in the untagged
> case.
Instead of attaching a tx_ring to ETS TC like your driver does, in our 
driver a tx_ring is attached to a single user priority (UP).
With this UP and the mapping UP <=> TC configured by DCB netlink to the 
HW, the HW can enforce the 8021Qaz attributes by the mapped TC.
For tagged traffic this UP is also used in the PCP field in the vlan tag.
>
>>>
>>> In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the
>>> priority set then use the priority to steer the skb to the correct queue
>>> groupings. In your case I think you can just fail any num_tc != 8 and keep
>>> the dflt map 1:1 then this should work. What did I miss? It looks like you
>>> already fail the num_tc != 8 case so why do we need this change?
>>>
>>> At most maybe we need a flag to indicate the mqprio map can't be changed in
>>> some cases.
>> What you suggest is that the priority in net_prio cgroup will be the User Priority, and not just the skb priority?
>
> That is how we are using it today yes. Which creates the some what
> unfortunate case (I guess) that the egress map has to be aligned
> with the qdisc map. This hasn't caused any problems in practice for us.
>
>> And also, for tagged traffic, how could it be forced to be synced with egress map?
>
> there is a priority in net_prio.ifpriomap group for each vlan as well as
> real device so we just setup the mapping for the vlan.
>
> This is how we do things like assign a vlan a default priority.
>
>
>>> [...]
>>>
>>>>
>>>>    void bnx2x_set_num_queues(struct bnx2x *bp)
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>> index 7a49830..d0d96e3 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
>>>>
>>>>    u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>>>>    {
>>>> -    struct mlx4_en_priv *priv = netdev_priv(dev);
>>>> -    int up = -1;
>>>> +    int up = 0;
>>>>
>>>>        if (vlan_tx_tag_present(skb))
>>>>            up = (vlan_tx_tag_get(skb)>>   13);
>>> I was trying to avoid logic like this in select_queue().
>> Why?
>
> Because this makes your driver potentially behave differently then
> other drivers. DCB should look the same from the user side
> regardless of the driver.
I agree - it should look the same.
>
>>>
>>> Can we get the same behavior by keeping the egress map and mqprio
>>> map in sync?
>> As I said above, if we force egress map to be synced to mqprio mapping, we loose it's power - mqprio is global, and egress map is per vlan.
>
> using net_prio cgroups per vlan allows per vlan priority mappings.
> I agree this is a bit awkward right now and it seems reasonable
> to expect setting the egress_map causes the skb steering to work
> correctly.
>
> The crux of the issue here is that ixgbe and bnx2x are modeling
> the qdisc tc as a traffic class but your hardware is based on
> a model that exposes user priorities. We need these to look the
> same from the user perspective. We need to figure out how to
> make this correct for both models. Any suggestions?
Before suggesting, I need to make sure I understand the current model:

Assumptions
-----------
a. If tagged traffic is involved, egress map is configured 1:1 and
    therefore, skb priority = User Priority (UP)
b. mqprio traffic class is ETS TC

Mappings in use
---------------
1. net_prio cgroup: netdev + task <=> skb priority
2. SO_PRIORITY/SO_IP_TOS: skb_priority
3. mqprio: skb priority <=> traffic class
4. DCB netlink: UP <=> ETS TC

Untagged traffic
----------------
User is using [1] or [2] to tag a flow with priority.
Driver is using [3] to steer traffic according to 8021Qaz ETS attributes.

Tagged traffic
--------------
User is using [1] or [2] to tag a flow with priority
Driver is setting PCP bits in vlan header using skb priority (1:1 in 
egress map).
Traffic is steered using [3].

Mapping [4] must be synced with [3].

>
> .John


- Amir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism
  2012-03-15 10:05         ` Amir Vadai
@ 2012-03-16  7:16           ` John Fastabend
  0 siblings, 0 replies; 20+ messages in thread
From: John Fastabend @ 2012-03-16  7:16 UTC (permalink / raw)
  To: Amir Vadai
  Cc: David S. Miller, netdev, Roland Dreier, Oren Duer, Amir Vadai,
	Jeff Kirsher, Eilon Greenstein, Liran Liss, Yevgeny Petrilin,
	Or Gerlitz

On 3/15/2012 3:05 AM, Amir Vadai wrote:
> On 03/14/2012 11:36 PM, John Fastabend wrote:
>> On 3/14/2012 3:09 AM, Amir Vadai wrote:
>>> On 03/13/2012 08:23 PM, John Fastabend wrote:
>>>> On 3/13/2012 10:22 AM, Amir Vadai wrote:
>>>>> From: Amir Vadai<amirv@mellanox.co.il>
>>>>>
>>>>> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
>>>>> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
>>>>> class. This patch introduces an approach which allow to use this mechanism also
>>>>> with hardware who has queues per user priority (UP). After the change,
>>>>> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
>>>>> range is defined by the caller function by the specific HW. If TC based queues,
>>>>> the range is by TC number and for UP based queues, the range is by UP.
>>>>>
>>>> ETS is one specific use case for mqprio it can easily be used with other
>>>> hardware transmission selection algorithms 802.1Q std based or otherwise.
>>>>
>>>> The mapping is really just an skb->priority to group of queues (qoffset and
>>>> qcount). I happened to call the queue grouping a traffic class because that
>>>> aligns with 802.1Q but it _really_ is just a queue grouping.
>>
>>> This is good for untagged traffic, but for tagged traffic I can see 2 problems with this approach:
>>> 1. mqprio mapping could contradict egress map or 8021Qaz mapping (UP
>>> <=>  TC). This could be solved (not very elegantly) by forcing the
>>> mappings to be synced.
>>
>> OK. We've just been keeping them in-sync.
>>
>>> 2. egress map is per vlan, and mqprio mapping
>>> is one global mapping.
>>
>> So it only matters when you want the egress map per vlan? The problem
>> I see with this now is the mellanox driver does DCB differently then
>> the existing drivers.
>>
>> For example if I put a task in a net_prio cgroup and assign the vlan
>> a priority this won't actually steer the packet correctly on mlx. I
>> also have to create an egress map existing drivers will ignore the
>> egress_map and steer the skb as they always have.
> But if you don't create an egress map for tagged traffic. What will be in the PCP field of the vlan tag (= User Priority)?
>>
>> At minimum skbs need to be steered the same on all drivers. We can't
>> expect user space to "know" what hardware is underneath.
>>
>>>>
>>>> In your case what would it mean to change the map and num_tc see 'tc':
>>>>
>>>> [root@jf-dev1-dcblab netperf]# tc qdisc add dev eth3 root mqprio help
>>>> Usage: ... mqprio [num_tc NUMBER] [map P0 P1 ...]
>>>>                     [queues count1@offset1 count2@offset2 ...] [hw 1|0]
>>>>
>>>> For example setting 'num_tc 8 map 0 1 2 3 0 1 2 3' looks like it might
>>>> not work correctly. Would you end up with an skb tagged with priority
>>>> 4,5,6,7 being sent out UP queues 0,1,2,3? My quess is that won't work
>>>> with PFC or your ETS correctly.
>>> I don't see a problem here. For example, skb tagged with priority 5
>>> is mapped to UP 1. And sent through one of the tx rings of UP 1. All
>>> the rings of UP 1 share the same transmission queue (schedule queue)
>>> which is controlled by PFC and ETS by the HW. What is the problem
>>> here?
>>
>> I was concerned about the actual tag that gets added. In ixgbe we've been
>> adding a tag based on skb->priority in the untagged pkt case. In your
>> driver after looking at the code either your not adding a tag or the
>> hardware is adding the correct user priority to the priority tagged pkts.
> It is added by the HW according to the tx_ring.
>>
>> We use the skb->priority in ixgbe because we can have multiple user
>> priorities (PCPs) on a single tx_ring with the above map. We have
>> no other way to know what the priority should be in the untagged
>> case.
> Instead of attaching a tx_ring to ETS TC like your driver does, in our driver a tx_ring is attached to a single user priority (UP).
> With this UP and the mapping UP <=> TC configured by DCB netlink to the HW, the HW can enforce the 8021Qaz attributes by the mapped TC.
> For tagged traffic this UP is also used in the PCP field in the vlan tag.
>>
>>>>
>>>> In the canonical iSCSI case we put iscsid in a net_prio cgroup to get the
>>>> priority set then use the priority to steer the skb to the correct queue
>>>> groupings. In your case I think you can just fail any num_tc != 8 and keep
>>>> the dflt map 1:1 then this should work. What did I miss? It looks like you
>>>> already fail the num_tc != 8 case so why do we need this change?
>>>>
>>>> At most maybe we need a flag to indicate the mqprio map can't be changed in
>>>> some cases.
>>> What you suggest is that the priority in net_prio cgroup will be the User Priority, and not just the skb priority?
>>
>> That is how we are using it today yes. Which creates the some what
>> unfortunate case (I guess) that the egress map has to be aligned
>> with the qdisc map. This hasn't caused any problems in practice for us.
>>
>>> And also, for tagged traffic, how could it be forced to be synced with egress map?
>>
>> there is a priority in net_prio.ifpriomap group for each vlan as well as
>> real device so we just setup the mapping for the vlan.
>>
>> This is how we do things like assign a vlan a default priority.
>>
>>
>>>> [...]
>>>>
>>>>>
>>>>>    void bnx2x_set_num_queues(struct bnx2x *bp)
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>>> index 7a49830..d0d96e3 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>>>> @@ -570,18 +570,15 @@ static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc, struct sk_buff *sk
>>>>>
>>>>>    u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb)
>>>>>    {
>>>>> -    struct mlx4_en_priv *priv = netdev_priv(dev);
>>>>> -    int up = -1;
>>>>> +    int up = 0;
>>>>>
>>>>>        if (vlan_tx_tag_present(skb))
>>>>>            up = (vlan_tx_tag_get(skb)>>   13);
>>>> I was trying to avoid logic like this in select_queue().
>>> Why?
>>
>> Because this makes your driver potentially behave differently then
>> other drivers. DCB should look the same from the user side
>> regardless of the driver.
> I agree - it should look the same.
>>
>>>>
>>>> Can we get the same behavior by keeping the egress map and mqprio
>>>> map in sync?
>>> As I said above, if we force egress map to be synced to mqprio mapping, we loose it's power - mqprio is global, and egress map is per vlan.
>>
>> using net_prio cgroups per vlan allows per vlan priority mappings.
>> I agree this is a bit awkward right now and it seems reasonable
>> to expect setting the egress_map causes the skb steering to work
>> correctly.
>>
>> The crux of the issue here is that ixgbe and bnx2x are modeling
>> the qdisc tc as a traffic class but your hardware is based on
>> a model that exposes user priorities. We need these to look the
>> same from the user perspective. We need to figure out how to
>> make this correct for both models. Any suggestions?
> Before suggesting, I need to make sure I understand the current model:
> 
> Assumptions
> -----------
> a. If tagged traffic is involved, egress map is configured 1:1 and
>    therefore, skb priority = User Priority (UP)

Right this is how we currently do this.

> b. mqprio traffic class is ETS TC

For IEEE 802.1Qaz yes this is the model.

> 
> Mappings in use
> ---------------
> 1. net_prio cgroup: netdev + task <=> skb priority
> 2. SO_PRIORITY/SO_IP_TOS: skb_priority
> 3. mqprio: skb priority <=> traffic class
> 4. DCB netlink: UP <=> ETS TC
> 

yep with DCB this is how we currently do the mappings.

> Untagged traffic
> ----------------
> User is using [1] or [2] to tag a flow with priority.
> Driver is using [3] to steer traffic according to 8021Qaz ETS attributes.
> 

correct

> Tagged traffic
> --------------
> User is using [1] or [2] to tag a flow with priority
> Driver is setting PCP bits in vlan header using skb priority (1:1 in egress map).
> Traffic is steered using [3].
> 
> Mapping [4] must be synced with [3].
> 

Correct this is how it currently works. It might be worth thinking
about how to get the egress map to work correctly in these cases.
Although right now user space can keep these in sync and manage
them.

>>
>> .John
> 
> 
> - Amir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/8] net/mlx4_en: DCB QoS support
  2012-03-13 17:21 [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
                   ` (7 preceding siblings ...)
  2012-03-13 17:22 ` [PATCH 8/8] net/mlx4_en: num cores tx rings for every UP Amir Vadai
@ 2012-03-20 11:29 ` Amir Vadai
  2012-03-20 19:58   ` David Miller
  8 siblings, 1 reply; 20+ messages in thread
From: Amir Vadai @ 2012-03-20 11:29 UTC (permalink / raw)
  To: David S. Miller, John Fastabend
  Cc: netdev, Roland Dreier, Oren Duer, Yevgeny Petrilin

On 03/13/2012 07:21 PM, Amir Vadai wrote:
> DCBX version 802.1qaz is supported.
> User Priority (UP) is set in QP context instead of in WQE (QP Work Queue
> Element), which means that all traffic from a queue will have the same UP.
> UP is also set for untagged traffic to be able to classify such traffic too.
>
> Mapping from sk_prio to User Priority is done by sch_mqprio mapping. Although
> confusingly sch_mqprio maps sk_prio to something called TC, it is not related
> to DCBX's TC, and is interpreted by mlx4_en driver as UP.
>
> The Current HW based QoS mechanism which was introduced in commit 4f57c087de9
> "net: implement mechanism for HW based QOS" is in orientation to ETS traffic
> class. Patch 7/8 introduces an approach which allow to use this mechanism also
> with hardware who has queues per user priority (UP). After the change,
> __skb_tx_hash() will direct a flow to a tx ring from a range of tx rings. This
> range is defined by the caller function by the specific HW. If TC based queues,
> the range is by TC number and for UP based queues, the range is by UP.
>
> Amir Vadai (8):
>    net/mlx4_en: Force user priority by QP attribute
>    net/mlx4_core: set port QoS attributes
>    net/mlx4_en: DCB QoS support
>    net/mlx4_en: Set max rate-limit for a TC
>    net/mlx4_en: sk_prio<=>  UP for untagged traffic
>    IB/rdma_cm: TOS<=>  UP mapping for IBoE
>    net: support tx_ring per UP in HW based QoS mechanism
>    net/mlx4_en: num cores tx rings for every UP
>
>   drivers/infiniband/core/cma.c                     |   35 ++++-
>   drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |   11 +-
>   drivers/net/ethernet/mellanox/mlx4/Kconfig        |   12 ++
>   drivers/net/ethernet/mellanox/mlx4/Makefile       |    1 +
>   drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c    |  215 +++++++++++++++++++++
>   drivers/net/ethernet/mellanox/mlx4/en_main.c      |    6 +-
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c    |   64 ++++++-
>   drivers/net/ethernet/mellanox/mlx4/en_port.h      |    2 +
>   drivers/net/ethernet/mellanox/mlx4/en_resources.c |    6 +-
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c        |    4 +-
>   drivers/net/ethernet/mellanox/mlx4/en_sysfs.c     |  120 ++++++++++++
>   drivers/net/ethernet/mellanox/mlx4/en_tx.c        |   20 +-
>   drivers/net/ethernet/mellanox/mlx4/mlx4.h         |   20 ++
>   drivers/net/ethernet/mellanox/mlx4/mlx4_en.h      |   38 +++-
>   drivers/net/ethernet/mellanox/mlx4/port.c         |   62 ++++++
>   include/linux/mlx4/cmd.h                          |    4 +
>   include/linux/mlx4/device.h                       |    3 +
>   include/linux/mlx4/qp.h                           |    3 +-
>   include/linux/netdevice.h                         |   12 +-
>   include/linux/skbuff.h                            |    3 +-
>   net/core/dev.c                                    |   10 +-
>   21 files changed, 615 insertions(+), 36 deletions(-)
>   create mode 100644 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
>   create mode 100644 drivers/net/ethernet/mellanox/mlx4/en_sysfs.c
>


Hi Dave,

Patches 7-8 who deal with the interaction between the kernel HW QoS 
constructs to the queue selection logic are still under discussion with 
John and some changes might be needed there.
At this point, we ask for patches 1-6 to be pulled in, and continue the 
discussion from there.

Thanks,
Amir

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/8] net/mlx4_en: DCB QoS support
  2012-03-20 11:29 ` [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
@ 2012-03-20 19:58   ` David Miller
  0 siblings, 0 replies; 20+ messages in thread
From: David Miller @ 2012-03-20 19:58 UTC (permalink / raw)
  To: amirv; +Cc: john.r.fastabend, netdev, roland, oren, yevgenyp

From: Amir Vadai <amirv@mellanox.com>
Date: Tue, 20 Mar 2012 13:29:11 +0200

> Patches 7-8 who deal with the interaction between the kernel HW QoS
> constructs to the queue selection logic are still under discussion
> with John and some changes might be needed there.
> At this point, we ask for patches 1-6 to be pulled in, and continue
> the discussion from there.

That's not how this works.

I toss out an entire series when some of the patches need changes.

You thus must resubmit freshly any subset you want applied.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-03-20 19:58 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-13 17:21 [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
2012-03-13 17:21 ` [PATCH 1/8] net/mlx4_en: Force user priority by QP attribute Amir Vadai
2012-03-13 17:21 ` [PATCH 2/8] net/mlx4_core: set port QoS attributes Amir Vadai
2012-03-13 17:21 ` [PATCH 3/8] net/mlx4_en: DCB QoS support Amir Vadai
2012-03-13 17:21 ` [PATCH 4/8] net/mlx4_en: Set max rate-limit for a TC Amir Vadai
2012-03-13 18:26   ` John Fastabend
2012-03-14 10:31     ` Amir Vadai
2012-03-13 19:16   ` Dave Taht
2012-03-14 10:42     ` Amir Vadai
2012-03-13 17:22 ` [PATCH 5/8] net/mlx4_en: sk_prio <=> UP for untagged traffic Amir Vadai
2012-03-13 17:22 ` [PATCH 6/8] IB/rdma_cm: TOS <=> UP mapping for IBoE Amir Vadai
2012-03-13 17:22 ` [PATCH 7/8] net: support tx_ring per UP in HW based QoS mechanism Amir Vadai
2012-03-13 18:23   ` John Fastabend
2012-03-14 10:09     ` Amir Vadai
2012-03-14 21:36       ` John Fastabend
2012-03-15 10:05         ` Amir Vadai
2012-03-16  7:16           ` John Fastabend
2012-03-13 17:22 ` [PATCH 8/8] net/mlx4_en: num cores tx rings for every UP Amir Vadai
2012-03-20 11:29 ` [PATCH 0/8] net/mlx4_en: DCB QoS support Amir Vadai
2012-03-20 19:58   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).