All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
@ 2023-01-20 14:15 Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 01/11] net/sched: mqprio: refactor nlattr parsing to a separate function Vladimir Oltean
                   ` (13 more replies)
  0 siblings, 14 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

I realize that this patch set will start a flame war, but there are
things about the mqprio qdisc that I simply don't understand, so in an
attempt to explain how I see things should be done, I've made some
patches to the code. I hope the reviewers will be patient enough with me :)

I need to touch mqprio because I'm preparing a patch set for Frame
Preemption (an IEEE 802.1Q feature). A disagreement started with
Vinicius here:
https://patchwork.kernel.org/project/netdevbpf/patch/20220816222920.1952936-3-vladimir.oltean@nxp.com/#24976672

regarding how TX packet prioritization should be handled. Vinicius said
that for some Intel NICs, prioritization at the egress scheduler stage
is fundamentally attached to TX queues rather than traffic classes.

In other words, in the "popular" mqprio configuration documented by him:

$ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
      num_tc 3 \
      map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
      queues 1@0 1@1 2@2 \
      hw 0

there are 3 Linux traffic classes and 4 TX queues. The TX queues are
organized in strict priority fashion, like this: TXQ 0 has highest prio
(hardware dequeue precedence for TX scheduler), TXQ 3 has lowest prio.
Packets classified by Linux to TC 2 are hashed between TXQ 2 and TXQ 3,
but the hardware has higher precedence for TXQ2 over TXQ 3, and Linux
doesn't know that.

I am surprised by this fact, and this isn't how ENETC works at all.
For ENETC, we try to prioritize on TCs rather than TXQs, and TC 7 has
higher priority than TC 7. For us, groups of TXQs that map to the same
TC have the same egress scheduling priority. It is possible (and maybe
useful) to have 2 TXQs per TC - one TXQ per CPU). Patch 07/11 tries to
make that more clear.

Furthermore (and this is really the biggest point of contention), myself
and Vinicius have the fundamental disagreement whether the 802.1Qbv
(taprio) gate mask should be passed to the device driver per TXQ or per
TC. This is what patch 11/11 is about.

Again, I'm not *certain* that my opinion on this topic is correct
(and it sure is confusing to see such a different approach for Intel).
But I would appreciate any feedback.

Vladimir Oltean (11):
  net/sched: mqprio: refactor nlattr parsing to a separate function
  net/sched: mqprio: refactor offloading and unoffloading to dedicated
    functions
  net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to
    pkt_sched.h
  net/sched: mqprio: allow offloading drivers to request queue count
    validation
  net/sched: mqprio: add extack messages for queue count validation
  net: enetc: request mqprio to validate the queue counts
  net: enetc: act upon the requested mqprio queue configuration
  net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc()
  net: enetc: act upon mqprio queue config in taprio offload
  net/sched: taprio: validate that gate mask does not exceed number of
    TCs
  net/sched: taprio: only calculate gate mask per TXQ for igc

 drivers/net/ethernet/freescale/enetc/enetc.c  |  67 ++--
 .../net/ethernet/freescale/enetc/enetc_qos.c  |  27 +-
 drivers/net/ethernet/intel/igc/igc_main.c     |  17 +
 include/net/pkt_cls.h                         |  10 -
 include/net/pkt_sched.h                       |  16 +
 net/sched/sch_mqprio.c                        | 298 +++++++++++-------
 net/sched/sch_taprio.c                        |  57 ++--
 7 files changed, 310 insertions(+), 182 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 01/11] net/sched: mqprio: refactor nlattr parsing to a separate function
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 02/11] net/sched: mqprio: refactor offloading and unoffloading to dedicated functions Vladimir Oltean
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

mqprio_init() is quite large and unwieldy to add more code to.
Split the netlink attribute parsing to a dedicated function.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_mqprio.c | 114 +++++++++++++++++++++++------------------
 1 file changed, 63 insertions(+), 51 deletions(-)

diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 4c68abaa289b..d2d8a02ded05 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -130,6 +130,67 @@ static int parse_attr(struct nlattr *tb[], int maxtype, struct nlattr *nla,
 	return 0;
 }
 
+static int mqprio_parse_nlattr(struct Qdisc *sch, struct tc_mqprio_qopt *qopt,
+			       struct nlattr *opt)
+{
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	struct nlattr *tb[TCA_MQPRIO_MAX + 1];
+	struct nlattr *attr;
+	int i, rem, err;
+
+	err = parse_attr(tb, TCA_MQPRIO_MAX, opt, mqprio_policy,
+			 sizeof(*qopt));
+	if (err < 0)
+		return err;
+
+	if (!qopt->hw)
+		return -EINVAL;
+
+	if (tb[TCA_MQPRIO_MODE]) {
+		priv->flags |= TC_MQPRIO_F_MODE;
+		priv->mode = *(u16 *)nla_data(tb[TCA_MQPRIO_MODE]);
+	}
+
+	if (tb[TCA_MQPRIO_SHAPER]) {
+		priv->flags |= TC_MQPRIO_F_SHAPER;
+		priv->shaper = *(u16 *)nla_data(tb[TCA_MQPRIO_SHAPER]);
+	}
+
+	if (tb[TCA_MQPRIO_MIN_RATE64]) {
+		if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE)
+			return -EINVAL;
+		i = 0;
+		nla_for_each_nested(attr, tb[TCA_MQPRIO_MIN_RATE64],
+				    rem) {
+			if (nla_type(attr) != TCA_MQPRIO_MIN_RATE64)
+				return -EINVAL;
+			if (i >= qopt->num_tc)
+				break;
+			priv->min_rate[i] = *(u64 *)nla_data(attr);
+			i++;
+		}
+		priv->flags |= TC_MQPRIO_F_MIN_RATE;
+	}
+
+	if (tb[TCA_MQPRIO_MAX_RATE64]) {
+		if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE)
+			return -EINVAL;
+		i = 0;
+		nla_for_each_nested(attr, tb[TCA_MQPRIO_MAX_RATE64],
+				    rem) {
+			if (nla_type(attr) != TCA_MQPRIO_MAX_RATE64)
+				return -EINVAL;
+			if (i >= qopt->num_tc)
+				break;
+			priv->max_rate[i] = *(u64 *)nla_data(attr);
+			i++;
+		}
+		priv->flags |= TC_MQPRIO_F_MAX_RATE;
+	}
+
+	return 0;
+}
+
 static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
 		       struct netlink_ext_ack *extack)
 {
@@ -139,9 +200,6 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
 	struct Qdisc *qdisc;
 	int i, err = -EOPNOTSUPP;
 	struct tc_mqprio_qopt *qopt = NULL;
-	struct nlattr *tb[TCA_MQPRIO_MAX + 1];
-	struct nlattr *attr;
-	int rem;
 	int len;
 
 	BUILD_BUG_ON(TC_MAX_QUEUE != TC_QOPT_MAX_QUEUE);
@@ -166,55 +224,9 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
 
 	len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt));
 	if (len > 0) {
-		err = parse_attr(tb, TCA_MQPRIO_MAX, opt, mqprio_policy,
-				 sizeof(*qopt));
-		if (err < 0)
+		err = mqprio_parse_nlattr(sch, qopt, opt);
+		if (err)
 			return err;
-
-		if (!qopt->hw)
-			return -EINVAL;
-
-		if (tb[TCA_MQPRIO_MODE]) {
-			priv->flags |= TC_MQPRIO_F_MODE;
-			priv->mode = *(u16 *)nla_data(tb[TCA_MQPRIO_MODE]);
-		}
-
-		if (tb[TCA_MQPRIO_SHAPER]) {
-			priv->flags |= TC_MQPRIO_F_SHAPER;
-			priv->shaper = *(u16 *)nla_data(tb[TCA_MQPRIO_SHAPER]);
-		}
-
-		if (tb[TCA_MQPRIO_MIN_RATE64]) {
-			if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE)
-				return -EINVAL;
-			i = 0;
-			nla_for_each_nested(attr, tb[TCA_MQPRIO_MIN_RATE64],
-					    rem) {
-				if (nla_type(attr) != TCA_MQPRIO_MIN_RATE64)
-					return -EINVAL;
-				if (i >= qopt->num_tc)
-					break;
-				priv->min_rate[i] = *(u64 *)nla_data(attr);
-				i++;
-			}
-			priv->flags |= TC_MQPRIO_F_MIN_RATE;
-		}
-
-		if (tb[TCA_MQPRIO_MAX_RATE64]) {
-			if (priv->shaper != TC_MQPRIO_SHAPER_BW_RATE)
-				return -EINVAL;
-			i = 0;
-			nla_for_each_nested(attr, tb[TCA_MQPRIO_MAX_RATE64],
-					    rem) {
-				if (nla_type(attr) != TCA_MQPRIO_MAX_RATE64)
-					return -EINVAL;
-				if (i >= qopt->num_tc)
-					break;
-				priv->max_rate[i] = *(u64 *)nla_data(attr);
-				i++;
-			}
-			priv->flags |= TC_MQPRIO_F_MAX_RATE;
-		}
 	}
 
 	/* pre-allocate qdisc, attachment can't fail */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 02/11] net/sched: mqprio: refactor offloading and unoffloading to dedicated functions
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 01/11] net/sched: mqprio: refactor nlattr parsing to a separate function Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 03/11] net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h Vladimir Oltean
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

Some more logic will be added to mqprio offloading, so split that code
up from mqprio_init(), which is already large, and create a new
function, mqprio_enable_offload(), similar to taprio_enable_offload().
Also create the opposite function mqprio_disable_offload().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_mqprio.c | 102 ++++++++++++++++++++++++-----------------
 1 file changed, 59 insertions(+), 43 deletions(-)

diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index d2d8a02ded05..3579a64da06e 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -27,6 +27,61 @@ struct mqprio_sched {
 	u64 max_rate[TC_QOPT_MAX_QUEUE];
 };
 
+static int mqprio_enable_offload(struct Qdisc *sch,
+				 const struct tc_mqprio_qopt *qopt)
+{
+	struct tc_mqprio_qopt_offload mqprio = {.qopt = *qopt};
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	int err, i;
+
+	switch (priv->mode) {
+	case TC_MQPRIO_MODE_DCB:
+		if (priv->shaper != TC_MQPRIO_SHAPER_DCB)
+			return -EINVAL;
+		break;
+	case TC_MQPRIO_MODE_CHANNEL:
+		mqprio.flags = priv->flags;
+		if (priv->flags & TC_MQPRIO_F_MODE)
+			mqprio.mode = priv->mode;
+		if (priv->flags & TC_MQPRIO_F_SHAPER)
+			mqprio.shaper = priv->shaper;
+		if (priv->flags & TC_MQPRIO_F_MIN_RATE)
+			for (i = 0; i < mqprio.qopt.num_tc; i++)
+				mqprio.min_rate[i] = priv->min_rate[i];
+		if (priv->flags & TC_MQPRIO_F_MAX_RATE)
+			for (i = 0; i < mqprio.qopt.num_tc; i++)
+				mqprio.max_rate[i] = priv->max_rate[i];
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_MQPRIO,
+					    &mqprio);
+	if (err)
+		return err;
+
+	priv->hw_offload = mqprio.qopt.hw;
+
+	return 0;
+}
+
+static void mqprio_disable_offload(struct Qdisc *sch)
+{
+	struct tc_mqprio_qopt_offload mqprio = { { 0 } };
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+
+	switch (priv->mode) {
+	case TC_MQPRIO_MODE_DCB:
+	case TC_MQPRIO_MODE_CHANNEL:
+		dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_MQPRIO,
+					      &mqprio);
+		break;
+	}
+}
+
 static void mqprio_destroy(struct Qdisc *sch)
 {
 	struct net_device *dev = qdisc_dev(sch);
@@ -41,22 +96,10 @@ static void mqprio_destroy(struct Qdisc *sch)
 		kfree(priv->qdiscs);
 	}
 
-	if (priv->hw_offload && dev->netdev_ops->ndo_setup_tc) {
-		struct tc_mqprio_qopt_offload mqprio = { { 0 } };
-
-		switch (priv->mode) {
-		case TC_MQPRIO_MODE_DCB:
-		case TC_MQPRIO_MODE_CHANNEL:
-			dev->netdev_ops->ndo_setup_tc(dev,
-						      TC_SETUP_QDISC_MQPRIO,
-						      &mqprio);
-			break;
-		default:
-			return;
-		}
-	} else {
+	if (priv->hw_offload && dev->netdev_ops->ndo_setup_tc)
+		mqprio_disable_offload(sch);
+	else
 		netdev_set_num_tc(dev, 0);
-	}
 }
 
 static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
@@ -253,36 +296,9 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
 	 * supplied and verified mapping
 	 */
 	if (qopt->hw) {
-		struct tc_mqprio_qopt_offload mqprio = {.qopt = *qopt};
-
-		switch (priv->mode) {
-		case TC_MQPRIO_MODE_DCB:
-			if (priv->shaper != TC_MQPRIO_SHAPER_DCB)
-				return -EINVAL;
-			break;
-		case TC_MQPRIO_MODE_CHANNEL:
-			mqprio.flags = priv->flags;
-			if (priv->flags & TC_MQPRIO_F_MODE)
-				mqprio.mode = priv->mode;
-			if (priv->flags & TC_MQPRIO_F_SHAPER)
-				mqprio.shaper = priv->shaper;
-			if (priv->flags & TC_MQPRIO_F_MIN_RATE)
-				for (i = 0; i < mqprio.qopt.num_tc; i++)
-					mqprio.min_rate[i] = priv->min_rate[i];
-			if (priv->flags & TC_MQPRIO_F_MAX_RATE)
-				for (i = 0; i < mqprio.qopt.num_tc; i++)
-					mqprio.max_rate[i] = priv->max_rate[i];
-			break;
-		default:
-			return -EINVAL;
-		}
-		err = dev->netdev_ops->ndo_setup_tc(dev,
-						    TC_SETUP_QDISC_MQPRIO,
-						    &mqprio);
+		err = mqprio_enable_offload(sch, qopt);
 		if (err)
 			return err;
-
-		priv->hw_offload = mqprio.qopt.hw;
 	} else {
 		netdev_set_num_tc(dev, qopt->num_tc);
 		for (i = 0; i < qopt->num_tc; i++)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 03/11] net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 01/11] net/sched: mqprio: refactor nlattr parsing to a separate function Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 02/11] net/sched: mqprio: refactor offloading and unoffloading to dedicated functions Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-25 13:09   ` Kurt Kanzenbach
  2023-01-20 14:15 ` [RFC PATCH net-next 04/11] net/sched: mqprio: allow offloading drivers to request queue count validation Vladimir Oltean
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

Since taprio is a scheduler and not a classifier, move its offload
structure to pkt_sched.h, where struct tc_taprio_qopt_offload also lies.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/pkt_cls.h   | 10 ----------
 include/net/pkt_sched.h | 10 ++++++++++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 4cabb32a2ad9..cd410a87517b 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -788,16 +788,6 @@ struct tc_cls_bpf_offload {
 	bool exts_integrated;
 };
 
-struct tc_mqprio_qopt_offload {
-	/* struct tc_mqprio_qopt must always be the first element */
-	struct tc_mqprio_qopt qopt;
-	u16 mode;
-	u16 shaper;
-	u32 flags;
-	u64 min_rate[TC_QOPT_MAX_QUEUE];
-	u64 max_rate[TC_QOPT_MAX_QUEUE];
-};
-
 /* This structure holds cookie structure that is passed from user
  * to the kernel for actions and classifiers
  */
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 38207873eda6..6c5e64e0a0bb 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -160,6 +160,16 @@ struct tc_etf_qopt_offload {
 	s32 queue;
 };
 
+struct tc_mqprio_qopt_offload {
+	/* struct tc_mqprio_qopt must always be the first element */
+	struct tc_mqprio_qopt qopt;
+	u16 mode;
+	u16 shaper;
+	u32 flags;
+	u64 min_rate[TC_QOPT_MAX_QUEUE];
+	u64 max_rate[TC_QOPT_MAX_QUEUE];
+};
+
 struct tc_taprio_caps {
 	bool supports_queue_max_sdu:1;
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 04/11] net/sched: mqprio: allow offloading drivers to request queue count validation
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (2 preceding siblings ...)
  2023-01-20 14:15 ` [RFC PATCH net-next 03/11] net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 05/11] net/sched: mqprio: add extack messages for " Vladimir Oltean
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

mqprio_parse_opt() proudly has a comment:

	/* If hardware offload is requested we will leave it to the device
	 * to either populate the queue counts itself or to validate the
	 * provided queue counts.
	 */

Unfortunately some device drivers did not get this memo, and don't
validate the queue counts.

Introduce a tc capability, and make mqprio query it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/pkt_sched.h |  4 +++
 net/sched/sch_mqprio.c  | 58 +++++++++++++++++++++++++++--------------
 2 files changed, 42 insertions(+), 20 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 6c5e64e0a0bb..02e3ccfbc7d1 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -160,6 +160,10 @@ struct tc_etf_qopt_offload {
 	s32 queue;
 };
 
+struct tc_mqprio_caps {
+	bool validate_queue_counts:1;
+};
+
 struct tc_mqprio_qopt_offload {
 	/* struct tc_mqprio_qopt must always be the first element */
 	struct tc_mqprio_qopt qopt;
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 3579a64da06e..5fdceab82ea1 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -27,14 +27,50 @@ struct mqprio_sched {
 	u64 max_rate[TC_QOPT_MAX_QUEUE];
 };
 
+static int mqprio_validate_queue_counts(struct net_device *dev,
+					const struct tc_mqprio_qopt *qopt)
+{
+	int i, j;
+
+	for (i = 0; i < qopt->num_tc; i++) {
+		unsigned int last = qopt->offset[i] + qopt->count[i];
+
+		/* Verify the queue count is in tx range being equal to the
+		 * real_num_tx_queues indicates the last queue is in use.
+		 */
+		if (qopt->offset[i] >= dev->real_num_tx_queues ||
+		    !qopt->count[i] ||
+		    last > dev->real_num_tx_queues)
+			return -EINVAL;
+
+		/* Verify that the offset and counts do not overlap */
+		for (j = i + 1; j < qopt->num_tc; j++) {
+			if (last > qopt->offset[j])
+				return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
 static int mqprio_enable_offload(struct Qdisc *sch,
 				 const struct tc_mqprio_qopt *qopt)
 {
 	struct tc_mqprio_qopt_offload mqprio = {.qopt = *qopt};
 	struct mqprio_sched *priv = qdisc_priv(sch);
 	struct net_device *dev = qdisc_dev(sch);
+	struct tc_mqprio_caps caps;
 	int err, i;
 
+	qdisc_offload_query_caps(dev, TC_SETUP_QDISC_MQPRIO,
+				 &caps, sizeof(caps));
+
+	if (caps.validate_queue_counts) {
+		err = mqprio_validate_queue_counts(dev, qopt);
+		if (err)
+			return err;
+	}
+
 	switch (priv->mode) {
 	case TC_MQPRIO_MODE_DCB:
 		if (priv->shaper != TC_MQPRIO_SHAPER_DCB)
@@ -104,7 +140,7 @@ static void mqprio_destroy(struct Qdisc *sch)
 
 static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
 {
-	int i, j;
+	int i;
 
 	/* Verify num_tc is not out of max range */
 	if (qopt->num_tc > TC_MAX_QUEUE)
@@ -131,25 +167,7 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
 	if (qopt->hw)
 		return dev->netdev_ops->ndo_setup_tc ? 0 : -EINVAL;
 
-	for (i = 0; i < qopt->num_tc; i++) {
-		unsigned int last = qopt->offset[i] + qopt->count[i];
-
-		/* Verify the queue count is in tx range being equal to the
-		 * real_num_tx_queues indicates the last queue is in use.
-		 */
-		if (qopt->offset[i] >= dev->real_num_tx_queues ||
-		    !qopt->count[i] ||
-		    last > dev->real_num_tx_queues)
-			return -EINVAL;
-
-		/* Verify that the offset and counts do not overlap */
-		for (j = i + 1; j < qopt->num_tc; j++) {
-			if (last > qopt->offset[j])
-				return -EINVAL;
-		}
-	}
-
-	return 0;
+	return mqprio_validate_queue_counts(dev, qopt);
 }
 
 static const struct nla_policy mqprio_policy[TCA_MQPRIO_MAX + 1] = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 05/11] net/sched: mqprio: add extack messages for queue count validation
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (3 preceding siblings ...)
  2023-01-20 14:15 ` [RFC PATCH net-next 04/11] net/sched: mqprio: allow offloading drivers to request queue count validation Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 06/11] net: enetc: request mqprio to validate the queue counts Vladimir Oltean
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

To make mqprio more user-friendly, create netlink extended ack messages
which say exactly what is wrong about the queue counts. This uses the
new support for printf-formatted extack messages.

Example:

$ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
	map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1
Error: sch_mqprio: Queues 1:1 for TC 1 overlap with last TX queue 3 for TC 0.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_mqprio.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
index 5fdceab82ea1..4cd6d47cc7a1 100644
--- a/net/sched/sch_mqprio.c
+++ b/net/sched/sch_mqprio.c
@@ -28,25 +28,42 @@ struct mqprio_sched {
 };
 
 static int mqprio_validate_queue_counts(struct net_device *dev,
-					const struct tc_mqprio_qopt *qopt)
+					const struct tc_mqprio_qopt *qopt,
+					struct netlink_ext_ack *extack)
 {
 	int i, j;
 
 	for (i = 0; i < qopt->num_tc; i++) {
 		unsigned int last = qopt->offset[i] + qopt->count[i];
 
+		if (!qopt->count[i]) {
+			NL_SET_ERR_MSG_FMT_MOD(extack, "No queues for TC %d",
+					       i);
+			return -EINVAL;
+		}
+
 		/* Verify the queue count is in tx range being equal to the
 		 * real_num_tx_queues indicates the last queue is in use.
 		 */
 		if (qopt->offset[i] >= dev->real_num_tx_queues ||
-		    !qopt->count[i] ||
-		    last > dev->real_num_tx_queues)
+		    last > dev->real_num_tx_queues) {
+			NL_SET_ERR_MSG_FMT_MOD(extack,
+					       "Queues %d:%d for TC %d exceed the %d TX queues available",
+					       qopt->count[i], qopt->offset[i],
+					       i, dev->real_num_tx_queues);
 			return -EINVAL;
+		}
 
 		/* Verify that the offset and counts do not overlap */
 		for (j = i + 1; j < qopt->num_tc; j++) {
-			if (last > qopt->offset[j])
+			if (last > qopt->offset[j]) {
+				NL_SET_ERR_MSG_FMT_MOD(extack,
+						       "Queues %d:%d for TC %d overlap with last TX queue %d for TC %d",
+						       qopt->count[j],
+						       qopt->offset[j],
+						       j, last, i);
 				return -EINVAL;
+			}
 		}
 	}
 
@@ -54,7 +71,8 @@ static int mqprio_validate_queue_counts(struct net_device *dev,
 }
 
 static int mqprio_enable_offload(struct Qdisc *sch,
-				 const struct tc_mqprio_qopt *qopt)
+				 const struct tc_mqprio_qopt *qopt,
+				 struct netlink_ext_ack *extack)
 {
 	struct tc_mqprio_qopt_offload mqprio = {.qopt = *qopt};
 	struct mqprio_sched *priv = qdisc_priv(sch);
@@ -66,7 +84,7 @@ static int mqprio_enable_offload(struct Qdisc *sch,
 				 &caps, sizeof(caps));
 
 	if (caps.validate_queue_counts) {
-		err = mqprio_validate_queue_counts(dev, qopt);
+		err = mqprio_validate_queue_counts(dev, qopt, extack);
 		if (err)
 			return err;
 	}
@@ -138,7 +156,9 @@ static void mqprio_destroy(struct Qdisc *sch)
 		netdev_set_num_tc(dev, 0);
 }
 
-static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
+static int mqprio_parse_opt(struct net_device *dev,
+			    struct tc_mqprio_qopt *qopt,
+			    struct netlink_ext_ack *extack)
 {
 	int i;
 
@@ -167,7 +187,7 @@ static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
 	if (qopt->hw)
 		return dev->netdev_ops->ndo_setup_tc ? 0 : -EINVAL;
 
-	return mqprio_validate_queue_counts(dev, qopt);
+	return mqprio_validate_queue_counts(dev, qopt, extack);
 }
 
 static const struct nla_policy mqprio_policy[TCA_MQPRIO_MAX + 1] = {
@@ -280,7 +300,7 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
 		return -EINVAL;
 
 	qopt = nla_data(opt);
-	if (mqprio_parse_opt(dev, qopt))
+	if (mqprio_parse_opt(dev, qopt, extack))
 		return -EINVAL;
 
 	len = nla_len(opt) - NLA_ALIGN(sizeof(*qopt));
@@ -314,7 +334,7 @@ static int mqprio_init(struct Qdisc *sch, struct nlattr *opt,
 	 * supplied and verified mapping
 	 */
 	if (qopt->hw) {
-		err = mqprio_enable_offload(sch, qopt);
+		err = mqprio_enable_offload(sch, qopt, extack);
 		if (err)
 			return err;
 	} else {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 06/11] net: enetc: request mqprio to validate the queue counts
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (4 preceding siblings ...)
  2023-01-20 14:15 ` [RFC PATCH net-next 05/11] net/sched: mqprio: add extack messages for " Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 07/11] net: enetc: act upon the requested mqprio queue configuration Vladimir Oltean
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

The enetc driver does not validate the mqprio queue configuration, so it
currently allows things like this:

$ tc qdisc add dev swp0 root handle 1: mqprio num_tc 8 \
	map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1

By requesting validation via the mqprio capability structure, this is no
longer allowed, and needs no custom code in the driver.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/ethernet/freescale/enetc/enetc_qos.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/freescale/enetc/enetc_qos.c b/drivers/net/ethernet/freescale/enetc/enetc_qos.c
index fcebb54224c0..6e0b4dd91509 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc_qos.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc_qos.c
@@ -1611,6 +1611,13 @@ int enetc_qos_query_caps(struct net_device *ndev, void *type_data)
 	struct enetc_si *si = priv->si;
 
 	switch (base->type) {
+	case TC_SETUP_QDISC_MQPRIO: {
+		struct tc_mqprio_caps *caps = base->caps;
+
+		caps->validate_queue_counts = true;
+
+		return 0;
+	}
 	case TC_SETUP_QDISC_TAPRIO: {
 		struct tc_taprio_caps *caps = base->caps;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 07/11] net: enetc: act upon the requested mqprio queue configuration
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (5 preceding siblings ...)
  2023-01-20 14:15 ` [RFC PATCH net-next 06/11] net: enetc: request mqprio to validate the queue counts Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 08/11] net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc() Vladimir Oltean
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

Regardless of the requested queue count per traffic class, the enetc
driver allocates a number of TX rings equal to the number of TCs, and
hardcodes a queue configuration of "1@0 1@1 ... 1@max-tc". Other
configurations are silently ignored and treated the same.

Improve that by allowing what the user requests to be actually
fulfilled. This allows more than one TX ring per traffic class.
For example:

$ tc qdisc add dev eno0 root handle 1: mqprio num_tc 4 \
	map 0 0 1 1 2 2 3 3 queues 2@0 2@2 2@4 2@6
[  146.267648] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
[  146.273451] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 0
[  146.283280] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 1
[  146.293987] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 1
[  146.300467] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 2
[  146.306866] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 2
[  146.313261] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 3
[  146.319622] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 3
$ tc qdisc del dev eno0 root
[  178.238418] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
[  178.244369] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 0
[  178.251486] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 0
[  178.258006] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 0
[  178.265038] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 0
[  178.271557] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 0
[  178.277910] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 0
[  178.284281] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 0
$ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
	map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 1
[  186.113162] fsl_enetc 0000:00:00.0 eno0: TX ring 0 prio 0
[  186.118764] fsl_enetc 0000:00:00.0 eno0: TX ring 1 prio 1
[  186.124374] fsl_enetc 0000:00:00.0 eno0: TX ring 2 prio 2
[  186.130765] fsl_enetc 0000:00:00.0 eno0: TX ring 3 prio 3
[  186.136404] fsl_enetc 0000:00:00.0 eno0: TX ring 4 prio 4
[  186.142049] fsl_enetc 0000:00:00.0 eno0: TX ring 5 prio 5
[  186.147674] fsl_enetc 0000:00:00.0 eno0: TX ring 6 prio 6
[  186.153305] fsl_enetc 0000:00:00.0 eno0: TX ring 7 prio 7

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/ethernet/freescale/enetc/enetc.c | 67 +++++++++++++-------
 1 file changed, 45 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/freescale/enetc/enetc.c b/drivers/net/ethernet/freescale/enetc/enetc.c
index ef21d6baed24..062a0ba6c959 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc.c
@@ -2614,6 +2614,15 @@ static int enetc_reconfigure(struct enetc_ndev_priv *priv, bool extended,
 	return err;
 }
 
+static void enetc_debug_tx_ring_prios(struct enetc_ndev_priv *priv)
+{
+	int i;
+
+	for (i = 0; i < priv->num_tx_rings; i++)
+		netdev_dbg(priv->ndev, "TX ring %d prio %d\n", i,
+			   priv->tx_ring[i]->prio);
+}
+
 int enetc_setup_tc_mqprio(struct net_device *ndev, void *type_data)
 {
 	struct enetc_ndev_priv *priv = netdev_priv(ndev);
@@ -2621,8 +2630,10 @@ int enetc_setup_tc_mqprio(struct net_device *ndev, void *type_data)
 	struct enetc_hw *hw = &priv->si->hw;
 	struct enetc_bdr *tx_ring;
 	int num_stack_tx_queues;
+	int offset, count;
+	int tc, i, q;
 	u8 num_tc;
-	int i;
+	int err;
 
 	num_stack_tx_queues = enetc_num_stack_tx_queues(priv);
 	mqprio->hw = TC_MQPRIO_HW_OFFLOAD_TCS;
@@ -2639,34 +2650,46 @@ int enetc_setup_tc_mqprio(struct net_device *ndev, void *type_data)
 			enetc_set_bdr_prio(hw, tx_ring->index, tx_ring->prio);
 		}
 
+		enetc_debug_tx_ring_prios(priv);
+
 		return 0;
 	}
 
-	/* Check if we have enough BD rings available to accommodate all TCs */
-	if (num_tc > num_stack_tx_queues) {
-		netdev_err(ndev, "Max %d traffic classes supported\n",
-			   priv->num_tx_rings);
-		return -EINVAL;
-	}
+	err = netdev_set_num_tc(ndev, num_tc);
+	if (err)
+		return err;
 
-	/* For the moment, we use only one BD ring per TC.
-	 *
-	 * Configure num_tc BD rings with increasing priorities.
-	 */
-	for (i = 0; i < num_tc; i++) {
-		tx_ring = priv->tx_ring[i];
-		tx_ring->prio = i;
-		enetc_set_bdr_prio(hw, tx_ring->index, tx_ring->prio);
-	}
+	num_stack_tx_queues = 0;
+
+	for (tc = 0; tc < num_tc; tc++) {
+		offset = mqprio->offset[tc];
+		count = mqprio->count[tc];
 
-	/* Reset the number of netdev queues based on the TC count */
-	netif_set_real_num_tx_queues(ndev, num_tc);
+		err = netdev_set_tc_queue(ndev, tc, count, offset);
+		if (err)
+			return err;
+
+		for (q = offset; q < offset + count; q++) {
+			tx_ring = priv->tx_ring[q];
+			/* The prio_tc_map is skb_tx_hash()'s way of selecting
+			 * between TX queues based on skb->priority. As such,
+			 * there's nothing to offload based on it.
+			 * Make the mqprio "traffic class" be the priority of
+			 * this ring group, and leave the Tx IPV to traffic
+			 * class mapping as its default mapping value of 1:1.
+			 */
+			tx_ring->prio = tc;
+			enetc_set_bdr_prio(hw, tx_ring->index, tx_ring->prio);
 
-	netdev_set_num_tc(ndev, num_tc);
+			num_stack_tx_queues++;
+		}
+	}
+
+	err = netif_set_real_num_tx_queues(ndev, num_stack_tx_queues);
+	if (err)
+		return err;
 
-	/* Each TC is associated with one netdev queue */
-	for (i = 0; i < num_tc; i++)
-		netdev_set_tc_queue(ndev, i, 1, i);
+	enetc_debug_tx_ring_prios(priv);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 08/11] net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc()
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (6 preceding siblings ...)
  2023-01-20 14:15 ` [RFC PATCH net-next 07/11] net: enetc: act upon the requested mqprio queue configuration Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 09/11] net: enetc: act upon mqprio queue config in taprio offload Vladimir Oltean
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

For some reason I cannot understand, the taprio offload does not pass
the mqprio queue configuration down to the offloading device driver.
So the driver cannot act upon the TX queue counts and prio->tc map.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/pkt_sched.h | 1 +
 net/sched/sch_taprio.c  | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 02e3ccfbc7d1..ace8be520fb0 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -187,6 +187,7 @@ struct tc_taprio_sched_entry {
 };
 
 struct tc_taprio_qopt_offload {
+	struct tc_mqprio_qopt_offload mqprio;
 	u8 enable;
 	ktime_t base_time;
 	u64 cycle_time;
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 570389f6cdd7..8f832fa82745 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -1228,6 +1228,7 @@ static void taprio_sched_to_offload(struct net_device *dev,
 static int taprio_enable_offload(struct net_device *dev,
 				 struct taprio_sched *q,
 				 struct sched_gate_list *sched,
+				 const struct tc_mqprio_qopt *mqprio,
 				 struct netlink_ext_ack *extack)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
@@ -1261,6 +1262,8 @@ static int taprio_enable_offload(struct net_device *dev,
 		return -ENOMEM;
 	}
 	offload->enable = 1;
+	if (mqprio)
+		offload->mqprio.qopt = *mqprio;
 	taprio_sched_to_offload(dev, sched, offload);
 
 	for (tc = 0; tc < TC_MAX_QUEUE; tc++)
@@ -1617,7 +1620,7 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 	}
 
 	if (FULL_OFFLOAD_IS_ENABLED(q->flags))
-		err = taprio_enable_offload(dev, q, new_admin, extack);
+		err = taprio_enable_offload(dev, q, new_admin, mqprio, extack);
 	else
 		err = taprio_disable_offload(dev, q, extack);
 	if (err)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 09/11] net: enetc: act upon mqprio queue config in taprio offload
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (7 preceding siblings ...)
  2023-01-20 14:15 ` [RFC PATCH net-next 08/11] net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc() Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 10/11] net/sched: taprio: validate that gate mask does not exceed number of TCs Vladimir Oltean
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

We assume that the mqprio queue configuration from taprio has a simple
1:1 mapping between prio and traffic class, and one TX queue per TC.
That might not be the case. Actually parse and act upon the mqprio
config.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 .../net/ethernet/freescale/enetc/enetc_qos.c  | 20 ++++++-------------
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/freescale/enetc/enetc_qos.c b/drivers/net/ethernet/freescale/enetc/enetc_qos.c
index 6e0b4dd91509..130ebf6853e6 100644
--- a/drivers/net/ethernet/freescale/enetc/enetc_qos.c
+++ b/drivers/net/ethernet/freescale/enetc/enetc_qos.c
@@ -136,29 +136,21 @@ int enetc_setup_tc_taprio(struct net_device *ndev, void *type_data)
 {
 	struct tc_taprio_qopt_offload *taprio = type_data;
 	struct enetc_ndev_priv *priv = netdev_priv(ndev);
-	struct enetc_hw *hw = &priv->si->hw;
-	struct enetc_bdr *tx_ring;
-	int err;
-	int i;
+	int err, i;
 
 	/* TSD and Qbv are mutually exclusive in hardware */
 	for (i = 0; i < priv->num_tx_rings; i++)
 		if (priv->tx_ring[i]->tsd_enable)
 			return -EBUSY;
 
-	for (i = 0; i < priv->num_tx_rings; i++) {
-		tx_ring = priv->tx_ring[i];
-		tx_ring->prio = taprio->enable ? i : 0;
-		enetc_set_bdr_prio(hw, tx_ring->index, tx_ring->prio);
-	}
+	err = enetc_setup_tc_mqprio(ndev, &taprio->mqprio);
+	if (err)
+		return err;
 
 	err = enetc_setup_taprio(ndev, taprio);
 	if (err) {
-		for (i = 0; i < priv->num_tx_rings; i++) {
-			tx_ring = priv->tx_ring[i];
-			tx_ring->prio = taprio->enable ? 0 : i;
-			enetc_set_bdr_prio(hw, tx_ring->index, tx_ring->prio);
-		}
+		taprio->mqprio.qopt.num_tc = 0;
+		enetc_setup_tc_mqprio(ndev, &taprio->mqprio);
 	}
 
 	return err;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 10/11] net/sched: taprio: validate that gate mask does not exceed number of TCs
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (8 preceding siblings ...)
  2023-01-20 14:15 ` [RFC PATCH net-next 09/11] net: enetc: act upon mqprio queue config in taprio offload Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-20 14:15 ` [RFC PATCH net-next 11/11] net/sched: taprio: only calculate gate mask per TXQ for igc Vladimir Oltean
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

"man tc-taprio" says:

| each gate state allows outgoing traffic for a subset (potentially
| empty) of traffic classes.

So it makes sense to not allow gate actions to have bits set for traffic
classes that exceed the number of TCs of the device (according to the
mqprio configuration). Validate precisely that.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 41 +++++++++++++++++++++++++----------------
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 8f832fa82745..a3fa5debe513 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -789,15 +789,24 @@ static int fill_sched_entry(struct taprio_sched *q, struct nlattr **tb,
 			    struct netlink_ext_ack *extack)
 {
 	int min_duration = length_to_duration(q, ETH_ZLEN);
+	struct net_device *dev = qdisc_dev(q->root);
+	int num_tc = netdev_get_num_tc(dev);
 	u32 interval = 0;
 
 	if (tb[TCA_TAPRIO_SCHED_ENTRY_CMD])
 		entry->command = nla_get_u8(
 			tb[TCA_TAPRIO_SCHED_ENTRY_CMD]);
 
-	if (tb[TCA_TAPRIO_SCHED_ENTRY_GATE_MASK])
+	if (tb[TCA_TAPRIO_SCHED_ENTRY_GATE_MASK]) {
 		entry->gate_mask = nla_get_u32(
 			tb[TCA_TAPRIO_SCHED_ENTRY_GATE_MASK]);
+		if (!num_tc || (entry->gate_mask & ~GENMASK(num_tc - 1, 0))) {
+			NL_SET_ERR_MSG_FMT(extack,
+					   "Gate mask 0x%x contains bits for non-existent traffic classes (device has %d)",
+					   entry->gate_mask, num_tc);
+			return -EINVAL;
+		}
+	}
 
 	if (tb[TCA_TAPRIO_SCHED_ENTRY_INTERVAL])
 		interval = nla_get_u32(
@@ -1588,6 +1597,21 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 		goto free_sched;
 	}
 
+	if (mqprio) {
+		err = netdev_set_num_tc(dev, mqprio->num_tc);
+		if (err)
+			goto free_sched;
+		for (i = 0; i < mqprio->num_tc; i++)
+			netdev_set_tc_queue(dev, i,
+					    mqprio->count[i],
+					    mqprio->offset[i]);
+
+		/* Always use supplied priority mappings */
+		for (i = 0; i <= TC_BITMASK; i++)
+			netdev_set_prio_tc_map(dev, i,
+					       mqprio->prio_tc_map[i]);
+	}
+
 	err = parse_taprio_schedule(q, tb, new_admin, extack);
 	if (err < 0)
 		goto free_sched;
@@ -1604,21 +1628,6 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 
 	taprio_set_picos_per_byte(dev, q);
 
-	if (mqprio) {
-		err = netdev_set_num_tc(dev, mqprio->num_tc);
-		if (err)
-			goto free_sched;
-		for (i = 0; i < mqprio->num_tc; i++)
-			netdev_set_tc_queue(dev, i,
-					    mqprio->count[i],
-					    mqprio->offset[i]);
-
-		/* Always use supplied priority mappings */
-		for (i = 0; i <= TC_BITMASK; i++)
-			netdev_set_prio_tc_map(dev, i,
-					       mqprio->prio_tc_map[i]);
-	}
-
 	if (FULL_OFFLOAD_IS_ENABLED(q->flags))
 		err = taprio_enable_offload(dev, q, new_admin, mqprio, extack);
 	else
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH net-next 11/11] net/sched: taprio: only calculate gate mask per TXQ for igc
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (9 preceding siblings ...)
  2023-01-20 14:15 ` [RFC PATCH net-next 10/11] net/sched: taprio: validate that gate mask does not exceed number of TCs Vladimir Oltean
@ 2023-01-20 14:15 ` Vladimir Oltean
  2023-01-25  1:11   ` Vinicius Costa Gomes
  2023-01-23 18:22 ` [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Jacob Keller
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-20 14:15 UTC (permalink / raw)
  To: netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

Vinicius has repeated a couple of times in our discussion that it was a
mistake for the taprio UAPI to take as input the Qbv gate mask per TC
rather than per TXQ. In the Frame Preemption RFC thread:
https://patchwork.kernel.org/project/netdevbpf/patch/20220816222920.1952936-3-vladimir.oltean@nxp.com/#25011225

I had this unanswered question:

| > And even that it works out because taprio "translates" from traffic
| > classes to queues when it sends the offload information to the driver,
| > i.e. the driver knows the schedule of queues, not traffic classes.
|
| Which is incredibly strange to me, since the standard clearly defines
| Qbv gates to be per traffic class, and in ENETC, even if we have 2 TX
| queues for the same traffic class (one per CPU), the hardware schedule
| is still per traffic class and not per independent TX queue (BD ring).
|
| How does this work for i225/i226, if 2 queues are configured for the
| same dequeue priority? Do the taprio gates still take effect per queue?

I haven't gotten an answer, and some things are still unclear, but I
suspect that igc is the outlier, and all the other hardware actually has
the gate mask per TC and not per TXQ, just like the standard says.

For example, in ENETC up until now, we weren't passed the mqprio queue
configuration via struct tc_taprio_qopt_offload, and hence, we needed to
assume that the TC:TXQ mapping was 1:1. So "per TC" or "per TXQ" did not
make a practical difference. I suspect that other drivers are in the
same position.

Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and
query the device driver before calling the proper ndo_setup_tc(), and
figure out if it expects the gate mask to be per TC or per TXQ.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/ethernet/intel/igc/igc_main.c | 17 +++++++++++++++++
 include/net/pkt_sched.h                   |  1 +
 net/sched/sch_taprio.c                    | 11 ++++++++---
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index e86b15efaeb8..9b6f2aaf78c2 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -6205,12 +6205,29 @@ static int igc_tsn_enable_cbs(struct igc_adapter *adapter,
 	return igc_tsn_offload_apply(adapter);
 }
 
+static int igc_tc_query_caps(struct tc_query_caps_base *base)
+{
+	switch (base->type) {
+	case TC_SETUP_QDISC_TAPRIO: {
+		struct tc_taprio_caps *caps = base->caps;
+
+		caps->gate_mask_per_txq = true;
+
+		return 0;
+	}
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 static int igc_setup_tc(struct net_device *dev, enum tc_setup_type type,
 			void *type_data)
 {
 	struct igc_adapter *adapter = netdev_priv(dev);
 
 	switch (type) {
+	case TC_QUERY_CAPS:
+		return igc_tc_query_caps(type_data);
 	case TC_SETUP_QDISC_TAPRIO:
 		return igc_tsn_enable_qbv_scheduling(adapter, type_data);
 
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index ace8be520fb0..fd889fc4912b 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -176,6 +176,7 @@ struct tc_mqprio_qopt_offload {
 
 struct tc_taprio_caps {
 	bool supports_queue_max_sdu:1;
+	bool gate_mask_per_txq:1;
 };
 
 struct tc_taprio_sched_entry {
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index a3fa5debe513..58efa982db65 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -1212,7 +1212,8 @@ static u32 tc_map_to_queue_mask(struct net_device *dev, u32 tc_mask)
 
 static void taprio_sched_to_offload(struct net_device *dev,
 				    struct sched_gate_list *sched,
-				    struct tc_taprio_qopt_offload *offload)
+				    struct tc_taprio_qopt_offload *offload,
+				    bool gate_mask_per_txq)
 {
 	struct sched_entry *entry;
 	int i = 0;
@@ -1226,7 +1227,11 @@ static void taprio_sched_to_offload(struct net_device *dev,
 
 		e->command = entry->command;
 		e->interval = entry->interval;
-		e->gate_mask = tc_map_to_queue_mask(dev, entry->gate_mask);
+		if (gate_mask_per_txq)
+			e->gate_mask = tc_map_to_queue_mask(dev,
+							    entry->gate_mask);
+		else
+			e->gate_mask = entry->gate_mask;
 
 		i++;
 	}
@@ -1273,7 +1278,7 @@ static int taprio_enable_offload(struct net_device *dev,
 	offload->enable = 1;
 	if (mqprio)
 		offload->mqprio.qopt = *mqprio;
-	taprio_sched_to_offload(dev, sched, offload);
+	taprio_sched_to_offload(dev, sched, offload, caps.gate_mask_per_txq);
 
 	for (tc = 0; tc < TC_MAX_QUEUE; tc++)
 		offload->max_sdu[tc] = q->max_sdu[tc];
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (10 preceding siblings ...)
  2023-01-20 14:15 ` [RFC PATCH net-next 11/11] net/sched: taprio: only calculate gate mask per TXQ for igc Vladimir Oltean
@ 2023-01-23 18:22 ` Jacob Keller
  2023-01-24 14:26   ` Vladimir Oltean
  2023-01-23 21:21 ` Gerhard Engleder
  2023-01-25  1:11 ` Vinicius Costa Gomes
  13 siblings, 1 reply; 25+ messages in thread
From: Jacob Keller @ 2023-01-23 18:22 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg



On 1/20/2023 6:15 AM, Vladimir Oltean wrote:
> I realize that this patch set will start a flame war, but there are
> things about the mqprio qdisc that I simply don't understand, so in an
> attempt to explain how I see things should be done, I've made some
> patches to the code. I hope the reviewers will be patient enough with me :)
> 
> I need to touch mqprio because I'm preparing a patch set for Frame
> Preemption (an IEEE 802.1Q feature). A disagreement started with
> Vinicius here:
> https://patchwork.kernel.org/project/netdevbpf/patch/20220816222920.1952936-3-vladimir.oltean@nxp.com/#24976672
> 
> regarding how TX packet prioritization should be handled. Vinicius said
> that for some Intel NICs, prioritization at the egress scheduler stage
> is fundamentally attached to TX queues rather than traffic classes.
> 
> In other words, in the "popular" mqprio configuration documented by him:
> 
> $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
>       num_tc 3 \
>       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
>       queues 1@0 1@1 2@2 \
>       hw 0
> 
> there are 3 Linux traffic classes and 4 TX queues. The TX queues are
> organized in strict priority fashion, like this: TXQ 0 has highest prio
> (hardware dequeue precedence for TX scheduler), TXQ 3 has lowest prio.
> Packets classified by Linux to TC 2 are hashed between TXQ 2 and TXQ 3,
> but the hardware has higher precedence for TXQ2 over TXQ 3, and Linux
> doesn't know that.
> 
> I am surprised by this fact, and this isn't how ENETC works at all.
> For ENETC, we try to prioritize on TCs rather than TXQs, and TC 7 has
> higher priority than TC 7. For us, groups of TXQs that map to the same
> TC have the same egress scheduling priority. It is possible (and maybe
> useful) to have 2 TXQs per TC - one TXQ per CPU). Patch 07/11 tries to
> make that more clear.
> 
> Furthermore (and this is really the biggest point of contention), myself
> and Vinicius have the fundamental disagreement whether the 802.1Qbv
> (taprio) gate mask should be passed to the device driver per TXQ or per
> TC. This is what patch 11/11 is about.
> 
> Again, I'm not *certain* that my opinion on this topic is correct
> (and it sure is confusing to see such a different approach for Intel).
> But I would appreciate any feedback.
> 
> Vladimir Oltean (11):
>   net/sched: mqprio: refactor nlattr parsing to a separate function
>   net/sched: mqprio: refactor offloading and unoffloading to dedicated
>     functions
>   net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to
>     pkt_sched.h
>   net/sched: mqprio: allow offloading drivers to request queue count
>     validation
>   net/sched: mqprio: add extack messages for queue count validation
>   net: enetc: request mqprio to validate the queue counts
>   net: enetc: act upon the requested mqprio queue configuration
>   net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc()
>   net: enetc: act upon mqprio queue config in taprio offload
>   net/sched: taprio: validate that gate mask does not exceed number of
>     TCs
>   net/sched: taprio: only calculate gate mask per TXQ for igc
> 

I don't work on igc or the i225/i226 devices, so I can't speak for
those, but this series looks ok to me.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

>  drivers/net/ethernet/freescale/enetc/enetc.c  |  67 ++--
>  .../net/ethernet/freescale/enetc/enetc_qos.c  |  27 +-
>  drivers/net/ethernet/intel/igc/igc_main.c     |  17 +
>  include/net/pkt_cls.h                         |  10 -
>  include/net/pkt_sched.h                       |  16 +
>  net/sched/sch_mqprio.c                        | 298 +++++++++++-------
>  net/sched/sch_taprio.c                        |  57 ++--
>  7 files changed, 310 insertions(+), 182 deletions(-)
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (11 preceding siblings ...)
  2023-01-23 18:22 ` [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Jacob Keller
@ 2023-01-23 21:21 ` Gerhard Engleder
  2023-01-23 21:31   ` Vladimir Oltean
  2023-01-25  1:11 ` Vinicius Costa Gomes
  13 siblings, 1 reply; 25+ messages in thread
From: Gerhard Engleder @ 2023-01-23 21:21 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang,
	Vinicius Costa Gomes, Alexander Duyck, Kurt Kanzenbach,
	Ferenc Fejes, Tony Nguyen, Jesse Brandeburg, Jacob Keller

On 20.01.23 15:15, Vladimir Oltean wrote:
> I realize that this patch set will start a flame war, but there are
> things about the mqprio qdisc that I simply don't understand, so in an
> attempt to explain how I see things should be done, I've made some
> patches to the code. I hope the reviewers will be patient enough with me :)
> 
> I need to touch mqprio because I'm preparing a patch set for Frame
> Preemption (an IEEE 802.1Q feature). A disagreement started with
> Vinicius here:
> https://patchwork.kernel.org/project/netdevbpf/patch/20220816222920.1952936-3-vladimir.oltean@nxp.com/#24976672
> 
> regarding how TX packet prioritization should be handled. Vinicius said
> that for some Intel NICs, prioritization at the egress scheduler stage
> is fundamentally attached to TX queues rather than traffic classes.
> 
> In other words, in the "popular" mqprio configuration documented by him:
> 
> $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
>        num_tc 3 \
>        map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
>        queues 1@0 1@1 2@2 \
>        hw 0
> 
> there are 3 Linux traffic classes and 4 TX queues. The TX queues are
> organized in strict priority fashion, like this: TXQ 0 has highest prio
> (hardware dequeue precedence for TX scheduler), TXQ 3 has lowest prio.
> Packets classified by Linux to TC 2 are hashed between TXQ 2 and TXQ 3,
> but the hardware has higher precedence for TXQ2 over TXQ 3, and Linux
> doesn't know that.

For my tsnep IP core it is similar, but with reverse priority. TXQ 0 has
the lowest priority (to be used for none real-time traffic). TXQ 1 has
priority over TXQ 0, TXQ 2 has priority over TXQ 1, ... . The number of
TX queues is flexible and depends on the requirements of the real-time
application and the available resources within the FPGA. The priority is
hard coded to save FPGA resources.

> I am surprised by this fact, and this isn't how ENETC works at all.
> For ENETC, we try to prioritize on TCs rather than TXQs, and TC 7 has
> higher priority than TC 7. For us, groups of TXQs that map to the same
> TC have the same egress scheduling priority. It is possible (and maybe
> useful) to have 2 TXQs per TC - one TXQ per CPU). Patch 07/11 tries to
> make that more clear.
> 
> Furthermore (and this is really the biggest point of contention), myself
> and Vinicius have the fundamental disagreement whether the 802.1Qbv
> (taprio) gate mask should be passed to the device driver per TXQ or per
> TC. This is what patch 11/11 is about.

tsnep also expects gate mask per TXQ. This simplifies the hardware
implementation. But it would be no problem if the gate mask would be
passed per TC and the driver is able to transform it to per TXQ.

> Again, I'm not *certain* that my opinion on this topic is correct
> (and it sure is confusing to see such a different approach for Intel).
> But I would appreciate any feedback.

In my opinion it makes sense to add mqprio queue configuration to
TAPRIO. This allows the driver to check if queue assignment and
prioritization makes sense for its device. Currently deep hardware
knowledge is needed to know how it is done right.

Gerhard

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
  2023-01-23 21:21 ` Gerhard Engleder
@ 2023-01-23 21:31   ` Vladimir Oltean
  2023-01-23 22:20     ` Gerhard Engleder
  0 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-23 21:31 UTC (permalink / raw)
  To: Gerhard Engleder
  Cc: netdev, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Claudiu Manoil, Camelia Groza,
	Xiaoliang Yang, Vinicius Costa Gomes, Alexander Duyck,
	Kurt Kanzenbach, Ferenc Fejes, Tony Nguyen, Jesse Brandeburg,
	Jacob Keller

On Mon, Jan 23, 2023 at 10:21:33PM +0100, Gerhard Engleder wrote:
> For my tsnep IP core it is similar, but with reverse priority. TXQ 0 has
> the lowest priority (to be used for none real-time traffic). TXQ 1 has
> priority over TXQ 0, TXQ 2 has priority over TXQ 1, ... . The number of
> TX queues is flexible and depends on the requirements of the real-time
> application and the available resources within the FPGA. The priority is
> hard coded to save FPGA resources.

But if there's no round robin between queues of equal priority, it means
you can never have more than 1 TXQ per traffic class with this design,
or i.o.w., your best-effort traffic will always be single queue, right?

> > Furthermore (and this is really the biggest point of contention), myself
> > and Vinicius have the fundamental disagreement whether the 802.1Qbv
> > (taprio) gate mask should be passed to the device driver per TXQ or per
> > TC. This is what patch 11/11 is about.
> 
> tsnep also expects gate mask per TXQ. This simplifies the hardware
> implementation. But it would be no problem if the gate mask would be
> passed per TC and the driver is able to transform it to per TXQ.

If tsnep can only have at most 1 TXQ per TC, then what's the difference
between gate mask per TXQ and gate mask per TC?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
  2023-01-23 21:31   ` Vladimir Oltean
@ 2023-01-23 22:20     ` Gerhard Engleder
  0 siblings, 0 replies; 25+ messages in thread
From: Gerhard Engleder @ 2023-01-23 22:20 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Claudiu Manoil, Camelia Groza,
	Xiaoliang Yang, Vinicius Costa Gomes, Alexander Duyck,
	Kurt Kanzenbach, Ferenc Fejes, Tony Nguyen, Jesse Brandeburg,
	Jacob Keller

On 23.01.23 22:31, Vladimir Oltean wrote:
> On Mon, Jan 23, 2023 at 10:21:33PM +0100, Gerhard Engleder wrote:
>> For my tsnep IP core it is similar, but with reverse priority. TXQ 0 has
>> the lowest priority (to be used for none real-time traffic). TXQ 1 has
>> priority over TXQ 0, TXQ 2 has priority over TXQ 1, ... . The number of
>> TX queues is flexible and depends on the requirements of the real-time
>> application and the available resources within the FPGA. The priority is
>> hard coded to save FPGA resources.
> 
> But if there's no round robin between queues of equal priority, it means
> you can never have more than 1 TXQ per traffic class with this design,
> or i.o.w., your best-effort traffic will always be single queue, right?

Yes, with the current design only 1 TXQ per traffic class is the goal.

>>> Furthermore (and this is really the biggest point of contention), myself
>>> and Vinicius have the fundamental disagreement whether the 802.1Qbv
>>> (taprio) gate mask should be passed to the device driver per TXQ or per
>>> TC. This is what patch 11/11 is about.
>>
>> tsnep also expects gate mask per TXQ. This simplifies the hardware
>> implementation. But it would be no problem if the gate mask would be
>> passed per TC and the driver is able to transform it to per TXQ.
> 
> If tsnep can only have at most 1 TXQ per TC, then what's the difference
> between gate mask per TXQ and gate mask per TC?

There can be less TCs than TXQs. If only the second TXQ would be used,
then gate mask per TC would be 0x1 and gate mask per TXQ would be 0x2.

If number of TCs and TXQs would be identical, then there would be no
difference. The overlap check enforces that TXQs are assigned to TCs in
strict order. So TXQs cannot be assigned to TCs in arbitrary order. At
least that was the result of a quick test. But I don't know what's the
reason of this behavior.

Gerhard

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
  2023-01-23 18:22 ` [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Jacob Keller
@ 2023-01-24 14:26   ` Vladimir Oltean
  2023-01-24 22:30     ` Jacob Keller
  0 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-24 14:26 UTC (permalink / raw)
  To: Jacob Keller
  Cc: netdev, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Claudiu Manoil, Camelia Groza,
	Xiaoliang Yang, Gerhard Engleder, Vinicius Costa Gomes,
	Alexander Duyck, Kurt Kanzenbach, Ferenc Fejes, Tony Nguyen,
	Jesse Brandeburg

Hi Jacob,

On Mon, Jan 23, 2023 at 10:22:08AM -0800, Jacob Keller wrote:
> I don't work on igc or the i225/i226 devices, so I can't speak for
> those, but this series looks ok to me.
> 
> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

For clarity, does this mean that I can put your review tag on all
patches in the next version?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
  2023-01-24 14:26   ` Vladimir Oltean
@ 2023-01-24 22:30     ` Jacob Keller
  0 siblings, 0 replies; 25+ messages in thread
From: Jacob Keller @ 2023-01-24 22:30 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Claudiu Manoil, Camelia Groza,
	Xiaoliang Yang, Gerhard Engleder, Vinicius Costa Gomes,
	Alexander Duyck, Kurt Kanzenbach, Ferenc Fejes, Tony Nguyen,
	Jesse Brandeburg



On 1/24/2023 6:26 AM, Vladimir Oltean wrote:
> Hi Jacob,
> 
> On Mon, Jan 23, 2023 at 10:22:08AM -0800, Jacob Keller wrote:
>> I don't work on igc or the i225/i226 devices, so I can't speak for
>> those, but this series looks ok to me.
>>
>> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
> 
> For clarity, does this mean that I can put your review tag on all
> patches in the next version?

Yes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
  2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
                   ` (12 preceding siblings ...)
  2023-01-23 21:21 ` Gerhard Engleder
@ 2023-01-25  1:11 ` Vinicius Costa Gomes
  2023-01-25 13:10   ` Vladimir Oltean
  13 siblings, 1 reply; 25+ messages in thread
From: Vinicius Costa Gomes @ 2023-01-25  1:11 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Alexander Duyck, Kurt Kanzenbach, Ferenc Fejes, Tony Nguyen,
	Jesse Brandeburg, Jacob Keller

Hi Vladimir,

Sorry for the delay. I had to sleep on this for a bit.

Vladimir Oltean <vladimir.oltean@nxp.com> writes:

> I realize that this patch set will start a flame war, but there are
> things about the mqprio qdisc that I simply don't understand, so in an
> attempt to explain how I see things should be done, I've made some
> patches to the code. I hope the reviewers will be patient enough with me :)
>
> I need to touch mqprio because I'm preparing a patch set for Frame
> Preemption (an IEEE 802.1Q feature). A disagreement started with
> Vinicius here:
> https://patchwork.kernel.org/project/netdevbpf/patch/20220816222920.1952936-3-vladimir.oltean@nxp.com/#24976672
>
> regarding how TX packet prioritization should be handled. Vinicius said
> that for some Intel NICs, prioritization at the egress scheduler stage
> is fundamentally attached to TX queues rather than traffic classes.
>
> In other words, in the "popular" mqprio configuration documented by him:
>
> $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
>       num_tc 3 \
>       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
>       queues 1@0 1@1 2@2 \
>       hw 0
>
> there are 3 Linux traffic classes and 4 TX queues. The TX queues are
> organized in strict priority fashion, like this: TXQ 0 has highest prio
> (hardware dequeue precedence for TX scheduler), TXQ 3 has lowest prio.
> Packets classified by Linux to TC 2 are hashed between TXQ 2 and TXQ 3,
> but the hardware has higher precedence for TXQ2 over TXQ 3, and Linux
> doesn't know that.
>
> I am surprised by this fact, and this isn't how ENETC works at all.
> For ENETC, we try to prioritize on TCs rather than TXQs, and TC 7 has
> higher priority than TC 7. For us, groups of TXQs that map to the same
> TC have the same egress scheduling priority. It is possible (and maybe
> useful) to have 2 TXQs per TC - one TXQ per CPU). Patch 07/11 tries to
> make that more clear.
>

That makes me think, making "queues" visible on mqprio/taprio perhaps
was a mistake. Perhaps if we only had the "prio to tc" map, and relied
on drivers implementing .ndo_select_queue() that would be less
problematic. And for devices with tens/hundreds of queues, this "no
queues to the user exposed" sounds like a better model. Anyway... just
wondering.

Perhaps something to think about for mqprio/taprio/etc "the next generation" ;-)

> Furthermore (and this is really the biggest point of contention), myself
> and Vinicius have the fundamental disagreement whether the 802.1Qbv
> (taprio) gate mask should be passed to the device driver per TXQ or per
> TC. This is what patch 11/11 is about.
>

I think that I was being annoying because I believed that some
implementation detail of the netdev prio_tc_map and the way that netdev
select TX queues (the "core of how mqprio works") would leak, and it
would be easier/more correct to make other vendors adapt themselves to
the "Intel"/"queues have priorities" model. But I stand corrected, as
you (and others) have proven.

In short, I am not opposed to this idea. This capability operation
really opens some possibilities. The patches look clean.

I'll play with the patches later in the week, quite swamped at this
point.

> Again, I'm not *certain* that my opinion on this topic is correct
> (and it sure is confusing to see such a different approach for Intel).
> But I would appreciate any feedback.

And that reminds me, I owe you a beverage of your choice. For all your
effort.

>
> Vladimir Oltean (11):
>   net/sched: mqprio: refactor nlattr parsing to a separate function
>   net/sched: mqprio: refactor offloading and unoffloading to dedicated
>     functions
>   net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to
>     pkt_sched.h
>   net/sched: mqprio: allow offloading drivers to request queue count
>     validation
>   net/sched: mqprio: add extack messages for queue count validation
>   net: enetc: request mqprio to validate the queue counts
>   net: enetc: act upon the requested mqprio queue configuration
>   net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc()
>   net: enetc: act upon mqprio queue config in taprio offload
>   net/sched: taprio: validate that gate mask does not exceed number of
>     TCs
>   net/sched: taprio: only calculate gate mask per TXQ for igc
>
>  drivers/net/ethernet/freescale/enetc/enetc.c  |  67 ++--
>  .../net/ethernet/freescale/enetc/enetc_qos.c  |  27 +-
>  drivers/net/ethernet/intel/igc/igc_main.c     |  17 +
>  include/net/pkt_cls.h                         |  10 -
>  include/net/pkt_sched.h                       |  16 +
>  net/sched/sch_mqprio.c                        | 298 +++++++++++-------
>  net/sched/sch_taprio.c                        |  57 ++--
>  7 files changed, 310 insertions(+), 182 deletions(-)
>
> -- 
> 2.34.1
>

Cheers,
-- 
Vinicius

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 11/11] net/sched: taprio: only calculate gate mask per TXQ for igc
  2023-01-20 14:15 ` [RFC PATCH net-next 11/11] net/sched: taprio: only calculate gate mask per TXQ for igc Vladimir Oltean
@ 2023-01-25  1:11   ` Vinicius Costa Gomes
  0 siblings, 0 replies; 25+ messages in thread
From: Vinicius Costa Gomes @ 2023-01-25  1:11 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Alexander Duyck, Kurt Kanzenbach, Ferenc Fejes, Tony Nguyen,
	Jesse Brandeburg, Jacob Keller

Vladimir Oltean <vladimir.oltean@nxp.com> writes:

> Vinicius has repeated a couple of times in our discussion that it was a
> mistake for the taprio UAPI to take as input the Qbv gate mask per TC
> rather than per TXQ. In the Frame Preemption RFC thread:
> https://patchwork.kernel.org/project/netdevbpf/patch/20220816222920.1952936-3-vladimir.oltean@nxp.com/#25011225
>
> I had this unanswered question:
>
> | > And even that it works out because taprio "translates" from traffic
> | > classes to queues when it sends the offload information to the driver,
> | > i.e. the driver knows the schedule of queues, not traffic classes.
> |
> | Which is incredibly strange to me, since the standard clearly defines
> | Qbv gates to be per traffic class, and in ENETC, even if we have 2 TX
> | queues for the same traffic class (one per CPU), the hardware schedule
> | is still per traffic class and not per independent TX queue (BD ring).
> |
> | How does this work for i225/i226, if 2 queues are configured for the
> | same dequeue priority? Do the taprio gates still take effect per
> | queue?

Sorry that I haven't answered this before.

Two things, for i225/i226:
  - The gates open/close registers are per-queue, i.e. I control
  explicitly when each gate is going to close/open inside each cycle
  (yes, this design does have limitations);
  - Looking at the datasheet there's also this: "Each queue must be
  assigned with a unique priority level". Not sure what happens if I set
  the same, I would expect that the ordering would be undefined, but I
  never tested that.

>
> I haven't gotten an answer, and some things are still unclear, but I
> suspect that igc is the outlier, and all the other hardware actually has
> the gate mask per TC and not per TXQ, just like the standard says.
>
> For example, in ENETC up until now, we weren't passed the mqprio queue
> configuration via struct tc_taprio_qopt_offload, and hence, we needed to
> assume that the TC:TXQ mapping was 1:1. So "per TC" or "per TXQ" did not
> make a practical difference. I suspect that other drivers are in the
> same position.
>
> Benefit from the TC_QUERY_CAPS feature that Jakub suggested we add, and
> query the device driver before calling the proper ndo_setup_tc(), and
> figure out if it expects the gate mask to be per TC or per TXQ.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  drivers/net/ethernet/intel/igc/igc_main.c | 17 +++++++++++++++++
>  include/net/pkt_sched.h                   |  1 +
>  net/sched/sch_taprio.c                    | 11 ++++++++---
>  3 files changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> index e86b15efaeb8..9b6f2aaf78c2 100644
> --- a/drivers/net/ethernet/intel/igc/igc_main.c
> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> @@ -6205,12 +6205,29 @@ static int igc_tsn_enable_cbs(struct igc_adapter *adapter,
>  	return igc_tsn_offload_apply(adapter);
>  }
>  
> +static int igc_tc_query_caps(struct tc_query_caps_base *base)
> +{
> +	switch (base->type) {
> +	case TC_SETUP_QDISC_TAPRIO: {
> +		struct tc_taprio_caps *caps = base->caps;
> +
> +		caps->gate_mask_per_txq = true;
> +
> +		return 0;
> +	}
> +	default:
> +		return -EOPNOTSUPP;
> +	}
> +}
> +
>  static int igc_setup_tc(struct net_device *dev, enum tc_setup_type type,
>  			void *type_data)
>  {
>  	struct igc_adapter *adapter = netdev_priv(dev);
>  
>  	switch (type) {
> +	case TC_QUERY_CAPS:
> +		return igc_tc_query_caps(type_data);
>  	case TC_SETUP_QDISC_TAPRIO:
>  		return igc_tsn_enable_qbv_scheduling(adapter, type_data);
>  
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index ace8be520fb0..fd889fc4912b 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -176,6 +176,7 @@ struct tc_mqprio_qopt_offload {
>  
>  struct tc_taprio_caps {
>  	bool supports_queue_max_sdu:1;
> +	bool gate_mask_per_txq:1;
>  };
>  
>  struct tc_taprio_sched_entry {
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index a3fa5debe513..58efa982db65 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -1212,7 +1212,8 @@ static u32 tc_map_to_queue_mask(struct net_device *dev, u32 tc_mask)
>  
>  static void taprio_sched_to_offload(struct net_device *dev,
>  				    struct sched_gate_list *sched,
> -				    struct tc_taprio_qopt_offload *offload)
> +				    struct tc_taprio_qopt_offload *offload,
> +				    bool gate_mask_per_txq)
>  {
>  	struct sched_entry *entry;
>  	int i = 0;
> @@ -1226,7 +1227,11 @@ static void taprio_sched_to_offload(struct net_device *dev,
>  
>  		e->command = entry->command;
>  		e->interval = entry->interval;
> -		e->gate_mask = tc_map_to_queue_mask(dev, entry->gate_mask);
> +		if (gate_mask_per_txq)
> +			e->gate_mask = tc_map_to_queue_mask(dev,
> +							    entry->gate_mask);
> +		else
> +			e->gate_mask = entry->gate_mask;
>  
>  		i++;
>  	}
> @@ -1273,7 +1278,7 @@ static int taprio_enable_offload(struct net_device *dev,
>  	offload->enable = 1;
>  	if (mqprio)
>  		offload->mqprio.qopt = *mqprio;
> -	taprio_sched_to_offload(dev, sched, offload);
> +	taprio_sched_to_offload(dev, sched, offload, caps.gate_mask_per_txq);
>  
>  	for (tc = 0; tc < TC_MAX_QUEUE; tc++)
>  		offload->max_sdu[tc] = q->max_sdu[tc];
> -- 
> 2.34.1
>

-- 
Vinicius

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 03/11] net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h
  2023-01-20 14:15 ` [RFC PATCH net-next 03/11] net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h Vladimir Oltean
@ 2023-01-25 13:09   ` Kurt Kanzenbach
  2023-01-25 13:16     ` Vladimir Oltean
  0 siblings, 1 reply; 25+ messages in thread
From: Kurt Kanzenbach @ 2023-01-25 13:09 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, John Fastabend
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Claudiu Manoil, Camelia Groza, Xiaoliang Yang, Gerhard Engleder,
	Vinicius Costa Gomes, Alexander Duyck, Ferenc Fejes, Tony Nguyen,
	Jesse Brandeburg, Jacob Keller

[-- Attachment #1: Type: text/plain, Size: 293 bytes --]

On Fri Jan 20 2023, Vladimir Oltean wrote:
> Since taprio is a scheduler and not a classifier, move its offload

s/taprio/mqprio/

Since mqprio is a .... ?

> structure to pkt_sched.h, where struct tc_taprio_qopt_offload also lies.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 861 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
  2023-01-25  1:11 ` Vinicius Costa Gomes
@ 2023-01-25 13:10   ` Vladimir Oltean
  2023-01-25 22:47     ` Vinicius Costa Gomes
  0 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-25 13:10 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: netdev, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Claudiu Manoil, Camelia Groza,
	Xiaoliang Yang, Gerhard Engleder, Alexander Duyck,
	Kurt Kanzenbach, Ferenc Fejes, Tony Nguyen, Jesse Brandeburg,
	Jacob Keller

On Tue, Jan 24, 2023 at 05:11:29PM -0800, Vinicius Costa Gomes wrote:
> Hi Vladimir,
> 
> Sorry for the delay. I had to sleep on this for a bit.

No problem, thanks for responding.

> > Vinicius said that for some Intel NICs, prioritization at the egress
> > scheduler stage is fundamentally attached to TX queues rather than
> > traffic classes.
> >
> > In other words, in the "popular" mqprio configuration documented by him:
> >
> > $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
> >       num_tc 3 \
> >       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
> >       queues 1@0 1@1 2@2 \
> >       hw 0
> >
> > there are 3 Linux traffic classes and 4 TX queues. The TX queues are
> > organized in strict priority fashion, like this: TXQ 0 has highest prio
> > (hardware dequeue precedence for TX scheduler), TXQ 3 has lowest prio.
> > Packets classified by Linux to TC 2 are hashed between TXQ 2 and TXQ 3,
> > but the hardware has higher precedence for TXQ2 over TXQ 3, and Linux
> > doesn't know that.
> >
> > I am surprised by this fact, and this isn't how ENETC works at all.
> > For ENETC, we try to prioritize on TCs rather than TXQs, and TC 7 has
> > higher priority than TC 7. For us, groups of TXQs that map to the same
> > TC have the same egress scheduling priority. It is possible (and maybe
> > useful) to have 2 TXQs per TC - one TXQ per CPU). Patch 07/11 tries to
> > make that more clear.
> 
> That makes me think, making "queues" visible on mqprio/taprio perhaps
> was a mistake. Perhaps if we only had the "prio to tc" map, and relied
> on drivers implementing .ndo_select_queue() that would be less
> problematic. And for devices with tens/hundreds of queues, this "no
> queues to the user exposed" sounds like a better model. Anyway... just
> wondering.
> 
> Perhaps something to think about for mqprio/taprio/etc "the next generation" ;-)

Hmm, not sure I wanted to go there with my proposal. I think the fact
that taprio allows specifying how many TXQs are used per TC (and
starting with which TXQ offset) is a direct consequence of the fact that
mqprio had that in its UAPI. Today, there certainly needs to exist
hardware-level knowledge in mapping TXQs to TCs in a way that makes
software prioritization (netdev_core_pick_tx()) coincide with the
hardware prioritization scheme. That requirement of prior knowledge
certainly makes a given taprio/mqprio configuration less portable across
systems/vendors, which is the problem IMO.

But I wouldn't jump to your conclusion, that it was a mistake to even
expose TXQs to user space. I would argue, maybe the problem is that not
*enough* information about TXQs is exposed to user space. I could
imagine it being useful for user space to be able to probe information
such as

- this netdev has strict prioritization (i.e. for this netdev, egress
  scheduling priority is attached directly to TXQs, and each TXQ is
  required to have a unique priority)

- this netdev has round robin egress scheduling between TXQs (or some
  other fairness scheme; which one?)
  - is the round robin scheduling weighted? what are the weights, and
    are they configurable? should skb_tx_hash() take the weights into
    consideration?

- this netdev has 2 layers of egress scheduling, first being strict
  priority and the other being round robin

Based on this kind of information, some kind of automation would become
possible to write an mqprio configuration that maps TCs to TXQs in a
portable way, and user space sockets are just concerned with the packet
priority API.

I guess different people would want to expose even more, or slightly
different, information about what it is that the kernel exposes for
TXQs. I would be interested to know what that is.

> > Furthermore (and this is really the biggest point of contention), myself
> > and Vinicius have the fundamental disagreement whether the 802.1Qbv
> > (taprio) gate mask should be passed to the device driver per TXQ or per
> > TC. This is what patch 11/11 is about.
> 
> I think that I was being annoying because I believed that some
> implementation detail of the netdev prio_tc_map and the way that netdev
> select TX queues (the "core of how mqprio works") would leak, and it
> would be easier/more correct to make other vendors adapt themselves to
> the "Intel"/"queues have priorities" model. But I stand corrected, as
> you (and others) have proven.

The problem with gates per TXQ is that it doesn't answer the obvious
question of how does that work out when there is >1 TXQ per TC.
With the clarification that "gates per TXQ" requires that there is a
single TXQ per TC, this effectively becomes just a matter of changing
the indices of set bits in the gate mask (TC 3 may correspond to TXQ
offset 5), which is essentially what Gerhard seems to want to see with
tsnep. That is something I don't have a problem with.

But I may want, as a sanity measure, to enforce that the mqprio queue
count for each TC is no more than 1 ;) Otherwise, we fall into that
problem I keep repeating: skb_tx_hash() arbitrarily hashes between 2
TXQs, both have an open gate in software (allowing traffic to pass),
but in hardware, one TXQ has an open gate and the other has a closed gate.
So half the traffic goes into the bitbucket, because software doesn't
know what hardware does/expects.

So please ACK this issue and my proposal to break your "popular" mqprio
configuration.

> In short, I am not opposed to this idea. This capability operation
> really opens some possibilities. The patches look clean.

Yeah, gotta thank Jakub for that.

> I'll play with the patches later in the week, quite swamped at this
> point.

Regarding the patches - I plan to send a v2 anyway, because patch 08/11
"net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc()"
doesn't quite work how I'd hoped. Specifically, taprio must hold a
persistent struct tc_mqprio_qopt, rather than just juggling with what it
received last time via TCA_TAPRIO_ATTR_PRIOMAP. This is because taprio
supports some level of dynamic reconfiguration via taprio_change(), and
the TCA_TAPRIO_ATTR_PRIOMAP would be NULL when reconfigured (because the
priomap isn't what has changed). Currently this will result in passing a
NULL (actually disabled) mqprio configuration to ndo_setup_tc(), but
what I really want is to pass the *old* mqprio configuration.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 03/11] net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h
  2023-01-25 13:09   ` Kurt Kanzenbach
@ 2023-01-25 13:16     ` Vladimir Oltean
  0 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-25 13:16 UTC (permalink / raw)
  To: Kurt Kanzenbach
  Cc: netdev, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Claudiu Manoil, Camelia Groza,
	Xiaoliang Yang, Gerhard Engleder, Vinicius Costa Gomes,
	Alexander Duyck, Ferenc Fejes, Tony Nguyen, Jesse Brandeburg,
	Jacob Keller

On Wed, Jan 25, 2023 at 02:09:27PM +0100, Kurt Kanzenbach wrote:
> On Fri Jan 20 2023, Vladimir Oltean wrote:
> > Since taprio is a scheduler and not a classifier, move its offload
> 
> s/taprio/mqprio/
> 
> Since mqprio is a .... ?

That would be the muscle memory, I suppose. Thanks for spotting this;
I've made the change in my local patch.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
  2023-01-25 13:10   ` Vladimir Oltean
@ 2023-01-25 22:47     ` Vinicius Costa Gomes
  2023-01-26 20:39       ` Vladimir Oltean
  0 siblings, 1 reply; 25+ messages in thread
From: Vinicius Costa Gomes @ 2023-01-25 22:47 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Claudiu Manoil, Camelia Groza,
	Xiaoliang Yang, Gerhard Engleder, Alexander Duyck,
	Kurt Kanzenbach, Ferenc Fejes, Tony Nguyen, Jesse Brandeburg,
	Jacob Keller

Vladimir Oltean <vladimir.oltean@nxp.com> writes:

> On Tue, Jan 24, 2023 at 05:11:29PM -0800, Vinicius Costa Gomes wrote:
>> Hi Vladimir,
>> 
>> Sorry for the delay. I had to sleep on this for a bit.
>
> No problem, thanks for responding.
>
>> > Vinicius said that for some Intel NICs, prioritization at the egress
>> > scheduler stage is fundamentally attached to TX queues rather than
>> > traffic classes.
>> >
>> > In other words, in the "popular" mqprio configuration documented by him:
>> >
>> > $ tc qdisc replace dev $IFACE parent root handle 100 mqprio \
>> >       num_tc 3 \
>> >       map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
>> >       queues 1@0 1@1 2@2 \
>> >       hw 0
>> >
>> > there are 3 Linux traffic classes and 4 TX queues. The TX queues are
>> > organized in strict priority fashion, like this: TXQ 0 has highest prio
>> > (hardware dequeue precedence for TX scheduler), TXQ 3 has lowest prio.
>> > Packets classified by Linux to TC 2 are hashed between TXQ 2 and TXQ 3,
>> > but the hardware has higher precedence for TXQ2 over TXQ 3, and Linux
>> > doesn't know that.
>> >
>> > I am surprised by this fact, and this isn't how ENETC works at all.
>> > For ENETC, we try to prioritize on TCs rather than TXQs, and TC 7 has
>> > higher priority than TC 7. For us, groups of TXQs that map to the same
>> > TC have the same egress scheduling priority. It is possible (and maybe
>> > useful) to have 2 TXQs per TC - one TXQ per CPU). Patch 07/11 tries to
>> > make that more clear.
>> 
>> That makes me think, making "queues" visible on mqprio/taprio perhaps
>> was a mistake. Perhaps if we only had the "prio to tc" map, and relied
>> on drivers implementing .ndo_select_queue() that would be less
>> problematic. And for devices with tens/hundreds of queues, this "no
>> queues to the user exposed" sounds like a better model. Anyway... just
>> wondering.
>> 
>> Perhaps something to think about for mqprio/taprio/etc "the next generation" ;-)
>
> Hmm, not sure I wanted to go there with my proposal. I think the fact
> that taprio allows specifying how many TXQs are used per TC (and
> starting with which TXQ offset) is a direct consequence of the fact that
> mqprio had that in its UAPI. Today, there certainly needs to exist
> hardware-level knowledge in mapping TXQs to TCs in a way that makes
> software prioritization (netdev_core_pick_tx()) coincide with the
> hardware prioritization scheme. That requirement of prior knowledge
> certainly makes a given taprio/mqprio configuration less portable across
> systems/vendors, which is the problem IMO.
>
> But I wouldn't jump to your conclusion, that it was a mistake to even
> expose TXQs to user space. I would argue, maybe the problem is that not
> *enough* information about TXQs is exposed to user space. I could
> imagine it being useful for user space to be able to probe information
> such as
>
> - this netdev has strict prioritization (i.e. for this netdev, egress
>   scheduling priority is attached directly to TXQs, and each TXQ is
>   required to have a unique priority)
>
> - this netdev has round robin egress scheduling between TXQs (or some
>   other fairness scheme; which one?)
>   - is the round robin scheduling weighted? what are the weights, and
>     are they configurable? should skb_tx_hash() take the weights into
>     consideration?
>
> - this netdev has 2 layers of egress scheduling, first being strict
>   priority and the other being round robin
>
> Based on this kind of information, some kind of automation would become
> possible to write an mqprio configuration that maps TCs to TXQs in a
> portable way, and user space sockets are just concerned with the packet
> priority API.
>
> I guess different people would want to expose even more, or slightly
> different, information about what it is that the kernel exposes for
> TXQs. I would be interested to know what that is.

Sounds interesting. Instead of "hiding" information from the user and
trusting the driver to do the right thing, we would expose enough
information for the user to config it right. That could work.

>
>> > Furthermore (and this is really the biggest point of contention), myself
>> > and Vinicius have the fundamental disagreement whether the 802.1Qbv
>> > (taprio) gate mask should be passed to the device driver per TXQ or per
>> > TC. This is what patch 11/11 is about.
>> 
>> I think that I was being annoying because I believed that some
>> implementation detail of the netdev prio_tc_map and the way that netdev
>> select TX queues (the "core of how mqprio works") would leak, and it
>> would be easier/more correct to make other vendors adapt themselves to
>> the "Intel"/"queues have priorities" model. But I stand corrected, as
>> you (and others) have proven.
>
> The problem with gates per TXQ is that it doesn't answer the obvious
> question of how does that work out when there is >1 TXQ per TC.
> With the clarification that "gates per TXQ" requires that there is a
> single TXQ per TC, this effectively becomes just a matter of changing
> the indices of set bits in the gate mask (TC 3 may correspond to TXQ
> offset 5), which is essentially what Gerhard seems to want to see with
> tsnep. That is something I don't have a problem with.
>
> But I may want, as a sanity measure, to enforce that the mqprio queue
> count for each TC is no more than 1 ;) Otherwise, we fall into that
> problem I keep repeating: skb_tx_hash() arbitrarily hashes between 2
> TXQs, both have an open gate in software (allowing traffic to pass),
> but in hardware, one TXQ has an open gate and the other has a closed gate.
> So half the traffic goes into the bitbucket, because software doesn't
> know what hardware does/expects.
>
> So please ACK this issue and my proposal to break your "popular" mqprio
> configuration.

I am afraid that I cannot give my ACK for that, that is, for some
definition, a breaking change. A config that has been working for many
years is going to stop working.

I know that is not ideal, perhaps we could use the capabilities "trick"
to help minimize the breakage? i.e. add a capability whether or not the
device supports/"make sense" having multiple TXQs handling a single TC?

Would it help?

>
>> In short, I am not opposed to this idea. This capability operation
>> really opens some possibilities. The patches look clean.
>
> Yeah, gotta thank Jakub for that.
>
>> I'll play with the patches later in the week, quite swamped at this
>> point.
>
> Regarding the patches - I plan to send a v2 anyway, because patch 08/11
> "net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc()"
> doesn't quite work how I'd hoped. Specifically, taprio must hold a
> persistent struct tc_mqprio_qopt, rather than just juggling with what it
> received last time via TCA_TAPRIO_ATTR_PRIOMAP. This is because taprio
> supports some level of dynamic reconfiguration via taprio_change(), and
> the TCA_TAPRIO_ATTR_PRIOMAP would be NULL when reconfigured (because the
> priomap isn't what has changed). Currently this will result in passing a
> NULL (actually disabled) mqprio configuration to ndo_setup_tc(), but
> what I really want is to pass the *old* mqprio configuration.

Cool. Will play with v2, then.


Cheers,
-- 
Vinicius

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup
  2023-01-25 22:47     ` Vinicius Costa Gomes
@ 2023-01-26 20:39       ` Vladimir Oltean
  0 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-01-26 20:39 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: netdev, John Fastabend, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Claudiu Manoil, Camelia Groza,
	Xiaoliang Yang, Gerhard Engleder, Alexander Duyck,
	Kurt Kanzenbach, Ferenc Fejes, Tony Nguyen, Jesse Brandeburg,
	Jacob Keller

On Wed, Jan 25, 2023 at 02:47:28PM -0800, Vinicius Costa Gomes wrote:
> > The problem with gates per TXQ is that it doesn't answer the obvious
> > question of how does that work out when there is >1 TXQ per TC.
> > With the clarification that "gates per TXQ" requires that there is a
> > single TXQ per TC, this effectively becomes just a matter of changing
> > the indices of set bits in the gate mask (TC 3 may correspond to TXQ
> > offset 5), which is essentially what Gerhard seems to want to see with
> > tsnep. That is something I don't have a problem with.
> >
> > But I may want, as a sanity measure, to enforce that the mqprio queue
> > count for each TC is no more than 1 ;) Otherwise, we fall into that
> > problem I keep repeating: skb_tx_hash() arbitrarily hashes between 2
> > TXQs, both have an open gate in software (allowing traffic to pass),
> > but in hardware, one TXQ has an open gate and the other has a closed gate.
> > So half the traffic goes into the bitbucket, because software doesn't
> > know what hardware does/expects.
> >
> > So please ACK this issue and my proposal to break your "popular" mqprio
> > configuration.
> 
> I am afraid that I cannot give my ACK for that, that is, for some
> definition, a breaking change. A config that has been working for many
> years is going to stop working.
> 
> I know that is not ideal, perhaps we could use the capabilities "trick"
> to help minimize the breakage? i.e. add a capability whether or not the
> device supports/"make sense" having multiple TXQs handling a single TC?
> 
> Would it help?

Not having multiple TXQs handling a single TC (that is fine), but having
multiple TXQs of different priorities handling a single tc...

So how does it work with igc, what exactly are we keeping alive?

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-01-26 20:40 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-20 14:15 [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 01/11] net/sched: mqprio: refactor nlattr parsing to a separate function Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 02/11] net/sched: mqprio: refactor offloading and unoffloading to dedicated functions Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 03/11] net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h Vladimir Oltean
2023-01-25 13:09   ` Kurt Kanzenbach
2023-01-25 13:16     ` Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 04/11] net/sched: mqprio: allow offloading drivers to request queue count validation Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 05/11] net/sched: mqprio: add extack messages for " Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 06/11] net: enetc: request mqprio to validate the queue counts Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 07/11] net: enetc: act upon the requested mqprio queue configuration Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 08/11] net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc() Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 09/11] net: enetc: act upon mqprio queue config in taprio offload Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 10/11] net/sched: taprio: validate that gate mask does not exceed number of TCs Vladimir Oltean
2023-01-20 14:15 ` [RFC PATCH net-next 11/11] net/sched: taprio: only calculate gate mask per TXQ for igc Vladimir Oltean
2023-01-25  1:11   ` Vinicius Costa Gomes
2023-01-23 18:22 ` [RFC PATCH net-next 00/11] ENETC mqprio/taprio cleanup Jacob Keller
2023-01-24 14:26   ` Vladimir Oltean
2023-01-24 22:30     ` Jacob Keller
2023-01-23 21:21 ` Gerhard Engleder
2023-01-23 21:31   ` Vladimir Oltean
2023-01-23 22:20     ` Gerhard Engleder
2023-01-25  1:11 ` Vinicius Costa Gomes
2023-01-25 13:10   ` Vladimir Oltean
2023-01-25 22:47     ` Vinicius Costa Gomes
2023-01-26 20:39       ` Vladimir Oltean

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.