All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/10] Add a new Qdisc, ETS
@ 2019-11-20 13:05 Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 01/10] net: pkt_cls: Clarify a comment Petr Machata
                   ` (13 more replies)
  0 siblings, 14 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

The IEEE standard 802.1Qaz (and 802.1Q-2014) specifies four principal
transmission selection algorithms: strict priority, credit-based shaper,
ETS (bandwidth sharing), and vendor-specific. All these have their
corresponding knobs in DCB. But DCB does not have interfaces to configure
RED and ECN, unlike Qdiscs.

In the Qdisc land, strict priority is implemented by PRIO. Credit-based
transmission selection algorithm can then be modeled by having e.g. TBF or
CBS Qdisc below some of the PRIO bands. ETS would then be modeled by
placing a DRR Qdisc under the last PRIO band.

The problem with this approach is that DRR on its own, as well as the
combination of PRIO and DRR, are tricky to configure and tricky to offload
to 802.1Qaz-compliant hardware. This is due to several reasons:

- As any classful Qdisc, DRR supports adding classifiers to decide in which
  class to enqueue packets. Unlike PRIO, there's however no fallback in the
  form of priomap. A way to achieve classification based on packet priority
  is e.g. like this:

    # tc filter add dev swp1 root handle 1: \
		basic match 'meta(priority eq 0)' flowid 1:10

  Expressing the priomap in this manner however forces drivers to deep dive
  into the classifier block to parse the individual rules.

  A possible solution would be to extend the classes with a "defmap" a la
  split / defmap mechanism of CBQ, and introduce this as a last resort
  classification. However, unlike priomap, this doesn't have the guarantee
  of covering all priorities. Traffic whose priority is not covered is
  dropped by DRR as unclassified. But ASICs tend to implement dropping in
  the ACL block, not in scheduling pipelines. The need to treat these
  configurations correctly (if only to decide to not offload at all)
  complicates a driver.

  It's not clear how to retrofit priomap with all its benefits to DRR
  without changing it beyond recognition.

- The interplay between PRIO and DRR is also causing problems. 802.1Qaz has
  all ETS TCs as a last resort. I believe switch ASICs that support ETS at
  all will handle ETS traffic likewise. However the Linux model is more
  generic, allowing the DRR block in any band. Drivers would need to be
  careful to handle this case correctly, otherwise the offloaded model
  might not match the slow-path one.

  In a similar vein, PRIO and DRR need to agree on the list of priorities
  assigned to DRR. This is doubly problematic--the user needs to take care
  to keep the two in sync, and the driver needs to watch for any holes in
  DRR coverage and treat the traffic correctly, as discussed above.

  Note that at the time that DRR Qdisc is added, it has no classes, and
  thus any priorities assigned to that PRIO band are not covered. Thus this
  case is surprisingly rather common, and needs to be handled gracefully by
  the driver.

- Similarly due to DRR flexibility, when a Qdisc (such as RED) is attached
  below it, it is not immediately clear which TC the class represents. This
  is unlike PRIO with its straightforward classid scheme. When DRR is
  combined with PRIO, the relationship between classes and TCs gets even
  more murky.

  This is a problem for users as well: the TC mapping is rather important
  for (devlink) shared buffer configuration and (ethtool) counters.

So instead, this patch set introduces a new Qdisc, which is based on
802.1Qaz wording. It is PRIO-like in how it is configured, meaning one
needs to specify how many bands there are, how many are strict and how many
are ETS, quanta for the latter, and priomap.

The new Qdisc operates like the PRIO / DRR combo would when configured as
per the standard. The strict classes, if any, are tried for traffic first.
When there's no traffic in any of the strict queues, the ETS ones (if any)
are treated in the same way as in DRR.

The chosen interface makes the overall system both reasonably easy to
configure, and reasonably easy to offload. The extra code to support ETS in
mlxsw (which already supports PRIO) is about 150 lines, of which perhaps 20
lines is bona fide new business logic.

Credit-based shaping transmission selection algorithm can be configured by
adding a CBS Qdisc under one of the strict bands (e.g. TBF can be used to a
similar effect as well). As a non-work-conserving Qdisc, CBS can't be
hooked under the ETS bands. This is detected and handled identically to DRR
Qdisc at runtime. Note that offloading CBS is not subject of this patchset.

The patchset proceeds in four stages:

- Patches #1-#3 are cleanups.
- Patches #4 and #5 contain the new Qdisc.
- Patches #6 and #7 update mlxsw to offload the new Qdisc.
- Patches #8-#10 add selftests for ETS.

Examples:

- Add a Qdisc with 6 bands, 3 strict and 3 ETS with 45%-30%-25% weights:

    # tc qdisc add dev swp1 root handle 1: \
	ets strict 3 quanta 4500 3000 2500 priomap 0 1 1 1 2 3 4 5
    # tc qdisc sh dev swp1
    qdisc ets 1: root refcnt 2 bands 6 strict 3 quanta 4500 3000 2500 priomap 0 1 1 1 2 3 4 5 0 0 0 0 0 0 0 0 

- Tweak quantum of one of the classes of the previous Qdisc:

    # tc class ch dev swp1 classid 1:4 ets quantum 1000
    # tc qdisc sh dev swp1
    qdisc ets 1: root refcnt 2 bands 6 strict 3 quanta 1000 3000 2500 priomap 0 1 1 1 2 3 4 5 0 0 0 0 0 0 0 0 
    # tc class ch dev swp1 classid 1:3 ets quantum 1000
    Error: Strict bands do not have a configurable quantum.

- Purely strict Qdisc with 1:1 mapping between priorities and TCs:

    # tc qdisc add dev swp1 root handle 1: \
	ets strict 8 priomap 7 6 5 4 3 2 1 0
    # tc qdisc sh dev swp1
    qdisc ets 1: root refcnt 2 bands 8 strict 8 priomap 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0 

- Use "bands" to specify number of bands explicitly. Underspecified bands
  are implicitly ETS and their quantum is taken from MTU. The following
  thus gives each band the same weight:

    # tc qdisc add dev swp1 root handle 1: \
	ets bands 8 priomap 7 6 5 4 3 2 1 0
    # tc qdisc sh dev swp1
    qdisc ets 1: root refcnt 2 bands 8 quanta 1514 1514 1514 1514 1514 1514 1514 1514 priomap 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0 

Petr Machata (10):
  net: pkt_cls: Clarify a comment
  mlxsw: spectrum_qdisc: Clarify a comment
  mlxsw: spectrum: Fix typos in MLXSW_REG_QEEC_HIERARCHY_* enumerators
  net: sch_ets: Add a new Qdisc
  net: sch_ets: Make the ETS qdisc offloadable
  mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS
  mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc
  selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh
  selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc
  selftests: qdiscs: Add test coverage for ETS Qdisc

 drivers/net/ethernet/mellanox/mlxsw/reg.h     |  10 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  26 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |   2 +
 .../ethernet/mellanox/mlxsw/spectrum_dcb.c    |  16 +-
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 205 ++++-
 include/linux/netdevice.h                     |   1 +
 include/net/pkt_cls.h                         |  36 +-
 include/uapi/linux/pkt_sched.h                |  29 +
 net/sched/Kconfig                             |  11 +
 net/sched/Makefile                            |   1 +
 net/sched/sch_ets.c                           | 796 ++++++++++++++++++
 .../selftests/drivers/net/mlxsw/qos_lib.sh    |  46 +-
 .../selftests/drivers/net/mlxsw/sch_ets.sh    |  54 ++
 tools/testing/selftests/net/forwarding/lib.sh |  18 +
 .../selftests/net/forwarding/sch_ets.sh       |  30 +
 .../selftests/net/forwarding/sch_ets_core.sh  | 229 +++++
 .../selftests/net/forwarding/sch_ets_tests.sh | 230 +++++
 .../tc-testing/tc-tests/qdiscs/ets.json       | 709 ++++++++++++++++
 18 files changed, 2371 insertions(+), 78 deletions(-)
 create mode 100644 net/sched/sch_ets.c
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh
 create mode 100755 tools/testing/selftests/net/forwarding/sch_ets.sh
 create mode 100644 tools/testing/selftests/net/forwarding/sch_ets_core.sh
 create mode 100644 tools/testing/selftests/net/forwarding/sch_ets_tests.sh
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json

-- 
2.20.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 01/10] net: pkt_cls: Clarify a comment
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 02/10] mlxsw: spectrum_qdisc: " Petr Machata
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

The bit about negating HW backlog left me scratching my head. Clarify the
comment.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 include/net/pkt_cls.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index e553fc80eb23..a7c5d492bc04 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -791,9 +791,8 @@ enum tc_prio_command {
 struct tc_prio_qopt_offload_params {
 	int bands;
 	u8 priomap[TC_PRIO_MAX + 1];
-	/* In case that a prio qdisc is offloaded and now is changed to a
-	 * non-offloadedable config, it needs to update the backlog & qlen
-	 * values to negate the HW backlog & qlen values (and only them).
+	/* At the point of un-offloading the Qdisc, the reported backlog and
+	 * qlen need to be reduced by the portion that is in HW.
 	 */
 	struct gnet_stats_queue *qstats;
 };
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 02/10] mlxsw: spectrum_qdisc: Clarify a comment
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 01/10] net: pkt_cls: Clarify a comment Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 03/10] mlxsw: spectrum: Fix typos in MLXSW_REG_QEEC_HIERARCHY_* enumerators Petr Machata
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

Expand the coment at mlxsw_sp_qdisc_prio_graft() to make the problem that
this function is trying to handle clearer.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 68cc6737d45c..0457ff5f6942 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -631,10 +631,18 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_prio = {
 	.clean_stats = mlxsw_sp_setup_tc_qdisc_prio_clean_stats,
 };
 
-/* Grafting is not supported in mlxsw. It will result in un-offloading of the
- * grafted qdisc as well as the qdisc in the qdisc new location.
- * (However, if the graft is to the location where the qdisc is already at, it
- * will be ignored completely and won't cause un-offloading).
+/* Linux allows linking of Qdiscs to arbitrary classes (so long as the resulting
+ * graph is free of cycles). These operations do not change the parent handle
+ * though, which means it can be incomplete (if there is more than one class
+ * where the Qdisc in question is grafted) or outright wrong (if the Qdisc was
+ * linked to a different class and then removed from the original class).
+ *
+ * The notification for child Qdisc replace (e.g. TC_RED_REPLACE) comes before
+ * the notification for parent graft (e.g. TC_PRIO_GRAFT). We take the replace
+ * notification to offload the child Qdisc, based on its parent handle, and use
+ * the graft operation to validate that the class where the child is actually
+ * grafted corresponds to the parent handle. If the two don't match, we
+ * unoffload the child.
  */
 static int
 mlxsw_sp_qdisc_prio_graft(struct mlxsw_sp_port *mlxsw_sp_port,
@@ -644,9 +652,6 @@ mlxsw_sp_qdisc_prio_graft(struct mlxsw_sp_port *mlxsw_sp_port,
 	int tclass_num = MLXSW_SP_PRIO_BAND_TO_TCLASS(p->band);
 	struct mlxsw_sp_qdisc *old_qdisc;
 
-	/* Check if the grafted qdisc is already in its "new" location. If so -
-	 * nothing needs to be done.
-	 */
 	if (p->band < IEEE_8021QAZ_MAX_TCS &&
 	    mlxsw_sp_port->tclass_qdiscs[tclass_num].handle == p->child_handle)
 		return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 03/10] mlxsw: spectrum: Fix typos in MLXSW_REG_QEEC_HIERARCHY_* enumerators
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 01/10] net: pkt_cls: Clarify a comment Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 02/10] mlxsw: spectrum_qdisc: " Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 04/10] net: sch_ets: Add a new Qdisc Petr Machata
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

The current spelling of MLXSW_REG_QEEC_HIERARCY_ complicates searching.
Fix it.

Adjusted some formatting to be 80-char clean in impacted blocks.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h     | 10 ++++----
 .../net/ethernet/mellanox/mlxsw/spectrum.c    | 24 +++++++++----------
 .../ethernet/mellanox/mlxsw/spectrum_dcb.c    | 16 ++++++-------
 3 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 5294a1622643..e43ae1654af1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -3477,10 +3477,10 @@ MLXSW_REG_DEFINE(qeec, MLXSW_REG_QEEC_ID, MLXSW_REG_QEEC_LEN);
 MLXSW_ITEM32(reg, qeec, local_port, 0x00, 16, 8);
 
 enum mlxsw_reg_qeec_hr {
-	MLXSW_REG_QEEC_HIERARCY_PORT,
-	MLXSW_REG_QEEC_HIERARCY_GROUP,
-	MLXSW_REG_QEEC_HIERARCY_SUBGROUP,
-	MLXSW_REG_QEEC_HIERARCY_TC,
+	MLXSW_REG_QEEC_HIERARCHY_PORT,
+	MLXSW_REG_QEEC_HIERARCHY_GROUP,
+	MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
+	MLXSW_REG_QEEC_HIERARCHY_TC,
 };
 
 /* reg_qeec_element_hierarchy
@@ -3619,7 +3619,7 @@ static inline void mlxsw_reg_qeec_ptps_pack(char *payload, u8 local_port,
 	MLXSW_REG_ZERO(qeec, payload);
 	mlxsw_reg_qeec_local_port_set(payload, local_port);
 	mlxsw_reg_qeec_element_hierarchy_set(payload,
-					     MLXSW_REG_QEEC_HIERARCY_PORT);
+					     MLXSW_REG_QEEC_HIERARCHY_PORT);
 	mlxsw_reg_qeec_ptps_set(payload, ptps);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 556dca328bb5..cb24807c119d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3602,26 +3602,26 @@ static int mlxsw_sp_port_ets_init(struct mlxsw_sp_port *mlxsw_sp_port)
 	 * one subgroup, which are all member in the same group.
 	 */
 	err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-				    MLXSW_REG_QEEC_HIERARCY_GROUP, 0, 0, false,
+				    MLXSW_REG_QEEC_HIERARCHY_GROUP, 0, 0, false,
 				    0);
 	if (err)
 		return err;
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_SUBGROUP, i,
-					    0, false, 0);
+					    MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
+					    i, 0, false, 0);
 		if (err)
 			return err;
 	}
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_TC, i, i,
+					    MLXSW_REG_QEEC_HIERARCHY_TC, i, i,
 					    false, 0);
 		if (err)
 			return err;
 
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_TC,
+					    MLXSW_REG_QEEC_HIERARCHY_TC,
 					    i + 8, i,
 					    true, 100);
 		if (err)
@@ -3633,28 +3633,28 @@ static int mlxsw_sp_port_ets_init(struct mlxsw_sp_port *mlxsw_sp_port)
 	 * for the initial configuration.
 	 */
 	err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_PORT, 0, 0,
+					    MLXSW_REG_QEEC_HIERARCHY_PORT, 0, 0,
 					    MLXSW_REG_QEEC_MAS_DIS);
 	if (err)
 		return err;
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-						    MLXSW_REG_QEEC_HIERARCY_SUBGROUP,
-						    i, 0,
-						    MLXSW_REG_QEEC_MAS_DIS);
+					    MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
+					    i, 0,
+					    MLXSW_REG_QEEC_MAS_DIS);
 		if (err)
 			return err;
 	}
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-						    MLXSW_REG_QEEC_HIERARCY_TC,
+						    MLXSW_REG_QEEC_HIERARCHY_TC,
 						    i, i,
 						    MLXSW_REG_QEEC_MAS_DIS);
 		if (err)
 			return err;
 
 		err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-						    MLXSW_REG_QEEC_HIERARCY_TC,
+						    MLXSW_REG_QEEC_HIERARCHY_TC,
 						    i + 8, i,
 						    MLXSW_REG_QEEC_MAS_DIS);
 		if (err)
@@ -3664,7 +3664,7 @@ static int mlxsw_sp_port_ets_init(struct mlxsw_sp_port *mlxsw_sp_port)
 	/* Configure the min shaper for multicast TCs. */
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_min_bw_set(mlxsw_sp_port,
-					       MLXSW_REG_QEEC_HIERARCY_TC,
+					       MLXSW_REG_QEEC_HIERARCHY_TC,
 					       i + 8, i,
 					       MLXSW_REG_QEEC_MIS_MIN);
 		if (err)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dcb.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dcb.c
index 21296fa7f7fb..4c335112dfbb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dcb.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dcb.c
@@ -160,8 +160,8 @@ static int __mlxsw_sp_dcbnl_ieee_setets(struct mlxsw_sp_port *mlxsw_sp_port,
 		u8 weight = ets->tc_tx_bw[i];
 
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_SUBGROUP, i,
-					    0, dwrr, weight);
+					    MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
+					    i, 0, dwrr, weight);
 		if (err) {
 			netdev_err(dev, "Failed to link subgroup ETS element %d to group\n",
 				   i);
@@ -198,8 +198,8 @@ static int __mlxsw_sp_dcbnl_ieee_setets(struct mlxsw_sp_port *mlxsw_sp_port,
 		u8 weight = my_ets->tc_tx_bw[i];
 
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_SUBGROUP, i,
-					    0, dwrr, weight);
+					    MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
+					    i, 0, dwrr, weight);
 	}
 	return err;
 }
@@ -507,9 +507,9 @@ static int mlxsw_sp_dcbnl_ieee_setmaxrate(struct net_device *dev,
 
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-						    MLXSW_REG_QEEC_HIERARCY_SUBGROUP,
-						    i, 0,
-						    maxrate->tc_maxrate[i]);
+					    MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
+					    i, 0,
+					    maxrate->tc_maxrate[i]);
 		if (err) {
 			netdev_err(dev, "Failed to set maxrate for TC %d\n", i);
 			goto err_port_ets_maxrate_set;
@@ -523,7 +523,7 @@ static int mlxsw_sp_dcbnl_ieee_setmaxrate(struct net_device *dev,
 err_port_ets_maxrate_set:
 	for (i--; i >= 0; i--)
 		mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-					      MLXSW_REG_QEEC_HIERARCY_SUBGROUP,
+					      MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
 					      i, 0, my_maxrate->tc_maxrate[i]);
 	return err;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 04/10] net: sch_ets: Add a new Qdisc
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (2 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 03/10] mlxsw: spectrum: Fix typos in MLXSW_REG_QEEC_HIERARCHY_* enumerators Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 05/10] net: sch_ets: Make the ETS qdisc offloadable Petr Machata
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

Introduces a new Qdisc, which is based on 802.1Q-2014 wording. It is
PRIO-like in how it is configured, meaning one needs to specify how many
bands there are, how many are strict and how many are dwrr, quanta for the
latter, and priomap.

The new Qdisc operates like the PRIO / DRR combo would when configured as
per the standard. The strict classes, if any, are tried for traffic first.
When there's no traffic in any of the strict queues, the ETS ones (if any)
are treated in the same way as in DRR.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 include/uapi/linux/pkt_sched.h |  29 ++
 net/sched/Kconfig              |  11 +
 net/sched/Makefile             |   1 +
 net/sched/sch_ets.c            | 701 +++++++++++++++++++++++++++++++++
 4 files changed, 742 insertions(+)
 create mode 100644 net/sched/sch_ets.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 5011259b8f67..d7f1bd8fd136 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1181,4 +1181,33 @@ enum {
 
 #define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
 
+/* ETS */
+
+#define TCQ_ETS_MAX_BANDS 16
+
+enum {
+	TCA_ETS_UNSPEC,
+	TCA_ETS_BANDS,	/* u8 */
+	TCA_ETS_STRICT,	/* u8 */
+	TCA_ETS_QUANTA,	/* nested TCA_ETS_BAND_QUANTUM */
+	TCA_ETS_PRIOMAP,	/* nested TCA_ETS_PMAP_BAND */
+	__TCA_ETS_MAX,
+};
+
+#define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
+
+enum {
+	TCA_ETS_BAND_UNSPEC,
+	TCA_ETS_BAND_QUANTUM,	/* u32 */
+	__TCA_ETS_BAND_MAX,
+};
+#define TCA_ETS_BAND_MAX (__TCA_ETS_BAND_MAX - 1)
+
+enum {
+	TCA_ETS_PMAP_UNSPEC,
+	TCA_ETS_PMAP_BAND,	/* u8 */
+	__TCA_ETS_PMAP_MAX,
+};
+#define TCA_ETS_PMAP_MAX (__TCA_ETS_PMAP_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 2985509147a2..192878a1095e 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -409,6 +409,17 @@ config NET_SCH_PLUG
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_plug.
 
+config NET_SCH_ETS
+	tristate "Enhanced transmission selection scheduler (ETS)"
+	help
+	  Say Y here if you want to use the 802.1Qaz-compliant "enhanced
+	  transmission selection" packet scheduling algorithm.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_drr.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 415d1e1f237e..bc8856b865ff 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
 obj-$(CONFIG_NET_SCH_PLUG)	+= sch_plug.o
+obj-$(CONFIG_NET_SCH_ETS)	+= sch_ets.o
 obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
 obj-$(CONFIG_NET_SCH_SKBPRIO)	+= sch_skbprio.o
 obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
new file mode 100644
index 000000000000..6a1751564c4b
--- /dev/null
+++ b/net/sched/sch_ets.c
@@ -0,0 +1,701 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * net/sched/sch_ets.c         Enhanced Transmission Selection scheduler
+ */
+
+#include <linux/module.h>
+#include <net/gen_stats.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/pkt_cls.h>
+#include <net/sch_generic.h>
+
+struct ets_class {
+	struct list_head alist; /* In struct ets_sched.active. */
+	struct Qdisc *qdisc;
+	u32 quantum;
+	u32 deficit;
+	struct gnet_stats_basic_packed bstats;
+	struct gnet_stats_queue qstats;
+};
+
+struct ets_sched {
+	struct list_head active;
+	struct tcf_proto __rcu *filter_list;
+	struct tcf_block *block;
+	unsigned int nbands;
+	unsigned int nstrict;
+	u8 prio2band[TC_PRIO_MAX + 1];
+	struct ets_class classes[TCQ_ETS_MAX_BANDS];
+};
+
+static const struct nla_policy ets_policy[TCA_ETS_MAX + 1] = {
+	[TCA_ETS_BANDS] = { .type = NLA_U8 },
+	[TCA_ETS_STRICT] = { .type = NLA_U8 },
+	[TCA_ETS_QUANTA] = { .type = NLA_NESTED },
+	[TCA_ETS_PRIOMAP] = { .type = NLA_NESTED },
+};
+
+static const struct nla_policy ets_pmap_policy[TCA_ETS_PMAP_MAX + 1] = {
+	[TCA_ETS_PMAP_BAND] = { .type = NLA_U8 },
+};
+
+static const struct nla_policy ets_band_policy[TCA_ETS_BAND_MAX + 1] = {
+	[TCA_ETS_BAND_QUANTUM]	= { .type = NLA_U32 },
+};
+
+static int ets_quantum_parse(struct Qdisc *sch, const struct nlattr *attr,
+			     unsigned int *pquantum,
+			     struct netlink_ext_ack *extack)
+{
+	unsigned int quantum;
+
+	if (attr) {
+		quantum = nla_get_u32(attr);
+		if (!quantum) {
+			NL_SET_ERR_MSG(extack, "ETS quantum cannot be zero");
+			return -EINVAL;
+		}
+	} else {
+		quantum = psched_mtu(qdisc_dev(sch));
+	}
+
+	*pquantum = quantum;
+	return 0;
+}
+
+static struct ets_class *
+ets_class_from_arg(struct Qdisc *sch, unsigned long arg)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+
+	return &q->classes[arg - 1];
+}
+
+static u32 ets_class_id(struct Qdisc *sch, struct ets_class *cl)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int band = cl - q->classes;
+
+	return TC_H_MAKE(sch->handle, band + 1);
+}
+
+static bool ets_class_is_strict(struct ets_sched *q, struct ets_class *cl)
+{
+	unsigned int band = cl - q->classes;
+
+	return band < q->nstrict;
+}
+
+static int ets_class_change(struct Qdisc *sch, u32 classid, u32 parentid,
+			    struct nlattr **tca, unsigned long *arg,
+			    struct netlink_ext_ack *extack)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, *arg);
+	struct ets_sched *q = qdisc_priv(sch);
+	struct nlattr *opt = tca[TCA_OPTIONS];
+	struct nlattr *tb[TCA_ETS_MAX + 1];
+	unsigned int quantum;
+	int err;
+
+	/* Classes can be added and removed only through Qdisc_ops.change
+	 * interface.
+	 */
+	if (!cl) {
+		NL_SET_ERR_MSG(extack, "Fine-grained class addition and removal is not supported");
+		return -EOPNOTSUPP;
+	}
+
+	if (!opt) {
+		NL_SET_ERR_MSG(extack, "ETS options are required for this operation");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb, TCA_ETS_BAND_MAX, opt, ets_band_policy,
+			       extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_ETS_BAND_QUANTUM])
+		/* Nothing to configure. */
+		return 0;
+
+	if (ets_class_is_strict(q, cl)) {
+		NL_SET_ERR_MSG(extack, "Strict bands do not have a configurable quantum");
+		return -EINVAL;
+	}
+
+	err = ets_quantum_parse(sch, tb[TCA_ETS_BAND_QUANTUM], &quantum,
+				extack);
+	if (err)
+		return err;
+
+	sch_tree_lock(sch);
+	cl->quantum = quantum;
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static int ets_class_graft(struct Qdisc *sch, unsigned long arg,
+			   struct Qdisc *new, struct Qdisc **old,
+			   struct netlink_ext_ack *extack)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, arg);
+
+	if (!new) {
+		new = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+					ets_class_id(sch, cl), NULL);
+		if (!new)
+			new = &noop_qdisc;
+	}
+
+	*old = qdisc_replace(sch, new, &cl->qdisc);
+	return 0;
+}
+
+static struct Qdisc *ets_class_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, arg);
+
+	return cl->qdisc;
+}
+
+static unsigned long ets_class_find(struct Qdisc *sch, u32 classid)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	unsigned long band = TC_H_MIN(classid);
+
+	if (band - 1 >= q->nbands)
+		return 0;
+	return band;
+}
+
+static void ets_class_qlen_notify(struct Qdisc *sch, unsigned long arg)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, arg);
+	struct ets_sched *q = qdisc_priv(sch);
+
+	if (!ets_class_is_strict(q, cl))
+		list_del(&cl->alist);
+}
+
+static int ets_class_dump(struct Qdisc *sch, unsigned long arg,
+			  struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, arg);
+	struct ets_sched *q = qdisc_priv(sch);
+	struct nlattr *nest;
+
+	tcm->tcm_parent = TC_H_ROOT;
+	tcm->tcm_handle = ets_class_id(sch, cl);
+	tcm->tcm_info = cl->qdisc->handle;
+
+	nest = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!nest)
+		goto nla_put_failure;
+	if (!ets_class_is_strict(q, cl)) {
+		if (nla_put_u32(skb, TCA_ETS_BAND_QUANTUM, cl->quantum))
+			goto nla_put_failure;
+	}
+	return nla_nest_end(skb, nest);
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -EMSGSIZE;
+}
+
+static int ets_class_dump_stats(struct Qdisc *sch, unsigned long arg,
+				struct gnet_dump *d)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, arg);
+	struct Qdisc *cl_q = cl->qdisc;
+
+	if (gnet_stats_copy_basic(qdisc_root_sleeping_running(sch),
+				  d, NULL, &cl_q->bstats) < 0 ||
+	    qdisc_qstats_copy(d, cl_q) < 0)
+		return -1;
+
+	return 0;
+}
+
+static void ets_qdisc_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < q->nbands; i++) {
+		if (arg->count < arg->skip) {
+			arg->count++;
+			continue;
+		}
+		if (arg->fn(sch, i + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+static struct tcf_block *
+ets_qdisc_tcf_block(struct Qdisc *sch, unsigned long cl,
+		    struct netlink_ext_ack *extack)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+
+	if (cl) {
+		NL_SET_ERR_MSG(extack, "ETS classid must be zero");
+		return NULL;
+	}
+
+	return q->block;
+}
+
+static unsigned long ets_qdisc_bind_tcf(struct Qdisc *sch, unsigned long parent,
+					u32 classid)
+{
+	return ets_class_find(sch, classid);
+}
+
+static void ets_qdisc_unbind_tcf(struct Qdisc *sch, unsigned long arg)
+{
+}
+
+static struct ets_class *ets_classify(struct sk_buff *skb, struct Qdisc *sch,
+				      int *qerr)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	u32 band = skb->priority;
+	struct tcf_result res;
+	struct tcf_proto *fl;
+	int err;
+
+	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+	if (TC_H_MAJ(skb->priority) != sch->handle) {
+		fl = rcu_dereference_bh(q->filter_list);
+		err = tcf_classify(skb, fl, &res, false);
+#ifdef CONFIG_NET_CLS_ACT
+		switch (err) {
+		case TC_ACT_STOLEN:
+		case TC_ACT_QUEUED:
+		case TC_ACT_TRAP:
+			*qerr = NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+			/* fall through */
+		case TC_ACT_SHOT:
+			return NULL;
+		}
+#endif
+		if (!fl || err < 0) {
+			if (TC_H_MAJ(band))
+				band = 0;
+			return &q->classes[q->prio2band[band & TC_PRIO_MAX]];
+		}
+		band = res.classid;
+	}
+	band = TC_H_MIN(band) - 1;
+	if (band >= q->nbands)
+		return &q->classes[q->prio2band[0]];
+	return &q->classes[band];
+}
+
+static int ets_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			     struct sk_buff **to_free)
+{
+	unsigned int len = qdisc_pkt_len(skb);
+	struct ets_sched *q = qdisc_priv(sch);
+	struct ets_class *cl;
+	int err = 0;
+	bool first;
+
+	cl = ets_classify(skb, sch, &err);
+	if (!cl) {
+		if (err & __NET_XMIT_BYPASS)
+			qdisc_qstats_drop(sch);
+		__qdisc_drop(skb, to_free);
+		return err;
+	}
+
+	first = !cl->qdisc->q.qlen;
+	err = qdisc_enqueue(skb, cl->qdisc, to_free);
+	if (unlikely(err != NET_XMIT_SUCCESS)) {
+		if (net_xmit_drop_count(err)) {
+			cl->qstats.drops++;
+			qdisc_qstats_drop(sch);
+		}
+		return err;
+	}
+
+	if (first && !ets_class_is_strict(q, cl)) {
+		list_add_tail(&cl->alist, &q->active);
+		cl->deficit = cl->quantum;
+	}
+
+	sch->qstats.backlog += len;
+	sch->q.qlen++;
+	return err;
+}
+
+static struct sk_buff *
+ets_qdisc_dequeue_skb(struct Qdisc *sch, struct sk_buff *skb)
+{
+	qdisc_bstats_update(sch, skb);
+	qdisc_qstats_backlog_dec(sch, skb);
+	sch->q.qlen--;
+	return skb;
+}
+
+static struct sk_buff *ets_qdisc_dequeue(struct Qdisc *sch)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	struct ets_class *cl;
+	struct sk_buff *skb;
+	unsigned int band;
+	unsigned int len;
+
+	while (1) {
+		for (band = 0; band < q->nstrict; band++) {
+			cl = &q->classes[band];
+			skb = qdisc_dequeue_peeked(cl->qdisc);
+			if (skb)
+				return ets_qdisc_dequeue_skb(sch, skb);
+		}
+
+		if (list_empty(&q->active))
+			goto out;
+
+		cl = list_first_entry(&q->active, struct ets_class, alist);
+		skb = cl->qdisc->ops->peek(cl->qdisc);
+		if (!skb) {
+			qdisc_warn_nonwc(__func__, cl->qdisc);
+			goto out;
+		}
+
+		len = qdisc_pkt_len(skb);
+		if (len <= cl->deficit) {
+			cl->deficit -= len;
+			skb = qdisc_dequeue_peeked(cl->qdisc);
+			if (unlikely(!skb))
+				goto out;
+			if (cl->qdisc->q.qlen == 0)
+				list_del(&cl->alist);
+			return ets_qdisc_dequeue_skb(sch, skb);
+		}
+
+		cl->deficit += cl->quantum;
+		list_move_tail(&cl->alist, &q->active);
+	}
+out:
+	return NULL;
+}
+
+static int ets_qdisc_priomap_parse(struct nlattr *priomap_attr,
+				   unsigned int nbands,
+				   u8 priomap[TCA_ETS_MAX + 1],
+				   struct netlink_ext_ack *extack)
+{
+	const struct nlattr *attr;
+	int prio = 0;
+	u8 band;
+	int rem;
+	int err;
+
+	err = nl80211_validate_nested(priomap_attr, TCA_ETS_PMAP_MAX,
+				      ets_pmap_policy, extack);
+	if (err)
+		return err;
+
+	nla_for_each_nested(attr, priomap_attr, rem) {
+		switch (nla_type(attr)) {
+		case TCA_ETS_PMAP_BAND:
+			if (prio > TC_PRIO_MAX) {
+				NL_SET_ERR_MSG_MOD(extack, "Too many priorities in ETS priomap");
+				return -EINVAL;
+			}
+			band = nla_get_u8(attr);
+			if (band >= nbands) {
+				NL_SET_ERR_MSG_MOD(extack, "Invalid band number in ETS priomap");
+				return -EINVAL;
+			}
+			priomap[prio++] = band;
+			break;
+		default:
+			NL_SET_ERR_MSG_MOD(extack, "ETS priomap can contain only ETS_PMAP_BAND attributes");
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int ets_qdisc_quanta_parse(struct Qdisc *sch, struct nlattr *quanta_attr,
+				  unsigned int nbands,
+				  unsigned int nstrict,
+				  unsigned int quanta[TCQ_ETS_MAX_BANDS],
+				  struct netlink_ext_ack *extack)
+{
+	const struct nlattr *attr;
+	int band = nstrict;
+	int rem;
+	int err;
+
+	err = nl80211_validate_nested(quanta_attr, TCA_ETS_BAND_MAX,
+				      ets_band_policy, extack);
+	if (err < 0)
+		return err;
+
+	nla_for_each_nested(attr, quanta_attr, rem) {
+		switch (nla_type(attr)) {
+		case TCA_ETS_BAND_QUANTUM:
+			if (band >= nbands) {
+				NL_SET_ERR_MSG_MOD(extack, "ETS quanta has more values than bands");
+				return -EINVAL;
+			}
+			err = ets_quantum_parse(sch, attr, &quanta[band++],
+						extack);
+			if (err)
+				return err;
+			break;
+		default:
+			NL_SET_ERR_MSG_MOD(extack, "ETS quanta can contain only ETS_BAND_QUANTUM attributes");
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int ets_qdisc_change(struct Qdisc *sch, struct nlattr *opt,
+			    struct netlink_ext_ack *extack)
+{
+	unsigned int quanta[TCQ_ETS_MAX_BANDS] = {0};
+	struct Qdisc *queues[TCQ_ETS_MAX_BANDS];
+	struct ets_sched *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_ETS_MAX + 1];
+	u8 priomap[TC_PRIO_MAX + 1] = {0};
+	unsigned int oldbands = q->nbands;
+	unsigned int nstrict = 0;
+	unsigned int nbands;
+	unsigned int i;
+	int err;
+
+	if (!opt) {
+		NL_SET_ERR_MSG(extack, "ETS options are required for this operation");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb, TCA_ETS_MAX, opt, ets_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_ETS_BANDS]) {
+		NL_SET_ERR_MSG_MOD(extack, "Number of bands is a required argument");
+		return -EINVAL;
+	}
+	nbands = nla_get_u8(tb[TCA_ETS_BANDS]);
+	if (nbands < 1 || nbands > TCQ_ETS_MAX_BANDS) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid number of bands");
+		return -EINVAL;
+	}
+
+	if (tb[TCA_ETS_STRICT]) {
+		nstrict = nla_get_u8(tb[TCA_ETS_STRICT]);
+		if (nstrict > nbands) {
+			NL_SET_ERR_MSG_MOD(extack, "Invalid number of strict bands");
+			return -EINVAL;
+		}
+	}
+
+	if (tb[TCA_ETS_PRIOMAP]) {
+		err = ets_qdisc_priomap_parse(tb[TCA_ETS_PRIOMAP],
+					      nbands, priomap, extack);
+		if (err)
+			return err;
+	}
+
+	if (tb[TCA_ETS_QUANTA]) {
+		err = ets_qdisc_quanta_parse(sch, tb[TCA_ETS_QUANTA],
+					     nbands, nstrict, quanta, extack);
+		if (err)
+			return err;
+	}
+	for (i = nstrict; i < nbands; i++) {
+		if (!quanta[i])
+			ets_quantum_parse(sch, NULL, &quanta[i], extack);
+	}
+
+	/* Before commit, make sure we can allocate all new qdiscs */
+	for (i = oldbands; i < nbands; i++) {
+		queues[i] = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+					      ets_class_id(sch, &q->classes[i]),
+					      extack);
+		if (!queues[i]) {
+			while (i > oldbands)
+				qdisc_put(queues[--i]);
+			return -ENOMEM;
+		}
+	}
+
+	sch_tree_lock(sch);
+
+	q->nbands = nbands;
+	q->nstrict = nstrict;
+	memcpy(q->prio2band, priomap, sizeof(priomap));
+
+	for (i = q->nbands; i < oldbands; i++)
+		qdisc_tree_flush_backlog(q->classes[i].qdisc);
+
+	for (i = 0; i < q->nbands; i++)
+		q->classes[i].quantum = quanta[i];
+
+	for (i = oldbands; i < q->nbands; i++) {
+		q->classes[i].qdisc = queues[i];
+		if (q->classes[i].qdisc != &noop_qdisc)
+			qdisc_hash_add(q->classes[i].qdisc, true);
+	}
+
+	sch_tree_unlock(sch);
+
+	for (i = q->nbands; i < oldbands; i++) {
+		qdisc_put(q->classes[i].qdisc);
+		memset(&q->classes[i], 0, sizeof(q->classes[i]));
+	}
+	return 0;
+}
+
+static int ets_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
+	if (err)
+		return err;
+
+	INIT_LIST_HEAD(&q->active);
+	return ets_qdisc_change(sch, opt, extack);
+}
+
+static void ets_qdisc_reset(struct Qdisc *sch)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int band;
+
+	for (band = q->nstrict; band < q->nbands; band++) {
+		if (q->classes[band].qdisc->q.qlen)
+			list_del(&q->classes[band].alist);
+	}
+	for (band = 0; band < q->nbands; band++)
+		qdisc_reset(q->classes[band].qdisc);
+	sch->qstats.backlog = 0;
+	sch->q.qlen = 0;
+}
+
+static void ets_qdisc_destroy(struct Qdisc *sch)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int band;
+
+	tcf_block_put(q->block);
+	for (band = 0; band < q->nbands; band++)
+		qdisc_put(q->classes[band].qdisc);
+}
+
+static int ets_qdisc_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	struct nlattr *opts;
+	struct nlattr *nest;
+	int band;
+	int prio;
+
+	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_err;
+
+	if (nla_put_u8(skb, TCA_ETS_BANDS, q->nbands))
+		goto nla_err;
+
+	if (q->nstrict &&
+	    nla_put_u8(skb, TCA_ETS_STRICT, q->nstrict))
+		goto nla_err;
+
+	if (q->nbands > q->nstrict) {
+		nest = nla_nest_start(skb, TCA_ETS_QUANTA);
+		if (!nest)
+			goto nla_err;
+
+		for (band = q->nstrict; band < q->nbands; band++) {
+			if (nla_put_u32(skb, TCA_ETS_BAND_QUANTUM,
+					q->classes[band].quantum))
+				goto nla_err;
+		}
+
+		nla_nest_end(skb, nest);
+	}
+
+	nest = nla_nest_start(skb, TCA_ETS_PRIOMAP);
+	if (!nest)
+		goto nla_err;
+
+	for (prio = 0; prio <= TC_PRIO_MAX; prio++) {
+		if (nla_put_u8(skb, TCA_ETS_PMAP_BAND, q->prio2band[prio]))
+			goto nla_err;
+	}
+
+	nla_nest_end(skb, nest);
+
+	return nla_nest_end(skb, opts);
+
+nla_err:
+	nla_nest_cancel(skb, opts);
+	return -EMSGSIZE;
+}
+
+static const struct Qdisc_class_ops ets_class_ops = {
+	.change		= ets_class_change,
+	.graft		= ets_class_graft,
+	.leaf		= ets_class_leaf,
+	.find		= ets_class_find,
+	.qlen_notify	= ets_class_qlen_notify,
+	.dump		= ets_class_dump,
+	.dump_stats	= ets_class_dump_stats,
+	.walk		= ets_qdisc_walk,
+	.tcf_block	= ets_qdisc_tcf_block,
+	.bind_tcf	= ets_qdisc_bind_tcf,
+	.unbind_tcf	= ets_qdisc_unbind_tcf,
+};
+
+static struct Qdisc_ops ets_qdisc_ops __read_mostly = {
+	.cl_ops		= &ets_class_ops,
+	.id		= "ets",
+	.priv_size	= sizeof(struct ets_sched),
+	.enqueue	= ets_qdisc_enqueue,
+	.dequeue	= ets_qdisc_dequeue,
+	.peek		= qdisc_peek_dequeued,
+	.change		= ets_qdisc_change,
+	.init		= ets_qdisc_init,
+	.reset		= ets_qdisc_reset,
+	.destroy	= ets_qdisc_destroy,
+	.dump		= ets_qdisc_dump,
+	.owner		= THIS_MODULE,
+};
+
+static int __init ets_init(void)
+{
+	return register_qdisc(&ets_qdisc_ops);
+}
+
+static void __exit ets_exit(void)
+{
+	unregister_qdisc(&ets_qdisc_ops);
+}
+
+module_init(ets_init);
+module_exit(ets_exit);
+MODULE_LICENSE("GPL");
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 05/10] net: sch_ets: Make the ETS qdisc offloadable
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (3 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 04/10] net: sch_ets: Add a new Qdisc Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 06/10] mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS Petr Machata
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

Add hooks at appropriate points to make it possible to offload the ETS
Qdisc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 include/linux/netdevice.h |  1 +
 include/net/pkt_cls.h     | 31 ++++++++++++
 net/sched/sch_ets.c       | 99 ++++++++++++++++++++++++++++++++++++++-
 3 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9e6fb8524d91..23ffcfcc9bc4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -849,6 +849,7 @@ enum tc_setup_type {
 	TC_SETUP_QDISC_GRED,
 	TC_SETUP_QDISC_TAPRIO,
 	TC_SETUP_FT,
+	TC_SETUP_QDISC_ETS,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a7c5d492bc04..942edfb3c7a5 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -823,4 +823,35 @@ struct tc_root_qopt_offload {
 	bool ingress;
 };
 
+enum tc_ets_command {
+	TC_ETS_REPLACE,
+	TC_ETS_DESTROY,
+	TC_ETS_STATS,
+	TC_ETS_GRAFT,
+};
+
+struct tc_ets_qopt_offload_replace_params {
+	unsigned int bands;
+	u8 priomap[TC_PRIO_MAX + 1];
+	unsigned int quanta[TCQ_ETS_MAX_BANDS];
+	unsigned int weights[TCQ_ETS_MAX_BANDS];
+	struct gnet_stats_queue *qstats;
+};
+
+struct tc_ets_qopt_offload_graft_params {
+	u8 band;
+	u32 child_handle;
+};
+
+struct tc_ets_qopt_offload {
+	enum tc_ets_command command;
+	u32 handle;
+	u32 parent;
+	union {
+		struct tc_ets_qopt_offload_replace_params replace_params;
+		struct tc_qopt_offload_stats stats;
+		struct tc_ets_qopt_offload_graft_params graft_params;
+	};
+};
+
 #endif
diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
index 6a1751564c4b..7fdccb192a33 100644
--- a/net/sched/sch_ets.c
+++ b/net/sched/sch_ets.c
@@ -80,6 +80,90 @@ static u32 ets_class_id(struct Qdisc *sch, struct ets_class *cl)
 	return TC_H_MAKE(sch->handle, band + 1);
 }
 
+static void ets_offload_change(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct ets_sched *q = qdisc_priv(sch);
+	struct tc_ets_qopt_offload qopt;
+	unsigned int w_psum_prev = 0;
+	unsigned int q_psum = 0;
+	unsigned int q_sum = 0;
+	unsigned int quantum;
+	unsigned int w_psum;
+	unsigned int weight;
+	unsigned int i;
+
+	if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc)
+		return;
+
+	qopt.command = TC_ETS_REPLACE;
+	qopt.handle = sch->handle;
+	qopt.parent = sch->parent;
+	qopt.replace_params.bands = q->nbands;
+	memcpy(&qopt.replace_params.priomap,
+	       q->prio2band, sizeof(q->prio2band));
+
+	for (i = 0; i < q->nbands; i++)
+		q_sum += q->classes[i].quantum;
+
+	for (i = 0; i < q->nbands; i++) {
+		quantum = q->classes[i].quantum;
+		q_psum += quantum;
+		w_psum = quantum ? q_psum * 100 / q_sum : 0;
+		weight = w_psum - w_psum_prev;
+		w_psum_prev = w_psum;
+
+		qopt.replace_params.quanta[i] = quantum;
+		qopt.replace_params.weights[i] = weight;
+	}
+
+	dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_ETS, &qopt);
+}
+
+static void ets_offload_destroy(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct tc_ets_qopt_offload qopt;
+
+	if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc)
+		return;
+
+	qopt.command = TC_ETS_DESTROY;
+	qopt.handle = sch->handle;
+	qopt.parent = sch->parent;
+	dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_ETS, &qopt);
+}
+
+static void ets_offload_graft(struct Qdisc *sch, struct Qdisc *new,
+			      struct Qdisc *old, unsigned long arg,
+			      struct netlink_ext_ack *extack)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct tc_ets_qopt_offload qopt;
+
+	qopt.command = TC_ETS_GRAFT;
+	qopt.handle = sch->handle;
+	qopt.parent = sch->parent;
+	qopt.graft_params.band = arg - 1;
+	qopt.graft_params.child_handle = new->handle;
+
+	qdisc_offload_graft_helper(dev, sch, new, old, TC_SETUP_QDISC_ETS,
+				   &qopt, extack);
+}
+
+static int ets_offload_dump(struct Qdisc *sch)
+{
+	struct tc_ets_qopt_offload qopt;
+
+	qopt.command = TC_ETS_STATS;
+	qopt.handle = sch->handle;
+	qopt.parent = sch->parent;
+	qopt.stats.bstats = &sch->bstats;
+	qopt.stats.qstats = &sch->qstats;
+
+	return qdisc_offload_dump_helper(sch, TC_SETUP_QDISC_ETS, &qopt);
+}
+
 static bool ets_class_is_strict(struct ets_sched *q, struct ets_class *cl)
 {
 	unsigned int band = cl - q->classes;
@@ -133,6 +217,8 @@ static int ets_class_change(struct Qdisc *sch, u32 classid, u32 parentid,
 	sch_tree_lock(sch);
 	cl->quantum = quantum;
 	sch_tree_unlock(sch);
+
+	ets_offload_change(sch);
 	return 0;
 }
 
@@ -150,6 +236,7 @@ static int ets_class_graft(struct Qdisc *sch, unsigned long arg,
 	}
 
 	*old = qdisc_replace(sch, new, &cl->qdisc);
+	ets_offload_graft(sch, new, *old, arg, extack);
 	return 0;
 }
 
@@ -557,6 +644,7 @@ static int ets_qdisc_change(struct Qdisc *sch, struct nlattr *opt,
 
 	sch_tree_unlock(sch);
 
+	ets_offload_change(sch);
 	for (i = q->nbands; i < oldbands; i++) {
 		qdisc_put(q->classes[i].qdisc);
 		memset(&q->classes[i], 0, sizeof(q->classes[i]));
@@ -601,6 +689,7 @@ static void ets_qdisc_destroy(struct Qdisc *sch)
 	struct ets_sched *q = qdisc_priv(sch);
 	int band;
 
+	ets_offload_destroy(sch);
 	tcf_block_put(q->block);
 	for (band = 0; band < q->nbands; band++)
 		qdisc_put(q->classes[band].qdisc);
@@ -609,11 +698,17 @@ static void ets_qdisc_destroy(struct Qdisc *sch)
 static int ets_qdisc_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct ets_sched *q = qdisc_priv(sch);
-	struct nlattr *opts;
+	struct nlattr *opts = NULL;
 	struct nlattr *nest;
 	int band;
 	int prio;
+	int err;
 
+	err = ets_offload_dump(sch);
+	if (err)
+		goto nla_err;
+
+	err = -EMSGSIZE;
 	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
 	if (!opts)
 		goto nla_err;
@@ -654,7 +749,7 @@ static int ets_qdisc_dump(struct Qdisc *sch, struct sk_buff *skb)
 
 nla_err:
 	nla_nest_cancel(skb, opts);
-	return -EMSGSIZE;
+	return err;
 }
 
 static const struct Qdisc_class_ops ets_class_ops = {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 06/10] mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (4 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 05/10] net: sch_ets: Make the ETS qdisc offloadable Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 07/10] mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc Petr Machata
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

Thanks to the similarity between PRIO and ETS it is possible to simply
reuse most of the code for offloading PRIO Qdisc. Extract the common
functionality into separate functions, making the current PRIO handlers
thin API adapters.

Extend the new functions to pass quanta for individual bands, which allows
configuring a subset of bands as WRR. Invoke mlxsw_sp_port_ets_set() as
appropriate to de/configure WRR-ness and weight of individual bands.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 112 ++++++++++++++----
 1 file changed, 87 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 0457ff5f6942..4cd2020bcbe0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -471,14 +471,16 @@ int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port,
 }
 
 static int
-mlxsw_sp_qdisc_prio_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
-			    struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
+__mlxsw_sp_qdisc_ets_destroy(struct mlxsw_sp_port *mlxsw_sp_port)
 {
 	int i;
 
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		mlxsw_sp_port_prio_tc_set(mlxsw_sp_port, i,
 					  MLXSW_SP_PORT_DEFAULT_TCLASS);
+		mlxsw_sp_port_ets_set(mlxsw_sp_port,
+				      MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
+				      i, 0, false, 0);
 		mlxsw_sp_qdisc_destroy(mlxsw_sp_port,
 				       &mlxsw_sp_port->tclass_qdiscs[i]);
 		mlxsw_sp_port->tclass_qdiscs[i].prio_bitmap = 0;
@@ -487,6 +489,22 @@ mlxsw_sp_qdisc_prio_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
 	return 0;
 }
 
+static int
+mlxsw_sp_qdisc_prio_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
+			    struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
+{
+	return __mlxsw_sp_qdisc_ets_destroy(mlxsw_sp_port);
+}
+
+static int
+__mlxsw_sp_qdisc_ets_check_params(unsigned int nbands)
+{
+	if (nbands > IEEE_8021QAZ_MAX_TCS)
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
 static int
 mlxsw_sp_qdisc_prio_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
 				 struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
@@ -494,30 +512,48 @@ mlxsw_sp_qdisc_prio_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
 {
 	struct tc_prio_qopt_offload_params *p = params;
 
-	if (p->bands > IEEE_8021QAZ_MAX_TCS)
-		return -EOPNOTSUPP;
-
-	return 0;
+	return __mlxsw_sp_qdisc_ets_check_params(p->bands);
 }
 
 static int
-mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port *mlxsw_sp_port,
-			    struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
-			    void *params)
+__mlxsw_sp_qdisc_ets_replace(struct mlxsw_sp_port *mlxsw_sp_port,
+			     unsigned int nbands,
+			     unsigned int quanta[TCQ_ETS_MAX_BANDS],
+			     u8 priomap[TC_PRIO_MAX + 1])
 {
-	struct tc_prio_qopt_offload_params *p = params;
 	struct mlxsw_sp_qdisc *child_qdisc;
 	int tclass, i, band, backlog;
+	unsigned int w_psum_prev = 0;
+	unsigned int q_psum = 0;
+	unsigned int w_psum = 0;
+	unsigned int weight = 0;
+	unsigned int q_sum = 0;
+	unsigned int quantum;
 	u8 old_priomap;
 	int err;
 
-	for (band = 0; band < p->bands; band++) {
+	for (band = 0; band < nbands; band++)
+		q_sum += quanta[band];
+
+	for (band = 0; band < nbands; band++) {
 		tclass = MLXSW_SP_PRIO_BAND_TO_TCLASS(band);
 		child_qdisc = &mlxsw_sp_port->tclass_qdiscs[tclass];
 		old_priomap = child_qdisc->prio_bitmap;
 		child_qdisc->prio_bitmap = 0;
+
+		quantum = quanta[band];
+		q_psum += quantum;
+		w_psum = quantum ? q_psum * 100 / q_sum : 0;
+		weight = w_psum - w_psum_prev;
+		w_psum_prev = w_psum;
+		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
+					    MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
+					    tclass, 0, !!quantum, weight);
+		if (err)
+			return err;
+
 		for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
-			if (p->priomap[i] == band) {
+			if (priomap[i] == band) {
 				child_qdisc->prio_bitmap |= BIT(i);
 				if (BIT(i) & old_priomap)
 					continue;
@@ -540,21 +576,46 @@ mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port *mlxsw_sp_port,
 		child_qdisc = &mlxsw_sp_port->tclass_qdiscs[tclass];
 		child_qdisc->prio_bitmap = 0;
 		mlxsw_sp_qdisc_destroy(mlxsw_sp_port, child_qdisc);
+		mlxsw_sp_port_ets_set(mlxsw_sp_port,
+				      MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
+				      tclass, 0, false, 0);
 	}
 	return 0;
 }
 
+static int
+mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port *mlxsw_sp_port,
+			    struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			    void *params)
+{
+	struct tc_prio_qopt_offload_params *p = params;
+	unsigned int quanta[TCQ_ETS_MAX_BANDS] = {0};
+
+	return __mlxsw_sp_qdisc_ets_replace(mlxsw_sp_port, p->bands,
+					    quanta, p->priomap);
+}
+
+static void
+__mlxsw_sp_qdisc_ets_unoffload(struct mlxsw_sp_port *mlxsw_sp_port,
+			       struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			       struct gnet_stats_queue *qstats)
+{
+	u64 backlog;
+
+	backlog = mlxsw_sp_cells_bytes(mlxsw_sp_port->mlxsw_sp,
+				       mlxsw_sp_qdisc->stats_base.backlog);
+	qstats->backlog -= backlog;
+}
+
 static void
 mlxsw_sp_qdisc_prio_unoffload(struct mlxsw_sp_port *mlxsw_sp_port,
 			      struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
 			      void *params)
 {
 	struct tc_prio_qopt_offload_params *p = params;
-	u64 backlog;
 
-	backlog = mlxsw_sp_cells_bytes(mlxsw_sp_port->mlxsw_sp,
-				       mlxsw_sp_qdisc->stats_base.backlog);
-	p->qstats->backlog -= backlog;
+	__mlxsw_sp_qdisc_ets_unoffload(mlxsw_sp_port, mlxsw_sp_qdisc,
+				       p->qstats);
 }
 
 static int
@@ -645,22 +706,22 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_prio = {
  * unoffload the child.
  */
 static int
-mlxsw_sp_qdisc_prio_graft(struct mlxsw_sp_port *mlxsw_sp_port,
-			  struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
-			  struct tc_prio_qopt_offload_graft_params *p)
+__mlxsw_sp_qdisc_ets_graft(struct mlxsw_sp_port *mlxsw_sp_port,
+			   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			   u8 band, u32 child_handle)
 {
-	int tclass_num = MLXSW_SP_PRIO_BAND_TO_TCLASS(p->band);
+	int tclass_num = MLXSW_SP_PRIO_BAND_TO_TCLASS(band);
 	struct mlxsw_sp_qdisc *old_qdisc;
 
-	if (p->band < IEEE_8021QAZ_MAX_TCS &&
-	    mlxsw_sp_port->tclass_qdiscs[tclass_num].handle == p->child_handle)
+	if (band < IEEE_8021QAZ_MAX_TCS &&
+	    mlxsw_sp_port->tclass_qdiscs[tclass_num].handle == child_handle)
 		return 0;
 
 	/* See if the grafted qdisc is already offloaded on any tclass. If so,
 	 * unoffload it.
 	 */
 	old_qdisc = mlxsw_sp_qdisc_find_by_handle(mlxsw_sp_port,
-						  p->child_handle);
+						  child_handle);
 	if (old_qdisc)
 		mlxsw_sp_qdisc_destroy(mlxsw_sp_port, old_qdisc);
 
@@ -695,8 +756,9 @@ int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port,
 		return mlxsw_sp_qdisc_get_stats(mlxsw_sp_port, mlxsw_sp_qdisc,
 						&p->stats);
 	case TC_PRIO_GRAFT:
-		return mlxsw_sp_qdisc_prio_graft(mlxsw_sp_port, mlxsw_sp_qdisc,
-						 &p->graft_params);
+		return __mlxsw_sp_qdisc_ets_graft(mlxsw_sp_port, mlxsw_sp_qdisc,
+						  p->graft_params.band,
+						  p->graft_params.child_handle);
 	default:
 		return -EOPNOTSUPP;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 07/10] mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (5 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 06/10] mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 08/10] selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh Petr Machata
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

Handle TC_SETUP_QDISC_ETS, add a new ops structure for the ETS Qdisc.
Invoke the extended prio handlers implemented in the previous patch. For
stats ops, invoke directly the prio callbacks, which are not sensitive to
differences between PRIO and ETS.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |   2 +
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |   2 +
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 108 +++++++++++++++---
 3 files changed, 95 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index cb24807c119d..48f971434eab 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1796,6 +1796,8 @@ static int mlxsw_sp_setup_tc(struct net_device *dev, enum tc_setup_type type,
 		return mlxsw_sp_setup_tc_red(mlxsw_sp_port, type_data);
 	case TC_SETUP_QDISC_PRIO:
 		return mlxsw_sp_setup_tc_prio(mlxsw_sp_port, type_data);
+	case TC_SETUP_QDISC_ETS:
+		return mlxsw_sp_setup_tc_ets(mlxsw_sp_port, type_data);
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 347bec9d1ecf..948ef4720d40 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -852,6 +852,8 @@ int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port,
 			  struct tc_red_qopt_offload *p);
 int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port,
 			   struct tc_prio_qopt_offload *p);
+int mlxsw_sp_setup_tc_ets(struct mlxsw_sp_port *mlxsw_sp_port,
+			  struct tc_ets_qopt_offload *p);
 
 /* spectrum_fid.c */
 bool mlxsw_sp_fid_is_dummy(struct mlxsw_sp *mlxsw_sp, u16 fid_index);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 4cd2020bcbe0..a25a94da4995 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -18,6 +18,7 @@ enum mlxsw_sp_qdisc_type {
 	MLXSW_SP_QDISC_NO_QDISC,
 	MLXSW_SP_QDISC_RED,
 	MLXSW_SP_QDISC_PRIO,
+	MLXSW_SP_QDISC_ETS,
 };
 
 struct mlxsw_sp_qdisc_ops {
@@ -519,36 +520,24 @@ static int
 __mlxsw_sp_qdisc_ets_replace(struct mlxsw_sp_port *mlxsw_sp_port,
 			     unsigned int nbands,
 			     unsigned int quanta[TCQ_ETS_MAX_BANDS],
+			     unsigned int weights[TCQ_ETS_MAX_BANDS],
 			     u8 priomap[TC_PRIO_MAX + 1])
 {
 	struct mlxsw_sp_qdisc *child_qdisc;
 	int tclass, i, band, backlog;
-	unsigned int w_psum_prev = 0;
-	unsigned int q_psum = 0;
-	unsigned int w_psum = 0;
-	unsigned int weight = 0;
-	unsigned int q_sum = 0;
-	unsigned int quantum;
 	u8 old_priomap;
 	int err;
 
-	for (band = 0; band < nbands; band++)
-		q_sum += quanta[band];
-
 	for (band = 0; band < nbands; band++) {
 		tclass = MLXSW_SP_PRIO_BAND_TO_TCLASS(band);
 		child_qdisc = &mlxsw_sp_port->tclass_qdiscs[tclass];
 		old_priomap = child_qdisc->prio_bitmap;
 		child_qdisc->prio_bitmap = 0;
 
-		quantum = quanta[band];
-		q_psum += quantum;
-		w_psum = quantum ? q_psum * 100 / q_sum : 0;
-		weight = w_psum - w_psum_prev;
-		w_psum_prev = w_psum;
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
 					    MLXSW_REG_QEEC_HIERARCHY_SUBGROUP,
-					    tclass, 0, !!quantum, weight);
+					    tclass, 0, !!quanta[band],
+					    weights[band]);
 		if (err)
 			return err;
 
@@ -589,10 +578,10 @@ mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port *mlxsw_sp_port,
 			    void *params)
 {
 	struct tc_prio_qopt_offload_params *p = params;
-	unsigned int quanta[TCQ_ETS_MAX_BANDS] = {0};
+	unsigned int zeroes[TCQ_ETS_MAX_BANDS] = {0};
 
 	return __mlxsw_sp_qdisc_ets_replace(mlxsw_sp_port, p->bands,
-					    quanta, p->priomap);
+					    zeroes, zeroes, p->priomap);
 }
 
 static void
@@ -692,6 +681,55 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_prio = {
 	.clean_stats = mlxsw_sp_setup_tc_qdisc_prio_clean_stats,
 };
 
+static int
+mlxsw_sp_qdisc_ets_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
+				struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+				void *params)
+{
+	struct tc_ets_qopt_offload_replace_params *p = params;
+
+	return __mlxsw_sp_qdisc_ets_check_params(p->bands);
+}
+
+static int
+mlxsw_sp_qdisc_ets_replace(struct mlxsw_sp_port *mlxsw_sp_port,
+			   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			   void *params)
+{
+	struct tc_ets_qopt_offload_replace_params *p = params;
+
+	return __mlxsw_sp_qdisc_ets_replace(mlxsw_sp_port, p->bands,
+					    p->quanta, p->weights, p->priomap);
+}
+
+static void
+mlxsw_sp_qdisc_ets_unoffload(struct mlxsw_sp_port *mlxsw_sp_port,
+			     struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			     void *params)
+{
+	struct tc_ets_qopt_offload_replace_params *p = params;
+
+	__mlxsw_sp_qdisc_ets_unoffload(mlxsw_sp_port, mlxsw_sp_qdisc,
+				       p->qstats);
+}
+
+static int
+mlxsw_sp_qdisc_ets_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
+			   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
+{
+	return __mlxsw_sp_qdisc_ets_destroy(mlxsw_sp_port);
+}
+
+static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_ets = {
+	.type = MLXSW_SP_QDISC_ETS,
+	.check_params = mlxsw_sp_qdisc_ets_check_params,
+	.replace = mlxsw_sp_qdisc_ets_replace,
+	.unoffload = mlxsw_sp_qdisc_ets_unoffload,
+	.destroy = mlxsw_sp_qdisc_ets_destroy,
+	.get_stats = mlxsw_sp_qdisc_get_prio_stats,
+	.clean_stats = mlxsw_sp_setup_tc_qdisc_prio_clean_stats,
+};
+
 /* Linux allows linking of Qdiscs to arbitrary classes (so long as the resulting
  * graph is free of cycles). These operations do not change the parent handle
  * though, which means it can be incomplete (if there is more than one class
@@ -764,6 +802,42 @@ int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port,
 	}
 }
 
+int mlxsw_sp_setup_tc_ets(struct mlxsw_sp_port *mlxsw_sp_port,
+			  struct tc_ets_qopt_offload *p)
+{
+	struct mlxsw_sp_qdisc *mlxsw_sp_qdisc;
+
+	mlxsw_sp_qdisc = mlxsw_sp_qdisc_find(mlxsw_sp_port, p->parent, true);
+	if (!mlxsw_sp_qdisc)
+		return -EOPNOTSUPP;
+
+	if (p->command == TC_ETS_REPLACE)
+		return mlxsw_sp_qdisc_replace(mlxsw_sp_port, p->handle,
+					      mlxsw_sp_qdisc,
+					      &mlxsw_sp_qdisc_ops_ets,
+					      &p->replace_params);
+
+	if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle,
+				    MLXSW_SP_QDISC_ETS))
+		return -EOPNOTSUPP;
+
+	switch (p->command) {
+	case TC_ETS_DESTROY:
+		return mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc);
+	case TC_ETS_STATS:
+		return mlxsw_sp_qdisc_get_stats(mlxsw_sp_port, mlxsw_sp_qdisc,
+						&p->stats);
+	case TC_ETS_GRAFT:
+		return __mlxsw_sp_qdisc_ets_graft(mlxsw_sp_port, mlxsw_sp_qdisc,
+						  p->graft_params.band,
+						  p->graft_params.child_handle);
+	case TC_ETS_REPLACE: /* Covered above. */
+		break;
+	}
+
+	return -EOPNOTSUPP;
+}
+
 int mlxsw_sp_tc_qdisc_init(struct mlxsw_sp_port *mlxsw_sp_port)
 {
 	struct mlxsw_sp_qdisc *mlxsw_sp_qdisc;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 08/10] selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (6 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 07/10] mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 09/10] selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc Petr Machata
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

These two functions are used for starting several streams of traffic, and
then stopping them later. They will be handy for the test coverage of ETS
Qdisc. Move them from mlxsw-specific qos_lib.sh to the generic lib.sh.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 .../selftests/drivers/net/mlxsw/qos_lib.sh     | 18 ------------------
 tools/testing/selftests/net/forwarding/lib.sh  | 18 ++++++++++++++++++
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh b/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
index e80be65799ad..75a3fb3b5663 100644
--- a/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
@@ -24,24 +24,6 @@ rate()
 	echo $((8 * (t1 - t0) / interval))
 }
 
-start_traffic()
-{
-	local h_in=$1; shift    # Where the traffic egresses the host
-	local sip=$1; shift
-	local dip=$1; shift
-	local dmac=$1; shift
-
-	$MZ $h_in -p 8000 -A $sip -B $dip -c 0 \
-		-a own -b $dmac -t udp -q &
-	sleep 1
-}
-
-stop_traffic()
-{
-	# Suppress noise from killing mausezahn.
-	{ kill %% && wait %%; } 2>/dev/null
-}
-
 check_rate()
 {
 	local rate=$1; shift
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 2f8cfea070c5..8ff242fa5f0d 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1065,3 +1065,21 @@ flood_test()
 	flood_unicast_test $br_port $host1_if $host2_if
 	flood_multicast_test $br_port $host1_if $host2_if
 }
+
+start_traffic()
+{
+	local h_in=$1; shift    # Where the traffic egresses the host
+	local sip=$1; shift
+	local dip=$1; shift
+	local dmac=$1; shift
+
+	$MZ $h_in -p 8000 -A $sip -B $dip -c 0 \
+		-a own -b $dmac -t udp -q &
+	sleep 1
+}
+
+stop_traffic()
+{
+	# Suppress noise from killing mausezahn.
+	{ kill %% && wait %%; } 2>/dev/null
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 09/10] selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (7 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 08/10] selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 10/10] selftests: qdiscs: " Petr Machata
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

This tests the newly-added ETS Qdisc. It runs two to three streams of
traffic, each with a different priority. ETS Qdisc is supposed to allocate
bandwidth according to the DRR algorithm and given weights. After running
the traffic for a while, counters are compared for each stream to check
that the expected ratio is in fact observed.

In order for the DRR process to kick in, a traffic bottleneck must exist in
the first place. In slow path, such bottleneck can be implemented by
wrapping the ETS Qdisc inside a TBF or other shaper. This might however
make the configuration unoffloadable. Instead, on HW datapath, the
bottleneck would be set up by lowering port speed and configuring shared
buffer suitably.

Therefore the test is structured as a core component that implements the
testing, with two wrapper scripts that implement the details of slow path
resp. fast path configuration.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 .../selftests/drivers/net/mlxsw/qos_lib.sh    |  28 +++
 .../selftests/drivers/net/mlxsw/sch_ets.sh    |  54 ++++
 .../selftests/net/forwarding/sch_ets.sh       |  30 +++
 .../selftests/net/forwarding/sch_ets_core.sh  | 229 +++++++++++++++++
 .../selftests/net/forwarding/sch_ets_tests.sh | 230 ++++++++++++++++++
 5 files changed, 571 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh
 create mode 100755 tools/testing/selftests/net/forwarding/sch_ets.sh
 create mode 100644 tools/testing/selftests/net/forwarding/sch_ets_core.sh
 create mode 100644 tools/testing/selftests/net/forwarding/sch_ets_tests.sh

diff --git a/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh b/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
index 75a3fb3b5663..a5937069ac16 100644
--- a/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
@@ -78,3 +78,31 @@ measure_rate()
 	echo $ir $er
 	return $ret
 }
+
+bail_on_lldpad()
+{
+	if systemctl is-active --quiet lldpad; then
+
+		cat >/dev/stderr <<-EOF
+		WARNING: lldpad is running
+
+			lldpad will likely configure DCB, and this test will
+			configure Qdiscs. mlxsw does not support both at the
+			same time, one of them is arbitrarily going to overwrite
+			the other. That will cause spurious failures (or,
+			unlikely, passes) of this test.
+		EOF
+
+		if [[ -z $ALLOW_LLDPAD ]]; then
+			cat >/dev/stderr <<-EOF
+
+				If you want to run the test anyway, please set
+				an environment variable ALLOW_LLDPAD to a
+				non-empty string.
+			EOF
+			exit 1
+		else
+			return
+		fi
+	fi
+}
diff --git a/tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh b/tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh
new file mode 100755
index 000000000000..1d70d89d2b1a
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh
@@ -0,0 +1,54 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# A driver for the ETS selftest that implements testing in offloaded datapath.
+lib_dir=$(dirname $0)/../../../net/forwarding
+source $lib_dir/sch_ets_core.sh
+source $lib_dir/devlink_lib.sh
+source qos_lib.sh
+
+switch_create()
+{
+	ets_switch_create
+
+	# Create a bottleneck so that the DWRR process can kick in.
+	ethtool -s $h2 speed 1000 autoneg off
+	ethtool -s $swp2 speed 1000 autoneg off
+
+	# Set the ingress quota high and use the three egress TCs to limit the
+	# amount of traffic that is admitted to the shared buffers. This makes
+	# sure that there is always enough traffic of all types to select from
+	# for the DWRR process.
+
+	devlink_port_pool_th_set $swp1 0 12
+	devlink_tc_bind_pool_th_set $swp1 0 ingress 0 12
+	devlink_port_pool_th_set $swp2 4 12
+	devlink_tc_bind_pool_th_set $swp2 7 egress 4 5
+	devlink_tc_bind_pool_th_set $swp2 6 egress 4 5
+	devlink_tc_bind_pool_th_set $swp2 5 egress 4 5
+}
+
+switch_destroy()
+{
+	devlink_tc_bind_pool_th_restore $swp2 5 egress
+	devlink_tc_bind_pool_th_restore $swp2 6 egress
+	devlink_tc_bind_pool_th_restore $swp2 7 egress
+	devlink_port_pool_th_restore $swp2 4
+	devlink_tc_bind_pool_th_restore $swp1 0 ingress
+	devlink_port_pool_th_restore $swp1 0
+
+	ethtool -s $swp2 autoneg on
+	ethtool -s $h2 autoneg on
+
+	ets_switch_destroy
+}
+
+get_stats()
+{
+	local band=$1; shift
+
+	ethtool_stats_get "$h2" rx_octets_prio_$band
+}
+
+bail_on_lldpad
+ets_run
diff --git a/tools/testing/selftests/net/forwarding/sch_ets.sh b/tools/testing/selftests/net/forwarding/sch_ets.sh
new file mode 100755
index 000000000000..25844854571e
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/sch_ets.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# A driver for the ETS selftest that implements testing in slowpath.
+lib_dir=.
+source sch_ets_core.sh
+
+switch_create()
+{
+	ets_switch_create
+
+	# Create a bottleneck so that the DWRR process can kick in.
+	tc qdisc add dev $swp2 root handle 1: tbf \
+	   rate 1Gbit burst 1Mbit latency 100ms
+	PARENT="parent 1:"
+}
+
+switch_destroy()
+{
+	ets_switch_destroy
+}
+
+get_stats()
+{
+	local stream=$1; shift
+
+	link_stats_get $h2.1$stream rx bytes
+}
+
+ets_run
diff --git a/tools/testing/selftests/net/forwarding/sch_ets_core.sh b/tools/testing/selftests/net/forwarding/sch_ets_core.sh
new file mode 100644
index 000000000000..a99dab5131bd
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/sch_ets_core.sh
@@ -0,0 +1,229 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# This is a template for ETS Qdisc test.
+#
+# This test sends from H1 several traffic streams with 802.1p-tagged packets.
+# The tags are used at $swp1 to prioritize the traffic. Each stream is then
+# queued at a different ETS band according to the assigned priority. After
+# runnig for a while, counters at H2 are consulted to determine whether the
+# traffic scheduling was according to the ETS configuration.
+#
+# This template is supposed to be embedded by a test driver, which implements
+# statistics collection, any HW-specific stuff, and prominently configures the
+# system to assure that there is overcommitment at $swp2. That is necessary so
+# that the ETS traffic selection algorithm kicks in and has to schedule some
+# traffic at the expense of other.
+#
+# A driver for veth-based testing is in sch_ets.sh, an example of a driver for
+# an offloaded data path is in selftests/drivers/net/mlxsw/sch_ets.sh.
+#
+# +---------------------------------------------------------------------+
+# | H1                                                                  |
+# |     + $h1.10              + $h1.11              + $h1.12            |
+# |     | 192.0.2.1/28        | 192.0.2.17/28       | 192.0.2.33/28     |
+# |     | egress-qos-map      | egress-qos-map      | egress-qos-map    |
+# |     |  0:0                |  0:1                |  0:2              |
+# |     \____________________ | ____________________/                   |
+# |                          \|/                                        |
+# |                           + $h1                                     |
+# +---------------------------|-----------------------------------------+
+#                             |
+# +---------------------------|-----------------------------------------+
+# | SW                        + $swp1                                   |
+# |                           | >1Gbps                                  |
+# |      ____________________/|\____________________                    |
+# |     /                     |                     \                   |
+# |  +--|----------------+ +--|----------------+ +--|----------------+  |
+# |  |  + $swp1.10       | |  + $swp1.11       | |  + $swp1.12       |  |
+# |  |    ingress-qos-map| |    ingress-qos-map| |    ingress-qos-map|  |
+# |  |     0:0 1:1 2:2   | |     0:0 1:1 2:2   | |     0:0 1:1 2:2   |  |
+# |  |                   | |                   | |                   |  |
+# |  |    BR10           | |    BR11           | |    BR12           |  |
+# |  |                   | |                   | |                   |  |
+# |  |  + $swp2.10       | |  + $swp2.11       | |  + $swp2.12       |  |
+# |  +--|----------------+ +--|----------------+ +--|----------------+  |
+# |     \____________________ | ____________________/                   |
+# |                          \|/                                        |
+# |                           + $swp2                                   |
+# |                           | 1Gbps (ethtool or HTB qdisc)            |
+# |                           | qdisc ets quanta $W0 $W1 $W2            |
+# |                           |           priomap 0 1 2                 |
+# +---------------------------|-----------------------------------------+
+#                             |
+# +---------------------------|-----------------------------------------+
+# | H2                        + $h2                                     |
+# |      ____________________/|\____________________                    |
+# |     /                     |                     \                   |
+# |     + $h2.10              + $h2.11              + $h2.12            |
+# |       192.0.2.2/28          192.0.2.18/28         192.0.2.34/28     |
+# +---------------------------------------------------------------------+
+
+NUM_NETIFS=4
+CHECK_TC="yes"
+source $lib_dir/lib.sh
+source $lib_dir/sch_ets_tests.sh
+
+ALL_TESTS="
+	ping_ipv4
+	$ALL_TESTS_STRICT
+	$ALL_TESTS_MIXED
+	$ALL_TESTS_DWRR
+"
+
+PARENT=root
+
+sip()
+{
+	echo 192.0.2.$((16 * $1 + 1))
+}
+
+dip()
+{
+	echo 192.0.2.$((16 * $1 + 2))
+}
+
+ets_start_traffic()
+{
+	local dst_mac=$(mac_get $h2)
+	local i=$1; shift
+
+	start_traffic $h1.1$i $(sip $i) $(dip $i) $dst_mac
+}
+
+h1_create()
+{
+	local i;
+
+	simple_if_init $h1
+	mtu_set $h1 9900
+	for i in {0..2}; do
+		vlan_create $h1 1$i v$h1 $(sip $i)/28
+		ip link set dev $h1.1$i type vlan egress 0:$i
+	done
+}
+
+h1_destroy()
+{
+	local i
+
+	for i in {0..2}; do
+		vlan_destroy $h1 1$i
+	done
+	mtu_restore $h1
+	simple_if_fini $h1
+}
+
+h2_create()
+{
+	local i
+
+	simple_if_init $h2
+	mtu_set $h2 9900
+	for i in {0..2}; do
+		vlan_create $h2 1$i v$h2 $(dip $i)/28
+	done
+}
+
+h2_destroy()
+{
+	local i
+
+	for i in {0..2}; do
+		vlan_destroy $h2 1$i
+	done
+	mtu_restore $h2
+	simple_if_fini $h2
+}
+
+ets_switch_create()
+{
+	local i
+
+	ip link set dev $swp1 up
+	mtu_set $swp1 9900
+
+	ip link set dev $swp2 up
+	mtu_set $swp2 9900
+
+	for i in {0..2}; do
+		vlan_create $swp1 1$i
+		ip link set dev $swp1.1$i type vlan ingress 0:0 1:1 2:2
+
+		vlan_create $swp2 1$i
+
+		ip link add dev br1$i type bridge
+		ip link set dev $swp1.1$i master br1$i
+		ip link set dev $swp2.1$i master br1$i
+
+		ip link set dev br1$i up
+		ip link set dev $swp1.1$i up
+		ip link set dev $swp2.1$i up
+	done
+}
+
+ets_switch_destroy()
+{
+	local i
+
+	tc qdisc del dev $swp2 root
+
+	for i in {0..2}; do
+		ip link del dev br1$i
+		vlan_destroy $swp2 1$i
+		vlan_destroy $swp1 1$i
+	done
+
+	mtu_restore $swp2
+	ip link set dev $swp2 down
+
+	mtu_restore $swp1
+	ip link set dev $swp1 down
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	put=$swp2
+	hut=$h2
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+ping_ipv4()
+{
+	ping_test $h1.10 $(dip 0) " vlan 10"
+	ping_test $h1.11 $(dip 1) " vlan 11"
+	ping_test $h1.12 $(dip 2) " vlan 12"
+}
+
+ets_run()
+{
+	trap cleanup EXIT
+
+	setup_prepare
+	setup_wait
+
+	tests_run
+
+	exit $EXIT_STATUS
+}
diff --git a/tools/testing/selftests/net/forwarding/sch_ets_tests.sh b/tools/testing/selftests/net/forwarding/sch_ets_tests.sh
new file mode 100644
index 000000000000..4d666d4e1721
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/sch_ets_tests.sh
@@ -0,0 +1,230 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# Global interface:
+#  $put -- port under test (e.g. $swp2)
+#  get_stats($band) -- A function to collect stats for band
+#  ets_start_traffic($band) -- Start traffic for this band
+
+ALL_TESTS_STRICT="
+	ets_set_strict
+	ets_dwrr_test_01
+	ets_dwrr_test_12
+"
+
+ALL_TESTS_MIXED="
+	ets_set_mixed
+	ets_dwrr_test_01
+	ets_dwrr_test_12
+"
+
+ALL_TESTS_DWRR="
+	ets_set_dwrr_uniform
+	ets_dwrr_test_012
+	ets_set_dwrr_varying
+	ets_dwrr_test_012
+	ets_change_quantum
+	ets_dwrr_test_012
+	ets_set_dwrr_two_bands
+	ets_dwrr_test_01
+"
+
+QDISC_DEV=
+
+# WS describes the Qdisc configuration. It has one value per band (so the
+# number of array elements indicates the number of bands). If the value is
+# 0, it is a strict band, otherwise the it's a DRR band and the value is
+# that band's quantum.
+declare -a WS
+
+qdisc_describe()
+{
+	local nbands=${#WS[@]}
+	local nstrict=0
+	local i
+
+	for ((i = 0; i < nbands; i++)); do
+		if ((!${WS[$i]})); then
+			: $((nstrict++))
+		fi
+	done
+
+	echo -n "ets bands $nbands"
+	if ((nstrict)); then
+		echo -n " strict $nstrict"
+	fi
+	if ((nstrict < nbands)); then
+		echo -n " quanta"
+		for ((i = nstrict; i < nbands; i++)); do
+			echo -n " ${WS[$i]}"
+		done
+	fi
+}
+
+__strict_eval()
+{
+	local desc=$1; shift
+	local d=$1; shift
+	local total=$1; shift
+	local above=$1; shift
+
+	RET=0
+
+	if ((! total)); then
+		check_err 1 "No traffic observed"
+		log_test "$desc"
+		return
+	fi
+
+	local ratio=$(echo "scale=2; 100 * $d / $total" | bc -l)
+	if ((above)); then
+		test $(echo "$ratio > 95.0" | bc -l) -eq 1
+		check_err $? "Not enough traffic"
+		log_test "$desc"
+		log_info "Expected ratio >95% Measured ratio $ratio"
+	else
+		test $(echo "$ratio < 5" | bc -l) -eq 1
+		check_err $? "Too much traffic"
+		log_test "$desc"
+		log_info "Expected ratio <5% Measured ratio $ratio"
+	fi
+}
+
+strict_eval()
+{
+	__strict_eval "$@" 1
+}
+
+notraf_eval()
+{
+	__strict_eval "$@" 0
+}
+
+__ets_dwrr_test()
+{
+	local -a streams=("$@")
+
+	local low_stream=${streams[0]}
+	local seen_strict=0
+	local -a t0 t1 d
+	local stream
+	local total
+	local i
+
+	echo "Testing $(qdisc_describe), streams ${streams[@]}"
+
+	for stream in ${streams[@]}; do
+		ets_start_traffic $stream
+	done
+
+	sleep 10
+
+	t0=($(for stream in ${streams[@]}; do
+		  get_stats $stream
+	      done))
+
+	sleep 10
+
+	t1=($(for stream in ${streams[@]}; do
+		  get_stats $stream
+	      done))
+	d=($(for ((i = 0; i < ${#streams[@]}; i++)); do
+		 echo $((${t1[$i]} - ${t0[$i]}))
+	     done))
+	total=$(echo ${d[@]} | sed 's/ /+/g' | bc)
+
+	for ((i = 0; i < ${#streams[@]}; i++)); do
+		local stream=${streams[$i]}
+		if ((seen_strict)); then
+			notraf_eval "band $stream" ${d[$i]} $total
+		elif ((${WS[$stream]} == 0)); then
+			strict_eval "band $stream" ${d[$i]} $total
+			seen_strict=1
+		elif ((stream == low_stream)); then
+			# Low stream is used as DWRR evaluation reference.
+			continue
+		else
+			multipath_eval "bands $low_stream:$stream" \
+				       ${WS[$low_stream]} ${WS[$stream]} \
+				       ${d[0]} ${d[$i]}
+		fi
+	done
+
+	for stream in ${streams[@]}; do
+		stop_traffic
+	done
+}
+
+ets_dwrr_test_012()
+{
+	__ets_dwrr_test 0 1 2
+}
+
+ets_dwrr_test_01()
+{
+	__ets_dwrr_test 0 1
+}
+
+ets_dwrr_test_12()
+{
+	__ets_dwrr_test 1 2
+}
+
+ets_qdisc_setup()
+{
+	local dev=$1; shift
+	local nstrict=$1; shift
+	local -a quanta=("$@")
+
+	local op=$(if [[ -n $QDISC_DEV ]]; then echo change; else echo add; fi)
+	local ndwrr=${#quanta[@]}
+	local nbands=$((nstrict + ndwrr))
+	local nstreams=$(if ((nbands > 3)); then echo 3; else echo $nbands; fi)
+	local priomap=$(seq 0 $((nstreams - 1)))
+	local i
+
+	WS=($(
+		for ((i = 0; i < nstrict; i++)); do
+			echo 0
+		done
+		for ((i = 0; i < ndwrr; i++)); do
+			echo ${quanta[$i]}
+		done
+	))
+
+	tc qdisc $op dev $dev $PARENT handle 10: ets			       \
+		$(if ((nstrict)); then echo strict $nstrict; fi)	       \
+		$(if ((${#quanta[@]})); then echo quanta ${quanta[@]}; fi)     \
+		priomap $priomap
+	QDISC_DEV=$dev
+}
+
+ets_set_dwrr_uniform()
+{
+	ets_qdisc_setup $put 0 3300 3300 3300
+}
+
+ets_set_dwrr_varying()
+{
+	ets_qdisc_setup $put 0 5000 3500 1500
+}
+
+ets_set_strict()
+{
+	ets_qdisc_setup $put 3
+}
+
+ets_set_mixed()
+{
+	ets_qdisc_setup $put 1 5000 2500 1500
+}
+
+ets_change_quantum()
+{
+	tc class change dev $put classid 10:2 ets quantum 8000
+	WS[1]=8000
+}
+
+ets_set_dwrr_two_bands()
+{
+	ets_qdisc_setup $put 0 5000 2500
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 10/10] selftests: qdiscs: Add test coverage for ETS Qdisc
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (8 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 09/10] selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 15:15   ` Roman Mashak
  2019-11-20 13:05 ` [RFC PATCH 1/3] libnetlink: parse_rtattr_nested should allow NLA_F_NESTED flag Petr Machata
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

Add TDC coverage for the new ETS Qdisc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 .../tc-testing/tc-tests/qdiscs/ets.json       | 709 ++++++++++++++++++
 1 file changed, 709 insertions(+)
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json

diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json
new file mode 100644
index 000000000000..6619d6bceb49
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json
@@ -0,0 +1,709 @@
+[
+    {
+        "id": "e90e",
+        "name": "Add ETS qdisc using bands",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 2",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .* bands 2",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "b059",
+        "name": "Add ETS qdisc using quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 900 800 700",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 4 quanta 1000 900 800 700",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "e8e7",
+        "name": "Add ETS qdisc using strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 3",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 3 strict 3",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "233c",
+        "name": "Add ETS qdisc using bands + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 4 quanta 1000 900 800 700",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 4 quanta 1000 900 800 700 priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "3d35",
+        "name": "Add ETS qdisc using bands + strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 3 strict 3",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 3 strict 3 priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "7f3b",
+        "name": "Add ETS qdisc using strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 3 quanta 1500 750",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 5 strict 3 quanta 1500 750 priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "4593",
+        "name": "Add ETS qdisc using strict 0 + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 0 quanta 1500 750",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 2 quanta 1500 750 priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "8938",
+        "name": "Add ETS qdisc using bands + strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 5 strict 3 quanta 1500 750",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 5 .*strict 3 quanta 1500 750 priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "0782",
+        "name": "Add ETS qdisc with more bands than quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 2 quanta 1000",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 2 .*quanta 1000 [1-9][0-9]* priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "501b",
+        "name": "Add ETS qdisc with more bands than strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 3 strict 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 3 strict 1 quanta ([1-9][0-9]* ){2}priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "671a",
+        "name": "Add ETS qdisc with more bands than strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 3 strict 1 quanta 1000",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 3 strict 1 quanta 1000 [1-9][0-9]* priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "c696",
+        "name": "Add ETS qdisc with priomap",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 5 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "30c4",
+        "name": "Add ETS qdisc with quanta + priomap",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 2000 3000 4000 5000 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*quanta 1000 2000 3000 4000 5000 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "e8ac",
+        "name": "Add ETS qdisc with strict + priomap",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 5 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 5 strict 5 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "5a7e",
+        "name": "Add ETS qdisc with quanta + strict + priomap",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 2 quanta 1000 2000 3000 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*strict 2 quanta 1000 2000 3000 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "cb8b",
+        "name": "Show ETS class :1",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 4000 3000 2000",
+        "expExitCode": "0",
+        "verifyCmd": "$TC class show dev $DUMMY classid 1:1",
+        "matchPattern": "class ets 1:1 root quantum 4000",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "1b4e",
+        "name": "Show ETS class :2",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 4000 3000 2000",
+        "expExitCode": "0",
+        "verifyCmd": "$TC class show dev $DUMMY classid 1:2",
+        "matchPattern": "class ets 1:2 root quantum 3000",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "f642",
+        "name": "Show ETS class :3",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 4000 3000 2000",
+        "expExitCode": "0",
+        "verifyCmd": "$TC class show dev $DUMMY classid 1:3",
+        "matchPattern": "class ets 1:3 root quantum 2000",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "0a5f",
+        "name": "Show ETS strict class",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 3",
+        "expExitCode": "0",
+        "verifyCmd": "$TC class show dev $DUMMY classid 1:1",
+        "matchPattern": "class ets 1:1 root $",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "f7c8",
+        "name": "Add ETS qdisc with too many quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 2 quanta 1000 2000 3000",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "2389",
+        "name": "Add ETS qdisc with too many strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 2 strict 3",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "fe3c",
+        "name": "Add ETS qdisc with too many strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 4 strict 2 quanta 1000 2000 3000",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "cb04",
+        "name": "Add ETS qdisc with excess priomap elements",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 5 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0 1 2",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "c32e",
+        "name": "Add ETS qdisc with priomap above bands",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 2 priomap 0 1 2",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "744c",
+        "name": "Add ETS qdisc with priomap above quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 500 priomap 0 1 2",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "7b33",
+        "name": "Add ETS qdisc with priomap above strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 2 priomap 0 1 2",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "dbe6",
+        "name": "Add ETS qdisc with priomap above strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 1 quanta 1000 500 priomap 0 1 2 3",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "bdb2",
+        "name": "Add ETS qdisc with priomap within bands with strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 4 strict 1 quanta 1000 500 priomap 0 1 2 3",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "39a3",
+        "name": "Add ETS qdisc with priomap above bands with strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 4 strict 1 quanta 1000 500 priomap 0 1 2 3 4",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "39c4",
+        "name": "Add ETS qdisc with too few bands",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 0",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "930b",
+        "name": "Add ETS qdisc with too many bands",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 17",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "406a",
+        "name": "Add ETS qdisc without parameters",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "28c6",
+        "name": "Change ETS band quantum",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true",
+            "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 2000 3000"
+        ],
+        "cmdUnderTest": "$TC class change dev $DUMMY classid 1:1 ets quantum 1500",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*quanta 1500 2000 3000 priomap ",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "4714",
+        "name": "Change ETS band without quantum",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true",
+            "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 2000 3000"
+        ],
+        "cmdUnderTest": "$TC class change dev $DUMMY classid 1:1 ets",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*quanta 1000 2000 3000 priomap ",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "6979",
+        "name": "Change quantum of a strict ETS band",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true",
+            "$TC qdisc add dev $DUMMY handle 1: root ets strict 5"
+        ],
+        "cmdUnderTest": "$TC class change dev $DUMMY classid 1:2 ets quantum 1500",
+        "expExitCode": "2",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets .*bands 5 .*strict 5",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "9a7d",
+        "name": "Change ETS strict band without quantum",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true",
+            "$TC qdisc add dev $DUMMY handle 1: root ets strict 5"
+        ],
+        "cmdUnderTest": "$TC class change dev $DUMMY classid 1:2 ets",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets .*bands 5 .*strict 5",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    }
+]
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/3] libnetlink: parse_rtattr_nested should allow NLA_F_NESTED flag
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (9 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 10/10] selftests: qdiscs: " Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 2/3] uapi: Update for the ETS Qdisc Petr Machata
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

In kernel commit 8cb081746c03 ("netlink: make validation more configurable
for future strictness"), Linux started implicitly flagging nests with
NLA_F_NESTED, unless the nest is created with nla_nest_start_noflag(). Have
libnetlink catch up by admitting the flag in the attribute.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 include/libnetlink.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/libnetlink.h b/include/libnetlink.h
index 8ebdc6d3..e27516f7 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -173,7 +173,8 @@ int rta_nest_end(struct rtattr *rta, struct rtattr *nest);
 				    RTA_ALIGN((rta)->rta_len)))
 
 #define parse_rtattr_nested(tb, max, rta) \
-	(parse_rtattr((tb), (max), RTA_DATA(rta), RTA_PAYLOAD(rta)))
+	(parse_rtattr_flags((tb), (max), RTA_DATA(rta), RTA_PAYLOAD(rta), \
+			    NLA_F_NESTED))
 
 #define parse_rtattr_one_nested(type, rta) \
 	(parse_rtattr_one(type, RTA_DATA(rta), RTA_PAYLOAD(rta)))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/3] uapi: Update for the ETS Qdisc
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (10 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 1/3] libnetlink: parse_rtattr_nested should allow NLA_F_NESTED flag Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 13:05 ` [RFC PATCH 3/3] tc: Add support for " Petr Machata
  2019-11-20 23:25 ` [RFC PATCH 00/10] Add a new Qdisc, ETS Jakub Kicinski
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

Update uAPI headers with the defines for the new ETS Qdisc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 include/uapi/linux/pkt_sched.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 5011259b..c570dfe8 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1181,4 +1181,33 @@ enum {
 
 #define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
 
+/* ETS */
+
+#define ETS_MAX_BANDS 16
+
+enum {
+	TCA_ETS_UNSPEC,
+	TCA_ETS_BANDS,	/* u8 */
+	TCA_ETS_STRICT,	/* u8 */
+	TCA_ETS_QUANTA,	/* nested TCA_ETS_BAND_QUANTUM */
+	TCA_ETS_PRIOMAP,	/* nested TCA_ETS_PMAP_BAND */
+	__TCA_ETS_MAX,
+};
+
+#define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
+
+enum {
+	TCA_ETS_BAND_UNSPEC,
+	TCA_ETS_BAND_QUANTUM,	/* u32 */
+	__TCA_ETS_BAND_MAX,
+};
+#define TCA_ETS_BAND_MAX (__TCA_ETS_BAND_MAX - 1)
+
+enum {
+	TCA_ETS_PMAP_UNSPEC,
+	TCA_ETS_PMAP_BAND,	/* u8 */
+	__TCA_ETS_PMAP_MAX,
+};
+#define TCA_ETS_PMAP_MAX (__TCA_ETS_PMAP_MAX - 1)
+
 #endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/3] tc: Add support for ETS Qdisc
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (11 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 2/3] uapi: Update for the ETS Qdisc Petr Machata
@ 2019-11-20 13:05 ` Petr Machata
  2019-11-20 23:25 ` [RFC PATCH 00/10] Add a new Qdisc, ETS Jakub Kicinski
  13 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-20 13:05 UTC (permalink / raw)
  To: netdev; +Cc: Petr Machata, Ido Schimmel, Roopa Prabhu

Add a new module to generate and parse options specific to the ETS Qdisc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 man/man8/tc-ets.8 | 165 ++++++++++++++++++++++
 man/man8/tc.8     |   1 +
 tc/Makefile       |   1 +
 tc/q_ets.c        | 338 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 505 insertions(+)
 create mode 100644 man/man8/tc-ets.8
 create mode 100644 tc/q_ets.c

diff --git a/man/man8/tc-ets.8 b/man/man8/tc-ets.8
new file mode 100644
index 00000000..fe824106
--- /dev/null
+++ b/man/man8/tc-ets.8
@@ -0,0 +1,165 @@
+.TH TC 8 "October 2019" "iproute2" "Linux"
+.SH NAME
+ETS \- Enhanced Transmission Selection scheduler
+.SH SYNOPSIS
+.B tc qdisc ... ets [ bands
+number
+.B ] [ strict
+number
+.B ] [ quanta
+bytes bytes bytes...
+.B ] [ priomap
+band band band...
+.B ]
+
+.B tc class ... ets [ quantum
+bytes
+.B ]
+
+.SH DESCRIPTION
+
+The Enhanced Transmission Selection scheduler is a classful queuing
+discipline that merges the functionality of the PRIO and DRR qdiscs in one
+scheduler. ETS makes it easy to configure a set of strict and
+bandwidth-sharing bands to implement the transmission selection described
+in 802.1Qaz.
+
+On creation with 'tc qdisc add', a fixed number of bands is created. Each
+band is a class, although it is not possible to directly add and remove
+bands with 'tc class' commands. The number of bands to be created must
+instead be specified on the command line as the qdisc is added.
+
+The minor number of classid to use when referring to a band is the band
+number increased by one. Thus band 0 will have classid of major:1, band 1
+that of major:2, etc.
+
+ETS bands are of two types: some number may be in strict mode, the
+remaining ones are in bandwidth-sharing mode.
+
+.SH ALGORITHM
+When dequeueing, strict bands are tried first, if there are any. Band 0 is
+tried first. If it did not deliver a packet, band 1 is tried next, and so
+on until one of the bands delivers a packet, or the strict bands are
+exhausted.
+
+If no packet has been dequeued from any of the strict bands, if there are
+any bandwidth-sharing bands, the dequeing proceeds according to the DRR
+algorithm. Each bandwidth-sharing band is assigned a deficit counter,
+initialized to quantum assigned by a
+.B quanta
+element. ETS maintains an (internal) ''active'' list of bandwidth-sharing
+bands whose qdiscs are non-empty. This list is used for dequeuing. A packet
+is dequeued from the band at the head of the list if the packet size is
+smaller or equal to the deficit counter. If the counter is too small, it is
+increased by
+.B quantum
+and the scheduler moves on to the next band in the active list.
+
+Only qdiscs that own their queue should be added below the
+bandwidth-sharing bands. Attaching to them non-work-conserving qdiscs like
+TBF does not make sense \-\- other qdiscs in the active list will also
+become inactive until the dequeue operation succeeds. This limitation does
+not exist with the strict bands.
+
+.SH CLASSIFICATION
+The ETS qdisc allows three ways to decide which band to enqueue a packet
+to:
+
+- Packet priority can be directly set to a class handle, in which case that
+  is the queue where the packet will be put. For example, band number 2 of
+  a qdisc with handle of 11: will have classid 11:3. To mark a packet for
+  queuing to this band, the packet priority should be set to 0x110003.
+
+- A tc filter attached to the qdisc can put the packet to a band by using
+  the \fBflowid\fR keyword.
+
+- As a last resort, the ETS qdisc consults its priomap (see below), which
+  maps packets to bands based on packet priority.
+
+.SH PARAMETERS
+.TP
+strict
+The number of bands that should be created in strict mode. If not given,
+this value is 0.
+
+.TP
+quanta
+Each bandwidth-sharing band needs to know its quantum, which is the amount
+of bytes a flow is allowed to dequeue before the scheduler moves to the
+next bandwidth-sharing band. The
+.B quanta
+argument lists quanta for the individual bandwidth-sharing bands.
+The minimum value of each quantum is 1.
+
+.TP
+bands
+Number of bands given explicitly. This value has to be at least large
+enough to cover the strict bands specified through the
+.B strict
+keyword and bandwidth-sharing bands specified in
+.B quanta.
+If a larger value is given, any extra bands are in bandwidth-sharing mode,
+and their quanta are deduced from the interface MTU.
+
+.TP
+priomap
+The priomap maps the priority of a packet to a band. The argument is a list
+of numbers. The first number indicates which band the packets with priority
+0 should be put to, the second is for priority 1, and so on.
+
+.SH EXAMPLE & USAGE
+
+.P
+Add a qdisc with 8 bandwidth-sharing bands, using the interface MTU as
+their quanta. Since all quanta are the same, this will lead to equal
+distribution of bandwidth between the bands, each will get about 12.5% of
+the link. The low 8 priorities go to individual bands in a 1:1 fashion:
+
+.P
+# tc qdisc add dev eth0 root handle 1: ets bands 8 priomap 7 6 5 4 3 2 1 0
+.br
+# tc qdisc show dev eth0
+.br
+qdisc ets 1: root refcnt 2 bands 8 quanta 1514 1514 1514 1514 1514 1514 1514 1514 priomap 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0
+
+.P
+Tweak the first band of the above qdisc to give it a quantum of 2650, which
+will give it about 20% of the link (and about 11.5% to the remaining
+bands):
+
+.P
+# tc class change dev eth0 classid 1:1 ets quantum 2650
+.br
+# tc qdisc show dev eth0
+.br
+qdisc ets 1: root refcnt 2 bands 8 quanta 2650 1514 1514 1514 1514 1514 1514 1514 priomap 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0
+
+.P
+Create a purely strict Qdisc with 1:1 mapping between priorities and TCs:
+
+.P
+# tc qdisc add dev eth0 root handle 1: ets strict 8 priomap 7 6 5 4 3 2 1 0
+.br
+# tc qdisc sh dev eth0
+.br
+qdisc ets 1: root refcnt 2 bands 8 strict 8 priomap 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0
+
+.P
+Add a Qdisc with 6 bands, 3 strict and 3 ETS with 35%-30%-25% weights:
+.P
+# tc qdisc add dev eth0 root handle 1: ets strict 3 quanta 3500 3000 2500 priomap 0 1 1 1 2 3 4 5
+.br
+# tc qdisc sh dev eth0
+.br
+qdisc ets 1: root refcnt 2 bands 6 strict 3 quanta 3500 3000 2500 priomap 0 1 1 1 2 3 4 5 0 0 0 0 0 0 0 0
+
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-prio (8),
+.BR tc-drr (8)
+
+.SH AUTHOR
+Parts of both this manual page and the code itself are taken from PRIO and
+DRR qdiscs.
+.br
+ETS qdisc itself was written by Petr Machata.
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index b81a396f..c4564a06 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -843,6 +843,7 @@ was written by Alexey N. Kuznetsov and added in Linux 2.2.
 .BR tc-codel (8),
 .BR tc-drr (8),
 .BR tc-ematch (8),
+.BR tc-ets (8),
 .BR tc-flow (8),
 .BR tc-flower (8),
 .BR tc-fq (8),
diff --git a/tc/Makefile b/tc/Makefile
index 14171a28..bea5550f 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -79,6 +79,7 @@ TCMODULES += q_cbs.o
 TCMODULES += q_etf.o
 TCMODULES += q_taprio.o
 TCMODULES += q_plug.o
+TCMODULES += q_ets.o
 
 TCSO :=
 ifeq ($(TC_CONFIG_ATM),y)
diff --git a/tc/q_ets.c b/tc/q_ets.c
new file mode 100644
index 00000000..c7b12183
--- /dev/null
+++ b/tc/q_ets.c
@@ -0,0 +1,338 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+
+/*
+ * Enhanced Transmission Selection - 802.1Qaz-based Qdisc
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <string.h>
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+	fprintf(stderr, "Usage: ... ets [bands NUMBER] [strict NUMBER] [quanta Q1 Q2...] [priomap P1 P2...]\n");
+}
+
+static void cexplain(void)
+{
+	fprintf(stderr, "Usage: ... ets [quantum Q1]\n");
+}
+
+static unsigned int parse_quantum(const char *arg)
+{
+	unsigned int quantum;
+
+	if (get_unsigned(&quantum, arg, 10)) {
+		fprintf(stderr, "Illegal \"quanta\" element\n");
+		return 0;
+	}
+	if (!quantum)
+		fprintf(stderr, "\"quanta\" must be > 0\n");
+	return quantum;
+}
+
+static int parse_nbands(const char *arg, __u8 *pnbands, const char *what)
+{
+	unsigned int tmp;
+
+	if (*pnbands) {
+		fprintf(stderr, "Duplicate \"%s\"\n", what);
+		return -1;
+	}
+	if (get_unsigned(&tmp, arg, 10)) {
+		fprintf(stderr, "Illegal \"%s\"\n", what);
+		return -1;
+	}
+	if (tmp > ETS_MAX_BANDS) {
+		fprintf(stderr, "The value of \"%s\" must be <= %d\n",
+			what, ETS_MAX_BANDS);
+		return -1;
+	}
+
+	*pnbands = tmp;
+	return 0;
+}
+
+static int ets_parse_opt(struct qdisc_util *qu, int argc, char **argv,
+			 struct nlmsghdr *n, const char *dev)
+{
+	__u8 nbands = 0;
+	__u8 nstrict = 0;
+	bool quanta_mode = false;
+	unsigned int nquanta = 0;
+	__u32 quanta[ETS_MAX_BANDS] = {0};
+	bool priomap_mode = false;
+	unsigned int nprio = 0;
+	__u8 priomap[TC_PRIO_MAX + 1] = {0};
+	unsigned int tmp;
+	struct rtattr *tail, *nest;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "bands") == 0) {
+			NEXT_ARG();
+			if (parse_nbands(*argv, &nbands, "bands"))
+				return -1;
+			priomap_mode = quanta_mode = false;
+		} else if (strcmp(*argv, "strict") == 0) {
+			NEXT_ARG();
+			if (parse_nbands(*argv, &nstrict, "strict"))
+				return -1;
+			priomap_mode = quanta_mode = false;
+		} else if (strcmp(*argv, "quanta") == 0) {
+			if (nquanta) {
+				fprintf(stderr, "Duplicate \"quanta\"\n");
+				return -1;
+			}
+			NEXT_ARG();
+			priomap_mode = false;
+			quanta_mode = true;
+			goto parse_quantum;
+		} else if (strcmp(*argv, "priomap") == 0) {
+			if (nprio) {
+				fprintf(stderr, "Duplicate \"priomap\"\n");
+				return -1;
+			}
+			NEXT_ARG();
+			priomap_mode = true;
+			quanta_mode = false;
+			goto parse_priomap;
+		} else if (strcmp(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else if (quanta_mode) {
+			unsigned int quantum;
+
+parse_quantum:
+			quantum = parse_quantum(*argv);
+			if (!quantum)
+				return -1;
+			quanta[nquanta++] = quantum;
+		} else if (priomap_mode) {
+			unsigned int band;
+
+parse_priomap:
+			if (get_unsigned(&band, *argv, 10)) {
+				fprintf(stderr, "Illegal \"priomap\" element\n");
+				return -1;
+			}
+			if (nprio > TC_PRIO_MAX) {
+				fprintf(stderr, "\"priomap\" index > TC_PRIO_MAX=%u\n", TC_PRIO_MAX);
+				return -1;
+			}
+			priomap[nprio++] = band;
+		} else {
+			fprintf(stderr, "What is \"%s\"?\n", *argv);
+			explain();
+			return -1;
+		}
+		argc--; argv++;
+	}
+
+	if (!nbands)
+		nbands = nquanta + nstrict;
+	if (!nbands) {
+		fprintf(stderr, "One of \"bands\", \"quanta\" or \"strict\" needs to be specified\n");
+		explain();
+		return -1;
+	}
+	if (nbands < 1) {
+		fprintf(stderr, "The value of \"bands\" must be >= 1\n");
+		explain();
+		return -1;
+	}
+	if (nstrict + nquanta > nbands) {
+		fprintf(stderr, "Not enough total bands to cover all the strict bands and quanta.\n");
+		explain();
+		return -1;
+	}
+	for (tmp = 0; tmp < nprio; tmp++) {
+		if (priomap[tmp] >= nbands) {
+			fprintf(stderr, "\"priomap\" element is out of bounds\n");
+			return -1;
+		}
+	}
+
+	tail = addattr_nest(n, 1024, TCA_OPTIONS | NLA_F_NESTED);
+	addattr_l(n, 1024, TCA_ETS_BANDS, &nbands, sizeof(nbands));
+	if (nstrict)
+		addattr_l(n, 1024, TCA_ETS_STRICT, &nstrict, sizeof(nstrict));
+	if (nquanta) {
+		nest = addattr_nest(n, 1024, TCA_ETS_QUANTA | NLA_F_NESTED);
+		for (tmp = 0; tmp < nquanta; tmp++)
+			addattr_l(n, 1024, TCA_ETS_BAND_QUANTUM,
+				  &quanta[tmp], sizeof(quanta[0]));
+		addattr_nest_end(n, nest);
+	}
+	if (nprio) {
+		nest = addattr_nest(n, 1024, TCA_ETS_PRIOMAP | NLA_F_NESTED);
+		for (tmp = 0; tmp < nprio; tmp++)
+			addattr_l(n, 1024, TCA_ETS_PMAP_BAND,
+				  &priomap[tmp], sizeof(priomap[0]));
+		addattr_nest_end(n, nest);
+	}
+	addattr_nest_end(n, tail);
+
+	return 0;
+}
+
+static int ets_parse_copt(struct qdisc_util *qu, int argc, char **argv,
+			  struct nlmsghdr *n, const char *dev)
+{
+	unsigned int quantum = 0;
+	struct rtattr *tail;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "quantum") == 0) {
+			if (quantum) {
+				fprintf(stderr, "Duplicate \"quanta\"\n");
+				return -1;
+			}
+			NEXT_ARG();
+			quantum = parse_quantum(*argv);
+			if (!quantum)
+				return -1;
+		} else if (strcmp(*argv, "help") == 0) {
+			cexplain();
+			return -1;
+		} else {
+			fprintf(stderr, "What is \"%s\"?\n", *argv);
+			cexplain();
+			return -1;
+		}
+		argc--; argv++;
+	}
+
+	tail = addattr_nest(n, 1024, TCA_OPTIONS | NLA_F_NESTED);
+	if (quantum)
+		addattr_l(n, 1024, TCA_ETS_BAND_QUANTUM, &quantum,
+			  sizeof(quantum));
+	addattr_nest_end(n, tail);
+
+	return 0;
+}
+
+static int ets_print_opt_quanta(struct rtattr *opt)
+{
+	int len = RTA_PAYLOAD(opt);
+	unsigned int offset;
+
+	open_json_array(PRINT_ANY, "quanta");
+	for (offset = 0; offset < len; ) {
+		struct rtattr *tb[TCA_ETS_BAND_MAX + 1] = {NULL};
+		struct rtattr *attr;
+		__u32 quantum;
+
+		attr = RTA_DATA(opt) + offset;
+		parse_rtattr(tb, TCA_ETS_BAND_MAX, attr, len - offset);
+		offset += RTA_LENGTH(RTA_PAYLOAD(attr));
+
+		if (!tb[TCA_ETS_BAND_QUANTUM]) {
+			fprintf(stderr, "ERROR: No ETS band quantum\n");
+			return -1;
+		}
+
+		quantum = rta_getattr_u32(tb[TCA_ETS_BAND_QUANTUM]);
+		print_uint(PRINT_ANY, NULL, " %u", quantum);
+
+	}
+	close_json_array(PRINT_ANY, " ");
+
+	return 0;
+}
+
+static int ets_print_opt_priomap(struct rtattr *opt)
+{
+	int len = RTA_PAYLOAD(opt);
+	unsigned int offset;
+
+	open_json_array(PRINT_ANY, "priomap");
+	for (offset = 0; offset < len; ) {
+		struct rtattr *tb[TCA_ETS_PMAP_MAX + 1] = {NULL};
+		struct rtattr *attr;
+		__u8 band;
+
+		attr = RTA_DATA(opt) + offset;
+		parse_rtattr(tb, TCA_ETS_PMAP_MAX, attr, len - offset);
+		offset += RTA_LENGTH(RTA_PAYLOAD(attr)) + 3 /*padding*/;
+
+		if (!tb[TCA_ETS_PMAP_BAND]) {
+			fprintf(stderr, "ERROR: No ETS priomap band\n");
+			return -1;
+		}
+
+		band = rta_getattr_u8(tb[TCA_ETS_PMAP_BAND]);
+		print_uint(PRINT_ANY, NULL, " %u", band);
+
+	}
+	close_json_array(PRINT_ANY, " ");
+
+	return 0;
+}
+
+static int ets_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+	struct rtattr *tb[TCA_ETS_MAX + 1];
+	__u8 nbands;
+	__u8 nstrict;
+	int err;
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_ETS_MAX, opt);
+
+	if (!tb[TCA_ETS_BANDS] || !tb[TCA_ETS_PRIOMAP]) {
+		fprintf(stderr, "ERROR: Incomplete ETS options\n");
+		return -1;
+	}
+
+	nbands = rta_getattr_u8(tb[TCA_ETS_BANDS]);
+	print_uint(PRINT_ANY, "bands", "bands %u ", nbands);
+
+	if (tb[TCA_ETS_STRICT]) {
+		nstrict = rta_getattr_u8(tb[TCA_ETS_STRICT]);
+		print_uint(PRINT_ANY, "strict", "strict %u ", nstrict);
+	}
+
+	if (tb[TCA_ETS_QUANTA]) {
+		err = ets_print_opt_quanta(tb[TCA_ETS_QUANTA]);
+		if (err)
+			return err;
+	}
+
+	return ets_print_opt_priomap(tb[TCA_ETS_PRIOMAP]);
+}
+
+static int ets_print_copt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+	struct rtattr *tb[TCA_ETS_BAND_MAX + 1];
+	__u32 quantum;
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_ETS_BAND_MAX, opt);
+
+	if (tb[TCA_ETS_BAND_QUANTUM]) {
+		quantum = rta_getattr_u32(tb[TCA_ETS_BAND_QUANTUM]);
+		print_uint(PRINT_ANY, "quantum", "quantum %u ", quantum);
+	}
+
+	return 0;
+}
+
+struct qdisc_util ets_qdisc_util = {
+	.id		= "ets",
+	.parse_qopt	= ets_parse_opt,
+	.parse_copt	= ets_parse_copt,
+	.print_qopt	= ets_print_opt,
+	.print_copt	= ets_print_copt,
+};
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 10/10] selftests: qdiscs: Add test coverage for ETS Qdisc
  2019-11-20 13:05 ` [RFC PATCH 10/10] selftests: qdiscs: " Petr Machata
@ 2019-11-20 15:15   ` Roman Mashak
  2019-11-20 15:42     ` Petr Machata
  0 siblings, 1 reply; 19+ messages in thread
From: Roman Mashak @ 2019-11-20 15:15 UTC (permalink / raw)
  To: Petr Machata; +Cc: netdev, Ido Schimmel, Roopa Prabhu

Petr Machata <petrm@mellanox.com> writes:

> Add TDC coverage for the new ETS Qdisc.
>

It would be good to have tests for upper bound limits of qdisc
parameters.

> Signed-off-by: Petr Machata <petrm@mellanox.com>
> ---
>  .../tc-testing/tc-tests/qdiscs/ets.json       | 709 ++++++++++++++++++
>  1 file changed, 709 insertions(+)
>  create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json
>
> diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json
> new file mode 100644
> index 000000000000..6619d6bceb49
> --- /dev/null
> +++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json
> @@ -0,0 +1,709 @@
> +[
> +    {
> +        "id": "e90e",
> +        "name": "Add ETS qdisc using bands",
> +        "category": [
> +            "qdisc",
> +            "ets"
> +        ],

[...]


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 10/10] selftests: qdiscs: Add test coverage for ETS Qdisc
  2019-11-20 15:15   ` Roman Mashak
@ 2019-11-20 15:42     ` Petr Machata
  2019-11-20 21:22       ` Roman Mashak
  0 siblings, 1 reply; 19+ messages in thread
From: Petr Machata @ 2019-11-20 15:42 UTC (permalink / raw)
  To: Roman Mashak; +Cc: netdev, Ido Schimmel, Roopa Prabhu


Roman Mashak <mrv@mojatatu.com> writes:

> Petr Machata <petrm@mellanox.com> writes:
>
>> Add TDC coverage for the new ETS Qdisc.
>>
>
> It would be good to have tests for upper bound limits of qdisc
> parameters.

All right. I can think of "bands 16", and then "quantum $(((1 << 32) -
1))", but I'm not sure the latter is very useful. Did you have anything
else in mind?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 10/10] selftests: qdiscs: Add test coverage for ETS Qdisc
  2019-11-20 15:42     ` Petr Machata
@ 2019-11-20 21:22       ` Roman Mashak
  0 siblings, 0 replies; 19+ messages in thread
From: Roman Mashak @ 2019-11-20 21:22 UTC (permalink / raw)
  To: Petr Machata; +Cc: netdev, Ido Schimmel, Roopa Prabhu

Petr Machata <petrm@mellanox.com> writes:

> Roman Mashak <mrv@mojatatu.com> writes:
>
>> Petr Machata <petrm@mellanox.com> writes:
>>
>>> Add TDC coverage for the new ETS Qdisc.
>>>
>>
>> It would be good to have tests for upper bound limits of qdisc
>> parameters.
>
> All right. I can think of "bands 16", and then "quantum $(((1 << 32) -
> 1))", but I'm not sure the latter is very useful. Did you have anything
> else in mind?

I think, any test that would validate the user's input of parameters
would suffice, e.g. min/max acceptable values, range of values and
such.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 00/10] Add a new Qdisc, ETS
  2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (12 preceding siblings ...)
  2019-11-20 13:05 ` [RFC PATCH 3/3] tc: Add support for " Petr Machata
@ 2019-11-20 23:25 ` Jakub Kicinski
  2019-11-21 12:43   ` Petr Machata
  13 siblings, 1 reply; 19+ messages in thread
From: Jakub Kicinski @ 2019-11-20 23:25 UTC (permalink / raw)
  To: Petr Machata; +Cc: netdev, Ido Schimmel, Roopa Prabhu

On Wed, 20 Nov 2019 13:05:08 +0000, Petr Machata wrote:
> The IEEE standard 802.1Qaz (and 802.1Q-2014) specifies four principal
> transmission selection algorithms: strict priority, credit-based shaper,
> ETS (bandwidth sharing), and vendor-specific. All these have their
> corresponding knobs in DCB. But DCB does not have interfaces to configure
> RED and ECN, unlike Qdiscs.
> 
> In the Qdisc land, strict priority is implemented by PRIO. Credit-based
> transmission selection algorithm can then be modeled by having e.g. TBF or
> CBS Qdisc below some of the PRIO bands. ETS would then be modeled by
> placing a DRR Qdisc under the last PRIO band.
> 
> The problem with this approach is that DRR on its own, as well as the
> combination of PRIO and DRR, are tricky to configure and tricky to offload
> to 802.1Qaz-compliant hardware. This is due to several reasons:
> 
> - As any classful Qdisc, DRR supports adding classifiers to decide in which
>   class to enqueue packets. Unlike PRIO, there's however no fallback in the
>   form of priomap. A way to achieve classification based on packet priority
>   is e.g. like this:
> 
>     # tc filter add dev swp1 root handle 1: \
> 		basic match 'meta(priority eq 0)' flowid 1:10
> 
>   Expressing the priomap in this manner however forces drivers to deep dive
>   into the classifier block to parse the individual rules.
> 
>   A possible solution would be to extend the classes with a "defmap" a la
>   split / defmap mechanism of CBQ, and introduce this as a last resort
>   classification. However, unlike priomap, this doesn't have the guarantee
>   of covering all priorities. Traffic whose priority is not covered is
>   dropped by DRR as unclassified. But ASICs tend to implement dropping in
>   the ACL block, not in scheduling pipelines. The need to treat these
>   configurations correctly (if only to decide to not offload at all)
>   complicates a driver.
> 
>   It's not clear how to retrofit priomap with all its benefits to DRR
>   without changing it beyond recognition.
> 
> - The interplay between PRIO and DRR is also causing problems. 802.1Qaz has
>   all ETS TCs as a last resort. I believe switch ASICs that support ETS at
>   all will handle ETS traffic likewise. However the Linux model is more
>   generic, allowing the DRR block in any band. Drivers would need to be
>   careful to handle this case correctly, otherwise the offloaded model
>   might not match the slow-path one.
> 
>   In a similar vein, PRIO and DRR need to agree on the list of priorities
>   assigned to DRR. This is doubly problematic--the user needs to take care
>   to keep the two in sync, and the driver needs to watch for any holes in
>   DRR coverage and treat the traffic correctly, as discussed above.
> 
>   Note that at the time that DRR Qdisc is added, it has no classes, and
>   thus any priorities assigned to that PRIO band are not covered. Thus this
>   case is surprisingly rather common, and needs to be handled gracefully by
>   the driver.
> 
> - Similarly due to DRR flexibility, when a Qdisc (such as RED) is attached
>   below it, it is not immediately clear which TC the class represents. This
>   is unlike PRIO with its straightforward classid scheme. When DRR is
>   combined with PRIO, the relationship between classes and TCs gets even
>   more murky.
> 
>   This is a problem for users as well: the TC mapping is rather important
>   for (devlink) shared buffer configuration and (ethtool) counters.

IMHO adding an API to simplify HW config is a double edged sword. 
I think everyone will appreciate the simplicity of the new interface..
until the HW gets a little more smart and then we'll all have to
go back to the full interface and offload both that and the simple one,
or keep growing the new interface (for all practical sense just for HW)
Qdisc.

Having written a MQ+GRED offload I sympathize with the complexity
concerns, also trying to explain how to set up such Qdiscs to users
results in a lot of blank stares.

Is there any chance at all we could simplify things by adding a better
user interface and a common translation layer in front of the drivers?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 00/10] Add a new Qdisc, ETS
  2019-11-20 23:25 ` [RFC PATCH 00/10] Add a new Qdisc, ETS Jakub Kicinski
@ 2019-11-21 12:43   ` Petr Machata
  0 siblings, 0 replies; 19+ messages in thread
From: Petr Machata @ 2019-11-21 12:43 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, Ido Schimmel, Roopa Prabhu


Jakub Kicinski <jakub.kicinski@netronome.com> writes:

> On Wed, 20 Nov 2019 13:05:08 +0000, Petr Machata wrote:
>> The IEEE standard 802.1Qaz (and 802.1Q-2014) specifies four principal
>> transmission selection algorithms: strict priority, credit-based shaper,
>> ETS (bandwidth sharing), and vendor-specific. All these have their
>> corresponding knobs in DCB. But DCB does not have interfaces to configure
>> RED and ECN, unlike Qdiscs.
>>
>> In the Qdisc land, strict priority is implemented by PRIO. Credit-based
>> transmission selection algorithm can then be modeled by having e.g. TBF or
>> CBS Qdisc below some of the PRIO bands. ETS would then be modeled by
>> placing a DRR Qdisc under the last PRIO band.
>>
>> The problem with this approach is that DRR on its own, as well as the
>> combination of PRIO and DRR, are tricky to configure and tricky to offload
>> to 802.1Qaz-compliant hardware. This is due to several reasons:
>>
>> - As any classful Qdisc, DRR supports adding classifiers to decide in which
>>   class to enqueue packets. Unlike PRIO, there's however no fallback in the
>>   form of priomap. A way to achieve classification based on packet priority
>>   is e.g. like this:
>>
>>     # tc filter add dev swp1 root handle 1: \
>> 		basic match 'meta(priority eq 0)' flowid 1:10
>>
>>   Expressing the priomap in this manner however forces drivers to deep dive
>>   into the classifier block to parse the individual rules.
>>
>>   A possible solution would be to extend the classes with a "defmap" a la
>>   split / defmap mechanism of CBQ, and introduce this as a last resort
>>   classification. However, unlike priomap, this doesn't have the guarantee
>>   of covering all priorities. Traffic whose priority is not covered is
>>   dropped by DRR as unclassified. But ASICs tend to implement dropping in
>>   the ACL block, not in scheduling pipelines. The need to treat these
>>   configurations correctly (if only to decide to not offload at all)
>>   complicates a driver.
>>
>>   It's not clear how to retrofit priomap with all its benefits to DRR
>>   without changing it beyond recognition.
>>
>> - The interplay between PRIO and DRR is also causing problems. 802.1Qaz has
>>   all ETS TCs as a last resort. I believe switch ASICs that support ETS at
>>   all will handle ETS traffic likewise. However the Linux model is more
>>   generic, allowing the DRR block in any band. Drivers would need to be
>>   careful to handle this case correctly, otherwise the offloaded model
>>   might not match the slow-path one.
>>
>>   In a similar vein, PRIO and DRR need to agree on the list of priorities
>>   assigned to DRR. This is doubly problematic--the user needs to take care
>>   to keep the two in sync, and the driver needs to watch for any holes in
>>   DRR coverage and treat the traffic correctly, as discussed above.
>>
>>   Note that at the time that DRR Qdisc is added, it has no classes, and
>>   thus any priorities assigned to that PRIO band are not covered. Thus this
>>   case is surprisingly rather common, and needs to be handled gracefully by
>>   the driver.
>>
>> - Similarly due to DRR flexibility, when a Qdisc (such as RED) is attached
>>   below it, it is not immediately clear which TC the class represents. This
>>   is unlike PRIO with its straightforward classid scheme. When DRR is
>>   combined with PRIO, the relationship between classes and TCs gets even
>>   more murky.
>>
>>   This is a problem for users as well: the TC mapping is rather important
>>   for (devlink) shared buffer configuration and (ethtool) counters.
>
> IMHO adding an API to simplify HW config is a double edged sword.
> I think everyone will appreciate the simplicity of the new interface..
> until the HW gets a little more smart and then we'll all have to

For reference, the Spectrum hardware already is more smart. We could
offload PRIO with several DRRs under different bands, the HW is
expressive enough to describe this. But nobody seems to need this: it
seems there are no customers needing anything more than what 802.1Qaz
describes. The DCB interface, which is pretty much married to HW
interfaces, is likewise very close to what 802.1Q specifies, and I don't
believe that's by chance.

> go back to the full interface and offload both that and the simple one,
> or keep growing the new interface (for all practical sense just for HW)
> Qdisc.

If 802.1Q introduces an algorithm that can't be expressed as a single
Qdisc, growing the ETS Qdisc is of course valid. E.g. the shaper
operation is restricted to a single band, so it makes sense to express
it as an independent unit. That's unlike the ETS algorithm, which needs
cooperation between several bands, so you can't easily attach "ETS'ness"
under individual PRIO bands.

> Having written a MQ+GRED offload I sympathize with the complexity
> concerns, also trying to explain how to set up such Qdiscs to users
> results in a lot of blank stares.
>
> Is there any chance at all we could simplify things by adding a better
> user interface and a common translation layer in front of the drivers?

One can imagine a library that handles these sorts of stuff. Drivers
would forward TC events to it, it would figure out what's what, and
somehow signal back to the driver. But packaging this as a Qdisc is such
an interface as well. And because Qdisc interface is well understood not
only by kernel hackers, but also by end users, a Qdisc takes care of
that part of the problem as well.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-11-21 12:43 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-20 13:05 [RFC PATCH 00/10] Add a new Qdisc, ETS Petr Machata
2019-11-20 13:05 ` [RFC PATCH 01/10] net: pkt_cls: Clarify a comment Petr Machata
2019-11-20 13:05 ` [RFC PATCH 02/10] mlxsw: spectrum_qdisc: " Petr Machata
2019-11-20 13:05 ` [RFC PATCH 03/10] mlxsw: spectrum: Fix typos in MLXSW_REG_QEEC_HIERARCHY_* enumerators Petr Machata
2019-11-20 13:05 ` [RFC PATCH 04/10] net: sch_ets: Add a new Qdisc Petr Machata
2019-11-20 13:05 ` [RFC PATCH 05/10] net: sch_ets: Make the ETS qdisc offloadable Petr Machata
2019-11-20 13:05 ` [RFC PATCH 06/10] mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS Petr Machata
2019-11-20 13:05 ` [RFC PATCH 07/10] mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc Petr Machata
2019-11-20 13:05 ` [RFC PATCH 08/10] selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh Petr Machata
2019-11-20 13:05 ` [RFC PATCH 09/10] selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc Petr Machata
2019-11-20 13:05 ` [RFC PATCH 10/10] selftests: qdiscs: " Petr Machata
2019-11-20 15:15   ` Roman Mashak
2019-11-20 15:42     ` Petr Machata
2019-11-20 21:22       ` Roman Mashak
2019-11-20 13:05 ` [RFC PATCH 1/3] libnetlink: parse_rtattr_nested should allow NLA_F_NESTED flag Petr Machata
2019-11-20 13:05 ` [RFC PATCH 2/3] uapi: Update for the ETS Qdisc Petr Machata
2019-11-20 13:05 ` [RFC PATCH 3/3] tc: Add support for " Petr Machata
2019-11-20 23:25 ` [RFC PATCH 00/10] Add a new Qdisc, ETS Jakub Kicinski
2019-11-21 12:43   ` Petr Machata

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.