All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS
@ 2019-12-18 14:55 Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 01/10] net: pkt_cls: Clarify a comment Petr Machata
                   ` (11 more replies)
  0 siblings, 12 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko

The IEEE standard 802.1Qaz (and 802.1Q-2014) specifies four principal
transmission selection algorithms: strict priority, credit-based shaper,
ETS (bandwidth sharing), and vendor-specific. All these have their
corresponding knobs in DCB. But DCB does not have interfaces to configure
RED and ECN, unlike Qdiscs.

In the Qdisc land, strict priority is implemented by PRIO. Credit-based
transmission selection algorithm can then be modeled by having e.g. TBF or
CBS Qdisc below some of the PRIO bands. ETS would then be modeled by
placing a DRR Qdisc under the last PRIO band.

The problem with this approach is that DRR on its own, as well as the
combination of PRIO and DRR, are tricky to configure and tricky to offload
to 802.1Qaz-compliant hardware. This is due to several reasons:

- As any classful Qdisc, DRR supports adding classifiers to decide in which
  class to enqueue packets. Unlike PRIO, there's however no fallback in the
  form of priomap. A way to achieve classification based on packet priority
  is e.g. like this:

    # tc filter add dev swp1 root handle 1: \
		basic match 'meta(priority eq 0)' flowid 1:10

  Expressing the priomap in this manner however forces drivers to deep dive
  into the classifier block to parse the individual rules.

  A possible solution would be to extend the classes with a "defmap" a la
  split / defmap mechanism of CBQ, and introduce this as a last resort
  classification. However, unlike priomap, this doesn't have the guarantee
  of covering all priorities. Traffic whose priority is not covered is
  dropped by DRR as unclassified. But ASICs tend to implement dropping in
  the ACL block, not in scheduling pipelines. The need to treat these
  configurations correctly (if only to decide to not offload at all)
  complicates a driver.

  It's not clear how to retrofit priomap with all its benefits to DRR
  without changing it beyond recognition.

- The interplay between PRIO and DRR is also causing problems. 802.1Qaz has
  all ETS TCs as a last resort. Switch ASICs that support ETS at all are
  likely to handle ETS traffic this way as well. However, the Linux model
  is more generic, allowing the DRR block in any band. Drivers would need
  to be careful to handle this case correctly, otherwise the offloaded
  model might not match the slow-path one.

  In a similar vein, PRIO and DRR need to agree on the list of priorities
  assigned to DRR. This is doubly problematic--the user needs to take care
  to keep the two in sync, and the driver needs to watch for any holes in
  DRR coverage and treat the traffic correctly, as discussed above.

  Note that at the time that DRR Qdisc is added, it has no classes, and
  thus any priorities assigned to that PRIO band are not covered. Thus this
  case is surprisingly rather common, and needs to be handled gracefully by
  the driver.

- Similarly due to DRR flexibility, when a Qdisc (such as RED) is attached
  below it, it is not immediately clear which TC the class represents. This
  is unlike PRIO with its straightforward classid scheme. When DRR is
  combined with PRIO, the relationship between classes and TCs gets even
  more murky.

  This is a problem for users as well: the TC mapping is rather important
  for (devlink) shared buffer configuration and (ethtool) counters.

So instead, this patch set introduces a new Qdisc, which is based on
802.1Qaz wording. It is PRIO-like in how it is configured, meaning one
needs to specify how many bands there are, how many are strict and how many
are ETS, quanta for the latter, and priomap.

The new Qdisc operates like the PRIO / DRR combo would when configured as
per the standard. The strict classes, if any, are tried for traffic first.
When there's no traffic in any of the strict queues, the ETS ones (if any)
are treated in the same way as in DRR.

The chosen interface makes the overall system both reasonably easy to
configure, and reasonably easy to offload. The extra code to support ETS in
mlxsw (which already supports PRIO) is about 150 lines, of which perhaps 20
lines is bona fide new business logic.

Credit-based shaping transmission selection algorithm can be configured by
adding a CBS Qdisc under one of the strict bands (e.g. TBF can be used to a
similar effect as well). As a non-work-conserving Qdisc, CBS can't be
hooked under the ETS bands. This is detected and handled identically to DRR
Qdisc at runtime. Note that offloading CBS is not subject of this patchset.

The patchset proceeds in four stages:

- Patches #1-#3 are cleanups.
- Patches #4 and #5 contain the new Qdisc.
- Patches #6 and #7 update mlxsw to offload the new Qdisc.
- Patches #8-#10 add selftests for ETS.

Examples:

- Add a Qdisc with 6 bands, 3 strict and 3 ETS with 45%-30%-25% weights:

    # tc qdisc add dev swp1 root handle 1: \
	ets strict 3 quanta 4500 3000 2500 priomap 0 1 1 1 2 3 4 5
    # tc qdisc sh dev swp1
    qdisc ets 1: root refcnt 2 bands 6 strict 3 quanta 4500 3000 2500 priomap 0 1 1 1 2 3 4 5 5 5 5 5 5 5 5 5 

- Tweak quantum of one of the classes of the previous Qdisc:

    # tc class ch dev swp1 classid 1:4 ets quantum 1000
    # tc qdisc sh dev swp1
    qdisc ets 1: root refcnt 2 bands 6 strict 3 quanta 1000 3000 2500 priomap 0 1 1 1 2 3 4 5 5 5 5 5 5 5 5 5 
    # tc class ch dev swp1 classid 1:3 ets quantum 1000
    Error: Strict bands do not have a configurable quantum.

- Purely strict Qdisc with 1:1 mapping between priorities and TCs:

    # tc qdisc add dev swp1 root handle 1: \
	ets strict 8 priomap 7 6 5 4 3 2 1 0
    # tc qdisc sh dev swp1
    qdisc ets 1: root refcnt 2 bands 8 strict 8 priomap 7 6 5 4 3 2 1 0 7 7 7 7 7 7 7 7 

- Use "bands" to specify number of bands explicitly. Underspecified bands
  are implicitly ETS and their quantum is taken from MTU. The following
  thus gives each band the same weight:

    # tc qdisc add dev swp1 root handle 1: \
	ets bands 8 priomap 7 6 5 4 3 2 1 0
    # tc qdisc sh dev swp1
    qdisc ets 1: root refcnt 2 bands 8 quanta 1514 1514 1514 1514 1514 1514 1514 1514 priomap 7 6 5 4 3 2 1 0 7 7 7 7 7 7 7 7 

v2:
- This addresses points raised by David Miller.
- Patch #4:
    - sch_ets.c: Add a comment with description of the Qdisc and the
      dequeuing algorithm.
    - Kconfig: Add a high-level description to the help blurb.

v1:
- No changes, first upstream submission after RFC.

v3 (internal):
- This addresses review from Jiri Pirko.
- Patch #3:
    - Rename to _HR_ instead of to _HIERARCHY_.
- Patch #4:
    - pkt_sched.h: Keep all the TCA_ETS_ constants in one enum.
    - pkt_sched.h: Rename TCA_ETS_BANDS to _NBANDS, _STRICT to _NSTRICT,
      _BAND_QUANTUM to _QUANTA_BAND and _PMAP_BAND to _PRIOMAP_BAND.
    - sch_ets.c: Update to reflect the above changes. Add a new policy,
      ets_class_policy, which is used when parsing class changes.
      Currently that policy is the same as the quanta policy, but that
      might change.
    - sch_ets.c: Move MTU handling from ets_quantum_parse() to the one
      caller that makes use of it.
    - sch_ets.c: ets_qdisc_priomap_parse(): WARN_ON_ONCE on invalid
      attribute instead of returning an extack.
- Patch #6:
    - __mlxsw_sp_qdisc_ets_replace(): Pass the weights argument to this
      function in this patch already. Drop the weight computation.
    - mlxsw_sp_qdisc_prio_replace(): Rename "quanta" to "zeroes" and
      pass for the abovementioned "weights".
    - mlxsw_sp_qdisc_prio_graft(): Convert to a wrapper around
      __mlxsw_sp_qdisc_ets_graft(), instead of invoking the latter
      directly from mlxsw_sp_setup_tc_prio().
    - Update to follow the _HIERARCHY_ -> _HR_ renaming.
- Patch #7:
    - __mlxsw_sp_qdisc_ets_replace(): The "weights" argument passing and
      weight computation removal are now done in a previous patch.
    - mlxsw_sp_setup_tc_ets(): Drop case TC_ETS_REPLACE, which is handled
      earlier in the function.
- Patch #3 (iproute2):
    - Add an example output to the commit message.
    - tc-ets.8: Fix output of two examples.
    - tc-ets.8: Describe default values of "bands", "quanta".
    - q_ets.c: A number of fixes in error messages.
    - q_ets.c: Comment formatting: /*padding*/ -> /* padding */
    - q_ets.c: parse_nbands: Move duplicate checking to callers.
    - q_ets.c: Don't accept both "quantum" and "quanta" as equivalent.

v2 (internal):
- This addresses review from Ido Schimmel and comments from Alexander
  Kushnarov.
- Patch #2:
    - s/coment/comment in the commit message.
- Patch #4:
    - sch_ets: ets_class_is_strict(), ets_class_id(): Constify an argument
    - ets_class_find(): RXTify
- Patch #3 (iproute2):
    - tc-ets.8: some spelling fixes
    - tc-ets.8: add another example
    - tc.8: add an ETS to "CLASSFUL QDISCS" section

v1 (internal):
- This addresses RFC reviews from Ido Schimmel and Roman Mashak, bugs found
  by Alexander Petrovskiy and myself, and other improvements.
- Patch #2:
    - Expand the explanation with an explicit example.
- Patch #4:
    - Kconfig: s/sch_drr/sch_ets/
    - sch_ets: Reorder includes to be in alphabetical order
    - sch_ets: ets_quantum_parse(): Rename the return-pointer argument
      from pquantum to quantum, and use it directly, not going through a
      local temporary.
    - sch_ets: ets_qdisc_quanta_parse(): Convert syntax of function
      argument "quanta" from an array to a pointer.
    - sch_ets: ets_qdisc_priomap_parse(): Likewise with "priomap".
    - sch_ets: ets_qdisc_quanta_parse(), ets_qdisc_priomap_parse(): Invoke
      __nla_validate_nested directly instead of nl80211_validate_nested().
    - sch_ets: ets_qdisc_quanta_parse(): WARN_ON_ONCE on invalid attribute
      instead of returning an extack.
    - sch_ets: ets_qdisc_change(): Make the last band the default one for
      unmentioned priomap priorities.
    - sch_ets: Fix a panic when an offloaded child in a bandwidth-sharing
      band notified its ETS parent.
    - sch_ets: When ungrafting, add the newly-created invisible FIFO to
      the Qdisc hash
- Patch #5:
    - pkt_cls.h: Note that quantum=0 signifies a strict band.
    - Fix error path handling when ets_offload_dump() fails.
- Patch #6:
    - __mlxsw_sp_qdisc_ets_replace(): Convert syntax of function arguments
      "quanta" and "priomap" from arrays to pointers.
- Patch #7:
    - __mlxsw_sp_qdisc_ets_replace(): Convert syntax of function argument
      "weights" from an array to a pointer.
- Patch #9:
    - mlxsw/sch_ets.sh: Add a comment explaining packet prioritization.
    - Adjust the whole suite to allow testing of traffic classifiers
      in addition to testing priomap.
- Patch #10:
    - Add a number of new tests to test default priomap band, overlarge
      number of bands, zeroes in quanta, and altogether missing quanta.
- Patch #1 (iproute2):
    - State motivation for inclusion of this patch in the patcheset in the
      commit message.
- Patch #3 (iproute2):
    - tc-ets.8: it is now December
    - tc-ets.8: explain inactivity WRT using non-WC Qdiscs under ETS band
    - tc-ets.8: s/flow/band in explanation of quantum
    - tc-ets.8: explain what happens with priorities not covered by priomap
    - tc-ets.8: default priomap band is now the last one
    - q_ets.c: ets_parse_opt(): Remove unnecessary initialization of
      priomap and quanta.

Petr Machata (10):
  net: pkt_cls: Clarify a comment
  mlxsw: spectrum_qdisc: Clarify a comment
  mlxsw: spectrum: Rename MLXSW_REG_QEEC_HIERARCY_* enumerators
  net: sch_ets: Add a new Qdisc
  net: sch_ets: Make the ETS qdisc offloadable
  mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS
  mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc
  selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh
  selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc
  selftests: qdiscs: Add test coverage for ETS Qdisc

 drivers/net/ethernet/mellanox/mlxsw/reg.h     |  11 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  21 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |   2 +
 .../ethernet/mellanox/mlxsw/spectrum_dcb.c    |   8 +-
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 219 +++-
 include/linux/netdevice.h                     |   1 +
 include/net/pkt_cls.h                         |  36 +-
 include/uapi/linux/pkt_sched.h                |  17 +
 net/sched/Kconfig                             |  17 +
 net/sched/Makefile                            |   1 +
 net/sched/sch_ets.c                           | 828 +++++++++++++++
 .../selftests/drivers/net/mlxsw/qos_lib.sh    |  46 +-
 .../selftests/drivers/net/mlxsw/sch_ets.sh    |  67 ++
 tools/testing/selftests/net/forwarding/lib.sh |  18 +
 .../selftests/net/forwarding/sch_ets.sh       |  44 +
 .../selftests/net/forwarding/sch_ets_core.sh  | 300 ++++++
 .../selftests/net/forwarding/sch_ets_tests.sh | 227 +++++
 .../tc-testing/tc-tests/qdiscs/ets.json       | 940 ++++++++++++++++++
 18 files changed, 2732 insertions(+), 71 deletions(-)
 create mode 100644 net/sched/sch_ets.c
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh
 create mode 100755 tools/testing/selftests/net/forwarding/sch_ets.sh
 create mode 100644 tools/testing/selftests/net/forwarding/sch_ets_core.sh
 create mode 100644 tools/testing/selftests/net/forwarding/sch_ets_tests.sh
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json

-- 
2.20.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next mlxsw v2 01/10] net: pkt_cls: Clarify a comment
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
@ 2019-12-18 14:55 ` Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 02/10] mlxsw: spectrum_qdisc: " Petr Machata
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko, Jiri Pirko

The bit about negating HW backlog left me scratching my head. Clarify the
comment.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/pkt_cls.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index e553fc80eb23..a7c5d492bc04 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -791,9 +791,8 @@ enum tc_prio_command {
 struct tc_prio_qopt_offload_params {
 	int bands;
 	u8 priomap[TC_PRIO_MAX + 1];
-	/* In case that a prio qdisc is offloaded and now is changed to a
-	 * non-offloadedable config, it needs to update the backlog & qlen
-	 * values to negate the HW backlog & qlen values (and only them).
+	/* At the point of un-offloading the Qdisc, the reported backlog and
+	 * qlen need to be reduced by the portion that is in HW.
 	 */
 	struct gnet_stats_queue *qstats;
 };
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next mlxsw v2 02/10] mlxsw: spectrum_qdisc: Clarify a comment
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 01/10] net: pkt_cls: Clarify a comment Petr Machata
@ 2019-12-18 14:55 ` Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 03/10] mlxsw: spectrum: Rename MLXSW_REG_QEEC_HIERARCY_* enumerators Petr Machata
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko, Jiri Pirko

Expand the comment at mlxsw_sp_qdisc_prio_graft() to make the problem that
this function is trying to handle clearer.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---

Notes:
    v2 (internal):
    - s/coment/comment in the commit message.
    
    v1 (internal):
    - Expand the explanation with an explicit example.

 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 31 ++++++++++++++-----
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 68cc6737d45c..135fef6c54b1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -631,10 +631,30 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_prio = {
 	.clean_stats = mlxsw_sp_setup_tc_qdisc_prio_clean_stats,
 };
 
-/* Grafting is not supported in mlxsw. It will result in un-offloading of the
- * grafted qdisc as well as the qdisc in the qdisc new location.
- * (However, if the graft is to the location where the qdisc is already at, it
- * will be ignored completely and won't cause un-offloading).
+/* Linux allows linking of Qdiscs to arbitrary classes (so long as the resulting
+ * graph is free of cycles). These operations do not change the parent handle
+ * though, which means it can be incomplete (if there is more than one class
+ * where the Qdisc in question is grafted) or outright wrong (if the Qdisc was
+ * linked to a different class and then removed from the original class).
+ *
+ * E.g. consider this sequence of operations:
+ *
+ *  # tc qdisc add dev swp1 root handle 1: prio
+ *  # tc qdisc add dev swp1 parent 1:3 handle 13: red limit 1000000 avpkt 10000
+ *  RED: set bandwidth to 10Mbit
+ *  # tc qdisc link dev swp1 handle 13: parent 1:2
+ *
+ * At this point, both 1:2 and 1:3 have the same RED Qdisc instance as their
+ * child. But RED will still only claim that 1:3 is its parent. If it's removed
+ * from that band, its only parent will be 1:2, but it will continue to claim
+ * that it is in fact 1:3.
+ *
+ * The notification for child Qdisc replace (e.g. TC_RED_REPLACE) comes before
+ * the notification for parent graft (e.g. TC_PRIO_GRAFT). We take the replace
+ * notification to offload the child Qdisc, based on its parent handle, and use
+ * the graft operation to validate that the class where the child is actually
+ * grafted corresponds to the parent handle. If the two don't match, we
+ * unoffload the child.
  */
 static int
 mlxsw_sp_qdisc_prio_graft(struct mlxsw_sp_port *mlxsw_sp_port,
@@ -644,9 +664,6 @@ mlxsw_sp_qdisc_prio_graft(struct mlxsw_sp_port *mlxsw_sp_port,
 	int tclass_num = MLXSW_SP_PRIO_BAND_TO_TCLASS(p->band);
 	struct mlxsw_sp_qdisc *old_qdisc;
 
-	/* Check if the grafted qdisc is already in its "new" location. If so -
-	 * nothing needs to be done.
-	 */
 	if (p->band < IEEE_8021QAZ_MAX_TCS &&
 	    mlxsw_sp_port->tclass_qdiscs[tclass_num].handle == p->child_handle)
 		return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next mlxsw v2 03/10] mlxsw: spectrum: Rename MLXSW_REG_QEEC_HIERARCY_* enumerators
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 01/10] net: pkt_cls: Clarify a comment Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 02/10] mlxsw: spectrum_qdisc: " Petr Machata
@ 2019-12-18 14:55 ` Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 04/10] net: sch_ets: Add a new Qdisc Petr Machata
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko, Jiri Pirko

These enums want to be named MLXSW_REG_QEEC_HIERARCHY_, but due to a typo
lack the second H. That is confusing and complicates searching.

But actually the enumerators should be named _HR_, because that is how
their enum type is called. So rename them as appropriate.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---

Notes:
    v3 (internal):
    - Rename to _HR_ instead of to _HIERARCHY_.

 drivers/net/ethernet/mellanox/mlxsw/reg.h     | 11 +++++------
 .../net/ethernet/mellanox/mlxsw/spectrum.c    | 19 +++++++++----------
 .../ethernet/mellanox/mlxsw/spectrum_dcb.c    |  8 ++++----
 3 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 5294a1622643..86a2d575ae73 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -3477,10 +3477,10 @@ MLXSW_REG_DEFINE(qeec, MLXSW_REG_QEEC_ID, MLXSW_REG_QEEC_LEN);
 MLXSW_ITEM32(reg, qeec, local_port, 0x00, 16, 8);
 
 enum mlxsw_reg_qeec_hr {
-	MLXSW_REG_QEEC_HIERARCY_PORT,
-	MLXSW_REG_QEEC_HIERARCY_GROUP,
-	MLXSW_REG_QEEC_HIERARCY_SUBGROUP,
-	MLXSW_REG_QEEC_HIERARCY_TC,
+	MLXSW_REG_QEEC_HR_PORT,
+	MLXSW_REG_QEEC_HR_GROUP,
+	MLXSW_REG_QEEC_HR_SUBGROUP,
+	MLXSW_REG_QEEC_HR_TC,
 };
 
 /* reg_qeec_element_hierarchy
@@ -3618,8 +3618,7 @@ static inline void mlxsw_reg_qeec_ptps_pack(char *payload, u8 local_port,
 {
 	MLXSW_REG_ZERO(qeec, payload);
 	mlxsw_reg_qeec_local_port_set(payload, local_port);
-	mlxsw_reg_qeec_element_hierarchy_set(payload,
-					     MLXSW_REG_QEEC_HIERARCY_PORT);
+	mlxsw_reg_qeec_element_hierarchy_set(payload, MLXSW_REG_QEEC_HR_PORT);
 	mlxsw_reg_qeec_ptps_set(payload, ptps);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 556dca328bb5..0d8fce749248 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3602,26 +3602,25 @@ static int mlxsw_sp_port_ets_init(struct mlxsw_sp_port *mlxsw_sp_port)
 	 * one subgroup, which are all member in the same group.
 	 */
 	err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-				    MLXSW_REG_QEEC_HIERARCY_GROUP, 0, 0, false,
-				    0);
+				    MLXSW_REG_QEEC_HR_GROUP, 0, 0, false, 0);
 	if (err)
 		return err;
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_SUBGROUP, i,
+					    MLXSW_REG_QEEC_HR_SUBGROUP, i,
 					    0, false, 0);
 		if (err)
 			return err;
 	}
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_TC, i, i,
+					    MLXSW_REG_QEEC_HR_TC, i, i,
 					    false, 0);
 		if (err)
 			return err;
 
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_TC,
+					    MLXSW_REG_QEEC_HR_TC,
 					    i + 8, i,
 					    true, 100);
 		if (err)
@@ -3633,13 +3632,13 @@ static int mlxsw_sp_port_ets_init(struct mlxsw_sp_port *mlxsw_sp_port)
 	 * for the initial configuration.
 	 */
 	err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_PORT, 0, 0,
+					    MLXSW_REG_QEEC_HR_PORT, 0, 0,
 					    MLXSW_REG_QEEC_MAS_DIS);
 	if (err)
 		return err;
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-						    MLXSW_REG_QEEC_HIERARCY_SUBGROUP,
+						    MLXSW_REG_QEEC_HR_SUBGROUP,
 						    i, 0,
 						    MLXSW_REG_QEEC_MAS_DIS);
 		if (err)
@@ -3647,14 +3646,14 @@ static int mlxsw_sp_port_ets_init(struct mlxsw_sp_port *mlxsw_sp_port)
 	}
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-						    MLXSW_REG_QEEC_HIERARCY_TC,
+						    MLXSW_REG_QEEC_HR_TC,
 						    i, i,
 						    MLXSW_REG_QEEC_MAS_DIS);
 		if (err)
 			return err;
 
 		err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-						    MLXSW_REG_QEEC_HIERARCY_TC,
+						    MLXSW_REG_QEEC_HR_TC,
 						    i + 8, i,
 						    MLXSW_REG_QEEC_MAS_DIS);
 		if (err)
@@ -3664,7 +3663,7 @@ static int mlxsw_sp_port_ets_init(struct mlxsw_sp_port *mlxsw_sp_port)
 	/* Configure the min shaper for multicast TCs. */
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_min_bw_set(mlxsw_sp_port,
-					       MLXSW_REG_QEEC_HIERARCY_TC,
+					       MLXSW_REG_QEEC_HR_TC,
 					       i + 8, i,
 					       MLXSW_REG_QEEC_MIS_MIN);
 		if (err)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dcb.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dcb.c
index 21296fa7f7fb..fe3bbba90659 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_dcb.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_dcb.c
@@ -160,7 +160,7 @@ static int __mlxsw_sp_dcbnl_ieee_setets(struct mlxsw_sp_port *mlxsw_sp_port,
 		u8 weight = ets->tc_tx_bw[i];
 
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_SUBGROUP, i,
+					    MLXSW_REG_QEEC_HR_SUBGROUP, i,
 					    0, dwrr, weight);
 		if (err) {
 			netdev_err(dev, "Failed to link subgroup ETS element %d to group\n",
@@ -198,7 +198,7 @@ static int __mlxsw_sp_dcbnl_ieee_setets(struct mlxsw_sp_port *mlxsw_sp_port,
 		u8 weight = my_ets->tc_tx_bw[i];
 
 		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
-					    MLXSW_REG_QEEC_HIERARCY_SUBGROUP, i,
+					    MLXSW_REG_QEEC_HR_SUBGROUP, i,
 					    0, dwrr, weight);
 	}
 	return err;
@@ -507,7 +507,7 @@ static int mlxsw_sp_dcbnl_ieee_setmaxrate(struct net_device *dev,
 
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		err = mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-						    MLXSW_REG_QEEC_HIERARCY_SUBGROUP,
+						    MLXSW_REG_QEEC_HR_SUBGROUP,
 						    i, 0,
 						    maxrate->tc_maxrate[i]);
 		if (err) {
@@ -523,7 +523,7 @@ static int mlxsw_sp_dcbnl_ieee_setmaxrate(struct net_device *dev,
 err_port_ets_maxrate_set:
 	for (i--; i >= 0; i--)
 		mlxsw_sp_port_ets_maxrate_set(mlxsw_sp_port,
-					      MLXSW_REG_QEEC_HIERARCY_SUBGROUP,
+					      MLXSW_REG_QEEC_HR_SUBGROUP,
 					      i, 0, my_maxrate->tc_maxrate[i]);
 	return err;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next mlxsw v2 04/10] net: sch_ets: Add a new Qdisc
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (2 preceding siblings ...)
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 03/10] mlxsw: spectrum: Rename MLXSW_REG_QEEC_HIERARCY_* enumerators Petr Machata
@ 2019-12-18 14:55 ` Petr Machata
  2019-12-18 15:02   ` Jiri Pirko
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 05/10] net: sch_ets: Make the ETS qdisc offloadable Petr Machata
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko

Introduces a new Qdisc, which is based on 802.1Q-2014 wording. It is
PRIO-like in how it is configured, meaning one needs to specify how many
bands there are, how many are strict and how many are dwrr, quanta for the
latter, and priomap.

The new Qdisc operates like the PRIO / DRR combo would when configured as
per the standard. The strict classes, if any, are tried for traffic first.
When there's no traffic in any of the strict queues, the ETS ones (if any)
are treated in the same way as in DRR.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---

Notes:
    v2 (upstream):
    - sch_ets.c: Add a comment with description of the Qdisc and the
      dequeuing algorithm.
    - Kconfig: Add a high-level description to the help blurb.
    
    v3 (internal):
    - pkt_sched.h: Keep all the TCA_ETS_ constants in one enum.
    - pkt_sched.h: Rename TCA_ETS_BANDS to _NBANDS, _STRICT to _NSTRICT,
      _BAND_QUANTUM to _QUANTA_BAND and _PMAP_BAND to _PRIOMAP_BAND.
    - sch_ets.c: Update to reflect the above changes. Add a new policy,
      ets_class_policy, which is used when parsing class changes.
      Currently that policy is the same as the quanta policy, but that
      might change.
    - sch_ets.c: Move MTU handling from ets_quantum_parse() to the one
      caller that makes use of it.
    - sch_ets.c: ets_qdisc_priomap_parse(): WARN_ON_ONCE on invalid
      attribute instead of returning an extack.
    
    v2 (internal):
    - sch_ets: ets_class_is_strict(), ets_class_id(): Constify an argument
    - ets_class_find(): RXTify
    
    v1 (internal):
    - Kconfig: s/sch_drr/sch_ets/ in description
    - sch_ets: Reorder includes to be in alphabetical order
    - sch_ets: ets_quantum_parse(): Rename the return-pointer argument
      from pquantum to quantum, and use it directly, not going through a
      local temporary.
    - sch_ets: ets_qdisc_quanta_parse(): Convert syntax of function
      argument "quanta" from an array to a pointer.
    - sch_ets: ets_qdisc_priomap_parse(): Likewise with "priomap".
    - sch_ets: ets_qdisc_quanta_parse(), ets_qdisc_priomap_parse(): Invoke
      __nla_validate_nested directly instead of nl80211_validate_nested().
    - sch_ets: ets_qdisc_quanta_parse(): WARN_ON_ONCE on invalid attribute
      instead of returning an extack.
    - sch_ets: ets_qdisc_change(): Make the last band the default one for
      unmentioned priomap priorities.
    - sch_ets: Fix a panic when an offloaded child in a bandwidth-sharing
      band notified its ETS parent. (Reported by Alexander Petrovskiy.)
    - sch_ets: When ungrafting, add the newly-created invisible FIFO to
      the Qdisc hash

 include/uapi/linux/pkt_sched.h |  17 +
 net/sched/Kconfig              |  17 +
 net/sched/Makefile             |   1 +
 net/sched/sch_ets.c            | 733 +++++++++++++++++++++++++++++++++
 4 files changed, 768 insertions(+)
 create mode 100644 net/sched/sch_ets.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 9f1a72876212..bf5a5b1dfb0b 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1187,4 +1187,21 @@ enum {
 
 #define TCA_TAPRIO_ATTR_MAX (__TCA_TAPRIO_ATTR_MAX - 1)
 
+/* ETS */
+
+#define TCQ_ETS_MAX_BANDS 16
+
+enum {
+	TCA_ETS_UNSPEC,
+	TCA_ETS_NBANDS,		/* u8 */
+	TCA_ETS_NSTRICT,	/* u8 */
+	TCA_ETS_QUANTA,		/* nested TCA_ETS_QUANTA_BAND */
+	TCA_ETS_QUANTA_BAND,	/* u32 */
+	TCA_ETS_PRIOMAP,	/* nested TCA_ETS_PRIOMAP_BAND */
+	TCA_ETS_PRIOMAP_BAND,	/* u8 */
+	__TCA_ETS_MAX,
+};
+
+#define TCA_ETS_MAX (__TCA_ETS_MAX - 1)
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 2985509147a2..b1e7ec726958 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -409,6 +409,23 @@ config NET_SCH_PLUG
 	  To compile this code as a module, choose M here: the
 	  module will be called sch_plug.
 
+config NET_SCH_ETS
+	tristate "Enhanced transmission selection scheduler (ETS)"
+	help
+          The Enhanced Transmission Selection scheduler is a classful
+          queuing discipline that merges functionality of PRIO and DRR
+          qdiscs in one scheduler. ETS makes it easy to configure a set of
+          strict and bandwidth-sharing bands to implement the transmission
+          selection described in 802.1Qaz.
+
+	  Say Y here if you want to use the ETS packet scheduling
+	  algorithm.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_ets.
+
+	  If unsure, say N.
+
 menuconfig NET_SCH_DEFAULT
 	bool "Allow override default queue discipline"
 	---help---
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 415d1e1f237e..bc8856b865ff 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
 obj-$(CONFIG_NET_SCH_PLUG)	+= sch_plug.o
+obj-$(CONFIG_NET_SCH_ETS)	+= sch_ets.o
 obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
 obj-$(CONFIG_NET_SCH_SKBPRIO)	+= sch_skbprio.o
 obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
new file mode 100644
index 000000000000..e6194b23e9b0
--- /dev/null
+++ b/net/sched/sch_ets.c
@@ -0,0 +1,733 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * net/sched/sch_ets.c         Enhanced Transmission Selection scheduler
+ *
+ * Description
+ * -----------
+ *
+ * The Enhanced Transmission Selection scheduler is a classful queuing
+ * discipline that merges functionality of PRIO and DRR qdiscs in one scheduler.
+ * ETS makes it easy to configure a set of strict and bandwidth-sharing bands to
+ * implement the transmission selection described in 802.1Qaz.
+ *
+ * Although ETS is technically classful, it's not possible to add and remove
+ * classes at will. Instead one specifies number of classes, how many are
+ * PRIO-like and how many DRR-like, and quanta for the latter.
+ *
+ * Algorithm
+ * ---------
+ *
+ * The strict classes, if any, are tried for traffic first: first band 0, if it
+ * has no traffic then band 1, etc.
+ *
+ * When there is no traffic in any of the strict queues, the bandwidth-sharing
+ * ones are tried next. Each band is assigned a deficit counter, initialized to
+ * "quantum" of that band. ETS maintains a list of active bandwidth-sharing
+ * bands whose qdiscs are non-empty. A packet is dequeued from the band at the
+ * head of the list if the packet size is smaller or equal to the deficit
+ * counter. If the counter is too small, it is increased by "quantum" and the
+ * scheduler moves on to the next band in the active list.
+ */
+
+#include <linux/module.h>
+#include <net/gen_stats.h>
+#include <net/netlink.h>
+#include <net/pkt_cls.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+
+struct ets_class {
+	struct list_head alist; /* In struct ets_sched.active. */
+	struct Qdisc *qdisc;
+	u32 quantum;
+	u32 deficit;
+	struct gnet_stats_basic_packed bstats;
+	struct gnet_stats_queue qstats;
+};
+
+struct ets_sched {
+	struct list_head active;
+	struct tcf_proto __rcu *filter_list;
+	struct tcf_block *block;
+	unsigned int nbands;
+	unsigned int nstrict;
+	u8 prio2band[TC_PRIO_MAX + 1];
+	struct ets_class classes[TCQ_ETS_MAX_BANDS];
+};
+
+static const struct nla_policy ets_policy[TCA_ETS_MAX + 1] = {
+	[TCA_ETS_NBANDS] = { .type = NLA_U8 },
+	[TCA_ETS_NSTRICT] = { .type = NLA_U8 },
+	[TCA_ETS_QUANTA] = { .type = NLA_NESTED },
+	[TCA_ETS_PRIOMAP] = { .type = NLA_NESTED },
+};
+
+static const struct nla_policy ets_priomap_policy[TCA_ETS_MAX + 1] = {
+	[TCA_ETS_PRIOMAP_BAND] = { .type = NLA_U8 },
+};
+
+static const struct nla_policy ets_quanta_policy[TCA_ETS_MAX + 1] = {
+	[TCA_ETS_QUANTA_BAND] = { .type = NLA_U32 },
+};
+
+static const struct nla_policy ets_class_policy[TCA_ETS_MAX + 1] = {
+	[TCA_ETS_QUANTA_BAND] = { .type = NLA_U32 },
+};
+
+static int ets_quantum_parse(struct Qdisc *sch, const struct nlattr *attr,
+			     unsigned int *quantum,
+			     struct netlink_ext_ack *extack)
+{
+	*quantum = nla_get_u32(attr);
+	if (!*quantum) {
+		NL_SET_ERR_MSG(extack, "ETS quantum cannot be zero");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static struct ets_class *
+ets_class_from_arg(struct Qdisc *sch, unsigned long arg)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+
+	return &q->classes[arg - 1];
+}
+
+static u32 ets_class_id(struct Qdisc *sch, const struct ets_class *cl)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int band = cl - q->classes;
+
+	return TC_H_MAKE(sch->handle, band + 1);
+}
+
+static bool ets_class_is_strict(struct ets_sched *q, const struct ets_class *cl)
+{
+	unsigned int band = cl - q->classes;
+
+	return band < q->nstrict;
+}
+
+static int ets_class_change(struct Qdisc *sch, u32 classid, u32 parentid,
+			    struct nlattr **tca, unsigned long *arg,
+			    struct netlink_ext_ack *extack)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, *arg);
+	struct ets_sched *q = qdisc_priv(sch);
+	struct nlattr *opt = tca[TCA_OPTIONS];
+	struct nlattr *tb[TCA_ETS_MAX + 1];
+	unsigned int quantum;
+	int err;
+
+	/* Classes can be added and removed only through Qdisc_ops.change
+	 * interface.
+	 */
+	if (!cl) {
+		NL_SET_ERR_MSG(extack, "Fine-grained class addition and removal is not supported");
+		return -EOPNOTSUPP;
+	}
+
+	if (!opt) {
+		NL_SET_ERR_MSG(extack, "ETS options are required for this operation");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb, TCA_ETS_MAX, opt, ets_class_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_ETS_QUANTA_BAND])
+		/* Nothing to configure. */
+		return 0;
+
+	if (ets_class_is_strict(q, cl)) {
+		NL_SET_ERR_MSG(extack, "Strict bands do not have a configurable quantum");
+		return -EINVAL;
+	}
+
+	err = ets_quantum_parse(sch, tb[TCA_ETS_QUANTA_BAND], &quantum,
+				extack);
+	if (err)
+		return err;
+
+	sch_tree_lock(sch);
+	cl->quantum = quantum;
+	sch_tree_unlock(sch);
+	return 0;
+}
+
+static int ets_class_graft(struct Qdisc *sch, unsigned long arg,
+			   struct Qdisc *new, struct Qdisc **old,
+			   struct netlink_ext_ack *extack)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, arg);
+
+	if (!new) {
+		new = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+					ets_class_id(sch, cl), NULL);
+		if (!new)
+			new = &noop_qdisc;
+		else
+			qdisc_hash_add(new, true);
+	}
+
+	*old = qdisc_replace(sch, new, &cl->qdisc);
+	return 0;
+}
+
+static struct Qdisc *ets_class_leaf(struct Qdisc *sch, unsigned long arg)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, arg);
+
+	return cl->qdisc;
+}
+
+static unsigned long ets_class_find(struct Qdisc *sch, u32 classid)
+{
+	unsigned long band = TC_H_MIN(classid);
+	struct ets_sched *q = qdisc_priv(sch);
+
+	if (band - 1 >= q->nbands)
+		return 0;
+	return band;
+}
+
+static void ets_class_qlen_notify(struct Qdisc *sch, unsigned long arg)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, arg);
+	struct ets_sched *q = qdisc_priv(sch);
+
+	/* We get notified about zero-length child Qdiscs as well if they are
+	 * offloaded. Those aren't on the active list though, so don't attempt
+	 * to remove them.
+	 */
+	if (!ets_class_is_strict(q, cl) && sch->q.qlen)
+		list_del(&cl->alist);
+}
+
+static int ets_class_dump(struct Qdisc *sch, unsigned long arg,
+			  struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, arg);
+	struct ets_sched *q = qdisc_priv(sch);
+	struct nlattr *nest;
+
+	tcm->tcm_parent = TC_H_ROOT;
+	tcm->tcm_handle = ets_class_id(sch, cl);
+	tcm->tcm_info = cl->qdisc->handle;
+
+	nest = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!nest)
+		goto nla_put_failure;
+	if (!ets_class_is_strict(q, cl)) {
+		if (nla_put_u32(skb, TCA_ETS_QUANTA_BAND, cl->quantum))
+			goto nla_put_failure;
+	}
+	return nla_nest_end(skb, nest);
+
+nla_put_failure:
+	nla_nest_cancel(skb, nest);
+	return -EMSGSIZE;
+}
+
+static int ets_class_dump_stats(struct Qdisc *sch, unsigned long arg,
+				struct gnet_dump *d)
+{
+	struct ets_class *cl = ets_class_from_arg(sch, arg);
+	struct Qdisc *cl_q = cl->qdisc;
+
+	if (gnet_stats_copy_basic(qdisc_root_sleeping_running(sch),
+				  d, NULL, &cl_q->bstats) < 0 ||
+	    qdisc_qstats_copy(d, cl_q) < 0)
+		return -1;
+
+	return 0;
+}
+
+static void ets_qdisc_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int i;
+
+	if (arg->stop)
+		return;
+
+	for (i = 0; i < q->nbands; i++) {
+		if (arg->count < arg->skip) {
+			arg->count++;
+			continue;
+		}
+		if (arg->fn(sch, i + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+static struct tcf_block *
+ets_qdisc_tcf_block(struct Qdisc *sch, unsigned long cl,
+		    struct netlink_ext_ack *extack)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+
+	if (cl) {
+		NL_SET_ERR_MSG(extack, "ETS classid must be zero");
+		return NULL;
+	}
+
+	return q->block;
+}
+
+static unsigned long ets_qdisc_bind_tcf(struct Qdisc *sch, unsigned long parent,
+					u32 classid)
+{
+	return ets_class_find(sch, classid);
+}
+
+static void ets_qdisc_unbind_tcf(struct Qdisc *sch, unsigned long arg)
+{
+}
+
+static struct ets_class *ets_classify(struct sk_buff *skb, struct Qdisc *sch,
+				      int *qerr)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	u32 band = skb->priority;
+	struct tcf_result res;
+	struct tcf_proto *fl;
+	int err;
+
+	*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+	if (TC_H_MAJ(skb->priority) != sch->handle) {
+		fl = rcu_dereference_bh(q->filter_list);
+		err = tcf_classify(skb, fl, &res, false);
+#ifdef CONFIG_NET_CLS_ACT
+		switch (err) {
+		case TC_ACT_STOLEN:
+		case TC_ACT_QUEUED:
+		case TC_ACT_TRAP:
+			*qerr = NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+			/* fall through */
+		case TC_ACT_SHOT:
+			return NULL;
+		}
+#endif
+		if (!fl || err < 0) {
+			if (TC_H_MAJ(band))
+				band = 0;
+			return &q->classes[q->prio2band[band & TC_PRIO_MAX]];
+		}
+		band = res.classid;
+	}
+	band = TC_H_MIN(band) - 1;
+	if (band >= q->nbands)
+		return &q->classes[q->prio2band[0]];
+	return &q->classes[band];
+}
+
+static int ets_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			     struct sk_buff **to_free)
+{
+	unsigned int len = qdisc_pkt_len(skb);
+	struct ets_sched *q = qdisc_priv(sch);
+	struct ets_class *cl;
+	int err = 0;
+	bool first;
+
+	cl = ets_classify(skb, sch, &err);
+	if (!cl) {
+		if (err & __NET_XMIT_BYPASS)
+			qdisc_qstats_drop(sch);
+		__qdisc_drop(skb, to_free);
+		return err;
+	}
+
+	first = !cl->qdisc->q.qlen;
+	err = qdisc_enqueue(skb, cl->qdisc, to_free);
+	if (unlikely(err != NET_XMIT_SUCCESS)) {
+		if (net_xmit_drop_count(err)) {
+			cl->qstats.drops++;
+			qdisc_qstats_drop(sch);
+		}
+		return err;
+	}
+
+	if (first && !ets_class_is_strict(q, cl)) {
+		list_add_tail(&cl->alist, &q->active);
+		cl->deficit = cl->quantum;
+	}
+
+	sch->qstats.backlog += len;
+	sch->q.qlen++;
+	return err;
+}
+
+static struct sk_buff *
+ets_qdisc_dequeue_skb(struct Qdisc *sch, struct sk_buff *skb)
+{
+	qdisc_bstats_update(sch, skb);
+	qdisc_qstats_backlog_dec(sch, skb);
+	sch->q.qlen--;
+	return skb;
+}
+
+static struct sk_buff *ets_qdisc_dequeue(struct Qdisc *sch)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	struct ets_class *cl;
+	struct sk_buff *skb;
+	unsigned int band;
+	unsigned int len;
+
+	while (1) {
+		for (band = 0; band < q->nstrict; band++) {
+			cl = &q->classes[band];
+			skb = qdisc_dequeue_peeked(cl->qdisc);
+			if (skb)
+				return ets_qdisc_dequeue_skb(sch, skb);
+		}
+
+		if (list_empty(&q->active))
+			goto out;
+
+		cl = list_first_entry(&q->active, struct ets_class, alist);
+		skb = cl->qdisc->ops->peek(cl->qdisc);
+		if (!skb) {
+			qdisc_warn_nonwc(__func__, cl->qdisc);
+			goto out;
+		}
+
+		len = qdisc_pkt_len(skb);
+		if (len <= cl->deficit) {
+			cl->deficit -= len;
+			skb = qdisc_dequeue_peeked(cl->qdisc);
+			if (unlikely(!skb))
+				goto out;
+			if (cl->qdisc->q.qlen == 0)
+				list_del(&cl->alist);
+			return ets_qdisc_dequeue_skb(sch, skb);
+		}
+
+		cl->deficit += cl->quantum;
+		list_move_tail(&cl->alist, &q->active);
+	}
+out:
+	return NULL;
+}
+
+static int ets_qdisc_priomap_parse(struct nlattr *priomap_attr,
+				   unsigned int nbands, u8 *priomap,
+				   struct netlink_ext_ack *extack)
+{
+	const struct nlattr *attr;
+	int prio = 0;
+	u8 band;
+	int rem;
+	int err;
+
+	err = __nla_validate_nested(priomap_attr, TCA_ETS_MAX,
+				    ets_priomap_policy, NL_VALIDATE_STRICT,
+				    extack);
+	if (err)
+		return err;
+
+	nla_for_each_nested(attr, priomap_attr, rem) {
+		switch (nla_type(attr)) {
+		case TCA_ETS_PRIOMAP_BAND:
+			if (prio > TC_PRIO_MAX) {
+				NL_SET_ERR_MSG_MOD(extack, "Too many priorities in ETS priomap");
+				return -EINVAL;
+			}
+			band = nla_get_u8(attr);
+			if (band >= nbands) {
+				NL_SET_ERR_MSG_MOD(extack, "Invalid band number in ETS priomap");
+				return -EINVAL;
+			}
+			priomap[prio++] = band;
+			break;
+		default:
+			WARN_ON_ONCE(1); /* Validate should have caught this. */
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int ets_qdisc_quanta_parse(struct Qdisc *sch, struct nlattr *quanta_attr,
+				  unsigned int nbands, unsigned int nstrict,
+				  unsigned int *quanta,
+				  struct netlink_ext_ack *extack)
+{
+	const struct nlattr *attr;
+	int band = nstrict;
+	int rem;
+	int err;
+
+	err = __nla_validate_nested(quanta_attr, TCA_ETS_MAX,
+				    ets_quanta_policy, NL_VALIDATE_STRICT,
+				    extack);
+	if (err < 0)
+		return err;
+
+	nla_for_each_nested(attr, quanta_attr, rem) {
+		switch (nla_type(attr)) {
+		case TCA_ETS_QUANTA_BAND:
+			if (band >= nbands) {
+				NL_SET_ERR_MSG_MOD(extack, "ETS quanta has more values than bands");
+				return -EINVAL;
+			}
+			err = ets_quantum_parse(sch, attr, &quanta[band++],
+						extack);
+			if (err)
+				return err;
+			break;
+		default:
+			WARN_ON_ONCE(1); /* Validate should have caught this. */
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int ets_qdisc_change(struct Qdisc *sch, struct nlattr *opt,
+			    struct netlink_ext_ack *extack)
+{
+	unsigned int quanta[TCQ_ETS_MAX_BANDS] = {0};
+	struct Qdisc *queues[TCQ_ETS_MAX_BANDS];
+	struct ets_sched *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_ETS_MAX + 1];
+	unsigned int oldbands = q->nbands;
+	u8 priomap[TC_PRIO_MAX + 1];
+	unsigned int nstrict = 0;
+	unsigned int nbands;
+	unsigned int i;
+	int err;
+
+	if (!opt) {
+		NL_SET_ERR_MSG(extack, "ETS options are required for this operation");
+		return -EINVAL;
+	}
+
+	err = nla_parse_nested(tb, TCA_ETS_MAX, opt, ets_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (!tb[TCA_ETS_NBANDS]) {
+		NL_SET_ERR_MSG_MOD(extack, "Number of bands is a required argument");
+		return -EINVAL;
+	}
+	nbands = nla_get_u8(tb[TCA_ETS_NBANDS]);
+	if (nbands < 1 || nbands > TCQ_ETS_MAX_BANDS) {
+		NL_SET_ERR_MSG_MOD(extack, "Invalid number of bands");
+		return -EINVAL;
+	}
+	/* Unless overridden, traffic goes to the last band. */
+	memset(priomap, nbands - 1, sizeof(priomap));
+
+	if (tb[TCA_ETS_NSTRICT]) {
+		nstrict = nla_get_u8(tb[TCA_ETS_NSTRICT]);
+		if (nstrict > nbands) {
+			NL_SET_ERR_MSG_MOD(extack, "Invalid number of strict bands");
+			return -EINVAL;
+		}
+	}
+
+	if (tb[TCA_ETS_PRIOMAP]) {
+		err = ets_qdisc_priomap_parse(tb[TCA_ETS_PRIOMAP],
+					      nbands, priomap, extack);
+		if (err)
+			return err;
+	}
+
+	if (tb[TCA_ETS_QUANTA]) {
+		err = ets_qdisc_quanta_parse(sch, tb[TCA_ETS_QUANTA],
+					     nbands, nstrict, quanta, extack);
+		if (err)
+			return err;
+	}
+	/* If there are more bands than strict + quanta provided, the remaining
+	 * ones are ETS with quantum of MTU. Initialize the missing values here.
+	 */
+	for (i = nstrict; i < nbands; i++) {
+		if (!quanta[i])
+			quanta[i] = psched_mtu(qdisc_dev(sch));
+	}
+
+	/* Before commit, make sure we can allocate all new qdiscs */
+	for (i = oldbands; i < nbands; i++) {
+		queues[i] = qdisc_create_dflt(sch->dev_queue, &pfifo_qdisc_ops,
+					      ets_class_id(sch, &q->classes[i]),
+					      extack);
+		if (!queues[i]) {
+			while (i > oldbands)
+				qdisc_put(queues[--i]);
+			return -ENOMEM;
+		}
+	}
+
+	sch_tree_lock(sch);
+
+	q->nbands = nbands;
+	q->nstrict = nstrict;
+	memcpy(q->prio2band, priomap, sizeof(priomap));
+
+	for (i = q->nbands; i < oldbands; i++)
+		qdisc_tree_flush_backlog(q->classes[i].qdisc);
+
+	for (i = 0; i < q->nbands; i++)
+		q->classes[i].quantum = quanta[i];
+
+	for (i = oldbands; i < q->nbands; i++) {
+		q->classes[i].qdisc = queues[i];
+		if (q->classes[i].qdisc != &noop_qdisc)
+			qdisc_hash_add(q->classes[i].qdisc, true);
+	}
+
+	sch_tree_unlock(sch);
+
+	for (i = q->nbands; i < oldbands; i++) {
+		qdisc_put(q->classes[i].qdisc);
+		memset(&q->classes[i], 0, sizeof(q->classes[i]));
+	}
+	return 0;
+}
+
+static int ets_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
+			  struct netlink_ext_ack *extack)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = tcf_block_get(&q->block, &q->filter_list, sch, extack);
+	if (err)
+		return err;
+
+	INIT_LIST_HEAD(&q->active);
+	return ets_qdisc_change(sch, opt, extack);
+}
+
+static void ets_qdisc_reset(struct Qdisc *sch)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int band;
+
+	for (band = q->nstrict; band < q->nbands; band++) {
+		if (q->classes[band].qdisc->q.qlen)
+			list_del(&q->classes[band].alist);
+	}
+	for (band = 0; band < q->nbands; band++)
+		qdisc_reset(q->classes[band].qdisc);
+	sch->qstats.backlog = 0;
+	sch->q.qlen = 0;
+}
+
+static void ets_qdisc_destroy(struct Qdisc *sch)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	int band;
+
+	tcf_block_put(q->block);
+	for (band = 0; band < q->nbands; band++)
+		qdisc_put(q->classes[band].qdisc);
+}
+
+static int ets_qdisc_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct ets_sched *q = qdisc_priv(sch);
+	struct nlattr *opts;
+	struct nlattr *nest;
+	int band;
+	int prio;
+
+	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_err;
+
+	if (nla_put_u8(skb, TCA_ETS_NBANDS, q->nbands))
+		goto nla_err;
+
+	if (q->nstrict &&
+	    nla_put_u8(skb, TCA_ETS_NSTRICT, q->nstrict))
+		goto nla_err;
+
+	if (q->nbands > q->nstrict) {
+		nest = nla_nest_start(skb, TCA_ETS_QUANTA);
+		if (!nest)
+			goto nla_err;
+
+		for (band = q->nstrict; band < q->nbands; band++) {
+			if (nla_put_u32(skb, TCA_ETS_QUANTA_BAND,
+					q->classes[band].quantum))
+				goto nla_err;
+		}
+
+		nla_nest_end(skb, nest);
+	}
+
+	nest = nla_nest_start(skb, TCA_ETS_PRIOMAP);
+	if (!nest)
+		goto nla_err;
+
+	for (prio = 0; prio <= TC_PRIO_MAX; prio++) {
+		if (nla_put_u8(skb, TCA_ETS_PRIOMAP_BAND, q->prio2band[prio]))
+			goto nla_err;
+	}
+
+	nla_nest_end(skb, nest);
+
+	return nla_nest_end(skb, opts);
+
+nla_err:
+	nla_nest_cancel(skb, opts);
+	return -EMSGSIZE;
+}
+
+static const struct Qdisc_class_ops ets_class_ops = {
+	.change		= ets_class_change,
+	.graft		= ets_class_graft,
+	.leaf		= ets_class_leaf,
+	.find		= ets_class_find,
+	.qlen_notify	= ets_class_qlen_notify,
+	.dump		= ets_class_dump,
+	.dump_stats	= ets_class_dump_stats,
+	.walk		= ets_qdisc_walk,
+	.tcf_block	= ets_qdisc_tcf_block,
+	.bind_tcf	= ets_qdisc_bind_tcf,
+	.unbind_tcf	= ets_qdisc_unbind_tcf,
+};
+
+static struct Qdisc_ops ets_qdisc_ops __read_mostly = {
+	.cl_ops		= &ets_class_ops,
+	.id		= "ets",
+	.priv_size	= sizeof(struct ets_sched),
+	.enqueue	= ets_qdisc_enqueue,
+	.dequeue	= ets_qdisc_dequeue,
+	.peek		= qdisc_peek_dequeued,
+	.change		= ets_qdisc_change,
+	.init		= ets_qdisc_init,
+	.reset		= ets_qdisc_reset,
+	.destroy	= ets_qdisc_destroy,
+	.dump		= ets_qdisc_dump,
+	.owner		= THIS_MODULE,
+};
+
+static int __init ets_init(void)
+{
+	return register_qdisc(&ets_qdisc_ops);
+}
+
+static void __exit ets_exit(void)
+{
+	unregister_qdisc(&ets_qdisc_ops);
+}
+
+module_init(ets_init);
+module_exit(ets_exit);
+MODULE_LICENSE("GPL");
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next mlxsw v2 05/10] net: sch_ets: Make the ETS qdisc offloadable
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (3 preceding siblings ...)
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 04/10] net: sch_ets: Add a new Qdisc Petr Machata
@ 2019-12-18 14:55 ` Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 06/10] mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS Petr Machata
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko, Jiri Pirko

Add hooks at appropriate points to make it possible to offload the ETS
Qdisc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---

Notes:
    v1 (internal):
    - pkt_cls.h: Note that quantum=0 signifies a strict band.
    - Fix error path handling when ets_offload_dump() fails.

 include/linux/netdevice.h |  1 +
 include/net/pkt_cls.h     | 31 +++++++++++++
 net/sched/sch_ets.c       | 95 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 127 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 30745068fb39..7a8ed11f5d45 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -849,6 +849,7 @@ enum tc_setup_type {
 	TC_SETUP_QDISC_GRED,
 	TC_SETUP_QDISC_TAPRIO,
 	TC_SETUP_FT,
+	TC_SETUP_QDISC_ETS,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a7c5d492bc04..47b115e2012a 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -823,4 +823,35 @@ struct tc_root_qopt_offload {
 	bool ingress;
 };
 
+enum tc_ets_command {
+	TC_ETS_REPLACE,
+	TC_ETS_DESTROY,
+	TC_ETS_STATS,
+	TC_ETS_GRAFT,
+};
+
+struct tc_ets_qopt_offload_replace_params {
+	unsigned int bands;
+	u8 priomap[TC_PRIO_MAX + 1];
+	unsigned int quanta[TCQ_ETS_MAX_BANDS];	/* 0 for strict bands. */
+	unsigned int weights[TCQ_ETS_MAX_BANDS];
+	struct gnet_stats_queue *qstats;
+};
+
+struct tc_ets_qopt_offload_graft_params {
+	u8 band;
+	u32 child_handle;
+};
+
+struct tc_ets_qopt_offload {
+	enum tc_ets_command command;
+	u32 handle;
+	u32 parent;
+	union {
+		struct tc_ets_qopt_offload_replace_params replace_params;
+		struct tc_qopt_offload_stats stats;
+		struct tc_ets_qopt_offload_graft_params graft_params;
+	};
+};
+
 #endif
diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
index e6194b23e9b0..a87e9159338c 100644
--- a/net/sched/sch_ets.c
+++ b/net/sched/sch_ets.c
@@ -102,6 +102,91 @@ static u32 ets_class_id(struct Qdisc *sch, const struct ets_class *cl)
 	return TC_H_MAKE(sch->handle, band + 1);
 }
 
+static void ets_offload_change(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct ets_sched *q = qdisc_priv(sch);
+	struct tc_ets_qopt_offload qopt;
+	unsigned int w_psum_prev = 0;
+	unsigned int q_psum = 0;
+	unsigned int q_sum = 0;
+	unsigned int quantum;
+	unsigned int w_psum;
+	unsigned int weight;
+	unsigned int i;
+
+	if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc)
+		return;
+
+	qopt.command = TC_ETS_REPLACE;
+	qopt.handle = sch->handle;
+	qopt.parent = sch->parent;
+	qopt.replace_params.bands = q->nbands;
+	qopt.replace_params.qstats = &sch->qstats;
+	memcpy(&qopt.replace_params.priomap,
+	       q->prio2band, sizeof(q->prio2band));
+
+	for (i = 0; i < q->nbands; i++)
+		q_sum += q->classes[i].quantum;
+
+	for (i = 0; i < q->nbands; i++) {
+		quantum = q->classes[i].quantum;
+		q_psum += quantum;
+		w_psum = quantum ? q_psum * 100 / q_sum : 0;
+		weight = w_psum - w_psum_prev;
+		w_psum_prev = w_psum;
+
+		qopt.replace_params.quanta[i] = quantum;
+		qopt.replace_params.weights[i] = weight;
+	}
+
+	dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_ETS, &qopt);
+}
+
+static void ets_offload_destroy(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct tc_ets_qopt_offload qopt;
+
+	if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc)
+		return;
+
+	qopt.command = TC_ETS_DESTROY;
+	qopt.handle = sch->handle;
+	qopt.parent = sch->parent;
+	dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_ETS, &qopt);
+}
+
+static void ets_offload_graft(struct Qdisc *sch, struct Qdisc *new,
+			      struct Qdisc *old, unsigned long arg,
+			      struct netlink_ext_ack *extack)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct tc_ets_qopt_offload qopt;
+
+	qopt.command = TC_ETS_GRAFT;
+	qopt.handle = sch->handle;
+	qopt.parent = sch->parent;
+	qopt.graft_params.band = arg - 1;
+	qopt.graft_params.child_handle = new->handle;
+
+	qdisc_offload_graft_helper(dev, sch, new, old, TC_SETUP_QDISC_ETS,
+				   &qopt, extack);
+}
+
+static int ets_offload_dump(struct Qdisc *sch)
+{
+	struct tc_ets_qopt_offload qopt;
+
+	qopt.command = TC_ETS_STATS;
+	qopt.handle = sch->handle;
+	qopt.parent = sch->parent;
+	qopt.stats.bstats = &sch->bstats;
+	qopt.stats.qstats = &sch->qstats;
+
+	return qdisc_offload_dump_helper(sch, TC_SETUP_QDISC_ETS, &qopt);
+}
+
 static bool ets_class_is_strict(struct ets_sched *q, const struct ets_class *cl)
 {
 	unsigned int band = cl - q->classes;
@@ -154,6 +239,8 @@ static int ets_class_change(struct Qdisc *sch, u32 classid, u32 parentid,
 	sch_tree_lock(sch);
 	cl->quantum = quantum;
 	sch_tree_unlock(sch);
+
+	ets_offload_change(sch);
 	return 0;
 }
 
@@ -173,6 +260,7 @@ static int ets_class_graft(struct Qdisc *sch, unsigned long arg,
 	}
 
 	*old = qdisc_replace(sch, new, &cl->qdisc);
+	ets_offload_graft(sch, new, *old, arg, extack);
 	return 0;
 }
 
@@ -589,6 +677,7 @@ static int ets_qdisc_change(struct Qdisc *sch, struct nlattr *opt,
 
 	sch_tree_unlock(sch);
 
+	ets_offload_change(sch);
 	for (i = q->nbands; i < oldbands; i++) {
 		qdisc_put(q->classes[i].qdisc);
 		memset(&q->classes[i], 0, sizeof(q->classes[i]));
@@ -633,6 +722,7 @@ static void ets_qdisc_destroy(struct Qdisc *sch)
 	struct ets_sched *q = qdisc_priv(sch);
 	int band;
 
+	ets_offload_destroy(sch);
 	tcf_block_put(q->block);
 	for (band = 0; band < q->nbands; band++)
 		qdisc_put(q->classes[band].qdisc);
@@ -645,6 +735,11 @@ static int ets_qdisc_dump(struct Qdisc *sch, struct sk_buff *skb)
 	struct nlattr *nest;
 	int band;
 	int prio;
+	int err;
+
+	err = ets_offload_dump(sch);
+	if (err)
+		return err;
 
 	opts = nla_nest_start_noflag(skb, TCA_OPTIONS);
 	if (!opts)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next mlxsw v2 06/10] mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (4 preceding siblings ...)
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 05/10] net: sch_ets: Make the ETS qdisc offloadable Petr Machata
@ 2019-12-18 14:55 ` Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 07/10] mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc Petr Machata
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko, Jiri Pirko

Thanks to the similarity between PRIO and ETS it is possible to simply
reuse most of the code for offloading PRIO Qdisc. Extract the common
functionality into separate functions, making the current PRIO handlers
thin API adapters.

Extend the new functions to pass quanta for individual bands, which allows
configuring a subset of bands as WRR. Invoke mlxsw_sp_port_ets_set() as
appropriate to de/configure WRR-ness and weight of individual bands.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---

Notes:
    v3 (internal):
    - __mlxsw_sp_qdisc_ets_replace(): Pass the weights argument to this
      function in this patch already. Drop the weight computation.
    - mlxsw_sp_qdisc_prio_replace(): Rename "quanta" to "zeroes" and
      pass for the abovementioned "weights".
    - mlxsw_sp_qdisc_prio_graft(): Convert to a wrapper around
      __mlxsw_sp_qdisc_ets_graft(), instead of invoking the latter
      directly from mlxsw_sp_setup_tc_prio().
    - Update to follow the _HIERARCHY_ -> _HR_ renaming.
    
    v1 (internal):
    - __mlxsw_sp_qdisc_ets_replace(): Convert syntax of function arguments
      "quanta" and "priomap" from arrays to pointers.

 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 104 ++++++++++++++----
 1 file changed, 81 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index 135fef6c54b1..d513af49c0a8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -471,14 +471,16 @@ int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port,
 }
 
 static int
-mlxsw_sp_qdisc_prio_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
-			    struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
+__mlxsw_sp_qdisc_ets_destroy(struct mlxsw_sp_port *mlxsw_sp_port)
 {
 	int i;
 
 	for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
 		mlxsw_sp_port_prio_tc_set(mlxsw_sp_port, i,
 					  MLXSW_SP_PORT_DEFAULT_TCLASS);
+		mlxsw_sp_port_ets_set(mlxsw_sp_port,
+				      MLXSW_REG_QEEC_HR_SUBGROUP,
+				      i, 0, false, 0);
 		mlxsw_sp_qdisc_destroy(mlxsw_sp_port,
 				       &mlxsw_sp_port->tclass_qdiscs[i]);
 		mlxsw_sp_port->tclass_qdiscs[i].prio_bitmap = 0;
@@ -487,6 +489,22 @@ mlxsw_sp_qdisc_prio_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
 	return 0;
 }
 
+static int
+mlxsw_sp_qdisc_prio_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
+			    struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
+{
+	return __mlxsw_sp_qdisc_ets_destroy(mlxsw_sp_port);
+}
+
+static int
+__mlxsw_sp_qdisc_ets_check_params(unsigned int nbands)
+{
+	if (nbands > IEEE_8021QAZ_MAX_TCS)
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
 static int
 mlxsw_sp_qdisc_prio_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
 				 struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
@@ -494,30 +512,36 @@ mlxsw_sp_qdisc_prio_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
 {
 	struct tc_prio_qopt_offload_params *p = params;
 
-	if (p->bands > IEEE_8021QAZ_MAX_TCS)
-		return -EOPNOTSUPP;
-
-	return 0;
+	return __mlxsw_sp_qdisc_ets_check_params(p->bands);
 }
 
 static int
-mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port *mlxsw_sp_port,
-			    struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
-			    void *params)
+__mlxsw_sp_qdisc_ets_replace(struct mlxsw_sp_port *mlxsw_sp_port,
+			     unsigned int nbands,
+			     const unsigned int *quanta,
+			     const unsigned int *weights,
+			     const u8 *priomap)
 {
-	struct tc_prio_qopt_offload_params *p = params;
 	struct mlxsw_sp_qdisc *child_qdisc;
 	int tclass, i, band, backlog;
 	u8 old_priomap;
 	int err;
 
-	for (band = 0; band < p->bands; band++) {
+	for (band = 0; band < nbands; band++) {
 		tclass = MLXSW_SP_PRIO_BAND_TO_TCLASS(band);
 		child_qdisc = &mlxsw_sp_port->tclass_qdiscs[tclass];
 		old_priomap = child_qdisc->prio_bitmap;
 		child_qdisc->prio_bitmap = 0;
+
+		err = mlxsw_sp_port_ets_set(mlxsw_sp_port,
+					    MLXSW_REG_QEEC_HR_SUBGROUP,
+					    tclass, 0, !!quanta[band],
+					    weights[band]);
+		if (err)
+			return err;
+
 		for (i = 0; i < IEEE_8021QAZ_MAX_TCS; i++) {
-			if (p->priomap[i] == band) {
+			if (priomap[i] == band) {
 				child_qdisc->prio_bitmap |= BIT(i);
 				if (BIT(i) & old_priomap)
 					continue;
@@ -540,21 +564,46 @@ mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port *mlxsw_sp_port,
 		child_qdisc = &mlxsw_sp_port->tclass_qdiscs[tclass];
 		child_qdisc->prio_bitmap = 0;
 		mlxsw_sp_qdisc_destroy(mlxsw_sp_port, child_qdisc);
+		mlxsw_sp_port_ets_set(mlxsw_sp_port,
+				      MLXSW_REG_QEEC_HR_SUBGROUP,
+				      tclass, 0, false, 0);
 	}
 	return 0;
 }
 
+static int
+mlxsw_sp_qdisc_prio_replace(struct mlxsw_sp_port *mlxsw_sp_port,
+			    struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			    void *params)
+{
+	struct tc_prio_qopt_offload_params *p = params;
+	unsigned int zeroes[TCQ_ETS_MAX_BANDS] = {0};
+
+	return __mlxsw_sp_qdisc_ets_replace(mlxsw_sp_port, p->bands,
+					    zeroes, zeroes, p->priomap);
+}
+
+static void
+__mlxsw_sp_qdisc_ets_unoffload(struct mlxsw_sp_port *mlxsw_sp_port,
+			       struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			       struct gnet_stats_queue *qstats)
+{
+	u64 backlog;
+
+	backlog = mlxsw_sp_cells_bytes(mlxsw_sp_port->mlxsw_sp,
+				       mlxsw_sp_qdisc->stats_base.backlog);
+	qstats->backlog -= backlog;
+}
+
 static void
 mlxsw_sp_qdisc_prio_unoffload(struct mlxsw_sp_port *mlxsw_sp_port,
 			      struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
 			      void *params)
 {
 	struct tc_prio_qopt_offload_params *p = params;
-	u64 backlog;
 
-	backlog = mlxsw_sp_cells_bytes(mlxsw_sp_port->mlxsw_sp,
-				       mlxsw_sp_qdisc->stats_base.backlog);
-	p->qstats->backlog -= backlog;
+	__mlxsw_sp_qdisc_ets_unoffload(mlxsw_sp_port, mlxsw_sp_qdisc,
+				       p->qstats);
 }
 
 static int
@@ -657,22 +706,22 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_prio = {
  * unoffload the child.
  */
 static int
-mlxsw_sp_qdisc_prio_graft(struct mlxsw_sp_port *mlxsw_sp_port,
-			  struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
-			  struct tc_prio_qopt_offload_graft_params *p)
+__mlxsw_sp_qdisc_ets_graft(struct mlxsw_sp_port *mlxsw_sp_port,
+			   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			   u8 band, u32 child_handle)
 {
-	int tclass_num = MLXSW_SP_PRIO_BAND_TO_TCLASS(p->band);
+	int tclass_num = MLXSW_SP_PRIO_BAND_TO_TCLASS(band);
 	struct mlxsw_sp_qdisc *old_qdisc;
 
-	if (p->band < IEEE_8021QAZ_MAX_TCS &&
-	    mlxsw_sp_port->tclass_qdiscs[tclass_num].handle == p->child_handle)
+	if (band < IEEE_8021QAZ_MAX_TCS &&
+	    mlxsw_sp_port->tclass_qdiscs[tclass_num].handle == child_handle)
 		return 0;
 
 	/* See if the grafted qdisc is already offloaded on any tclass. If so,
 	 * unoffload it.
 	 */
 	old_qdisc = mlxsw_sp_qdisc_find_by_handle(mlxsw_sp_port,
-						  p->child_handle);
+						  child_handle);
 	if (old_qdisc)
 		mlxsw_sp_qdisc_destroy(mlxsw_sp_port, old_qdisc);
 
@@ -681,6 +730,15 @@ mlxsw_sp_qdisc_prio_graft(struct mlxsw_sp_port *mlxsw_sp_port,
 	return -EOPNOTSUPP;
 }
 
+static int
+mlxsw_sp_qdisc_prio_graft(struct mlxsw_sp_port *mlxsw_sp_port,
+			  struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			  struct tc_prio_qopt_offload_graft_params *p)
+{
+	return __mlxsw_sp_qdisc_ets_graft(mlxsw_sp_port, mlxsw_sp_qdisc,
+					  p->band, p->child_handle);
+}
+
 int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port,
 			   struct tc_prio_qopt_offload *p)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next mlxsw v2 07/10] mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (5 preceding siblings ...)
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 06/10] mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS Petr Machata
@ 2019-12-18 14:55 ` Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 08/10] selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh Petr Machata
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko, Jiri Pirko

Handle TC_SETUP_QDISC_ETS, add a new ops structure for the ETS Qdisc.
Invoke the extended prio handlers implemented in the previous patch. For
stats ops, invoke directly the prio callbacks, which are not sensitive to
differences between PRIO and ETS.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---

Notes:
    v3 (internal):
    - __mlxsw_sp_qdisc_ets_replace(): The "weights" argument passing and
      weight computation removal are now done in a previous patch.
    - mlxsw_sp_setup_tc_ets(): Drop case TC_ETS_REPLACE, which is handled
      earlier in the function.
    
    v1 (internal):
    - __mlxsw_sp_qdisc_ets_replace(): Convert syntax of function argument
      "weights" from an array to a pointer.

 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  2 +
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |  2 +
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 84 +++++++++++++++++++
 3 files changed, 88 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 0d8fce749248..ea632042e609 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1796,6 +1796,8 @@ static int mlxsw_sp_setup_tc(struct net_device *dev, enum tc_setup_type type,
 		return mlxsw_sp_setup_tc_red(mlxsw_sp_port, type_data);
 	case TC_SETUP_QDISC_PRIO:
 		return mlxsw_sp_setup_tc_prio(mlxsw_sp_port, type_data);
+	case TC_SETUP_QDISC_ETS:
+		return mlxsw_sp_setup_tc_ets(mlxsw_sp_port, type_data);
 	default:
 		return -EOPNOTSUPP;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 347bec9d1ecf..948ef4720d40 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -852,6 +852,8 @@ int mlxsw_sp_setup_tc_red(struct mlxsw_sp_port *mlxsw_sp_port,
 			  struct tc_red_qopt_offload *p);
 int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port,
 			   struct tc_prio_qopt_offload *p);
+int mlxsw_sp_setup_tc_ets(struct mlxsw_sp_port *mlxsw_sp_port,
+			  struct tc_ets_qopt_offload *p);
 
 /* spectrum_fid.c */
 bool mlxsw_sp_fid_is_dummy(struct mlxsw_sp *mlxsw_sp, u16 fid_index);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index d513af49c0a8..81a2c087f534 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -18,6 +18,7 @@ enum mlxsw_sp_qdisc_type {
 	MLXSW_SP_QDISC_NO_QDISC,
 	MLXSW_SP_QDISC_RED,
 	MLXSW_SP_QDISC_PRIO,
+	MLXSW_SP_QDISC_ETS,
 };
 
 struct mlxsw_sp_qdisc_ops {
@@ -680,6 +681,55 @@ static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_prio = {
 	.clean_stats = mlxsw_sp_setup_tc_qdisc_prio_clean_stats,
 };
 
+static int
+mlxsw_sp_qdisc_ets_check_params(struct mlxsw_sp_port *mlxsw_sp_port,
+				struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+				void *params)
+{
+	struct tc_ets_qopt_offload_replace_params *p = params;
+
+	return __mlxsw_sp_qdisc_ets_check_params(p->bands);
+}
+
+static int
+mlxsw_sp_qdisc_ets_replace(struct mlxsw_sp_port *mlxsw_sp_port,
+			   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			   void *params)
+{
+	struct tc_ets_qopt_offload_replace_params *p = params;
+
+	return __mlxsw_sp_qdisc_ets_replace(mlxsw_sp_port, p->bands,
+					    p->quanta, p->weights, p->priomap);
+}
+
+static void
+mlxsw_sp_qdisc_ets_unoffload(struct mlxsw_sp_port *mlxsw_sp_port,
+			     struct mlxsw_sp_qdisc *mlxsw_sp_qdisc,
+			     void *params)
+{
+	struct tc_ets_qopt_offload_replace_params *p = params;
+
+	__mlxsw_sp_qdisc_ets_unoffload(mlxsw_sp_port, mlxsw_sp_qdisc,
+				       p->qstats);
+}
+
+static int
+mlxsw_sp_qdisc_ets_destroy(struct mlxsw_sp_port *mlxsw_sp_port,
+			   struct mlxsw_sp_qdisc *mlxsw_sp_qdisc)
+{
+	return __mlxsw_sp_qdisc_ets_destroy(mlxsw_sp_port);
+}
+
+static struct mlxsw_sp_qdisc_ops mlxsw_sp_qdisc_ops_ets = {
+	.type = MLXSW_SP_QDISC_ETS,
+	.check_params = mlxsw_sp_qdisc_ets_check_params,
+	.replace = mlxsw_sp_qdisc_ets_replace,
+	.unoffload = mlxsw_sp_qdisc_ets_unoffload,
+	.destroy = mlxsw_sp_qdisc_ets_destroy,
+	.get_stats = mlxsw_sp_qdisc_get_prio_stats,
+	.clean_stats = mlxsw_sp_setup_tc_qdisc_prio_clean_stats,
+};
+
 /* Linux allows linking of Qdiscs to arbitrary classes (so long as the resulting
  * graph is free of cycles). These operations do not change the parent handle
  * though, which means it can be incomplete (if there is more than one class
@@ -772,6 +822,40 @@ int mlxsw_sp_setup_tc_prio(struct mlxsw_sp_port *mlxsw_sp_port,
 	}
 }
 
+int mlxsw_sp_setup_tc_ets(struct mlxsw_sp_port *mlxsw_sp_port,
+			  struct tc_ets_qopt_offload *p)
+{
+	struct mlxsw_sp_qdisc *mlxsw_sp_qdisc;
+
+	mlxsw_sp_qdisc = mlxsw_sp_qdisc_find(mlxsw_sp_port, p->parent, true);
+	if (!mlxsw_sp_qdisc)
+		return -EOPNOTSUPP;
+
+	if (p->command == TC_ETS_REPLACE)
+		return mlxsw_sp_qdisc_replace(mlxsw_sp_port, p->handle,
+					      mlxsw_sp_qdisc,
+					      &mlxsw_sp_qdisc_ops_ets,
+					      &p->replace_params);
+
+	if (!mlxsw_sp_qdisc_compare(mlxsw_sp_qdisc, p->handle,
+				    MLXSW_SP_QDISC_ETS))
+		return -EOPNOTSUPP;
+
+	switch (p->command) {
+	case TC_ETS_DESTROY:
+		return mlxsw_sp_qdisc_destroy(mlxsw_sp_port, mlxsw_sp_qdisc);
+	case TC_ETS_STATS:
+		return mlxsw_sp_qdisc_get_stats(mlxsw_sp_port, mlxsw_sp_qdisc,
+						&p->stats);
+	case TC_ETS_GRAFT:
+		return __mlxsw_sp_qdisc_ets_graft(mlxsw_sp_port, mlxsw_sp_qdisc,
+						  p->graft_params.band,
+						  p->graft_params.child_handle);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
 int mlxsw_sp_tc_qdisc_init(struct mlxsw_sp_port *mlxsw_sp_port)
 {
 	struct mlxsw_sp_qdisc *mlxsw_sp_qdisc;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next mlxsw v2 08/10] selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (6 preceding siblings ...)
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 07/10] mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc Petr Machata
@ 2019-12-18 14:55 ` Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 09/10] selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc Petr Machata
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko, Jiri Pirko

These two functions are used for starting several streams of traffic, and
then stopping them later. They will be handy for the test coverage of ETS
Qdisc. Move them from mlxsw-specific qos_lib.sh to the generic lib.sh.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 .../selftests/drivers/net/mlxsw/qos_lib.sh     | 18 ------------------
 tools/testing/selftests/net/forwarding/lib.sh  | 18 ++++++++++++++++++
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh b/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
index e80be65799ad..75a3fb3b5663 100644
--- a/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
@@ -24,24 +24,6 @@ rate()
 	echo $((8 * (t1 - t0) / interval))
 }
 
-start_traffic()
-{
-	local h_in=$1; shift    # Where the traffic egresses the host
-	local sip=$1; shift
-	local dip=$1; shift
-	local dmac=$1; shift
-
-	$MZ $h_in -p 8000 -A $sip -B $dip -c 0 \
-		-a own -b $dmac -t udp -q &
-	sleep 1
-}
-
-stop_traffic()
-{
-	# Suppress noise from killing mausezahn.
-	{ kill %% && wait %%; } 2>/dev/null
-}
-
 check_rate()
 {
 	local rate=$1; shift
diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 1f64e7348f69..a0b09bb6995e 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -1065,3 +1065,21 @@ flood_test()
 	flood_unicast_test $br_port $host1_if $host2_if
 	flood_multicast_test $br_port $host1_if $host2_if
 }
+
+start_traffic()
+{
+	local h_in=$1; shift    # Where the traffic egresses the host
+	local sip=$1; shift
+	local dip=$1; shift
+	local dmac=$1; shift
+
+	$MZ $h_in -p 8000 -A $sip -B $dip -c 0 \
+		-a own -b $dmac -t udp -q &
+	sleep 1
+}
+
+stop_traffic()
+{
+	# Suppress noise from killing mausezahn.
+	{ kill %% && wait %%; } 2>/dev/null
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next mlxsw v2 09/10] selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (7 preceding siblings ...)
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 08/10] selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh Petr Machata
@ 2019-12-18 14:55 ` Petr Machata
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 10/10] selftests: qdiscs: " Petr Machata
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko, Jiri Pirko

This tests the newly-added ETS Qdisc. It runs two to three streams of
traffic, each with a different priority. ETS Qdisc is supposed to allocate
bandwidth according to the DRR algorithm and given weights. After running
the traffic for a while, counters are compared for each stream to check
that the expected ratio is in fact observed.

In order for the DRR process to kick in, a traffic bottleneck must exist in
the first place. In slow path, such bottleneck can be implemented by
wrapping the ETS Qdisc inside a TBF or other shaper. This might however
make the configuration unoffloadable. Instead, on HW datapath, the
bottleneck would be set up by lowering port speed and configuring shared
buffer suitably.

Therefore the test is structured as a core component that implements the
testing, with two wrapper scripts that implement the details of slow path
resp. fast path configuration.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---

Notes:
    v1 (internal):
    - mlxsw/sch_ets.sh: Add a comment explaining packet prioritization.
    - Adjust the whole suite to allow testing of traffic classifiers
      in addition to testing priomap.

 .../selftests/drivers/net/mlxsw/qos_lib.sh    |  28 ++
 .../selftests/drivers/net/mlxsw/sch_ets.sh    |  67 ++++
 .../selftests/net/forwarding/sch_ets.sh       |  44 +++
 .../selftests/net/forwarding/sch_ets_core.sh  | 300 ++++++++++++++++++
 .../selftests/net/forwarding/sch_ets_tests.sh | 227 +++++++++++++
 5 files changed, 666 insertions(+)
 create mode 100755 tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh
 create mode 100755 tools/testing/selftests/net/forwarding/sch_ets.sh
 create mode 100644 tools/testing/selftests/net/forwarding/sch_ets_core.sh
 create mode 100644 tools/testing/selftests/net/forwarding/sch_ets_tests.sh

diff --git a/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh b/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
index 75a3fb3b5663..a5937069ac16 100644
--- a/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/qos_lib.sh
@@ -78,3 +78,31 @@ measure_rate()
 	echo $ir $er
 	return $ret
 }
+
+bail_on_lldpad()
+{
+	if systemctl is-active --quiet lldpad; then
+
+		cat >/dev/stderr <<-EOF
+		WARNING: lldpad is running
+
+			lldpad will likely configure DCB, and this test will
+			configure Qdiscs. mlxsw does not support both at the
+			same time, one of them is arbitrarily going to overwrite
+			the other. That will cause spurious failures (or,
+			unlikely, passes) of this test.
+		EOF
+
+		if [[ -z $ALLOW_LLDPAD ]]; then
+			cat >/dev/stderr <<-EOF
+
+				If you want to run the test anyway, please set
+				an environment variable ALLOW_LLDPAD to a
+				non-empty string.
+			EOF
+			exit 1
+		else
+			return
+		fi
+	fi
+}
diff --git a/tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh b/tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh
new file mode 100755
index 000000000000..c9fc4d4885c1
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh
@@ -0,0 +1,67 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# A driver for the ETS selftest that implements testing in offloaded datapath.
+lib_dir=$(dirname $0)/../../../net/forwarding
+source $lib_dir/sch_ets_core.sh
+source $lib_dir/devlink_lib.sh
+source qos_lib.sh
+
+ALL_TESTS="
+	ping_ipv4
+	priomap_mode
+	ets_test_strict
+	ets_test_mixed
+	ets_test_dwrr
+"
+
+switch_create()
+{
+	ets_switch_create
+
+	# Create a bottleneck so that the DWRR process can kick in.
+	ethtool -s $h2 speed 1000 autoneg off
+	ethtool -s $swp2 speed 1000 autoneg off
+
+	# Set the ingress quota high and use the three egress TCs to limit the
+	# amount of traffic that is admitted to the shared buffers. This makes
+	# sure that there is always enough traffic of all types to select from
+	# for the DWRR process.
+	devlink_port_pool_th_set $swp1 0 12
+	devlink_tc_bind_pool_th_set $swp1 0 ingress 0 12
+	devlink_port_pool_th_set $swp2 4 12
+	devlink_tc_bind_pool_th_set $swp2 7 egress 4 5
+	devlink_tc_bind_pool_th_set $swp2 6 egress 4 5
+	devlink_tc_bind_pool_th_set $swp2 5 egress 4 5
+
+	# Note: sch_ets_core.sh uses VLAN ingress-qos-map to assign packet
+	# priorities at $swp1 based on their 802.1p headers. ingress-qos-map is
+	# not offloaded by mlxsw as of this writing, but the mapping used is
+	# 1:1, which is the mapping currently hard-coded by the driver.
+}
+
+switch_destroy()
+{
+	devlink_tc_bind_pool_th_restore $swp2 5 egress
+	devlink_tc_bind_pool_th_restore $swp2 6 egress
+	devlink_tc_bind_pool_th_restore $swp2 7 egress
+	devlink_port_pool_th_restore $swp2 4
+	devlink_tc_bind_pool_th_restore $swp1 0 ingress
+	devlink_port_pool_th_restore $swp1 0
+
+	ethtool -s $swp2 autoneg on
+	ethtool -s $h2 autoneg on
+
+	ets_switch_destroy
+}
+
+# Callback from sch_ets_tests.sh
+get_stats()
+{
+	local band=$1; shift
+
+	ethtool_stats_get "$h2" rx_octets_prio_$band
+}
+
+bail_on_lldpad
+ets_run
diff --git a/tools/testing/selftests/net/forwarding/sch_ets.sh b/tools/testing/selftests/net/forwarding/sch_ets.sh
new file mode 100755
index 000000000000..40e0ad1bc4f2
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/sch_ets.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# A driver for the ETS selftest that implements testing in slowpath.
+lib_dir=.
+source sch_ets_core.sh
+
+ALL_TESTS="
+	ping_ipv4
+	priomap_mode
+	ets_test_strict
+	ets_test_mixed
+	ets_test_dwrr
+	classifier_mode
+	ets_test_strict
+	ets_test_mixed
+	ets_test_dwrr
+"
+
+switch_create()
+{
+	ets_switch_create
+
+	# Create a bottleneck so that the DWRR process can kick in.
+	tc qdisc add dev $swp2 root handle 1: tbf \
+	   rate 1Gbit burst 1Mbit latency 100ms
+	PARENT="parent 1:"
+}
+
+switch_destroy()
+{
+	ets_switch_destroy
+	tc qdisc del dev $swp2 root
+}
+
+# Callback from sch_ets_tests.sh
+get_stats()
+{
+	local stream=$1; shift
+
+	link_stats_get $h2.1$stream rx bytes
+}
+
+ets_run
diff --git a/tools/testing/selftests/net/forwarding/sch_ets_core.sh b/tools/testing/selftests/net/forwarding/sch_ets_core.sh
new file mode 100644
index 000000000000..f906fcc66572
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/sch_ets_core.sh
@@ -0,0 +1,300 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# This is a template for ETS Qdisc test.
+#
+# This test sends from H1 several traffic streams with 802.1p-tagged packets.
+# The tags are used at $swp1 to prioritize the traffic. Each stream is then
+# queued at a different ETS band according to the assigned priority. After
+# runnig for a while, counters at H2 are consulted to determine whether the
+# traffic scheduling was according to the ETS configuration.
+#
+# This template is supposed to be embedded by a test driver, which implements
+# statistics collection, any HW-specific stuff, and prominently configures the
+# system to assure that there is overcommitment at $swp2. That is necessary so
+# that the ETS traffic selection algorithm kicks in and has to schedule some
+# traffic at the expense of other.
+#
+# A driver for veth-based testing is in sch_ets.sh, an example of a driver for
+# an offloaded data path is in selftests/drivers/net/mlxsw/sch_ets.sh.
+#
+# +---------------------------------------------------------------------+
+# | H1                                                                  |
+# |     + $h1.10              + $h1.11              + $h1.12            |
+# |     | 192.0.2.1/28        | 192.0.2.17/28       | 192.0.2.33/28     |
+# |     | egress-qos-map      | egress-qos-map      | egress-qos-map    |
+# |     |  0:0                |  0:1                |  0:2              |
+# |     \____________________ | ____________________/                   |
+# |                          \|/                                        |
+# |                           + $h1                                     |
+# +---------------------------|-----------------------------------------+
+#                             |
+# +---------------------------|-----------------------------------------+
+# | SW                        + $swp1                                   |
+# |                           | >1Gbps                                  |
+# |      ____________________/|\____________________                    |
+# |     /                     |                     \                   |
+# |  +--|----------------+ +--|----------------+ +--|----------------+  |
+# |  |  + $swp1.10       | |  + $swp1.11       | |  + $swp1.12       |  |
+# |  |    ingress-qos-map| |    ingress-qos-map| |    ingress-qos-map|  |
+# |  |     0:0 1:1 2:2   | |     0:0 1:1 2:2   | |     0:0 1:1 2:2   |  |
+# |  |                   | |                   | |                   |  |
+# |  |    BR10           | |    BR11           | |    BR12           |  |
+# |  |                   | |                   | |                   |  |
+# |  |  + $swp2.10       | |  + $swp2.11       | |  + $swp2.12       |  |
+# |  +--|----------------+ +--|----------------+ +--|----------------+  |
+# |     \____________________ | ____________________/                   |
+# |                          \|/                                        |
+# |                           + $swp2                                   |
+# |                           | 1Gbps (ethtool or HTB qdisc)            |
+# |                           | qdisc ets quanta $W0 $W1 $W2            |
+# |                           |           priomap 0 1 2                 |
+# +---------------------------|-----------------------------------------+
+#                             |
+# +---------------------------|-----------------------------------------+
+# | H2                        + $h2                                     |
+# |      ____________________/|\____________________                    |
+# |     /                     |                     \                   |
+# |     + $h2.10              + $h2.11              + $h2.12            |
+# |       192.0.2.2/28          192.0.2.18/28         192.0.2.34/28     |
+# +---------------------------------------------------------------------+
+
+NUM_NETIFS=4
+CHECK_TC=yes
+source $lib_dir/lib.sh
+source $lib_dir/sch_ets_tests.sh
+
+PARENT=root
+QDISC_DEV=
+
+sip()
+{
+	echo 192.0.2.$((16 * $1 + 1))
+}
+
+dip()
+{
+	echo 192.0.2.$((16 * $1 + 2))
+}
+
+# Callback from sch_ets_tests.sh
+ets_start_traffic()
+{
+	local dst_mac=$(mac_get $h2)
+	local i=$1; shift
+
+	start_traffic $h1.1$i $(sip $i) $(dip $i) $dst_mac
+}
+
+ETS_CHANGE_QDISC=
+
+priomap_mode()
+{
+	echo "Running in priomap mode"
+	ets_delete_qdisc
+	ETS_CHANGE_QDISC=ets_change_qdisc_priomap
+}
+
+classifier_mode()
+{
+	echo "Running in classifier mode"
+	ets_delete_qdisc
+	ETS_CHANGE_QDISC=ets_change_qdisc_classifier
+}
+
+ets_change_qdisc_priomap()
+{
+	local dev=$1; shift
+	local nstrict=$1; shift
+	local priomap=$1; shift
+	local quanta=("${@}")
+
+	local op=$(if [[ -n $QDISC_DEV ]]; then echo change; else echo add; fi)
+
+	tc qdisc $op dev $dev $PARENT handle 10: ets			       \
+		$(if ((nstrict)); then echo strict $nstrict; fi)	       \
+		$(if ((${#quanta[@]})); then echo quanta ${quanta[@]}; fi)     \
+		priomap $priomap
+	QDISC_DEV=$dev
+}
+
+ets_change_qdisc_classifier()
+{
+	local dev=$1; shift
+	local nstrict=$1; shift
+	local priomap=$1; shift
+	local quanta=("${@}")
+
+	local op=$(if [[ -n $QDISC_DEV ]]; then echo change; else echo add; fi)
+
+	tc qdisc $op dev $dev $PARENT handle 10: ets			       \
+		$(if ((nstrict)); then echo strict $nstrict; fi)	       \
+		$(if ((${#quanta[@]})); then echo quanta ${quanta[@]}; fi)
+
+	if [[ $op == add ]]; then
+		local prio=0
+		local band
+
+		for band in $priomap; do
+			tc filter add dev $dev parent 10: basic \
+				match "meta(priority eq $prio)" \
+				flowid 10:$((band + 1))
+			((prio++))
+		done
+	fi
+	QDISC_DEV=$dev
+}
+
+# Callback from sch_ets_tests.sh
+ets_change_qdisc()
+{
+	if [[ -z "$ETS_CHANGE_QDISC" ]]; then
+		exit 1
+	fi
+	$ETS_CHANGE_QDISC "$@"
+}
+
+ets_delete_qdisc()
+{
+	if [[ -n $QDISC_DEV ]]; then
+		tc qdisc del dev $QDISC_DEV $PARENT
+		QDISC_DEV=
+	fi
+}
+
+h1_create()
+{
+	local i;
+
+	simple_if_init $h1
+	mtu_set $h1 9900
+	for i in {0..2}; do
+		vlan_create $h1 1$i v$h1 $(sip $i)/28
+		ip link set dev $h1.1$i type vlan egress 0:$i
+	done
+}
+
+h1_destroy()
+{
+	local i
+
+	for i in {0..2}; do
+		vlan_destroy $h1 1$i
+	done
+	mtu_restore $h1
+	simple_if_fini $h1
+}
+
+h2_create()
+{
+	local i
+
+	simple_if_init $h2
+	mtu_set $h2 9900
+	for i in {0..2}; do
+		vlan_create $h2 1$i v$h2 $(dip $i)/28
+	done
+}
+
+h2_destroy()
+{
+	local i
+
+	for i in {0..2}; do
+		vlan_destroy $h2 1$i
+	done
+	mtu_restore $h2
+	simple_if_fini $h2
+}
+
+ets_switch_create()
+{
+	local i
+
+	ip link set dev $swp1 up
+	mtu_set $swp1 9900
+
+	ip link set dev $swp2 up
+	mtu_set $swp2 9900
+
+	for i in {0..2}; do
+		vlan_create $swp1 1$i
+		ip link set dev $swp1.1$i type vlan ingress 0:0 1:1 2:2
+
+		vlan_create $swp2 1$i
+
+		ip link add dev br1$i type bridge
+		ip link set dev $swp1.1$i master br1$i
+		ip link set dev $swp2.1$i master br1$i
+
+		ip link set dev br1$i up
+		ip link set dev $swp1.1$i up
+		ip link set dev $swp2.1$i up
+	done
+}
+
+ets_switch_destroy()
+{
+	local i
+
+	ets_delete_qdisc
+
+	for i in {0..2}; do
+		ip link del dev br1$i
+		vlan_destroy $swp2 1$i
+		vlan_destroy $swp1 1$i
+	done
+
+	mtu_restore $swp2
+	ip link set dev $swp2 down
+
+	mtu_restore $swp1
+	ip link set dev $swp1 down
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	put=$swp2
+	hut=$h2
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+ping_ipv4()
+{
+	ping_test $h1.10 $(dip 0) " vlan 10"
+	ping_test $h1.11 $(dip 1) " vlan 11"
+	ping_test $h1.12 $(dip 2) " vlan 12"
+}
+
+ets_run()
+{
+	trap cleanup EXIT
+
+	setup_prepare
+	setup_wait
+
+	tests_run
+
+	exit $EXIT_STATUS
+}
diff --git a/tools/testing/selftests/net/forwarding/sch_ets_tests.sh b/tools/testing/selftests/net/forwarding/sch_ets_tests.sh
new file mode 100644
index 000000000000..3c3b204d47e8
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/sch_ets_tests.sh
@@ -0,0 +1,227 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# Global interface:
+#  $put -- port under test (e.g. $swp2)
+#  get_stats($band) -- A function to collect stats for band
+#  ets_start_traffic($band) -- Start traffic for this band
+#  ets_change_qdisc($op, $dev, $nstrict, $quanta...) -- Add or change qdisc
+
+# WS describes the Qdisc configuration. It has one value per band (so the
+# number of array elements indicates the number of bands). If the value is
+# 0, it is a strict band, otherwise the it's a DRR band and the value is
+# that band's quantum.
+declare -a WS
+
+qdisc_describe()
+{
+	local nbands=${#WS[@]}
+	local nstrict=0
+	local i
+
+	for ((i = 0; i < nbands; i++)); do
+		if ((!${WS[$i]})); then
+			: $((nstrict++))
+		fi
+	done
+
+	echo -n "ets bands $nbands"
+	if ((nstrict)); then
+		echo -n " strict $nstrict"
+	fi
+	if ((nstrict < nbands)); then
+		echo -n " quanta"
+		for ((i = nstrict; i < nbands; i++)); do
+			echo -n " ${WS[$i]}"
+		done
+	fi
+}
+
+__strict_eval()
+{
+	local desc=$1; shift
+	local d=$1; shift
+	local total=$1; shift
+	local above=$1; shift
+
+	RET=0
+
+	if ((! total)); then
+		check_err 1 "No traffic observed"
+		log_test "$desc"
+		return
+	fi
+
+	local ratio=$(echo "scale=2; 100 * $d / $total" | bc -l)
+	if ((above)); then
+		test $(echo "$ratio > 95.0" | bc -l) -eq 1
+		check_err $? "Not enough traffic"
+		log_test "$desc"
+		log_info "Expected ratio >95% Measured ratio $ratio"
+	else
+		test $(echo "$ratio < 5" | bc -l) -eq 1
+		check_err $? "Too much traffic"
+		log_test "$desc"
+		log_info "Expected ratio <5% Measured ratio $ratio"
+	fi
+}
+
+strict_eval()
+{
+	__strict_eval "$@" 1
+}
+
+notraf_eval()
+{
+	__strict_eval "$@" 0
+}
+
+__ets_dwrr_test()
+{
+	local -a streams=("$@")
+
+	local low_stream=${streams[0]}
+	local seen_strict=0
+	local -a t0 t1 d
+	local stream
+	local total
+	local i
+
+	echo "Testing $(qdisc_describe), streams ${streams[@]}"
+
+	for stream in ${streams[@]}; do
+		ets_start_traffic $stream
+	done
+
+	sleep 10
+
+	t0=($(for stream in ${streams[@]}; do
+		  get_stats $stream
+	      done))
+
+	sleep 10
+
+	t1=($(for stream in ${streams[@]}; do
+		  get_stats $stream
+	      done))
+	d=($(for ((i = 0; i < ${#streams[@]}; i++)); do
+		 echo $((${t1[$i]} - ${t0[$i]}))
+	     done))
+	total=$(echo ${d[@]} | sed 's/ /+/g' | bc)
+
+	for ((i = 0; i < ${#streams[@]}; i++)); do
+		local stream=${streams[$i]}
+		if ((seen_strict)); then
+			notraf_eval "band $stream" ${d[$i]} $total
+		elif ((${WS[$stream]} == 0)); then
+			strict_eval "band $stream" ${d[$i]} $total
+			seen_strict=1
+		elif ((stream == low_stream)); then
+			# Low stream is used as DWRR evaluation reference.
+			continue
+		else
+			multipath_eval "bands $low_stream:$stream" \
+				       ${WS[$low_stream]} ${WS[$stream]} \
+				       ${d[0]} ${d[$i]}
+		fi
+	done
+
+	for stream in ${streams[@]}; do
+		stop_traffic
+	done
+}
+
+ets_dwrr_test_012()
+{
+	__ets_dwrr_test 0 1 2
+}
+
+ets_dwrr_test_01()
+{
+	__ets_dwrr_test 0 1
+}
+
+ets_dwrr_test_12()
+{
+	__ets_dwrr_test 1 2
+}
+
+ets_qdisc_setup()
+{
+	local dev=$1; shift
+	local nstrict=$1; shift
+	local -a quanta=("$@")
+
+	local ndwrr=${#quanta[@]}
+	local nbands=$((nstrict + ndwrr))
+	local nstreams=$(if ((nbands > 3)); then echo 3; else echo $nbands; fi)
+	local priomap=$(seq 0 $((nstreams - 1)))
+	local i
+
+	WS=($(
+		for ((i = 0; i < nstrict; i++)); do
+			echo 0
+		done
+		for ((i = 0; i < ndwrr; i++)); do
+			echo ${quanta[$i]}
+		done
+	))
+
+	ets_change_qdisc $dev $nstrict "$priomap" ${quanta[@]}
+}
+
+ets_set_dwrr_uniform()
+{
+	ets_qdisc_setup $put 0 3300 3300 3300
+}
+
+ets_set_dwrr_varying()
+{
+	ets_qdisc_setup $put 0 5000 3500 1500
+}
+
+ets_set_strict()
+{
+	ets_qdisc_setup $put 3
+}
+
+ets_set_mixed()
+{
+	ets_qdisc_setup $put 1 5000 2500 1500
+}
+
+ets_change_quantum()
+{
+	tc class change dev $put classid 10:2 ets quantum 8000
+	WS[1]=8000
+}
+
+ets_set_dwrr_two_bands()
+{
+	ets_qdisc_setup $put 0 5000 2500
+}
+
+ets_test_strict()
+{
+	ets_set_strict
+	ets_dwrr_test_01
+	ets_dwrr_test_12
+}
+
+ets_test_mixed()
+{
+	ets_set_mixed
+	ets_dwrr_test_01
+	ets_dwrr_test_12
+}
+
+ets_test_dwrr()
+{
+	ets_set_dwrr_uniform
+	ets_dwrr_test_012
+	ets_set_dwrr_varying
+	ets_dwrr_test_012
+	ets_change_quantum
+	ets_dwrr_test_012
+	ets_set_dwrr_two_bands
+	ets_dwrr_test_01
+}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next mlxsw v2 10/10] selftests: qdiscs: Add test coverage for ETS Qdisc
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (8 preceding siblings ...)
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 09/10] selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc Petr Machata
@ 2019-12-18 14:55 ` Petr Machata
  2019-12-18 16:22 ` [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS John Fastabend
  2019-12-18 23:16 ` David Miller
  11 siblings, 0 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-18 14:55 UTC (permalink / raw)
  To: netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko

Add TDC coverage for the new ETS Qdisc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
---

Notes:
    v1 (internal):
    - Add a number of new tests to test default priomap band, overlarge
      number of bands, zeroes in quanta, and altogether missing quanta.

 .../tc-testing/tc-tests/qdiscs/ets.json       | 940 ++++++++++++++++++
 1 file changed, 940 insertions(+)
 create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json

diff --git a/tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json
new file mode 100644
index 000000000000..180593010675
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json
@@ -0,0 +1,940 @@
+[
+    {
+        "id": "e90e",
+        "name": "Add ETS qdisc using bands",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 2",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .* bands 2",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "b059",
+        "name": "Add ETS qdisc using quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 900 800 700",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 4 quanta 1000 900 800 700",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "e8e7",
+        "name": "Add ETS qdisc using strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 3",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 3 strict 3",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "233c",
+        "name": "Add ETS qdisc using bands + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 4 quanta 1000 900 800 700",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 4 quanta 1000 900 800 700 priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "3d35",
+        "name": "Add ETS qdisc using bands + strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 3 strict 3",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 3 strict 3 priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "7f3b",
+        "name": "Add ETS qdisc using strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 3 quanta 1500 750",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 5 strict 3 quanta 1500 750 priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "4593",
+        "name": "Add ETS qdisc using strict 0 + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 0 quanta 1500 750",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 2 quanta 1500 750 priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "8938",
+        "name": "Add ETS qdisc using bands + strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 5 strict 3 quanta 1500 750",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 5 .*strict 3 quanta 1500 750 priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "0782",
+        "name": "Add ETS qdisc with more bands than quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 2 quanta 1000",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 2 .*quanta 1000 [1-9][0-9]* priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "501b",
+        "name": "Add ETS qdisc with more bands than strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 3 strict 1",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 3 strict 1 quanta ([1-9][0-9]* ){2}priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "671a",
+        "name": "Add ETS qdisc with more bands than strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 3 strict 1 quanta 1000",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 3 strict 1 quanta 1000 [1-9][0-9]* priomap",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "2a23",
+        "name": "Add ETS qdisc with 16 bands",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 16",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .* bands 16",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "8daf",
+        "name": "Add ETS qdisc with 17 bands",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 17",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "7f95",
+        "name": "Add ETS qdisc with 17 strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 17",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "837a",
+        "name": "Add ETS qdisc with 16 quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .* bands 16",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "65b6",
+        "name": "Add ETS qdisc with 17 quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17",
+        "expExitCode": "2",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "b9e9",
+        "name": "Add ETS qdisc with 16 strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 8 quanta 1 2 3 4 5 6 7 8",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .* bands 16",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "9877",
+        "name": "Add ETS qdisc with 17 strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 9 quanta 1 2 3 4 5 6 7 8",
+        "expExitCode": "2",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "c696",
+        "name": "Add ETS qdisc with priomap",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 5 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "30c4",
+        "name": "Add ETS qdisc with quanta + priomap",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 2000 3000 4000 5000 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*quanta 1000 2000 3000 4000 5000 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "e8ac",
+        "name": "Add ETS qdisc with strict + priomap",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 5 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*bands 5 strict 5 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "5a7e",
+        "name": "Add ETS qdisc with quanta + strict + priomap",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 2 quanta 1000 2000 3000 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*strict 2 quanta 1000 2000 3000 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "cb8b",
+        "name": "Show ETS class :1",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 4000 3000 2000",
+        "expExitCode": "0",
+        "verifyCmd": "$TC class show dev $DUMMY classid 1:1",
+        "matchPattern": "class ets 1:1 root quantum 4000",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "1b4e",
+        "name": "Show ETS class :2",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 4000 3000 2000",
+        "expExitCode": "0",
+        "verifyCmd": "$TC class show dev $DUMMY classid 1:2",
+        "matchPattern": "class ets 1:2 root quantum 3000",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "f642",
+        "name": "Show ETS class :3",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 4000 3000 2000",
+        "expExitCode": "0",
+        "verifyCmd": "$TC class show dev $DUMMY classid 1:3",
+        "matchPattern": "class ets 1:3 root quantum 2000",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "0a5f",
+        "name": "Show ETS strict class",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 3",
+        "expExitCode": "0",
+        "verifyCmd": "$TC class show dev $DUMMY classid 1:1",
+        "matchPattern": "class ets 1:1 root $",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DUMMY handle 1: root",
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "f7c8",
+        "name": "Add ETS qdisc with too many quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 2 quanta 1000 2000 3000",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "2389",
+        "name": "Add ETS qdisc with too many strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 2 strict 3",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "fe3c",
+        "name": "Add ETS qdisc with too many strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 4 strict 2 quanta 1000 2000 3000",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "cb04",
+        "name": "Add ETS qdisc with excess priomap elements",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 5 priomap 0 0 1 0 1 2 0 1 2 3 0 1 2 3 4 0 1 2",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "c32e",
+        "name": "Add ETS qdisc with priomap above bands",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 2 priomap 0 1 2",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "744c",
+        "name": "Add ETS qdisc with priomap above quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 500 priomap 0 1 2",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "7b33",
+        "name": "Add ETS qdisc with priomap above strict",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 2 priomap 0 1 2",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "dbe6",
+        "name": "Add ETS qdisc with priomap above strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets strict 1 quanta 1000 500 priomap 0 1 2 3",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "bdb2",
+        "name": "Add ETS qdisc with priomap within bands with strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 4 strict 1 quanta 1000 500 priomap 0 1 2 3",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "39a3",
+        "name": "Add ETS qdisc with priomap above bands with strict + quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 4 strict 1 quanta 1000 500 priomap 0 1 2 3 4",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "557c",
+        "name": "Unset priorities default to the last band",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 4 priomap 0 0 0 0",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets .*priomap 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "a347",
+        "name": "Unset priorities default to the last band -- no priomap",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 4",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets .*priomap 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "39c4",
+        "name": "Add ETS qdisc with too few bands",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 0",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "930b",
+        "name": "Add ETS qdisc with too many bands",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets bands 17",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "406a",
+        "name": "Add ETS qdisc without parameters",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "e51a",
+        "name": "Zero element in quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 0 800 700",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "e7f2",
+        "name": "Sole zero element in quanta",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta 0",
+        "expExitCode": "1",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "d6e6",
+        "name": "No values after the quanta keyword",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true"
+        ],
+        "cmdUnderTest": "$TC qdisc add dev $DUMMY handle 1: root ets quanta",
+        "expExitCode": "255",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets",
+        "matchCount": "0",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "28c6",
+        "name": "Change ETS band quantum",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true",
+            "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 2000 3000"
+        ],
+        "cmdUnderTest": "$TC class change dev $DUMMY classid 1:1 ets quantum 1500",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*quanta 1500 2000 3000 priomap ",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "4714",
+        "name": "Change ETS band without quantum",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true",
+            "$TC qdisc add dev $DUMMY handle 1: root ets quanta 1000 2000 3000"
+        ],
+        "cmdUnderTest": "$TC class change dev $DUMMY classid 1:1 ets",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets 1: root .*quanta 1000 2000 3000 priomap ",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "6979",
+        "name": "Change quantum of a strict ETS band",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true",
+            "$TC qdisc add dev $DUMMY handle 1: root ets strict 5"
+        ],
+        "cmdUnderTest": "$TC class change dev $DUMMY classid 1:2 ets quantum 1500",
+        "expExitCode": "2",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets .*bands 5 .*strict 5",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    },
+    {
+        "id": "9a7d",
+        "name": "Change ETS strict band without quantum",
+        "category": [
+            "qdisc",
+            "ets"
+        ],
+        "setup": [
+            "$IP link add dev $DUMMY type dummy || /bin/true",
+            "$TC qdisc add dev $DUMMY handle 1: root ets strict 5"
+        ],
+        "cmdUnderTest": "$TC class change dev $DUMMY classid 1:2 ets",
+        "expExitCode": "0",
+        "verifyCmd": "$TC qdisc show dev $DUMMY",
+        "matchPattern": "qdisc ets .*bands 5 .*strict 5",
+        "matchCount": "1",
+        "teardown": [
+            "$IP link del dev $DUMMY type dummy"
+        ]
+    }
+]
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next mlxsw v2 04/10] net: sch_ets: Add a new Qdisc
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 04/10] net: sch_ets: Add a new Qdisc Petr Machata
@ 2019-12-18 15:02   ` Jiri Pirko
  0 siblings, 0 replies; 17+ messages in thread
From: Jiri Pirko @ 2019-12-18 15:02 UTC (permalink / raw)
  To: Petr Machata
  Cc: netdev, David Miller, Roopa Prabhu, Jakub Kicinski, Roman Mashak,
	Ido Schimmel

Wed, Dec 18, 2019 at 03:55:13PM CET, petrm@mellanox.com wrote:
>Introduces a new Qdisc, which is based on 802.1Q-2014 wording. It is
>PRIO-like in how it is configured, meaning one needs to specify how many
>bands there are, how many are strict and how many are dwrr, quanta for the
>latter, and priomap.
>
>The new Qdisc operates like the PRIO / DRR combo would when configured as
>per the standard. The strict classes, if any, are tried for traffic first.
>When there's no traffic in any of the strict queues, the ETS ones (if any)
>are treated in the same way as in DRR.
>
>Signed-off-by: Petr Machata <petrm@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (9 preceding siblings ...)
  2019-12-18 14:55 ` [PATCH net-next mlxsw v2 10/10] selftests: qdiscs: " Petr Machata
@ 2019-12-18 16:22 ` John Fastabend
  2019-12-18 18:35   ` Petr Machata
  2019-12-18 23:16 ` David Miller
  11 siblings, 1 reply; 17+ messages in thread
From: John Fastabend @ 2019-12-18 16:22 UTC (permalink / raw)
  To: Petr Machata, netdev
  Cc: Petr Machata, David Miller, Roopa Prabhu, Jakub Kicinski,
	Roman Mashak, Ido Schimmel, Jiri Pirko

Petr Machata wrote:
> The IEEE standard 802.1Qaz (and 802.1Q-2014) specifies four principal
> transmission selection algorithms: strict priority, credit-based shaper,
> ETS (bandwidth sharing), and vendor-specific. All these have their
> corresponding knobs in DCB. But DCB does not have interfaces to configure
> RED and ECN, unlike Qdiscs.

So the idea here (way back when I did this years ago) is that marking ECN
traffic was not paticularly CPU intensive on any metrics I came up with.
And I don't recall anyone ever wanting to do RED here. The configuration
I usually recommended was to use mqprio + SO_PRIORITY + fq per qdisc. Then
once we got the BPF egress hook we replaced SO_PRIORITY configurations with
the more dynamic BPF action to set it. There was never a compelling perf
reason to offload red/ecn.

But these use cases were edge nodes. I believe this series is mostly about
control path and maybe some light control traffic? This is for switches
not for edge nodes right? I'm guessing because I don't see any performance
analaysis on why this is useful, intuitively it makes sense if there is
a small CPU sitting on a 48 port 10gbps box or something like that.

> 
> In the Qdisc land, strict priority is implemented by PRIO. Credit-based
> transmission selection algorithm can then be modeled by having e.g. TBF or
> CBS Qdisc below some of the PRIO bands. ETS would then be modeled by
> placing a DRR Qdisc under the last PRIO band.
> 
> The problem with this approach is that DRR on its own, as well as the
> combination of PRIO and DRR, are tricky to configure and tricky to offload
> to 802.1Qaz-compliant hardware. This is due to several reasons:

I would argue the trick to configure part could be hid behind tooling to
simplify setup. The more annoying part is it was stuck behind the qdisc
lock. I was hoping this would implement a lockless ETS qdisc seeing we
have the infra to do lockless qdiscs now. But seems not. I guess software
perf analysis might show prio+drr and ets here are about the same performance
wise.

offload is tricky with stacked qdiscs though ;)

> 
> - As any classful Qdisc, DRR supports adding classifiers to decide in which
>   class to enqueue packets. Unlike PRIO, there's however no fallback in the
>   form of priomap. A way to achieve classification based on packet priority
>   is e.g. like this:
> 
>     # tc filter add dev swp1 root handle 1: \
> 		basic match 'meta(priority eq 0)' flowid 1:10
> 
>   Expressing the priomap in this manner however forces drivers to deep dive
>   into the classifier block to parse the individual rules.
> 
>   A possible solution would be to extend the classes with a "defmap" a la
>   split / defmap mechanism of CBQ, and introduce this as a last resort
>   classification. However, unlike priomap, this doesn't have the guarantee
>   of covering all priorities. Traffic whose priority is not covered is
>   dropped by DRR as unclassified. But ASICs tend to implement dropping in
>   the ACL block, not in scheduling pipelines. The need to treat these
>   configurations correctly (if only to decide to not offload at all)
>   complicates a driver.
> 
>   It's not clear how to retrofit priomap with all its benefits to DRR
>   without changing it beyond recognition.
> 
> - The interplay between PRIO and DRR is also causing problems. 802.1Qaz has
>   all ETS TCs as a last resort. Switch ASICs that support ETS at all are
>   likely to handle ETS traffic this way as well. However, the Linux model
>   is more generic, allowing the DRR block in any band. Drivers would need
>   to be careful to handle this case correctly, otherwise the offloaded
>   model might not match the slow-path one.

Yep, although cases already exist all over the offload side.

> 
>   In a similar vein, PRIO and DRR need to agree on the list of priorities
>   assigned to DRR. This is doubly problematic--the user needs to take care
>   to keep the two in sync, and the driver needs to watch for any holes in
>   DRR coverage and treat the traffic correctly, as discussed above.
> 
>   Note that at the time that DRR Qdisc is added, it has no classes, and
>   thus any priorities assigned to that PRIO band are not covered. Thus this
>   case is surprisingly rather common, and needs to be handled gracefully by
>   the driver.
> 
> - Similarly due to DRR flexibility, when a Qdisc (such as RED) is attached
>   below it, it is not immediately clear which TC the class represents. This
>   is unlike PRIO with its straightforward classid scheme. When DRR is
>   combined with PRIO, the relationship between classes and TCs gets even
>   more murky.
> 
>   This is a problem for users as well: the TC mapping is rather important
>   for (devlink) shared buffer configuration and (ethtool) counters.
> 
> So instead, this patch set introduces a new Qdisc, which is based on
> 802.1Qaz wording. It is PRIO-like in how it is configured, meaning one
> needs to specify how many bands there are, how many are strict and how many
> are ETS, quanta for the latter, and priomap.
> 
> The new Qdisc operates like the PRIO / DRR combo would when configured as
> per the standard. The strict classes, if any, are tried for traffic first.
> When there's no traffic in any of the strict queues, the ETS ones (if any)
> are treated in the same way as in DRR.
> 
> The chosen interface makes the overall system both reasonably easy to
> configure, and reasonably easy to offload. The extra code to support ETS in
> mlxsw (which already supports PRIO) is about 150 lines, of which perhaps 20
> lines is bona fide new business logic.

Sorry maybe obvious question but I couldn't sort it out. When the qdisc is
offloaded if packets are sent via software stack do they also hit the sw
side qdisc enqueue logic? Or did I miss something in the graft logic that
then skips adding the qdisc to software side? For example taprio has dequeue
logic for both offload and software cases but I don't see that here.

> 
> Credit-based shaping transmission selection algorithm can be configured by
> adding a CBS Qdisc under one of the strict bands (e.g. TBF can be used to a
> similar effect as well). As a non-work-conserving Qdisc, CBS can't be
> hooked under the ETS bands. This is detected and handled identically to DRR
> Qdisc at runtime. Note that offloading CBS is not subject of this patchset.

Any performance data showing how accurate we get on software side? The
advantage of hardware always to me seemed to be precision in the WRR algorithm.
Also data showing how much overhead we get hit with from basic mq case
would help me understand if this is even useful for software or just a
exercise in building some offload logic.

FWIW I like the idea I meant to write an ETS sw qdisc for years with
the expectation that it could get close enough to hardware offload case
for most use cases, all but those that really need <5% tolerance or something.

Thanks!
John

> 
> The patchset proceeds in four stages:
> 
> - Patches #1-#3 are cleanups.
> - Patches #4 and #5 contain the new Qdisc.
> - Patches #6 and #7 update mlxsw to offload the new Qdisc.
> - Patches #8-#10 add selftests for ETS.
> 
> Examples:
> 
> - Add a Qdisc with 6 bands, 3 strict and 3 ETS with 45%-30%-25% weights:
> 
>     # tc qdisc add dev swp1 root handle 1: \
> 	ets strict 3 quanta 4500 3000 2500 priomap 0 1 1 1 2 3 4 5
>     # tc qdisc sh dev swp1
>     qdisc ets 1: root refcnt 2 bands 6 strict 3 quanta 4500 3000 2500 priomap 0 1 1 1 2 3 4 5 5 5 5 5 5 5 5 5 
> 
> - Tweak quantum of one of the classes of the previous Qdisc:
> 
>     # tc class ch dev swp1 classid 1:4 ets quantum 1000
>     # tc qdisc sh dev swp1
>     qdisc ets 1: root refcnt 2 bands 6 strict 3 quanta 1000 3000 2500 priomap 0 1 1 1 2 3 4 5 5 5 5 5 5 5 5 5 
>     # tc class ch dev swp1 classid 1:3 ets quantum 1000
>     Error: Strict bands do not have a configurable quantum.
> 
> - Purely strict Qdisc with 1:1 mapping between priorities and TCs:
> 
>     # tc qdisc add dev swp1 root handle 1: \
> 	ets strict 8 priomap 7 6 5 4 3 2 1 0
>     # tc qdisc sh dev swp1
>     qdisc ets 1: root refcnt 2 bands 8 strict 8 priomap 7 6 5 4 3 2 1 0 7 7 7 7 7 7 7 7 
> 
> - Use "bands" to specify number of bands explicitly. Underspecified bands
>   are implicitly ETS and their quantum is taken from MTU. The following
>   thus gives each band the same weight:
> 
>     # tc qdisc add dev swp1 root handle 1: \
> 	ets bands 8 priomap 7 6 5 4 3 2 1 0
>     # tc qdisc sh dev swp1
>     qdisc ets 1: root refcnt 2 bands 8 quanta 1514 1514 1514 1514 1514 1514 1514 1514 priomap 7 6 5 4 3 2 1 0 7 7 7 7 7 7 7 7 
> 
> v2:
> - This addresses points raised by David Miller.
> - Patch #4:
>     - sch_ets.c: Add a comment with description of the Qdisc and the
>       dequeuing algorithm.
>     - Kconfig: Add a high-level description to the help blurb.
> 
> v1:
> - No changes, first upstream submission after RFC.
> 
> v3 (internal):
> - This addresses review from Jiri Pirko.
> - Patch #3:
>     - Rename to _HR_ instead of to _HIERARCHY_.
> - Patch #4:
>     - pkt_sched.h: Keep all the TCA_ETS_ constants in one enum.
>     - pkt_sched.h: Rename TCA_ETS_BANDS to _NBANDS, _STRICT to _NSTRICT,
>       _BAND_QUANTUM to _QUANTA_BAND and _PMAP_BAND to _PRIOMAP_BAND.
>     - sch_ets.c: Update to reflect the above changes. Add a new policy,
>       ets_class_policy, which is used when parsing class changes.
>       Currently that policy is the same as the quanta policy, but that
>       might change.
>     - sch_ets.c: Move MTU handling from ets_quantum_parse() to the one
>       caller that makes use of it.
>     - sch_ets.c: ets_qdisc_priomap_parse(): WARN_ON_ONCE on invalid
>       attribute instead of returning an extack.
> - Patch #6:
>     - __mlxsw_sp_qdisc_ets_replace(): Pass the weights argument to this
>       function in this patch already. Drop the weight computation.
>     - mlxsw_sp_qdisc_prio_replace(): Rename "quanta" to "zeroes" and
>       pass for the abovementioned "weights".
>     - mlxsw_sp_qdisc_prio_graft(): Convert to a wrapper around
>       __mlxsw_sp_qdisc_ets_graft(), instead of invoking the latter
>       directly from mlxsw_sp_setup_tc_prio().
>     - Update to follow the _HIERARCHY_ -> _HR_ renaming.
> - Patch #7:
>     - __mlxsw_sp_qdisc_ets_replace(): The "weights" argument passing and
>       weight computation removal are now done in a previous patch.
>     - mlxsw_sp_setup_tc_ets(): Drop case TC_ETS_REPLACE, which is handled
>       earlier in the function.
> - Patch #3 (iproute2):
>     - Add an example output to the commit message.
>     - tc-ets.8: Fix output of two examples.
>     - tc-ets.8: Describe default values of "bands", "quanta".
>     - q_ets.c: A number of fixes in error messages.
>     - q_ets.c: Comment formatting: /*padding*/ -> /* padding */
>     - q_ets.c: parse_nbands: Move duplicate checking to callers.
>     - q_ets.c: Don't accept both "quantum" and "quanta" as equivalent.
> 
> v2 (internal):
> - This addresses review from Ido Schimmel and comments from Alexander
>   Kushnarov.
> - Patch #2:
>     - s/coment/comment in the commit message.
> - Patch #4:
>     - sch_ets: ets_class_is_strict(), ets_class_id(): Constify an argument
>     - ets_class_find(): RXTify
> - Patch #3 (iproute2):
>     - tc-ets.8: some spelling fixes
>     - tc-ets.8: add another example
>     - tc.8: add an ETS to "CLASSFUL QDISCS" section
> 
> v1 (internal):
> - This addresses RFC reviews from Ido Schimmel and Roman Mashak, bugs found
>   by Alexander Petrovskiy and myself, and other improvements.
> - Patch #2:
>     - Expand the explanation with an explicit example.
> - Patch #4:
>     - Kconfig: s/sch_drr/sch_ets/
>     - sch_ets: Reorder includes to be in alphabetical order
>     - sch_ets: ets_quantum_parse(): Rename the return-pointer argument
>       from pquantum to quantum, and use it directly, not going through a
>       local temporary.
>     - sch_ets: ets_qdisc_quanta_parse(): Convert syntax of function
>       argument "quanta" from an array to a pointer.
>     - sch_ets: ets_qdisc_priomap_parse(): Likewise with "priomap".
>     - sch_ets: ets_qdisc_quanta_parse(), ets_qdisc_priomap_parse(): Invoke
>       __nla_validate_nested directly instead of nl80211_validate_nested().
>     - sch_ets: ets_qdisc_quanta_parse(): WARN_ON_ONCE on invalid attribute
>       instead of returning an extack.
>     - sch_ets: ets_qdisc_change(): Make the last band the default one for
>       unmentioned priomap priorities.
>     - sch_ets: Fix a panic when an offloaded child in a bandwidth-sharing
>       band notified its ETS parent.
>     - sch_ets: When ungrafting, add the newly-created invisible FIFO to
>       the Qdisc hash
> - Patch #5:
>     - pkt_cls.h: Note that quantum=0 signifies a strict band.
>     - Fix error path handling when ets_offload_dump() fails.
> - Patch #6:
>     - __mlxsw_sp_qdisc_ets_replace(): Convert syntax of function arguments
>       "quanta" and "priomap" from arrays to pointers.
> - Patch #7:
>     - __mlxsw_sp_qdisc_ets_replace(): Convert syntax of function argument
>       "weights" from an array to a pointer.
> - Patch #9:
>     - mlxsw/sch_ets.sh: Add a comment explaining packet prioritization.
>     - Adjust the whole suite to allow testing of traffic classifiers
>       in addition to testing priomap.
> - Patch #10:
>     - Add a number of new tests to test default priomap band, overlarge
>       number of bands, zeroes in quanta, and altogether missing quanta.
> - Patch #1 (iproute2):
>     - State motivation for inclusion of this patch in the patcheset in the
>       commit message.
> - Patch #3 (iproute2):
>     - tc-ets.8: it is now December
>     - tc-ets.8: explain inactivity WRT using non-WC Qdiscs under ETS band
>     - tc-ets.8: s/flow/band in explanation of quantum
>     - tc-ets.8: explain what happens with priorities not covered by priomap
>     - tc-ets.8: default priomap band is now the last one
>     - q_ets.c: ets_parse_opt(): Remove unnecessary initialization of
>       priomap and quanta.
> 
> Petr Machata (10):
>   net: pkt_cls: Clarify a comment
>   mlxsw: spectrum_qdisc: Clarify a comment
>   mlxsw: spectrum: Rename MLXSW_REG_QEEC_HIERARCY_* enumerators
>   net: sch_ets: Add a new Qdisc
>   net: sch_ets: Make the ETS qdisc offloadable
>   mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS
>   mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc
>   selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh
>   selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc
>   selftests: qdiscs: Add test coverage for ETS Qdisc
> 
>  drivers/net/ethernet/mellanox/mlxsw/reg.h     |  11 +-
>  .../net/ethernet/mellanox/mlxsw/spectrum.c    |  21 +-
>  .../net/ethernet/mellanox/mlxsw/spectrum.h    |   2 +
>  .../ethernet/mellanox/mlxsw/spectrum_dcb.c    |   8 +-
>  .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 219 +++-
>  include/linux/netdevice.h                     |   1 +
>  include/net/pkt_cls.h                         |  36 +-
>  include/uapi/linux/pkt_sched.h                |  17 +
>  net/sched/Kconfig                             |  17 +
>  net/sched/Makefile                            |   1 +
>  net/sched/sch_ets.c                           | 828 +++++++++++++++
>  .../selftests/drivers/net/mlxsw/qos_lib.sh    |  46 +-
>  .../selftests/drivers/net/mlxsw/sch_ets.sh    |  67 ++
>  tools/testing/selftests/net/forwarding/lib.sh |  18 +
>  .../selftests/net/forwarding/sch_ets.sh       |  44 +
>  .../selftests/net/forwarding/sch_ets_core.sh  | 300 ++++++
>  .../selftests/net/forwarding/sch_ets_tests.sh | 227 +++++
>  .../tc-testing/tc-tests/qdiscs/ets.json       | 940 ++++++++++++++++++
>  18 files changed, 2732 insertions(+), 71 deletions(-)
>  create mode 100644 net/sched/sch_ets.c
>  create mode 100755 tools/testing/selftests/drivers/net/mlxsw/sch_ets.sh
>  create mode 100755 tools/testing/selftests/net/forwarding/sch_ets.sh
>  create mode 100644 tools/testing/selftests/net/forwarding/sch_ets_core.sh
>  create mode 100644 tools/testing/selftests/net/forwarding/sch_ets_tests.sh
>  create mode 100644 tools/testing/selftests/tc-testing/tc-tests/qdiscs/ets.json
> 
> -- 
> 2.20.1
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS
  2019-12-18 16:22 ` [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS John Fastabend
@ 2019-12-18 18:35   ` Petr Machata
  2019-12-19 16:32     ` John Fastabend
  0 siblings, 1 reply; 17+ messages in thread
From: Petr Machata @ 2019-12-18 18:35 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, David Miller, Roopa Prabhu, Jakub Kicinski, Roman Mashak,
	Ido Schimmel, Jiri Pirko


John Fastabend <john.fastabend@gmail.com> writes:

> Petr Machata wrote:
>> The IEEE standard 802.1Qaz (and 802.1Q-2014) specifies four principal
>> transmission selection algorithms: strict priority, credit-based shaper,
>> ETS (bandwidth sharing), and vendor-specific. All these have their
>> corresponding knobs in DCB. But DCB does not have interfaces to configure
>> RED and ECN, unlike Qdiscs.
>
> So the idea here (way back when I did this years ago) is that marking ECN
> traffic was not paticularly CPU intensive on any metrics I came up with.
> And I don't recall anyone ever wanting to do RED here. The configuration
> I usually recommended was to use mqprio + SO_PRIORITY + fq per qdisc. Then
> once we got the BPF egress hook we replaced SO_PRIORITY configurations with
> the more dynamic BPF action to set it. There was never a compelling perf
> reason to offload red/ecn.
>
> But these use cases were edge nodes. I believe this series is mostly about
> control path and maybe some light control traffic? This is for switches
> not for edge nodes right? I'm guessing because I don't see any performance
> analaysis on why this is useful, intuitively it makes sense if there is
> a small CPU sitting on a 48 port 10gbps box or something like that.

Yes.

Our particular use case is a switch that has throughput in Tbps. There
simply isn't enough bandwidth to even get all this traffic to the CPU,
let alone process it on the CPU. You need to offload, or it doesn't make
sense. 48 x 10Gbps with a small CPU is like that as well, yeah.

From what I hear, RED / ECN was not used very widely in these sorts of
deployments, rather the deal was to have more bandwidth than you need
and not worry about QoS. This is changing, and people experiment with
this stuff more. So there is interest in strict vs. DWRR TCs, shapers,
and RED / ECN.

>> In the Qdisc land, strict priority is implemented by PRIO. Credit-based
>> transmission selection algorithm can then be modeled by having e.g. TBF or
>> CBS Qdisc below some of the PRIO bands. ETS would then be modeled by
>> placing a DRR Qdisc under the last PRIO band.
>>
>> The problem with this approach is that DRR on its own, as well as the
>> combination of PRIO and DRR, are tricky to configure and tricky to offload
>> to 802.1Qaz-compliant hardware. This is due to several reasons:
>
> I would argue the trick to configure part could be hid behind tooling to
> simplify setup. The more annoying part is it was stuck behind the qdisc
> lock. I was hoping this would implement a lockless ETS qdisc seeing we
> have the infra to do lockless qdiscs now. But seems not. I guess software
> perf analysis might show prio+drr and ets here are about the same performance
> wise.

Pretty sure. It's the same algorithm, and I would guess that the one
extra virtual call will not throw it off.

> offload is tricky with stacked qdiscs though ;)

Offload and configuration both.

Of course there could be a script to somehow generate and parse the
configuration on the front end, and some sort of library to consolidate
on the driver side, but it's far cleaner and easier to understand for
all involved if it's a Qdisc. Qdiscs are tricky, but people still
understand them well in comparison.

>> - As any classful Qdisc, DRR supports adding classifiers to decide in which
>>   class to enqueue packets. Unlike PRIO, there's however no fallback in the
>>   form of priomap. A way to achieve classification based on packet priority
>>   is e.g. like this:
>>
>>     # tc filter add dev swp1 root handle 1: \
>> 		basic match 'meta(priority eq 0)' flowid 1:10
>>
>>   Expressing the priomap in this manner however forces drivers to deep dive
>>   into the classifier block to parse the individual rules.
>>
>>   A possible solution would be to extend the classes with a "defmap" a la
>>   split / defmap mechanism of CBQ, and introduce this as a last resort
>>   classification. However, unlike priomap, this doesn't have the guarantee
>>   of covering all priorities. Traffic whose priority is not covered is
>>   dropped by DRR as unclassified. But ASICs tend to implement dropping in
>>   the ACL block, not in scheduling pipelines. The need to treat these
>>   configurations correctly (if only to decide to not offload at all)
>>   complicates a driver.
>>
>>   It's not clear how to retrofit priomap with all its benefits to DRR
>>   without changing it beyond recognition.
>>
>> - The interplay between PRIO and DRR is also causing problems. 802.1Qaz has
>>   all ETS TCs as a last resort. Switch ASICs that support ETS at all are
>>   likely to handle ETS traffic this way as well. However, the Linux model
>>   is more generic, allowing the DRR block in any band. Drivers would need
>>   to be careful to handle this case correctly, otherwise the offloaded
>>   model might not match the slow-path one.
>
> Yep, although cases already exist all over the offload side.
>>
>>   In a similar vein, PRIO and DRR need to agree on the list of priorities
>>   assigned to DRR. This is doubly problematic--the user needs to take care
>>   to keep the two in sync, and the driver needs to watch for any holes in
>>   DRR coverage and treat the traffic correctly, as discussed above.
>>
>>   Note that at the time that DRR Qdisc is added, it has no classes, and
>>   thus any priorities assigned to that PRIO band are not covered. Thus this
>>   case is surprisingly rather common, and needs to be handled gracefully by
>>   the driver.
>>
>> - Similarly due to DRR flexibility, when a Qdisc (such as RED) is attached
>>   below it, it is not immediately clear which TC the class represents. This
>>   is unlike PRIO with its straightforward classid scheme. When DRR is
>>   combined with PRIO, the relationship between classes and TCs gets even
>>   more murky.
>>
>>   This is a problem for users as well: the TC mapping is rather important
>>   for (devlink) shared buffer configuration and (ethtool) counters.
>>
>> So instead, this patch set introduces a new Qdisc, which is based on
>> 802.1Qaz wording. It is PRIO-like in how it is configured, meaning one
>> needs to specify how many bands there are, how many are strict and how many
>> are ETS, quanta for the latter, and priomap.
>>
>> The new Qdisc operates like the PRIO / DRR combo would when configured as
>> per the standard. The strict classes, if any, are tried for traffic first.
>> When there's no traffic in any of the strict queues, the ETS ones (if any)
>> are treated in the same way as in DRR.
>>
>> The chosen interface makes the overall system both reasonably easy to
>> configure, and reasonably easy to offload. The extra code to support ETS in
>> mlxsw (which already supports PRIO) is about 150 lines, of which perhaps 20
>> lines is bona fide new business logic.
>
> Sorry maybe obvious question but I couldn't sort it out. When the qdisc is
> offloaded if packets are sent via software stack do they also hit the sw
> side qdisc enqueue logic? Or did I miss something in the graft logic that
> then skips adding the qdisc to software side? For example taprio has dequeue
> logic for both offload and software cases but I don't see that here.

You mean the graft logic in the driver? All that stuff is in there just
to figure out how to configure the device. SW datapath packets are
still handled as usual.

There even is a selftest for the SW datapath that uses veth pairs to
implement interconnect and TBF to throttle it (so that the scheduling
kicks in).

>>
>> Credit-based shaping transmission selection algorithm can be configured by
>> adding a CBS Qdisc under one of the strict bands (e.g. TBF can be used to a
>> similar effect as well). As a non-work-conserving Qdisc, CBS can't be
>> hooked under the ETS bands. This is detected and handled identically to DRR
>> Qdisc at runtime. Note that offloading CBS is not subject of this patchset.
>
> Any performance data showing how accurate we get on software side? The
> advantage of hardware always to me seemed to be precision in the WRR algorithm.

Quantum is specified as a number of bytes allowed to dequeue before a
queue loses the medium. Over time, the amount of traffic dequeued from
individual queues should average out to be the quanta your specified. At
any point in time, size of the packets matters: if I push 1000B packets
into a 10000B-quantum queue, it will use 100% of its allocation. If they
are 800B packets, there will be some waste (and it will compensate next
round).

As far as the Qdisc is defined, the SW side is as accurate as possible
under given traffic patterns. For HW, we translate to %, and rounding
might lead to artifacts. You kinda get the same deal with DCB, where
there's no way to split 100% among 8 TCs perfectly fairly.

> Also data showing how much overhead we get hit with from basic mq case
> would help me understand if this is even useful for software or just a
> exercise in building some offload logic.

So the Qdisc is written to do something reasonable in the SW datapath.
In that respect it's as useful as PRIO and DRR are. Not sure that as a
switch operator you really want to handle this much traffic on the CPU
though.

> FWIW I like the idea I meant to write an ETS sw qdisc for years with
> the expectation that it could get close enough to hardware offload case
> for most use cases, all but those that really need <5% tolerance or something.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS
  2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
                   ` (10 preceding siblings ...)
  2019-12-18 16:22 ` [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS John Fastabend
@ 2019-12-18 23:16 ` David Miller
  11 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2019-12-18 23:16 UTC (permalink / raw)
  To: petrm; +Cc: netdev, roopa, jakub.kicinski, mrv, idosch, jiri

From: Petr Machata <petrm@mellanox.com>
Date: Wed, 18 Dec 2019 14:55:06 +0000

> The IEEE standard 802.1Qaz (and 802.1Q-2014) specifies four principal
> transmission selection algorithms: strict priority, credit-based shaper,
> ETS (bandwidth sharing), and vendor-specific. All these have their
> corresponding knobs in DCB. But DCB does not have interfaces to configure
> RED and ECN, unlike Qdiscs.
> 
> In the Qdisc land, strict priority is implemented by PRIO. Credit-based
> transmission selection algorithm can then be modeled by having e.g. TBF or
> CBS Qdisc below some of the PRIO bands. ETS would then be modeled by
> placing a DRR Qdisc under the last PRIO band.
> 
> The problem with this approach is that DRR on its own, as well as the
> combination of PRIO and DRR, are tricky to configure and tricky to offload
> to 802.1Qaz-compliant hardware. This is due to several reasons:
 ...
> So instead, this patch set introduces a new Qdisc, which is based on
> 802.1Qaz wording. It is PRIO-like in how it is configured, meaning one
> needs to specify how many bands there are, how many are strict and how many
> are ETS, quanta for the latter, and priomap.
 ...

Series applied, thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS
  2019-12-18 18:35   ` Petr Machata
@ 2019-12-19 16:32     ` John Fastabend
  2019-12-19 17:46       ` Petr Machata
  0 siblings, 1 reply; 17+ messages in thread
From: John Fastabend @ 2019-12-19 16:32 UTC (permalink / raw)
  To: Petr Machata, John Fastabend
  Cc: netdev, David Miller, Roopa Prabhu, Jakub Kicinski, Roman Mashak,
	Ido Schimmel, Jiri Pirko

Petr Machata wrote:
> 
> John Fastabend <john.fastabend@gmail.com> writes:
> 
> > Petr Machata wrote:
> >> The IEEE standard 802.1Qaz (and 802.1Q-2014) specifies four principal
> >> transmission selection algorithms: strict priority, credit-based shaper,
> >> ETS (bandwidth sharing), and vendor-specific. All these have their
> >> corresponding knobs in DCB. But DCB does not have interfaces to configure
> >> RED and ECN, unlike Qdiscs.
> >
> > So the idea here (way back when I did this years ago) is that marking ECN
> > traffic was not paticularly CPU intensive on any metrics I came up with.
> > And I don't recall anyone ever wanting to do RED here. The configuration
> > I usually recommended was to use mqprio + SO_PRIORITY + fq per qdisc. Then
> > once we got the BPF egress hook we replaced SO_PRIORITY configurations with
> > the more dynamic BPF action to set it. There was never a compelling perf
> > reason to offload red/ecn.
> >
> > But these use cases were edge nodes. I believe this series is mostly about
> > control path and maybe some light control traffic? This is for switches
> > not for edge nodes right? I'm guessing because I don't see any performance
> > analaysis on why this is useful, intuitively it makes sense if there is
> > a small CPU sitting on a 48 port 10gbps box or something like that.
> 
> Yes.
> 
> Our particular use case is a switch that has throughput in Tbps. There
> simply isn't enough bandwidth to even get all this traffic to the CPU,
> let alone process it on the CPU. You need to offload, or it doesn't make
> sense. 48 x 10Gbps with a small CPU is like that as well, yeah.

Got it so I suspect primary usage will be offload then at least for
the initial usage.

> 
> From what I hear, RED / ECN was not used very widely in these sorts of
> deployments, rather the deal was to have more bandwidth than you need
> and not worry about QoS. This is changing, and people experiment with
> this stuff more. So there is interest in strict vs. DWRR TCs, shapers,
> and RED / ECN.
> 
> >> In the Qdisc land, strict priority is implemented by PRIO. Credit-based
> >> transmission selection algorithm can then be modeled by having e.g. TBF or
> >> CBS Qdisc below some of the PRIO bands. ETS would then be modeled by
> >> placing a DRR Qdisc under the last PRIO band.
> >>
> >> The problem with this approach is that DRR on its own, as well as the
> >> combination of PRIO and DRR, are tricky to configure and tricky to offload
> >> to 802.1Qaz-compliant hardware. This is due to several reasons:
> >
> > I would argue the trick to configure part could be hid behind tooling to
> > simplify setup. The more annoying part is it was stuck behind the qdisc
> > lock. I was hoping this would implement a lockless ETS qdisc seeing we
> > have the infra to do lockless qdiscs now. But seems not. I guess software
> > perf analysis might show prio+drr and ets here are about the same performance
> > wise.
> 
> Pretty sure. It's the same algorithm, and I would guess that the one
> extra virtual call will not throw it off.

Yeah small in comparison to other performance issues I would guess.

> 
> > offload is tricky with stacked qdiscs though ;)
> 
> Offload and configuration both.
> 
> Of course there could be a script to somehow generate and parse the
> configuration on the front end, and some sort of library to consolidate
> on the driver side, but it's far cleaner and easier to understand for
> all involved if it's a Qdisc. Qdiscs are tricky, but people still
> understand them well in comparison.

At one point I wrote an app to sit on top of the tc netlink interface
and create common (at least for the customers at the time) setups. But
that tool is probably lost to history at this point.

I don't think its paticularly difficult to build this type of tool
on top of the API but also not against a new qdisc like this that
folds in a more concrete usage and aligns with a spec. And Dave
already merged it so good to see ;)

[...]

> >> The chosen interface makes the overall system both reasonably easy to
> >> configure, and reasonably easy to offload. The extra code to support ETS in
> >> mlxsw (which already supports PRIO) is about 150 lines, of which perhaps 20
> >> lines is bona fide new business logic.
> >
> > Sorry maybe obvious question but I couldn't sort it out. When the qdisc is
> > offloaded if packets are sent via software stack do they also hit the sw
> > side qdisc enqueue logic? Or did I miss something in the graft logic that
> > then skips adding the qdisc to software side? For example taprio has dequeue
> > logic for both offload and software cases but I don't see that here.
> 
> You mean the graft logic in the driver? All that stuff is in there just
> to figure out how to configure the device. SW datapath packets are
> still handled as usual.

Got it just wasn't clear to me when viewing it from the software + smartnic
use case. So is there a bug or maybe just missing feature, where if I
offloaded this on a NIC that both software and hardware would do the ETS
algorithm? How about on the switch would traffic from the CPU be both ETS 
classified in software and in hardware? Or maybe CPU uses different interface
without offload on?

> 
> There even is a selftest for the SW datapath that uses veth pairs to
> implement interconnect and TBF to throttle it (so that the scheduling
> kicks in).

+1

> 
> >>
> >> Credit-based shaping transmission selection algorithm can be configured by
> >> adding a CBS Qdisc under one of the strict bands (e.g. TBF can be used to a
> >> similar effect as well). As a non-work-conserving Qdisc, CBS can't be
> >> hooked under the ETS bands. This is detected and handled identically to DRR
> >> Qdisc at runtime. Note that offloading CBS is not subject of this patchset.
> >
> > Any performance data showing how accurate we get on software side? The
> > advantage of hardware always to me seemed to be precision in the WRR algorithm.
> 
> Quantum is specified as a number of bytes allowed to dequeue before a
> queue loses the medium. Over time, the amount of traffic dequeued from
> individual queues should average out to be the quanta your specified. At
> any point in time, size of the packets matters: if I push 1000B packets
> into a 10000B-quantum queue, it will use 100% of its allocation. If they
> are 800B packets, there will be some waste (and it will compensate next
> round).
> 
> As far as the Qdisc is defined, the SW side is as accurate as possible
> under given traffic patterns. For HW, we translate to %, and rounding
> might lead to artifacts. You kinda get the same deal with DCB, where
> there's no way to split 100% among 8 TCs perfectly fairly.
> 
> > Also data showing how much overhead we get hit with from basic mq case
> > would help me understand if this is even useful for software or just a
> > exercise in building some offload logic.
> 
> So the Qdisc is written to do something reasonable in the SW datapath.
> In that respect it's as useful as PRIO and DRR are. Not sure that as a
> switch operator you really want to handle this much traffic on the CPU
> though.

I was more thinking of using it in the smart nic case.

> 
> > FWIW I like the idea I meant to write an ETS sw qdisc for years with
> > the expectation that it could get close enough to hardware offload case
> > for most use cases, all but those that really need <5% tolerance or something.

Anyways thanks for the answers clears it up on my side. One remaining
question is if software does send packets if they get both classified
via software and hardware. Might be worth thinking about fixing if
that is the case or probably more likely switch knows not to do
this.

Thanks,
John

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS
  2019-12-19 16:32     ` John Fastabend
@ 2019-12-19 17:46       ` Petr Machata
  0 siblings, 0 replies; 17+ messages in thread
From: Petr Machata @ 2019-12-19 17:46 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, David Miller, Roopa Prabhu, Jakub Kicinski, Roman Mashak,
	Ido Schimmel, Jiri Pirko


John Fastabend <john.fastabend@gmail.com> writes:

> Petr Machata wrote:
>>
>> John Fastabend <john.fastabend@gmail.com> writes:
>>
>> > Petr Machata wrote:
>> >> The IEEE standard 802.1Qaz (and 802.1Q-2014) specifies four principal
>> >> transmission selection algorithms: strict priority, credit-based shaper,
>> >> ETS (bandwidth sharing), and vendor-specific. All these have their
>> >> corresponding knobs in DCB. But DCB does not have interfaces to configure
>> >> RED and ECN, unlike Qdiscs.
>> >
>> > So the idea here (way back when I did this years ago) is that marking ECN
>> > traffic was not paticularly CPU intensive on any metrics I came up with.
>> > And I don't recall anyone ever wanting to do RED here. The configuration
>> > I usually recommended was to use mqprio + SO_PRIORITY + fq per qdisc. Then
>> > once we got the BPF egress hook we replaced SO_PRIORITY configurations with
>> > the more dynamic BPF action to set it. There was never a compelling perf
>> > reason to offload red/ecn.
>> >
>> > But these use cases were edge nodes. I believe this series is mostly about
>> > control path and maybe some light control traffic? This is for switches
>> > not for edge nodes right? I'm guessing because I don't see any performance
>> > analaysis on why this is useful, intuitively it makes sense if there is
>> > a small CPU sitting on a 48 port 10gbps box or something like that.
>>
>> Yes.
>>
>> Our particular use case is a switch that has throughput in Tbps. There
>> simply isn't enough bandwidth to even get all this traffic to the CPU,
>> let alone process it on the CPU. You need to offload, or it doesn't make
>> sense. 48 x 10Gbps with a small CPU is like that as well, yeah.
>
> Got it so I suspect primary usage will be offload then at least for
> the initial usage.

Yes, particularly configuration of offloaded forwarding path.

>> > offload is tricky with stacked qdiscs though ;)
>>
>> Offload and configuration both.
>>
>> Of course there could be a script to somehow generate and parse the
>> configuration on the front end, and some sort of library to consolidate
>> on the driver side, but it's far cleaner and easier to understand for
>> all involved if it's a Qdisc. Qdiscs are tricky, but people still
>> understand them well in comparison.
>
> At one point I wrote an app to sit on top of the tc netlink interface
> and create common (at least for the customers at the time) setups. But
> that tool is probably lost to history at this point.
>
> I don't think its paticularly difficult to build this type of tool
> on top of the API but also not against a new qdisc like this that
> folds in a more concrete usage and aligns with a spec. And Dave
> already merged it so good to see ;)
>
> [...]
>
>> >> The chosen interface makes the overall system both reasonably easy to
>> >> configure, and reasonably easy to offload. The extra code to support ETS in
>> >> mlxsw (which already supports PRIO) is about 150 lines, of which perhaps 20
>> >> lines is bona fide new business logic.
>> >
>> > Sorry maybe obvious question but I couldn't sort it out. When the qdisc is
>> > offloaded if packets are sent via software stack do they also hit the sw
>> > side qdisc enqueue logic? Or did I miss something in the graft logic that
>> > then skips adding the qdisc to software side? For example taprio has dequeue
>> > logic for both offload and software cases but I don't see that here.
>>
>> You mean the graft logic in the driver? All that stuff is in there just
>> to figure out how to configure the device. SW datapath packets are
>> still handled as usual.
>
> Got it just wasn't clear to me when viewing it from the software + smartnic
> use case. So is there a bug or maybe just missing feature, where if I
> offloaded this on a NIC that both software and hardware would do the ETS
> algorithm? How about on the switch would traffic from the CPU be both ETS
> classified in software and in hardware? Or maybe CPU uses different interface
> without offload on?

You would get SW scheduling if there's more traffic than the host
interface can handle.

In the HW, control traffic gets TC 16, which the chip hardcodes as the
highest priority and handles in a dedicated set of queues. So there's no
second classification.

>> >> Credit-based shaping transmission selection algorithm can be configured by
>> >> adding a CBS Qdisc under one of the strict bands (e.g. TBF can be used to a
>> >> similar effect as well). As a non-work-conserving Qdisc, CBS can't be
>> >> hooked under the ETS bands. This is detected and handled identically to DRR
>> >> Qdisc at runtime. Note that offloading CBS is not subject of this patchset.
>> >
>> > Any performance data showing how accurate we get on software side? The
>> > advantage of hardware always to me seemed to be precision in the WRR algorithm.
>>
>> Quantum is specified as a number of bytes allowed to dequeue before a
>> queue loses the medium. Over time, the amount of traffic dequeued from
>> individual queues should average out to be the quanta your specified. At
>> any point in time, size of the packets matters: if I push 1000B packets
>> into a 10000B-quantum queue, it will use 100% of its allocation. If they
>> are 800B packets, there will be some waste (and it will compensate next
>> round).
>>
>> As far as the Qdisc is defined, the SW side is as accurate as possible
>> under given traffic patterns. For HW, we translate to %, and rounding
>> might lead to artifacts. You kinda get the same deal with DCB, where
>> there's no way to split 100% among 8 TCs perfectly fairly.
>>
>> > Also data showing how much overhead we get hit with from basic mq case
>> > would help me understand if this is even useful for software or just a
>> > exercise in building some offload logic.
>>
>> So the Qdisc is written to do something reasonable in the SW datapath.
>> In that respect it's as useful as PRIO and DRR are. Not sure that as a
>> switch operator you really want to handle this much traffic on the CPU
>> though.
>
> I was more thinking of using it in the smart nic case.

I'm not really familiar with this.

I can imagine some knobs that map the individual bands to NIC queues for
example. I think that's something that mlxsw_spectrum could actually
use. We do have several queues to the chip, and currently round-robin
them by hand in the driver. Logic to determine which queues to use for
which traffic seems to make sense. But currently we simply don't see
these use cases at all.

>> > FWIW I like the idea I meant to write an ETS sw qdisc for years with
>> > the expectation that it could get close enough to hardware offload case
>> > for most use cases, all but those that really need <5% tolerance or something.
>
> Anyways thanks for the answers clears it up on my side. One remaining
> question is if software does send packets if they get both classified
> via software and hardware. Might be worth thinking about fixing if
> that is the case or probably more likely switch knows not to do
> this.

Yeah, traffic from the CPU is handled specially.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-12-19 17:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-18 14:55 [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS Petr Machata
2019-12-18 14:55 ` [PATCH net-next mlxsw v2 01/10] net: pkt_cls: Clarify a comment Petr Machata
2019-12-18 14:55 ` [PATCH net-next mlxsw v2 02/10] mlxsw: spectrum_qdisc: " Petr Machata
2019-12-18 14:55 ` [PATCH net-next mlxsw v2 03/10] mlxsw: spectrum: Rename MLXSW_REG_QEEC_HIERARCY_* enumerators Petr Machata
2019-12-18 14:55 ` [PATCH net-next mlxsw v2 04/10] net: sch_ets: Add a new Qdisc Petr Machata
2019-12-18 15:02   ` Jiri Pirko
2019-12-18 14:55 ` [PATCH net-next mlxsw v2 05/10] net: sch_ets: Make the ETS qdisc offloadable Petr Machata
2019-12-18 14:55 ` [PATCH net-next mlxsw v2 06/10] mlxsw: spectrum_qdisc: Generalize PRIO offload to support ETS Petr Machata
2019-12-18 14:55 ` [PATCH net-next mlxsw v2 07/10] mlxsw: spectrum_qdisc: Support offloading of ETS Qdisc Petr Machata
2019-12-18 14:55 ` [PATCH net-next mlxsw v2 08/10] selftests: forwarding: Move start_/stop_traffic from mlxsw to lib.sh Petr Machata
2019-12-18 14:55 ` [PATCH net-next mlxsw v2 09/10] selftests: forwarding: sch_ets: Add test coverage for ETS Qdisc Petr Machata
2019-12-18 14:55 ` [PATCH net-next mlxsw v2 10/10] selftests: qdiscs: " Petr Machata
2019-12-18 16:22 ` [PATCH net-next mlxsw v2 00/10] Add a new Qdisc, ETS John Fastabend
2019-12-18 18:35   ` Petr Machata
2019-12-19 16:32     ` John Fastabend
2019-12-19 17:46       ` Petr Machata
2019-12-18 23:16 ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.