netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children
@ 2023-06-02 10:37 Vladimir Oltean
  2023-06-02 10:37 ` [PATCH RESEND net-next 1/5] net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach() Vladimir Oltean
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-02 10:37 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

[ Original patch set was lost due to an apparent transient problem with
kernel.org's DNSBL setup. This is an identical resend. ]

Prompted by Vinicius' request to consolidate some child Qdisc
dereferences in taprio:
https://lore.kernel.org/netdev/87edmxv7x2.fsf@intel.com/

I remembered that I had left some unfinished work in this Qdisc, namely
commit af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see
the per-netdev-queue pfifo child qdiscs"").

This patch set represents another stab at, essentially, what's in the
title. Not only does taprio not properly detect when it's grafted as a
non-root qdisc, but it also returns incorrect per-class stats.
Eventually, Vinicius' request is addressed too, although in a different
form than the one he requested (which was purely cosmetic).

Review from people more experienced with Qdiscs than me would be
appreciated. I tried my best to explain what I consider to be problems.
I am deliberately targeting net-next because the changes are too
invasive for net - they were reverted from stable once already.

Vladimir Oltean (5):
  net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during
    attach()
  net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload
    mode
  net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf()
  net/sched: taprio: delete misleading comment about preallocating child
    qdiscs
  net/sched: taprio: dump class stats for the actual q->qdiscs[]

 net/sched/sch_taprio.c | 60 ++++++++++++++++++++++++------------------
 1 file changed, 35 insertions(+), 25 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RESEND net-next 1/5] net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach()
  2023-06-02 10:37 [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
@ 2023-06-02 10:37 ` Vladimir Oltean
  2023-06-02 10:37 ` [PATCH RESEND net-next 2/5] net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode Vladimir Oltean
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-02 10:37 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

This is a simple code transformation with no intended behavior change,
just to make it absolutely clear that q->qdiscs[] is only attached to
the child taprio classes in full offload mode.

Right now we use the q->qdiscs[] variable in taprio_attach() for
software mode too, but that is quite confusing and avoidable. We use
it only to reach the netdev TX queue, but we could as well just use
netdev_get_tx_queue() for that.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 3c4c2c334878..b1c611c72aa4 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -2130,14 +2130,20 @@ static void taprio_attach(struct Qdisc *sch)
 
 	/* Attach underlying qdisc */
 	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
-		struct Qdisc *qdisc = q->qdiscs[ntx];
+		struct netdev_queue *dev_queue = netdev_get_tx_queue(dev, ntx);
 		struct Qdisc *old;
 
 		if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
+			struct Qdisc *qdisc = q->qdiscs[ntx];
+
 			qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
-			old = dev_graft_qdisc(qdisc->dev_queue, qdisc);
+			old = dev_graft_qdisc(dev_queue, qdisc);
 		} else {
-			old = dev_graft_qdisc(qdisc->dev_queue, sch);
+			/* In software mode, attach the root taprio qdisc
+			 * to all netdev TX queues, so that dev_qdisc_enqueue()
+			 * goes through taprio_enqueue().
+			 */
+			old = dev_graft_qdisc(dev_queue, sch);
 			qdisc_refcount_inc(sch);
 		}
 		if (old)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RESEND net-next 2/5] net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode
  2023-06-02 10:37 [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
  2023-06-02 10:37 ` [PATCH RESEND net-next 1/5] net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach() Vladimir Oltean
@ 2023-06-02 10:37 ` Vladimir Oltean
  2023-06-06 10:27   ` Paolo Abeni
  2023-06-02 10:37 ` [PATCH RESEND net-next 3/5] net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf() Vladimir Oltean
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-02 10:37 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

The taprio qdisc currently lives in the following equilibrium.

In the software scheduling case, taprio attaches itself to all TXQs,
thus having a refcount of 1 + the number of TX queues. In this mode,
q->qdiscs[] is not visible directly to the Qdisc API. The lifetime of
the Qdiscs from this private array lasts until qdisc_destroy() ->
taprio_destroy().

In the fully offloaded case, the root taprio has a refcount of 1, and
all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[]
are visible to the Qdisc API (they are attached to the netdev TXQs
directly), however taprio loses a reference to them very early - during
qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees
the q->qdiscs[] array to not leak memory, but interestingly, it does not
release a reference on these qdiscs because it doesn't effectively own
them - they are created by taprio but owned by the Qdisc core, and will
be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when
the Qdisc is deleted or when the child Qdisc is replaced with something
else.

My interest is to change this equilibrium such that taprio also owns a
reference on the q->qdiscs[] child Qdiscs for the lifetime of the root
Qdisc, including in full offload mode. I want this because I would like
taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have
insight into q->qdiscs[] for the software scheduling mode - currently
they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same
as the root taprio.

The following set of changes is necessary:
- don't free q->qdiscs[] early in taprio_attach(), free it late in
  taprio_destroy() for consistency with software mode. But:
- currently that's not possible, because taprio doesn't own a reference
  on q->qdiscs[]. So hold that reference - once during the initial
  attach() and once during subsequent graft() calls when the child is
  changed.
- always keep track of the current child in q->qdiscs[], even for full
  offload mode, so that we free in taprio_destroy() what we should, and
  not something stale.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index b1c611c72aa4..8807fc915b79 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -2138,23 +2138,20 @@ static void taprio_attach(struct Qdisc *sch)
 
 			qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
 			old = dev_graft_qdisc(dev_queue, qdisc);
+			/* Keep refcount of q->qdiscs[ntx] at 2 */
+			qdisc_refcount_inc(qdisc);
 		} else {
 			/* In software mode, attach the root taprio qdisc
 			 * to all netdev TX queues, so that dev_qdisc_enqueue()
 			 * goes through taprio_enqueue().
 			 */
 			old = dev_graft_qdisc(dev_queue, sch);
+			/* Keep root refcount at 1 + num_tx_queues */
 			qdisc_refcount_inc(sch);
 		}
 		if (old)
 			qdisc_put(old);
 	}
-
-	/* access to the child qdiscs is not needed in offload mode */
-	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
-		kfree(q->qdiscs);
-		q->qdiscs = NULL;
-	}
 }
 
 static struct netdev_queue *taprio_queue_get(struct Qdisc *sch,
@@ -2183,15 +2180,24 @@ static int taprio_graft(struct Qdisc *sch, unsigned long cl,
 	if (dev->flags & IFF_UP)
 		dev_deactivate(dev);
 
-	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
+	/* In software mode, the root taprio qdisc is still the one attached to
+	 * all netdev TX queues, and hence responsible for taprio_enqueue() to
+	 * forward the skbs to the child qdiscs from the private q->qdiscs[]
+	 * array. So only attach the new qdisc to the netdev queue in offload
+	 * mode, where the enqueue must bypass taprio. However, save the
+	 * reference to the new qdisc in the private array in both cases, to
+	 * have an up-to-date reference to our children.
+	 */
+	if (FULL_OFFLOAD_IS_ENABLED(q->flags))
 		*old = dev_graft_qdisc(dev_queue, new);
-	} else {
+	else
 		*old = q->qdiscs[cl - 1];
-		q->qdiscs[cl - 1] = new;
-	}
 
-	if (new)
+	q->qdiscs[cl - 1] = new;
+	if (new) {
+		qdisc_refcount_inc(new);
 		new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
+	}
 
 	if (dev->flags & IFF_UP)
 		dev_activate(dev);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RESEND net-next 3/5] net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf()
  2023-06-02 10:37 [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
  2023-06-02 10:37 ` [PATCH RESEND net-next 1/5] net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach() Vladimir Oltean
  2023-06-02 10:37 ` [PATCH RESEND net-next 2/5] net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode Vladimir Oltean
@ 2023-06-02 10:37 ` Vladimir Oltean
  2023-06-02 10:37 ` [PATCH RESEND net-next 4/5] net/sched: taprio: delete misleading comment about preallocating child qdiscs Vladimir Oltean
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-02 10:37 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

This is another stab at commit 1461d212ab27 ("net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"), later
reverted in commit af7b29b1deaa ("Revert "net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"").

I believe that the problems that caused the revert were fixed, and thus,
this change is identical to the original patch.

Its purpose is to properly reject attaching a software taprio child
qdisc to a software taprio parent. Because unoffloaded taprio currently
reports itself (the root Qdisc) as the return value from qdisc_leaf(),
then the process of attaching another taprio as child to a Qdisc class
of the root will just result in a Qdisc_ops :: change() call for the
root. Whereas that's not we want. We want Qdisc_ops :: init() to be
called for the taprio child, in order to give the taprio child a chance
to check whether its sch->parent is TC_H_ROOT or not (and reject this
configuration).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 8807fc915b79..263306fe38d8 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -2433,12 +2433,14 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 
 static struct Qdisc *taprio_leaf(struct Qdisc *sch, unsigned long cl)
 {
-	struct netdev_queue *dev_queue = taprio_queue_get(sch, cl);
+	struct taprio_sched *q = qdisc_priv(sch);
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned int ntx = cl - 1;
 
-	if (!dev_queue)
+	if (ntx >= dev->num_tx_queues)
 		return NULL;
 
-	return dev_queue->qdisc_sleeping;
+	return q->qdiscs[ntx];
 }
 
 static unsigned long taprio_find(struct Qdisc *sch, u32 classid)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RESEND net-next 4/5] net/sched: taprio: delete misleading comment about preallocating child qdiscs
  2023-06-02 10:37 [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
                   ` (2 preceding siblings ...)
  2023-06-02 10:37 ` [PATCH RESEND net-next 3/5] net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf() Vladimir Oltean
@ 2023-06-02 10:37 ` Vladimir Oltean
  2023-06-02 10:37 ` [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[] Vladimir Oltean
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-02 10:37 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

As mentioned in commit af7b29b1deaa ("Revert "net/sched: taprio: make
qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"") - unlike
mqprio, taprio doesn't use q->qdiscs[] only as a temporary transport
between Qdisc_ops :: init() and Qdisc_ops :: attach().

Delete the comment, which is just stolen from mqprio, but there, the
usage patterns are a lot different, and this is nothing but confusing.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 263306fe38d8..cc7ff98e5e86 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -2084,11 +2084,8 @@ static int taprio_init(struct Qdisc *sch, struct nlattr *opt,
 		return -EOPNOTSUPP;
 	}
 
-	/* pre-allocate qdisc, attachment can't fail */
-	q->qdiscs = kcalloc(dev->num_tx_queues,
-			    sizeof(q->qdiscs[0]),
+	q->qdiscs = kcalloc(dev->num_tx_queues, sizeof(q->qdiscs[0]),
 			    GFP_KERNEL);
-
 	if (!q->qdiscs)
 		return -ENOMEM;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[]
  2023-06-02 10:37 [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
                   ` (3 preceding siblings ...)
  2023-06-02 10:37 ` [PATCH RESEND net-next 4/5] net/sched: taprio: delete misleading comment about preallocating child qdiscs Vladimir Oltean
@ 2023-06-02 10:37 ` Vladimir Oltean
  2023-06-08 18:44   ` Jamal Hadi Salim
  2023-06-05 12:50 ` [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
  2023-06-05 15:44 ` Jamal Hadi Salim
  6 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-02 10:37 UTC (permalink / raw)
  To: netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

This makes a difference for the software scheduling mode, where
dev_queue->qdisc_sleeping is the same as the taprio root Qdisc itself,
but when we're talking about what Qdisc and stats get reported for a
traffic class, the root taprio isn't what comes to mind, but q->qdiscs[]
is.

To understand the difference, I've attempted to send 100 packets in
software mode through traffic class 0 (they are in the Qdisc's backlog),
and recorded the stats before and after the change.

Here is before:

$ tc -s class show dev eth0
class taprio 8001:1 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 9400b 100p requeues 0
 Window drops: 0
class taprio 8001:2 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 9400b 100p requeues 0
 Window drops: 0
class taprio 8001:3 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 9400b 100p requeues 0
 Window drops: 0
class taprio 8001:4 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 9400b 100p requeues 0
 Window drops: 0
class taprio 8001:5 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 9400b 100p requeues 0
 Window drops: 0
class taprio 8001:6 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 9400b 100p requeues 0
 Window drops: 0
class taprio 8001:7 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 9400b 100p requeues 0
 Window drops: 0
class taprio 8001:8 root leaf 8001:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 9400b 100p requeues 0
 Window drops: 0

and here is after:

class taprio 8001:1 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 9400b 100p requeues 0
 Window drops: 0
class taprio 8001:2 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 Window drops: 0
class taprio 8001:3 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 Window drops: 0
class taprio 8001:4 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 Window drops: 0
class taprio 8001:5 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 Window drops: 0
class taprio 8001:6 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 Window drops: 0
class taprio 8001:7 root leaf 8010:
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 Window drops: 0
class taprio 8001:8 root
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 Window drops: 0

The most glaring (and expected) difference is that before, all class
stats reported the global stats, whereas now, they really report just
the counters for that traffic class.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/sched/sch_taprio.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index cc7ff98e5e86..23b98c3af8b2 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -2452,11 +2452,11 @@ static unsigned long taprio_find(struct Qdisc *sch, u32 classid)
 static int taprio_dump_class(struct Qdisc *sch, unsigned long cl,
 			     struct sk_buff *skb, struct tcmsg *tcm)
 {
-	struct netdev_queue *dev_queue = taprio_queue_get(sch, cl);
+	struct Qdisc *child = taprio_leaf(sch, cl);
 
 	tcm->tcm_parent = TC_H_ROOT;
 	tcm->tcm_handle |= TC_H_MIN(cl);
-	tcm->tcm_info = dev_queue->qdisc_sleeping->handle;
+	tcm->tcm_info = child->handle;
 
 	return 0;
 }
@@ -2466,8 +2466,7 @@ static int taprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 	__releases(d->lock)
 	__acquires(d->lock)
 {
-	struct netdev_queue *dev_queue = taprio_queue_get(sch, cl);
-	struct Qdisc *child = dev_queue->qdisc_sleeping;
+	struct Qdisc *child = taprio_leaf(sch, cl);
 	struct tc_taprio_qopt_offload offload = {
 		.cmd = TAPRIO_CMD_TC_STATS,
 		.tc_stats = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children
  2023-06-02 10:37 [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
                   ` (4 preceding siblings ...)
  2023-06-02 10:37 ` [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[] Vladimir Oltean
@ 2023-06-05 12:50 ` Vladimir Oltean
  2023-06-05 13:27   ` Simon Horman
  2023-06-05 15:44 ` Jamal Hadi Salim
  6 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-05 12:50 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Vinicius Costa Gomes, linux-kernel, intel-wired-lan,
	Muhammad Husaini Zulkifli, Peilin Ye, Pedro Tammela

Hi,

On Fri, Jun 02, 2023 at 01:37:45PM +0300, Vladimir Oltean wrote:
> [ Original patch set was lost due to an apparent transient problem with
> kernel.org's DNSBL setup. This is an identical resend. ]
> 
> Prompted by Vinicius' request to consolidate some child Qdisc
> dereferences in taprio:
> https://lore.kernel.org/netdev/87edmxv7x2.fsf@intel.com/
> 
> I remembered that I had left some unfinished work in this Qdisc, namely
> commit af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see
> the per-netdev-queue pfifo child qdiscs"").
> 
> This patch set represents another stab at, essentially, what's in the
> title. Not only does taprio not properly detect when it's grafted as a
> non-root qdisc, but it also returns incorrect per-class stats.
> Eventually, Vinicius' request is addressed too, although in a different
> form than the one he requested (which was purely cosmetic).
> 
> Review from people more experienced with Qdiscs than me would be
> appreciated. I tried my best to explain what I consider to be problems.
> I am deliberately targeting net-next because the changes are too
> invasive for net - they were reverted from stable once already.

I noticed that this patch set has "Changes Requested" in patchwork.

I can't completely exclude the fact that maybe someone has requested
some changes to be made, but there is no email in my inbox to that end,
and for that matter, neither did patchwork or the email archive process
any responses to this thread.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children
  2023-06-05 12:50 ` [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
@ 2023-06-05 13:27   ` Simon Horman
  0 siblings, 0 replies; 25+ messages in thread
From: Simon Horman @ 2023-06-05 13:27 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Vinicius Costa Gomes, linux-kernel, intel-wired-lan,
	Muhammad Husaini Zulkifli, Peilin Ye, Pedro Tammela

On Mon, Jun 05, 2023 at 03:50:42PM +0300, Vladimir Oltean wrote:
> Hi,
> 
> On Fri, Jun 02, 2023 at 01:37:45PM +0300, Vladimir Oltean wrote:
> > [ Original patch set was lost due to an apparent transient problem with
> > kernel.org's DNSBL setup. This is an identical resend. ]
> > 
> > Prompted by Vinicius' request to consolidate some child Qdisc
> > dereferences in taprio:
> > https://lore.kernel.org/netdev/87edmxv7x2.fsf@intel.com/
> > 
> > I remembered that I had left some unfinished work in this Qdisc, namely
> > commit af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see
> > the per-netdev-queue pfifo child qdiscs"").
> > 
> > This patch set represents another stab at, essentially, what's in the
> > title. Not only does taprio not properly detect when it's grafted as a
> > non-root qdisc, but it also returns incorrect per-class stats.
> > Eventually, Vinicius' request is addressed too, although in a different
> > form than the one he requested (which was purely cosmetic).
> > 
> > Review from people more experienced with Qdiscs than me would be
> > appreciated. I tried my best to explain what I consider to be problems.
> > I am deliberately targeting net-next because the changes are too
> > invasive for net - they were reverted from stable once already.
> 
> I noticed that this patch set has "Changes Requested" in patchwork.
> 
> I can't completely exclude the fact that maybe someone has requested
> some changes to be made, but there is no email in my inbox to that end,
> and for that matter, neither did patchwork or the email archive process
> any responses to this thread.

I concur. Let's see if this sets set it to "Under Review".

-- 
pw-bot: under-review


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children
  2023-06-02 10:37 [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
                   ` (5 preceding siblings ...)
  2023-06-05 12:50 ` [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
@ 2023-06-05 15:44 ` Jamal Hadi Salim
  2023-06-05 16:53   ` Vladimir Oltean
  6 siblings, 1 reply; 25+ messages in thread
From: Jamal Hadi Salim @ 2023-06-05 15:44 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Fri, Jun 2, 2023 at 6:38 AM Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>
> [ Original patch set was lost due to an apparent transient problem with
> kernel.org's DNSBL setup. This is an identical resend. ]
>
> Prompted by Vinicius' request to consolidate some child Qdisc
> dereferences in taprio:
> https://lore.kernel.org/netdev/87edmxv7x2.fsf@intel.com/
>
> I remembered that I had left some unfinished work in this Qdisc, namely
> commit af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see
> the per-netdev-queue pfifo child qdiscs"").
>
> This patch set represents another stab at, essentially, what's in the
> title. Not only does taprio not properly detect when it's grafted as a
> non-root qdisc, but it also returns incorrect per-class stats.
> Eventually, Vinicius' request is addressed too, although in a different
> form than the one he requested (which was purely cosmetic).
>
> Review from people more experienced with Qdiscs than me would be
> appreciated. I tried my best to explain what I consider to be problems.

I havent been following - but if you show me sample intended tc
configs for both s/w and hardware offloads i can comment.

In my cursory look i assumed you wanted to go along the path of mqprio
where nothing much happens in the s/w datapath other than requeues
when the tx hardware path is busy (notice it is missing an
enqueue/deque ops). In that case the hardware selection is essentially
of a DMA ring based on skb tags. It seems you took it up a notch by
infact having a choice of whether to have pure s/w or offload path.

cheers,
jamal
> I am deliberately targeting net-next because the changes are too
> invasive for net - they were reverted from stable once already.
>
> Vladimir Oltean (5):
>   net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during
>     attach()
>   net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload
>     mode
>   net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf()
>   net/sched: taprio: delete misleading comment about preallocating child
>     qdiscs
>   net/sched: taprio: dump class stats for the actual q->qdiscs[]
>
>  net/sched/sch_taprio.c | 60 ++++++++++++++++++++++++------------------
>  1 file changed, 35 insertions(+), 25 deletions(-)
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children
  2023-06-05 15:44 ` Jamal Hadi Salim
@ 2023-06-05 16:53   ` Vladimir Oltean
  2023-06-06 15:39     ` Jamal Hadi Salim
  0 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-05 16:53 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

Hi Jamal,

On Mon, Jun 05, 2023 at 11:44:17AM -0400, Jamal Hadi Salim wrote:
> I havent been following - but if you show me sample intended tc
> configs for both s/w and hardware offloads i can comment.

There is not much difference in usage between the 2 modes. IMO the software
data path logic is only a simulation for demonstrative purposes of what the
shaper is intended to do. If hardware offload is available, it is always
preferable. Otherwise, I'm not sure if anyone uses the pure software
scheduling mode (also without txtime assist) for a real life use case.

I was working with something like this for testing the code paths affected
by these changes:

#!/bin/bash

add_taprio()
{
	local offload=$1
	local extra_flags

	case $offload in
	true)
		extra_flags="flags 0x2"
		;;
	false)
		extra_flags="clockid CLOCK_TAI"
		;;
	esac

	tc qdisc replace dev eno0 handle 8001: parent root stab overhead 24 taprio \
		num_tc 8 \
		map 0 1 2 3 4 5 6 7 \
		queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
		max-sdu 0 0 0 0 0 200 0 0 \
		base-time 200 \
		sched-entry S 80 20000 \
		sched-entry S a0 20000 \
		sched-entry S 5f 60000 \
		$extra_flags
}

add_cbs()
{
	local offload=$1
	local extra_flags

	case $offload in
	true)
		extra_flags="offload 1"
		;;
	false)
		extra_flags=""
		;;
	esac

	max_frame_size=1500
	data_rate_kbps=20000
	port_transmit_rate_kbps=1000000
	idleslope=$data_rate_kbps
	sendslope=$(($idleslope - $port_transmit_rate_kbps))
	locredit=$(($max_frame_size * $sendslope / $port_transmit_rate_kbps))
	hicredit=$(($max_frame_size * $idleslope / $port_transmit_rate_kbps))
	tc qdisc replace dev eno0 parent 8001:8 cbs \
		idleslope $idleslope \
		sendslope $sendslope \
		hicredit $hicredit \
		locredit $locredit \
		$extra_flags
}

# this should always fail
add_second_taprio()
{
	tc qdisc replace dev eno0 parent 8001:7 taprio \
		num_tc 8 \
		map 0 1 2 3 4 5 6 7 \
		queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
		max-sdu 0 0 0 0 0 200 0 0 \
		base-time 200 \
		sched-entry S 80 20000 \
		sched-entry S a0 20000 \
		sched-entry S 5f 60000 \
		clockid CLOCK_TAI
}

ip link set eno0 up

echo "Offload:"
add_taprio true
add_cbs true
add_second_taprio
mausezahn eno0 -t ip -b 00:04:9f:05:f6:27 -c 100 -p 60
sleep 5
tc -s class show dev eno0
tc qdisc del dev eno0 root

echo "Software:"
add_taprio false
add_cbs false
add_second_taprio
mausezahn eno0 -t ip -b 00:04:9f:05:f6:27 -c 100 -p 60
sleep 5
tc -s class show dev eno0
tc qdisc del dev eno0 root

> In my cursory look i assumed you wanted to go along the path of mqprio
> where nothing much happens in the s/w datapath other than requeues
> when the tx hardware path is busy (notice it is missing an
> enqueue/deque ops). In that case the hardware selection is essentially
> of a DMA ring based on skb tags. It seems you took it up a notch by
> infact having a choice of whether to have pure s/w or offload path.

Yes. Actually the original taprio design always had the enqueue()/dequeue()
ops involved in the data path, then commit 13511704f8d7 ("net: taprio
offload: enforce qdisc to netdev queue mapping") retrofitted the mqprio
model when using the "flags 0x2" argument.

If you have time to read, the discussion behind that redesign was here:
https://lore.kernel.org/netdev/20210511171829.17181-1-yannick.vignon@oss.nxp.com/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 2/5] net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode
  2023-06-02 10:37 ` [PATCH RESEND net-next 2/5] net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode Vladimir Oltean
@ 2023-06-06 10:27   ` Paolo Abeni
  2023-06-06 15:56     ` Vladimir Oltean
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Abeni @ 2023-06-06 10:27 UTC (permalink / raw)
  To: Vladimir Oltean, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Vinicius Costa Gomes, linux-kernel,
	intel-wired-lan, Muhammad Husaini Zulkifli, Peilin Ye,
	Pedro Tammela

On Fri, 2023-06-02 at 13:37 +0300, Vladimir Oltean wrote:
> The taprio qdisc currently lives in the following equilibrium.
> 
> In the software scheduling case, taprio attaches itself to all TXQs,
> thus having a refcount of 1 + the number of TX queues. In this mode,
> q->qdiscs[] is not visible directly to the Qdisc API. The lifetime of
> the Qdiscs from this private array lasts until qdisc_destroy() ->
> taprio_destroy().
> 
> In the fully offloaded case, the root taprio has a refcount of 1, and
> all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[]
> are visible to the Qdisc API (they are attached to the netdev TXQs
> directly), however taprio loses a reference to them very early - during
> qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees
> the q->qdiscs[] array to not leak memory, but interestingly, it does not
> release a reference on these qdiscs because it doesn't effectively own
> them - they are created by taprio but owned by the Qdisc core, and will
> be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when
> the Qdisc is deleted or when the child Qdisc is replaced with something
> else.
> 
> My interest is to change this equilibrium such that taprio also owns a
> reference on the q->qdiscs[] child Qdiscs for the lifetime of the root
> Qdisc, including in full offload mode. I want this because I would like
> taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have
> insight into q->qdiscs[] for the software scheduling mode - currently
> they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same
> as the root taprio.
> 
> The following set of changes is necessary:
> - don't free q->qdiscs[] early in taprio_attach(), free it late in
>   taprio_destroy() for consistency with software mode. But:
> - currently that's not possible, because taprio doesn't own a reference
>   on q->qdiscs[]. So hold that reference - once during the initial
>   attach() and once during subsequent graft() calls when the child is
>   changed.
> - always keep track of the current child in q->qdiscs[], even for full
>   offload mode, so that we free in taprio_destroy() what we should, and
>   not something stale.
> 
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  net/sched/sch_taprio.c | 28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index b1c611c72aa4..8807fc915b79 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -2138,23 +2138,20 @@ static void taprio_attach(struct Qdisc *sch)
>  
>  			qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
>  			old = dev_graft_qdisc(dev_queue, qdisc);
> +			/* Keep refcount of q->qdiscs[ntx] at 2 */
> +			qdisc_refcount_inc(qdisc);
>  		} else {
>  			/* In software mode, attach the root taprio qdisc
>  			 * to all netdev TX queues, so that dev_qdisc_enqueue()
>  			 * goes through taprio_enqueue().
>  			 */
>  			old = dev_graft_qdisc(dev_queue, sch);
> +			/* Keep root refcount at 1 + num_tx_queues */
>  			qdisc_refcount_inc(sch);
>  		}
>  		if (old)
>  			qdisc_put(old);
>  	}
> -
> -	/* access to the child qdiscs is not needed in offload mode */
> -	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> -		kfree(q->qdiscs);
> -		q->qdiscs = NULL;
> -	}
>  }
>  
>  static struct netdev_queue *taprio_queue_get(struct Qdisc *sch,
> @@ -2183,15 +2180,24 @@ static int taprio_graft(struct Qdisc *sch, unsigned long cl,
>  	if (dev->flags & IFF_UP)
>  		dev_deactivate(dev);
>  
> -	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> +	/* In software mode, the root taprio qdisc is still the one attached to
> +	 * all netdev TX queues, and hence responsible for taprio_enqueue() to
> +	 * forward the skbs to the child qdiscs from the private q->qdiscs[]
> +	 * array. So only attach the new qdisc to the netdev queue in offload
> +	 * mode, where the enqueue must bypass taprio. However, save the
> +	 * reference to the new qdisc in the private array in both cases, to
> +	 * have an up-to-date reference to our children.
> +	 */
> +	if (FULL_OFFLOAD_IS_ENABLED(q->flags))
>  		*old = dev_graft_qdisc(dev_queue, new);
> -	} else {
> +	else
>  		*old = q->qdiscs[cl - 1];
> -		q->qdiscs[cl - 1] = new;
> -	}
>  
> -	if (new)
> +	q->qdiscs[cl - 1] = new;
> +	if (new) {
> +		qdisc_refcount_inc(new);
>  		new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
> +	}
>  
Isn't the above leaking a reference to old with something alike:

tc qdisc replace dev eth0 handle 8001: parent root stab overhead 24 taprio flags 0x2 #...
	# each q in q->qdiscs has refcnt == 2
tc qdisc replace dev eth0 parent 8001:8 cbs #...
	# -> taprio_graft(..., cbs, ...)
	# cbs refcnt is 2
	# 'old' refcnt decreases by 1, refcount will not reach 0?!?

kmemleak should be able to see that.

BTW, what about including your tests from the cover letter somewhere under tc-testing?

Thanks!

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children
  2023-06-05 16:53   ` Vladimir Oltean
@ 2023-06-06 15:39     ` Jamal Hadi Salim
  2023-06-06 16:31       ` Vladimir Oltean
  0 siblings, 1 reply; 25+ messages in thread
From: Jamal Hadi Salim @ 2023-06-06 15:39 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

Hi Vladimir,

On Mon, Jun 5, 2023 at 12:53 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>
> Hi Jamal,
>
> On Mon, Jun 05, 2023 at 11:44:17AM -0400, Jamal Hadi Salim wrote:
> > I havent been following - but if you show me sample intended tc
> > configs for both s/w and hardware offloads i can comment.
>
> There is not much difference in usage between the 2 modes. IMO the software
> data path logic is only a simulation for demonstrative purposes of what the
> shaper is intended to do. If hardware offload is available, it is always
> preferable. Otherwise, I'm not sure if anyone uses the pure software
> scheduling mode (also without txtime assist) for a real life use case.
>

Thanks for the sample. Understood on the software twin being useful
for simulation (our standard rule is you need to have a software twin)
- it is great you conform to that.

I took a cursory glance at the existing kernel code in order to get
better context (I will make more time later). Essentially this qdisc
is classful and is capable of hardware offload. Initial comments are
unrelated to the patchset (all this Klingon DCB thingy configuration
like a flag 0x2 is still magic to me).

1)Just some details become confusing in regards to offload vs not; F.e
class grafting (taprio_graft()) is de/activating the device but that
seems only needed for offload. Would it not be better to have those
separate and call graft_offload vs graft_software, etc? We really need
to create a generic document on how someone would write code for
qdiscs for consistency (I started working on one but never completed
it - if there is a volunteer i would be happy to work with one to
complete it).
2) It seems like in mqprio this qdisc can only be root qdisc (like
mqprio) and you dont want to replace the children with other types of
qdiscs i.e the children are always pfifo? i.e is it possible or
intended for example to replace 8001:x with bfifo etc? or even change
the pfifo queue size, etc?
3) Offload intention seems really to be bypassing enqueue() and going
straigth to the driver xmit() for a specific DMA ring that the skb is
mapped to. Except for the case where the driver says it's busy and
refuses to stash the skb in ring in which case you have to requeue to
the appropriate child qdisc/class. I am not sure how that would work
here - mqprio gets away with it by not defining any of the
en/de/requeue() callbacks - but likely it will be the lack of requeue
that makes it work.

 I will read the other thread you pointed to when i get a moment.

cheers,
jamal

> I was working with something like this for testing the code paths affected
> by these changes:
>
> #!/bin/bash
>
> add_taprio()
> {
>         local offload=$1
>         local extra_flags
>
>         case $offload in
>         true)
>                 extra_flags="flags 0x2"
>                 ;;
>         false)
>                 extra_flags="clockid CLOCK_TAI"
>                 ;;
>         esac
>
>         tc qdisc replace dev eno0 handle 8001: parent root stab overhead 24 taprio \
>                 num_tc 8 \
>                 map 0 1 2 3 4 5 6 7 \
>                 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
>                 max-sdu 0 0 0 0 0 200 0 0 \
>                 base-time 200 \
>                 sched-entry S 80 20000 \
>                 sched-entry S a0 20000 \
>                 sched-entry S 5f 60000 \
>                 $extra_flags
> }
>
> add_cbs()
> {
>         local offload=$1
>         local extra_flags
>
>         case $offload in
>         true)
>                 extra_flags="offload 1"
>                 ;;
>         false)
>                 extra_flags=""
>                 ;;
>         esac
>
>         max_frame_size=1500
>         data_rate_kbps=20000
>         port_transmit_rate_kbps=1000000
>         idleslope=$data_rate_kbps
>         sendslope=$(($idleslope - $port_transmit_rate_kbps))
>         locredit=$(($max_frame_size * $sendslope / $port_transmit_rate_kbps))
>         hicredit=$(($max_frame_size * $idleslope / $port_transmit_rate_kbps))
>         tc qdisc replace dev eno0 parent 8001:8 cbs \
>                 idleslope $idleslope \
>                 sendslope $sendslope \
>                 hicredit $hicredit \
>                 locredit $locredit \
>                 $extra_flags
> }
>
> # this should always fail
> add_second_taprio()
> {
>         tc qdisc replace dev eno0 parent 8001:7 taprio \
>                 num_tc 8 \
>                 map 0 1 2 3 4 5 6 7 \
>                 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
>                 max-sdu 0 0 0 0 0 200 0 0 \
>                 base-time 200 \
>                 sched-entry S 80 20000 \
>                 sched-entry S a0 20000 \
>                 sched-entry S 5f 60000 \
>                 clockid CLOCK_TAI
> }
>
> ip link set eno0 up
>
> echo "Offload:"
> add_taprio true
> add_cbs true
> add_second_taprio
> mausezahn eno0 -t ip -b 00:04:9f:05:f6:27 -c 100 -p 60
> sleep 5
> tc -s class show dev eno0
> tc qdisc del dev eno0 root
>
> echo "Software:"
> add_taprio false
> add_cbs false
> add_second_taprio
> mausezahn eno0 -t ip -b 00:04:9f:05:f6:27 -c 100 -p 60
> sleep 5
> tc -s class show dev eno0
> tc qdisc del dev eno0 root
>
> > In my cursory look i assumed you wanted to go along the path of mqprio
> > where nothing much happens in the s/w datapath other than requeues
> > when the tx hardware path is busy (notice it is missing an
> > enqueue/deque ops). In that case the hardware selection is essentially
> > of a DMA ring based on skb tags. It seems you took it up a notch by
> > infact having a choice of whether to have pure s/w or offload path.
>
> Yes. Actually the original taprio design always had the enqueue()/dequeue()
> ops involved in the data path, then commit 13511704f8d7 ("net: taprio
> offload: enforce qdisc to netdev queue mapping") retrofitted the mqprio
> model when using the "flags 0x2" argument.


> If you have time to read, the discussion behind that redesign was here:
> https://lore.kernel.org/netdev/20210511171829.17181-1-yannick.vignon@oss.nxp.com/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 2/5] net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode
  2023-06-06 10:27   ` Paolo Abeni
@ 2023-06-06 15:56     ` Vladimir Oltean
  2023-06-07 10:05       ` Paolo Abeni
  0 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-06 15:56 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Tue, Jun 06, 2023 at 12:27:09PM +0200, Paolo Abeni wrote:
> On Fri, 2023-06-02 at 13:37 +0300, Vladimir Oltean wrote:
> > The taprio qdisc currently lives in the following equilibrium.
> > 
> > In the software scheduling case, taprio attaches itself to all TXQs,
> > thus having a refcount of 1 + the number of TX queues. In this mode,
> > q->qdiscs[] is not visible directly to the Qdisc API. The lifetime of
> > the Qdiscs from this private array lasts until qdisc_destroy() ->
> > taprio_destroy().
> > 
> > In the fully offloaded case, the root taprio has a refcount of 1, and
> > all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[]
> > are visible to the Qdisc API (they are attached to the netdev TXQs
> > directly), however taprio loses a reference to them very early - during
> > qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees
> > the q->qdiscs[] array to not leak memory, but interestingly, it does not
> > release a reference on these qdiscs because it doesn't effectively own
> > them - they are created by taprio but owned by the Qdisc core, and will
> > be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when
> > the Qdisc is deleted or when the child Qdisc is replaced with something
> > else.
> > 
> > My interest is to change this equilibrium such that taprio also owns a
> > reference on the q->qdiscs[] child Qdiscs for the lifetime of the root
> > Qdisc, including in full offload mode. I want this because I would like
> > taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have
> > insight into q->qdiscs[] for the software scheduling mode - currently
> > they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same
> > as the root taprio.
> > 
> > The following set of changes is necessary:
> > - don't free q->qdiscs[] early in taprio_attach(), free it late in
> >   taprio_destroy() for consistency with software mode. But:
> > - currently that's not possible, because taprio doesn't own a reference
> >   on q->qdiscs[]. So hold that reference - once during the initial
> >   attach() and once during subsequent graft() calls when the child is
> >   changed.
> > - always keep track of the current child in q->qdiscs[], even for full
> >   offload mode, so that we free in taprio_destroy() what we should, and
> >   not something stale.
> > 
> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> > ---
> >  net/sched/sch_taprio.c | 28 +++++++++++++++++-----------
> >  1 file changed, 17 insertions(+), 11 deletions(-)
> > 
> > diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> > index b1c611c72aa4..8807fc915b79 100644
> > --- a/net/sched/sch_taprio.c
> > +++ b/net/sched/sch_taprio.c
> > @@ -2138,23 +2138,20 @@ static void taprio_attach(struct Qdisc *sch)
> >  
> >  			qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
> >  			old = dev_graft_qdisc(dev_queue, qdisc);
> > +			/* Keep refcount of q->qdiscs[ntx] at 2 */
> > +			qdisc_refcount_inc(qdisc);
> >  		} else {
> >  			/* In software mode, attach the root taprio qdisc
> >  			 * to all netdev TX queues, so that dev_qdisc_enqueue()
> >  			 * goes through taprio_enqueue().
> >  			 */
> >  			old = dev_graft_qdisc(dev_queue, sch);
> > +			/* Keep root refcount at 1 + num_tx_queues */
> >  			qdisc_refcount_inc(sch);
> >  		}
> >  		if (old)
> >  			qdisc_put(old);
> >  	}
> > -
> > -	/* access to the child qdiscs is not needed in offload mode */
> > -	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> > -		kfree(q->qdiscs);
> > -		q->qdiscs = NULL;
> > -	}
> >  }
> >  
> >  static struct netdev_queue *taprio_queue_get(struct Qdisc *sch,
> > @@ -2183,15 +2180,24 @@ static int taprio_graft(struct Qdisc *sch, unsigned long cl,
> >  	if (dev->flags & IFF_UP)
> >  		dev_deactivate(dev);
> >  
> > -	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> > +	/* In software mode, the root taprio qdisc is still the one attached to
> > +	 * all netdev TX queues, and hence responsible for taprio_enqueue() to
> > +	 * forward the skbs to the child qdiscs from the private q->qdiscs[]
> > +	 * array. So only attach the new qdisc to the netdev queue in offload
> > +	 * mode, where the enqueue must bypass taprio. However, save the
> > +	 * reference to the new qdisc in the private array in both cases, to
> > +	 * have an up-to-date reference to our children.
> > +	 */
> > +	if (FULL_OFFLOAD_IS_ENABLED(q->flags))
> >  		*old = dev_graft_qdisc(dev_queue, new);
> > -	} else {
> > +	else
> >  		*old = q->qdiscs[cl - 1];
> > -		q->qdiscs[cl - 1] = new;
> > -	}
> >  
> > -	if (new)
> > +	q->qdiscs[cl - 1] = new;
> > +	if (new) {
> > +		qdisc_refcount_inc(new);
> >  		new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
> > +	}
> >  
> Isn't the above leaking a reference to old with something alike:
> 
> tc qdisc replace dev eth0 handle 8001: parent root stab overhead 24 taprio flags 0x2 #...
> 	# each q in q->qdiscs has refcnt == 2
> tc qdisc replace dev eth0 parent 8001:8 cbs #...
> 	# -> taprio_graft(..., cbs, ...)
> 	# cbs refcnt is 2
> 	# 'old' refcnt decreases by 1, refcount will not reach 0?!?
> 
> kmemleak should be able to see that.

You're right - in full offload mode, the refcount of "old" (pfifo, parent 8001:8)
does not drop to 0 after grafting something else to 8001:8.

I believe this other implementation should work in all cases.

static int taprio_graft(struct Qdisc *sch, unsigned long cl,
			struct Qdisc *new, struct Qdisc **old,
			struct netlink_ext_ack *extack)
{
	struct taprio_sched *q = qdisc_priv(sch);
	struct net_device *dev = qdisc_dev(sch);
	struct netdev_queue *dev_queue = taprio_queue_get(sch, cl);

	if (!dev_queue)
		return -EINVAL;

	if (dev->flags & IFF_UP)
		dev_deactivate(dev);

	/* In offload mode, the child Qdisc is directly attached to the netdev
	 * TX queue, and thus, we need to keep its refcount elevated in order
	 * to counteract qdisc_graft()'s call to qdisc_put() once per TX queue.
	 * However, save the reference to the new qdisc in the private array in
	 * both software and offload cases, to have an up-to-date reference to
	 * our children.
	 */
	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
		*old = dev_graft_qdisc(dev_queue, new);
		if (new)
			qdisc_refcount_inc(new);
		if (*old)
			qdisc_put(*old);
	} else {
		*old = q->qdiscs[cl - 1];
	}

	q->qdiscs[cl - 1] = new;
	if (new)
		new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;

	if (dev->flags & IFF_UP)
		dev_activate(dev);

	return 0;
}

> BTW, what about including your tests from the cover letter somewhere under tc-testing?

I don't know about that. Does it involve adding taprio hw offload to netdevsim,
so that both code paths are covered?

To be honest with you, I never ran tdc, and I worry about what I may find there,
unrelated to what I'm being asked to add (at a cursory glance I see overlapping
TXQs in taprio.json), and which has the possibility of turning this into a huge
time pit.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children
  2023-06-06 15:39     ` Jamal Hadi Salim
@ 2023-06-06 16:31       ` Vladimir Oltean
  2023-06-06 17:42         ` Jamal Hadi Salim
  0 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-06 16:31 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Tue, Jun 06, 2023 at 11:39:32AM -0400, Jamal Hadi Salim wrote:
> 1)Just some details become confusing in regards to offload vs not; F.e
> class grafting (taprio_graft()) is de/activating the device but that
> seems only needed for offload. Would it not be better to have those
> separate and call graft_offload vs graft_software, etc? We really need
> to create a generic document on how someone would write code for
> qdiscs for consistency (I started working on one but never completed
> it - if there is a volunteer i would be happy to work with one to
> complete it).

I would be a happy reader of that document. I haven't studied whether
dev_deactivate() and dev_activate() are necessary for the pure software
data path, where the root taprio is also directly attached to the netdev
TXQs and that fact doesn't change across its lifetime.

> 2) It seems like in mqprio this qdisc can only be root qdisc (like
> mqprio)

so far so good

> and you dont want to replace the children with other types of
> qdiscs i.e the children are always pfifo? i.e is it possible or
> intended for example to replace 8001:x with bfifo etc? or even change
> the pfifo queue size, etc?

no, this is not true, why do you say this?

> 3) Offload intention seems really to be bypassing enqueue() and going
> straigth to the driver xmit() for a specific DMA ring that the skb is
> mapped to. Except for the case where the driver says it's busy and
> refuses to stash the skb in ring in which case you have to requeue to
> the appropriate child qdisc/class. I am not sure how that would work
> here - mqprio gets away with it by not defining any of the
> en/de/requeue() callbacks

wait, there is a requeue() callback? where?

> - but likely it will be the lack of requeue that makes it work.

Looking at dev_requeue_skb(), isn't it always going to be requeued to
the same qdisc it was originally dequeued from? I don't see what is the
problem.

My understanding of the offload intention is not really to bypass dequeue()
in general and as a matter of principle, but rather to bypass the root's
taprio_dequeue() specifically, as that could do unrelated work, and jump
right to the specific child's dequeue().

The child could have its own complex enqueue() and dequeue() and that is
perfectly fine - for example cbs_dequeue_soft() is a valid child dequeue
procedure - as long as the process isn't blocked in the sendmsg() call
by __qdisc_run() processing packets belonging to unrelated traffic
classes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children
  2023-06-06 16:31       ` Vladimir Oltean
@ 2023-06-06 17:42         ` Jamal Hadi Salim
  2023-06-07 10:09           ` Vladimir Oltean
  0 siblings, 1 reply; 25+ messages in thread
From: Jamal Hadi Salim @ 2023-06-06 17:42 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Tue, Jun 6, 2023 at 12:32 PM Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>
> On Tue, Jun 06, 2023 at 11:39:32AM -0400, Jamal Hadi Salim wrote:
> > 1)Just some details become confusing in regards to offload vs not; F.e
> > class grafting (taprio_graft()) is de/activating the device but that
> > seems only needed for offload. Would it not be better to have those
> > separate and call graft_offload vs graft_software, etc? We really need
> > to create a generic document on how someone would write code for
> > qdiscs for consistency (I started working on one but never completed
> > it - if there is a volunteer i would be happy to work with one to
> > complete it).
>
> I would be a happy reader of that document. I haven't studied whether
> dev_deactivate() and dev_activate() are necessary for the pure software
> data path, where the root taprio is also directly attached to the netdev
> TXQs and that fact doesn't change across its lifetime.

I didnt go that far in the doc but was intending to. It was supposed
to be a tutorial somewhere and i ended not doing it.
something like htb will be a good example to compare with (it is a
classful qdisc which is also capable of offloading). It is not the
same as mqprio which can only be root.

> > 2) It seems like in mqprio this qdisc can only be root qdisc (like
> > mqprio)
>
> so far so good
>
> > and you dont want to replace the children with other types of
> > qdiscs i.e the children are always pfifo? i.e is it possible or
> > intended for example to replace 8001:x with bfifo etc? or even change
> > the pfifo queue size, etc?
>
> no, this is not true, why do you say this?

I am just asking questions trying to understand;-> So if can i
replace, for example, with a tbf would it make sense even in s/w?

> > 3) Offload intention seems really to be bypassing enqueue() and going
> > straigth to the driver xmit() for a specific DMA ring that the skb is
> > mapped to. Except for the case where the driver says it's busy and
> > refuses to stash the skb in ring in which case you have to requeue to
> > the appropriate child qdisc/class. I am not sure how that would work
> > here - mqprio gets away with it by not defining any of the
> > en/de/requeue() callbacks
>
> wait, there is a requeue() callback? where?

Hrm, someone removed that ops i guess at some point - not sure when,
need to look at git history.
But short answer is yes it used to be there; git will probably reveal
from the commit its disappearance.

>
> > - but likely it will be the lack of requeue that makes it work.
>
> Looking at dev_requeue_skb(), isn't it always going to be requeued to
> the same qdisc it was originally dequeued from? I don't see what is the
> problem.

In the basic case that approach is sufficient. For pfifo you want it
to the first skb dequeued the next opportunity to send to h/w.
Basically the idea is/was: if the hardware is busy you may need to run
some algorithm (present in requeue but not in enqueu) to decide if you
should put the skb at the head, at the tail, somewhere else, drop it,
check if some time limit has exceeded and do something funky etc etc.

> My understanding of the offload intention is not really to bypass dequeue()
> in general and as a matter of principle, but rather to bypass the root's
> taprio_dequeue() specifically, as that could do unrelated work, and jump
> right to the specific child's dequeue().
>
> The child could have its own complex enqueue() and dequeue() and that is
> perfectly fine - for example cbs_dequeue_soft() is a valid child dequeue
> procedure - as long as the process isn't blocked in the sendmsg() call
> by __qdisc_run() processing packets belonging to unrelated traffic
> classes.

Does it matter what type the child enqueue/dequeue? eg can i attach htb, etc?

cheers,
jamal

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 2/5] net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode
  2023-06-06 15:56     ` Vladimir Oltean
@ 2023-06-07 10:05       ` Paolo Abeni
  2023-06-07 10:16         ` Vladimir Oltean
  0 siblings, 1 reply; 25+ messages in thread
From: Paolo Abeni @ 2023-06-07 10:05 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Tue, 2023-06-06 at 18:56 +0300, Vladimir Oltean wrote:
> On Tue, Jun 06, 2023 at 12:27:09PM +0200, Paolo Abeni wrote:
> > On Fri, 2023-06-02 at 13:37 +0300, Vladimir Oltean wrote:
> > > The taprio qdisc currently lives in the following equilibrium.
> > > 
> > > In the software scheduling case, taprio attaches itself to all TXQs,
> > > thus having a refcount of 1 + the number of TX queues. In this mode,
> > > q->qdiscs[] is not visible directly to the Qdisc API. The lifetime of
> > > the Qdiscs from this private array lasts until qdisc_destroy() ->
> > > taprio_destroy().
> > > 
> > > In the fully offloaded case, the root taprio has a refcount of 1, and
> > > all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[]
> > > are visible to the Qdisc API (they are attached to the netdev TXQs
> > > directly), however taprio loses a reference to them very early - during
> > > qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees
> > > the q->qdiscs[] array to not leak memory, but interestingly, it does not
> > > release a reference on these qdiscs because it doesn't effectively own
> > > them - they are created by taprio but owned by the Qdisc core, and will
> > > be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when
> > > the Qdisc is deleted or when the child Qdisc is replaced with something
> > > else.
> > > 
> > > My interest is to change this equilibrium such that taprio also owns a
> > > reference on the q->qdiscs[] child Qdiscs for the lifetime of the root
> > > Qdisc, including in full offload mode. I want this because I would like
> > > taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have
> > > insight into q->qdiscs[] for the software scheduling mode - currently
> > > they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same
> > > as the root taprio.
> > > 
> > > The following set of changes is necessary:
> > > - don't free q->qdiscs[] early in taprio_attach(), free it late in
> > >   taprio_destroy() for consistency with software mode. But:
> > > - currently that's not possible, because taprio doesn't own a reference
> > >   on q->qdiscs[]. So hold that reference - once during the initial
> > >   attach() and once during subsequent graft() calls when the child is
> > >   changed.
> > > - always keep track of the current child in q->qdiscs[], even for full
> > >   offload mode, so that we free in taprio_destroy() what we should, and
> > >   not something stale.
> > > 
> > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> > > ---
> > >  net/sched/sch_taprio.c | 28 +++++++++++++++++-----------
> > >  1 file changed, 17 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> > > index b1c611c72aa4..8807fc915b79 100644
> > > --- a/net/sched/sch_taprio.c
> > > +++ b/net/sched/sch_taprio.c
> > > @@ -2138,23 +2138,20 @@ static void taprio_attach(struct Qdisc *sch)
> > >  
> > >  			qdisc->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
> > >  			old = dev_graft_qdisc(dev_queue, qdisc);
> > > +			/* Keep refcount of q->qdiscs[ntx] at 2 */
> > > +			qdisc_refcount_inc(qdisc);
> > >  		} else {
> > >  			/* In software mode, attach the root taprio qdisc
> > >  			 * to all netdev TX queues, so that dev_qdisc_enqueue()
> > >  			 * goes through taprio_enqueue().
> > >  			 */
> > >  			old = dev_graft_qdisc(dev_queue, sch);
> > > +			/* Keep root refcount at 1 + num_tx_queues */
> > >  			qdisc_refcount_inc(sch);
> > >  		}
> > >  		if (old)
> > >  			qdisc_put(old);
> > >  	}
> > > -
> > > -	/* access to the child qdiscs is not needed in offload mode */
> > > -	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> > > -		kfree(q->qdiscs);
> > > -		q->qdiscs = NULL;
> > > -	}
> > >  }
> > >  
> > >  static struct netdev_queue *taprio_queue_get(struct Qdisc *sch,
> > > @@ -2183,15 +2180,24 @@ static int taprio_graft(struct Qdisc *sch, unsigned long cl,
> > >  	if (dev->flags & IFF_UP)
> > >  		dev_deactivate(dev);
> > >  
> > > -	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> > > +	/* In software mode, the root taprio qdisc is still the one attached to
> > > +	 * all netdev TX queues, and hence responsible for taprio_enqueue() to
> > > +	 * forward the skbs to the child qdiscs from the private q->qdiscs[]
> > > +	 * array. So only attach the new qdisc to the netdev queue in offload
> > > +	 * mode, where the enqueue must bypass taprio. However, save the
> > > +	 * reference to the new qdisc in the private array in both cases, to
> > > +	 * have an up-to-date reference to our children.
> > > +	 */
> > > +	if (FULL_OFFLOAD_IS_ENABLED(q->flags))
> > >  		*old = dev_graft_qdisc(dev_queue, new);
> > > -	} else {
> > > +	else
> > >  		*old = q->qdiscs[cl - 1];
> > > -		q->qdiscs[cl - 1] = new;
> > > -	}
> > >  
> > > -	if (new)
> > > +	q->qdiscs[cl - 1] = new;
> > > +	if (new) {
> > > +		qdisc_refcount_inc(new);
> > >  		new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
> > > +	}
> > >  
> > Isn't the above leaking a reference to old with something alike:
> > 
> > tc qdisc replace dev eth0 handle 8001: parent root stab overhead 24 taprio flags 0x2 #...
> > 	# each q in q->qdiscs has refcnt == 2
> > tc qdisc replace dev eth0 parent 8001:8 cbs #...
> > 	# -> taprio_graft(..., cbs, ...)
> > 	# cbs refcnt is 2
> > 	# 'old' refcnt decreases by 1, refcount will not reach 0?!?
> > 
> > kmemleak should be able to see that.
> 
> You're right - in full offload mode, the refcount of "old" (pfifo, parent 8001:8)
> does not drop to 0 after grafting something else to 8001:8.
> 
> I believe this other implementation should work in all cases.
> 
> static int taprio_graft(struct Qdisc *sch, unsigned long cl,
> 			struct Qdisc *new, struct Qdisc **old,
> 			struct netlink_ext_ack *extack)
> {
> 	struct taprio_sched *q = qdisc_priv(sch);
> 	struct net_device *dev = qdisc_dev(sch);
> 	struct netdev_queue *dev_queue = taprio_queue_get(sch, cl);
> 
> 	if (!dev_queue)
> 		return -EINVAL;
> 
> 	if (dev->flags & IFF_UP)
> 		dev_deactivate(dev);
> 
> 	/* In offload mode, the child Qdisc is directly attached to the netdev
> 	 * TX queue, and thus, we need to keep its refcount elevated in order
> 	 * to counteract qdisc_graft()'s call to qdisc_put() once per TX queue.
> 	 * However, save the reference to the new qdisc in the private array in
> 	 * both software and offload cases, to have an up-to-date reference to
> 	 * our children.
> 	 */
> 	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> 		*old = dev_graft_qdisc(dev_queue, new);
> 		if (new)
> 			qdisc_refcount_inc(new);
> 		if (*old)
> 			qdisc_put(*old);
> 	} else {
> 		*old = q->qdiscs[cl - 1];
> 	}

Perhaps the above chunk could be:

	*old = q->qdiscs[cl - 1];
	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
		WARN_ON_ONCE(dev_graft_qdisc(dev_queue, new) != *old);
		if (new)
			qdisc_refcount_inc(new);
		if (*old)
			qdisc_put(*old);
	}

(boldly assuming I'm not completely lost, which looks a wild bet ;)

> 	q->qdiscs[cl - 1] = new;
> 	if (new)
> 		new->flags |= TCQ_F_ONETXQUEUE | TCQ_F_NOPARENT;
> 
> 	if (dev->flags & IFF_UP)
> 		dev_activate(dev);
> 
> 	return 0;
> }
> 
> > BTW, what about including your tests from the cover letter somewhere under tc-testing?
> 
> I don't know about that. Does it involve adding taprio hw offload to netdevsim,
> so that both code paths are covered?

I guess I underlooked the needed effort and we could live without new
tests here.


Cheers,

Paolo


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children
  2023-06-06 17:42         ` Jamal Hadi Salim
@ 2023-06-07 10:09           ` Vladimir Oltean
  2023-06-07 10:30             ` Vladimir Oltean
  0 siblings, 1 reply; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-07 10:09 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Tue, Jun 06, 2023 at 01:42:19PM -0400, Jamal Hadi Salim wrote:
> > > 2) It seems like in mqprio this qdisc can only be root qdisc (like
> > > mqprio)
> >
> > so far so good
> >
> > > and you dont want to replace the children with other types of
> > > qdiscs i.e the children are always pfifo? i.e is it possible or
> > > intended for example to replace 8001:x with bfifo etc? or even change
> > > the pfifo queue size, etc?
> >
> > no, this is not true, why do you say this?
> 
> I am just asking questions trying to understand;-> So if can i
> replace, for example, with a tbf would it make sense even in s/w?
> 
> > The child could have its own complex enqueue() and dequeue() and that is
> > perfectly fine - for example cbs_dequeue_soft() is a valid child dequeue
> > procedure - as long as the process isn't blocked in the sendmsg() call
> > by __qdisc_run() processing packets belonging to unrelated traffic
> > classes.
> 
> Does it matter what type the child enqueue/dequeue? eg can i attach htb, etc?

So in principle, the taprio model is compatible with attaching any child
Qdisc to the per-TXQ child classes - with tc-cbs in particular being of
interest, because that is a TSN shaper, but also, tbf or htb could be
reasonably imagined as children, and taprio doesn't oppose to any Qdisc
as its child.

That being said, a non-offloaded cbs/htb will not work with an offloaded
taprio root anymore after commit 13511704f8d7 ("net: taprio offload:
enforce qdisc to netdev queue mapping"), and IMO what should be done
there is to reject somehow those child Qdiscs which also can't be
offloaded when the root is offloaded.

Offloading a taprio qdisc (potentially with offloaded cbs children) also
affects the autonomous forwarding data path in case of an Ethernet switch.
So it's not just about dev_queue_xmits() from Linux.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 2/5] net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode
  2023-06-07 10:05       ` Paolo Abeni
@ 2023-06-07 10:16         ` Vladimir Oltean
  0 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-07 10:16 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Wed, Jun 07, 2023 at 12:05:19PM +0200, Paolo Abeni wrote:
> Perhaps the above chunk could be:
> 
> 	*old = q->qdiscs[cl - 1];
> 	if (FULL_OFFLOAD_IS_ENABLED(q->flags)) {
> 		WARN_ON_ONCE(dev_graft_qdisc(dev_queue, new) != *old);
> 		if (new)
> 			qdisc_refcount_inc(new);
> 		if (*old)
> 			qdisc_put(*old);
> 	}
> 
> (boldly assuming I'm not completely lost, which looks a wild bet ;)

Yeah, could be like that. In full offload mode, q->qdiscs[cl - 1] is
also what's grafted to the TXQ, so the WARN_ON() would be indicative of
a serious bug if it triggered.

> > > BTW, what about including your tests from the cover letter somewhere under tc-testing?
> > 
> > I don't know about that. Does it involve adding taprio hw offload to netdevsim,
> > so that both code paths are covered?
> 
> I guess I underlooked the needed effort and we could live without new
> tests here.

Let's see how the discussion progresses.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children
  2023-06-07 10:09           ` Vladimir Oltean
@ 2023-06-07 10:30             ` Vladimir Oltean
  0 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-07 10:30 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Wed, Jun 07, 2023 at 01:09:01PM +0300, Vladimir Oltean wrote:
> That being said, a non-offloaded cbs/htb will not work with an offloaded
> taprio root anymore after commit 13511704f8d7 ("net: taprio offload:
> enforce qdisc to netdev queue mapping"), and IMO what should be done
> there is to reject somehow those child Qdiscs which also can't be
> offloaded when the root is offloaded.
> 
> Offloading a taprio qdisc (potentially with offloaded cbs children) also
> affects the autonomous forwarding data path in case of an Ethernet switch.
> So it's not just about dev_queue_xmits() from Linux.

Ah, no, I said something stupid. Non-offloaded child Qdisc should still
work with offloaded root (shaping done in software, scheduling in hardware).
I was thinking about the autonomous forwarding case in the back of my
mind, that's still problematic because the shaping wouldn't affect the
packets that Linux didn't originate.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[]
  2023-06-02 10:37 ` [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[] Vladimir Oltean
@ 2023-06-08 18:44   ` Jamal Hadi Salim
  2023-06-09 12:10     ` Vladimir Oltean
  0 siblings, 1 reply; 25+ messages in thread
From: Jamal Hadi Salim @ 2023-06-08 18:44 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Fri, Jun 2, 2023 at 6:38 AM Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>
> This makes a difference for the software scheduling mode, where
> dev_queue->qdisc_sleeping is the same as the taprio root Qdisc itself,
> but when we're talking about what Qdisc and stats get reported for a
> traffic class, the root taprio isn't what comes to mind, but q->qdiscs[]
> is.
>
> To understand the difference, I've attempted to send 100 packets in
> software mode through traffic class 0 (they are in the Qdisc's backlog),
> and recorded the stats before and after the change.
>

Other than the refcount issue i think the approach looks reasonable to
me. The stats before/after you are showing below though are
interesting; are you showing a transient phase where packets are
temporarily in the backlog. Typically the backlog is a transient phase
which lasts a very short period. Maybe it works differently for
taprio? I took a quick look at the code and do see to decrement the
backlog in the dequeue, so if it is not transient then some code path
is not being hit.

Aside: I realize you are busy - but if you get time and provide some
sample tc command lines for testing we could help create the tests for
you, at least the first time. The advantage of putting these tests in
tools/testing/selftests/tc-testing/ is that there are test tools out
there that run these tests and so regressions are easier to catch
sooner.

cheers,
jamal

> Here is before:
>
> $ tc -s class show dev eth0
> class taprio 8001:1 root leaf 8001:
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 9400b 100p requeues 0
>  Window drops: 0
> class taprio 8001:2 root leaf 8001:
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 9400b 100p requeues 0
>  Window drops: 0
> class taprio 8001:3 root leaf 8001:
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 9400b 100p requeues 0
>  Window drops: 0
> class taprio 8001:4 root leaf 8001:
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 9400b 100p requeues 0
>  Window drops: 0
> class taprio 8001:5 root leaf 8001:
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 9400b 100p requeues 0
>  Window drops: 0
> class taprio 8001:6 root leaf 8001:
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 9400b 100p requeues 0
>  Window drops: 0
> class taprio 8001:7 root leaf 8001:
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 9400b 100p requeues 0
>  Window drops: 0
> class taprio 8001:8 root leaf 8001:
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 9400b 100p requeues 0
>  Window drops: 0
>
> and here is after:
>
> class taprio 8001:1 root
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 9400b 100p requeues 0
>  Window drops: 0
> class taprio 8001:2 root
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>  Window drops: 0
> class taprio 8001:3 root
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>  Window drops: 0
> class taprio 8001:4 root
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>  Window drops: 0
> class taprio 8001:5 root
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>  Window drops: 0
> class taprio 8001:6 root
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>  Window drops: 0
> class taprio 8001:7 root leaf 8010:
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>  Window drops: 0
> class taprio 8001:8 root
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
>  Window drops: 0
>
> The most glaring (and expected) difference is that before, all class
> stats reported the global stats, whereas now, they really report just
> the counters for that traffic class.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>  net/sched/sch_taprio.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index cc7ff98e5e86..23b98c3af8b2 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -2452,11 +2452,11 @@ static unsigned long taprio_find(struct Qdisc *sch, u32 classid)
>  static int taprio_dump_class(struct Qdisc *sch, unsigned long cl,
>                              struct sk_buff *skb, struct tcmsg *tcm)
>  {
> -       struct netdev_queue *dev_queue = taprio_queue_get(sch, cl);
> +       struct Qdisc *child = taprio_leaf(sch, cl);
>
>         tcm->tcm_parent = TC_H_ROOT;
>         tcm->tcm_handle |= TC_H_MIN(cl);
> -       tcm->tcm_info = dev_queue->qdisc_sleeping->handle;
> +       tcm->tcm_info = child->handle;
>
>         return 0;
>  }
> @@ -2466,8 +2466,7 @@ static int taprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
>         __releases(d->lock)
>         __acquires(d->lock)
>  {
> -       struct netdev_queue *dev_queue = taprio_queue_get(sch, cl);
> -       struct Qdisc *child = dev_queue->qdisc_sleeping;
> +       struct Qdisc *child = taprio_leaf(sch, cl);
>         struct tc_taprio_qopt_offload offload = {
>                 .cmd = TAPRIO_CMD_TC_STATS,
>                 .tc_stats = {
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[]
  2023-06-08 18:44   ` Jamal Hadi Salim
@ 2023-06-09 12:10     ` Vladimir Oltean
  2023-06-09 14:56       ` Victor Nogueira
  2023-06-09 16:19       ` Jamal Hadi Salim
  0 siblings, 2 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-09 12:10 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Thu, Jun 08, 2023 at 02:44:46PM -0400, Jamal Hadi Salim wrote:
> Other than the refcount issue i think the approach looks reasonable to
> me. The stats before/after you are showing below though are
> interesting; are you showing a transient phase where packets are
> temporarily in the backlog. Typically the backlog is a transient phase
> which lasts a very short period. Maybe it works differently for
> taprio? I took a quick look at the code and do see to decrement the
> backlog in the dequeue, so if it is not transient then some code path
> is not being hit.

It's a fair concern. The thing is that I put very aggressive time slots
in the schedule that I'm testing with, and my kernel has a lot of
debugging stuff which bogs it down (kasan, kmemleak, lockdep, DMA API
debug etc). Not to mention that the CPU isn't the fastest to begin with.

The way taprio works is that there's a hrtimer which fires at the
expiration time of the current schedule entry and sets up the gates for
the next one. Each schedule entry has a gate for each traffic class
which determines what traffic classes are eligible for dequeue() and
which ones aren't.

The dequeue() procedure, though also invoked by the advance_schedule()
hrtimer -> __netif_schedule(), is also time-sensitive. By the time
taprio_dequeue() runs, taprio_entry_allows_tx() function might return
false when the system is so bogged down that it wasn't able to make
enough progress to dequeue() an skb in time. When that happens, there is
no mechanism, currently, to age out packets that stood too much in the
TX queues (what does "too much" mean?).

Whereas enqueue() is technically not time-sensitive, i.e. you can
enqueue whenever you want and the Qdisc will dequeue whenever it can.
Though in practice, to make this scheduling technique useful, the user
space enqueue should also be time-aware (though you can't capture this
with ping).

If I increase all my sched-entry intervals by a factor of 100, the
backlog issue goes away and the system can make forward progress.

So yeah, sorry, I didn't pay too much attention to the data I was
presenting for illustrative purposes.

> Aside: I realize you are busy - but if you get time and provide some
> sample tc command lines for testing we could help create the tests for
> you, at least the first time. The advantage of putting these tests in
> tools/testing/selftests/tc-testing/ is that there are test tools out
> there that run these tests and so regressions are easier to catch
> sooner.

Yeah, ok. The script posted in a reply on the cover letter is still what
I'm working with. The things it intends to capture are:
- attaching a custom Qdisc to one of taprio's classes doesn't fail
- attaching taprio to one of taprio's classes fails
- sending packets through one queue increases the counters (any counters)
  of just that queue

All the above, replicated once for the software scheduling case and once
for the offload case. Currently netdevsim doesn't attempt to emulate
taprio offload.

Is there a way to skip tests? I may look into tdc, but I honestly don't
have time for unrelated stuff such as figuring out why my kernel isn't
configured for the other tests to pass - and it seems that once one test
fails, the others are completely skipped, see below.

Also, by which rule are the test IDs created?

root@debian:~# cd selftests/tc-testing/
root@debian:~/selftests/tc-testing# ./tdc.sh
considering category qdisc
 -- ns/SubPlugin.__init__
Test 0582: Create QFQ with default setting
Test c9a3: Create QFQ with class weight setting
Test d364: Test QFQ with max class weight setting
Test 8452: Create QFQ with class maxpkt setting
Test 22df: Test QFQ class maxpkt setting lower bound
Test 92ee: Test QFQ class maxpkt setting upper bound
Test d920: Create QFQ with multiple class setting
Test 0548: Delete QFQ with handle
Test 5901: Show QFQ class
Test 0385: Create DRR with default setting
Test 2375: Delete DRR with handle
Test 3092: Show DRR class
Test 3460: Create CBQ with default setting
exit: 2
exit: 0
Error: Specified qdisc kind is unknown.


-----> teardown stage *** Could not execute: "$TC qdisc del dev $DUMMY handle 1: root"

-----> teardown stage *** Error message: "Error: Invalid handle.
"
returncode 2; expected [0]

-----> teardown stage *** Aborting test run.


<_io.BufferedReader name=3> *** stdout ***


<_io.BufferedReader name=5> *** stderr ***
"-----> teardown stage" did not complete successfully
Exception <class '__main__.PluginMgrTestFail'> ('teardown', 'Error: Specified qdisc kind is unknown.\n', '"-----> teardown stage" did not complete successfully') (caught in test_runner, running test 14 3460 Create CBQ with default setting stage teardown)
---------------
traceback
  File "/root/selftests/tc-testing/./tdc.py", line 495, in test_runner
    res = run_one_test(pm, args, index, tidx)
  File "/root/selftests/tc-testing/./tdc.py", line 434, in run_one_test
    prepare_env(args, pm, 'teardown', '-----> teardown stage', tidx['teardown'], procout)
  File "/root/selftests/tc-testing/./tdc.py", line 245, in prepare_env
    raise PluginMgrTestFail(
---------------
accumulated output for this test: 
Error: Specified qdisc kind is unknown.
---------------

All test results:

1..336
ok 1 0582 - Create QFQ with default setting
ok 2 c9a3 - Create QFQ with class weight setting
ok 3 d364 - Test QFQ with max class weight setting
ok 4 8452 - Create QFQ with class maxpkt setting
ok 5 22df - Test QFQ class maxpkt setting lower bound
ok 6 92ee - Test QFQ class maxpkt setting upper bound
ok 7 d920 - Create QFQ with multiple class setting
ok 8 0548 - Delete QFQ with handle
ok 9 5901 - Show QFQ class
ok 10 0385 - Create DRR with default setting
ok 11 2375 - Delete DRR with handle
ok 12 3092 - Show DRR class
ok 13 3460 - Create CBQ with default setting # skipped - "-----> teardown stage" did not complete successfully

ok 14 0592 - Create CBQ with mpu # skipped - skipped - previous teardown failed 14 3460

ok 15 4684 - Create CBQ with valid cell num # skipped - skipped - previous teardown failed 14 3460

ok 16 4345 - Create CBQ with invalid cell num # skipped - skipped - previous teardown failed 14 3460

ok 17 4525 - Create CBQ with valid ewma # skipped - skipped - previous teardown failed 14 3460

ok 18 6784 - Create CBQ with invalid ewma # skipped - skipped - previous teardown failed 14 3460

ok 19 5468 - Delete CBQ with handle # skipped - skipped - previous teardown failed 14 3460

ok 20 492a - Show CBQ class # skipped - skipped - previous teardown failed 14 3460

ok 21 9903 - Add mqprio Qdisc to multi-queue device (8 queues) # skipped - skipped - previous teardown failed 14 3460

ok 22 453a - Delete nonexistent mqprio Qdisc # skipped - skipped - previous teardown failed 14 3460

ok 23 5292 - Delete mqprio Qdisc twice # skipped - skipped - previous teardown failed 14 3460

ok 24 45a9 - Add mqprio Qdisc to single-queue device # skipped - skipped - previous teardown failed 14 3460

ok 25 2ba9 - Show mqprio class # skipped - skipped - previous teardown failed 14 3460

ok 26 4812 - Create HHF with default setting # skipped - skipped - previous teardown failed 14 3460

ok 27 8a92 - Create HHF with limit setting # skipped - skipped - previous teardown failed 14 3460

ok 28 3491 - Create HHF with quantum setting # skipped - skipped - previous teardown failed 14 3460
(...)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[]
  2023-06-09 12:10     ` Vladimir Oltean
@ 2023-06-09 14:56       ` Victor Nogueira
  2023-06-10 17:49         ` Vladimir Oltean
  2023-06-09 16:19       ` Jamal Hadi Salim
  1 sibling, 1 reply; 25+ messages in thread
From: Victor Nogueira @ 2023-06-09 14:56 UTC (permalink / raw)
  To: Vladimir Oltean, Jamal Hadi Salim
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On 09/06/2023 09:10, Vladimir Oltean wrote:
> On Thu, Jun 08, 2023 at 02:44:46PM -0400, Jamal Hadi Salim wrote:
>> Other than the refcount issue i think the approach looks reasonable to
>> me. The stats before/after you are showing below though are
>> interesting; are you showing a transient phase where packets are
>> temporarily in the backlog. Typically the backlog is a transient phase
>> which lasts a very short period. Maybe it works differently for
>> taprio? I took a quick look at the code and do see to decrement the
>> backlog in the dequeue, so if it is not transient then some code path
>> is not being hit.
> 
> It's a fair concern. The thing is that I put very aggressive time slots
> in the schedule that I'm testing with, and my kernel has a lot of
> debugging stuff which bogs it down (kasan, kmemleak, lockdep, DMA API
> debug etc). Not to mention that the CPU isn't the fastest to begin with.
> 
> The way taprio works is that there's a hrtimer which fires at the
> expiration time of the current schedule entry and sets up the gates for
> the next one. Each schedule entry has a gate for each traffic class
> which determines what traffic classes are eligible for dequeue() and
> which ones aren't.
> 
> The dequeue() procedure, though also invoked by the advance_schedule()
> hrtimer -> __netif_schedule(), is also time-sensitive. By the time
> taprio_dequeue() runs, taprio_entry_allows_tx() function might return
> false when the system is so bogged down that it wasn't able to make
> enough progress to dequeue() an skb in time. When that happens, there is
> no mechanism, currently, to age out packets that stood too much in the
> TX queues (what does "too much" mean?).
> 
> Whereas enqueue() is technically not time-sensitive, i.e. you can
> enqueue whenever you want and the Qdisc will dequeue whenever it can.
> Though in practice, to make this scheduling technique useful, the user
> space enqueue should also be time-aware (though you can't capture this
> with ping).
> 
> If I increase all my sched-entry intervals by a factor of 100, the
> backlog issue goes away and the system can make forward progress.
> 
> So yeah, sorry, I didn't pay too much attention to the data I was
> presenting for illustrative purposes.
> 
>> Aside: I realize you are busy - but if you get time and provide some
>> sample tc command lines for testing we could help create the tests for
>> you, at least the first time. The advantage of putting these tests in
>> tools/testing/selftests/tc-testing/ is that there are test tools out
>> there that run these tests and so regressions are easier to catch
>> sooner.
> 
> Yeah, ok. The script posted in a reply on the cover letter is still what
> I'm working with. The things it intends to capture are:
> - attaching a custom Qdisc to one of taprio's classes doesn't fail
> - attaching taprio to one of taprio's classes fails
> - sending packets through one queue increases the counters (any counters)
>    of just that queue
> 
> All the above, replicated once for the software scheduling case and once
> for the offload case. Currently netdevsim doesn't attempt to emulate
> taprio offload.
> 
> Is there a way to skip tests? I may look into tdc, but I honestly don't
> have time for unrelated stuff such as figuring out why my kernel isn't
> configured for the other tests to pass - and it seems that once one test
> fails, the others are completely skipped, see below.

You can tell tdc to run a specific test file by providing the "-f" option.
For example, if you want to run only taprio tests, you can issue the
following command:

./tdc.py -f tc-tests/qdiscs/taprio.json

This is also described in tdc's README.

> Also, by which rule are the test IDs created?

When creating a test case in tdc, you must have an ID field.
What we do to generate the IDs is leave the "id" field as an
empty string on the test case description, for example:


{
     "id": "",
     "name": "My dummy test case",
     ...
}

and run the following:

./tdc.py -i

This will automatically fill up the "id" field in the JSON
with an appropriate ID.

> root@debian:~# cd selftests/tc-testing/
> root@debian:~/selftests/tc-testing# ./tdc.sh
> considering category qdisc
>   -- ns/SubPlugin.__init__
> Test 0582: Create QFQ with default setting
> Test c9a3: Create QFQ with class weight setting
> Test d364: Test QFQ with max class weight setting
> Test 8452: Create QFQ with class maxpkt setting
> Test 22df: Test QFQ class maxpkt setting lower bound
> Test 92ee: Test QFQ class maxpkt setting upper bound
> Test d920: Create QFQ with multiple class setting
> Test 0548: Delete QFQ with handle
> Test 5901: Show QFQ class
> Test 0385: Create DRR with default setting
> Test 2375: Delete DRR with handle
> Test 3092: Show DRR class
> Test 3460: Create CBQ with default setting
> exit: 2
> exit: 0
> Error: Specified qdisc kind is unknown.

As you stated, this likely means you are missing a config option.
However you can just ask tdc to run a specific test file to avoid this.

> (...)

cheers,
Victor

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[]
  2023-06-09 12:10     ` Vladimir Oltean
  2023-06-09 14:56       ` Victor Nogueira
@ 2023-06-09 16:19       ` Jamal Hadi Salim
  2023-06-09 17:56         ` Vladimir Oltean
  1 sibling, 1 reply; 25+ messages in thread
From: Jamal Hadi Salim @ 2023-06-09 16:19 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Fri, Jun 9, 2023 at 8:10 AM Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>
> On Thu, Jun 08, 2023 at 02:44:46PM -0400, Jamal Hadi Salim wrote:
> > Other than the refcount issue i think the approach looks reasonable to
> > me. The stats before/after you are showing below though are
> > interesting; are you showing a transient phase where packets are
> > temporarily in the backlog. Typically the backlog is a transient phase
> > which lasts a very short period. Maybe it works differently for
> > taprio? I took a quick look at the code and do see to decrement the
> > backlog in the dequeue, so if it is not transient then some code path
> > is not being hit.
>
> It's a fair concern. The thing is that I put very aggressive time slots
> in the schedule that I'm testing with, and my kernel has a lot of
> debugging stuff which bogs it down (kasan, kmemleak, lockdep, DMA API
> debug etc). Not to mention that the CPU isn't the fastest to begin with.
>
> The way taprio works is that there's a hrtimer which fires at the
> expiration time of the current schedule entry and sets up the gates for
> the next one. Each schedule entry has a gate for each traffic class
> which determines what traffic classes are eligible for dequeue() and
> which ones aren't.
>
> The dequeue() procedure, though also invoked by the advance_schedule()
> hrtimer -> __netif_schedule(), is also time-sensitive. By the time
> taprio_dequeue() runs, taprio_entry_allows_tx() function might return
> false when the system is so bogged down that it wasn't able to make
> enough progress to dequeue() an skb in time. When that happens, there is
> no mechanism, currently, to age out packets that stood too much in the
> TX queues (what does "too much" mean?).
>
> Whereas enqueue() is technically not time-sensitive, i.e. you can
> enqueue whenever you want and the Qdisc will dequeue whenever it can.
> Though in practice, to make this scheduling technique useful, the user
> space enqueue should also be time-aware (though you can't capture this
> with ping).
>
> If I increase all my sched-entry intervals by a factor of 100, the
> backlog issue goes away and the system can make forward progress.
>
> So yeah, sorry, I didn't pay too much attention to the data I was
> presenting for illustrative purposes.
>

So it seems to me it is a transient phase and that at some point the
backlog will clear up and the sent stats will go up. Maybe just say so
in your commit or show the final result after the packet is gone.

I have to admit, I dont know much about taprio - that's why i am
asking all these leading questions. You spoke of gates etc and thats
klingon to me; but iiuc there's some time sensitive stuff that needs
to be sent out within a deadline. Q: What should happen to skbs that
are no longer valid?
On the aging thing which you say is missing, shouldnt the hrtimer or
schedule kick not be able to dequeue timestamped packets and just drop
them?

cheers,
jamal



cheers,
jamal

> > Aside: I realize you are busy - but if you get time and provide some
> > sample tc command lines for testing we could help create the tests for
> > you, at least the first time. The advantage of putting these tests in
> > tools/testing/selftests/tc-testing/ is that there are test tools out
> > there that run these tests and so regressions are easier to catch
> > sooner.
>
> Yeah, ok. The script posted in a reply on the cover letter is still what
> I'm working with. The things it intends to capture are:
> - attaching a custom Qdisc to one of taprio's classes doesn't fail
> - attaching taprio to one of taprio's classes fails
> - sending packets through one queue increases the counters (any counters)
>   of just that queue
>
> All the above, replicated once for the software scheduling case and once
> for the offload case. Currently netdevsim doesn't attempt to emulate
> taprio offload.
>
> Is there a way to skip tests? I may look into tdc, but I honestly don't
> have time for unrelated stuff such as figuring out why my kernel isn't
> configured for the other tests to pass - and it seems that once one test
> fails, the others are completely skipped, see below.
>
> Also, by which rule are the test IDs created?
>
> root@debian:~# cd selftests/tc-testing/
> root@debian:~/selftests/tc-testing# ./tdc.sh
> considering category qdisc
>  -- ns/SubPlugin.__init__
> Test 0582: Create QFQ with default setting
> Test c9a3: Create QFQ with class weight setting
> Test d364: Test QFQ with max class weight setting
> Test 8452: Create QFQ with class maxpkt setting
> Test 22df: Test QFQ class maxpkt setting lower bound
> Test 92ee: Test QFQ class maxpkt setting upper bound
> Test d920: Create QFQ with multiple class setting
> Test 0548: Delete QFQ with handle
> Test 5901: Show QFQ class
> Test 0385: Create DRR with default setting
> Test 2375: Delete DRR with handle
> Test 3092: Show DRR class
> Test 3460: Create CBQ with default setting
> exit: 2
> exit: 0
> Error: Specified qdisc kind is unknown.
>
>
> -----> teardown stage *** Could not execute: "$TC qdisc del dev $DUMMY handle 1: root"
>
> -----> teardown stage *** Error message: "Error: Invalid handle.
> "
> returncode 2; expected [0]
>
> -----> teardown stage *** Aborting test run.
>
>
> <_io.BufferedReader name=3> *** stdout ***
>
>
> <_io.BufferedReader name=5> *** stderr ***
> "-----> teardown stage" did not complete successfully
> Exception <class '__main__.PluginMgrTestFail'> ('teardown', 'Error: Specified qdisc kind is unknown.\n', '"-----> teardown stage" did not complete successfully') (caught in test_runner, running test 14 3460 Create CBQ with default setting stage teardown)
> ---------------
> traceback
>   File "/root/selftests/tc-testing/./tdc.py", line 495, in test_runner
>     res = run_one_test(pm, args, index, tidx)
>   File "/root/selftests/tc-testing/./tdc.py", line 434, in run_one_test
>     prepare_env(args, pm, 'teardown', '-----> teardown stage', tidx['teardown'], procout)
>   File "/root/selftests/tc-testing/./tdc.py", line 245, in prepare_env
>     raise PluginMgrTestFail(
> ---------------
> accumulated output for this test:
> Error: Specified qdisc kind is unknown.
> ---------------
>
> All test results:
>
> 1..336
> ok 1 0582 - Create QFQ with default setting
> ok 2 c9a3 - Create QFQ with class weight setting
> ok 3 d364 - Test QFQ with max class weight setting
> ok 4 8452 - Create QFQ with class maxpkt setting
> ok 5 22df - Test QFQ class maxpkt setting lower bound
> ok 6 92ee - Test QFQ class maxpkt setting upper bound
> ok 7 d920 - Create QFQ with multiple class setting
> ok 8 0548 - Delete QFQ with handle
> ok 9 5901 - Show QFQ class
> ok 10 0385 - Create DRR with default setting
> ok 11 2375 - Delete DRR with handle
> ok 12 3092 - Show DRR class
> ok 13 3460 - Create CBQ with default setting # skipped - "-----> teardown stage" did not complete successfully
>
> ok 14 0592 - Create CBQ with mpu # skipped - skipped - previous teardown failed 14 3460
>
> ok 15 4684 - Create CBQ with valid cell num # skipped - skipped - previous teardown failed 14 3460
>
> ok 16 4345 - Create CBQ with invalid cell num # skipped - skipped - previous teardown failed 14 3460
>
> ok 17 4525 - Create CBQ with valid ewma # skipped - skipped - previous teardown failed 14 3460
>
> ok 18 6784 - Create CBQ with invalid ewma # skipped - skipped - previous teardown failed 14 3460
>
> ok 19 5468 - Delete CBQ with handle # skipped - skipped - previous teardown failed 14 3460
>
> ok 20 492a - Show CBQ class # skipped - skipped - previous teardown failed 14 3460
>
> ok 21 9903 - Add mqprio Qdisc to multi-queue device (8 queues) # skipped - skipped - previous teardown failed 14 3460
>
> ok 22 453a - Delete nonexistent mqprio Qdisc # skipped - skipped - previous teardown failed 14 3460
>
> ok 23 5292 - Delete mqprio Qdisc twice # skipped - skipped - previous teardown failed 14 3460
>
> ok 24 45a9 - Add mqprio Qdisc to single-queue device # skipped - skipped - previous teardown failed 14 3460
>
> ok 25 2ba9 - Show mqprio class # skipped - skipped - previous teardown failed 14 3460
>
> ok 26 4812 - Create HHF with default setting # skipped - skipped - previous teardown failed 14 3460
>
> ok 27 8a92 - Create HHF with limit setting # skipped - skipped - previous teardown failed 14 3460
>
> ok 28 3491 - Create HHF with quantum setting # skipped - skipped - previous teardown failed 14 3460
> (...)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[]
  2023-06-09 16:19       ` Jamal Hadi Salim
@ 2023-06-09 17:56         ` Vladimir Oltean
  0 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-09 17:56 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Vinicius Costa Gomes,
	linux-kernel, intel-wired-lan, Muhammad Husaini Zulkifli,
	Peilin Ye, Pedro Tammela

On Fri, Jun 09, 2023 at 12:19:12PM -0400, Jamal Hadi Salim wrote:
> So it seems to me it is a transient phase and that at some point the
> backlog will clear up and the sent stats will go up. Maybe just say so
> in your commit or show the final result after the packet is gone.

I will re-collect some stats where there is nothing backlogged.

> I have to admit, I dont know much about taprio - that's why i am
> asking all these leading questions. You spoke of gates etc and thats
> klingon to me; but iiuc there's some time sensitive stuff that needs
> to be sent out within a deadline.

If sch_taprio.c is klingon to you, you can just imagine how sch_api.c
reads to me :)

> Q: What should happen to skbs that are no longer valid?
> On the aging thing which you say is missing, shouldnt the hrtimer or
> schedule kick not be able to dequeue timestamped packets and just drop
> them?

I think the skbs being "valid" is an application-defined metric (except
in the txtime assist mode, where skbs do truly have a transmit deadline).
The user space could reasonably enqueue 100 packets at a time, fully
aware of the fact that the cycle length is 1 second and that there's
only room in one cycle to send one packet, thus it would take 100
seconds for all those packets to be dequeued and sent.

It could also be that this isn't the case.

I guess what could be auto-detected and warned about is when a cycle
passes (sort of like an RCU grace period), the backlog was non-zero,
the gates were open, but still, no skb was dequeued. After one cycle,
the schedule repeats itself identically. But then do what? why drop?
it's a system issue..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[]
  2023-06-09 14:56       ` Victor Nogueira
@ 2023-06-10 17:49         ` Vladimir Oltean
  0 siblings, 0 replies; 25+ messages in thread
From: Vladimir Oltean @ 2023-06-10 17:49 UTC (permalink / raw)
  To: Victor Nogueira, Paolo Abeni, Jamal Hadi Salim
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski, Cong Wang,
	Jiri Pirko, Vinicius Costa Gomes, linux-kernel, intel-wired-lan,
	Muhammad Husaini Zulkifli, Peilin Ye, Pedro Tammela

On Fri, Jun 09, 2023 at 11:56:35AM -0300, Victor Nogueira wrote:
> You can tell tdc to run a specific test file by providing the "-f" option.
> For example, if you want to run only taprio tests, you can issue the
> following command:
> 
> ./tdc.py -f tc-tests/qdiscs/taprio.json

Thanks. I've been able to make some progress with this.

Unfortunately the updated series conflicts with this in-flight patch
set, so I wouldn't want to post it just yet.
https://patchwork.kernel.org/project/netdevbpf/cover/20230609135917.1084327-1-vladimir.oltean@nxp.com/

However, I did add taprio offload to the netdevsim driver so that this
code path could be covered with tdc. I'd like Jamal and Paolo, who asked
for it, to comment on whether this is what they were looking to see.
https://github.com/vladimiroltean/linux/commits/sch-taprio-relationship-children-v2

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-06-10 17:49 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-02 10:37 [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
2023-06-02 10:37 ` [PATCH RESEND net-next 1/5] net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach() Vladimir Oltean
2023-06-02 10:37 ` [PATCH RESEND net-next 2/5] net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode Vladimir Oltean
2023-06-06 10:27   ` Paolo Abeni
2023-06-06 15:56     ` Vladimir Oltean
2023-06-07 10:05       ` Paolo Abeni
2023-06-07 10:16         ` Vladimir Oltean
2023-06-02 10:37 ` [PATCH RESEND net-next 3/5] net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf() Vladimir Oltean
2023-06-02 10:37 ` [PATCH RESEND net-next 4/5] net/sched: taprio: delete misleading comment about preallocating child qdiscs Vladimir Oltean
2023-06-02 10:37 ` [PATCH RESEND net-next 5/5] net/sched: taprio: dump class stats for the actual q->qdiscs[] Vladimir Oltean
2023-06-08 18:44   ` Jamal Hadi Salim
2023-06-09 12:10     ` Vladimir Oltean
2023-06-09 14:56       ` Victor Nogueira
2023-06-10 17:49         ` Vladimir Oltean
2023-06-09 16:19       ` Jamal Hadi Salim
2023-06-09 17:56         ` Vladimir Oltean
2023-06-05 12:50 ` [PATCH RESEND net-next 0/5] Improve the taprio qdisc's relationship with its children Vladimir Oltean
2023-06-05 13:27   ` Simon Horman
2023-06-05 15:44 ` Jamal Hadi Salim
2023-06-05 16:53   ` Vladimir Oltean
2023-06-06 15:39     ` Jamal Hadi Salim
2023-06-06 16:31       ` Vladimir Oltean
2023-06-06 17:42         ` Jamal Hadi Salim
2023-06-07 10:09           ` Vladimir Oltean
2023-06-07 10:30             ` Vladimir Oltean

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).