netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v1 0/5] TC: Introduce qevents
@ 2020-06-26 22:45 Petr Machata
  2020-06-26 22:45 ` [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue Petr Machata
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: Petr Machata @ 2020-06-26 22:45 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet, jiri,
	idosch, Petr Machata

The Spectrum hardware allows execution of one of several actions as a
result of queue management decisions: tail-dropping, early-dropping,
marking a packet, or passing a configured latency threshold or buffer
size. Such packets can be mirrored, trapped, or sampled.

Modeling the action to be taken as simply a TC action is very attractive,
but it is not obvious where to put these actions. At least with ECN marking
one could imagine a tree of qdiscs and classifiers that effectively
accomplishes this task, albeit in an impractically complex manner. But
there is just no way to match on dropped-ness of a packet, let alone
dropped-ness due to a particular reason.

To allow configuring user-defined actions as a result of inner workings of
a qdisc, this patch set introduces a concept of qevents. Those are attach
points for TC blocks, where filters can be put that are executed as the
packet hits well-defined points in the qdisc algorithms. The attached
blocks can be shared, in a manner similar to clsact ingress and egress
blocks, arbitrary classifiers with arbitrary actions can be put on them,
etc.

For example:

# tc qdisc add dev eth0 root handle 1: \
	red limit 500K avpkt 1K qevent early_drop block 10
# tc filter add block 10 \
	matchall action mirred egress mirror dev eth1

The central patch #2 introduces several helpers to allow easy and uniform
addition of qevents to qdiscs: initialization, destruction, qevent block
number change validation, and qevent handling, i.e. dispatch of the filters
attached to the block bound to a qevent.

Patch #1 adds root_lock argument to qdisc enqueue op. The problem this is
tackling is that if a qevent filter pushes packets to the same qdisc tree
that holds the qevent in the first place, attempt to take qdisc root lock
for the second time will lead to a deadlock. To solve the issue, qevent
handler needs to unlock and relock the root lock around the filter
processing. Passing root_lock around makes it possible to get the lock
where it is needed, and visibly so, such that it is obvious the lock will
be used when invoking a qevent.

The following two patches, #3 and #4, then add two qevents to the RED
qdisc: "early_drop" qevent fires when a packet is early-dropped; "mark"
qevent, when it is ECN-marked.

Patch #5 contains a selftest. I have mentioned this test when pushing the
RED ECN nodrop mode and said that "I have no confidence in its portability
to [...] different configurations". That still holds. The backlog and
packet size are tuned to make the test deterministic. But it is better than
nothing, and on the boxes that I ran it on it does work and shows that
qevents work the way they are supposed to, and that their addition has not
broken the other tested features.

This patch set does not deal with offloading. The idea there is that a
driver will be able to figure out that a given block is used in qevent
context by looking at binder type. A future patch-set will add a qdisc
pointer to struct flow_block_offload, which a driver will be able to
consult to glean the TC or other relevant attributes.

Changes from RFC to v1:
- Move a "q = qdisc_priv(sch)" from patch #3 to patch #4
- Fix deadlock caused by mirroring packet back to the same qdisc tree.
- Rename "tail" qevent to "tail_drop".
- Adapt to the new 100-column standard.
- Add a selftest

Petr Machata (5):
  net: sched: Pass root lock to Qdisc_ops.enqueue
  net: sched: Introduce helpers for qevent blocks
  net: sched: sch_red: Split init and change callbacks
  net: sched: sch_red: Add qevents "early_drop" and "mark"
  selftests: forwarding: Add a RED test for SW datapath

 include/net/flow_offload.h                    |   2 +
 include/net/pkt_cls.h                         |  49 ++
 include/net/sch_generic.h                     |   6 +-
 include/uapi/linux/pkt_sched.h                |   2 +
 net/core/dev.c                                |   4 +-
 net/sched/cls_api.c                           | 119 +++++
 net/sched/sch_atm.c                           |   4 +-
 net/sched/sch_blackhole.c                     |   2 +-
 net/sched/sch_cake.c                          |   2 +-
 net/sched/sch_cbq.c                           |   4 +-
 net/sched/sch_cbs.c                           |  18 +-
 net/sched/sch_choke.c                         |   2 +-
 net/sched/sch_codel.c                         |   2 +-
 net/sched/sch_drr.c                           |   4 +-
 net/sched/sch_dsmark.c                        |   4 +-
 net/sched/sch_etf.c                           |   2 +-
 net/sched/sch_ets.c                           |   4 +-
 net/sched/sch_fifo.c                          |   6 +-
 net/sched/sch_fq.c                            |   2 +-
 net/sched/sch_fq_codel.c                      |   2 +-
 net/sched/sch_fq_pie.c                        |   2 +-
 net/sched/sch_generic.c                       |   4 +-
 net/sched/sch_gred.c                          |   2 +-
 net/sched/sch_hfsc.c                          |   6 +-
 net/sched/sch_hhf.c                           |   2 +-
 net/sched/sch_htb.c                           |   4 +-
 net/sched/sch_multiq.c                        |   4 +-
 net/sched/sch_netem.c                         |   8 +-
 net/sched/sch_pie.c                           |   2 +-
 net/sched/sch_plug.c                          |   2 +-
 net/sched/sch_prio.c                          |   6 +-
 net/sched/sch_qfq.c                           |   4 +-
 net/sched/sch_red.c                           | 102 +++-
 net/sched/sch_sfb.c                           |   4 +-
 net/sched/sch_sfq.c                           |   2 +-
 net/sched/sch_skbprio.c                       |   2 +-
 net/sched/sch_taprio.c                        |   4 +-
 net/sched/sch_tbf.c                           |  10 +-
 net/sched/sch_teql.c                          |   4 +-
 .../selftests/net/forwarding/sch_red.sh       | 492 ++++++++++++++++++
 40 files changed, 822 insertions(+), 84 deletions(-)
 create mode 100755 tools/testing/selftests/net/forwarding/sch_red.sh

-- 
2.20.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue
  2020-06-26 22:45 [PATCH net-next v1 0/5] TC: Introduce qevents Petr Machata
@ 2020-06-26 22:45 ` Petr Machata
  2020-07-06 19:21   ` Cong Wang
  2020-06-26 22:45 ` [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks Petr Machata
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: Petr Machata @ 2020-06-26 22:45 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet, jiri,
	idosch, Petr Machata

A following patch introduces qevents, points in qdisc algorithm where
packet can be processed by user-defined filters. Should this processing
lead to a situation where a new packet is to be enqueued on the same port,
holding the root lock would lead to deadlocks. To solve the issue, qevent
handler needs to unlock and relock the root lock when necessary.

To that end, add the root lock argument to the qdisc op enqueue, and
propagate throughout.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 include/net/sch_generic.h |  6 ++++--
 net/core/dev.c            |  4 ++--
 net/sched/sch_atm.c       |  4 ++--
 net/sched/sch_blackhole.c |  2 +-
 net/sched/sch_cake.c      |  2 +-
 net/sched/sch_cbq.c       |  4 ++--
 net/sched/sch_cbs.c       | 18 +++++++++---------
 net/sched/sch_choke.c     |  2 +-
 net/sched/sch_codel.c     |  2 +-
 net/sched/sch_drr.c       |  4 ++--
 net/sched/sch_dsmark.c    |  4 ++--
 net/sched/sch_etf.c       |  2 +-
 net/sched/sch_ets.c       |  4 ++--
 net/sched/sch_fifo.c      |  6 +++---
 net/sched/sch_fq.c        |  2 +-
 net/sched/sch_fq_codel.c  |  2 +-
 net/sched/sch_fq_pie.c    |  2 +-
 net/sched/sch_generic.c   |  4 ++--
 net/sched/sch_gred.c      |  2 +-
 net/sched/sch_hfsc.c      |  6 +++---
 net/sched/sch_hhf.c       |  2 +-
 net/sched/sch_htb.c       |  4 ++--
 net/sched/sch_multiq.c    |  4 ++--
 net/sched/sch_netem.c     |  8 ++++----
 net/sched/sch_pie.c       |  2 +-
 net/sched/sch_plug.c      |  2 +-
 net/sched/sch_prio.c      |  6 +++---
 net/sched/sch_qfq.c       |  4 ++--
 net/sched/sch_red.c       |  4 ++--
 net/sched/sch_sfb.c       |  4 ++--
 net/sched/sch_sfq.c       |  2 +-
 net/sched/sch_skbprio.c   |  2 +-
 net/sched/sch_taprio.c    |  4 ++--
 net/sched/sch_tbf.c       | 10 +++++-----
 net/sched/sch_teql.c      |  4 ++--
 35 files changed, 73 insertions(+), 71 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index c510b03b9751..fceb3d63c925 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -57,6 +57,7 @@ struct qdisc_skb_head {
 struct Qdisc {
 	int 			(*enqueue)(struct sk_buff *skb,
 					   struct Qdisc *sch,
+					   spinlock_t *root_lock,
 					   struct sk_buff **to_free);
 	struct sk_buff *	(*dequeue)(struct Qdisc *sch);
 	unsigned int		flags;
@@ -241,6 +242,7 @@ struct Qdisc_ops {
 
 	int 			(*enqueue)(struct sk_buff *skb,
 					   struct Qdisc *sch,
+					   spinlock_t *root_lock,
 					   struct sk_buff **to_free);
 	struct sk_buff *	(*dequeue)(struct Qdisc *);
 	struct sk_buff *	(*peek)(struct Qdisc *);
@@ -788,11 +790,11 @@ static inline void qdisc_calculate_pkt_len(struct sk_buff *skb,
 #endif
 }
 
-static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 				struct sk_buff **to_free)
 {
 	qdisc_calculate_pkt_len(skb, sch);
-	return sch->enqueue(skb, sch, to_free);
+	return sch->enqueue(skb, sch, root_lock, to_free);
 }
 
 static inline void _bstats_update(struct gnet_stats_basic_packed *bstats,
diff --git a/net/core/dev.c b/net/core/dev.c
index 6bc2388141f6..7d45356e1a6f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3743,7 +3743,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	qdisc_calculate_pkt_len(skb, q);
 
 	if (q->flags & TCQ_F_NOLOCK) {
-		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+		rc = q->enqueue(skb, q, NULL, &to_free) & NET_XMIT_MASK;
 		qdisc_run(q);
 
 		if (unlikely(to_free))
@@ -3786,7 +3786,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		qdisc_run_end(q);
 		rc = NET_XMIT_SUCCESS;
 	} else {
-		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+		rc = q->enqueue(skb, q, root_lock, &to_free) & NET_XMIT_MASK;
 		if (qdisc_run_begin(q)) {
 			if (unlikely(contended)) {
 				spin_unlock(&q->busylock);
diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c
index ee12ca9f55b4..fb6b16c4e46d 100644
--- a/net/sched/sch_atm.c
+++ b/net/sched/sch_atm.c
@@ -374,7 +374,7 @@ static struct tcf_block *atm_tc_tcf_block(struct Qdisc *sch, unsigned long cl,
 
 /* --------------------------- Qdisc operations ---------------------------- */
 
-static int atm_tc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int atm_tc_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			  struct sk_buff **to_free)
 {
 	struct atm_qdisc_data *p = qdisc_priv(sch);
@@ -432,7 +432,7 @@ static int atm_tc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 #endif
 	}
 
-	ret = qdisc_enqueue(skb, flow->q, to_free);
+	ret = qdisc_enqueue(skb, flow->q, root_lock, to_free);
 	if (ret != NET_XMIT_SUCCESS) {
 drop: __maybe_unused
 		if (net_xmit_drop_count(ret)) {
diff --git a/net/sched/sch_blackhole.c b/net/sched/sch_blackhole.c
index a7f7667ae984..187644657c4f 100644
--- a/net/sched/sch_blackhole.c
+++ b/net/sched/sch_blackhole.c
@@ -13,7 +13,7 @@
 #include <linux/skbuff.h>
 #include <net/pkt_sched.h>
 
-static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int blackhole_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			     struct sk_buff **to_free)
 {
 	qdisc_drop(skb, sch, to_free);
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 60f8ae578819..d109cae6eb6a 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -1663,7 +1663,7 @@ static u32 cake_classify(struct Qdisc *sch, struct cake_tin_data **t,
 
 static void cake_reconfigure(struct Qdisc *sch);
 
-static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			struct sk_buff **to_free)
 {
 	struct cake_sched_data *q = qdisc_priv(sch);
diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index 39b427dc7512..052d4a1af69a 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -356,7 +356,7 @@ cbq_mark_toplevel(struct cbq_sched_data *q, struct cbq_class *cl)
 }
 
 static int
-cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 	    struct sk_buff **to_free)
 {
 	struct cbq_sched_data *q = qdisc_priv(sch);
@@ -373,7 +373,7 @@ cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		return ret;
 	}
 
-	ret = qdisc_enqueue(skb, cl->q, to_free);
+	ret = qdisc_enqueue(skb, cl->q, root_lock, to_free);
 	if (ret == NET_XMIT_SUCCESS) {
 		sch->q.qlen++;
 		cbq_mark_toplevel(q, cl);
diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
index 2eaac2ff380f..7af15ebe07f7 100644
--- a/net/sched/sch_cbs.c
+++ b/net/sched/sch_cbs.c
@@ -77,7 +77,7 @@ struct cbs_sched_data {
 	s64 sendslope; /* in bytes/s */
 	s64 idleslope; /* in bytes/s */
 	struct qdisc_watchdog watchdog;
-	int (*enqueue)(struct sk_buff *skb, struct Qdisc *sch,
+	int (*enqueue)(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		       struct sk_buff **to_free);
 	struct sk_buff *(*dequeue)(struct Qdisc *sch);
 	struct Qdisc *qdisc;
@@ -85,13 +85,13 @@ struct cbs_sched_data {
 };
 
 static int cbs_child_enqueue(struct sk_buff *skb, struct Qdisc *sch,
-			     struct Qdisc *child,
+			     struct Qdisc *child, spinlock_t *root_lock,
 			     struct sk_buff **to_free)
 {
 	unsigned int len = qdisc_pkt_len(skb);
 	int err;
 
-	err = child->ops->enqueue(skb, child, to_free);
+	err = child->ops->enqueue(skb, child, root_lock, to_free);
 	if (err != NET_XMIT_SUCCESS)
 		return err;
 
@@ -101,16 +101,16 @@ static int cbs_child_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	return NET_XMIT_SUCCESS;
 }
 
-static int cbs_enqueue_offload(struct sk_buff *skb, struct Qdisc *sch,
+static int cbs_enqueue_offload(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			       struct sk_buff **to_free)
 {
 	struct cbs_sched_data *q = qdisc_priv(sch);
 	struct Qdisc *qdisc = q->qdisc;
 
-	return cbs_child_enqueue(skb, sch, qdisc, to_free);
+	return cbs_child_enqueue(skb, sch, qdisc, root_lock, to_free);
 }
 
-static int cbs_enqueue_soft(struct sk_buff *skb, struct Qdisc *sch,
+static int cbs_enqueue_soft(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			    struct sk_buff **to_free)
 {
 	struct cbs_sched_data *q = qdisc_priv(sch);
@@ -124,15 +124,15 @@ static int cbs_enqueue_soft(struct sk_buff *skb, struct Qdisc *sch,
 		q->last = ktime_get_ns();
 	}
 
-	return cbs_child_enqueue(skb, sch, qdisc, to_free);
+	return cbs_child_enqueue(skb, sch, qdisc, root_lock, to_free);
 }
 
-static int cbs_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int cbs_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		       struct sk_buff **to_free)
 {
 	struct cbs_sched_data *q = qdisc_priv(sch);
 
-	return q->enqueue(skb, sch, to_free);
+	return q->enqueue(skb, sch, root_lock, to_free);
 }
 
 /* timediff is in ns, slope is in bytes/s */
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index bd618b00d319..baf3faee31aa 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -210,7 +210,7 @@ static bool choke_match_random(const struct choke_sched_data *q,
 	return choke_match_flow(oskb, nskb);
 }
 
-static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int choke_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			 struct sk_buff **to_free)
 {
 	struct choke_sched_data *q = qdisc_priv(sch);
diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
index 30169b3adbbb..1d94837abdd8 100644
--- a/net/sched/sch_codel.c
+++ b/net/sched/sch_codel.c
@@ -108,7 +108,7 @@ static struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch)
 	return skb;
 }
 
-static int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			       struct sk_buff **to_free)
 {
 	struct codel_sched_data *q;
diff --git a/net/sched/sch_drr.c b/net/sched/sch_drr.c
index 07a2b0b35495..0d5c9a8ec61d 100644
--- a/net/sched/sch_drr.c
+++ b/net/sched/sch_drr.c
@@ -337,7 +337,7 @@ static struct drr_class *drr_classify(struct sk_buff *skb, struct Qdisc *sch,
 	return NULL;
 }
 
-static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		       struct sk_buff **to_free)
 {
 	unsigned int len = qdisc_pkt_len(skb);
@@ -355,7 +355,7 @@ static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	}
 
 	first = !cl->qdisc->q.qlen;
-	err = qdisc_enqueue(skb, cl->qdisc, to_free);
+	err = qdisc_enqueue(skb, cl->qdisc, root_lock, to_free);
 	if (unlikely(err != NET_XMIT_SUCCESS)) {
 		if (net_xmit_drop_count(err)) {
 			cl->qstats.drops++;
diff --git a/net/sched/sch_dsmark.c b/net/sched/sch_dsmark.c
index 05605b30bef3..fbe49fffcdbb 100644
--- a/net/sched/sch_dsmark.c
+++ b/net/sched/sch_dsmark.c
@@ -198,7 +198,7 @@ static struct tcf_block *dsmark_tcf_block(struct Qdisc *sch, unsigned long cl,
 
 /* --------------------------- Qdisc operations ---------------------------- */
 
-static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			  struct sk_buff **to_free)
 {
 	unsigned int len = qdisc_pkt_len(skb);
@@ -267,7 +267,7 @@ static int dsmark_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		}
 	}
 
-	err = qdisc_enqueue(skb, p->q, to_free);
+	err = qdisc_enqueue(skb, p->q, root_lock, to_free);
 	if (err != NET_XMIT_SUCCESS) {
 		if (net_xmit_drop_count(err))
 			qdisc_qstats_drop(sch);
diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
index c48f91075b5c..7a7c50a68115 100644
--- a/net/sched/sch_etf.c
+++ b/net/sched/sch_etf.c
@@ -160,7 +160,7 @@ static void report_sock_error(struct sk_buff *skb, u32 err, u8 code)
 }
 
 static int etf_enqueue_timesortedlist(struct sk_buff *nskb, struct Qdisc *sch,
-				      struct sk_buff **to_free)
+				      spinlock_t *root_lock, struct sk_buff **to_free)
 {
 	struct etf_sched_data *q = qdisc_priv(sch);
 	struct rb_node **p = &q->head.rb_root.rb_node, *parent = NULL;
diff --git a/net/sched/sch_ets.c b/net/sched/sch_ets.c
index a87e9159338c..373dc5855d4e 100644
--- a/net/sched/sch_ets.c
+++ b/net/sched/sch_ets.c
@@ -415,7 +415,7 @@ static struct ets_class *ets_classify(struct sk_buff *skb, struct Qdisc *sch,
 	return &q->classes[band];
 }
 
-static int ets_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int ets_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			     struct sk_buff **to_free)
 {
 	unsigned int len = qdisc_pkt_len(skb);
@@ -433,7 +433,7 @@ static int ets_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	}
 
 	first = !cl->qdisc->q.qlen;
-	err = qdisc_enqueue(skb, cl->qdisc, to_free);
+	err = qdisc_enqueue(skb, cl->qdisc, root_lock, to_free);
 	if (unlikely(err != NET_XMIT_SUCCESS)) {
 		if (net_xmit_drop_count(err)) {
 			cl->qstats.drops++;
diff --git a/net/sched/sch_fifo.c b/net/sched/sch_fifo.c
index a579a4131d22..b4da5b624ad8 100644
--- a/net/sched/sch_fifo.c
+++ b/net/sched/sch_fifo.c
@@ -16,7 +16,7 @@
 
 /* 1 band FIFO pseudo-"scheduler" */
 
-static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			 struct sk_buff **to_free)
 {
 	if (likely(sch->qstats.backlog + qdisc_pkt_len(skb) <= sch->limit))
@@ -25,7 +25,7 @@ static int bfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	return qdisc_drop(skb, sch, to_free);
 }
 
-static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			 struct sk_buff **to_free)
 {
 	if (likely(sch->q.qlen < sch->limit))
@@ -34,7 +34,7 @@ static int pfifo_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	return qdisc_drop(skb, sch, to_free);
 }
 
-static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int pfifo_tail_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			      struct sk_buff **to_free)
 {
 	unsigned int prev_backlog;
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 8f06a808c59a..64f5b4ccf3b0 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -439,7 +439,7 @@ static bool fq_packet_beyond_horizon(const struct sk_buff *skb,
 	return unlikely((s64)skb->tstamp > (s64)(q->ktime_cache + q->horizon));
 }
 
-static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		      struct sk_buff **to_free)
 {
 	struct fq_sched_data *q = qdisc_priv(sch);
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 436160be9c18..fe044a546e92 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -181,7 +181,7 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
 	return idx;
 }
 
-static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			    struct sk_buff **to_free)
 {
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
diff --git a/net/sched/sch_fq_pie.c b/net/sched/sch_fq_pie.c
index fb760cee824e..a27a250ab8f9 100644
--- a/net/sched/sch_fq_pie.c
+++ b/net/sched/sch_fq_pie.c
@@ -125,7 +125,7 @@ static inline void flow_queue_add(struct fq_pie_flow *flow,
 	skb->next = NULL;
 }
 
-static int fq_pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int fq_pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 				struct sk_buff **to_free)
 {
 	struct fq_pie_sched_data *q = qdisc_priv(sch);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 265a61d011df..715cde1df9e4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -520,7 +520,7 @@ EXPORT_SYMBOL(netif_carrier_off);
    cheaper.
  */
 
-static int noop_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
+static int noop_enqueue(struct sk_buff *skb, struct Qdisc *qdisc, spinlock_t *root_lock,
 			struct sk_buff **to_free)
 {
 	__qdisc_drop(skb, to_free);
@@ -614,7 +614,7 @@ static inline struct skb_array *band2list(struct pfifo_fast_priv *priv,
 	return &priv->q[band];
 }
 
-static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
+static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc, spinlock_t *root_lock,
 			      struct sk_buff **to_free)
 {
 	int band = prio2band[skb->priority & TC_PRIO_MAX];
diff --git a/net/sched/sch_gred.c b/net/sched/sch_gred.c
index 8599c6f31b05..7d67c6cd6605 100644
--- a/net/sched/sch_gred.c
+++ b/net/sched/sch_gred.c
@@ -161,7 +161,7 @@ static bool gred_per_vq_red_flags_used(struct gred_sched *table)
 	return false;
 }
 
-static int gred_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int gred_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			struct sk_buff **to_free)
 {
 	struct gred_sched_data *q = NULL;
diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 433f2190960f..7f6670044f0a 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -1528,8 +1528,8 @@ hfsc_dump_qdisc(struct Qdisc *sch, struct sk_buff *skb)
 	return -1;
 }
 
-static int
-hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
+static int hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
+			struct sk_buff **to_free)
 {
 	unsigned int len = qdisc_pkt_len(skb);
 	struct hfsc_class *cl;
@@ -1545,7 +1545,7 @@ hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
 	}
 
 	first = !cl->qdisc->q.qlen;
-	err = qdisc_enqueue(skb, cl->qdisc, to_free);
+	err = qdisc_enqueue(skb, cl->qdisc, root_lock, to_free);
 	if (unlikely(err != NET_XMIT_SUCCESS)) {
 		if (net_xmit_drop_count(err)) {
 			cl->qstats.drops++;
diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index be35f03b657b..d37e87fdc2a4 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -368,7 +368,7 @@ static unsigned int hhf_drop(struct Qdisc *sch, struct sk_buff **to_free)
 	return bucket - q->buckets;
 }
 
-static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int hhf_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		       struct sk_buff **to_free)
 {
 	struct hhf_sched_data *q = qdisc_priv(sch);
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 8184c87da8be..52fc513688b1 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -576,7 +576,7 @@ static inline void htb_deactivate(struct htb_sched *q, struct htb_class *cl)
 	cl->prio_activity = 0;
 }
 
-static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		       struct sk_buff **to_free)
 {
 	int uninitialized_var(ret);
@@ -599,7 +599,7 @@ static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		__qdisc_drop(skb, to_free);
 		return ret;
 #endif
-	} else if ((ret = qdisc_enqueue(skb, cl->leaf.q,
+	} else if ((ret = qdisc_enqueue(skb, cl->leaf.q, root_lock,
 					to_free)) != NET_XMIT_SUCCESS) {
 		if (net_xmit_drop_count(ret)) {
 			qdisc_qstats_drop(sch);
diff --git a/net/sched/sch_multiq.c b/net/sched/sch_multiq.c
index 1330ad224931..648611f5c105 100644
--- a/net/sched/sch_multiq.c
+++ b/net/sched/sch_multiq.c
@@ -57,7 +57,7 @@ multiq_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 }
 
 static int
-multiq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+multiq_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 	       struct sk_buff **to_free)
 {
 	struct Qdisc *qdisc;
@@ -74,7 +74,7 @@ multiq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	}
 #endif
 
-	ret = qdisc_enqueue(skb, qdisc, to_free);
+	ret = qdisc_enqueue(skb, qdisc, root_lock, to_free);
 	if (ret == NET_XMIT_SUCCESS) {
 		sch->q.qlen++;
 		return NET_XMIT_SUCCESS;
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 84f82771cdf5..8fb17483a34f 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -431,7 +431,7 @@ static struct sk_buff *netem_segment(struct sk_buff *skb, struct Qdisc *sch,
  * 	NET_XMIT_DROP: queue length didn't change.
  *      NET_XMIT_SUCCESS: one skb was queued.
  */
-static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			 struct sk_buff **to_free)
 {
 	struct netem_sched_data *q = qdisc_priv(sch);
@@ -480,7 +480,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		u32 dupsave = q->duplicate; /* prevent duplicating a dup... */
 
 		q->duplicate = 0;
-		rootq->enqueue(skb2, rootq, to_free);
+		rootq->enqueue(skb2, rootq, root_lock, to_free);
 		q->duplicate = dupsave;
 		rc_drop = NET_XMIT_SUCCESS;
 	}
@@ -604,7 +604,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 			skb_mark_not_on_list(segs);
 			qdisc_skb_cb(segs)->pkt_len = segs->len;
 			last_len = segs->len;
-			rc = qdisc_enqueue(segs, sch, to_free);
+			rc = qdisc_enqueue(segs, sch, root_lock, to_free);
 			if (rc != NET_XMIT_SUCCESS) {
 				if (net_xmit_drop_count(rc))
 					qdisc_qstats_drop(sch);
@@ -720,7 +720,7 @@ static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 				struct sk_buff *to_free = NULL;
 				int err;
 
-				err = qdisc_enqueue(skb, q->qdisc, &to_free);
+				err = qdisc_enqueue(skb, q->qdisc, NULL, &to_free);
 				kfree_skb_list(to_free);
 				if (err != NET_XMIT_SUCCESS &&
 				    net_xmit_drop_count(err)) {
diff --git a/net/sched/sch_pie.c b/net/sched/sch_pie.c
index c65077f0c0f3..b305313b64e3 100644
--- a/net/sched/sch_pie.c
+++ b/net/sched/sch_pie.c
@@ -82,7 +82,7 @@ bool pie_drop_early(struct Qdisc *sch, struct pie_params *params,
 }
 EXPORT_SYMBOL_GPL(pie_drop_early);
 
-static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int pie_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			     struct sk_buff **to_free)
 {
 	struct pie_sched_data *q = qdisc_priv(sch);
diff --git a/net/sched/sch_plug.c b/net/sched/sch_plug.c
index cbc2ebca4548..e5f8b4769b4d 100644
--- a/net/sched/sch_plug.c
+++ b/net/sched/sch_plug.c
@@ -84,7 +84,7 @@ struct plug_sched_data {
 	u32 pkts_to_release;
 };
 
-static int plug_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int plug_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			struct sk_buff **to_free)
 {
 	struct plug_sched_data *q = qdisc_priv(sch);
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index 647941702f9f..a3e187f2603c 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -65,8 +65,8 @@ prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr)
 	return q->queues[band];
 }
 
-static int
-prio_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
+static int prio_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
+			struct sk_buff **to_free)
 {
 	unsigned int len = qdisc_pkt_len(skb);
 	struct Qdisc *qdisc;
@@ -83,7 +83,7 @@ prio_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
 	}
 #endif
 
-	ret = qdisc_enqueue(skb, qdisc, to_free);
+	ret = qdisc_enqueue(skb, qdisc, root_lock, to_free);
 	if (ret == NET_XMIT_SUCCESS) {
 		sch->qstats.backlog += len;
 		sch->q.qlen++;
diff --git a/net/sched/sch_qfq.c b/net/sched/sch_qfq.c
index 0b05ac7c848e..ede854516825 100644
--- a/net/sched/sch_qfq.c
+++ b/net/sched/sch_qfq.c
@@ -1194,7 +1194,7 @@ static struct qfq_aggregate *qfq_choose_next_agg(struct qfq_sched *q)
 	return agg;
 }
 
-static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		       struct sk_buff **to_free)
 {
 	unsigned int len = qdisc_pkt_len(skb), gso_segs;
@@ -1225,7 +1225,7 @@ static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 
 	gso_segs = skb_is_gso(skb) ? skb_shinfo(skb)->gso_segs : 1;
 	first = !cl->qdisc->q.qlen;
-	err = qdisc_enqueue(skb, cl->qdisc, to_free);
+	err = qdisc_enqueue(skb, cl->qdisc, root_lock, to_free);
 	if (unlikely(err != NET_XMIT_SUCCESS)) {
 		pr_debug("qfq_enqueue: enqueue failed %d\n", err);
 		if (net_xmit_drop_count(err)) {
diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index 555a1b9e467f..6ace7d757e8b 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -65,7 +65,7 @@ static int red_use_nodrop(struct red_sched_data *q)
 	return q->flags & TC_RED_NODROP;
 }
 
-static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		       struct sk_buff **to_free)
 {
 	struct red_sched_data *q = qdisc_priv(sch);
@@ -118,7 +118,7 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		break;
 	}
 
-	ret = qdisc_enqueue(skb, child, to_free);
+	ret = qdisc_enqueue(skb, child, root_lock, to_free);
 	if (likely(ret == NET_XMIT_SUCCESS)) {
 		qdisc_qstats_backlog_inc(sch, skb);
 		sch->q.qlen++;
diff --git a/net/sched/sch_sfb.c b/net/sched/sch_sfb.c
index 4074c50ac3d7..d2a6e78262bb 100644
--- a/net/sched/sch_sfb.c
+++ b/net/sched/sch_sfb.c
@@ -276,7 +276,7 @@ static bool sfb_classify(struct sk_buff *skb, struct tcf_proto *fl,
 	return false;
 }
 
-static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		       struct sk_buff **to_free)
 {
 
@@ -399,7 +399,7 @@ static int sfb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	}
 
 enqueue:
-	ret = qdisc_enqueue(skb, child, to_free);
+	ret = qdisc_enqueue(skb, child, root_lock, to_free);
 	if (likely(ret == NET_XMIT_SUCCESS)) {
 		qdisc_qstats_backlog_inc(sch, skb);
 		sch->q.qlen++;
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 5a6def5e4e6d..46cdefd69e44 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -343,7 +343,7 @@ static int sfq_headdrop(const struct sfq_sched_data *q)
 }
 
 static int
-sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
+sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock, struct sk_buff **to_free)
 {
 	struct sfq_sched_data *q = qdisc_priv(sch);
 	unsigned int hash, dropped;
diff --git a/net/sched/sch_skbprio.c b/net/sched/sch_skbprio.c
index 7a5e4c454715..f75f237c4436 100644
--- a/net/sched/sch_skbprio.c
+++ b/net/sched/sch_skbprio.c
@@ -65,7 +65,7 @@ static u16 calc_new_low_prio(const struct skbprio_sched_data *q)
 	return SKBPRIO_MAX_PRIORITY - 1;
 }
 
-static int skbprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int skbprio_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			  struct sk_buff **to_free)
 {
 	const unsigned int max_priority = SKBPRIO_MAX_PRIORITY - 1;
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index b1eb12d33b9a..b6a9480ec3e7 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -410,7 +410,7 @@ static long get_packet_txtime(struct sk_buff *skb, struct Qdisc *sch)
 	return txtime;
 }
 
-static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 			  struct sk_buff **to_free)
 {
 	struct taprio_sched *q = qdisc_priv(sch);
@@ -435,7 +435,7 @@ static int taprio_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	qdisc_qstats_backlog_inc(sch, skb);
 	sch->q.qlen++;
 
-	return qdisc_enqueue(skb, child, to_free);
+	return qdisc_enqueue(skb, child, root_lock, to_free);
 }
 
 static struct sk_buff *taprio_peek_soft(struct Qdisc *sch)
diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index 78e79029dc63..c3eb5cdb83a8 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -187,7 +187,7 @@ static int tbf_offload_dump(struct Qdisc *sch)
 /* GSO packet is too big, segment it so that tbf can transmit
  * each segment in time
  */
-static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch,
+static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		       struct sk_buff **to_free)
 {
 	struct tbf_sched_data *q = qdisc_priv(sch);
@@ -206,7 +206,7 @@ static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch,
 		skb_mark_not_on_list(segs);
 		qdisc_skb_cb(segs)->pkt_len = segs->len;
 		len += segs->len;
-		ret = qdisc_enqueue(segs, q->qdisc, to_free);
+		ret = qdisc_enqueue(segs, q->qdisc, root_lock, to_free);
 		if (ret != NET_XMIT_SUCCESS) {
 			if (net_xmit_drop_count(ret))
 				qdisc_qstats_drop(sch);
@@ -221,7 +221,7 @@ static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch,
 	return nb > 0 ? NET_XMIT_SUCCESS : NET_XMIT_DROP;
 }
 
-static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
 		       struct sk_buff **to_free)
 {
 	struct tbf_sched_data *q = qdisc_priv(sch);
@@ -231,10 +231,10 @@ static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	if (qdisc_pkt_len(skb) > q->max_size) {
 		if (skb_is_gso(skb) &&
 		    skb_gso_validate_mac_len(skb, q->max_size))
-			return tbf_segment(skb, sch, to_free);
+			return tbf_segment(skb, sch, root_lock, to_free);
 		return qdisc_drop(skb, sch, to_free);
 	}
-	ret = qdisc_enqueue(skb, q->qdisc, to_free);
+	ret = qdisc_enqueue(skb, q->qdisc, root_lock, to_free);
 	if (ret != NET_XMIT_SUCCESS) {
 		if (net_xmit_drop_count(ret))
 			qdisc_qstats_drop(sch);
diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index 689ef6f3ded8..511964653476 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -72,8 +72,8 @@ struct teql_sched_data {
 
 /* "teql*" qdisc routines */
 
-static int
-teql_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free)
+static int teql_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_lock,
+			struct sk_buff **to_free)
 {
 	struct net_device *dev = qdisc_dev(sch);
 	struct teql_sched_data *q = qdisc_priv(sch);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-06-26 22:45 [PATCH net-next v1 0/5] TC: Introduce qevents Petr Machata
  2020-06-26 22:45 ` [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue Petr Machata
@ 2020-06-26 22:45 ` Petr Machata
  2020-07-06 19:48   ` Cong Wang
  2020-07-07 19:48   ` Cong Wang
  2020-06-26 22:45 ` [PATCH net-next v1 3/5] net: sched: sch_red: Split init and change callbacks Petr Machata
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: Petr Machata @ 2020-06-26 22:45 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet, jiri,
	idosch, Petr Machata

Qevents are attach points for TC blocks, where filters can be put that are
executed when "interesting events" take place in a qdisc. The data to keep
and the functions to invoke to maintain a qevent will be largely the same
between qevents. Therefore introduce sched-wide helpers for qevent
management.

Currently, similarly to ingress and egress blocks of clsact pseudo-qdisc,
blocks attachment cannot be changed after the qdisc is created. To that
end, add a helper tcf_qevent_validate_change(), which verifies whether
block index attribute is not attached, or if it is, whether its value
matches the current one (i.e. there is no material change).

The function tcf_qevent_handle() should be invoked when qdisc hits the
"interesting event" corresponding to a block. This function releases root
lock for the duration of executing the attached filters, to allow packets
generated through user actions (notably mirred) to be reinserted to the
same qdisc tree.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 include/net/pkt_cls.h |  49 +++++++++++++++++
 net/sched/cls_api.c   | 119 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 168 insertions(+)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index ed65619cbc47..b0e11bbb6d7a 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -32,6 +32,12 @@ struct tcf_block_ext_info {
 	u32 block_index;
 };
 
+struct tcf_qevent {
+	struct tcf_block	*block;
+	struct tcf_block_ext_info info;
+	struct tcf_proto __rcu *filter_chain;
+};
+
 struct tcf_block_cb;
 bool tcf_queue_work(struct rcu_work *rwork, work_func_t func);
 
@@ -552,6 +558,49 @@ int tc_setup_cb_reoffload(struct tcf_block *block, struct tcf_proto *tp,
 			  void *cb_priv, u32 *flags, unsigned int *in_hw_count);
 unsigned int tcf_exts_num_actions(struct tcf_exts *exts);
 
+#ifdef CONFIG_NET_CLS_ACT
+int tcf_qevent_init(struct tcf_qevent *qe, struct Qdisc *sch,
+		    enum flow_block_binder_type binder_type,
+		    struct nlattr *block_index_attr,
+		    struct netlink_ext_ack *extack);
+void tcf_qevent_destroy(struct tcf_qevent *qe, struct Qdisc *sch);
+int tcf_qevent_validate_change(struct tcf_qevent *qe, struct nlattr *block_index_attr,
+			       struct netlink_ext_ack *extack);
+struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb,
+				  spinlock_t *root_lock, struct sk_buff **to_free, int *ret);
+int tcf_qevent_dump(struct sk_buff *skb, int attr_name, struct tcf_qevent *qe);
+#else
+static inline int tcf_qevent_init(struct tcf_qevent *qe, struct Qdisc *sch,
+				  enum flow_block_binder_type binder_type,
+				  struct nlattr *block_index_attr,
+				  struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static inline void tcf_qevent_destroy(struct tcf_qevent *qe, struct Qdisc *sch)
+{
+}
+
+static inline int tcf_qevent_validate_change(struct tcf_qevent *qe, struct nlattr *block_index_attr,
+					     struct netlink_ext_ack *extack)
+{
+	return 0;
+}
+
+static inline struct sk_buff *
+tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb,
+		  spinlock_t *root_lock, struct sk_buff **to_free, int *ret)
+{
+	return skb;
+}
+
+static inline int tcf_qevent_dump(struct sk_buff *skb, int attr_name, struct tcf_qevent *qe)
+{
+	return 0;
+}
+#endif
+
 struct tc_cls_u32_knode {
 	struct tcf_exts *exts;
 	struct tcf_result *res;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index a00a203b2ef5..c594a6343be1 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3741,6 +3741,125 @@ unsigned int tcf_exts_num_actions(struct tcf_exts *exts)
 }
 EXPORT_SYMBOL(tcf_exts_num_actions);
 
+#ifdef CONFIG_NET_CLS_ACT
+static int tcf_qevent_parse_block_index(struct nlattr *block_index_attr,
+					u32 *p_block_index,
+					struct netlink_ext_ack *extack)
+{
+	*p_block_index = nla_get_u32(block_index_attr);
+	if (!*p_block_index) {
+		NL_SET_ERR_MSG(extack, "Block number may not be zero");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int tcf_qevent_init(struct tcf_qevent *qe, struct Qdisc *sch,
+		    enum flow_block_binder_type binder_type,
+		    struct nlattr *block_index_attr,
+		    struct netlink_ext_ack *extack)
+{
+	u32 block_index;
+	int err;
+
+	if (!block_index_attr)
+		return 0;
+
+	err = tcf_qevent_parse_block_index(block_index_attr, &block_index, extack);
+	if (err)
+		return err;
+
+	if (!block_index)
+		return 0;
+
+	qe->info.binder_type = binder_type;
+	qe->info.chain_head_change = tcf_chain_head_change_dflt;
+	qe->info.chain_head_change_priv = &qe->filter_chain;
+	qe->info.block_index = block_index;
+
+	return tcf_block_get_ext(&qe->block, sch, &qe->info, extack);
+}
+EXPORT_SYMBOL(tcf_qevent_init);
+
+void tcf_qevent_destroy(struct tcf_qevent *qe, struct Qdisc *sch)
+{
+	if (qe->info.block_index)
+		tcf_block_put_ext(qe->block, sch, &qe->info);
+}
+EXPORT_SYMBOL(tcf_qevent_destroy);
+
+int tcf_qevent_validate_change(struct tcf_qevent *qe, struct nlattr *block_index_attr,
+			       struct netlink_ext_ack *extack)
+{
+	u32 block_index;
+	int err;
+
+	if (!block_index_attr)
+		return 0;
+
+	err = tcf_qevent_parse_block_index(block_index_attr, &block_index, extack);
+	if (err)
+		return err;
+
+	/* Bounce newly-configured block or change in block. */
+	if (block_index != qe->info.block_index) {
+		NL_SET_ERR_MSG(extack, "Change of blocks is not supported");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(tcf_qevent_validate_change);
+
+struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, struct sk_buff *skb,
+				  spinlock_t *root_lock, struct sk_buff **to_free, int *ret)
+{
+	struct tcf_result cl_res;
+	struct tcf_proto *fl;
+
+	if (!qe->info.block_index)
+		return skb;
+
+	fl = rcu_dereference_bh(qe->filter_chain);
+
+	if (root_lock)
+		spin_unlock(root_lock);
+
+	switch (tcf_classify(skb, fl, &cl_res, false)) {
+	case TC_ACT_SHOT:
+		qdisc_qstats_drop(sch);
+		__qdisc_drop(skb, to_free);
+		*ret = __NET_XMIT_BYPASS;
+		return NULL;
+	case TC_ACT_STOLEN:
+	case TC_ACT_QUEUED:
+	case TC_ACT_TRAP:
+		__qdisc_drop(skb, to_free);
+		*ret = __NET_XMIT_STOLEN;
+		return NULL;
+	case TC_ACT_REDIRECT:
+		skb_do_redirect(skb);
+		*ret = __NET_XMIT_STOLEN;
+		return NULL;
+	}
+
+	if (root_lock)
+		spin_lock(root_lock);
+
+	return skb;
+}
+EXPORT_SYMBOL(tcf_qevent_handle);
+
+int tcf_qevent_dump(struct sk_buff *skb, int attr_name, struct tcf_qevent *qe)
+{
+	if (!qe->info.block_index)
+		return 0;
+	return nla_put_u32(skb, attr_name, qe->info.block_index);
+}
+EXPORT_SYMBOL(tcf_qevent_dump);
+#endif
+
 static __net_init int tcf_net_init(struct net *net)
 {
 	struct tcf_net *tn = net_generic(net, tcf_net_id);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v1 3/5] net: sched: sch_red: Split init and change callbacks
  2020-06-26 22:45 [PATCH net-next v1 0/5] TC: Introduce qevents Petr Machata
  2020-06-26 22:45 ` [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue Petr Machata
  2020-06-26 22:45 ` [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks Petr Machata
@ 2020-06-26 22:45 ` Petr Machata
  2020-06-26 22:45 ` [PATCH net-next v1 4/5] net: sched: sch_red: Add qevents "early_drop" and "mark" Petr Machata
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Petr Machata @ 2020-06-26 22:45 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet, jiri,
	idosch, Petr Machata

In the following patches, RED will get two qevents. The implementation will
be clearer if the callback for change is not a pure subset of the callback
for init. Split the two and promote attribute parsing to the callbacks
themselves from the common code, because it will be handy there.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 net/sched/sch_red.c | 42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index 6ace7d757e8b..225ce370e5a8 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -215,12 +215,11 @@ static const struct nla_policy red_policy[TCA_RED_MAX + 1] = {
 	[TCA_RED_FLAGS] = NLA_POLICY_BITFIELD32(TC_RED_SUPPORTED_FLAGS),
 };
 
-static int red_change(struct Qdisc *sch, struct nlattr *opt,
-		      struct netlink_ext_ack *extack)
+static int __red_change(struct Qdisc *sch, struct nlattr **tb,
+			struct netlink_ext_ack *extack)
 {
 	struct Qdisc *old_child = NULL, *child = NULL;
 	struct red_sched_data *q = qdisc_priv(sch);
-	struct nlattr *tb[TCA_RED_MAX + 1];
 	struct nla_bitfield32 flags_bf;
 	struct tc_red_qopt *ctl;
 	unsigned char userbits;
@@ -228,14 +227,6 @@ static int red_change(struct Qdisc *sch, struct nlattr *opt,
 	int err;
 	u32 max_P;
 
-	if (opt == NULL)
-		return -EINVAL;
-
-	err = nla_parse_nested_deprecated(tb, TCA_RED_MAX, opt, red_policy,
-					  NULL);
-	if (err < 0)
-		return err;
-
 	if (tb[TCA_RED_PARMS] == NULL ||
 	    tb[TCA_RED_STAB] == NULL)
 		return -EINVAL;
@@ -323,11 +314,38 @@ static int red_init(struct Qdisc *sch, struct nlattr *opt,
 		    struct netlink_ext_ack *extack)
 {
 	struct red_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_RED_MAX + 1];
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested_deprecated(tb, TCA_RED_MAX, opt, red_policy,
+					  extack);
+	if (err < 0)
+		return err;
 
 	q->qdisc = &noop_qdisc;
 	q->sch = sch;
 	timer_setup(&q->adapt_timer, red_adaptative_timer, 0);
-	return red_change(sch, opt, extack);
+	return __red_change(sch, tb, extack);
+}
+
+static int red_change(struct Qdisc *sch, struct nlattr *opt,
+		      struct netlink_ext_ack *extack)
+{
+	struct nlattr *tb[TCA_RED_MAX + 1];
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested_deprecated(tb, TCA_RED_MAX, opt, red_policy,
+					  extack);
+	if (err < 0)
+		return err;
+
+	return __red_change(sch, tb, extack);
 }
 
 static int red_dump_offload_stats(struct Qdisc *sch)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v1 4/5] net: sched: sch_red: Add qevents "early_drop" and "mark"
  2020-06-26 22:45 [PATCH net-next v1 0/5] TC: Introduce qevents Petr Machata
                   ` (2 preceding siblings ...)
  2020-06-26 22:45 ` [PATCH net-next v1 3/5] net: sched: sch_red: Split init and change callbacks Petr Machata
@ 2020-06-26 22:45 ` Petr Machata
  2020-06-26 22:45 ` [PATCH net-next v1 5/5] selftests: forwarding: Add a RED test for SW datapath Petr Machata
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Petr Machata @ 2020-06-26 22:45 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet, jiri,
	idosch, Petr Machata

In order to allow acting on dropped and/or ECN-marked packets, add two new
qevents to the RED qdisc: "early_drop" and "mark". Filters attached at
"early_drop" block are executed as packets are early-dropped, those
attached at the "mark" block are executed as packets are ECN-marked.

Two new attributes are introduced: TCA_RED_EARLY_DROP_BLOCK with the block
index for the "early_drop" qevent, and TCA_RED_MARK_BLOCK for the "mark"
qevent. Absence of these attributes signifies "don't care": no block is
allocated in that case, or the existing blocks are left intact in case of
the change callback.

For purposes of offloading, blocks attached to these qevents appear with
newly-introduced binder types, FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP and
FLOW_BLOCK_BINDER_TYPE_RED_MARK.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 include/net/flow_offload.h     |  2 ++
 include/uapi/linux/pkt_sched.h |  2 ++
 net/sched/sch_red.c            | 58 ++++++++++++++++++++++++++++++++--
 3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index f2c8311a0433..d63a6a164bf4 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -419,6 +419,8 @@ enum flow_block_binder_type {
 	FLOW_BLOCK_BINDER_TYPE_UNSPEC,
 	FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS,
 	FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS,
+	FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP,
+	FLOW_BLOCK_BINDER_TYPE_RED_MARK,
 };
 
 struct flow_block {
diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index a95f3ae7ab37..9e7c2c607845 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -257,6 +257,8 @@ enum {
 	TCA_RED_STAB,
 	TCA_RED_MAX_P,
 	TCA_RED_FLAGS,		/* bitfield32 */
+	TCA_RED_EARLY_DROP_BLOCK, /* u32 */
+	TCA_RED_MARK_BLOCK,	/* u32 */
 	__TCA_RED_MAX,
 };
 
diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index 225ce370e5a8..de2be4d04ed6 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -46,6 +46,8 @@ struct red_sched_data {
 	struct red_vars		vars;
 	struct red_stats	stats;
 	struct Qdisc		*qdisc;
+	struct tcf_qevent	qe_early_drop;
+	struct tcf_qevent	qe_mark;
 };
 
 #define TC_RED_SUPPORTED_FLAGS (TC_RED_HISTORIC_FLAGS | TC_RED_NODROP)
@@ -92,6 +94,9 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_
 
 		if (INET_ECN_set_ce(skb)) {
 			q->stats.prob_mark++;
+			skb = tcf_qevent_handle(&q->qe_mark, sch, skb, root_lock, to_free, &ret);
+			if (!skb)
+				return NET_XMIT_CN | ret;
 		} else if (!red_use_nodrop(q)) {
 			q->stats.prob_drop++;
 			goto congestion_drop;
@@ -109,6 +114,9 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_
 
 		if (INET_ECN_set_ce(skb)) {
 			q->stats.forced_mark++;
+			skb = tcf_qevent_handle(&q->qe_mark, sch, skb, root_lock, to_free, &ret);
+			if (!skb)
+				return NET_XMIT_CN | ret;
 		} else if (!red_use_nodrop(q)) {
 			q->stats.forced_drop++;
 			goto congestion_drop;
@@ -129,6 +137,10 @@ static int red_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t *root_
 	return ret;
 
 congestion_drop:
+	skb = tcf_qevent_handle(&q->qe_early_drop, sch, skb, root_lock, to_free, &ret);
+	if (!skb)
+		return NET_XMIT_CN | ret;
+
 	qdisc_drop(skb, sch, to_free);
 	return NET_XMIT_CN;
 }
@@ -202,6 +214,8 @@ static void red_destroy(struct Qdisc *sch)
 {
 	struct red_sched_data *q = qdisc_priv(sch);
 
+	tcf_qevent_destroy(&q->qe_mark, sch);
+	tcf_qevent_destroy(&q->qe_early_drop, sch);
 	del_timer_sync(&q->adapt_timer);
 	red_offload(sch, false);
 	qdisc_put(q->qdisc);
@@ -213,6 +227,8 @@ static const struct nla_policy red_policy[TCA_RED_MAX + 1] = {
 	[TCA_RED_STAB]	= { .len = RED_STAB_SIZE },
 	[TCA_RED_MAX_P] = { .type = NLA_U32 },
 	[TCA_RED_FLAGS] = NLA_POLICY_BITFIELD32(TC_RED_SUPPORTED_FLAGS),
+	[TCA_RED_EARLY_DROP_BLOCK] = { .type = NLA_U32 },
+	[TCA_RED_MARK_BLOCK] = { .type = NLA_U32 },
 };
 
 static int __red_change(struct Qdisc *sch, struct nlattr **tb,
@@ -328,12 +344,38 @@ static int red_init(struct Qdisc *sch, struct nlattr *opt,
 	q->qdisc = &noop_qdisc;
 	q->sch = sch;
 	timer_setup(&q->adapt_timer, red_adaptative_timer, 0);
-	return __red_change(sch, tb, extack);
+
+	err = __red_change(sch, tb, extack);
+	if (err)
+		return err;
+
+	err = tcf_qevent_init(&q->qe_early_drop, sch,
+			      FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP,
+			      tb[TCA_RED_EARLY_DROP_BLOCK], extack);
+	if (err)
+		goto err_early_drop_init;
+
+	err = tcf_qevent_init(&q->qe_mark, sch,
+			      FLOW_BLOCK_BINDER_TYPE_RED_MARK,
+			      tb[TCA_RED_MARK_BLOCK], extack);
+	if (err)
+		goto err_mark_init;
+
+	return 0;
+
+err_mark_init:
+	tcf_qevent_destroy(&q->qe_early_drop, sch);
+err_early_drop_init:
+	del_timer_sync(&q->adapt_timer);
+	red_offload(sch, false);
+	qdisc_put(q->qdisc);
+	return err;
 }
 
 static int red_change(struct Qdisc *sch, struct nlattr *opt,
 		      struct netlink_ext_ack *extack)
 {
+	struct red_sched_data *q = qdisc_priv(sch);
 	struct nlattr *tb[TCA_RED_MAX + 1];
 	int err;
 
@@ -345,6 +387,16 @@ static int red_change(struct Qdisc *sch, struct nlattr *opt,
 	if (err < 0)
 		return err;
 
+	err = tcf_qevent_validate_change(&q->qe_early_drop,
+					 tb[TCA_RED_EARLY_DROP_BLOCK], extack);
+	if (err)
+		return err;
+
+	err = tcf_qevent_validate_change(&q->qe_mark,
+					 tb[TCA_RED_MARK_BLOCK], extack);
+	if (err)
+		return err;
+
 	return __red_change(sch, tb, extack);
 }
 
@@ -389,7 +441,9 @@ static int red_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (nla_put(skb, TCA_RED_PARMS, sizeof(opt), &opt) ||
 	    nla_put_u32(skb, TCA_RED_MAX_P, q->parms.max_P) ||
 	    nla_put_bitfield32(skb, TCA_RED_FLAGS,
-			       q->flags, TC_RED_SUPPORTED_FLAGS))
+			       q->flags, TC_RED_SUPPORTED_FLAGS) ||
+	    tcf_qevent_dump(skb, TCA_RED_MARK_BLOCK, &q->qe_mark) ||
+	    tcf_qevent_dump(skb, TCA_RED_EARLY_DROP_BLOCK, &q->qe_early_drop))
 		goto nla_put_failure;
 	return nla_nest_end(skb, opts);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH net-next v1 5/5] selftests: forwarding: Add a RED test for SW datapath
  2020-06-26 22:45 [PATCH net-next v1 0/5] TC: Introduce qevents Petr Machata
                   ` (3 preceding siblings ...)
  2020-06-26 22:45 ` [PATCH net-next v1 4/5] net: sched: sch_red: Add qevents "early_drop" and "mark" Petr Machata
@ 2020-06-26 22:45 ` Petr Machata
  2020-06-26 22:45 ` [PATCH iproute2-next v1 1/4] uapi: pkt_sched: Add two new RED attributes Petr Machata
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 31+ messages in thread
From: Petr Machata @ 2020-06-26 22:45 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet, jiri,
	idosch, Petr Machata

This test is inspired by the mlxsw RED selftest. It is much simpler to set
up (also because there is no point in testing PRIO / RED encapsulation). It
tests bare RED, ECN and ECN+nodrop modes of operation. On top of that it
tests RED early_drop and mark qevents.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 .../selftests/net/forwarding/sch_red.sh       | 492 ++++++++++++++++++
 1 file changed, 492 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/sch_red.sh

diff --git a/tools/testing/selftests/net/forwarding/sch_red.sh b/tools/testing/selftests/net/forwarding/sch_red.sh
new file mode 100755
index 000000000000..e714bae473fb
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/sch_red.sh
@@ -0,0 +1,492 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# This test sends one stream of traffic from H1 through a TBF shaper, to a RED
+# within TBF shaper on $swp3. The two shapers have the same configuration, and
+# thus the resulting stream should fill all available bandwidth on the latter
+# shaper. A second stream is sent from H2 also via $swp3, and used to inject
+# additional traffic. Since all available bandwidth is taken, this traffic has
+# to go to backlog.
+#
+# +--------------------------+                     +--------------------------+
+# | H1                       |                     | H2                       |
+# |     + $h1                |                     |     + $h2                |
+# |     | 192.0.2.1/28       |                     |     | 192.0.2.2/28       |
+# |     | TBF 10Mbps         |                     |     |                    |
+# +-----|--------------------+                     +-----|--------------------+
+#       |                                                |
+# +-----|------------------------------------------------|--------------------+
+# | SW  |                                                |                    |
+# |  +--|------------------------------------------------|----------------+   |
+# |  |  + $swp1                                          + $swp2          |   |
+# |  |                               BR                                   |   |
+# |  |                                                                    |   |
+# |  |                                + $swp3                             |   |
+# |  |                                | TBF 10Mbps / RED                  |   |
+# |  +--------------------------------|-----------------------------------+   |
+# |                                   |                                       |
+# +-----------------------------------|---------------------------------------+
+#                                     |
+#                               +-----|--------------------+
+#			        | H3  |                    |
+#			        |     + $h1                |
+#			        |       192.0.2.3/28       |
+#			        |                          |
+#			        +--------------------------+
+
+ALL_TESTS="
+	ping_ipv4
+	ecn_test
+	ecn_nodrop_test
+	red_test
+	red_qevent_test
+	ecn_qevent_test
+"
+
+NUM_NETIFS=6
+CHECK_TC="yes"
+source lib.sh
+
+BACKLOG=30000
+PKTSZ=1400
+
+h1_create()
+{
+	simple_if_init $h1 192.0.2.1/28
+	mtu_set $h1 10000
+	tc qdisc replace dev $h1 root handle 1: tbf \
+	   rate 10Mbit burst 10K limit 1M
+}
+
+h1_destroy()
+{
+	tc qdisc del dev $h1 root
+	mtu_restore $h1
+	simple_if_fini $h1 192.0.2.1/28
+}
+
+h2_create()
+{
+	simple_if_init $h2 192.0.2.2/28
+	mtu_set $h2 10000
+}
+
+h2_destroy()
+{
+	mtu_restore $h2
+	simple_if_fini $h2 192.0.2.2/28
+}
+
+h3_create()
+{
+	simple_if_init $h3 192.0.2.3/28
+	mtu_set $h3 10000
+}
+
+h3_destroy()
+{
+	mtu_restore $h3
+	simple_if_fini $h3 192.0.2.3/28
+}
+
+switch_create()
+{
+	ip link add dev br up type bridge
+	ip link set dev $swp1 up master br
+	ip link set dev $swp2 up master br
+	ip link set dev $swp3 up master br
+
+	mtu_set $swp1 10000
+	mtu_set $swp2 10000
+	mtu_set $swp3 10000
+
+	tc qdisc replace dev $swp3 root handle 1: tbf \
+	   rate 10Mbit burst 10K limit 1M
+	ip link add name _drop_test up type dummy
+}
+
+switch_destroy()
+{
+	ip link del dev _drop_test
+	tc qdisc del dev $swp3 root
+
+	mtu_restore $h3
+	mtu_restore $h2
+	mtu_restore $h1
+
+	ip link set dev $swp3 down nomaster
+	ip link set dev $swp2 down nomaster
+	ip link set dev $swp1 down nomaster
+	ip link del dev br
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	h2=${NETIFS[p3]}
+	swp2=${NETIFS[p4]}
+
+	swp3=${NETIFS[p5]}
+	h3=${NETIFS[p6]}
+
+	h3_mac=$(mac_get $h3)
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+	h3_create
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+	h3_destroy
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+ping_ipv4()
+{
+	ping_test $h1 192.0.2.3 " from host 1"
+	ping_test $h2 192.0.2.3 " from host 2"
+}
+
+get_qdisc_backlog()
+{
+	qdisc_stats_get $swp3 11: .backlog
+}
+
+get_nmarked()
+{
+	qdisc_stats_get $swp3 11: .marked
+}
+
+get_qdisc_npackets()
+{
+	qdisc_stats_get $swp3 11: .packets
+}
+
+get_nmirrored()
+{
+	link_stats_get _drop_test tx packets
+}
+
+send_packets()
+{
+	local proto=$1; shift
+	local pkts=$1; shift
+
+	$MZ $h2 -p $PKTSZ -a own -b $h3_mac -A 192.0.2.2 -B 192.0.2.3 -t $proto -q -c $pkts "$@"
+}
+
+# This sends traffic in an attempt to build a backlog of $size. Returns 0 on
+# success. After 10 failed attempts it bails out and returns 1. It dumps the
+# backlog size to stdout.
+build_backlog()
+{
+	local size=$1; shift
+	local proto=$1; shift
+
+	local i=0
+
+	while :; do
+		local cur=$(get_qdisc_backlog)
+		local diff=$((size - cur))
+		local pkts=$(((diff + PKTSZ - 1) / PKTSZ))
+
+		if ((cur >= size)); then
+			echo $cur
+			return 0
+		elif ((i++ > 10)); then
+			echo $cur
+			return 1
+		fi
+
+		send_packets $proto $pkts "$@"
+		sleep 1
+	done
+}
+
+check_marking()
+{
+	local cond=$1; shift
+
+	local npackets_0=$(get_qdisc_npackets)
+	local nmarked_0=$(get_nmarked)
+	sleep 5
+	local npackets_1=$(get_qdisc_npackets)
+	local nmarked_1=$(get_nmarked)
+
+	local nmarked_d=$((nmarked_1 - nmarked_0))
+	local npackets_d=$((npackets_1 - npackets_0))
+	local pct=$((100 * nmarked_d / npackets_d))
+
+	echo $pct
+	((pct $cond))
+}
+
+check_mirroring()
+{
+	local cond=$1; shift
+
+	local npackets_0=$(get_qdisc_npackets)
+	local nmirrored_0=$(get_nmirrored)
+	sleep 5
+	local npackets_1=$(get_qdisc_npackets)
+	local nmirrored_1=$(get_nmirrored)
+
+	local nmirrored_d=$((nmirrored_1 - nmirrored_0))
+	local npackets_d=$((npackets_1 - npackets_0))
+	local pct=$((100 * nmirrored_d / npackets_d))
+
+	echo $pct
+	((pct $cond))
+}
+
+ecn_test_common()
+{
+	local name=$1; shift
+	local limit=$1; shift
+	local backlog
+	local pct
+
+	# Build the below-the-limit backlog using UDP. We could use TCP just
+	# fine, but this way we get a proof that UDP is accepted when queue
+	# length is below the limit. The main stream is using TCP, and if the
+	# limit is misconfigured, we would see this traffic being ECN marked.
+	RET=0
+	backlog=$(build_backlog $((2 * limit / 3)) udp)
+	check_err $? "Could not build the requested backlog"
+	pct=$(check_marking "== 0")
+	check_err $? "backlog $backlog / $limit Got $pct% marked packets, expected == 0."
+	log_test "$name backlog < limit"
+
+	# Now push TCP, because non-TCP traffic would be early-dropped after the
+	# backlog crosses the limit, and we want to make sure that the backlog
+	# is above the limit.
+	RET=0
+	backlog=$(build_backlog $((3 * limit / 2)) tcp tos=0x01)
+	check_err $? "Could not build the requested backlog"
+	pct=$(check_marking ">= 95")
+	check_err $? "backlog $backlog / $limit Got $pct% marked packets, expected >= 95."
+	log_test "$name backlog > limit"
+}
+
+do_ecn_test()
+{
+	local limit=$1; shift
+	local name=ECN
+
+	$MZ $h1 -p $PKTSZ -A 192.0.2.1 -B 192.0.2.3 -c 0 \
+		-a own -b $h3_mac -t tcp -q tos=0x01 &
+	sleep 1
+
+	ecn_test_common "$name" $limit
+
+	# Up there we saw that UDP gets accepted when backlog is below the
+	# limit. Now that it is above, it should all get dropped, and backlog
+	# building should fail.
+	RET=0
+	build_backlog $((2 * limit)) udp >/dev/null
+	check_fail $? "UDP traffic went into backlog instead of being early-dropped"
+	log_test "$name backlog > limit: UDP early-dropped"
+
+	stop_traffic
+	sleep 1
+}
+
+do_ecn_nodrop_test()
+{
+	local limit=$1; shift
+	local name="ECN nodrop"
+
+	$MZ $h1 -p $PKTSZ -A 192.0.2.1 -B 192.0.2.3 -c 0 \
+		-a own -b $h3_mac -t tcp -q tos=0x01 &
+	sleep 1
+
+	ecn_test_common "$name" $limit
+
+	# Up there we saw that UDP gets accepted when backlog is below the
+	# limit. Now that it is above, in nodrop mode, make sure it goes to
+	# backlog as well.
+	RET=0
+	build_backlog $((2 * limit)) udp >/dev/null
+	check_err $? "UDP traffic was early-dropped instead of getting into backlog"
+	log_test "$name backlog > limit: UDP not dropped"
+
+	stop_traffic
+	sleep 1
+}
+
+do_red_test()
+{
+	local limit=$1; shift
+	local backlog
+	local pct
+
+	# Use ECN-capable TCP to verify there's no marking even though the queue
+	# is above limit.
+	$MZ $h1 -p $PKTSZ -A 192.0.2.1 -B 192.0.2.3 -c 0 \
+		-a own -b $h3_mac -t tcp -q tos=0x01 &
+
+	# Pushing below the queue limit should work.
+	RET=0
+	backlog=$(build_backlog $((2 * limit / 3)) tcp tos=0x01)
+	check_err $? "Could not build the requested backlog"
+	pct=$(check_marking "== 0")
+	check_err $? "backlog $backlog / $limit Got $pct% marked packets, expected == 0."
+	log_test "RED backlog < limit"
+
+	# Pushing above should not.
+	RET=0
+	backlog=$(build_backlog $((3 * limit / 2)) tcp tos=0x01)
+	check_fail $? "Traffic went into backlog instead of being early-dropped"
+	pct=$(check_marking "== 0")
+	check_err $? "backlog $backlog / $limit Got $pct% marked packets, expected == 0."
+	log_test "RED backlog > limit"
+
+	stop_traffic
+	sleep 1
+}
+
+do_red_qevent_test()
+{
+	local limit=$1; shift
+	local backlog
+	local base
+	local now
+	local pct
+
+	RET=0
+
+	$MZ $h1 -p $PKTSZ -A 192.0.2.1 -B 192.0.2.3 -c 0 \
+		-a own -b $h3_mac -t udp -q &
+	sleep 1
+
+	tc filter add block 10 pref 1234 handle 102 matchall skip_hw \
+	   action mirred egress mirror dev _drop_test
+
+	# Push to the queue until it's at the limit. The configured limit is
+	# rounded by the qdisc, so this is the best we can do to get to the real
+	# limit.
+	build_backlog $((3 * limit / 2)) udp >/dev/null
+
+	base=$(get_nmirrored)
+	send_packets udp 100
+	sleep 1
+	now=$(get_nmirrored)
+	((now >= base + 100))
+	check_err $? "Dropped packets not observed: 100 expected, $((now - base)) seen"
+
+	tc filter del block 10 pref 1234 handle 102 matchall
+
+	base=$(get_nmirrored)
+	send_packets udp 100
+	sleep 1
+	now=$(get_nmirrored)
+	((now == base))
+	check_err $? "Dropped packets still observed: 0 expected, $((now - base)) seen"
+
+	log_test "RED early_dropped packets mirrored"
+
+	stop_traffic
+	sleep 1
+}
+
+do_ecn_qevent_test()
+{
+	local limit=$1; shift
+	local name=ECN
+
+	RET=0
+
+	$MZ $h1 -p $PKTSZ -A 192.0.2.1 -B 192.0.2.3 -c 0 \
+		-a own -b $h3_mac -t tcp -q tos=0x01 &
+	sleep 1
+
+	tc filter add block 10 pref 1234 handle 102 matchall skip_hw \
+	   action mirred egress mirror dev _drop_test
+
+	backlog=$(build_backlog $((2 * limit / 3)) tcp tos=0x01)
+	check_err $? "Could not build the requested backlog"
+	pct=$(check_mirroring "== 0")
+	check_err $? "backlog $backlog / $limit Got $pct% mirrored packets, expected == 0."
+
+	backlog=$(build_backlog $((3 * limit / 2)) tcp tos=0x01)
+	check_err $? "Could not build the requested backlog"
+	pct=$(check_mirroring ">= 95")
+	check_err $? "backlog $backlog / $limit Got $pct% mirrored packets, expected >= 95."
+
+	tc filter del block 10 pref 1234 handle 102 matchall
+
+	log_test "ECN marked packets mirrored"
+
+	stop_traffic
+	sleep 1
+}
+
+install_qdisc()
+{
+	local -a args=("$@")
+
+	tc qdisc replace dev $swp3 parent 1:1 handle 11: red \
+	   limit 1M avpkt $PKTSZ probability 1 \
+	   min $BACKLOG max $((BACKLOG + 1)) burst 38 "${args[@]}"
+	sleep 1
+}
+
+uninstall_qdisc()
+{
+	tc qdisc del dev $swp3 parent 1:1
+}
+
+ecn_test()
+{
+	install_qdisc ecn
+	do_ecn_test $BACKLOG
+	uninstall_qdisc
+}
+
+ecn_nodrop_test()
+{
+	install_qdisc ecn nodrop
+	do_ecn_nodrop_test $BACKLOG
+	uninstall_qdisc
+}
+
+red_test()
+{
+	install_qdisc
+	do_red_test $BACKLOG
+	uninstall_qdisc
+}
+
+red_qevent_test()
+{
+	install_qdisc qevent early_drop block 10
+	do_red_qevent_test $BACKLOG
+	uninstall_qdisc
+}
+
+ecn_qevent_test()
+{
+	install_qdisc ecn qevent mark block 10
+	do_ecn_qevent_test $BACKLOG
+	uninstall_qdisc
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH iproute2-next v1 1/4] uapi: pkt_sched: Add two new RED attributes
  2020-06-26 22:45 [PATCH net-next v1 0/5] TC: Introduce qevents Petr Machata
                   ` (4 preceding siblings ...)
  2020-06-26 22:45 ` [PATCH net-next v1 5/5] selftests: forwarding: Add a RED test for SW datapath Petr Machata
@ 2020-06-26 22:45 ` Petr Machata
  2020-06-26 22:45   ` [PATCH iproute2-next v1 2/4] tc: Add helpers to support qevent handling Petr Machata
                     ` (2 more replies)
  2020-06-26 22:56 ` [PATCH net-next v1 0/5] TC: Introduce qevents Stephen Hemminger
  2020-06-30  0:15 ` David Miller
  7 siblings, 3 replies; 31+ messages in thread
From: Petr Machata @ 2020-06-26 22:45 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet, jiri,
	idosch, Petr Machata

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 include/uapi/linux/pkt_sched.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index a95f3ae7..9e7c2c60 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -257,6 +257,8 @@ enum {
 	TCA_RED_STAB,
 	TCA_RED_MAX_P,
 	TCA_RED_FLAGS,		/* bitfield32 */
+	TCA_RED_EARLY_DROP_BLOCK, /* u32 */
+	TCA_RED_MARK_BLOCK,	/* u32 */
 	__TCA_RED_MAX,
 };
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH iproute2-next v1 2/4] tc: Add helpers to support qevent handling
  2020-06-26 22:45 ` [PATCH iproute2-next v1 1/4] uapi: pkt_sched: Add two new RED attributes Petr Machata
@ 2020-06-26 22:45   ` Petr Machata
  2020-06-26 22:45   ` [PATCH iproute2-next v1 3/4] man: tc: Describe qevents Petr Machata
  2020-06-26 22:45   ` [PATCH iproute2-next v1 4/4] tc: q_red: Add support for qevents "mark" and "early_drop" Petr Machata
  2 siblings, 0 replies; 31+ messages in thread
From: Petr Machata @ 2020-06-26 22:45 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet, jiri,
	idosch, Petr Machata

Introduce a set of helpers to make it easy to add support for qevents into
qdisc.

The idea behind this is that qevent types will be generally reused between
qdiscs, rather than each having a completely idiosyncratic set of qevents.
The qevent module holds functions for parsing, dumping and formatting of
these common qevent types, and for dispatch to the appropriate set of
handlers based on the qevent name.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 tc/Makefile    |   1 +
 tc/tc_qevent.c | 202 +++++++++++++++++++++++++++++++++++++++++++++++++
 tc/tc_qevent.h |  49 ++++++++++++
 3 files changed, 252 insertions(+)
 create mode 100644 tc/tc_qevent.c
 create mode 100644 tc/tc_qevent.h

diff --git a/tc/Makefile b/tc/Makefile
index 79c9c1dd..5a517af2 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -122,6 +122,7 @@ TCLIB += tc_red.o
 TCLIB += tc_cbq.o
 TCLIB += tc_estimator.o
 TCLIB += tc_stab.o
+TCLIB += tc_qevent.o
 
 CFLAGS += -DCONFIG_GACT -DCONFIG_GACT_PROB
 ifneq ($(IPT_LIB_DIR),)
diff --git a/tc/tc_qevent.c b/tc/tc_qevent.c
new file mode 100644
index 00000000..1f8e6506
--- /dev/null
+++ b/tc/tc_qevent.c
@@ -0,0 +1,202 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+
+/*
+ * Helpers for handling qevents.
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include "tc_qevent.h"
+#include "utils.h"
+
+void qevents_init(struct qevent_util *qevents)
+{
+	if (!qevents)
+		return;
+
+	for (; qevents->id; qevents++)
+		memset(qevents->data, 0, qevents->data_size);
+}
+
+int qevent_parse(struct qevent_util *qevents, int *p_argc, char ***p_argv)
+{
+	char **argv = *p_argv;
+	int argc = *p_argc;
+	const char *name = *argv;
+	int err;
+
+	if (!qevents)
+		goto out;
+
+	for (; qevents->id; qevents++) {
+		if (strcmp(name, qevents->id) == 0) {
+			NEXT_ARG();
+			err = qevents->parse_qevent(qevents, &argc, &argv);
+			if (err)
+				return err;
+
+			*p_argc = argc;
+			*p_argv = argv;
+			return 0;
+		}
+	}
+
+out:
+	fprintf(stderr, "Unknown qevent `%s'\n", name);
+	return -1;
+}
+
+int qevents_read(struct qevent_util *qevents, struct rtattr **tb)
+{
+	int err;
+
+	if (!qevents)
+		return 0;
+
+	for (; qevents->id; qevents++) {
+		if (tb[qevents->attr]) {
+			err = qevents->read_qevent(qevents, tb);
+			if (err)
+				return err;
+		}
+	}
+
+	return 0;
+}
+
+void qevents_print(struct qevent_util *qevents, FILE *f)
+{
+	int first = true;
+
+	if (!qevents)
+		return;
+
+	for (; qevents->id; qevents++) {
+		struct qevent_base *qeb = qevents->data;
+
+		if (qeb->block_idx) {
+			if (first) {
+				open_json_array(PRINT_JSON, "qevents");
+				first = false;
+			}
+
+			open_json_object(NULL);
+			print_string(PRINT_ANY, "kind", " qevent %s", qevents->id);
+			qevents->print_qevent(qevents, f);
+			close_json_object();
+		}
+	}
+
+	if (!first)
+		close_json_array(PRINT_ANY, "");
+}
+
+int qevents_dump(struct qevent_util *qevents, struct nlmsghdr *n)
+{
+	int err;
+
+	if (!qevents)
+		return 0;
+
+	for (; qevents->id; qevents++) {
+		struct qevent_base *qeb = qevents->data;
+
+		if (qeb->block_idx) {
+			err = qevents->dump_qevent(qevents, n);
+			if (err)
+				return err;
+		}
+	}
+
+	return 0;
+}
+
+static int parse_block_idx(const char *arg, struct qevent_base *qeb)
+{
+	if (qeb->block_idx) {
+		fprintf(stderr, "Qevent block index already specified\n");
+		return -1;
+	}
+
+	if (get_unsigned(&qeb->block_idx, arg, 10) || !qeb->block_idx) {
+		fprintf(stderr, "Illegal qevent block index\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static int read_block_idx(struct rtattr *attr, struct qevent_base *qeb)
+{
+	if (qeb->block_idx) {
+		fprintf(stderr, "Qevent block index already specified\n");
+		return -1;
+	}
+
+	qeb->block_idx = rta_getattr_u32(attr);
+	if (!qeb->block_idx) {
+		fprintf(stderr, "Illegal qevent block index\n");
+		return -1;
+	}
+
+	return 0;
+}
+
+static void print_block_idx(FILE *f, __u32 block_idx)
+{
+	print_uint(PRINT_ANY, "block", " block %u", block_idx);
+}
+
+int qevent_parse_plain(struct qevent_util *qu, int *p_argc, char ***p_argv)
+{
+	struct qevent_plain *qe = qu->data;
+	char **argv = *p_argv;
+	int argc = *p_argc;
+
+	if (qe->base.block_idx) {
+		fprintf(stderr, "Duplicate qevent\n");
+		return -1;
+	}
+
+	while (argc > 0) {
+		if (strcmp(*argv, "block") == 0) {
+			NEXT_ARG();
+			if (parse_block_idx(*argv, &qe->base))
+				return -1;
+		} else {
+			break;
+		}
+		NEXT_ARG_FWD();
+	}
+
+	if (!qe->base.block_idx) {
+		fprintf(stderr, "Unspecified qevent block index\n");
+		return -1;
+	}
+
+	*p_argc = argc;
+	*p_argv = argv;
+	return 0;
+}
+
+int qevent_read_plain(struct qevent_util *qu, struct rtattr **tb)
+{
+	struct qevent_plain *qe = qu->data;
+
+	return read_block_idx(tb[qu->attr], &qe->base);
+}
+
+void qevent_print_plain(struct qevent_util *qu, FILE *f)
+{
+	struct qevent_plain *qe = qu->data;
+
+	print_block_idx(f, qe->base.block_idx);
+}
+
+int qevent_dump_plain(struct qevent_util *qu, struct nlmsghdr *n)
+{
+	struct qevent_plain *qe = qu->data;
+
+	return addattr32(n, 1024, qu->attr, qe->base.block_idx);
+}
diff --git a/tc/tc_qevent.h b/tc/tc_qevent.h
new file mode 100644
index 00000000..574e7cff
--- /dev/null
+++ b/tc/tc_qevent.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _TC_QEVENT_H_
+#define _TC_QEVENT_H_
+
+#include <linux/types.h>
+#include <libnetlink.h>
+
+struct qevent_base {
+	__u32 block_idx;
+};
+
+struct qevent_util {
+	const char *id;
+	int (*parse_qevent)(struct qevent_util *qu, int *argc, char ***argv);
+	int (*read_qevent)(struct qevent_util *qu, struct rtattr **tb);
+	void (*print_qevent)(struct qevent_util *qu, FILE *f);
+	int (*dump_qevent)(struct qevent_util *qu, struct nlmsghdr *n);
+	size_t data_size;
+	void *data;
+	int attr;
+};
+
+#define QEVENT(_name, _form, _data, _attr)				\
+	{								\
+		.id = _name,						\
+		.parse_qevent = qevent_parse_##_form,			\
+		.read_qevent = qevent_read_##_form,			\
+		.print_qevent = qevent_print_##_form,			\
+		.dump_qevent = qevent_dump_##_form,			\
+		.data_size = sizeof(struct qevent_##_form),		\
+		.data = _data,						\
+		.attr = _attr,						\
+	}
+
+void qevents_init(struct qevent_util *qevents);
+int qevent_parse(struct qevent_util *qevents, int *p_argc, char ***p_argv);
+int qevents_read(struct qevent_util *qevents, struct rtattr **tb);
+int qevents_dump(struct qevent_util *qevents, struct nlmsghdr *n);
+void qevents_print(struct qevent_util *qevents, FILE *f);
+
+struct qevent_plain {
+	struct qevent_base base;
+};
+int qevent_parse_plain(struct qevent_util *qu, int *p_argc, char ***p_argv);
+int qevent_read_plain(struct qevent_util *qu, struct rtattr **tb);
+void qevent_print_plain(struct qevent_util *qu, FILE *f);
+int qevent_dump_plain(struct qevent_util *qu, struct nlmsghdr *n);
+
+#endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH iproute2-next v1 3/4] man: tc: Describe qevents
  2020-06-26 22:45 ` [PATCH iproute2-next v1 1/4] uapi: pkt_sched: Add two new RED attributes Petr Machata
  2020-06-26 22:45   ` [PATCH iproute2-next v1 2/4] tc: Add helpers to support qevent handling Petr Machata
@ 2020-06-26 22:45   ` Petr Machata
  2020-06-26 22:45   ` [PATCH iproute2-next v1 4/4] tc: q_red: Add support for qevents "mark" and "early_drop" Petr Machata
  2 siblings, 0 replies; 31+ messages in thread
From: Petr Machata @ 2020-06-26 22:45 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet, jiri,
	idosch, Petr Machata

Add some general remarks about qevents.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 man/man8/tc.8 | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/man/man8/tc.8 b/man/man8/tc.8
index e8e0cd0f..970440f6 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -254,6 +254,25 @@ Traffic control filter that matches every packet. See
 .BR tc-matchall (8)
 for details.
 
+.SH QEVENTS
+Qdiscs may invoke user-configured actions when certain interesting events
+take place in the qdisc. Each qevent can either be unused, or can have a
+block attached to it. To this block are then attached filters using the "tc
+block BLOCK_IDX" syntax. The block is executed when the qevent associated
+with the attachment point takes place. For example, packet could be
+dropped, or delayed, etc., depending on the qdisc and the qevent in
+question.
+
+For example:
+.PP
+.RS
+tc qdisc add dev eth0 root handle 1: red limit 500K avpkt 1K \\
+   qevent early block 10
+.RE
+.RS
+tc filter add block 10 matchall action mirred egress mirror dev eth1
+.RE
+
 .SH CLASSLESS QDISCS
 The classless qdiscs are:
 .TP
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH iproute2-next v1 4/4] tc: q_red: Add support for qevents "mark" and "early_drop"
  2020-06-26 22:45 ` [PATCH iproute2-next v1 1/4] uapi: pkt_sched: Add two new RED attributes Petr Machata
  2020-06-26 22:45   ` [PATCH iproute2-next v1 2/4] tc: Add helpers to support qevent handling Petr Machata
  2020-06-26 22:45   ` [PATCH iproute2-next v1 3/4] man: tc: Describe qevents Petr Machata
@ 2020-06-26 22:45   ` Petr Machata
  2 siblings, 0 replies; 31+ messages in thread
From: Petr Machata @ 2020-06-26 22:45 UTC (permalink / raw)
  To: netdev
  Cc: David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet, jiri,
	idosch, Petr Machata

The "early_drop" qevent matches packets that have been early-dropped. The
"mark" qevent matches packets that have been ECN-marked.

Signed-off-by: Petr Machata <petrm@mellanox.com>
---
 man/man8/tc-red.8 | 18 +++++++++++++++++-
 tc/q_red.c        | 30 +++++++++++++++++++++++++++---
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/man/man8/tc-red.8 b/man/man8/tc-red.8
index b5aaa986..662e4d8b 100644
--- a/man/man8/tc-red.8
+++ b/man/man8/tc-red.8
@@ -17,7 +17,11 @@ packets
 rate
 .B ] [ probability
 chance
-.B ] [ adaptive ]
+.B ] [ adaptive ] [ qevent early_drop block
+index
+.B ] [ qevent mark block
+index
+.B ]
 
 .SH DESCRIPTION
 Random Early Detection is a classless qdisc which manages its queue size
@@ -134,6 +138,18 @@ Goal of Adaptive RED is to make 'probability' dynamic value between 1% and 50% t
 .B (max - min) / 2
 .fi
 
+.SH QEVENTS
+See tc (8) for some general notes about qevents. The RED qdisc supports the
+following qevents:
+
+.TP
+early_drop
+The associated block is executed when packets are early-dropped. This includes
+non-ECT packets in ECN mode.
+.TP
+mark
+The associated block is executed when packets are marked in ECN mode.
+
 .SH EXAMPLE
 
 .P
diff --git a/tc/q_red.c b/tc/q_red.c
index 53181c82..97856f03 100644
--- a/tc/q_red.c
+++ b/tc/q_red.c
@@ -22,6 +22,7 @@
 
 #include "utils.h"
 #include "tc_util.h"
+#include "tc_qevent.h"
 
 #include "tc_red.h"
 
@@ -30,11 +31,20 @@ static void explain(void)
 	fprintf(stderr,
 		"Usage: ... red	limit BYTES [min BYTES] [max BYTES] avpkt BYTES [burst PACKETS]\n"
 		"		[adaptive] [probability PROBABILITY] [bandwidth KBPS]\n"
-		"		[ecn] [harddrop] [nodrop]\n");
+		"		[ecn] [harddrop] [nodrop]\n"
+		"		[qevent early_drop block IDX] [qevent mark block IDX]\n");
 }
 
 #define RED_SUPPORTED_FLAGS (TC_RED_HISTORIC_FLAGS | TC_RED_NODROP)
 
+static struct qevent_plain qe_early_drop = {};
+static struct qevent_plain qe_mark = {};
+static struct qevent_util qevents[] = {
+	QEVENT("early_drop", plain, &qe_early_drop, TCA_RED_EARLY_DROP_BLOCK),
+	QEVENT("mark", plain, &qe_mark, TCA_RED_MARK_BLOCK),
+	{},
+};
+
 static int red_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 			 struct nlmsghdr *n, const char *dev)
 {
@@ -51,6 +61,8 @@ static int red_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 	__u32 max_P;
 	struct rtattr *tail;
 
+	qevents_init(qevents);
+
 	while (argc > 0) {
 		if (strcmp(*argv, "limit") == 0) {
 			NEXT_ARG();
@@ -109,6 +121,11 @@ static int red_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 			flags_bf.value |= TC_RED_ADAPTATIVE;
 		} else if (strcmp(*argv, "adaptive") == 0) {
 			flags_bf.value |= TC_RED_ADAPTATIVE;
+		} else if (matches(*argv, "qevent") == 0) {
+			NEXT_ARG();
+			if (qevent_parse(qevents, &argc, &argv))
+				return -1;
+			continue;
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -162,6 +179,8 @@ static int red_parse_opt(struct qdisc_util *qu, int argc, char **argv,
 	max_P = probability * pow(2, 32);
 	addattr_l(n, 1024, TCA_RED_MAX_P, &max_P, sizeof(max_P));
 	addattr_l(n, 1024, TCA_RED_FLAGS, &flags_bf, sizeof(flags_bf));
+	if (qevents_dump(qevents, n))
+		return -1;
 	addattr_nest_end(n, tail);
 	return 0;
 }
@@ -203,12 +222,12 @@ static int red_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	print_uint(PRINT_JSON, "min", NULL, qopt->qth_min);
 	print_string(PRINT_FP, NULL, "min %s ", sprint_size(qopt->qth_min, b2));
 	print_uint(PRINT_JSON, "max", NULL, qopt->qth_max);
-	print_string(PRINT_FP, NULL, "max %s ", sprint_size(qopt->qth_max, b3));
+	print_string(PRINT_FP, NULL, "max %s", sprint_size(qopt->qth_max, b3));
 
 	tc_red_print_flags(qopt->flags);
 
 	if (show_details) {
-		print_uint(PRINT_ANY, "ewma", "ewma %u ", qopt->Wlog);
+		print_uint(PRINT_ANY, "ewma", " ewma %u ", qopt->Wlog);
 		if (max_P)
 			print_float(PRINT_ANY, "probability",
 				    "probability %lg ", max_P / pow(2, 32));
@@ -217,6 +236,11 @@ static int red_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 		print_uint(PRINT_ANY, "Scell_log", "Scell_log %u",
 			   qopt->Scell_log);
 	}
+
+	qevents_init(qevents);
+	if (qevents_read(qevents, tb))
+		return -1;
+	qevents_print(qevents, f);
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 0/5] TC: Introduce qevents
  2020-06-26 22:45 [PATCH net-next v1 0/5] TC: Introduce qevents Petr Machata
                   ` (5 preceding siblings ...)
  2020-06-26 22:45 ` [PATCH iproute2-next v1 1/4] uapi: pkt_sched: Add two new RED attributes Petr Machata
@ 2020-06-26 22:56 ` Stephen Hemminger
  2020-06-29 13:21   ` Petr Machata
  2020-06-30  0:15 ` David Miller
  7 siblings, 1 reply; 31+ messages in thread
From: Stephen Hemminger @ 2020-06-26 22:56 UTC (permalink / raw)
  To: Petr Machata
  Cc: netdev, David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet,
	jiri, idosch

On Sat, 27 Jun 2020 01:45:24 +0300
Petr Machata <petrm@mellanox.com> wrote:

> The Spectrum hardware allows execution of one of several actions as a
> result of queue management decisions: tail-dropping, early-dropping,
> marking a packet, or passing a configured latency threshold or buffer
> size. Such packets can be mirrored, trapped, or sampled.
> 
> Modeling the action to be taken as simply a TC action is very attractive,
> but it is not obvious where to put these actions. At least with ECN marking
> one could imagine a tree of qdiscs and classifiers that effectively
> accomplishes this task, albeit in an impractically complex manner. But
> there is just no way to match on dropped-ness of a packet, let alone
> dropped-ness due to a particular reason.

Would a BPF based hook be more flexible and reuse more existing
infrastructure? 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 0/5] TC: Introduce qevents
  2020-06-26 22:56 ` [PATCH net-next v1 0/5] TC: Introduce qevents Stephen Hemminger
@ 2020-06-29 13:21   ` Petr Machata
  0 siblings, 0 replies; 31+ messages in thread
From: Petr Machata @ 2020-06-29 13:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, David Miller, Jakub Kicinski, Cong Wang, Eric Dumazet,
	jiri, idosch


Stephen Hemminger <stephen@networkplumber.org> writes:

> On Sat, 27 Jun 2020 01:45:24 +0300
> Petr Machata <petrm@mellanox.com> wrote:
>
>> The Spectrum hardware allows execution of one of several actions as a
>> result of queue management decisions: tail-dropping, early-dropping,
>> marking a packet, or passing a configured latency threshold or buffer
>> size. Such packets can be mirrored, trapped, or sampled.
>>
>> Modeling the action to be taken as simply a TC action is very attractive,
>> but it is not obvious where to put these actions. At least with ECN marking
>> one could imagine a tree of qdiscs and classifiers that effectively
>> accomplishes this task, albeit in an impractically complex manner. But
>> there is just no way to match on dropped-ness of a packet, let alone
>> dropped-ness due to a particular reason.
>
> Would a BPF based hook be more flexible and reuse more existing
> infrastructure?

This does reuse the existing infrastructure though: filters, actions,
shared blocks, qdiscs invoking blocks, none of that is new.

And BPF can still be invoked though classifier and / or action bpf. It
looks like you get the best of both worlds here: something symbolic for
those of us that use the filter infrastructure, and a well-defined hook
for those of us who like the BPF approach.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 0/5] TC: Introduce qevents
  2020-06-26 22:45 [PATCH net-next v1 0/5] TC: Introduce qevents Petr Machata
                   ` (6 preceding siblings ...)
  2020-06-26 22:56 ` [PATCH net-next v1 0/5] TC: Introduce qevents Stephen Hemminger
@ 2020-06-30  0:15 ` David Miller
  7 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2020-06-30  0:15 UTC (permalink / raw)
  To: petrm; +Cc: netdev, kuba, xiyou.wangcong, eric.dumazet, jiri, idosch

From: Petr Machata <petrm@mellanox.com>
Date: Sat, 27 Jun 2020 01:45:24 +0300

> The Spectrum hardware allows execution of one of several actions as a
> result of queue management decisions: tail-dropping, early-dropping,
> marking a packet, or passing a configured latency threshold or buffer
> size. Such packets can be mirrored, trapped, or sampled.
> 
> Modeling the action to be taken as simply a TC action is very attractive,
> but it is not obvious where to put these actions. At least with ECN marking
> one could imagine a tree of qdiscs and classifiers that effectively
> accomplishes this task, albeit in an impractically complex manner. But
> there is just no way to match on dropped-ness of a packet, let alone
> dropped-ness due to a particular reason.
> 
> To allow configuring user-defined actions as a result of inner workings of
> a qdisc, this patch set introduces a concept of qevents. Those are attach
> points for TC blocks, where filters can be put that are executed as the
> packet hits well-defined points in the qdisc algorithms. The attached
> blocks can be shared, in a manner similar to clsact ingress and egress
> blocks, arbitrary classifiers with arbitrary actions can be put on them,
> etc.
 ...

Series applied, thank you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue
  2020-06-26 22:45 ` [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue Petr Machata
@ 2020-07-06 19:21   ` Cong Wang
  2020-07-07 15:25     ` Petr Machata
  0 siblings, 1 reply; 31+ messages in thread
From: Cong Wang @ 2020-07-06 19:21 UTC (permalink / raw)
  To: Petr Machata
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel

On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
>
> A following patch introduces qevents, points in qdisc algorithm where
> packet can be processed by user-defined filters. Should this processing
> lead to a situation where a new packet is to be enqueued on the same port,
> holding the root lock would lead to deadlocks. To solve the issue, qevent
> handler needs to unlock and relock the root lock when necessary.
>
> To that end, add the root lock argument to the qdisc op enqueue, and
> propagate throughout.

Hmm, but why do you pass root lock down to each ->enqueue()?

You can find root lock with sch_tree_lock() (or qdisc_lock() if you don't
care about hierarchy), and you already have qdisc as a parameter of
tcf_qevent_handle().

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-06-26 22:45 ` [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks Petr Machata
@ 2020-07-06 19:48   ` Cong Wang
  2020-07-07 15:22     ` Petr Machata
  2020-07-07 19:48   ` Cong Wang
  1 sibling, 1 reply; 31+ messages in thread
From: Cong Wang @ 2020-07-06 19:48 UTC (permalink / raw)
  To: Petr Machata
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel

On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
> The function tcf_qevent_handle() should be invoked when qdisc hits the
> "interesting event" corresponding to a block. This function releases root
> lock for the duration of executing the attached filters, to allow packets
> generated through user actions (notably mirred) to be reinserted to the
> same qdisc tree.

Are you sure releasing the root lock in the middle of an enqueue operation
is a good idea? I mean, it seems racy with qdisc change or reset path,
for example, __red_change() could update some RED parameters
immediately after you release the root lock.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-06 19:48   ` Cong Wang
@ 2020-07-07 15:22     ` Petr Machata
  2020-07-07 19:13       ` Cong Wang
  0 siblings, 1 reply; 31+ messages in thread
From: Petr Machata @ 2020-07-07 15:22 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel


Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
>> The function tcf_qevent_handle() should be invoked when qdisc hits the
>> "interesting event" corresponding to a block. This function releases root
>> lock for the duration of executing the attached filters, to allow packets
>> generated through user actions (notably mirred) to be reinserted to the
>> same qdisc tree.
>
> Are you sure releasing the root lock in the middle of an enqueue operation
> is a good idea? I mean, it seems racy with qdisc change or reset path,
> for example, __red_change() could update some RED parameters
> immediately after you release the root lock.

So I had mulled over this for a while. If packets are enqueued or
dequeued while the lock is released, maybe the packet under
consideration should have been hard_marked instead of prob_marked, or
vice versa. (I can't really go to not marked at all, because the fact
that we ran the qevent is a very observable commitment to put the packet
in the queue with marking.) I figured that is not such a big deal.

Regarding a configuration change, for a brief period after the change, a
few not-yet-pushed packets could have been enqueued with ECN marking
even as I e.g. disabled ECN. This does not seem like a big deal either,
these are transient effects.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue
  2020-07-06 19:21   ` Cong Wang
@ 2020-07-07 15:25     ` Petr Machata
  2020-07-07 19:41       ` Cong Wang
  0 siblings, 1 reply; 31+ messages in thread
From: Petr Machata @ 2020-07-07 15:25 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel


Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
>>
>> A following patch introduces qevents, points in qdisc algorithm where
>> packet can be processed by user-defined filters. Should this processing
>> lead to a situation where a new packet is to be enqueued on the same port,
>> holding the root lock would lead to deadlocks. To solve the issue, qevent
>> handler needs to unlock and relock the root lock when necessary.
>>
>> To that end, add the root lock argument to the qdisc op enqueue, and
>> propagate throughout.
>
> Hmm, but why do you pass root lock down to each ->enqueue()?
>
> You can find root lock with sch_tree_lock() (or qdisc_lock() if you don't
> care about hierarchy), and you already have qdisc as a parameter of
> tcf_qevent_handle().

I know, I wanted to make it clear that the lock may end up being used,
instead of doing it "stealthily". If you find this inelegant I can push
a follow-up that converts tcf_qevent_handle() to sch_tree_unlock().

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-07 15:22     ` Petr Machata
@ 2020-07-07 19:13       ` Cong Wang
  2020-07-08 12:35         ` Petr Machata
  0 siblings, 1 reply; 31+ messages in thread
From: Cong Wang @ 2020-07-07 19:13 UTC (permalink / raw)
  To: Petr Machata
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel

On Tue, Jul 7, 2020 at 8:22 AM Petr Machata <petrm@mellanox.com> wrote:
>
>
> Cong Wang <xiyou.wangcong@gmail.com> writes:
>
> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
> >> The function tcf_qevent_handle() should be invoked when qdisc hits the
> >> "interesting event" corresponding to a block. This function releases root
> >> lock for the duration of executing the attached filters, to allow packets
> >> generated through user actions (notably mirred) to be reinserted to the
> >> same qdisc tree.
> >
> > Are you sure releasing the root lock in the middle of an enqueue operation
> > is a good idea? I mean, it seems racy with qdisc change or reset path,
> > for example, __red_change() could update some RED parameters
> > immediately after you release the root lock.
>
> So I had mulled over this for a while. If packets are enqueued or
> dequeued while the lock is released, maybe the packet under
> consideration should have been hard_marked instead of prob_marked, or
> vice versa. (I can't really go to not marked at all, because the fact
> that we ran the qevent is a very observable commitment to put the packet
> in the queue with marking.) I figured that is not such a big deal.
>
> Regarding a configuration change, for a brief period after the change, a
> few not-yet-pushed packets could have been enqueued with ECN marking
> even as I e.g. disabled ECN. This does not seem like a big deal either,
> these are transient effects.

Hmm, let's see:

1. red_enqueue() caches a pointer to child qdisc, child = q->qdisc
2. root lock is released by tcf_qevent_handle().
3. __red_change() acquires the root lock and then changes child
qdisc with a new one
4. The old child qdisc is put by qdisc_put()
5. tcf_qevent_handle() acquires the root lock again, and still uses
the cached but now freed old child qdisc.

Isn't this a problem?

Even if it really is not, why do we make tcf_qevent_handle() callers'
life so hard? They have to be very careful with race conditions like this.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue
  2020-07-07 15:25     ` Petr Machata
@ 2020-07-07 19:41       ` Cong Wang
  0 siblings, 0 replies; 31+ messages in thread
From: Cong Wang @ 2020-07-07 19:41 UTC (permalink / raw)
  To: Petr Machata
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel

On Tue, Jul 7, 2020 at 8:25 AM Petr Machata <petrm@mellanox.com> wrote:
>
>
> Cong Wang <xiyou.wangcong@gmail.com> writes:
>
> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
> >>
> >> A following patch introduces qevents, points in qdisc algorithm where
> >> packet can be processed by user-defined filters. Should this processing
> >> lead to a situation where a new packet is to be enqueued on the same port,
> >> holding the root lock would lead to deadlocks. To solve the issue, qevent
> >> handler needs to unlock and relock the root lock when necessary.
> >>
> >> To that end, add the root lock argument to the qdisc op enqueue, and
> >> propagate throughout.
> >
> > Hmm, but why do you pass root lock down to each ->enqueue()?
> >
> > You can find root lock with sch_tree_lock() (or qdisc_lock() if you don't
> > care about hierarchy), and you already have qdisc as a parameter of
> > tcf_qevent_handle().
>
> I know, I wanted to make it clear that the lock may end up being used,
> instead of doing it "stealthily". If you find this inelegant I can push
> a follow-up that converts tcf_qevent_handle() to sch_tree_unlock().

So far only sch_red uses tcf_qevent_handle(), changing the rest
qdisc's just for sch_red seems overkill here.

Of course, if we could eliminate the root lock release, all the pains
would go away.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-06-26 22:45 ` [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks Petr Machata
  2020-07-06 19:48   ` Cong Wang
@ 2020-07-07 19:48   ` Cong Wang
  2020-07-08  9:19     ` Petr Machata
  1 sibling, 1 reply; 31+ messages in thread
From: Cong Wang @ 2020-07-07 19:48 UTC (permalink / raw)
  To: Petr Machata
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel

On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
> The function tcf_qevent_handle() should be invoked when qdisc hits the
> "interesting event" corresponding to a block. This function releases root
> lock for the duration of executing the attached filters, to allow packets
> generated through user actions (notably mirred) to be reinserted to the
> same qdisc tree.

I read this again, another question here is: why is tcf_qevent_handle()
special here? We call tcf_classify() under root lock in many other places
too, for example htb_enqueue(), which of course includes act_mirred
execution, so why isn't it a problem there?

People added MIRRED_RECURSION_LIMIT for this kinda recursion,
but never released that root lock.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-07 19:48   ` Cong Wang
@ 2020-07-08  9:19     ` Petr Machata
  0 siblings, 0 replies; 31+ messages in thread
From: Petr Machata @ 2020-07-08  9:19 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel


Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
>> The function tcf_qevent_handle() should be invoked when qdisc hits the
>> "interesting event" corresponding to a block. This function releases root
>> lock for the duration of executing the attached filters, to allow packets
>> generated through user actions (notably mirred) to be reinserted to the
>> same qdisc tree.
>
> I read this again, another question here is: why is tcf_qevent_handle()
> special here? We call tcf_classify() under root lock in many other places
> too, for example htb_enqueue(), which of course includes act_mirred
> execution, so why isn't it a problem there?
>
> People added MIRRED_RECURSION_LIMIT for this kinda recursion,
> but never released that root lock.

Yes, I realized later that the qdiscs that use tcf_classify() for
classification have this exact problem as well. My intention was to fix
it by dropping the lock. Since the classification is the first step of
enqueing it should not really lead to races, so hopefully this will be
OK. I don't have any code as of yet.

The recursion limit makes sense for clsact, which is handled out of the
root lock.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-07 19:13       ` Cong Wang
@ 2020-07-08 12:35         ` Petr Machata
  2020-07-08 16:21           ` Petr Machata
  2020-07-08 19:04           ` Cong Wang
  0 siblings, 2 replies; 31+ messages in thread
From: Petr Machata @ 2020-07-08 12:35 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel


Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Tue, Jul 7, 2020 at 8:22 AM Petr Machata <petrm@mellanox.com> wrote:
>>
>>
>> Cong Wang <xiyou.wangcong@gmail.com> writes:
>>
>> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
>> >> The function tcf_qevent_handle() should be invoked when qdisc hits the
>> >> "interesting event" corresponding to a block. This function releases root
>> >> lock for the duration of executing the attached filters, to allow packets
>> >> generated through user actions (notably mirred) to be reinserted to the
>> >> same qdisc tree.
>> >
>> > Are you sure releasing the root lock in the middle of an enqueue operation
>> > is a good idea? I mean, it seems racy with qdisc change or reset path,
>> > for example, __red_change() could update some RED parameters
>> > immediately after you release the root lock.
>>
>> So I had mulled over this for a while. If packets are enqueued or
>> dequeued while the lock is released, maybe the packet under
>> consideration should have been hard_marked instead of prob_marked, or
>> vice versa. (I can't really go to not marked at all, because the fact
>> that we ran the qevent is a very observable commitment to put the packet
>> in the queue with marking.) I figured that is not such a big deal.
>>
>> Regarding a configuration change, for a brief period after the change, a
>> few not-yet-pushed packets could have been enqueued with ECN marking
>> even as I e.g. disabled ECN. This does not seem like a big deal either,
>> these are transient effects.
>
> Hmm, let's see:
>
> 1. red_enqueue() caches a pointer to child qdisc, child = q->qdisc
> 2. root lock is released by tcf_qevent_handle().
> 3. __red_change() acquires the root lock and then changes child
> qdisc with a new one
> 4. The old child qdisc is put by qdisc_put()
> 5. tcf_qevent_handle() acquires the root lock again, and still uses
> the cached but now freed old child qdisc.
>
> Isn't this a problem?

I missed this. It is a problem, destroy gets called right away and then
enqueue would use invalid data.

Also qdisc_graft() calls cops->graft, which locks the tree and swaps the
qdisc pointes, and then qdisc_put()s the original one. So dropping the
root lock can lead to destruction of the qdisc whose enqueue is
currently executed.

So that's no good. The lock has to be held throughout.

> Even if it really is not, why do we make tcf_qevent_handle() callers'
> life so hard? They have to be very careful with race conditions like this.

Do you have another solution in mind here? I think the deadlock (in both
classification and qevents) is an issue, but really don't know how to
avoid it except by dropping the lock.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-08 12:35         ` Petr Machata
@ 2020-07-08 16:21           ` Petr Machata
  2020-07-08 19:09             ` Cong Wang
  2020-07-08 19:04           ` Cong Wang
  1 sibling, 1 reply; 31+ messages in thread
From: Petr Machata @ 2020-07-08 16:21 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel


Petr Machata <petrm@mellanox.com> writes:

> Cong Wang <xiyou.wangcong@gmail.com> writes:
>
>> On Tue, Jul 7, 2020 at 8:22 AM Petr Machata <petrm@mellanox.com> wrote:
>>>
>>>
>>> Cong Wang <xiyou.wangcong@gmail.com> writes:
>>>
>>> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata <petrm@mellanox.com> wrote:
>>> >> The function tcf_qevent_handle() should be invoked when qdisc hits the
>>> >> "interesting event" corresponding to a block. This function releases root
>>> >> lock for the duration of executing the attached filters, to allow packets
>>> >> generated through user actions (notably mirred) to be reinserted to the
>>> >> same qdisc tree.
>>> >
>>> > Are you sure releasing the root lock in the middle of an enqueue operation
>>> > is a good idea? I mean, it seems racy with qdisc change or reset path,
>>> > for example, __red_change() could update some RED parameters
>>> > immediately after you release the root lock.
>>>
>>> So I had mulled over this for a while. If packets are enqueued or
>>> dequeued while the lock is released, maybe the packet under
>>> consideration should have been hard_marked instead of prob_marked, or
>>> vice versa. (I can't really go to not marked at all, because the fact
>>> that we ran the qevent is a very observable commitment to put the packet
>>> in the queue with marking.) I figured that is not such a big deal.
>>>
>>> Regarding a configuration change, for a brief period after the change, a
>>> few not-yet-pushed packets could have been enqueued with ECN marking
>>> even as I e.g. disabled ECN. This does not seem like a big deal either,
>>> these are transient effects.
>>
>> Hmm, let's see:
>>
>> 1. red_enqueue() caches a pointer to child qdisc, child = q->qdisc
>> 2. root lock is released by tcf_qevent_handle().
>> 3. __red_change() acquires the root lock and then changes child
>> qdisc with a new one
>> 4. The old child qdisc is put by qdisc_put()
>> 5. tcf_qevent_handle() acquires the root lock again, and still uses
>> the cached but now freed old child qdisc.
>>
>> Isn't this a problem?
>
> I missed this. It is a problem, destroy gets called right away and then
> enqueue would use invalid data.
>
> Also qdisc_graft() calls cops->graft, which locks the tree and swaps the
> qdisc pointes, and then qdisc_put()s the original one. So dropping the
> root lock can lead to destruction of the qdisc whose enqueue is
> currently executed.
>
> So that's no good. The lock has to be held throughout.
>
>> Even if it really is not, why do we make tcf_qevent_handle() callers'
>> life so hard? They have to be very careful with race conditions like this.
>
> Do you have another solution in mind here? I think the deadlock (in both
> classification and qevents) is an issue, but really don't know how to
> avoid it except by dropping the lock.

Actually I guess I could qdisc_refcount_inc() the current qdisc so that
it doesn't go away. Then when enqueing I could access the child
directly, not relying on the now-obsolete cache from the beginning of
the enqueue function. I suppose that a similar approach could be used in
other users of tcf_classify() as well. What do you think?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-08 12:35         ` Petr Machata
  2020-07-08 16:21           ` Petr Machata
@ 2020-07-08 19:04           ` Cong Wang
  2020-07-08 21:04             ` Petr Machata
  1 sibling, 1 reply; 31+ messages in thread
From: Cong Wang @ 2020-07-08 19:04 UTC (permalink / raw)
  To: Petr Machata
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel

On Wed, Jul 8, 2020 at 5:35 AM Petr Machata <petrm@mellanox.com> wrote:
> Do you have another solution in mind here? I think the deadlock (in both
> classification and qevents) is an issue, but really don't know how to
> avoid it except by dropping the lock.

Ideally we should only take the lock once, but it clearly requires some
work to teach the dev_queue_xmit() in act_mirred not to acquire it again.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-08 16:21           ` Petr Machata
@ 2020-07-08 19:09             ` Cong Wang
  0 siblings, 0 replies; 31+ messages in thread
From: Cong Wang @ 2020-07-08 19:09 UTC (permalink / raw)
  To: Petr Machata
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel

On Wed, Jul 8, 2020 at 9:21 AM Petr Machata <petrm@mellanox.com> wrote:
>
> Actually I guess I could qdisc_refcount_inc() the current qdisc so that
> it doesn't go away. Then when enqueing I could access the child
> directly, not relying on the now-obsolete cache from the beginning of
> the enqueue function. I suppose that a similar approach could be used in
> other users of tcf_classify() as well. What do you think?

The above example is just a quick one I can think of, there could be
more race conditions that lead to other kinds of bugs.

I am sure you can fix that one, but the point is that it is hard to
audit and fix them all. The best solution here is of course not to
release that lock, but again it requires some more work to avoid
the deadlock.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-08 19:04           ` Cong Wang
@ 2020-07-08 21:04             ` Petr Machata
  2020-07-09  0:13               ` Petr Machata
  0 siblings, 1 reply; 31+ messages in thread
From: Petr Machata @ 2020-07-08 21:04 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel


Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Wed, Jul 8, 2020 at 5:35 AM Petr Machata <petrm@mellanox.com> wrote:
>> Do you have another solution in mind here? I think the deadlock (in both
>> classification and qevents) is an issue, but really don't know how to
>> avoid it except by dropping the lock.
>
> Ideally we should only take the lock once, but it clearly requires some
> work to teach the dev_queue_xmit() in act_mirred not to acquire it again.

act_mirred does not acquire it though. The egress path does. And the
packet can traverse several mirred instances on several devices before
it gets to the deadlock. Currently I don't see a way to give the egress
path a way to know that the lock is already taken.

I'll think about it some more. For now I will at least fix the lack of
locking.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-08 21:04             ` Petr Machata
@ 2020-07-09  0:13               ` Petr Machata
  2020-07-09 19:37                 ` Cong Wang
  0 siblings, 1 reply; 31+ messages in thread
From: Petr Machata @ 2020-07-09  0:13 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel


Petr Machata <petrm@mellanox.com> writes:

> Cong Wang <xiyou.wangcong@gmail.com> writes:
>
> I'll think about it some more. For now I will at least fix the lack of
> locking.

I guess I could store smp_processor_id() that acquired the lock in
struct qdisc_skb_head. Do a trylock instead of lock, and on fail check
the stored value. I'll need to be careful about the race between
unsuccessful trylock and the test, and about making sure CPU ID doesn't
change after it is read. I'll probe this tomorrow.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-09  0:13               ` Petr Machata
@ 2020-07-09 19:37                 ` Cong Wang
  2020-07-10 14:40                   ` Petr Machata
  0 siblings, 1 reply; 31+ messages in thread
From: Cong Wang @ 2020-07-09 19:37 UTC (permalink / raw)
  To: Petr Machata
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel

On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote:
>
>
> Petr Machata <petrm@mellanox.com> writes:
>
> > Cong Wang <xiyou.wangcong@gmail.com> writes:
> >
> > I'll think about it some more. For now I will at least fix the lack of
> > locking.
>
> I guess I could store smp_processor_id() that acquired the lock in
> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check
> the stored value. I'll need to be careful about the race between
> unsuccessful trylock and the test, and about making sure CPU ID doesn't
> change after it is read. I'll probe this tomorrow.

Like __netif_tx_lock(), right? Seems doable.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-09 19:37                 ` Cong Wang
@ 2020-07-10 14:40                   ` Petr Machata
  2020-07-14  2:51                     ` Cong Wang
  0 siblings, 1 reply; 31+ messages in thread
From: Petr Machata @ 2020-07-10 14:40 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel


Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote:
>>
>>
>> Petr Machata <petrm@mellanox.com> writes:
>>
>> > Cong Wang <xiyou.wangcong@gmail.com> writes:
>> >
>> > I'll think about it some more. For now I will at least fix the lack of
>> > locking.
>>
>> I guess I could store smp_processor_id() that acquired the lock in
>> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check
>> the stored value. I'll need to be careful about the race between
>> unsuccessful trylock and the test, and about making sure CPU ID doesn't
>> change after it is read. I'll probe this tomorrow.
>
> Like __netif_tx_lock(), right? Seems doable.

Good to see it actually used, I wasn't sure if the idea made sense :)

Unfortunately it is not enough.

Consider two threads (A, B) and two netdevices (eth0, eth1):

- "A" takes eth0's root lock and proceeds to classification
- "B" takes eth1's root lock and proceeds to classification
- "A" invokes mirror to eth1, waits on lock held by "B"
- "B" invakes mirror to eth0, waits on lock held by "A"
- Some say they are still waiting to this day.

So one option that I see is to just stash the mirrored packet in a queue
instead of delivering it right away:

- s/netif_receive_skb/netif_rx/ in act_mirred

- Reuse the RX queue for TX packets as well, differentiating the two by
  a bit in SKB CB. Then process_backlog() would call either
  __netif_receive_skb() or dev_queue_transmit().

- Drop mirred_rec_level guard.

This seems to work, but I might be missing something non-obvious, such
as CB actually being used for something already in that context. I would
really rather not introduce a second backlog queue just for mirred
though.

Since mirred_rec_level does not kick in anymore, the same packet can end
up being forwarded from the backlog queue, to the qdisc, and back to the
backlog queue, forever. But that seems OK, that's what the admin
configured, so that's what's happening.

If this is not a good idea for some reason, this might work as well:

- Convert the current root lock to an rw lock. Convert all current
  lockers to write lock (which should be safe), except of enqueue, which
  will take read lock. That will allow many concurrent threads to enter
  enqueue, or one thread several times, but it will exclude all other
  users.

  So this guards configuration access to the qdisc tree, makes sure
  qdiscs don't go away from under one's feet.

- Introduce another spin lock to guard the private data of the qdisc
  tree, counters etc., things that even two concurrent enqueue
  operations shouldn't trample on. Enqueue takes this spin lock after
  read-locking the root lock. act_mirred drops it before injecting the
  packet and takes it again afterwards.

Any opinions y'all?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-10 14:40                   ` Petr Machata
@ 2020-07-14  2:51                     ` Cong Wang
  2020-07-14  9:12                       ` Petr Machata
  0 siblings, 1 reply; 31+ messages in thread
From: Cong Wang @ 2020-07-14  2:51 UTC (permalink / raw)
  To: Petr Machata
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel

On Fri, Jul 10, 2020 at 7:40 AM Petr Machata <petrm@mellanox.com> wrote:
>
>
> Cong Wang <xiyou.wangcong@gmail.com> writes:
>
> > On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote:
> >>
> >>
> >> Petr Machata <petrm@mellanox.com> writes:
> >>
> >> > Cong Wang <xiyou.wangcong@gmail.com> writes:
> >> >
> >> > I'll think about it some more. For now I will at least fix the lack of
> >> > locking.
> >>
> >> I guess I could store smp_processor_id() that acquired the lock in
> >> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check
> >> the stored value. I'll need to be careful about the race between
> >> unsuccessful trylock and the test, and about making sure CPU ID doesn't
> >> change after it is read. I'll probe this tomorrow.
> >
> > Like __netif_tx_lock(), right? Seems doable.
>
> Good to see it actually used, I wasn't sure if the idea made sense :)
>
> Unfortunately it is not enough.
>
> Consider two threads (A, B) and two netdevices (eth0, eth1):
>
> - "A" takes eth0's root lock and proceeds to classification
> - "B" takes eth1's root lock and proceeds to classification
> - "A" invokes mirror to eth1, waits on lock held by "B"
> - "B" invakes mirror to eth0, waits on lock held by "A"
> - Some say they are still waiting to this day.

Sure, AA or ABBA deadlock.

>
> So one option that I see is to just stash the mirrored packet in a queue
> instead of delivering it right away:
>
> - s/netif_receive_skb/netif_rx/ in act_mirred
>
> - Reuse the RX queue for TX packets as well, differentiating the two by
>   a bit in SKB CB. Then process_backlog() would call either
>   __netif_receive_skb() or dev_queue_transmit().
>
> - Drop mirred_rec_level guard.

I don't think I follow you, the root qdisc lock is on egress which has
nothing to do with ingress, so I don't see how netif_rx() is even involved.

>
> This seems to work, but I might be missing something non-obvious, such
> as CB actually being used for something already in that context. I would
> really rather not introduce a second backlog queue just for mirred
> though.
>
> Since mirred_rec_level does not kick in anymore, the same packet can end
> up being forwarded from the backlog queue, to the qdisc, and back to the
> backlog queue, forever. But that seems OK, that's what the admin
> configured, so that's what's happening.
>
> If this is not a good idea for some reason, this might work as well:
>
> - Convert the current root lock to an rw lock. Convert all current
>   lockers to write lock (which should be safe), except of enqueue, which
>   will take read lock. That will allow many concurrent threads to enter
>   enqueue, or one thread several times, but it will exclude all other
>   users.

Are you sure we can parallelize enqueue()? They all need to move
skb into some queue, which is not able to parallelize with just a read
lock. Even the "lockless" qdisc takes a spinlock, r->producer_lock,
for enqueue().


>
>   So this guards configuration access to the qdisc tree, makes sure
>   qdiscs don't go away from under one's feet.
>
> - Introduce another spin lock to guard the private data of the qdisc
>   tree, counters etc., things that even two concurrent enqueue
>   operations shouldn't trample on. Enqueue takes this spin lock after
>   read-locking the root lock. act_mirred drops it before injecting the
>   packet and takes it again afterwards.
>
> Any opinions y'all?

I thought about forbidding mirror/redirecting to the same device,
but there might be some legitimate use cases of such. So, I don't
have any other ideas yet, perhaps there is some way to refactor
dev_queue_xmit() to avoid this deadlock.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks
  2020-07-14  2:51                     ` Cong Wang
@ 2020-07-14  9:12                       ` Petr Machata
  0 siblings, 0 replies; 31+ messages in thread
From: Petr Machata @ 2020-07-14  9:12 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, David Miller, Jakub Kicinski,
	Eric Dumazet, Jiri Pirko, Ido Schimmel


Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Fri, Jul 10, 2020 at 7:40 AM Petr Machata <petrm@mellanox.com> wrote:
>>
>>
>> Cong Wang <xiyou.wangcong@gmail.com> writes:
>>
>> > On Wed, Jul 8, 2020 at 5:13 PM Petr Machata <petrm@mellanox.com> wrote:
>> >>
>> >>
>> >> Petr Machata <petrm@mellanox.com> writes:
>> >>
>> >> > Cong Wang <xiyou.wangcong@gmail.com> writes:
>> >> >
>> >> > I'll think about it some more. For now I will at least fix the lack of
>> >> > locking.
>> >>
>> >> I guess I could store smp_processor_id() that acquired the lock in
>> >> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check
>> >> the stored value. I'll need to be careful about the race between
>> >> unsuccessful trylock and the test, and about making sure CPU ID doesn't
>> >> change after it is read. I'll probe this tomorrow.
>> >
>> > Like __netif_tx_lock(), right? Seems doable.
>>
>> Good to see it actually used, I wasn't sure if the idea made sense :)
>>
>> Unfortunately it is not enough.
>>
>> Consider two threads (A, B) and two netdevices (eth0, eth1):
>>
>> - "A" takes eth0's root lock and proceeds to classification
>> - "B" takes eth1's root lock and proceeds to classification
>> - "A" invokes mirror to eth1, waits on lock held by "B"
>> - "B" invakes mirror to eth0, waits on lock held by "A"
>> - Some say they are still waiting to this day.
>
> Sure, AA or ABBA deadlock.
>
>>
>> So one option that I see is to just stash the mirrored packet in a queue
>> instead of delivering it right away:
>>
>> - s/netif_receive_skb/netif_rx/ in act_mirred
>>
>> - Reuse the RX queue for TX packets as well, differentiating the two by
>>   a bit in SKB CB. Then process_backlog() would call either
>>   __netif_receive_skb() or dev_queue_transmit().
>>
>> - Drop mirred_rec_level guard.
>
> I don't think I follow you, the root qdisc lock is on egress which has
> nothing to do with ingress, so I don't see how netif_rx() is even involved.

netif_rx() isn't, but __netif_receive_skb() is, and that can lead to the
deadlock as well when another mirred redirects it back to the locked
egress queue.

So a way to solve "mirred ingress dev" action deadlock is to
s/netif_receive_skb/netif_rx/. I.e. don't resolve the mirror right away,
go through the per-CPU queue.

Then "mirred egress dev" could be fixed similarly by repurposing the
queue for both ingress and egress, differentiating ingress packets from
egress ones by a bit in SKB CB.

>>
>> This seems to work, but I might be missing something non-obvious, such
>> as CB actually being used for something already in that context. I would
>> really rather not introduce a second backlog queue just for mirred
>> though.
>>
>> Since mirred_rec_level does not kick in anymore, the same packet can end
>> up being forwarded from the backlog queue, to the qdisc, and back to the
>> backlog queue, forever. But that seems OK, that's what the admin
>> configured, so that's what's happening.
>>
>> If this is not a good idea for some reason, this might work as well:
>>
>> - Convert the current root lock to an rw lock. Convert all current
>>   lockers to write lock (which should be safe), except of enqueue, which
>>   will take read lock. That will allow many concurrent threads to enter
>>   enqueue, or one thread several times, but it will exclude all other
>>   users.
>
> Are you sure we can parallelize enqueue()? They all need to move
> skb into some queue, which is not able to parallelize with just a read
> lock. Even the "lockless" qdisc takes a spinlock, r->producer_lock,
> for enqueue().

That's why the second spin lock is for. In guards private data,
including the queues.

>>
>>   So this guards configuration access to the qdisc tree, makes sure
>>   qdiscs don't go away from under one's feet.
>>
>> - Introduce another spin lock to guard the private data of the qdisc
>>   tree, counters etc., things that even two concurrent enqueue
>>   operations shouldn't trample on. Enqueue takes this spin lock after
>>   read-locking the root lock. act_mirred drops it before injecting the
>>   packet and takes it again afterwards.
>>
>> Any opinions y'all?
>
> I thought about forbidding mirror/redirecting to the same device,
> but there might be some legitimate use cases of such. So, I don't

Yes, and also that's not enough:

- A chain of mirreds can achieve the deadlocks as well (i.e. mirror to
  eth1, redirect back to eth0). Or the ABA case shown above, where it's
  two actions that don't even work with the same packets causing the
  deadlock.

- I suspect general forwarding could cause this deadlock as well. E.g.
  redirecting to ingress of a device, where bridge, router take over and
  bring the packet back to egress. I have not tried reproducing this
  though, maybe there's a queue or delayed work etc. somewhere in there
  that makes this not an issue.

> have any other ideas yet, perhaps there is some way to refactor
> dev_queue_xmit() to avoid this deadlock.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-07-14  9:12 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-26 22:45 [PATCH net-next v1 0/5] TC: Introduce qevents Petr Machata
2020-06-26 22:45 ` [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue Petr Machata
2020-07-06 19:21   ` Cong Wang
2020-07-07 15:25     ` Petr Machata
2020-07-07 19:41       ` Cong Wang
2020-06-26 22:45 ` [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks Petr Machata
2020-07-06 19:48   ` Cong Wang
2020-07-07 15:22     ` Petr Machata
2020-07-07 19:13       ` Cong Wang
2020-07-08 12:35         ` Petr Machata
2020-07-08 16:21           ` Petr Machata
2020-07-08 19:09             ` Cong Wang
2020-07-08 19:04           ` Cong Wang
2020-07-08 21:04             ` Petr Machata
2020-07-09  0:13               ` Petr Machata
2020-07-09 19:37                 ` Cong Wang
2020-07-10 14:40                   ` Petr Machata
2020-07-14  2:51                     ` Cong Wang
2020-07-14  9:12                       ` Petr Machata
2020-07-07 19:48   ` Cong Wang
2020-07-08  9:19     ` Petr Machata
2020-06-26 22:45 ` [PATCH net-next v1 3/5] net: sched: sch_red: Split init and change callbacks Petr Machata
2020-06-26 22:45 ` [PATCH net-next v1 4/5] net: sched: sch_red: Add qevents "early_drop" and "mark" Petr Machata
2020-06-26 22:45 ` [PATCH net-next v1 5/5] selftests: forwarding: Add a RED test for SW datapath Petr Machata
2020-06-26 22:45 ` [PATCH iproute2-next v1 1/4] uapi: pkt_sched: Add two new RED attributes Petr Machata
2020-06-26 22:45   ` [PATCH iproute2-next v1 2/4] tc: Add helpers to support qevent handling Petr Machata
2020-06-26 22:45   ` [PATCH iproute2-next v1 3/4] man: tc: Describe qevents Petr Machata
2020-06-26 22:45   ` [PATCH iproute2-next v1 4/4] tc: q_red: Add support for qevents "mark" and "early_drop" Petr Machata
2020-06-26 22:56 ` [PATCH net-next v1 0/5] TC: Introduce qevents Stephen Hemminger
2020-06-29 13:21   ` Petr Machata
2020-06-30  0:15 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).