All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/10] running qdiscs without qdisc_lock
@ 2016-07-14  6:19 John Fastabend
  2016-07-14  6:19 ` [RFC PATCH v2 01/10] net: sched: allow qdiscs to handle locking John Fastabend
                   ` (9 more replies)
  0 siblings, 10 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:19 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

Hi,

I thought I should go ahead and send this series out for comments.
Here I allow qdiscs to be run without taking the qdisc lock. As a
result statistics, gso skb, tx bad skb and a few other things need
to be "safe" to run without locks. It _should_ all be covered here.
Although I just noticed I must be missing a dec on the backlog
counter somewhere as one of my tests just ended with 0packets but
a nonzero bytes counter.

Also of note in this series I used the skb_array implementation
already in net-next for the tun/tap devices. With this implementation
for cases where lots of threads are hitting the same qdisc I see
a modest improvement but for cases like mq with pktgen where
everything is lined up nicely I see a fairly unpleasant regression.

I have a few thoughts on how to resolve this. First if we support
bulk_dequeue as an operation on the skb_array this should help
vs getting the consumer lock repeatedly. Also we really don't need
the HARD_TX_LOCK if we have a core per queue and XPS setup like many
multiqueue nics default to. And I need to go back and look at the
original alf ring implementation as well to see how it compares I
don't recall seeing the mq regression there.

Also after the above it might be nice to make all qdiscs support
the per cpu statistics and drop non per cpu cases just to simplify
the code and all the if/else branching where its not needed.

As usual any thoughts, comments, etc are welcome.

And I wasn't going to add these numbers just because they come from
an untuned system but why not.

Here are some initial numbers from pktgen on my development which
is a reasonable system (E5-2695) but I didn't do any work to tweak
the config so there is still a bunch of debug/hacking options still
running.

The pktgen command is

./samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i eth3  -t X -s 64

pfifo_fast

        original pps        lockless        diff
1       1418168             1269450         -148718
2       1587390             1553408         -33982
4       1084961             1683639         +598678
8       989636              1522723         +533087
12      1014018             1348172         +334154

mq
       original pps        lockless        diff
1      1442018             1205180        -236838 
2      2646069             2266095        -379974
4      5136200             4269470        -866730
8      
12     13275671            10810909       -2464762

---

John Fastabend (10):
      net: sched: allow qdiscs to handle locking
      net: sched: qdisc_qlen for per cpu logic
      net: sched: provide per cpu qstat helpers
      net: sched: a dflt qdisc may be used with per cpu stats
      net: sched: per cpu gso handlers
      net: sched: support qdisc_reset on NOLOCK qdisc
      net: sched: support skb_bad_tx with lockless qdisc
      net: sched: pfifo_fast use alf_queue
      net: sched: helper to sum qlen
      net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq


 include/net/gen_stats.h   |    3 
 include/net/sch_generic.h |  105 ++++++++++++
 net/core/dev.c            |   32 +++-
 net/core/gen_stats.c      |    9 +
 net/sched/sch_api.c       |   12 +
 net/sched/sch_generic.c   |  385 +++++++++++++++++++++++++++++++++++----------
 net/sched/sch_mq.c        |   25 ++-
 7 files changed, 467 insertions(+), 104 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 01/10] net: sched: allow qdiscs to handle locking
  2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
@ 2016-07-14  6:19 ` John Fastabend
  2016-07-14  6:44   ` John Fastabend
  2016-07-14  6:20 ` [RFC PATCH v2 02/10] net: sched: qdisc_qlen for per cpu logic John Fastabend
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:19 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

This patch adds a flag for queueing disciplines to indicate the stack
does not need to use the qdisc lock to protect operations. This can
be used to build lockless scheduling algorithms and improving
performance.

The flag is checked in the tx path and the qdisc lock is only taken
if it is not set. For now use a conditional if statement. Later we
could be more aggressive if it proves worthwhile and use a static key
or wrap this in a likely().

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |    1 +
 net/core/dev.c            |   32 ++++++++++++++++++++++++++++----
 net/sched/sch_generic.c   |   24 ++++++++++++++++--------
 3 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 909aff2..3de6a8c 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -58,6 +58,7 @@ struct Qdisc {
 #define TCQ_F_NOPARENT		0x40 /* root of its hierarchy :
 				      * qdisc_tree_decrease_qlen() should stop.
 				      */
+#define TCQ_F_NOLOCK		0x80 /* qdisc does not require locking */
 	u32			limit;
 	const struct Qdisc_ops	*ops;
 	struct qdisc_size_table	__rcu *stab;
diff --git a/net/core/dev.c b/net/core/dev.c
index b92d63b..f35d449 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3075,6 +3075,27 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	int rc;
 
 	qdisc_calculate_pkt_len(skb, q);
+
+	if (q->flags & TCQ_F_NOLOCK) {
+		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
+			__qdisc_drop(skb, &to_free);
+			rc = NET_XMIT_DROP;
+		} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q)) {
+			qdisc_bstats_cpu_update(q, skb);
+			__qdisc_run(q);
+			if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
+				__qdisc_run(q);
+			rc = NET_XMIT_SUCCESS;
+		} else {
+			rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
+			__qdisc_run(q);
+		}
+
+		if (unlikely(to_free))
+			kfree_skb_list(to_free);
+		return rc;
+	}
+
 	/*
 	 * Heuristic to force contended enqueues to serialize on a
 	 * separate lock before trying to get qdisc main lock.
@@ -3896,19 +3917,22 @@ static void net_tx_action(struct softirq_action *h)
 
 		while (head) {
 			struct Qdisc *q = head;
-			spinlock_t *root_lock;
+			spinlock_t *root_lock = NULL;
 
 			head = head->next_sched;
 
-			root_lock = qdisc_lock(q);
-			spin_lock(root_lock);
+			if (!(q->flags & TCQ_F_NOLOCK)) {
+				root_lock = qdisc_lock(q);
+				spin_lock(root_lock);
+			}
 			/* We need to make sure head->next_sched is read
 			 * before clearing __QDISC_STATE_SCHED
 			 */
 			smp_mb__before_atomic();
 			clear_bit(__QDISC_STATE_SCHED, &q->state);
 			qdisc_run(q);
-			spin_unlock(root_lock);
+			if (!(q->flags & TCQ_F_NOLOCK))
+				spin_unlock(root_lock);
 		}
 	}
 }
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e95b67c..2c3e23b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -170,7 +170,8 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 	int ret = NETDEV_TX_BUSY;
 
 	/* And release qdisc */
-	spin_unlock(root_lock);
+	if (!(q->flags & TCQ_F_NOLOCK))
+		spin_unlock(root_lock);
 
 	/* Note that we validate skb (GSO, checksum, ...) outside of locks */
 	if (validate)
@@ -183,10 +184,13 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
 		HARD_TX_UNLOCK(dev, txq);
 	} else {
-		spin_lock(root_lock);
+		if (!(q->flags & TCQ_F_NOLOCK))
+			spin_lock(root_lock);
 		return qdisc_qlen(q);
 	}
-	spin_lock(root_lock);
+
+	if (!(q->flags & TCQ_F_NOLOCK))
+		spin_lock(root_lock);
 
 	if (dev_xmit_complete(ret)) {
 		/* Driver sent out skb successfully or skb was consumed */
@@ -868,14 +872,18 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 
 		dev_queue = netdev_get_tx_queue(dev, i);
 		q = dev_queue->qdisc_sleeping;
-		root_lock = qdisc_lock(q);
 
-		spin_lock_bh(root_lock);
+		if (q->flags & TCQ_F_NOLOCK) {
+			val = test_bit(__QDISC_STATE_SCHED, &q->state);
+		} else {
+			root_lock = qdisc_lock(q);
+			spin_lock_bh(root_lock);
 
-		val = (qdisc_is_running(q) ||
-		       test_bit(__QDISC_STATE_SCHED, &q->state));
+			val = (qdisc_is_running(q) ||
+			       test_bit(__QDISC_STATE_SCHED, &q->state));
 
-		spin_unlock_bh(root_lock);
+			spin_unlock_bh(root_lock);
+		}
 
 		if (val)
 			return true;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 02/10] net: sched: qdisc_qlen for per cpu logic
  2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
  2016-07-14  6:19 ` [RFC PATCH v2 01/10] net: sched: allow qdiscs to handle locking John Fastabend
@ 2016-07-14  6:20 ` John Fastabend
  2016-07-14  6:20 ` [RFC PATCH v2 03/10] net: sched: provide per cpu qstat helpers John Fastabend
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:20 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

This is a bit interesting because it means sch_direct_xmit will
return a positive value which causes the dequeue/xmit cycle to
continue only when a specific cpu has a qlen > 0.

However checking each cpu for qlen will break performance so
its important to note that qdiscs that set the no lock bit need
to have some sort of per cpu enqueue/dequeue data structure that
maps to the per cpu qlen value.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 3de6a8c..354951d 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -247,8 +247,16 @@ static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
 	BUILD_BUG_ON(sizeof(qcb->data) < sz);
 }
 
+static inline int qdisc_qlen_cpu(const struct Qdisc *q)
+{
+	return this_cpu_ptr(q->cpu_qstats)->qlen;
+}
+
 static inline int qdisc_qlen(const struct Qdisc *q)
 {
+	if (q->flags & TCQ_F_NOLOCK)
+		return qdisc_qlen_cpu(q);
+
 	return q->q.qlen;
 }
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 03/10] net: sched: provide per cpu qstat helpers
  2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
  2016-07-14  6:19 ` [RFC PATCH v2 01/10] net: sched: allow qdiscs to handle locking John Fastabend
  2016-07-14  6:20 ` [RFC PATCH v2 02/10] net: sched: qdisc_qlen for per cpu logic John Fastabend
@ 2016-07-14  6:20 ` John Fastabend
  2016-07-14  6:21 ` [RFC PATCH v2 04/10] net: sched: a dflt qdisc may be used with per cpu stats John Fastabend
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:20 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

The per cpu qstats support was added with per cpu bstat support which
is currently used by the ingress qdisc. This patch adds a set of
helpers needed to make other qdiscs that use qstats per cpu as well.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |   39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 354951d..f69da4b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -573,12 +573,43 @@ static inline void qdisc_qstats_backlog_dec(struct Qdisc *sch,
 	sch->qstats.backlog -= qdisc_pkt_len(skb);
 }
 
+static inline void qdisc_qstats_cpu_backlog_dec(struct Qdisc *sch,
+						const struct sk_buff *skb)
+{
+	struct gnet_stats_queue *q = this_cpu_ptr(sch->cpu_qstats);
+
+	q->backlog -= qdisc_pkt_len(skb);
+}
+
 static inline void qdisc_qstats_backlog_inc(struct Qdisc *sch,
 					    const struct sk_buff *skb)
 {
 	sch->qstats.backlog += qdisc_pkt_len(skb);
 }
 
+static inline void qdisc_qstats_cpu_backlog_inc(struct Qdisc *sch,
+						const struct sk_buff *skb)
+{
+	struct gnet_stats_queue *q = this_cpu_ptr(sch->cpu_qstats);
+
+	q->backlog += qdisc_pkt_len(skb);
+}
+
+static inline void qdisc_qstats_cpu_qlen_inc(struct Qdisc *sch)
+{
+	this_cpu_ptr(sch->cpu_qstats)->qlen++;
+}
+
+static inline void qdisc_qstats_cpu_qlen_dec(struct Qdisc *sch)
+{
+	this_cpu_ptr(sch->cpu_qstats)->qlen--;
+}
+
+static inline void qdisc_qstats_cpu_requeues_inc(struct Qdisc *sch)
+{
+	this_cpu_ptr(sch->cpu_qstats)->requeues++;
+}
+
 static inline void __qdisc_qstats_drop(struct Qdisc *sch, int count)
 {
 	sch->qstats.drops += count;
@@ -751,6 +782,14 @@ static inline void rtnl_qdisc_drop(struct sk_buff *skb, struct Qdisc *sch)
 	qdisc_qstats_drop(sch);
 }
 
+static inline int qdisc_drop_cpu(struct sk_buff *skb, struct Qdisc *sch,
+				 struct sk_buff **to_free)
+{
+	__qdisc_drop(skb, to_free);
+	qdisc_qstats_cpu_drop(sch);
+
+	return NET_XMIT_DROP;
+}
 
 static inline int qdisc_drop(struct sk_buff *skb, struct Qdisc *sch,
 			     struct sk_buff **to_free)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 04/10] net: sched: a dflt qdisc may be used with per cpu stats
  2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
                   ` (2 preceding siblings ...)
  2016-07-14  6:20 ` [RFC PATCH v2 03/10] net: sched: provide per cpu qstat helpers John Fastabend
@ 2016-07-14  6:21 ` John Fastabend
  2016-07-14  6:21 ` [RFC PATCH v2 05/10] net: sched: per cpu gso handlers John Fastabend
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:21 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

Enable dflt qdisc support for per cpu stats before this patch a
dflt qdisc was required to use the global statistics qstats and
bstats.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_generic.c |   24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 2c3e23b..fc70204 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -647,18 +647,34 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
 	struct Qdisc *sch;
 
 	if (!try_module_get(ops->owner))
-		goto errout;
+		return NULL;
 
 	sch = qdisc_alloc(dev_queue, ops);
 	if (IS_ERR(sch))
-		goto errout;
+		return NULL;
 	sch->parent = parentid;
 
-	if (!ops->init || ops->init(sch, NULL) == 0)
+	if (!ops->init)
 		return sch;
 
-	qdisc_destroy(sch);
+	if (ops->init(sch, NULL))
+		goto errout;
+
+	/* init() may have set percpu flags so init data structures */
+	if (qdisc_is_percpu_stats(sch)) {
+		sch->cpu_bstats =
+			netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu);
+		if (!sch->cpu_bstats)
+			goto errout;
+
+		sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
+		if (!sch->cpu_qstats)
+			goto errout;
+	}
+
+	return sch;
 errout:
+	qdisc_destroy(sch);
 	return NULL;
 }
 EXPORT_SYMBOL(qdisc_create_dflt);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 05/10] net: sched: per cpu gso handlers
  2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
                   ` (3 preceding siblings ...)
  2016-07-14  6:21 ` [RFC PATCH v2 04/10] net: sched: a dflt qdisc may be used with per cpu stats John Fastabend
@ 2016-07-14  6:21 ` John Fastabend
  2016-07-14  6:22 ` [RFC PATCH v2 06/10] net: sched: support qdisc_reset on NOLOCK qdisc John Fastabend
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:21 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

The net sched infrastructure has a gso ptr that points to skb structs
that have failed to be enqueued by the device driver.

This can happen when multiple cores try to push a skb onto the same
underlying hardware queue resulting in lock contention. This case is
handled by a cpu collision handler handle_dev_cpu_collision(). Another
case occurs when the stack overruns the drivers low level tx queues
capacity. Ideally these should be a rare occurrence in a well-tuned
system but they do happen.

To handle this in the lockless case use a per cpu gso field to park
the skb until the conflict can be resolved. Note at this point the
skb has already been popped off the qdisc so it has to be handled
by the infrastructure.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |   37 +++++++++++++++++++++++
 net/sched/sch_api.c       |    7 ++++
 net/sched/sch_generic.c   |   71 ++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 110 insertions(+), 5 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f69da4b..7b140e2 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,10 @@ struct qdisc_size_table {
 	u16			data[];
 };
 
+struct gso_cell {
+	struct sk_buff *skb;
+};
+
 struct Qdisc {
 	int 			(*enqueue)(struct sk_buff *skb,
 					   struct Qdisc *sch,
@@ -73,6 +77,8 @@ struct Qdisc {
 	struct gnet_stats_basic_cpu __percpu *cpu_bstats;
 	struct gnet_stats_queue	__percpu *cpu_qstats;
 
+	struct gso_cell __percpu *gso_cpu_skb;
+
 	/*
 	 * For performance sake on SMP, we put highly modified fields at the end
 	 */
@@ -725,6 +731,22 @@ static inline struct sk_buff *qdisc_peek_dequeued(struct Qdisc *sch)
 	return sch->gso_skb;
 }
 
+static inline struct sk_buff *qdisc_peek_dequeued_cpu(struct Qdisc *sch)
+{
+	struct gso_cell *gso = this_cpu_ptr(sch->gso_cpu_skb);
+
+	if (!gso->skb) {
+		struct sk_buff *skb = sch->dequeue(sch);
+
+		if (skb) {
+			gso->skb = skb;
+			qdisc_qstats_cpu_qlen_inc(sch);
+		}
+	}
+
+	return gso->skb;
+}
+
 /* use instead of qdisc->dequeue() for all qdiscs queried with ->peek() */
 static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
 {
@@ -741,6 +763,21 @@ static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
 	return skb;
 }
 
+static inline struct sk_buff *qdisc_dequeue_peeked_skb(struct Qdisc *sch)
+{
+	struct gso_cell *gso = this_cpu_ptr(sch->gso_cpu_skb);
+	struct sk_buff *skb = gso->skb;
+
+	if (skb) {
+		gso->skb = NULL;
+		qdisc_qstats_cpu_qlen_dec(sch);
+	} else {
+		skb = sch->dequeue(sch);
+	}
+
+	return skb;
+}
+
 static inline void __qdisc_reset_queue(struct sk_buff_head *list)
 {
 	/*
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 12ebde8..d713052 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -966,6 +966,12 @@ qdisc_create(struct net_device *dev, struct netdev_queue *dev_queue,
 				goto err_out4;
 		}
 
+		if (sch->flags & TCQ_F_NOLOCK) {
+			sch->gso_cpu_skb = alloc_percpu(struct gso_cell);
+			if (!sch->gso_cpu_skb)
+				goto err_out4;
+		}
+
 		if (tca[TCA_STAB]) {
 			stab = qdisc_get_stab(tca[TCA_STAB]);
 			if (IS_ERR(stab)) {
@@ -1014,6 +1020,7 @@ err_out:
 err_out4:
 	free_percpu(sch->cpu_bstats);
 	free_percpu(sch->cpu_qstats);
+	free_percpu(sch->gso_cpu_skb);
 	/*
 	 * Any broken qdiscs that would require a ops->reset() here?
 	 * The qdisc was never in action so it shouldn't be necessary.
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index fc70204..f903093 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -44,8 +44,25 @@ EXPORT_SYMBOL(default_qdisc_ops);
  * - ingress filtering is also serialized via qdisc root lock
  * - updates to tree and tree walking are only done under the rtnl mutex.
  */
+static inline struct sk_buff *qdisc_dequeue_gso_skb(struct Qdisc *sch)
+{
+	if (sch->gso_cpu_skb)
+		return (this_cpu_ptr(sch->gso_cpu_skb))->skb;
 
-static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+	return sch->gso_skb;
+}
+
+static inline void qdisc_null_gso_skb(struct Qdisc *sch)
+{
+	if (sch->gso_cpu_skb) {
+		(this_cpu_ptr(sch->gso_cpu_skb))->skb = NULL;
+		return;
+	}
+
+	sch->gso_skb = NULL;
+}
+
+static inline int __dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 {
 	q->gso_skb = skb;
 	q->qstats.requeues++;
@@ -56,6 +73,25 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 	return 0;
 }
 
+static inline int dev_requeue_cpu_skb(struct sk_buff *skb, struct Qdisc *q)
+{
+	this_cpu_ptr(q->gso_cpu_skb)->skb = skb;
+	qdisc_qstats_cpu_requeues_inc(q);
+	qdisc_qstats_cpu_backlog_inc(q, skb);
+	qdisc_qstats_cpu_qlen_inc(q);
+	__netif_schedule(q);
+
+	return 0;
+}
+
+static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+{
+	if (q->flags & TCQ_F_NOLOCK)
+		return dev_requeue_cpu_skb(skb, q);
+	else
+		return __dev_requeue_skb(skb, q);
+}
+
 static void try_bulk_dequeue_skb(struct Qdisc *q,
 				 struct sk_buff *skb,
 				 const struct netdev_queue *txq,
@@ -111,7 +147,7 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
 static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 				   int *packets)
 {
-	struct sk_buff *skb = q->gso_skb;
+	struct sk_buff *skb = qdisc_dequeue_gso_skb(q);
 	const struct netdev_queue *txq = q->dev_queue;
 
 	*packets = 1;
@@ -121,9 +157,15 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 		/* check the reason of requeuing without tx lock first */
 		txq = skb_get_tx_queue(txq->dev, skb);
 		if (!netif_xmit_frozen_or_stopped(txq)) {
-			q->gso_skb = NULL;
-			qdisc_qstats_backlog_dec(q, skb);
-			q->q.qlen--;
+			qdisc_null_gso_skb(q);
+
+			if (qdisc_is_percpu_stats(q)) {
+				qdisc_qstats_cpu_backlog_inc(q, skb);
+				qdisc_qstats_cpu_qlen_dec(q);
+			} else {
+				qdisc_qstats_backlog_dec(q, skb);
+				q->q.qlen--;
+			}
 		} else
 			skb = NULL;
 		return skb;
@@ -672,6 +714,12 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
 			goto errout;
 	}
 
+	if (sch->flags & TCQ_F_NOLOCK) {
+		sch->gso_cpu_skb = alloc_percpu(struct gso_cell);
+		if (!sch->gso_cpu_skb)
+			goto errout;
+	}
+
 	return sch;
 errout:
 	qdisc_destroy(sch);
@@ -708,6 +756,19 @@ static void qdisc_rcu_free(struct rcu_head *head)
 		free_percpu(qdisc->cpu_qstats);
 	}
 
+	if (qdisc->gso_cpu_skb) {
+		int i;
+
+		for_each_possible_cpu(i) {
+			struct gso_cell *cell;
+
+			cell = per_cpu_ptr(qdisc->gso_cpu_skb, i);
+			kfree_skb_list(cell->skb);
+		}
+
+		free_percpu(qdisc->gso_cpu_skb);
+	}
+
 	kfree((char *) qdisc - qdisc->padded);
 }
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 06/10] net: sched: support qdisc_reset on NOLOCK qdisc
  2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
                   ` (4 preceding siblings ...)
  2016-07-14  6:21 ` [RFC PATCH v2 05/10] net: sched: per cpu gso handlers John Fastabend
@ 2016-07-14  6:22 ` John Fastabend
  2016-07-14  6:22 ` [RFC PATCH v2 07/10] net: sched: support skb_bad_tx with lockless qdisc John Fastabend
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:22 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

The qdisc_reset operation depends on the qdisc lock at the moment
to halt any additions to gso_skb and statistics while the list is
free'd and the stats zeroed.

Without the qdisc lock we can not guarantee another cpu is not in
the process of adding a skb to one of the "cells". Here are the
two cases we have to handle.

 case 1: qdisc_graft operation. In this case a "new" qdisc is attached
	 and the 'qdisc_destroy' operation is called on the old qdisc.
	 The destroy operation will wait a rcu grace period and call
	 qdisc_rcu_free(). At which point gso_cpu_skb is free'd along
	 with all stats so no need to zero stats and gso_cpu_skb from
	 the reset operation itself.

	 Because we can not continue to call qdisc_reset before waiting
	 an rcu grace period so that the qdisc is detached from all
	 cpus simply do not call qdisc_reset() at all and let the
	 qdisc_destroy operation clean up the qdisc. Note, a refcnt
	 greater than 1 would cause the destroy operation to be
	 aborted however if this ever happened the reference to the
	 qdisc would be lost and we would have a memory leak.

 case 2: dev_deactivate sequence. This can come from a user bringing
	 the interface down which causes the gso_skb list to be flushed
	 and the qlen zero'd. At the moment this is protected by the
	 qdisc lock so while we clear the qlen/gso_skb fields we are
	 guaranteed no new skbs are added. For the lockless case
	 though this is not true. To resolve this move the qdisc_reset
	 call after the new qdisc is assigned and a grace period is
	 exercised to ensure no new skbs can be enqueued. Further
	 the RTNL lock is held so we can not get another call to
	 activate the qdisc while the skb lists are being free'd.

	 Finally, fix qdisc_reset to handle the per cpu stats and
	 skb lists.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_generic.c |   45 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index f903093..8a665dc 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -739,6 +739,20 @@ void qdisc_reset(struct Qdisc *qdisc)
 	kfree_skb(qdisc->skb_bad_txq);
 	qdisc->skb_bad_txq = NULL;
 
+	if (qdisc->gso_cpu_skb) {
+		int i;
+
+		for_each_possible_cpu(i) {
+			struct gso_cell *cell;
+
+			cell = per_cpu_ptr(qdisc->gso_cpu_skb, i);
+			if (cell) {
+				kfree_skb_list(cell->skb);
+				cell = NULL;
+			}
+		}
+	}
+
 	if (qdisc->gso_skb) {
 		kfree_skb_list(qdisc->gso_skb);
 		qdisc->gso_skb = NULL;
@@ -814,10 +828,6 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 	root_lock = qdisc_lock(oqdisc);
 	spin_lock_bh(root_lock);
 
-	/* Prune old scheduler */
-	if (oqdisc && atomic_read(&oqdisc->refcnt) <= 1)
-		qdisc_reset(oqdisc);
-
 	/* ... and graft new one */
 	if (qdisc == NULL)
 		qdisc = &noop_qdisc;
@@ -931,7 +941,6 @@ static void dev_deactivate_queue(struct net_device *dev,
 			set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
 
 		rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
-		qdisc_reset(qdisc);
 
 		spin_unlock_bh(qdisc_lock(qdisc));
 	}
@@ -968,6 +977,16 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 	return false;
 }
 
+static void dev_qdisc_reset(struct net_device *dev,
+			    struct netdev_queue *dev_queue,
+			    void *none)
+{
+	struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
+
+	if (qdisc)
+		qdisc_reset(qdisc);
+}
+
 /**
  * 	dev_deactivate_many - deactivate transmissions on several devices
  * 	@head: list of devices to deactivate
@@ -978,7 +997,6 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 void dev_deactivate_many(struct list_head *head)
 {
 	struct net_device *dev;
-	bool sync_needed = false;
 
 	list_for_each_entry(dev, head, close_list) {
 		netdev_for_each_tx_queue(dev, dev_deactivate_queue,
@@ -988,20 +1006,27 @@ void dev_deactivate_many(struct list_head *head)
 					     &noop_qdisc);
 
 		dev_watchdog_down(dev);
-		sync_needed |= !dev->dismantle;
 	}
 
 	/* Wait for outstanding qdisc-less dev_queue_xmit calls.
 	 * This is avoided if all devices are in dismantle phase :
 	 * Caller will call synchronize_net() for us
 	 */
-	if (sync_needed)
-		synchronize_net();
+	synchronize_net();
 
 	/* Wait for outstanding qdisc_run calls. */
-	list_for_each_entry(dev, head, close_list)
+	list_for_each_entry(dev, head, close_list) {
 		while (some_qdisc_is_busy(dev))
 			yield();
+
+		/* The new qdisc is assigned at this point so we can safely
+		 * unwind stale skb lists and qdisc statistics
+		 */
+		netdev_for_each_tx_queue(dev, dev_qdisc_reset, NULL);
+		if (dev_ingress_queue(dev))
+			dev_qdisc_reset(dev, dev_ingress_queue(dev), NULL);
+	}
+
 }
 
 void dev_deactivate(struct net_device *dev)

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 07/10] net: sched: support skb_bad_tx with lockless qdisc
  2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
                   ` (5 preceding siblings ...)
  2016-07-14  6:22 ` [RFC PATCH v2 06/10] net: sched: support qdisc_reset on NOLOCK qdisc John Fastabend
@ 2016-07-14  6:22 ` John Fastabend
  2016-07-14  6:23 ` [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue John Fastabend
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:22 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

Similar to how gso is handled skb_bad_tx needs to be per cpu to handle
lockless qdisc with multiple writer/producers.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |    7 +++
 net/sched/sch_api.c       |    5 ++
 net/sched/sch_generic.c   |   94 +++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 97 insertions(+), 9 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 7b140e2..149f079 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -40,6 +40,10 @@ struct gso_cell {
 	struct sk_buff *skb;
 };
 
+struct bad_txq_cell {
+	struct sk_buff *skb;
+};
+
 struct Qdisc {
 	int 			(*enqueue)(struct sk_buff *skb,
 					   struct Qdisc *sch,
@@ -77,7 +81,8 @@ struct Qdisc {
 	struct gnet_stats_basic_cpu __percpu *cpu_bstats;
 	struct gnet_stats_queue	__percpu *cpu_qstats;
 
-	struct gso_cell __percpu *gso_cpu_skb;
+	struct gso_cell     __percpu *gso_cpu_skb;
+	struct bad_txq_cell __percpu *skb_bad_txq_cpu;
 
 	/*
 	 * For performance sake on SMP, we put highly modified fields at the end
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index d713052..50088e2 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -970,6 +970,10 @@ qdisc_create(struct net_device *dev, struct netdev_queue *dev_queue,
 			sch->gso_cpu_skb = alloc_percpu(struct gso_cell);
 			if (!sch->gso_cpu_skb)
 				goto err_out4;
+
+			sch->skb_bad_txq_cpu = alloc_percpu(struct bad_txq_cell);
+			if (!sch->skb_bad_txq_cpu)
+				goto err_out4;
 		}
 
 		if (tca[TCA_STAB]) {
@@ -1021,6 +1025,7 @@ err_out4:
 	free_percpu(sch->cpu_bstats);
 	free_percpu(sch->cpu_qstats);
 	free_percpu(sch->gso_cpu_skb);
+	free_percpu(sch->skb_bad_txq_cpu);
 	/*
 	 * Any broken qdiscs that would require a ops->reset() here?
 	 * The qdisc was never in action so it shouldn't be necessary.
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 8a665dc..7dcd066 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -44,6 +44,42 @@ EXPORT_SYMBOL(default_qdisc_ops);
  * - ingress filtering is also serialized via qdisc root lock
  * - updates to tree and tree walking are only done under the rtnl mutex.
  */
+static inline struct sk_buff *qdisc_dequeue_skb_bad_txq(struct Qdisc *sch)
+{
+	if (sch->skb_bad_txq_cpu) {
+		struct bad_txq_cell *cell = this_cpu_ptr(sch->skb_bad_txq_cpu);
+
+		return cell->skb;
+	}
+
+	return sch->skb_bad_txq;
+}
+
+static inline void qdisc_enqueue_skb_bad_txq(struct Qdisc *sch,
+					     struct sk_buff *skb)
+{
+	if (sch->skb_bad_txq_cpu) {
+		struct bad_txq_cell *cell = this_cpu_ptr(sch->skb_bad_txq_cpu);
+
+		cell->skb = skb;
+		return;
+	}
+
+	sch->skb_bad_txq = skb;
+}
+
+static inline void qdisc_null_skb_bad_txq(struct Qdisc *sch)
+{
+	if (sch->skb_bad_txq_cpu) {
+		struct bad_txq_cell *cell = this_cpu_ptr(sch->skb_bad_txq_cpu);
+
+		cell->skb = NULL;
+		return;
+	}
+
+	sch->skb_bad_txq = NULL;
+}
+
 static inline struct sk_buff *qdisc_dequeue_gso_skb(struct Qdisc *sch)
 {
 	if (sch->gso_cpu_skb)
@@ -129,9 +165,15 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
 		if (!nskb)
 			break;
 		if (unlikely(skb_get_queue_mapping(nskb) != mapping)) {
-			q->skb_bad_txq = nskb;
-			qdisc_qstats_backlog_inc(q, nskb);
-			q->q.qlen++;
+			qdisc_enqueue_skb_bad_txq(q, nskb);
+
+			if (qdisc_is_percpu_stats(q)) {
+				qdisc_qstats_cpu_backlog_inc(q, nskb);
+				qdisc_qstats_cpu_qlen_inc(q);
+			} else {
+				qdisc_qstats_backlog_inc(q, nskb);
+				q->q.qlen++;
+			}
 			break;
 		}
 		skb->next = nskb;
@@ -160,7 +202,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 			qdisc_null_gso_skb(q);
 
 			if (qdisc_is_percpu_stats(q)) {
-				qdisc_qstats_cpu_backlog_inc(q, skb);
+				qdisc_qstats_cpu_backlog_dec(q, skb);
 				qdisc_qstats_cpu_qlen_dec(q);
 			} else {
 				qdisc_qstats_backlog_dec(q, skb);
@@ -171,14 +213,19 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 		return skb;
 	}
 	*validate = true;
-	skb = q->skb_bad_txq;
+	skb = qdisc_dequeue_skb_bad_txq(q);
 	if (unlikely(skb)) {
 		/* check the reason of requeuing without tx lock first */
 		txq = skb_get_tx_queue(txq->dev, skb);
 		if (!netif_xmit_frozen_or_stopped(txq)) {
-			q->skb_bad_txq = NULL;
-			qdisc_qstats_backlog_dec(q, skb);
-			q->q.qlen--;
+			qdisc_null_skb_bad_txq(q);
+			if (qdisc_is_percpu_stats(q)) {
+				qdisc_qstats_cpu_backlog_dec(q, skb);
+				qdisc_qstats_cpu_qlen_dec(q);
+			} else {
+				qdisc_qstats_backlog_dec(q, skb);
+				q->q.qlen--;
+			}
 			goto bulk;
 		}
 		return NULL;
@@ -718,6 +765,10 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
 		sch->gso_cpu_skb = alloc_percpu(struct gso_cell);
 		if (!sch->gso_cpu_skb)
 			goto errout;
+
+		sch->skb_bad_txq_cpu = alloc_percpu(struct bad_txq_cell);
+		if (!sch->skb_bad_txq_cpu)
+			goto errout;
 	}
 
 	return sch;
@@ -748,6 +799,20 @@ void qdisc_reset(struct Qdisc *qdisc)
 			cell = per_cpu_ptr(qdisc->gso_cpu_skb, i);
 			if (cell) {
 				kfree_skb_list(cell->skb);
+				cell->skb = NULL;
+			}
+		}
+	}
+
+	if (qdisc->skb_bad_txq_cpu) {
+		int i;
+
+		for_each_possible_cpu(i) {
+			struct bad_txq_cell *cell;
+
+			cell = per_cpu_ptr(qdisc->skb_bad_txq_cpu, i);
+			if (cell) {
+				kfree_skb(cell->skb);
 				cell = NULL;
 			}
 		}
@@ -783,6 +848,19 @@ static void qdisc_rcu_free(struct rcu_head *head)
 		free_percpu(qdisc->gso_cpu_skb);
 	}
 
+	if (qdisc->skb_bad_txq_cpu) {
+		int i;
+
+		for_each_possible_cpu(i) {
+			struct bad_txq_cell *cell;
+
+			cell = per_cpu_ptr(qdisc->skb_bad_txq_cpu, i);
+			kfree_skb(cell->skb);
+		}
+
+		free_percpu(qdisc->skb_bad_txq_cpu);
+	}
+
 	kfree((char *) qdisc - qdisc->padded);
 }
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
  2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
                   ` (6 preceding siblings ...)
  2016-07-14  6:22 ` [RFC PATCH v2 07/10] net: sched: support skb_bad_tx with lockless qdisc John Fastabend
@ 2016-07-14  6:23 ` John Fastabend
  2016-07-14 15:11   ` Jesper Dangaard Brouer
  2016-07-14 23:42   ` Alexei Starovoitov
  2016-07-14  6:23 ` [RFC PATCH v2 09/10] net: sched: helper to sum qlen John Fastabend
  2016-07-14  6:24 ` [RFC PATCH v2 10/10] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend
  9 siblings, 2 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:23 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

This converts the pfifo_fast qdisc to use the alf_queue enqueue and
dequeue routines then sets the NOLOCK bit.

This also removes the logic used to pick the next band to dequeue from
and instead just checks each alf_queue for packets from top priority
to lowest. This might need to be a bit more clever but seems to work
for now.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_generic.c |  131 +++++++++++++++++++++++++++--------------------
 1 file changed, 75 insertions(+), 56 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7dcd066..2ac3eb9 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -26,6 +26,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/if_vlan.h>
+#include <linux/skb_array.h>
 #include <net/sch_generic.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -555,88 +556,79 @@ static const u8 prio2band[TC_PRIO_MAX + 1] = {
 
 /*
  * Private data for a pfifo_fast scheduler containing:
- * 	- queues for the three band
- * 	- bitmap indicating which of the bands contain skbs
+ *	- rings for priority bands
  */
 struct pfifo_fast_priv {
-	u32 bitmap;
-	struct sk_buff_head q[PFIFO_FAST_BANDS];
+	struct skb_array q[PFIFO_FAST_BANDS];
 };
 
-/*
- * Convert a bitmap to the first band number where an skb is queued, where:
- * 	bitmap=0 means there are no skbs on any band.
- * 	bitmap=1 means there is an skb on band 0.
- *	bitmap=7 means there are skbs on all 3 bands, etc.
- */
-static const int bitmap2band[] = {-1, 0, 1, 0, 2, 0, 1, 0};
-
-static inline struct sk_buff_head *band2list(struct pfifo_fast_priv *priv,
-					     int band)
+static inline struct skb_array *band2list(struct pfifo_fast_priv *priv,
+					  int band)
 {
-	return priv->q + band;
+	return &priv->q[band];
 }
 
 static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
 			      struct sk_buff **to_free)
 {
-	if (skb_queue_len(&qdisc->q) < qdisc_dev(qdisc)->tx_queue_len) {
-		int band = prio2band[skb->priority & TC_PRIO_MAX];
-		struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-		struct sk_buff_head *list = band2list(priv, band);
-
-		priv->bitmap |= (1 << band);
-		qdisc->q.qlen++;
-		return __qdisc_enqueue_tail(skb, qdisc, list);
-	}
+	int band = prio2band[skb->priority & TC_PRIO_MAX];
+	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
+	struct skb_array *q = band2list(priv, band);
+	int err;
 
-	return qdisc_drop(skb, qdisc, to_free);
+	err = skb_array_produce_bh(q, skb);
+
+	if (unlikely(err))
+		return qdisc_drop_cpu(skb, qdisc, to_free);
+
+	qdisc_qstats_cpu_qlen_inc(qdisc);
+	qdisc_qstats_cpu_backlog_inc(qdisc, skb);
+	return NET_XMIT_SUCCESS;
 }
 
 static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-	int band = bitmap2band[priv->bitmap];
+	struct sk_buff *skb = NULL;
+	int band;
 
-	if (likely(band >= 0)) {
-		struct sk_buff_head *list = band2list(priv, band);
-		struct sk_buff *skb = __qdisc_dequeue_head(qdisc, list);
+	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
+		struct skb_array *q = band2list(priv, band);
 
-		qdisc->q.qlen--;
-		if (skb_queue_empty(list))
-			priv->bitmap &= ~(1 << band);
+		if (__skb_array_empty(q))
+			continue;
 
-		return skb;
+		skb = skb_array_consume_bh(q);
 	}
 
-	return NULL;
-}
-
-static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
-{
-	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-	int band = bitmap2band[priv->bitmap];
-
-	if (band >= 0) {
-		struct sk_buff_head *list = band2list(priv, band);
-
-		return skb_peek(list);
+	if (likely(skb)) {
+		qdisc_qstats_cpu_backlog_dec(qdisc, skb);
+		qdisc_bstats_cpu_update(qdisc, skb);
+		qdisc_qstats_cpu_qlen_dec(qdisc);
 	}
 
-	return NULL;
+	return skb;
 }
 
 static void pfifo_fast_reset(struct Qdisc *qdisc)
 {
-	int prio;
+	int i, band;
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 
-	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++)
-		__qdisc_reset_queue(band2list(priv, prio));
+	for (band = 0; band < PFIFO_FAST_BANDS; band++) {
+		struct skb_array *q = band2list(priv, band);
+		struct sk_buff *skb;
 
-	priv->bitmap = 0;
-	qdisc->qstats.backlog = 0;
-	qdisc->q.qlen = 0;
+		while ((skb = skb_array_consume_bh(q)) != NULL)
+			kfree_skb(skb);
+	}
+
+	for_each_possible_cpu(i) {
+		struct gnet_stats_queue *q = per_cpu_ptr(qdisc->cpu_qstats, i);
+
+		q->backlog = 0;
+		q->qlen = 0;
+	}
 }
 
 static int pfifo_fast_dump(struct Qdisc *qdisc, struct sk_buff *skb)
@@ -654,24 +646,51 @@ nla_put_failure:
 
 static int pfifo_fast_init(struct Qdisc *qdisc, struct nlattr *opt)
 {
-	int prio;
+	unsigned int qlen = qdisc_dev(qdisc)->tx_queue_len;
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
+	int prio;
+
+	/* guard against zero length rings */
+	if (!qlen)
+		return -EINVAL;
+
+	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+		struct skb_array *q = band2list(priv, prio);
+		int err;
 
-	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++)
-		__skb_queue_head_init(band2list(priv, prio));
+		err = skb_array_init(q, qlen, GFP_KERNEL);
+		if (err)
+			return -ENOMEM;
+	}
 
 	/* Can by-pass the queue discipline */
 	qdisc->flags |= TCQ_F_CAN_BYPASS;
+	qdisc->flags |= TCQ_F_NOLOCK;
+	qdisc->flags |= TCQ_F_CPUSTATS;
+
 	return 0;
 }
 
+static void pfifo_fast_destroy(struct Qdisc *sch)
+{
+	struct pfifo_fast_priv *priv = qdisc_priv(sch);
+	int prio;
+
+	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+		struct skb_array *q = band2list(priv, prio);
+
+		skb_array_cleanup(q);
+	}
+}
+
 struct Qdisc_ops pfifo_fast_ops __read_mostly = {
 	.id		=	"pfifo_fast",
 	.priv_size	=	sizeof(struct pfifo_fast_priv),
 	.enqueue	=	pfifo_fast_enqueue,
 	.dequeue	=	pfifo_fast_dequeue,
-	.peek		=	pfifo_fast_peek,
+	.peek		=	qdisc_peek_dequeued,
 	.init		=	pfifo_fast_init,
+	.destroy	=	pfifo_fast_destroy,
 	.reset		=	pfifo_fast_reset,
 	.dump		=	pfifo_fast_dump,
 	.owner		=	THIS_MODULE,

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 09/10] net: sched: helper to sum qlen
  2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
                   ` (7 preceding siblings ...)
  2016-07-14  6:23 ` [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue John Fastabend
@ 2016-07-14  6:23 ` John Fastabend
  2016-07-14  6:24 ` [RFC PATCH v2 10/10] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend
  9 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:23 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

Reporting qlen when qlen is per cpu requires aggregating the per
cpu counters. This adds a helper routine for this.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 149f079..d370fee 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -271,6 +271,21 @@ static inline int qdisc_qlen(const struct Qdisc *q)
 	return q->q.qlen;
 }
 
+static inline int qdisc_qlen_sum(const struct Qdisc *q)
+{
+	__u32 qlen = 0;
+	int i;
+
+	if (q->flags & TCQ_F_NOLOCK) {
+		for_each_possible_cpu(i)
+			qlen += per_cpu_ptr(q->cpu_qstats, i)->qlen;
+	} else {
+		qlen = q->q.qlen;
+	}
+
+	return qlen;
+}
+
 static inline struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
 {
 	return (struct qdisc_skb_cb *)skb->cb;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 10/10] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
  2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
                   ` (8 preceding siblings ...)
  2016-07-14  6:23 ` [RFC PATCH v2 09/10] net: sched: helper to sum qlen John Fastabend
@ 2016-07-14  6:24 ` John Fastabend
  9 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:24 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

The sch_mq qdisc creates a sub-qdisc per tx queue which are then
called independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Also exports __gnet_stats_copy_queue() to use as a helper function.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/gen_stats.h |    3 +++
 net/core/gen_stats.c    |    9 +++++----
 net/sched/sch_mq.c      |   25 ++++++++++++++++++-------
 3 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index 231e121..5ddc88b 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -47,6 +47,9 @@ int gnet_stats_copy_rate_est(struct gnet_dump *d,
 int gnet_stats_copy_queue(struct gnet_dump *d,
 			  struct gnet_stats_queue __percpu *cpu_q,
 			  struct gnet_stats_queue *q, __u32 qlen);
+void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
+			     const struct gnet_stats_queue __percpu *cpu_q,
+			     const struct gnet_stats_queue *q, __u32 qlen);
 int gnet_stats_copy_app(struct gnet_dump *d, void *st, int len);
 
 int gnet_stats_finish_copy(struct gnet_dump *d);
diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
index 508e051..a503547 100644
--- a/net/core/gen_stats.c
+++ b/net/core/gen_stats.c
@@ -254,10 +254,10 @@ __gnet_stats_copy_queue_cpu(struct gnet_stats_queue *qstats,
 	}
 }
 
-static void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
-				    const struct gnet_stats_queue __percpu *cpu,
-				    const struct gnet_stats_queue *q,
-				    __u32 qlen)
+void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
+			     const struct gnet_stats_queue __percpu *cpu,
+			     const struct gnet_stats_queue *q,
+			     __u32 qlen)
 {
 	if (cpu) {
 		__gnet_stats_copy_queue_cpu(qstats, cpu);
@@ -271,6 +271,7 @@ static void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
 
 	qstats->qlen = qlen;
 }
+EXPORT_SYMBOL(__gnet_stats_copy_queue);
 
 /**
  * gnet_stats_copy_queue - copy queue statistics into statistics TLV
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index b943982..f4b5bbb 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -17,6 +17,7 @@
 #include <linux/skbuff.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
+#include <net/sch_generic.h>
 
 struct mq_sched {
 	struct Qdisc		**qdiscs;
@@ -107,15 +108,25 @@ static int mq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	memset(&sch->qstats, 0, sizeof(sch->qstats));
 
 	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+		struct gnet_stats_basic_cpu __percpu *cpu_bstats = NULL;
+		struct gnet_stats_queue __percpu *cpu_qstats = NULL;
+		__u32 qlen = 0;
+
 		qdisc = netdev_get_tx_queue(dev, ntx)->qdisc_sleeping;
 		spin_lock_bh(qdisc_lock(qdisc));
-		sch->q.qlen		+= qdisc->q.qlen;
-		sch->bstats.bytes	+= qdisc->bstats.bytes;
-		sch->bstats.packets	+= qdisc->bstats.packets;
-		sch->qstats.backlog	+= qdisc->qstats.backlog;
-		sch->qstats.drops	+= qdisc->qstats.drops;
-		sch->qstats.requeues	+= qdisc->qstats.requeues;
-		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
+
+		if (qdisc_is_percpu_stats(qdisc)) {
+			cpu_bstats = qdisc->cpu_bstats;
+			cpu_qstats = qdisc->cpu_qstats;
+		}
+
+		qlen = qdisc_qlen_sum(qdisc);
+
+		__gnet_stats_copy_basic(NULL, &sch->bstats,
+					cpu_bstats, &qdisc->bstats);
+		__gnet_stats_copy_queue(&sch->qstats,
+					cpu_qstats, &qdisc->qstats, qlen);
+
 		spin_unlock_bh(qdisc_lock(qdisc));
 	}
 	return 0;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 01/10] net: sched: allow qdiscs to handle locking
  2016-07-14  6:19 ` [RFC PATCH v2 01/10] net: sched: allow qdiscs to handle locking John Fastabend
@ 2016-07-14  6:44   ` John Fastabend
  0 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-14  6:44 UTC (permalink / raw)
  To: fw, jhs, alexei.starovoitov, eric.dumazet, brouer; +Cc: netdev

On 16-07-13 11:19 PM, John Fastabend wrote:
> This patch adds a flag for queueing disciplines to indicate the stack
> does not need to use the qdisc lock to protect operations. This can
> be used to build lockless scheduling algorithms and improving
> performance.
> 
> The flag is checked in the tx path and the qdisc lock is only taken
> if it is not set. For now use a conditional if statement. Later we
> could be more aggressive if it proves worthwhile and use a static key
> or wrap this in a likely().
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---

[...]

> @@ -3075,6 +3075,27 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>  	int rc;
>  
>  	qdisc_calculate_pkt_len(skb, q);
> +
> +	if (q->flags & TCQ_F_NOLOCK) {
> +		if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> +			__qdisc_drop(skb, &to_free);
> +			rc = NET_XMIT_DROP;
> +		} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q)) {
> +			qdisc_bstats_cpu_update(q, skb);
> +			__qdisc_run(q);

Reviewing these patches now and noticed this qdisc_run() is not
needed.

> +			if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
> +				__qdisc_run(q);
> +			rc = NET_XMIT_SUCCESS;
> +		} else {
> +			rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> +			__qdisc_run(q);
> +		}
> +
> +		if (unlikely(to_free))
> +			kfree_skb_list(to_free);
> +		return rc;
> +	}
> +

[...]

Thanks,
John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
  2016-07-14  6:23 ` [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue John Fastabend
@ 2016-07-14 15:11   ` Jesper Dangaard Brouer
  2016-07-15  0:09     ` John Fastabend
  2016-07-14 23:42   ` Alexei Starovoitov
  1 sibling, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-14 15:11 UTC (permalink / raw)
  To: John Fastabend; +Cc: fw, jhs, alexei.starovoitov, eric.dumazet, netdev, brouer

On Wed, 13 Jul 2016 23:23:12 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> This converts the pfifo_fast qdisc to use the alf_queue enqueue and
> dequeue routines then sets the NOLOCK bit.

Should have said skb_array, not alf_queue ;-)

 > This also removes the logic used to pick the next band to dequeue from
> and instead just checks each alf_queue for packets from top priority
> to lowest. This might need to be a bit more clever but seems to work
> for now.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  net/sched/sch_generic.c |  131 +++++++++++++++++++++++++++--------------------
>  1 file changed, 75 insertions(+), 56 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 7dcd066..2ac3eb9 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -26,6 +26,7 @@
>  #include <linux/list.h>
>  #include <linux/slab.h>
>  #include <linux/if_vlan.h>
> +#include <linux/skb_array.h>
>  #include <net/sch_generic.h>
>  #include <net/pkt_sched.h>
>  #include <net/dst.h>
> @@ -555,88 +556,79 @@ static const u8 prio2band[TC_PRIO_MAX + 1] = {
>  
>  /*
>   * Private data for a pfifo_fast scheduler containing:
> - * 	- queues for the three band
> - * 	- bitmap indicating which of the bands contain skbs
> + *	- rings for priority bands
>   */
>  struct pfifo_fast_priv {
> -	u32 bitmap;
> -	struct sk_buff_head q[PFIFO_FAST_BANDS];
> +	struct skb_array q[PFIFO_FAST_BANDS];
>  };
>  
> -/*
> - * Convert a bitmap to the first band number where an skb is queued, where:
> - * 	bitmap=0 means there are no skbs on any band.
> - * 	bitmap=1 means there is an skb on band 0.
> - *	bitmap=7 means there are skbs on all 3 bands, etc.
> - */
> -static const int bitmap2band[] = {-1, 0, 1, 0, 2, 0, 1, 0};
> -
> -static inline struct sk_buff_head *band2list(struct pfifo_fast_priv *priv,
> -					     int band)
> +static inline struct skb_array *band2list(struct pfifo_fast_priv *priv,
> +					  int band)
>  {
> -	return priv->q + band;
> +	return &priv->q[band];
>  }
>  
>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
>  			      struct sk_buff **to_free)
>  {
> -	if (skb_queue_len(&qdisc->q) < qdisc_dev(qdisc)->tx_queue_len) {
> -		int band = prio2band[skb->priority & TC_PRIO_MAX];
> -		struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> -		struct sk_buff_head *list = band2list(priv, band);
> -
> -		priv->bitmap |= (1 << band);
> -		qdisc->q.qlen++;
> -		return __qdisc_enqueue_tail(skb, qdisc, list);
> -	}
> +	int band = prio2band[skb->priority & TC_PRIO_MAX];
> +	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> +	struct skb_array *q = band2list(priv, band);
> +	int err;
>  
> -	return qdisc_drop(skb, qdisc, to_free);
> +	err = skb_array_produce_bh(q, skb);

Do you need the _bh variant here?  (Doesn't the qdisc run with BH disabled?)


> +
> +	if (unlikely(err))
> +		return qdisc_drop_cpu(skb, qdisc, to_free);
> +
> +	qdisc_qstats_cpu_qlen_inc(qdisc);
> +	qdisc_qstats_cpu_backlog_inc(qdisc, skb);
> +	return NET_XMIT_SUCCESS;
>  }
>  
>  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  {
>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> -	int band = bitmap2band[priv->bitmap];
> +	struct sk_buff *skb = NULL;
> +	int band;
>  
> -	if (likely(band >= 0)) {
> -		struct sk_buff_head *list = band2list(priv, band);
> -		struct sk_buff *skb = __qdisc_dequeue_head(qdisc, list);
> +	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> +		struct skb_array *q = band2list(priv, band);
>  
> -		qdisc->q.qlen--;
> -		if (skb_queue_empty(list))
> -			priv->bitmap &= ~(1 << band);
> +		if (__skb_array_empty(q))
> +			continue;
>  
> -		return skb;
> +		skb = skb_array_consume_bh(q);

Also _bh variant.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
  2016-07-14  6:23 ` [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue John Fastabend
  2016-07-14 15:11   ` Jesper Dangaard Brouer
@ 2016-07-14 23:42   ` Alexei Starovoitov
  2016-07-15  0:07     ` John Fastabend
  1 sibling, 1 reply; 21+ messages in thread
From: Alexei Starovoitov @ 2016-07-14 23:42 UTC (permalink / raw)
  To: John Fastabend; +Cc: fw, jhs, eric.dumazet, brouer, netdev

On Wed, Jul 13, 2016 at 11:23:12PM -0700, John Fastabend wrote:
> This converts the pfifo_fast qdisc to use the alf_queue enqueue and
> dequeue routines then sets the NOLOCK bit.
> 
> This also removes the logic used to pick the next band to dequeue from
> and instead just checks each alf_queue for packets from top priority
> to lowest. This might need to be a bit more clever but seems to work
> for now.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  net/sched/sch_generic.c |  131 +++++++++++++++++++++++++++--------------------

>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
>  			      struct sk_buff **to_free)
>  {
> -	return qdisc_drop(skb, qdisc, to_free);
> +	err = skb_array_produce_bh(q, skb);
..
>  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>  {
> +		skb = skb_array_consume_bh(q);

For this particular qdisc the performance gain should come from
granularityof spin_lock, right?
Before we were taking the lock much earlier. Here we keep the lock,
but for the very short time.
        original pps        lockless        diff
        1       1418168             1269450         -148718
        2       1587390             1553408         -33982
        4       1084961             1683639         +598678
        8       989636              1522723         +533087
        12      1014018             1348172         +334154
so perf for 1 cpu case is mainly due to array vs list,
since number of locks is still the same and there is no collision ?
but then why shorter lock give higher overhead in multi cpu cases?
That would have been the main target for performance improvement?

Looks like mq gets the most benefit, because it's lockless internally
which makes sense.
In general I think this is the right direction where tc infra should move to.
I'm only not sure whether it's worth converting pfifo to skb_array.
Probably alf queue would have been a better alternative.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
  2016-07-14 23:42   ` Alexei Starovoitov
@ 2016-07-15  0:07     ` John Fastabend
  2016-07-15 11:23       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2016-07-15  0:07 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: fw, jhs, eric.dumazet, brouer, netdev

On 16-07-14 04:42 PM, Alexei Starovoitov wrote:
> On Wed, Jul 13, 2016 at 11:23:12PM -0700, John Fastabend wrote:
>> This converts the pfifo_fast qdisc to use the alf_queue enqueue and
>> dequeue routines then sets the NOLOCK bit.
>>
>> This also removes the logic used to pick the next band to dequeue from
>> and instead just checks each alf_queue for packets from top priority
>> to lowest. This might need to be a bit more clever but seems to work
>> for now.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>  net/sched/sch_generic.c |  131 +++++++++++++++++++++++++++--------------------
> 
>>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
>>  			      struct sk_buff **to_free)
>>  {
>> -	return qdisc_drop(skb, qdisc, to_free);
>> +	err = skb_array_produce_bh(q, skb);
> ..
>>  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>  {
>> +		skb = skb_array_consume_bh(q);
> 
> For this particular qdisc the performance gain should come from
> granularityof spin_lock, right?

And the fact that the consumer and producer are using different
locks now.

> Before we were taking the lock much earlier. Here we keep the lock,
> but for the very short time.
>         original pps        lockless        diff
>         1       1418168             1269450         -148718
>         2       1587390             1553408         -33982
>         4       1084961             1683639         +598678
>         8       989636              1522723         +533087
>         12      1014018             1348172         +334154
> so perf for 1 cpu case is mainly due to array vs list,
> since number of locks is still the same and there is no collision ?
> but then why shorter lock give higher overhead in multi cpu cases?

So in this case running pfifo_fast as the root qdisc with 12 threads
means we have 12 producers hitting a single enqueue() path where as with
mq and only looking at pktgen numbers we have one thread for each
skb_array.

> That would have been the main target for performance improvement?
> 

Maybe I should fire up a TCP test with 1000's of threads to see what
the perf numbers look like.

> Looks like mq gets the most benefit, because it's lockless internally
> which makes sense.
> In general I think this is the right direction where tc infra should move to.
> I'm only not sure whether it's worth converting pfifo to skb_array.
> Probably alf queue would have been a better alternative.
> 

Tomorrows task is to resurrect the alf_queue and look at its numbers
compared to this. Today was spent trying to remove the HARD_TX_LOCK
that protects the driver, in the mq case it seems this is not really
needed either.

.John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
  2016-07-14 15:11   ` Jesper Dangaard Brouer
@ 2016-07-15  0:09     ` John Fastabend
  2016-07-15 10:09       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2016-07-15  0:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: fw, jhs, alexei.starovoitov, eric.dumazet, netdev

On 16-07-14 08:11 AM, Jesper Dangaard Brouer wrote:
> On Wed, 13 Jul 2016 23:23:12 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> This converts the pfifo_fast qdisc to use the alf_queue enqueue and
>> dequeue routines then sets the NOLOCK bit.
> 
> Should have said skb_array, not alf_queue ;-)

I think I'll get numbers for the alf_queue to so we can compare them
as well. On the todo list for tomorrow.

> 
>  > This also removes the logic used to pick the next band to dequeue from
>> and instead just checks each alf_queue for packets from top priority
>> to lowest. This might need to be a bit more clever but seems to work
>> for now.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>  net/sched/sch_generic.c |  131 +++++++++++++++++++++++++++--------------------
>>  1 file changed, 75 insertions(+), 56 deletions(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 7dcd066..2ac3eb9 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -26,6 +26,7 @@
>>  #include <linux/list.h>
>>  #include <linux/slab.h>
>>  #include <linux/if_vlan.h>
>> +#include <linux/skb_array.h>
>>  #include <net/sch_generic.h>
>>  #include <net/pkt_sched.h>
>>  #include <net/dst.h>
>> @@ -555,88 +556,79 @@ static const u8 prio2band[TC_PRIO_MAX + 1] = {
>>  
>>  /*
>>   * Private data for a pfifo_fast scheduler containing:
>> - * 	- queues for the three band
>> - * 	- bitmap indicating which of the bands contain skbs
>> + *	- rings for priority bands
>>   */
>>  struct pfifo_fast_priv {
>> -	u32 bitmap;
>> -	struct sk_buff_head q[PFIFO_FAST_BANDS];
>> +	struct skb_array q[PFIFO_FAST_BANDS];
>>  };
>>  
>> -/*
>> - * Convert a bitmap to the first band number where an skb is queued, where:
>> - * 	bitmap=0 means there are no skbs on any band.
>> - * 	bitmap=1 means there is an skb on band 0.
>> - *	bitmap=7 means there are skbs on all 3 bands, etc.
>> - */
>> -static const int bitmap2band[] = {-1, 0, 1, 0, 2, 0, 1, 0};
>> -
>> -static inline struct sk_buff_head *band2list(struct pfifo_fast_priv *priv,
>> -					     int band)
>> +static inline struct skb_array *band2list(struct pfifo_fast_priv *priv,
>> +					  int band)
>>  {
>> -	return priv->q + band;
>> +	return &priv->q[band];
>>  }
>>  
>>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
>>  			      struct sk_buff **to_free)
>>  {
>> -	if (skb_queue_len(&qdisc->q) < qdisc_dev(qdisc)->tx_queue_len) {
>> -		int band = prio2band[skb->priority & TC_PRIO_MAX];
>> -		struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>> -		struct sk_buff_head *list = band2list(priv, band);
>> -
>> -		priv->bitmap |= (1 << band);
>> -		qdisc->q.qlen++;
>> -		return __qdisc_enqueue_tail(skb, qdisc, list);
>> -	}
>> +	int band = prio2band[skb->priority & TC_PRIO_MAX];
>> +	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>> +	struct skb_array *q = band2list(priv, band);
>> +	int err;
>>  
>> -	return qdisc_drop(skb, qdisc, to_free);
>> +	err = skb_array_produce_bh(q, skb);
> 
> Do you need the _bh variant here?  (Doesn't the qdisc run with BH disabled?)
> 
> 

Yep its inside rcu_read_lock_bh().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
  2016-07-15  0:09     ` John Fastabend
@ 2016-07-15 10:09       ` Jesper Dangaard Brouer
  2016-07-15 17:29         ` John Fastabend
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-15 10:09 UTC (permalink / raw)
  To: John Fastabend
  Cc: fw, jhs, alexei.starovoitov, eric.dumazet, netdev, brouer,
	Michael S. Tsirkin

On Thu, 14 Jul 2016 17:09:43 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> >>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
> >>  			      struct sk_buff **to_free)
> >>  {
> >> -	if (skb_queue_len(&qdisc->q) < qdisc_dev(qdisc)->tx_queue_len) {
> >> -		int band = prio2band[skb->priority & TC_PRIO_MAX];
> >> -		struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> >> -		struct sk_buff_head *list = band2list(priv, band);
> >> -
> >> -		priv->bitmap |= (1 << band);
> >> -		qdisc->q.qlen++;
> >> -		return __qdisc_enqueue_tail(skb, qdisc, list);
> >> -	}
> >> +	int band = prio2band[skb->priority & TC_PRIO_MAX];
> >> +	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> >> +	struct skb_array *q = band2list(priv, band);
> >> +	int err;
> >>  
> >> -	return qdisc_drop(skb, qdisc, to_free);
> >> +	err = skb_array_produce_bh(q, skb);  
> > 
> > Do you need the _bh variant here?  (Doesn't the qdisc run with BH disabled?)
> > 
> >   
> 
> Yep its inside rcu_read_lock_bh().

The call rcu_read_lock_bh() already disabled BH (local_bh_disable()).
Thus, you can use the normal variants of skb_array_produce(), it is
(approx 20 cycles) faster than the _bh variant...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
  2016-07-15  0:07     ` John Fastabend
@ 2016-07-15 11:23       ` Jesper Dangaard Brouer
  2016-07-15 22:18         ` John Fastabend
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-15 11:23 UTC (permalink / raw)
  To: John Fastabend; +Cc: Alexei Starovoitov, fw, jhs, eric.dumazet, netdev, brouer

On Thu, 14 Jul 2016 17:07:33 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 16-07-14 04:42 PM, Alexei Starovoitov wrote:
> > On Wed, Jul 13, 2016 at 11:23:12PM -0700, John Fastabend wrote:  
> >> This converts the pfifo_fast qdisc to use the alf_queue enqueue and
> >> dequeue routines then sets the NOLOCK bit.
> >>
> >> This also removes the logic used to pick the next band to dequeue from
> >> and instead just checks each alf_queue for packets from top priority
> >> to lowest. This might need to be a bit more clever but seems to work
> >> for now.
> >>
> >> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> >> ---
> >>  net/sched/sch_generic.c |  131 +++++++++++++++++++++++++++--------------------  
> >   
> >>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
> >>  			      struct sk_buff **to_free)
> >>  {
> >> -	return qdisc_drop(skb, qdisc, to_free);
> >> +	err = skb_array_produce_bh(q, skb);  
> > ..  
> >>  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
> >>  {
> >> +		skb = skb_array_consume_bh(q);  
> > 
> > For this particular qdisc the performance gain should come from
> > granularityof spin_lock, right?  
> 
> And the fact that the consumer and producer are using different
> locks now.

Yes. Splitting up enqueue'ers (producer's) from the dequeuer (consumer)
is an important step, because today the qdisc layer have this problem
that enqueue'ers can starve the single dequeuer.  The current
mitigation tricks are the enq busy_lock and bulk dequeue.

As John says, using skb_array cause producers and consumer to use
different locks.

> > Before we were taking the lock much earlier. Here we keep the lock,
> > but for the very short time.
> >         original pps        lockless        diff
> >         1       1418168             1269450         -148718
> >         2       1587390             1553408         -33982
> >         4       1084961             1683639         +598678
> >         8       989636              1522723         +533087
> >         12      1014018             1348172         +334154
> >

It looks problematic that lockless is slower in the base single CPU
case, orig (1418168 pps) to lockless (1269450).  By approx
(1/1418168-1/1269450)*10^9 = -82.60 ns.  That base "regression" look
too high to start with.

How I view the results:
Negative scaling starts early for "original", which is the main problem
that we are looking to solve.  The lockless does not show as much
scaling effect as expected, and start to show a trend towards negative
scaling.

> > so perf for 1 cpu case is mainly due to array vs list,
> > since number of locks is still the same and there is no collision ?
> > but then why shorter lock give higher overhead in multi cpu cases?  
> 
> So in this case running pfifo_fast as the root qdisc with 12 threads
> means we have 12 producers hitting a single enqueue() path where as with
> mq and only looking at pktgen numbers we have one thread for each
> skb_array.

The skb_array allow multi producer multi consumer (MPMC), but is
optimized towards the case of a single producer and a single consumer
(SPSC).  The SPSC queue is usually implemented with much less
synchronization, but then you need some other guarantee that only a
single producer/consumer can be running.

> > That would have been the main target for performance improvement?
> >   
> 
> Maybe I should fire up a TCP test with 1000's of threads to see what
> the perf numbers look like.
> 
> > Looks like mq gets the most benefit, because it's lockless internally
> > which makes sense.
> > In general I think this is the right direction where tc infra should move to.
> > I'm only not sure whether it's worth converting pfifo to skb_array.
> > Probably alf queue would have been a better alternative.
> >   
> 
> Tomorrows task is to resurrect the alf_queue and look at its numbers
> compared to this. Today was spent trying to remove the HARD_TX_LOCK
> that protects the driver, in the mq case it seems this is not really
> needed either.

I would be very interested in the alf_queue numbers. As alf_queue
should perform better with multiple producers.  If you kept the single
TC dequeue'er rule, then you could use the MPSC variant of alf_queue.

My benchmarks from 2014 are avail in this presentation:
 http://people.netfilter.org/hawk/presentations/nfws2014/dp-accel-qdisc-lockless.pdf
Also showing the mitigation effect of bulk-dequeue patches slide 12
(But I would like to see some new benchmarks for alf_queue slide 14)


But do notice, alf_queue have a design problem that skb_array solved.
Alf_queue have the problem of read-bouncing the remote/opposite CPUs
cache line, because it need to (1) at enqueue (producer) look if there
is free space (reading consumer.tail), and (2) at dequeue (consumer) if
any object are avail (reading producer.tail).

The alf_queue mitigate this by design problem by allowing doing bulk
enqueue and bulk dequeue. But I guess, you will not use that "mode", I
think I did in my NFWS2014 benchmarks.

The skb_array solved this design problem, by using the content of the
queue objects as a maker for free/full condition.  The alf_queue cannot
do so easily, because of it's bulk design, that need to reserve part of
the queue.  Thus, we likely need some new queue design, which is a
hybrid between alf_queue and skb_array, for this use-case.



I actually wrote some queue micro-benchmarks that show the strength and
weaknesses of both skb_array[1] and alf_queue[2].

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
  2016-07-15 10:09       ` Jesper Dangaard Brouer
@ 2016-07-15 17:29         ` John Fastabend
  0 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-07-15 17:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: fw, jhs, alexei.starovoitov, eric.dumazet, netdev, Michael S. Tsirkin

On 16-07-15 03:09 AM, Jesper Dangaard Brouer wrote:
> On Thu, 14 Jul 2016 17:09:43 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>>>>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
>>>>  			      struct sk_buff **to_free)
>>>>  {
>>>> -	if (skb_queue_len(&qdisc->q) < qdisc_dev(qdisc)->tx_queue_len) {
>>>> -		int band = prio2band[skb->priority & TC_PRIO_MAX];
>>>> -		struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>>>> -		struct sk_buff_head *list = band2list(priv, band);
>>>> -
>>>> -		priv->bitmap |= (1 << band);
>>>> -		qdisc->q.qlen++;
>>>> -		return __qdisc_enqueue_tail(skb, qdisc, list);
>>>> -	}
>>>> +	int band = prio2band[skb->priority & TC_PRIO_MAX];
>>>> +	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>>>> +	struct skb_array *q = band2list(priv, band);
>>>> +	int err;
>>>>  
>>>> -	return qdisc_drop(skb, qdisc, to_free);
>>>> +	err = skb_array_produce_bh(q, skb);  
>>>
>>> Do you need the _bh variant here?  (Doesn't the qdisc run with BH disabled?)
>>>
>>>   
>>
>> Yep its inside rcu_read_lock_bh().
> 
> The call rcu_read_lock_bh() already disabled BH (local_bh_disable()).
> Thus, you can use the normal variants of skb_array_produce(), it is
> (approx 20 cycles) faster than the _bh variant...
> 

hah I was agreeing with you as in yep no need for the _bh variant :)
I must have been low on coffee or something when I wrote that response
because when I read it now it sounds like I really think the _bh is
needed.

At any rate _bh removed thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
  2016-07-15 11:23       ` Jesper Dangaard Brouer
@ 2016-07-15 22:18         ` John Fastabend
  2016-07-15 22:48           ` Alexei Starovoitov
  0 siblings, 1 reply; 21+ messages in thread
From: John Fastabend @ 2016-07-15 22:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Alexei Starovoitov, fw, jhs, eric.dumazet, netdev

On 16-07-15 04:23 AM, Jesper Dangaard Brouer wrote:
> On Thu, 14 Jul 2016 17:07:33 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> On 16-07-14 04:42 PM, Alexei Starovoitov wrote:
>>> On Wed, Jul 13, 2016 at 11:23:12PM -0700, John Fastabend wrote:  
>>>> This converts the pfifo_fast qdisc to use the alf_queue enqueue and
>>>> dequeue routines then sets the NOLOCK bit.
>>>>
>>>> This also removes the logic used to pick the next band to dequeue from
>>>> and instead just checks each alf_queue for packets from top priority
>>>> to lowest. This might need to be a bit more clever but seems to work
>>>> for now.
>>>>
>>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>>> ---
>>>>  net/sched/sch_generic.c |  131 +++++++++++++++++++++++++++--------------------  
>>>   
>>>>  static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
>>>>  			      struct sk_buff **to_free)
>>>>  {
>>>> -	return qdisc_drop(skb, qdisc, to_free);
>>>> +	err = skb_array_produce_bh(q, skb);  
>>> ..  
>>>>  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>>  {
>>>> +		skb = skb_array_consume_bh(q);  
>>>
>>> For this particular qdisc the performance gain should come from
>>> granularityof spin_lock, right?  
>>
>> And the fact that the consumer and producer are using different
>> locks now.
> 
> Yes. Splitting up enqueue'ers (producer's) from the dequeuer (consumer)
> is an important step, because today the qdisc layer have this problem
> that enqueue'ers can starve the single dequeuer.  The current
> mitigation tricks are the enq busy_lock and bulk dequeue.
> 
> As John says, using skb_array cause producers and consumer to use
> different locks.
> 
>>> Before we were taking the lock much earlier. Here we keep the lock,
>>> but for the very short time.
>>>         original pps        lockless        diff
>>>         1       1418168             1269450         -148718
>>>         2       1587390             1553408         -33982
>>>         4       1084961             1683639         +598678
>>>         8       989636              1522723         +533087
>>>         12      1014018             1348172         +334154
>>>
> 

I was able to recover the performance loss here and actually improve it
by fixing a few things in the patchset. Namely qdisc_run was
being called in a few places unnecessarily creating a fairly large per
packet cost overhead and then using the _bh locks was costing quite a
bit and is not needed as Jesper pointer out.

So new pps data here in somewhat raw format. I ran five iterations of
each thread count (1,2,4,8,12)

nolock (pfifo_fast)
1:  1440293 1421602 1409553 1393469 1424543
2:  1754890 1819292 1727948 1797711 1743427
4:  3282665 3344095 3315220 3332777 3348972
8:  2940079 1644450 2950777 2922085 2946310
12: 2042084 2610060 2857581 3493162 3104611

lock (pfifo_fast)
1:  1471479 1469142 1458825 1456788 1453952
2:  1746231 1749490 1753176 1753780 1755959
4:  1119626 1120515 1121478 1119220 1121115
8:  1001471  999308 1000318 1000776 1000384
12:  989269  992122  991590  986581  990430

nolock (mq)
1:   1435952  1459523  1448860  1385451   1435031
2:   2850662  2855702  2859105  2855443   2843382
4:   5288135  5271192  5252242  5270192   5311642
8:  10042731 10018063  9891813  9968382   9956727
12: 13265277 13384199 13438955 13363771  13436198

lock (mq)
1:   1448374  1444208  1437459  1437088  1452453
2:   2687963  2679221  2651059  2691630  2667479
4:   5153884  4684153  5091728  4635261  4902381
8:   9292395  9625869  9681835  9711651  9660498
12: 13553918 13682410 14084055 13946138 13724726

So then if we just use the first test example because I'm being a
bit lazy and don't want to calculate the avg/mean/whatever we get
a pfifo_fast chart like,

      locked             nolock           diff
---------------------------------------------------
1     1471479            1440293          −  31186
2     1746231            1754890          +   8659
4     1119626            3282665          +2163039
8     1119626            2940079          +1820453
12     989269            2857581*         +1868312

[*] I pulled the 3rd iteration here as the 1st one seems off

And the mq chart looks reasonable again with these changes,


       locked            nolock           diff
---------------------------------------------------
1       1448374          1435952          -  12422
2       2687963          2850662          + 162699
4       5153884          5288135          + 134251
8       9292395         10042731          + 750336
12     13553918         13265277          - 288641

So the mq case is a bit of a wash from my point of view which I sort
of expected seeing in this test case there is no contention on the
enqueue()/producer or dequeue()/consumer case when running pktgen
at 1 thread per qdisc/queue. A better test would be to fire up a few
thousand udp sessions and bang on the qdiscs to get contention on the
enqueue side. I'll try this next. On another note the variance is a
touch concerning in the data above for the no lock case so might look
into that a bit more to see why we can get 1mpps swing in one of those
cases I sort of wonder if something kicked off on my test machine
to cause that.

Also I'm going to take a look at Jesper's microbenchmark numbers but I
think if I can convince myself that using skb_array helps or at least
does no harm I might push to have this include with skb_array and then
work on optimizing the ring type/kind/etc. as a follow up patch.
Additionally it does seem to provide goodness on the pfifo_fast single
queue case.

Final point is there are more optimizations we can do once the enqueue
and dequeue is separated. For example two fairly easy things include
removing HARD_TX_LOCK nn NICs with a ring per core and adding bulk
dequeue() to the skb_array or alf queue or whatever object we end up
on. And I expect this will provide additional perf boost.

Thanks,
John

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue
  2016-07-15 22:18         ` John Fastabend
@ 2016-07-15 22:48           ` Alexei Starovoitov
  0 siblings, 0 replies; 21+ messages in thread
From: Alexei Starovoitov @ 2016-07-15 22:48 UTC (permalink / raw)
  To: John Fastabend; +Cc: Jesper Dangaard Brouer, fw, jhs, eric.dumazet, netdev

On Fri, Jul 15, 2016 at 03:18:12PM -0700, John Fastabend wrote:
> 
> nolock (pfifo_fast)
> 1:  1440293 1421602 1409553 1393469 1424543
> 2:  1754890 1819292 1727948 1797711 1743427
> 4:  3282665 3344095 3315220 3332777 3348972
> 8:  2940079 1644450 2950777 2922085 2946310
> 12: 2042084 2610060 2857581 3493162 3104611
> 
> lock (pfifo_fast)
> 1:  1471479 1469142 1458825 1456788 1453952
> 2:  1746231 1749490 1753176 1753780 1755959
> 4:  1119626 1120515 1121478 1119220 1121115
> 8:  1001471  999308 1000318 1000776 1000384
> 12:  989269  992122  991590  986581  990430
> 
> So then if we just use the first test example because I'm being a
> bit lazy and don't want to calculate the avg/mean/whatever we get
> a pfifo_fast chart like,
> 
>       locked             nolock           diff
> ---------------------------------------------------
> 1     1471479            1440293          −  31186
> 2     1746231            1754890          +   8659
> 4     1119626            3282665          +2163039
> 8     1119626            2940079          +1820453
> 12     989269            2857581*         +1868312
...
> Also I'm going to take a look at Jesper's microbenchmark numbers but I
> think if I can convince myself that using skb_array helps or at least
> does no harm I might push to have this include with skb_array and then
> work on optimizing the ring type/kind/etc. as a follow up patch.
> Additionally it does seem to provide goodness on the pfifo_fast single
> queue case.

Agree. I think the pfifo_fast gains worth applying this patch set
as-is and work on further improvements in follow up.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-07-15 22:48 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-14  6:19 [RFC PATCH v2 00/10] running qdiscs without qdisc_lock John Fastabend
2016-07-14  6:19 ` [RFC PATCH v2 01/10] net: sched: allow qdiscs to handle locking John Fastabend
2016-07-14  6:44   ` John Fastabend
2016-07-14  6:20 ` [RFC PATCH v2 02/10] net: sched: qdisc_qlen for per cpu logic John Fastabend
2016-07-14  6:20 ` [RFC PATCH v2 03/10] net: sched: provide per cpu qstat helpers John Fastabend
2016-07-14  6:21 ` [RFC PATCH v2 04/10] net: sched: a dflt qdisc may be used with per cpu stats John Fastabend
2016-07-14  6:21 ` [RFC PATCH v2 05/10] net: sched: per cpu gso handlers John Fastabend
2016-07-14  6:22 ` [RFC PATCH v2 06/10] net: sched: support qdisc_reset on NOLOCK qdisc John Fastabend
2016-07-14  6:22 ` [RFC PATCH v2 07/10] net: sched: support skb_bad_tx with lockless qdisc John Fastabend
2016-07-14  6:23 ` [RFC PATCH v2 08/10] net: sched: pfifo_fast use alf_queue John Fastabend
2016-07-14 15:11   ` Jesper Dangaard Brouer
2016-07-15  0:09     ` John Fastabend
2016-07-15 10:09       ` Jesper Dangaard Brouer
2016-07-15 17:29         ` John Fastabend
2016-07-14 23:42   ` Alexei Starovoitov
2016-07-15  0:07     ` John Fastabend
2016-07-15 11:23       ` Jesper Dangaard Brouer
2016-07-15 22:18         ` John Fastabend
2016-07-15 22:48           ` Alexei Starovoitov
2016-07-14  6:23 ` [RFC PATCH v2 09/10] net: sched: helper to sum qlen John Fastabend
2016-07-14  6:24 ` [RFC PATCH v2 10/10] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.