netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq
@ 2015-12-30 17:50 John Fastabend
  2015-12-30 17:51 ` [RFC PATCH 01/12] lib: array based lock free queue John Fastabend
                   ` (12 more replies)
  0 siblings, 13 replies; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:50 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

Hi,

This is a first take at removing the qdisc lock on the xmit path
where qdiscs actually have queues of skbs. The ingress qdisc
which is already lockless was "easy" at least in the sense that
we did not need any lock-free data structures to hold skbs.

The series here is experimental at the moment I decided to dump
it to netdev list when the list of folks I wanted to send it to
privately grew to three or four. Hopefully more people will take
a look at it and give feedback/criticism/whatever. For now I've
only done very basic performance tests and it showed a slight
performance improvement with pfifo_fast but this is somewhat to
be expected as the dequeue operation in the qdisc is only removing
a single skb at a time a bulk dequeue would be better presumably
so I'm tinkering with a pfifo_bulk or option to pfifo_fast to make
that work. All that said I ran some traffic over night and my
kernel didn't crash, did a few interface resets and up/downs and
functionally everything is still up and running. On the TODO list
though is to review all the code paths into/out of sch_generic and
sch_api at the moment no promises I didn't miss a path.

The plan of attack here was

 - use the alf_queue (patch 1 from Jesper) and then convert
   pfifo_fast linked list of skbs over to the alf_queue.

 - fixup all the cases where pfifo fast uses qstats to be per-cpu

 - fixup qlen to support per cpu operations

 - make the gso_skb logic per cpu so any given cpu can park an
   skb when the driver throws an error or we get a cpu collision

 - wrap all the qdisc_lock calls in the xmit path with a wrapper
   that checks for a NOLOCK flag first

 - set the per cpu stats bit and nolock bit in pfifo fast and
   see if it works.

On the TODO list,

 - get some performance numbers for various cases all I've done
   so far is run some basic pktgen tests with a debug kernel and
   a few 'perf records'. Both seem to look positive but I'll do
   some more tests over the next few days.

 - review the code paths some more

 - have some cleanup/improvements/review to do in alf_queue

 - add helpers to remove nasty **void casts in alf_queue ops

 - support bulk dequeue from qdisc either pfifo_fast or new qdisc

 - support mqprio and multiq. multiq lets me run classifiers/actions
   and with the lockless bit lets multiple cpus run in parrallel
   for performance close to mq and mqprio.

Another note in my original take on this I tried to rework some of
the error handling out of the drivers and cpu_collision paths to drop
the gso_skb logic altogether. By using dql we could/should(?) know
if a pkt can be consumed at least in the ONETX case. I haven't given
up on this but it got a bit tricky so I dropped it for now.

---

John Fastabend (12):
      lib: array based lock free queue
      net: sched: free per cpu bstats
      net: sched: allow qdiscs to handle locking
      net: sched: provide per cpu qstat helpers
      net: sched: per cpu gso handlers
      net: sched: support qdisc_reset on NOLOCK qdisc
      net: sched: qdisc_qlen for per cpu logic
      net: sched: a dflt qdisc may be used with per cpu stats
      net: sched: pfifo_fast use alf_queue
      net: sched: helper to sum qlen
      net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
      net: sched: pfifo_fast new option to deque multiple pkts


 include/linux/alf_queue.h |  368 +++++++++++++++++++++++++++++++++++++++++++++
 include/net/gen_stats.h   |    3 
 include/net/sch_generic.h |  101 ++++++++++++
 lib/Makefile              |    2 
 lib/alf_queue.c           |   42 +++++
 net/core/dev.c            |   20 +-
 net/core/gen_stats.c      |    9 +
 net/sched/sch_generic.c   |  237 +++++++++++++++++++++--------
 net/sched/sch_mq.c        |   25 ++-
 9 files changed, 717 insertions(+), 90 deletions(-)
 create mode 100644 include/linux/alf_queue.h
 create mode 100644 lib/alf_queue.c

--
Signature

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH 01/12] lib: array based lock free queue
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
@ 2015-12-30 17:51 ` John Fastabend
  2016-01-13 19:28   ` Jesper Dangaard Brouer
  2015-12-30 17:51 ` [RFC PATCH 02/12] net: sched: free per cpu bstats John Fastabend
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:51 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

Initial implementation of an array based lock free queue. This works
is originally done by Jesper Dangaard Brouer and I've grabbed it made
only minor tweaks at the moment and plan to use it with the 'tc'
subsystem although it is general enough to be used elsewhere.

Certainly this implementation can be furthered optimized and improved
but it is a good base implementation.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/alf_queue.h |  368 +++++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile              |    2 
 lib/alf_queue.c           |   42 +++++
 3 files changed, 411 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/alf_queue.h
 create mode 100644 lib/alf_queue.c

diff --git a/include/linux/alf_queue.h b/include/linux/alf_queue.h
new file mode 100644
index 0000000..dac304e
--- /dev/null
+++ b/include/linux/alf_queue.h
@@ -0,0 +1,368 @@
+#ifndef _LINUX_ALF_QUEUE_H
+#define _LINUX_ALF_QUEUE_H
+/* linux/alf_queue.h
+ *
+ * ALF: Array-based Lock-Free queue
+ *
+ * Queue properties
+ *  - Array based for cache-line optimization
+ *  - Bounded by the array size
+ *  - FIFO Producer/Consumer queue, no queue traversal supported
+ *  - Very fast
+ *  - Designed as a queue for pointers to objects
+ *  - Bulk enqueue and dequeue support
+ *  - Supports combinations of Multi and Single Producer/Consumer
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ *  by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ *  for licensing details see kernel-base/COPYING
+ */
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct alf_actor {
+	u32 head;
+	u32 tail;
+};
+
+struct alf_queue {
+	u32 size;
+	u32 mask;
+	u32 flags;
+	struct alf_actor producer ____cacheline_aligned_in_smp;
+	struct alf_actor consumer ____cacheline_aligned_in_smp;
+	void *ring[0] ____cacheline_aligned_in_smp;
+};
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp);
+void		  alf_queue_free(struct alf_queue *q);
+
+/* Helpers for LOAD and STORE of elements, have been split-out because:
+ *  1. They can be reused for both "Single" and "Multi" variants
+ *  2. Allow us to experiment with (pipeline) optimizations in this area.
+ */
+/* Only a single of these helpers will survive upstream submission */
+#define __helper_alf_enqueue_store __helper_alf_enqueue_store_unroll
+#define __helper_alf_dequeue_load  __helper_alf_dequeue_load_unroll
+
+static inline void
+__helper_alf_enqueue_store_unroll(u32 p_head, struct alf_queue *q,
+				  void **ptr, const u32 n)
+{
+	int i, iterations = n & ~3UL;
+	u32 index = p_head & q->mask;
+
+	if (likely((index + n) <= q->mask)) {
+		/* Can save masked-AND knowing we cannot wrap */
+		/* Loop unroll */
+		for (i = 0; i < iterations; i += 4, index += 4) {
+			q->ring[index]   = ptr[i];
+			q->ring[index+1] = ptr[i+1];
+			q->ring[index+2] = ptr[i+2];
+			q->ring[index+3] = ptr[i+3];
+		}
+		/* Remainder handling */
+		switch (n & 0x3) {
+		case 3:
+			q->ring[index]   = ptr[i];
+			q->ring[index+1] = ptr[i+1];
+			q->ring[index+2] = ptr[i+2];
+			break;
+		case 2:
+			q->ring[index]   = ptr[i];
+			q->ring[index+1] = ptr[i+1];
+			break;
+		case 1:
+			q->ring[index] = ptr[i];
+		}
+	} else {
+		/* Fall-back to "mask" version */
+		for (i = 0; i < n; i++, index++)
+			q->ring[index & q->mask] = ptr[i];
+	}
+}
+
+static inline void
+__helper_alf_dequeue_load_unroll(u32 c_head, struct alf_queue *q,
+				 void **ptr, const u32 elems)
+{
+	int i, iterations = elems & ~3UL;
+	u32 index = c_head & q->mask;
+
+	if (likely((index + elems) <= q->mask)) {
+		/* Can save masked-AND knowing we cannot wrap */
+		/* Loop unroll */
+		for (i = 0; i < iterations; i += 4, index += 4) {
+			ptr[i]   = q->ring[index];
+			ptr[i+1] = q->ring[index+1];
+			ptr[i+2] = q->ring[index+2];
+			ptr[i+3] = q->ring[index+3];
+		}
+		/* Remainder handling */
+		switch (elems & 0x3) {
+		case 3:
+			ptr[i]   = q->ring[index];
+			ptr[i+1] = q->ring[index+1];
+			ptr[i+2] = q->ring[index+2];
+			break;
+		case 2:
+			ptr[i]   = q->ring[index];
+			ptr[i+1] = q->ring[index+1];
+			break;
+		case 1:
+			ptr[i]   = q->ring[index];
+		}
+	} else {
+		/* Fall-back to "mask" version */
+		for (i = 0; i < elems; i++, index++)
+			ptr[i] = q->ring[index & q->mask];
+	}
+}
+
+/* Main Multi-Producer ENQUEUE
+ *
+ * Even-though current API have a "fixed" semantics of aborting if it
+ * cannot enqueue the full bulk size.  Users of this API should check
+ * on the returned number of enqueue elements match, to verify enqueue
+ * was successful.  This allow us to introduce a "variable" enqueue
+ * scheme later.
+ *
+ * Not preemption safe. Multiple CPUs can enqueue elements, but the
+ * same CPU is not allowed to be preempted and access the same
+ * queue. Due to how the tail is updated, this can result in a soft
+ * lock-up. (Same goes for alf_mc_dequeue).
+ */
+static inline int
+alf_mp_enqueue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 p_head, p_next, c_tail, space;
+
+	/* Reserve part of the array for enqueue STORE/WRITE */
+	do {
+		p_head = ACCESS_ONCE(q->producer.head);
+		c_tail = ACCESS_ONCE(q->consumer.tail);/* as smp_load_aquire */
+
+		space = q->size + c_tail - p_head;
+		if (n > space)
+			return 0;
+
+		p_next = p_head + n;
+	}
+	while (unlikely(cmpxchg(&q->producer.head, p_head, p_next) != p_head));
+
+	/* The memory barrier of smp_load_acquire(&q->consumer.tail)
+	 * is satisfied by cmpxchg implicit full memory barrier
+	 */
+
+	/* STORE the elems into the queue array */
+	__helper_alf_enqueue_store(p_head, q, ptr, n);
+	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+	/* Wait for other concurrent preceding enqueues not yet done,
+	 * this part make us none-wait-free and could be problematic
+	 * in case of congestion with many CPUs
+	 */
+	while (unlikely(ACCESS_ONCE(q->producer.tail) != p_head))
+		cpu_relax();
+	/* Mark this enq done and avail for consumption */
+	ACCESS_ONCE(q->producer.tail) = p_next;
+
+	return n;
+}
+
+/* Main Multi-Consumer DEQUEUE */
+static inline int
+alf_mc_dequeue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 c_head, c_next, p_tail, elems;
+
+	/* Reserve part of the array for dequeue LOAD/READ */
+	do {
+		c_head = ACCESS_ONCE(q->consumer.head);
+		p_tail = ACCESS_ONCE(q->producer.tail);
+
+		elems = p_tail - c_head;
+
+		if (elems == 0)
+			return 0;
+		else
+			elems = min(elems, n);
+
+		c_next = c_head + elems;
+	}
+	while (unlikely(cmpxchg(&q->consumer.head, c_head, c_next) != c_head));
+
+	/* LOAD the elems from the queue array.
+	 *   We don't need a smb_rmb() Read-Memory-Barrier here because
+	 *   the above cmpxchg is an implied full Memory-Barrier.
+	 */
+	__helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+	/* Wait for other concurrent preceding dequeues not yet done */
+	while (unlikely(ACCESS_ONCE(q->consumer.tail) != c_head))
+		cpu_relax();
+	/* Mark this deq done and avail for producers */
+	smp_store_release(&q->consumer.tail, c_next);
+	/* Archs with weak Memory Ordering need a memory barrier
+	 * (store_release) here.  As the STORE to q->consumer.tail,
+	 * must happen after the dequeue LOADs.  Paired with enqueue
+	 * implicit full-MB in cmpxchg.
+	 */
+
+	return elems;
+}
+
+/* #define ASSERT_DEBUG_SPSC 1 */
+#ifndef ASSERT_DEBUG_SPSC
+#define ASSERT(x) do { } while (0)
+#else
+#define ASSERT(x)							\
+	do {								\
+		if (unlikely(!(x))) {					\
+			pr_crit("Assertion failed %s:%d: \"%s\"\n",	\
+				__FILE__, __LINE__, #x);		\
+			BUG();						\
+		}							\
+	} while (0)
+#endif
+
+/* Main SINGLE Producer ENQUEUE
+ *  caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sp_enqueue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 p_head, p_next, c_tail, space;
+
+	/* Reserve part of the array for enqueue STORE/WRITE */
+	p_head = q->producer.head;
+	smp_rmb(); /* for consumer.tail write, making sure deq loads are done */
+	c_tail = ACCESS_ONCE(q->consumer.tail);
+
+	space = q->size + c_tail - p_head;
+	if (n > space)
+		return 0;
+
+	p_next = p_head + n;
+	ASSERT(ACCESS_ONCE(q->producer.head) == p_head);
+	q->producer.head = p_next;
+
+	/* STORE the elems into the queue array */
+	__helper_alf_enqueue_store(p_head, q, ptr, n);
+	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
+
+	/* Assert no other CPU (or same CPU via preemption) changed queue */
+	ASSERT(ACCESS_ONCE(q->producer.tail) == p_head);
+
+	/* Mark this enq done and avail for consumption */
+	ACCESS_ONCE(q->producer.tail) = p_next;
+
+	return n;
+}
+
+/* Main SINGLE Consumer DEQUEUE
+ *  caller MUST make sure preemption is disabled
+ */
+static inline int
+alf_sc_dequeue(const u32 n;
+	       struct alf_queue *q, void *ptr[n], const u32 n)
+{
+	u32 c_head, c_next, p_tail, elems;
+
+	/* Reserve part of the array for dequeue LOAD/READ */
+	c_head = q->consumer.head;
+	p_tail = ACCESS_ONCE(q->producer.tail);
+
+	elems = p_tail - c_head;
+
+	if (elems == 0)
+		return 0;
+	else
+		elems = min(elems, n);
+
+	c_next = c_head + elems;
+	ASSERT(ACCESS_ONCE(q->consumer.head) == c_head);
+	q->consumer.head = c_next;
+
+	smp_rmb(); /* Read-Memory-Barrier matching enq STOREs */
+	__helper_alf_dequeue_load(c_head, q, ptr, elems);
+
+	/* Archs with weak Memory Ordering need a memory barrier here.
+	 * As the STORE to q->consumer.tail, must happen after the
+	 * dequeue LOADs. Dequeue LOADs have a dependent STORE into
+	 * ptr, thus a smp_wmb() is enough.
+	 */
+	smp_wmb();
+
+	/* Assert no other CPU (or same CPU via preemption) changed queue */
+	ASSERT(ACCESS_ONCE(q->consumer.tail) == c_head);
+
+	/* Mark this deq done and avail for producers */
+	ACCESS_ONCE(q->consumer.tail) = c_next;
+
+	return elems;
+}
+
+static inline bool
+alf_queue_empty(struct alf_queue *q)
+{
+	u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+	u32 p_tail = ACCESS_ONCE(q->producer.tail);
+
+	/* The empty (and initial state) is when consumer have reached
+	 * up with producer.
+	 *
+	 * DOUBLE-CHECK: Should we use producer.head, as this indicate
+	 * a producer is in-progress(?)
+	 */
+	return c_tail == p_tail;
+}
+
+static inline int
+alf_queue_count(struct alf_queue *q)
+{
+	u32 c_head = ACCESS_ONCE(q->consumer.head);
+	u32 p_tail = ACCESS_ONCE(q->producer.tail);
+	u32 elems;
+
+	/* Due to u32 arithmetic the values are implicitly
+	 * masked/modulo 32-bit, thus saving one mask operation
+	 */
+	elems = p_tail - c_head;
+	/* Thus, same as:
+	 *  elems = (p_tail - c_head) & q->mask;
+	 */
+	return elems;
+}
+
+static inline int
+alf_queue_avail_space(struct alf_queue *q)
+{
+	u32 p_head = ACCESS_ONCE(q->producer.head);
+	u32 c_tail = ACCESS_ONCE(q->consumer.tail);
+	u32 space;
+
+	/* The max avail space is q->size and
+	 * the empty state is when (consumer == producer)
+	 */
+
+	/* Due to u32 arithmetic the values are implicitly
+	 * masked/modulo 32-bit, thus saving one mask operation
+	 */
+	space = q->size + c_tail - p_head;
+	/* Thus, same as:
+	 *  space = (q->size + c_tail - p_head) & q->mask;
+	 */
+	return space;
+}
+
+static inline void
+alf_queue_flush(struct alf_queue *q)
+{
+
+}
+
+#endif /* _LINUX_ALF_QUEUE_H */
diff --git a/lib/Makefile b/lib/Makefile
index 180dd4d..6b7b161 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -27,7 +27,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o
+	 once.o alf_queue.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/alf_queue.c b/lib/alf_queue.c
new file mode 100644
index 0000000..ac026ad
--- /dev/null
+++ b/lib/alf_queue.c
@@ -0,0 +1,42 @@
+/*
+ * lib/alf_queue.c
+ *
+ * ALF: Array-based Lock-Free queue
+ *  - Main implementation in: include/linux/alf_queue.h
+ *
+ * Copyright (C) 2014, Red Hat, Inc.,
+ *  by Jesper Dangaard Brouer and Hannes Frederic Sowa
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/alf_queue.h>
+#include <linux/log2.h>
+
+struct alf_queue *alf_queue_alloc(u32 size, gfp_t gfp)
+{
+	struct alf_queue *q;
+	size_t mem_size;
+
+	if (!(is_power_of_2(size)) || size > 65536)
+		return ERR_PTR(-EINVAL);
+
+	/* The ring array is allocated together with the queue struct */
+	mem_size = size * sizeof(void *) + sizeof(struct alf_queue);
+	q = kzalloc(mem_size, gfp);
+	if (!q)
+		return ERR_PTR(-ENOMEM);
+
+	q->size = size;
+	q->mask = size - 1;
+
+	return q;
+}
+EXPORT_SYMBOL_GPL(alf_queue_alloc);
+
+void alf_queue_free(struct alf_queue *q)
+{
+	kfree(q);
+}
+EXPORT_SYMBOL_GPL(alf_queue_free);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 02/12] net: sched: free per cpu bstats
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
  2015-12-30 17:51 ` [RFC PATCH 01/12] lib: array based lock free queue John Fastabend
@ 2015-12-30 17:51 ` John Fastabend
  2016-01-04 15:21   ` Daniel Borkmann
  2015-12-30 17:51 ` [RFC PATCH 03/12] net: sched: allow qdiscs to handle locking John Fastabend
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:51 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

When a qdisc is using per cpu stats only the bstats are being
freed. This also free's the qstats.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_generic.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e82a1ad..16bc83b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -658,8 +658,10 @@ static void qdisc_rcu_free(struct rcu_head *head)
 {
 	struct Qdisc *qdisc = container_of(head, struct Qdisc, rcu_head);
 
-	if (qdisc_is_percpu_stats(qdisc))
+	if (qdisc_is_percpu_stats(qdisc)) {
 		free_percpu(qdisc->cpu_bstats);
+		free_percpu(qdisc->cpu_qstats);
+	}
 
 	kfree((char *) qdisc - qdisc->padded);
 }

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 03/12] net: sched: allow qdiscs to handle locking
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
  2015-12-30 17:51 ` [RFC PATCH 01/12] lib: array based lock free queue John Fastabend
  2015-12-30 17:51 ` [RFC PATCH 02/12] net: sched: free per cpu bstats John Fastabend
@ 2015-12-30 17:51 ` John Fastabend
  2015-12-30 17:52 ` [RFC PATCH 04/12] net: sched: provide per cpu qstat helpers John Fastabend
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:51 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

This patch adds a flag for queueing disciplines to indicate
the stack does not need to use the qdisc lock to protect
operations. This can be used to build lockless scheduling
algorithms and improving performance.

The flag is checked in the tx path and the qdisc lock is
only taken if it is not set. For now use a conditional
if statement. Later we could be more aggressive if it
proves worthwhile and use a static key or wrap this in
a likely().

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |    1 +
 net/core/dev.c            |   20 ++++++++++++--------
 net/sched/sch_generic.c   |    7 +++++--
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index b2a8e63..c8d42c3 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -64,6 +64,7 @@ struct Qdisc {
 #define TCQ_F_NOPARENT		0x40 /* root of its hierarchy :
 				      * qdisc_tree_decrease_qlen() should stop.
 				      */
+#define TCQ_F_NOLOCK		0x80 /* qdisc does not require locking */
 	u32			limit;
 	const struct Qdisc_ops	*ops;
 	struct qdisc_size_table	__rcu *stab;
diff --git a/net/core/dev.c b/net/core/dev.c
index 914b4a2..7a51609 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3004,7 +3004,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 				 struct netdev_queue *txq)
 {
 	spinlock_t *root_lock = qdisc_lock(q);
-	bool contended;
+	bool contended = false;
 	int rc;
 
 	qdisc_pkt_len_init(skb);
@@ -3015,11 +3015,13 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	 * This permits __QDISC___STATE_RUNNING owner to get the lock more
 	 * often and dequeue packets faster.
 	 */
-	contended = qdisc_is_running(q);
-	if (unlikely(contended))
-		spin_lock(&q->busylock);
+	if (!(q->flags & TCQ_F_NOLOCK)) {
+		contended = qdisc_is_running(q);
+		if (unlikely(contended))
+			spin_lock(&q->busylock);
+		spin_lock(root_lock);
+	}
 
-	spin_lock(root_lock);
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
 		kfree_skb(skb);
 		rc = NET_XMIT_DROP;
@@ -3053,9 +3055,11 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 			__qdisc_run(q);
 		}
 	}
-	spin_unlock(root_lock);
-	if (unlikely(contended))
-		spin_unlock(&q->busylock);
+	if (!(q->flags & TCQ_F_NOLOCK)) {
+		spin_unlock(root_lock);
+		if (unlikely(contended))
+			spin_unlock(&q->busylock);
+	}
 	return rc;
 }
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 16bc83b..37dfa4a 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -153,7 +153,8 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 	int ret = NETDEV_TX_BUSY;
 
 	/* And release qdisc */
-	spin_unlock(root_lock);
+	if (!(q->flags & TCQ_F_NOLOCK))
+		spin_unlock(root_lock);
 
 	/* Note that we validate skb (GSO, checksum, ...) outside of locks */
 	if (validate)
@@ -166,7 +167,9 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
 		HARD_TX_UNLOCK(dev, txq);
 	}
-	spin_lock(root_lock);
+
+	if (!(q->flags & TCQ_F_NOLOCK))
+		spin_lock(root_lock);
 
 	if (dev_xmit_complete(ret)) {
 		/* Driver sent out skb successfully or skb was consumed */

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 04/12] net: sched: provide per cpu qstat helpers
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
                   ` (2 preceding siblings ...)
  2015-12-30 17:51 ` [RFC PATCH 03/12] net: sched: allow qdiscs to handle locking John Fastabend
@ 2015-12-30 17:52 ` John Fastabend
  2015-12-30 17:52 ` [RFC PATCH 05/12] net: sched: per cpu gso handlers John Fastabend
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:52 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

The per cpu qstats support was added with per cpu bstat support
which is currently used by the ingress qdisc. This patch adds
a set of helpers needed to make other qdiscs that use qstats
per cpu as well.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |   39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index c8d42c3..9966c17 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -545,12 +545,43 @@ static inline void qdisc_qstats_backlog_dec(struct Qdisc *sch,
 	sch->qstats.backlog -= qdisc_pkt_len(skb);
 }
 
+static inline void qdisc_qstats_cpu_backlog_dec(struct Qdisc *sch,
+						const struct sk_buff *skb)
+{
+	struct gnet_stats_queue *q = this_cpu_ptr(sch->cpu_qstats);
+
+	q->backlog -= qdisc_pkt_len(skb);
+}
+
 static inline void qdisc_qstats_backlog_inc(struct Qdisc *sch,
 					    const struct sk_buff *skb)
 {
 	sch->qstats.backlog += qdisc_pkt_len(skb);
 }
 
+static inline void qdisc_qstats_cpu_backlog_inc(struct Qdisc *sch,
+						const struct sk_buff *skb)
+{
+	struct gnet_stats_queue *q = this_cpu_ptr(sch->cpu_qstats);
+
+	q->backlog += qdisc_pkt_len(skb);
+}
+
+static inline void qdisc_qstats_cpu_qlen_inc(struct Qdisc *sch)
+{
+	this_cpu_ptr(sch->cpu_qstats)->qlen++;
+}
+
+static inline void qdisc_qstats_cpu_qlen_dec(struct Qdisc *sch)
+{
+	this_cpu_ptr(sch->cpu_qstats)->qlen--;
+}
+
+static inline void qdisc_qstats_cpu_requeues_inc(struct Qdisc *sch)
+{
+	this_cpu_ptr(sch->cpu_qstats)->requeues++;
+}
+
 static inline void __qdisc_qstats_drop(struct Qdisc *sch, int count)
 {
 	sch->qstats.drops += count;
@@ -726,6 +757,14 @@ static inline int qdisc_drop(struct sk_buff *skb, struct Qdisc *sch)
 	return NET_XMIT_DROP;
 }
 
+static inline int qdisc_drop_cpu(struct sk_buff *skb, struct Qdisc *sch)
+{
+	kfree_skb(skb);
+	qdisc_qstats_cpu_drop(sch);
+
+	return NET_XMIT_DROP;
+}
+
 static inline int qdisc_reshape_fail(struct sk_buff *skb, struct Qdisc *sch)
 {
 	qdisc_qstats_drop(sch);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 05/12] net: sched: per cpu gso handlers
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
                   ` (3 preceding siblings ...)
  2015-12-30 17:52 ` [RFC PATCH 04/12] net: sched: provide per cpu qstat helpers John Fastabend
@ 2015-12-30 17:52 ` John Fastabend
  2015-12-30 20:26   ` Jesper Dangaard Brouer
  2015-12-30 17:53 ` [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc John Fastabend
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:52 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

The net sched infrastructure has a gso ptr that points to skb structs
that have failed to be enqueued by the device driver.

This can happen when multiple cores try to push a skb onto the same
underlying hardware queue resulting in lock contention. This case is
handled by a cpu collision handler handle_dev_cpu_collision(). Another
case occurs when the stack overruns the drivers low level tx queues
capacity. Ideally these should be a rare occurrence in a well-tuned
system but they do happen.

To handle this in the lockless case use a per cpu gso field to park
the skb until the conflict can be resolved. Note at this point the
skb has already been popped off the qdisc so it has to be handled
by the infrastructure.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |   36 ++++++++++++++++++++++++++++++++++++
 net/sched/sch_generic.c   |   34 ++++++++++++++++++++++++++++++++--
 2 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9966c17..aa39dd4 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -44,6 +44,10 @@ struct qdisc_size_table {
 	u16			data[];
 };
 
+struct gso_cell {
+	struct sk_buff *skb;
+};
+
 struct Qdisc {
 	int 			(*enqueue)(struct sk_buff *skb, struct Qdisc *dev);
 	struct sk_buff *	(*dequeue)(struct Qdisc *dev);
@@ -88,6 +92,7 @@ struct Qdisc {
 
 	struct Qdisc		*next_sched;
 	struct sk_buff		*gso_skb;
+	struct gso_cell __percpu *gso_cpu_skb;
 	/*
 	 * For performance sake on SMP, we put highly modified fields at the end
 	 */
@@ -699,6 +704,22 @@ static inline struct sk_buff *qdisc_peek_dequeued(struct Qdisc *sch)
 	return sch->gso_skb;
 }
 
+static inline struct sk_buff *qdisc_peek_dequeued_cpu(struct Qdisc *sch)
+{
+	struct gso_cell *gso = this_cpu_ptr(sch->gso_cpu_skb);
+
+	if (!gso->skb) {
+		struct sk_buff *skb = sch->dequeue(sch);
+
+		if (skb) {
+			gso->skb = skb;
+			qdisc_qstats_cpu_qlen_inc(sch);
+		}
+	}
+
+	return gso->skb;
+}
+
 /* use instead of qdisc->dequeue() for all qdiscs queried with ->peek() */
 static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
 {
@@ -714,6 +735,21 @@ static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
 	return skb;
 }
 
+static inline struct sk_buff *qdisc_dequeue_peeked_skb(struct Qdisc *sch)
+{
+	struct gso_cell *gso = this_cpu_ptr(sch->gso_cpu_skb);
+	struct sk_buff *skb = gso->skb;
+
+	if (skb) {
+		gso->skb = NULL;
+		qdisc_qstats_cpu_qlen_dec(sch);
+	} else {
+		skb = sch->dequeue(sch);
+	}
+
+	return skb;
+}
+
 static inline void __qdisc_reset_queue(struct Qdisc *sch,
 				       struct sk_buff_head *list)
 {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37dfa4a..9aeb51f 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -44,8 +44,7 @@ EXPORT_SYMBOL(default_qdisc_ops);
  * - ingress filtering is also serialized via qdisc root lock
  * - updates to tree and tree walking are only done under the rtnl mutex.
  */
-
-static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+static inline int __dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 {
 	q->gso_skb = skb;
 	q->qstats.requeues++;
@@ -55,6 +54,24 @@ static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 	return 0;
 }
 
+static inline int dev_requeue_cpu_skb(struct sk_buff *skb, struct Qdisc *q)
+{
+	this_cpu_ptr(q->gso_cpu_skb)->skb = skb;
+	qdisc_qstats_cpu_requeues_inc(q);
+	qdisc_qstats_cpu_qlen_inc(q);
+	__netif_schedule(q);
+
+	return 0;
+}
+
+static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
+{
+	if (q->flags & TCQ_F_NOLOCK)
+		return __dev_requeue_skb(skb, q);
+	else
+		return dev_requeue_cpu_skb(skb, q);
+}
+
 static void try_bulk_dequeue_skb(struct Qdisc *q,
 				 struct sk_buff *skb,
 				 const struct netdev_queue *txq,
@@ -666,6 +683,19 @@ static void qdisc_rcu_free(struct rcu_head *head)
 		free_percpu(qdisc->cpu_qstats);
 	}
 
+	if (qdisc->gso_cpu_skb) {
+		int i;
+
+		for_each_possible_cpu(i) {
+			struct gso_cell *cell;
+
+			cell = per_cpu_ptr(qdisc->gso_cpu_skb, i);
+			kfree_skb_list(cell->skb);
+		}
+
+		free_percpu(qdisc->gso_cpu_skb);
+	}
+
 	kfree((char *) qdisc - qdisc->padded);
 }
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
                   ` (4 preceding siblings ...)
  2015-12-30 17:52 ` [RFC PATCH 05/12] net: sched: per cpu gso handlers John Fastabend
@ 2015-12-30 17:53 ` John Fastabend
  2016-01-01  2:30   ` Alexei Starovoitov
  2016-01-13 16:20   ` David Miller
  2015-12-30 17:53 ` [RFC PATCH 07/12] net: sched: qdisc_qlen for per cpu logic John Fastabend
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:53 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

The qdisc_reset operation depends on the qdisc lock at the moment
to halt any additions to gso_skb and statistics while the list is
free'd and the stats zeroed.

Without the qdisc lock we can not guarantee another cpu is not in
the process of adding a skb to one of the "cells". Here are the
two cases we have to handle.

 case 1: qdisc_graft operation. In this case a "new" qdisc is attached
	 and the 'qdisc_destroy' operation is called on the old qdisc.
	 The destroy operation will wait a rcu grace period and call
	 qdisc_rcu_free(). At which point gso_cpu_skb is free'd along
	 with all stats so no need to zero stats and gso_cpu_skb from
	 the reset operation itself.

	 Because we can not continue to call qdisc_reset before waiting
	 an rcu grace period so that the qdisc is detached from all
	 cpus simply do not call qdisc_reset() at all and let the
	 qdisc_destroy operation clean up the qdisc. Note, a refcnt
	 greater than 1 would cause the destroy operation to be
	 aborted however if this ever happened the reference to the
	 qdisc would be lost and we would have a memory leak.

 case 2: dev_deactivate sequence. This can come from a user bringing
	 the interface down which causes the gso_skb list to be flushed
	 and the qlen zero'd. At the moment this is protected by the
	 qdisc lock so while we clear the qlen/gso_skb fields we are
	 guaranteed no new skbs are added. For the lockless case
	 though this is not true. To resolve this move the qdisc_reset
	 call after the new qdisc is assigned and a grace period is
	 exercised to ensure no new skbs can be enqueued. Further
	 the RTNL lock is held so we can not get another call to
	 activate the qdisc while the skb lists are being free'd.

	 Finally, fix qdisc_reset to handle the per cpu stats and
	 skb lists.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_generic.c |   42 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 7 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 9aeb51f..134fb95 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -666,6 +666,18 @@ void qdisc_reset(struct Qdisc *qdisc)
 	if (ops->reset)
 		ops->reset(qdisc);
 
+	if (qdisc->gso_cpu_skb) {
+		int i;
+
+		for_each_possible_cpu(i) {
+			struct gso_cell *cell;
+
+			cell = per_cpu_ptr(qdisc->gso_cpu_skb, i);
+			if (cell)
+				kfree_skb_list(cell->skb);
+		}
+	}
+
 	if (qdisc->gso_skb) {
 		kfree_skb_list(qdisc->gso_skb);
 		qdisc->gso_skb = NULL;
@@ -740,10 +752,6 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 	root_lock = qdisc_lock(oqdisc);
 	spin_lock_bh(root_lock);
 
-	/* Prune old scheduler */
-	if (oqdisc && atomic_read(&oqdisc->refcnt) <= 1)
-		qdisc_reset(oqdisc);
-
 	/* ... and graft new one */
 	if (qdisc == NULL)
 		qdisc = &noop_qdisc;
@@ -857,7 +865,6 @@ static void dev_deactivate_queue(struct net_device *dev,
 			set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
 
 		rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
-		qdisc_reset(qdisc);
 
 		spin_unlock_bh(qdisc_lock(qdisc));
 	}
@@ -890,6 +897,18 @@ static bool some_qdisc_is_busy(struct net_device *dev)
 	return false;
 }
 
+static void dev_qdisc_reset(struct net_device *dev,
+			    struct netdev_queue *dev_queue,
+			    void *none)
+{
+	struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
+
+	WARN_ON(!qdisc);
+
+	if (qdisc)
+		qdisc_reset(qdisc);
+}
+
 /**
  * 	dev_deactivate_many - deactivate transmissions on several devices
  * 	@head: list of devices to deactivate
@@ -910,7 +929,7 @@ void dev_deactivate_many(struct list_head *head)
 					     &noop_qdisc);
 
 		dev_watchdog_down(dev);
-		sync_needed |= !dev->dismantle;
+		sync_needed = true;
 	}
 
 	/* Wait for outstanding qdisc-less dev_queue_xmit calls.
@@ -921,9 +940,18 @@ void dev_deactivate_many(struct list_head *head)
 		synchronize_net();
 
 	/* Wait for outstanding qdisc_run calls. */
-	list_for_each_entry(dev, head, close_list)
+	list_for_each_entry(dev, head, close_list) {
 		while (some_qdisc_is_busy(dev))
 			yield();
+
+		/* The new qdisc is assigned at this point so we can safely
+		 * unwind stale skb lists and qdisc statistics
+		 */
+		netdev_for_each_tx_queue(dev, dev_qdisc_reset, NULL);
+		if (dev_ingress_queue(dev))
+			dev_qdisc_reset(dev, dev_ingress_queue(dev), NULL);
+	}
+
 }
 
 void dev_deactivate(struct net_device *dev)

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 07/12] net: sched: qdisc_qlen for per cpu logic
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
                   ` (5 preceding siblings ...)
  2015-12-30 17:53 ` [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc John Fastabend
@ 2015-12-30 17:53 ` John Fastabend
  2015-12-30 17:53 ` [RFC PATCH 08/12] net: sched: a dflt qdisc may be used with per cpu stats John Fastabend
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:53 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

This is a bit interesting because it means sch_direct_xmit will
return a positive value which causes the dequeue/xmit cycle to
continue only when a specific cpu has a qlen > 0.

However checking each cpu for qlen will break performance so
its important to note that qdiscs that set the no lock bit need
to have some sort of per cpu enqueue/dequeue data structure that
maps to the per cpu qlen value.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index aa39dd4..30f4c60 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -273,8 +273,16 @@ static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
 	BUILD_BUG_ON(sizeof(qcb->data) < sz);
 }
 
+static inline int qdisc_qlen_cpu(const struct Qdisc *q)
+{
+	return this_cpu_ptr(q->cpu_qstats)->qlen;
+}
+
 static inline int qdisc_qlen(const struct Qdisc *q)
 {
+	if (q->flags & TCQ_F_NOLOCK)
+		return qdisc_qlen_cpu(q);
+
 	return q->q.qlen;
 }
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 08/12] net: sched: a dflt qdisc may be used with per cpu stats
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
                   ` (6 preceding siblings ...)
  2015-12-30 17:53 ` [RFC PATCH 07/12] net: sched: qdisc_qlen for per cpu logic John Fastabend
@ 2015-12-30 17:53 ` John Fastabend
  2015-12-30 17:54 ` [RFC PATCH 09/12] net: sched: pfifo_fast use alf_queue John Fastabend
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:53 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

Enable dflt qdisc support for per cpu stats before this patch a
dflt qdisc was required to use the global statistics qstats and
bstats.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_generic.c |   24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 134fb95..be5d63a 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -641,18 +641,34 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
 	struct Qdisc *sch;
 
 	if (!try_module_get(ops->owner))
-		goto errout;
+		return NULL;
 
 	sch = qdisc_alloc(dev_queue, ops);
 	if (IS_ERR(sch))
-		goto errout;
+		return NULL;
 	sch->parent = parentid;
 
-	if (!ops->init || ops->init(sch, NULL) == 0)
+	if (!ops->init)
 		return sch;
 
-	qdisc_destroy(sch);
+	if (ops->init(sch, NULL))
+		goto errout;
+
+	/* init() may have set percpu flags so init data structures */
+	if (qdisc_is_percpu_stats(sch)) {
+		sch->cpu_bstats =
+			netdev_alloc_pcpu_stats(struct gnet_stats_basic_cpu);
+		if (!sch->cpu_bstats)
+			goto errout;
+
+		sch->cpu_qstats = alloc_percpu(struct gnet_stats_queue);
+		if (!sch->cpu_qstats)
+			goto errout;
+	}
+
+	return sch;
 errout:
+	qdisc_destroy(sch);
 	return NULL;
 }
 EXPORT_SYMBOL(qdisc_create_dflt);

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 09/12] net: sched: pfifo_fast use alf_queue
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
                   ` (7 preceding siblings ...)
  2015-12-30 17:53 ` [RFC PATCH 08/12] net: sched: a dflt qdisc may be used with per cpu stats John Fastabend
@ 2015-12-30 17:54 ` John Fastabend
  2016-01-13 16:24   ` David Miller
  2015-12-30 17:54 ` [RFC PATCH 10/12] net: sched: helper to sum qlen John Fastabend
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:54 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

This converts the pfifo_fast qdisc to use the alf_queue enqueue
and dequeue routines then sets the NOLOCK bit.

This also removes the logic used to pick the next band to dequeue
from and instead just checks each alf_queue for packets from
top priority to lowest. This might need to be a bit more clever
but seems to work for now.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 net/sched/sch_generic.c |  120 +++++++++++++++++++++++++----------------------
 1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index be5d63a..480cf63 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -26,6 +26,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/if_vlan.h>
+#include <linux/alf_queue.h>
 #include <net/sch_generic.h>
 #include <net/pkt_sched.h>
 #include <net/dst.h>
@@ -467,87 +468,80 @@ static const u8 prio2band[TC_PRIO_MAX + 1] = {
 
 /*
  * Private data for a pfifo_fast scheduler containing:
- * 	- queues for the three band
- * 	- bitmap indicating which of the bands contain skbs
+ *	- rings for the priority bands
  */
 struct pfifo_fast_priv {
-	u32 bitmap;
-	struct sk_buff_head q[PFIFO_FAST_BANDS];
+	struct alf_queue *q[PFIFO_FAST_BANDS];
 };
 
-/*
- * Convert a bitmap to the first band number where an skb is queued, where:
- * 	bitmap=0 means there are no skbs on any band.
- * 	bitmap=1 means there is an skb on band 0.
- *	bitmap=7 means there are skbs on all 3 bands, etc.
- */
-static const int bitmap2band[] = {-1, 0, 1, 0, 2, 0, 1, 0};
-
-static inline struct sk_buff_head *band2list(struct pfifo_fast_priv *priv,
-					     int band)
+static inline struct alf_queue *band2list(struct pfifo_fast_priv *priv,
+					  int band)
 {
-	return priv->q + band;
+	return priv->q[band];
 }
 
 static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc)
 {
-	if (skb_queue_len(&qdisc->q) < qdisc_dev(qdisc)->tx_queue_len) {
-		int band = prio2band[skb->priority & TC_PRIO_MAX];
-		struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-		struct sk_buff_head *list = band2list(priv, band);
-
-		priv->bitmap |= (1 << band);
-		qdisc->q.qlen++;
-		return __qdisc_enqueue_tail(skb, qdisc, list);
-	}
-
-	return qdisc_drop(skb, qdisc);
-}
-
-static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
-{
+	int band = prio2band[skb->priority & TC_PRIO_MAX];
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-	int band = bitmap2band[priv->bitmap];
+	struct alf_queue *q = band2list(priv, band);
+	int n;
 
-	if (likely(band >= 0)) {
-		struct sk_buff_head *list = band2list(priv, band);
-		struct sk_buff *skb = __qdisc_dequeue_head(qdisc, list);
+	if (!q) {
+		WARN_ON(1);
+		return qdisc_drop(skb, qdisc);
+	}
 
-		qdisc->q.qlen--;
-		if (skb_queue_empty(list))
-			priv->bitmap &= ~(1 << band);
+	n = alf_mp_enqueue(q, &skb, 1);
 
-		return skb;
+	/* If queue is overrun fall through to drop */
+	if (n) {
+		qdisc_qstats_cpu_qlen_inc(qdisc);
+		qdisc_qstats_cpu_backlog_inc(qdisc, skb);
+		return NET_XMIT_SUCCESS;
 	}
 
-	return NULL;
+	return qdisc_drop_cpu(skb, qdisc);
 }
 
-static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
+static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-	int band = bitmap2band[priv->bitmap];
+	struct sk_buff *skb = NULL;
+	int band;
+
+	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
+		struct alf_queue *q = band2list(priv, band);
 
-	if (band >= 0) {
-		struct sk_buff_head *list = band2list(priv, band);
+		if (alf_queue_empty(q))
+			continue;
 
-		return skb_peek(list);
+		alf_mc_dequeue(q, &skb, 1);
 	}
 
-	return NULL;
+	if (likely(skb)) {
+		qdisc_qstats_cpu_backlog_dec(qdisc, skb);
+		qdisc_bstats_cpu_update(qdisc, skb);
+		qdisc_qstats_cpu_qlen_dec(qdisc);
+	}
+
+	return skb;
 }
 
 static void pfifo_fast_reset(struct Qdisc *qdisc)
 {
-	int prio;
+	int i, band;
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
 
-	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++)
-		__qdisc_reset_queue(qdisc, band2list(priv, prio));
+	for (band = 0; band < PFIFO_FAST_BANDS; band++)
+		alf_queue_flush(band2list(priv, band));
 
-	priv->bitmap = 0;
-	qdisc->qstats.backlog = 0;
-	qdisc->q.qlen = 0;
+	for_each_possible_cpu(i) {
+		struct gnet_stats_queue *q = per_cpu_ptr(qdisc->cpu_qstats, i);
+
+		q->backlog = 0;
+		q->qlen = 0;
+	}
 }
 
 static int pfifo_fast_dump(struct Qdisc *qdisc, struct sk_buff *skb)
@@ -565,14 +559,30 @@ nla_put_failure:
 
 static int pfifo_fast_init(struct Qdisc *qdisc, struct nlattr *opt)
 {
-	int prio;
+	unsigned int qlen = qdisc_dev(qdisc)->tx_queue_len;
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
+	int prio;
+
+	/* guard against zero length rings */
+	if (!qlen)
+		return -EINVAL;
+
+	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+		unsigned int size = roundup_pow_of_two(qlen);
+		struct alf_queue *q;
 
-	for (prio = 0; prio < PFIFO_FAST_BANDS; prio++)
-		__skb_queue_head_init(band2list(priv, prio));
+		q = alf_queue_alloc(size, GFP_KERNEL);
+		if (IS_ERR_OR_NULL(q))
+			return -ENOMEM;
+
+		priv->q[prio] = q;
+	}
 
 	/* Can by-pass the queue discipline */
 	qdisc->flags |= TCQ_F_CAN_BYPASS;
+	qdisc->flags |= TCQ_F_NOLOCK;
+	qdisc->flags |= TCQ_F_CPUSTATS;
+
 	return 0;
 }
 
@@ -581,7 +591,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
 	.priv_size	=	sizeof(struct pfifo_fast_priv),
 	.enqueue	=	pfifo_fast_enqueue,
 	.dequeue	=	pfifo_fast_dequeue,
-	.peek		=	pfifo_fast_peek,
+	.peek		=	qdisc_peek_dequeued,
 	.init		=	pfifo_fast_init,
 	.reset		=	pfifo_fast_reset,
 	.dump		=	pfifo_fast_dump,

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 10/12] net: sched: helper to sum qlen
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
                   ` (8 preceding siblings ...)
  2015-12-30 17:54 ` [RFC PATCH 09/12] net: sched: pfifo_fast use alf_queue John Fastabend
@ 2015-12-30 17:54 ` John Fastabend
  2015-12-30 17:55 ` [RFC PATCH 11/12] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:54 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

Reporting qlen when qlen is per cpu requires aggregating the per
cpu counters. This adds a helper routine for this.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |   15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 30f4c60..2c57278 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -286,6 +286,21 @@ static inline int qdisc_qlen(const struct Qdisc *q)
 	return q->q.qlen;
 }
 
+static inline int qdisc_qlen_sum(const struct Qdisc *q)
+{
+	__u32 qlen = 0;
+	int i;
+
+	if (q->flags & TCQ_F_NOLOCK) {
+		for_each_possible_cpu(i)
+			qlen += per_cpu_ptr(q->cpu_qstats, i)->qlen;
+	} else {
+		qlen = q->q.qlen;
+	}
+
+	return qlen;
+}
+
 static inline struct qdisc_skb_cb *qdisc_skb_cb(const struct sk_buff *skb)
 {
 	return (struct qdisc_skb_cb *)skb->cb;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 11/12] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
                   ` (9 preceding siblings ...)
  2015-12-30 17:54 ` [RFC PATCH 10/12] net: sched: helper to sum qlen John Fastabend
@ 2015-12-30 17:55 ` John Fastabend
  2015-12-30 17:55 ` [RFC PATCH 12/12] net: sched: pfifo_fast new option to deque multiple pkts John Fastabend
  2016-01-06 13:14 ` [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq Jamal Hadi Salim
  12 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:55 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

The sch_mq qdisc creates a sub-qdisc per tx queue which are then
called for independently for enqueue and dequeue operations. However
statistics are aggregated and pushed up to the "master" qdisc.

This patch adds support for any of the sub-qdiscs to be per cpu
statistic qdiscs. To handle this case add a check when calculating
stats and aggregate the per cpu stats if needed.

Also exports __gnet_stats_copy_queue() to use as a helper function.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/gen_stats.h |    3 +++
 net/core/gen_stats.c    |    9 +++++----
 net/sched/sch_mq.c      |   25 ++++++++++++++++++-------
 3 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index cbafa37..a902061 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -43,6 +43,9 @@ int gnet_stats_copy_rate_est(struct gnet_dump *d,
 int gnet_stats_copy_queue(struct gnet_dump *d,
 			  struct gnet_stats_queue __percpu *cpu_q,
 			  struct gnet_stats_queue *q, __u32 qlen);
+void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
+			     const struct gnet_stats_queue __percpu *cpu_q,
+			     const struct gnet_stats_queue *q, __u32 qlen);
 int gnet_stats_copy_app(struct gnet_dump *d, void *st, int len);
 
 int gnet_stats_finish_copy(struct gnet_dump *d);
diff --git a/net/core/gen_stats.c b/net/core/gen_stats.c
index 1e2f46a..b653a56 100644
--- a/net/core/gen_stats.c
+++ b/net/core/gen_stats.c
@@ -235,10 +235,10 @@ __gnet_stats_copy_queue_cpu(struct gnet_stats_queue *qstats,
 	}
 }
 
-static void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
-				    const struct gnet_stats_queue __percpu *cpu,
-				    const struct gnet_stats_queue *q,
-				    __u32 qlen)
+void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
+			     const struct gnet_stats_queue __percpu *cpu,
+			     const struct gnet_stats_queue *q,
+			     __u32 qlen)
 {
 	if (cpu) {
 		__gnet_stats_copy_queue_cpu(qstats, cpu);
@@ -252,6 +252,7 @@ static void __gnet_stats_copy_queue(struct gnet_stats_queue *qstats,
 
 	qstats->qlen = qlen;
 }
+EXPORT_SYMBOL(__gnet_stats_copy_queue);
 
 /**
  * gnet_stats_copy_queue - copy queue statistics into statistics TLV
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index 3e82f04..3468317 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -17,6 +17,7 @@
 #include <linux/skbuff.h>
 #include <net/netlink.h>
 #include <net/pkt_sched.h>
+#include <net/sch_generic.h>
 
 struct mq_sched {
 	struct Qdisc		**qdiscs;
@@ -107,15 +108,25 @@ static int mq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	memset(&sch->qstats, 0, sizeof(sch->qstats));
 
 	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+		struct gnet_stats_basic_cpu __percpu *cpu_bstats = NULL;
+		struct gnet_stats_queue __percpu *cpu_qstats = NULL;
+		__u32 qlen = 0;
+
 		qdisc = netdev_get_tx_queue(dev, ntx)->qdisc_sleeping;
 		spin_lock_bh(qdisc_lock(qdisc));
-		sch->q.qlen		+= qdisc->q.qlen;
-		sch->bstats.bytes	+= qdisc->bstats.bytes;
-		sch->bstats.packets	+= qdisc->bstats.packets;
-		sch->qstats.backlog	+= qdisc->qstats.backlog;
-		sch->qstats.drops	+= qdisc->qstats.drops;
-		sch->qstats.requeues	+= qdisc->qstats.requeues;
-		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
+
+		if (qdisc_is_percpu_stats(qdisc)) {
+			cpu_bstats = qdisc->cpu_bstats;
+			cpu_qstats = qdisc->cpu_qstats;
+		}
+
+		qlen = qdisc_qlen_sum(qdisc);
+
+		__gnet_stats_copy_basic(&sch->bstats,
+					cpu_bstats, &qdisc->bstats);
+		__gnet_stats_copy_queue(&sch->qstats,
+					cpu_qstats, &qdisc->qstats, qlen);
+
 		spin_unlock_bh(qdisc_lock(qdisc));
 	}
 	return 0;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 12/12] net: sched: pfifo_fast new option to deque multiple pkts
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
                   ` (10 preceding siblings ...)
  2015-12-30 17:55 ` [RFC PATCH 11/12] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend
@ 2015-12-30 17:55 ` John Fastabend
  2015-12-30 18:13   ` John Fastabend
  2016-01-06 13:14 ` [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq Jamal Hadi Salim
  12 siblings, 1 reply; 29+ messages in thread
From: John Fastabend @ 2015-12-30 17:55 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev, john.fastabend

Now that pfifo_fast is using the alf_queue data structures we can
dequeue multiple skbs and save some overhead.

This works because the bulk dequeue logic accepts skb lists already.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/net/sch_generic.h |    2 +-
 net/sched/sch_generic.c   |   30 ++++++++++++++++++++----------
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 2c57278..95c11ed 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -128,7 +128,7 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
 
 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
 {
-	return qdisc->flags & TCQ_F_ONETXQUEUE;
+	return (qdisc->flags & TCQ_F_ONETXQUEUE) & !(qdisc->flags & TCQ_F_NOLOCK);
 }
 
 static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 480cf63..ec5e78e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -507,25 +507,35 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc)
 static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
 {
 	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
-	struct sk_buff *skb = NULL;
-	int band;
+	struct sk_buff *skb[8+1] = {NULL};
+	int band, i, elems = 0;
 
-	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
+	if (this_cpu_ptr(qdisc->cpu_qstats)->qlen < 8)
+		return NULL;
+
+	for (band = 0; band < PFIFO_FAST_BANDS && !skb[0]; band++) {
 		struct alf_queue *q = band2list(priv, band);
 
 		if (alf_queue_empty(q))
 			continue;
 
-		alf_mc_dequeue(q, &skb, 1);
+		elems = alf_mc_dequeue(q, skb, 8);
+
+		/* link array of skbs for driver to process */
+		for (i = 0; i < elems; i++)
+			skb[i]->next = skb[i+1];
 	}
 
-	if (likely(skb)) {
-		qdisc_qstats_cpu_backlog_dec(qdisc, skb);
-		qdisc_bstats_cpu_update(qdisc, skb);
-		qdisc_qstats_cpu_qlen_dec(qdisc);
+	if (likely(skb[0])) {
+		for (i = 0; i < elems; i++) {
+			qdisc_qstats_cpu_backlog_dec(qdisc, skb[i]);
+			qdisc_bstats_cpu_update(qdisc, skb[i]);
+		}
+
+		this_cpu_ptr(qdisc->cpu_qstats)->qlen -= elems;
 	}
 
-	return skb;
+	return skb[0];
 }
 
 static void pfifo_fast_reset(struct Qdisc *qdisc)
@@ -579,7 +589,7 @@ static int pfifo_fast_init(struct Qdisc *qdisc, struct nlattr *opt)
 	}
 
 	/* Can by-pass the queue discipline */
-	qdisc->flags |= TCQ_F_CAN_BYPASS;
+	//qdisc->flags |= TCQ_F_CAN_BYPASS;
 	qdisc->flags |= TCQ_F_NOLOCK;
 	qdisc->flags |= TCQ_F_CPUSTATS;
 

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 12/12] net: sched: pfifo_fast new option to deque multiple pkts
  2015-12-30 17:55 ` [RFC PATCH 12/12] net: sched: pfifo_fast new option to deque multiple pkts John Fastabend
@ 2015-12-30 18:13   ` John Fastabend
  0 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2015-12-30 18:13 UTC (permalink / raw)
  To: daniel, eric.dumazet, jhs, aduyck, brouer, davem; +Cc: john.r.fastabend, netdev

On 15-12-30 09:55 AM, John Fastabend wrote:
> Now that pfifo_fast is using the alf_queue data structures we can
> dequeue multiple skbs and save some overhead.
> 
> This works because the bulk dequeue logic accepts skb lists already.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---

oops I didn't mean to send this it obviously doesn't work because
until you have 8 skbs nothing gets dequeued. This was just a test
patch I was looking at for perf numbers. Maybe it provides some
insight into how we could build a pfifo_bulk or add an option to
pfifo_fast to dequeue multiple pkts at a time. The trick is to
sort out how long to wait for packets to build up or possibly
just remove this line,

+	if (this_cpu_ptr(qdisc->cpu_qstats)->qlen < 8)
+		return NULL;

And opportunistically pull packets out at the risk of over-running
the driver if those are large skbs.

.John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 05/12] net: sched: per cpu gso handlers
  2015-12-30 17:52 ` [RFC PATCH 05/12] net: sched: per cpu gso handlers John Fastabend
@ 2015-12-30 20:26   ` Jesper Dangaard Brouer
  2015-12-30 20:42     ` John Fastabend
  0 siblings, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2015-12-30 20:26 UTC (permalink / raw)
  To: John Fastabend
  Cc: daniel, eric.dumazet, jhs, aduyck, davem, john.r.fastabend,
	netdev, brouer

On Wed, 30 Dec 2015 09:52:49 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> The net sched infrastructure has a gso ptr that points to skb structs
> that have failed to be enqueued by the device driver.

What about fixing up the naming "gso" to something else like "requeue",
in the process (or by an pre-patch) ?


> This can happen when multiple cores try to push a skb onto the same
> underlying hardware queue resulting in lock contention. This case is
> handled by a cpu collision handler handle_dev_cpu_collision(). Another
> case occurs when the stack overruns the drivers low level tx queues
> capacity. Ideally these should be a rare occurrence in a well-tuned
> system but they do happen.
> 
> To handle this in the lockless case use a per cpu gso field to park
> the skb until the conflict can be resolved. Note at this point the
> skb has already been popped off the qdisc so it has to be handled
> by the infrastructure.

I generally like this idea of resolving this per cpu.  (I stalled here,
on the requeue issue, last time I implemented a lockless qdisc
approach).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 05/12] net: sched: per cpu gso handlers
  2015-12-30 20:26   ` Jesper Dangaard Brouer
@ 2015-12-30 20:42     ` John Fastabend
  0 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2015-12-30 20:42 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: daniel, eric.dumazet, jhs, aduyck, davem, john.r.fastabend, netdev

On 15-12-30 12:26 PM, Jesper Dangaard Brouer wrote:
> On Wed, 30 Dec 2015 09:52:49 -0800
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> The net sched infrastructure has a gso ptr that points to skb structs
>> that have failed to be enqueued by the device driver.
> 
> What about fixing up the naming "gso" to something else like "requeue",
> in the process (or by an pre-patch) ?

Sure I'll throw a patch in front of this to rename it.

> 
> 
>> This can happen when multiple cores try to push a skb onto the same
>> underlying hardware queue resulting in lock contention. This case is
>> handled by a cpu collision handler handle_dev_cpu_collision(). Another
>> case occurs when the stack overruns the drivers low level tx queues
>> capacity. Ideally these should be a rare occurrence in a well-tuned
>> system but they do happen.
>>
>> To handle this in the lockless case use a per cpu gso field to park
>> the skb until the conflict can be resolved. Note at this point the
>> skb has already been popped off the qdisc so it has to be handled
>> by the infrastructure.
> 
> I generally like this idea of resolving this per cpu.  (I stalled here,
> on the requeue issue, last time I implemented a lockless qdisc
> approach).
> 

Great, this approach seems to work OK.

On another note even if we only get a single skb dequeued at a time in
the initial implementation this is still a win as soon as we start
running classifiers/actions. Even if doing simple pfifo_fast sans
classifiers raw throughput net gain is minimal.

.John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc
  2015-12-30 17:53 ` [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc John Fastabend
@ 2016-01-01  2:30   ` Alexei Starovoitov
  2016-01-03 19:37     ` John Fastabend
  2016-01-13 16:20   ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Alexei Starovoitov @ 2016-01-01  2:30 UTC (permalink / raw)
  To: John Fastabend
  Cc: daniel, eric.dumazet, jhs, aduyck, brouer, davem,
	john.r.fastabend, netdev

On Wed, Dec 30, 2015 at 09:53:13AM -0800, John Fastabend wrote:
> The qdisc_reset operation depends on the qdisc lock at the moment
> to halt any additions to gso_skb and statistics while the list is
> free'd and the stats zeroed.
> 
> Without the qdisc lock we can not guarantee another cpu is not in
> the process of adding a skb to one of the "cells". Here are the
> two cases we have to handle.
> 
>  case 1: qdisc_graft operation. In this case a "new" qdisc is attached
> 	 and the 'qdisc_destroy' operation is called on the old qdisc.
> 	 The destroy operation will wait a rcu grace period and call
> 	 qdisc_rcu_free(). At which point gso_cpu_skb is free'd along
> 	 with all stats so no need to zero stats and gso_cpu_skb from
> 	 the reset operation itself.
> 
> 	 Because we can not continue to call qdisc_reset before waiting
> 	 an rcu grace period so that the qdisc is detached from all
> 	 cpus simply do not call qdisc_reset() at all and let the
> 	 qdisc_destroy operation clean up the qdisc. Note, a refcnt
> 	 greater than 1 would cause the destroy operation to be
> 	 aborted however if this ever happened the reference to the
> 	 qdisc would be lost and we would have a memory leak.
> 
>  case 2: dev_deactivate sequence. This can come from a user bringing
> 	 the interface down which causes the gso_skb list to be flushed
> 	 and the qlen zero'd. At the moment this is protected by the
> 	 qdisc lock so while we clear the qlen/gso_skb fields we are
> 	 guaranteed no new skbs are added. For the lockless case
> 	 though this is not true. To resolve this move the qdisc_reset
> 	 call after the new qdisc is assigned and a grace period is
> 	 exercised to ensure no new skbs can be enqueued. Further
> 	 the RTNL lock is held so we can not get another call to
> 	 activate the qdisc while the skb lists are being free'd.
> 
> 	 Finally, fix qdisc_reset to handle the per cpu stats and
> 	 skb lists.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
...
> -	/* Prune old scheduler */
> -	if (oqdisc && atomic_read(&oqdisc->refcnt) <= 1)
> -		qdisc_reset(oqdisc);
> -
...
> -		sync_needed |= !dev->dismantle;
> +		sync_needed = true;

I think killing above <=1 check and forcing synchronize_net() will
make qdisc destruction more reliable than it's right now.
Your commit log sounds too pessimistic :)

btw, sync_needed variable can be removed as well.

All other patches look good. Great stuff overall!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc
  2016-01-01  2:30   ` Alexei Starovoitov
@ 2016-01-03 19:37     ` John Fastabend
  0 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2016-01-03 19:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: daniel, eric.dumazet, jhs, aduyck, brouer, davem,
	john.r.fastabend, netdev

On 15-12-31 06:30 PM, Alexei Starovoitov wrote:
> On Wed, Dec 30, 2015 at 09:53:13AM -0800, John Fastabend wrote:
>> The qdisc_reset operation depends on the qdisc lock at the moment
>> to halt any additions to gso_skb and statistics while the list is
>> free'd and the stats zeroed.
>>
>> Without the qdisc lock we can not guarantee another cpu is not in
>> the process of adding a skb to one of the "cells". Here are the
>> two cases we have to handle.
>>
>>  case 1: qdisc_graft operation. In this case a "new" qdisc is attached
>> 	 and the 'qdisc_destroy' operation is called on the old qdisc.
>> 	 The destroy operation will wait a rcu grace period and call
>> 	 qdisc_rcu_free(). At which point gso_cpu_skb is free'd along
>> 	 with all stats so no need to zero stats and gso_cpu_skb from
>> 	 the reset operation itself.
>>
>> 	 Because we can not continue to call qdisc_reset before waiting
>> 	 an rcu grace period so that the qdisc is detached from all
>> 	 cpus simply do not call qdisc_reset() at all and let the
>> 	 qdisc_destroy operation clean up the qdisc. Note, a refcnt
>> 	 greater than 1 would cause the destroy operation to be
>> 	 aborted however if this ever happened the reference to the
>> 	 qdisc would be lost and we would have a memory leak.
>>
>>  case 2: dev_deactivate sequence. This can come from a user bringing
>> 	 the interface down which causes the gso_skb list to be flushed
>> 	 and the qlen zero'd. At the moment this is protected by the
>> 	 qdisc lock so while we clear the qlen/gso_skb fields we are
>> 	 guaranteed no new skbs are added. For the lockless case
>> 	 though this is not true. To resolve this move the qdisc_reset
>> 	 call after the new qdisc is assigned and a grace period is
>> 	 exercised to ensure no new skbs can be enqueued. Further
>> 	 the RTNL lock is held so we can not get another call to
>> 	 activate the qdisc while the skb lists are being free'd.
>>
>> 	 Finally, fix qdisc_reset to handle the per cpu stats and
>> 	 skb lists.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ...
>> -	/* Prune old scheduler */
>> -	if (oqdisc && atomic_read(&oqdisc->refcnt) <= 1)
>> -		qdisc_reset(oqdisc);
>> -
> ...
>> -		sync_needed |= !dev->dismantle;
>> +		sync_needed = true;
> 
> I think killing above <=1 check and forcing synchronize_net() will
> make qdisc destruction more reliable than it's right now.

I'll probably do that in a follow up series. Its a bit tricky and
actually I'm not convinced we aren't leaking memory as it is before
this series when we transition from per txq qdiscs and the more
traditional single big qdisc in a few cases. I'll take a look tomorrow
with kmemleak and tools. My guess is in practice people load a qdisc
and don't change it much and better setups for those types of things
is to load your favorite qdisc under mq or mqprio when using
multiqueue devices.

> Your commit log sounds too pessimistic :)
> 
> btw, sync_needed variable can be removed as well.

Yep thanks good catch.

> 
> All other patches look good. Great stuff overall!
> 

Thanks, I'm going to add mqprio and multiq support, do some testing and
some perf work then submit it. The alf_queue bits can be optimized
further later.

Also after we get multiq support lockless we can run filters on egress
side without the qdisc lock overhead.

.John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 02/12] net: sched: free per cpu bstats
  2015-12-30 17:51 ` [RFC PATCH 02/12] net: sched: free per cpu bstats John Fastabend
@ 2016-01-04 15:21   ` Daniel Borkmann
  2016-01-04 17:32     ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Borkmann @ 2016-01-04 15:21 UTC (permalink / raw)
  To: John Fastabend, eric.dumazet, jhs, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev

On 12/30/2015 06:51 PM, John Fastabend wrote:
> When a qdisc is using per cpu stats only the bstats are being
> freed. This also free's the qstats.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>

Seems like a bug fix, current code seems to free this only in error
path in qdisc_create(). Should this go to -net as an individual one?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 02/12] net: sched: free per cpu bstats
  2016-01-04 15:21   ` Daniel Borkmann
@ 2016-01-04 17:32     ` Eric Dumazet
  2016-01-04 18:08       ` John Fastabend
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2016-01-04 17:32 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: John Fastabend, jhs, aduyck, brouer, davem, john.r.fastabend, netdev

On Mon, 2016-01-04 at 16:21 +0100, Daniel Borkmann wrote:
> On 12/30/2015 06:51 PM, John Fastabend wrote:
> > When a qdisc is using per cpu stats only the bstats are being
> > freed. This also free's the qstats.
> >
> > Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> 
> Seems like a bug fix, current code seems to free this only in error
> path in qdisc_create(). Should this go to -net as an individual one?

Absolutely, this is needed in -net as ingress qdisc uses this stuff
already.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 02/12] net: sched: free per cpu bstats
  2016-01-04 17:32     ` Eric Dumazet
@ 2016-01-04 18:08       ` John Fastabend
  0 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2016-01-04 18:08 UTC (permalink / raw)
  To: Eric Dumazet, Daniel Borkmann
  Cc: jhs, aduyck, brouer, davem, john.r.fastabend, netdev

On 16-01-04 09:32 AM, Eric Dumazet wrote:
> On Mon, 2016-01-04 at 16:21 +0100, Daniel Borkmann wrote:
>> On 12/30/2015 06:51 PM, John Fastabend wrote:
>>> When a qdisc is using per cpu stats only the bstats are being
>>> freed. This also free's the qstats.
>>>
>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>
>> Seems like a bug fix, current code seems to free this only in error
>> path in qdisc_create(). Should this go to -net as an individual one?
> 
> Absolutely, this is needed in -net as ingress qdisc uses this stuff
> already.
> 
> 

Yep, I'll push a patch for -net today.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq
  2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
                   ` (11 preceding siblings ...)
  2015-12-30 17:55 ` [RFC PATCH 12/12] net: sched: pfifo_fast new option to deque multiple pkts John Fastabend
@ 2016-01-06 13:14 ` Jamal Hadi Salim
  2016-01-07 23:30   ` John Fastabend
  12 siblings, 1 reply; 29+ messages in thread
From: Jamal Hadi Salim @ 2016-01-06 13:14 UTC (permalink / raw)
  To: John Fastabend, daniel, eric.dumazet, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev


Sorry for not being as responsive as i would like to be
(theman calls and i have to go).
This looks like a good (tc workshop) candidate discussion, if still
active by netdev11 time.

On 15-12-30 12:50 PM, John Fastabend wrote:
> Hi,
>
> This is a first take at removing the qdisc lock on the xmit path
> where qdiscs actually have queues of skbs. The ingress qdisc
> which is already lockless was "easy" at least in the sense that
> we did not need any lock-free data structures to hold skbs.
>

I did some testing over the holidays for a netdev11 paper submission
and the ingress qdisc side of things looks very impressive (on
an average packet size) on a single (i7 class) cpu.
It can handle about 3x what an egress side pktgen type test (not
very real life) can handle. Analysis shows the lock is killing us.
So if you are looking at low hanging fruit, the egress is
the place to look.
I have a pktgen change that may be useful for you - I will post
it next time i get cycles.
I am also a willing guinea pig (given upcoming netdev11) to do
some perf testing ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq
  2016-01-06 13:14 ` [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq Jamal Hadi Salim
@ 2016-01-07 23:30   ` John Fastabend
  0 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2016-01-07 23:30 UTC (permalink / raw)
  To: Jamal Hadi Salim, daniel, eric.dumazet, aduyck, brouer, davem
  Cc: john.r.fastabend, netdev

On 16-01-06 05:14 AM, Jamal Hadi Salim wrote:
> 
> Sorry for not being as responsive as i would like to be
> (theman calls and i have to go).
> This looks like a good (tc workshop) candidate discussion, if still
> active by netdev11 time.
> 
> On 15-12-30 12:50 PM, John Fastabend wrote:
>> Hi,
>>
>> This is a first take at removing the qdisc lock on the xmit path
>> where qdiscs actually have queues of skbs. The ingress qdisc
>> which is already lockless was "easy" at least in the sense that
>> we did not need any lock-free data structures to hold skbs.
>>
> 
> I did some testing over the holidays for a netdev11 paper submission
> and the ingress qdisc side of things looks very impressive (on
> an average packet size) on a single (i7 class) cpu.
> It can handle about 3x what an egress side pktgen type test (not
> very real life) can handle. Analysis shows the lock is killing us.
> So if you are looking at low hanging fruit, the egress is
> the place to look.

So this series drops the qdisc lock which should hopefully per
some older data I have bring it on par with the ingress side. But
I need to retest this latest patch series.

> I have a pktgen change that may be useful for you - I will post
> it next time i get cycles.
> I am also a willing guinea pig (given upcoming netdev11) to do
> some perf testing ;->
> 

I typically test this by putting vlan's on top of a dev with the
qdisc I want to test and running pktgen on the vlans the entry point
of the skb is pretty much where I want it in this case.

> cheers,
> jamal
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc
  2015-12-30 17:53 ` [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc John Fastabend
  2016-01-01  2:30   ` Alexei Starovoitov
@ 2016-01-13 16:20   ` David Miller
  2016-01-13 18:03     ` John Fastabend
  1 sibling, 1 reply; 29+ messages in thread
From: David Miller @ 2016-01-13 16:20 UTC (permalink / raw)
  To: john.fastabend
  Cc: daniel, eric.dumazet, jhs, aduyck, brouer, john.r.fastabend, netdev

From: John Fastabend <john.fastabend@gmail.com>
Date: Wed, 30 Dec 2015 09:53:13 -0800

>  case 2: dev_deactivate sequence. This can come from a user bringing
> 	 the interface down which causes the gso_skb list to be flushed
> 	 and the qlen zero'd. At the moment this is protected by the
> 	 qdisc lock so while we clear the qlen/gso_skb fields we are
> 	 guaranteed no new skbs are added. For the lockless case
> 	 though this is not true. To resolve this move the qdisc_reset
> 	 call after the new qdisc is assigned and a grace period is
> 	 exercised to ensure no new skbs can be enqueued. Further
> 	 the RTNL lock is held so we can not get another call to
> 	 activate the qdisc while the skb lists are being free'd.
> 
> 	 Finally, fix qdisc_reset to handle the per cpu stats and
> 	 skb lists.

Just wanted to note that some setups are sensitive to device
register/deregister costs.  This is why we batch register and
unregister operations in the core, so that the RCU grace period
is consolidated into one when we register/unregister a lot of
net devices.

If we now will incur a new per-device unregister RCU grace period
when the qdisc is destroyed, it could cause a regression.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 09/12] net: sched: pfifo_fast use alf_queue
  2015-12-30 17:54 ` [RFC PATCH 09/12] net: sched: pfifo_fast use alf_queue John Fastabend
@ 2016-01-13 16:24   ` David Miller
  2016-01-13 18:18     ` John Fastabend
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2016-01-13 16:24 UTC (permalink / raw)
  To: john.fastabend
  Cc: daniel, eric.dumazet, jhs, aduyck, brouer, john.r.fastabend, netdev

From: John Fastabend <john.fastabend@gmail.com>
Date: Wed, 30 Dec 2015 09:54:20 -0800

> This also removes the logic used to pick the next band to dequeue
> from and instead just checks each alf_queue for packets from
> top priority to lowest. This might need to be a bit more clever
> but seems to work for now.

I suspect we won't need to be more clever, there's only 3 bands
after all and the head/tail tests should be fast enough.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc
  2016-01-13 16:20   ` David Miller
@ 2016-01-13 18:03     ` John Fastabend
  2016-01-15 19:44       ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: John Fastabend @ 2016-01-13 18:03 UTC (permalink / raw)
  To: David Miller
  Cc: daniel, eric.dumazet, jhs, aduyck, brouer, john.r.fastabend, netdev

On 16-01-13 08:20 AM, David Miller wrote:
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Wed, 30 Dec 2015 09:53:13 -0800
> 
>>  case 2: dev_deactivate sequence. This can come from a user bringing
>> 	 the interface down which causes the gso_skb list to be flushed
>> 	 and the qlen zero'd. At the moment this is protected by the
>> 	 qdisc lock so while we clear the qlen/gso_skb fields we are
>> 	 guaranteed no new skbs are added. For the lockless case
>> 	 though this is not true. To resolve this move the qdisc_reset
>> 	 call after the new qdisc is assigned and a grace period is
>> 	 exercised to ensure no new skbs can be enqueued. Further
>> 	 the RTNL lock is held so we can not get another call to
>> 	 activate the qdisc while the skb lists are being free'd.
>>
>> 	 Finally, fix qdisc_reset to handle the per cpu stats and
>> 	 skb lists.
> 
> Just wanted to note that some setups are sensitive to device
> register/deregister costs.  This is why we batch register and
> unregister operations in the core, so that the RCU grace period
> is consolidated into one when we register/unregister a lot of
> net devices.
> 
> If we now will incur a new per-device unregister RCU grace period
> when the qdisc is destroyed, it could cause a regression.
> 

It adds a synchronize_net in the error case for many users of
unregister_netdevice(). I think this should be rare and I believe
its OK to add the extra sync net in these cases. For example this
may happen when we try to add a tunnel and __dev_get_by_name() fails.
But if your worried about bring up, tear down performance I think you
should be using ifindex numbers and also not fat fingering dev
names on the cli.

Also there are a few drivers still doing their own walking of lists
and calling unregister_netdevice() directly instead of the better
APIs like unregister_netdevice_queue() and friends. I can patch these
drivers if that helps its a mechanical change but I'm not super
excited about testing things like the caif driver ;)

Further just looking at it now there are three calls to sync net in
the dev down paths. It seems we should be able to remove at least one
of those if we re-organize the tear down a bit better. But that is
another patch series.

.John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 09/12] net: sched: pfifo_fast use alf_queue
  2016-01-13 16:24   ` David Miller
@ 2016-01-13 18:18     ` John Fastabend
  0 siblings, 0 replies; 29+ messages in thread
From: John Fastabend @ 2016-01-13 18:18 UTC (permalink / raw)
  To: David Miller
  Cc: daniel, eric.dumazet, jhs, aduyck, brouer, john.r.fastabend, netdev

On 16-01-13 08:24 AM, David Miller wrote:
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Wed, 30 Dec 2015 09:54:20 -0800
> 
>> This also removes the logic used to pick the next band to dequeue
>> from and instead just checks each alf_queue for packets from
>> top priority to lowest. This might need to be a bit more clever
>> but seems to work for now.
> 
> I suspect we won't need to be more clever, there's only 3 bands
> after all and the head/tail tests should be fast enough.
> 

Even with alf_dequeue operation dequeueing a single skb at a time and
iterating over the bands as I did here I see a perf improvement
on my desktop here,

threads 	     mq + pfifo_fast

		before		after

1		1.70 Mpps	 2.00 Mpps
2		3.15 Mpps	 3.90 Mpps
4		4.70 Mpps	 6.98 Mpps
8		9.57 Mpps	11.62 Mpps

This is using my pktgen patch previously posted and bulking set to
0 in both cases. This doesn't really say anything about the contention
cases, etc so I'll do some more testing before the merge window opens.
Also my kernel isn't really optimized I had some of the kernel hacking
stuff enabled, etc. It at least looks promising though and dequeueing
more than a single skb out of pfifo_fast should help.

Something like,

static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
{
        struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
        struct sk_buff *skb[8+1];
        int band, n = 0, i;

        skb[0] = NULL;

        for (band = 0; band < PFIFO_FAST_BANDS && !skb[0]; band++) {
                struct alf_queue *q = band2list(priv, band);

                if (alf_queue_empty(q))
                        continue;

                n = alf_mc_dequeue(q, skb, 8); <-- 4, 8, or something
        }

.John

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 01/12] lib: array based lock free queue
  2015-12-30 17:51 ` [RFC PATCH 01/12] lib: array based lock free queue John Fastabend
@ 2016-01-13 19:28   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2016-01-13 19:28 UTC (permalink / raw)
  To: John Fastabend
  Cc: daniel, eric.dumazet, jhs, aduyck, davem, john.r.fastabend,
	netdev, brouer


On Wed, 30 Dec 2015 09:51:03 -0800 John Fastabend <john.fastabend@gmail.com> wrote:

> Initial implementation of an array based lock free queue. This works
> is originally done by Jesper Dangaard Brouer and I've grabbed it made
> only minor tweaks at the moment and plan to use it with the 'tc'
> subsystem although it is general enough to be used elsewhere.

First thanks for using my work. P.s. Hannes also deserves some credit
developing this queue.  And Alex Duyck helped review it (and found some
bugs ;-))

A question about the qdisc usage context: It is guaranteed that qdisc
enqueue and dequeue always runs either 1) in softirq context or 2) with
softirq/BH disabled?

This is important because (as noted in a comment), alf_queue is:

 Not preemption safe. Multiple CPUs can enqueue elements, but the
 same CPU is not allowed to be preempted and access the same
 queue. Due to how the tail is updated, this can result in a soft
 lock-up


> Certainly this implementation can be furthered optimized and improved
> but it is a good base implementation.

Agreed. E.g. the helpers you choose for enqueue/dequeue (store/load)
might not be the best, they look optimal (loop unroll etc) but the
extra icache usage cancel out the performance gain. At least in my
qmempool testing.
 But lets not get into these details now, as you say it is a good base
implementation. 


> diff --git a/include/linux/alf_queue.h b/include/linux/alf_queue.h
> new file mode 100644
> index 0000000..dac304e
> --- /dev/null
> +++ b/include/linux/alf_queue.h
> @@ -0,0 +1,368 @@
[...]
> +
> +/* Main Multi-Producer ENQUEUE
> + *
[...]
> + *
> + * Not preemption safe. Multiple CPUs can enqueue elements, but the
> + * same CPU is not allowed to be preempted and access the same
> + * queue. Due to how the tail is updated, this can result in a soft
> + * lock-up. (Same goes for alf_mc_dequeue).
> + */
> +static inline int
> +alf_mp_enqueue(const u32 n;
> +	       struct alf_queue *q, void *ptr[n], const u32 n)
> +{
> +	u32 p_head, p_next, c_tail, space;
> +
> +	/* Reserve part of the array for enqueue STORE/WRITE */
> +	do {
> +		p_head = ACCESS_ONCE(q->producer.head);
> +		c_tail = ACCESS_ONCE(q->consumer.tail);/* as smp_load_aquire */
> +
> +		space = q->size + c_tail - p_head;
> +		if (n > space)
> +			return 0;
> +
> +		p_next = p_head + n;
> +	}
> +	while (unlikely(cmpxchg(&q->producer.head, p_head, p_next) != p_head));
> +
> +	/* The memory barrier of smp_load_acquire(&q->consumer.tail)
> +	 * is satisfied by cmpxchg implicit full memory barrier
> +	 */
> +
> +	/* STORE the elems into the queue array */
> +	__helper_alf_enqueue_store(p_head, q, ptr, n);
> +	smp_wmb(); /* Write-Memory-Barrier matching dequeue LOADs */
> +
> +	/* Wait for other concurrent preceding enqueues not yet done,
> +	 * this part make us none-wait-free and could be problematic
> +	 * in case of congestion with many CPUs
> +	 */
> +	while (unlikely(ACCESS_ONCE(q->producer.tail) != p_head))
> +		cpu_relax();
> +	/* Mark this enq done and avail for consumption */
> +	ACCESS_ONCE(q->producer.tail) = p_next;
> +
> +	return n;
> +}

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc
  2016-01-13 18:03     ` John Fastabend
@ 2016-01-15 19:44       ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2016-01-15 19:44 UTC (permalink / raw)
  To: john.fastabend
  Cc: daniel, eric.dumazet, jhs, aduyck, brouer, john.r.fastabend, netdev

From: John Fastabend <john.fastabend@gmail.com>
Date: Wed, 13 Jan 2016 10:03:54 -0800

> On 16-01-13 08:20 AM, David Miller wrote:
>> From: John Fastabend <john.fastabend@gmail.com>
>> Date: Wed, 30 Dec 2015 09:53:13 -0800
>> 
>>>  case 2: dev_deactivate sequence. This can come from a user bringing
>>> 	 the interface down which causes the gso_skb list to be flushed
>>> 	 and the qlen zero'd. At the moment this is protected by the
>>> 	 qdisc lock so while we clear the qlen/gso_skb fields we are
>>> 	 guaranteed no new skbs are added. For the lockless case
>>> 	 though this is not true. To resolve this move the qdisc_reset
>>> 	 call after the new qdisc is assigned and a grace period is
>>> 	 exercised to ensure no new skbs can be enqueued. Further
>>> 	 the RTNL lock is held so we can not get another call to
>>> 	 activate the qdisc while the skb lists are being free'd.
>>>
>>> 	 Finally, fix qdisc_reset to handle the per cpu stats and
>>> 	 skb lists.
>> 
>> Just wanted to note that some setups are sensitive to device
>> register/deregister costs.  This is why we batch register and
>> unregister operations in the core, so that the RCU grace period
>> is consolidated into one when we register/unregister a lot of
>> net devices.
>> 
>> If we now will incur a new per-device unregister RCU grace period
>> when the qdisc is destroyed, it could cause a regression.
>> 
> 
> It adds a synchronize_net in the error case for many users of
> unregister_netdevice(). I think this should be rare and I believe
> its OK to add the extra sync net in these cases. For example this
> may happen when we try to add a tunnel and __dev_get_by_name() fails.
> But if your worried about bring up, tear down performance I think you
> should be using ifindex numbers and also not fat fingering dev
> names on the cli.

Ok, agreed.

> Also there are a few drivers still doing their own walking of lists
> and calling unregister_netdevice() directly instead of the better
> APIs like unregister_netdevice_queue() and friends. I can patch these
> drivers if that helps its a mechanical change but I'm not super
> excited about testing things like the caif driver ;)

No, that's not necessary.  If anyone works on that, I'd rather it be
someone interested in the individual drivers and therefore has the
ability to test it easily.

> Further just looking at it now there are three calls to sync net in
> the dev down paths. It seems we should be able to remove at least one
> of those if we re-organize the tear down a bit better. But that is
> another patch series.

Indeed, there isn't any reason why these can't be consolidated into
two or even just one.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-01-15 19:44 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-30 17:50 [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq John Fastabend
2015-12-30 17:51 ` [RFC PATCH 01/12] lib: array based lock free queue John Fastabend
2016-01-13 19:28   ` Jesper Dangaard Brouer
2015-12-30 17:51 ` [RFC PATCH 02/12] net: sched: free per cpu bstats John Fastabend
2016-01-04 15:21   ` Daniel Borkmann
2016-01-04 17:32     ` Eric Dumazet
2016-01-04 18:08       ` John Fastabend
2015-12-30 17:51 ` [RFC PATCH 03/12] net: sched: allow qdiscs to handle locking John Fastabend
2015-12-30 17:52 ` [RFC PATCH 04/12] net: sched: provide per cpu qstat helpers John Fastabend
2015-12-30 17:52 ` [RFC PATCH 05/12] net: sched: per cpu gso handlers John Fastabend
2015-12-30 20:26   ` Jesper Dangaard Brouer
2015-12-30 20:42     ` John Fastabend
2015-12-30 17:53 ` [RFC PATCH 06/12] net: sched: support qdisc_reset on NOLOCK qdisc John Fastabend
2016-01-01  2:30   ` Alexei Starovoitov
2016-01-03 19:37     ` John Fastabend
2016-01-13 16:20   ` David Miller
2016-01-13 18:03     ` John Fastabend
2016-01-15 19:44       ` David Miller
2015-12-30 17:53 ` [RFC PATCH 07/12] net: sched: qdisc_qlen for per cpu logic John Fastabend
2015-12-30 17:53 ` [RFC PATCH 08/12] net: sched: a dflt qdisc may be used with per cpu stats John Fastabend
2015-12-30 17:54 ` [RFC PATCH 09/12] net: sched: pfifo_fast use alf_queue John Fastabend
2016-01-13 16:24   ` David Miller
2016-01-13 18:18     ` John Fastabend
2015-12-30 17:54 ` [RFC PATCH 10/12] net: sched: helper to sum qlen John Fastabend
2015-12-30 17:55 ` [RFC PATCH 11/12] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq John Fastabend
2015-12-30 17:55 ` [RFC PATCH 12/12] net: sched: pfifo_fast new option to deque multiple pkts John Fastabend
2015-12-30 18:13   ` John Fastabend
2016-01-06 13:14 ` [RFC PATCH 00/12] drop the qdisc lock for pfifo_fast/mq Jamal Hadi Salim
2016-01-07 23:30   ` John Fastabend

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).