linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/2] skb_array: array based FIFO for skbs
@ 2016-05-23 10:43 Michael S. Tsirkin
  2016-05-23 10:43 ` [PATCH v5 1/2] " Michael S. Tsirkin
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-05-23 10:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Eric Dumazet, davem, netdev,
	Steven Rostedt, brouer

This is in response to the proposal by Jason to make tun
rx packet queue lockless using a circular buffer.
My testing seems to show that at least for the common usecase
in networking, which isn't lockless, circular buffer
with indices does not perform that well, because
each index access causes a cache line to bounce between
CPUs, and index access causes stalls due to the dependency.

By comparison, an array of pointers where NULL means invalid
and !NULL means valid, can be updated without messing up barriers
at all and does not have this issue.

On the flip side, cache pressure may be caused by using large queues.
tun has a queue of 1000 entries by default and that's 8K.
At this point I'm not sure this can be solved efficiently.
The correct solution might be sizing the queues appropriately.

Here's an implementation of this idea: it can be used more
or less whenever sk_buff_head can be used, except you need
to know the queue size in advance.

It's using skb pointers but we switching to void * would be easy at cost
of type safety, though it appears that people want lockless  push
etc so I'm not sure of the value.

I didn't implement resizing but it's possible by holding
both consumer and producer locks.

I think this code works fine without any extra memory barriers since we
always read and write the same location, so the accesses can not be
reordered.
Multiple writes of the same value into memory would mess things up
for us, I don't think compilers would do it though.
But if people feel it's better to be safe wrt compiler optimizations,
specifying queue as volatile would probably do it in a cleaner way
than converting all accesses to READ_ONCE/WRITE_ONCE. Thoughts?

The only issue is with calls within a loop using the __skb_array_XXX
accessors - in theory compiler could hoist accesses out of the loop.

Following volatile-considered-harmful.txt I merely
documented that callers that busy-poll should invoke cpu_relax().
Most people will use the external skb_array_XXX APIs with a spinlock,
so this should not be an issue for them.

changes since v4 (v3 was never posted)
	documentation
	dropped SKB_ARRAY_MIN_SIZE heuristic
	unit test (in userspace, included as patch 2)

changes since v2:
        fixed integer overflow pointed out by Eric.
        added some comments.

changes since v1:
        fixed bug pointed out by Eric.


Michael S. Tsirkin (2):
  skb_array: array based FIFO for skbs
  skb_array: ring test

 include/linux/skb_array.h         | 127 +++++++++++++++++++++++++++++
 tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++
 tools/virtio/ringtest/Makefile    |   4 +-
 3 files changed, 297 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/skb_array.h
 create mode 100644 tools/virtio/ringtest/skb_array.c

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v5 1/2] skb_array: array based FIFO for skbs
  2016-05-23 10:43 [PATCH v5 0/2] skb_array: array based FIFO for skbs Michael S. Tsirkin
@ 2016-05-23 10:43 ` Michael S. Tsirkin
  2016-05-23 10:43 ` [PATCH v5 2/2] skb_array: ring test Michael S. Tsirkin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-05-23 10:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Eric Dumazet, davem, netdev,
	Steven Rostedt, brouer

A simple array based FIFO of pointers.  Intended for net stack so uses
skbs for type safety, but we can replace with with void * if others find
it useful outside of net stack.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/linux/skb_array.h | 127 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 127 insertions(+)
 create mode 100644 include/linux/skb_array.h

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
new file mode 100644
index 0000000..d2535ad
--- /dev/null
+++ b/include/linux/skb_array.h
@@ -0,0 +1,127 @@
+/*
+ *	Author:
+ *		Michael S. Tsirkin <mst@redhat.com>
+ *
+ *	This program is free software; you can redistribute it and/or modify it
+ *	under the terms of the GNU General Public License as published by the
+ *	Free Software Foundation; either version 2 of the License, or (at your
+ *	option) any later version.
+ *
+ *	This is a limited-size FIFO maintaining sk_buff structures in FIFO
+ *	order, with one CPU producing entries and another consuming entries
+ *	from a FIFO.
+ *
+ *	This implementation tries to minimize cache-contention when there is a
+ *	single producer and a single consumer CPU.
+ */
+
+#ifndef _LINUX_SKB_ARRAY_H
+#define _LINUX_SKB_ARRAY_H 1
+
+#ifdef __KERNEL__
+#include <linux/spinlock.h>
+#include <linux/cache.h>
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <linux/cache.h>
+#include <linux/slab.h>
+#include <asm/errno.h>
+#endif
+
+struct sk_buff;
+
+struct skb_array {
+	int producer ____cacheline_aligned_in_smp;
+	spinlock_t producer_lock;
+	int consumer ____cacheline_aligned_in_smp;
+	spinlock_t consumer_lock;
+	/* Shared consumer/producer data */
+	int size ____cacheline_aligned_in_smp; /* max entries in queue */
+	struct sk_buff **queue;
+};
+
+/* Note: callers invoking this in a loop must use a compiler barrier,
+ * for example cpu_relax().
+ */
+static inline bool __skb_array_full(struct skb_array *a)
+{
+	return a->queue[a->producer];
+}
+
+/* Note: callers invoking this in a loop must use a compiler barrier,
+ * for example cpu_relax().
+ */
+static inline int __skb_array_produce(struct skb_array *a,
+				       struct sk_buff *skb)
+{
+	if (__skb_array_full(a))
+		return -ENOSPC;
+
+	a->queue[a->producer++] = skb;
+	if (unlikely(a->producer >= a->size))
+		a->producer = 0;
+	return 0;
+}
+
+static inline int skb_array_produce_bh(struct skb_array *a,
+				       struct sk_buff *skb)
+{
+	int ret;
+
+	spin_lock_bh(&a->producer_lock);
+	ret = __skb_array_produce(a, skb);
+	spin_unlock_bh(&a->producer_lock);
+
+	return ret;
+}
+
+/* Note: callers invoking this in a loop must use a compiler barrier,
+ * for example cpu_relax().
+ */
+static inline struct sk_buff *__skb_array_peek(struct skb_array *a)
+{
+	return a->queue[a->consumer];
+}
+
+/* Must only be called after __skb_array_peek returned !NULL */
+static inline void __skb_array_consume(struct skb_array *a)
+{
+	a->queue[a->consumer++] = NULL;
+	if (unlikely(a->consumer >= a->size))
+		a->consumer = 0;
+}
+
+static inline struct sk_buff *skb_array_consume_bh(struct skb_array *a)
+{
+	struct sk_buff *skb;
+
+	spin_lock_bh(&a->consumer_lock);
+	skb = __skb_array_peek(a);
+	if (skb)
+		__skb_array_consume(a);
+	spin_unlock_bh(&a->consumer_lock);
+
+	return skb;
+}
+
+static inline int skb_array_init(struct skb_array *a, int size, gfp_t gfp)
+{
+	a->queue = kzalloc(ALIGN(size * sizeof *(a->queue), SMP_CACHE_BYTES),
+			   gfp);
+	if (!a->queue)
+		return -ENOMEM;
+
+	a->size = size;
+	a->producer = a->consumer = 0;
+	spin_lock_init(&a->producer_lock);
+	spin_lock_init(&a->consumer_lock);
+
+	return 0;
+}
+
+static inline void skb_array_cleanup(struct skb_array *a)
+{
+	kfree(a->queue);
+}
+
+#endif /* _LINUX_SKB_ARRAY_H  */
-- 
MST

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 2/2] skb_array: ring test
  2016-05-23 10:43 [PATCH v5 0/2] skb_array: array based FIFO for skbs Michael S. Tsirkin
  2016-05-23 10:43 ` [PATCH v5 1/2] " Michael S. Tsirkin
@ 2016-05-23 10:43 ` Michael S. Tsirkin
  2016-05-23 13:09   ` Jesper Dangaard Brouer
  2016-05-23 13:31 ` [PATCH v5 0/2] skb_array: array based FIFO for skbs Eric Dumazet
  2016-05-30  9:59 ` Jason Wang
  3 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-05-23 10:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael S. Tsirkin, Jason Wang, Eric Dumazet, davem, netdev,
	Steven Rostedt, brouer

Add ringtest based unit test for skb array.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++
 tools/virtio/ringtest/Makefile    |   4 +-
 2 files changed, 170 insertions(+), 1 deletion(-)
 create mode 100644 tools/virtio/ringtest/skb_array.c

diff --git a/tools/virtio/ringtest/skb_array.c b/tools/virtio/ringtest/skb_array.c
new file mode 100644
index 0000000..4ab7e31
--- /dev/null
+++ b/tools/virtio/ringtest/skb_array.c
@@ -0,0 +1,167 @@
+#define _GNU_SOURCE
+#include "main.h"
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <pthread.h>
+#include <malloc.h>
+#include <assert.h>
+#include <errno.h>
+#include <limits.h>
+
+struct sk_buff;
+#define SMP_CACHE_BYTES 64
+#define cache_line_size() SMP_CACHE_BYTES
+#define ____cacheline_aligned_in_smp __attribute__ ((aligned (SMP_CACHE_BYTES)))
+#define unlikely(x)    (__builtin_expect(!!(x), 0))
+#define ALIGN(x, a) (((x) + (a) - 1) / (a) * (a))
+typedef pthread_spinlock_t  spinlock_t;
+
+typedef int gfp_t;
+static void *kzalloc(unsigned size, gfp_t gfp)
+{
+	void *p = memalign(64, size);
+	if (!p)
+		return p;
+	memset(p, 0, size);
+
+	return p;
+}
+
+static void kfree(void *p)
+{
+	if (p)
+		free(p);
+}
+
+static void spin_lock_init(spinlock_t *lock)
+{
+	int r = pthread_spin_init(lock, 0);
+	assert(!r);
+}
+
+static void spin_lock_bh(spinlock_t *lock)
+{
+	int ret = pthread_spin_lock(lock);
+	assert(!ret);
+}
+
+static void spin_unlock_bh(spinlock_t *lock)
+{
+	int ret = pthread_spin_unlock(lock);
+	assert(!ret);
+}
+
+#include "../../../include/linux/skb_array.h"
+
+static unsigned long long headcnt, tailcnt;
+static struct skb_array array ____cacheline_aligned_in_smp;
+
+/* implemented by ring */
+void alloc_ring(void)
+{
+	int ret = skb_array_init(&array, ring_size, 0);
+	assert(!ret);
+}
+
+/* guest side */
+int add_inbuf(unsigned len, void *buf, void *datap)
+{
+	int ret;
+	
+	assert(headcnt - tailcnt <= ring_size);
+	ret = __skb_array_produce(&array, buf);
+	if (ret >= 0) {
+		ret = 0;
+		headcnt++;
+	}
+
+	return ret;
+}
+
+/*
+ * skb_array API provides no way for producer to find out whether a given
+ * buffer was consumed.  Our tests merely require that a successful get_buf
+ * implies that add_inbuf succeed in the past, and that add_inbuf will succeed,
+ * fake it accordingly.
+ */
+void *get_buf(unsigned *lenp, void **bufp)
+{
+	void *datap;
+
+	if (tailcnt == headcnt || __skb_array_full(&array))
+		datap = NULL;
+	else {
+		datap = "Buffer\n";
+		++tailcnt;
+	}
+
+	return datap;
+}
+
+void poll_used(void)
+{
+	void *b;
+
+	do {
+		if (tailcnt == headcnt || __skb_array_full(&array)) {
+			b = NULL;
+			barrier();
+		} else {
+			b = "Buffer\n";
+		}
+	} while (!b);
+}
+
+void disable_call()
+{
+	assert(0);
+}
+
+bool enable_call()
+{
+	assert(0);
+}
+
+void kick_available(void)
+{
+	assert(0);
+}
+
+/* host side */
+void disable_kick()
+{
+	assert(0);
+}
+
+bool enable_kick()
+{
+	assert(0);
+}
+
+void poll_avail(void)
+{
+	void *b;
+
+	do {
+		b = __skb_array_peek(&array);
+		barrier();
+	} while (!b);
+}
+
+bool use_buf(unsigned *lenp, void **bufp)
+{
+	struct sk_buff *skb;
+
+	skb = __skb_array_peek(&array);
+	if (skb) {
+		__skb_array_consume(&array);
+	}
+
+	return skb;
+}
+
+void call_used(void)
+{
+	assert(0);
+}
diff --git a/tools/virtio/ringtest/Makefile b/tools/virtio/ringtest/Makefile
index 6ba7455..87e58cf 100644
--- a/tools/virtio/ringtest/Makefile
+++ b/tools/virtio/ringtest/Makefile
@@ -1,6 +1,6 @@
 all:
 
-all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder
+all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder skb_array
 
 CFLAGS += -Wall
 CFLAGS += -pthread -O2 -ggdb
@@ -8,6 +8,7 @@ LDFLAGS += -pthread -O2 -ggdb
 
 main.o: main.c main.h
 ring.o: ring.c main.h
+skb_array.o: skb_array.c main.h ../../../include/linux/skb_array.h
 virtio_ring_0_9.o: virtio_ring_0_9.c main.h
 virtio_ring_poll.o: virtio_ring_poll.c virtio_ring_0_9.c main.h
 virtio_ring_inorder.o: virtio_ring_inorder.c virtio_ring_0_9.c main.h
@@ -15,6 +16,7 @@ ring: ring.o main.o
 virtio_ring_0_9: virtio_ring_0_9.o main.o
 virtio_ring_poll: virtio_ring_poll.o main.o
 virtio_ring_inorder: virtio_ring_inorder.o main.o
+skb_array: skb_array.o main.o
 clean:
 	-rm main.o
 	-rm ring.o ring
-- 
MST

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] skb_array: ring test
  2016-05-23 10:43 ` [PATCH v5 2/2] skb_array: ring test Michael S. Tsirkin
@ 2016-05-23 13:09   ` Jesper Dangaard Brouer
  2016-05-23 20:52     ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-23 13:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, Eric Dumazet, davem, netdev,
	Steven Rostedt, brouer

On Mon, 23 May 2016 13:43:46 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> Add ringtest based unit test for skb array.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++
>  tools/virtio/ringtest/Makefile    |   4 +-

Patch didn't apply cleanly to Makefile, as you also seems to have
"virtio_ring_inorder", I manually applied it.

I chdir to tools/virtio/ringtest/ and I could compile "skb_array",
BUT how do I use it??? (the README is not helpful)

What is the "output", are there any performance measurement results?


> diff --git a/tools/virtio/ringtest/Makefile b/tools/virtio/ringtest/Makefile
> index 6ba7455..87e58cf 100644
> --- a/tools/virtio/ringtest/Makefile
> +++ b/tools/virtio/ringtest/Makefile
> @@ -1,6 +1,6 @@
>  all:
>  
> -all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder
> +all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder skb_array
                                              ^^^^^^^^^^^^^^^^^^^
>  
>  CFLAGS += -Wall
>  CFLAGS += -pthread -O2 -ggdb
> @@ -8,6 +8,7 @@ LDFLAGS += -pthread -O2 -ggdb
>  
>  main.o: main.c main.h
>  ring.o: ring.c main.h
> +skb_array.o: skb_array.c main.h ../../../include/linux/skb_array.h
>  virtio_ring_0_9.o: virtio_ring_0_9.c main.h
>  virtio_ring_poll.o: virtio_ring_poll.c virtio_ring_0_9.c main.h
>  virtio_ring_inorder.o: virtio_ring_inorder.c virtio_ring_0_9.c main.h
> @@ -15,6 +16,7 @@ ring: ring.o main.o
>  virtio_ring_0_9: virtio_ring_0_9.o main.o
>  virtio_ring_poll: virtio_ring_poll.o main.o
>  virtio_ring_inorder: virtio_ring_inorder.o main.o
   ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^
> +skb_array: skb_array.o main.o
>  clean:
>  	-rm main.o
>  	-rm ring.o ring


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 0/2] skb_array: array based FIFO for skbs
  2016-05-23 10:43 [PATCH v5 0/2] skb_array: array based FIFO for skbs Michael S. Tsirkin
  2016-05-23 10:43 ` [PATCH v5 1/2] " Michael S. Tsirkin
  2016-05-23 10:43 ` [PATCH v5 2/2] skb_array: ring test Michael S. Tsirkin
@ 2016-05-23 13:31 ` Eric Dumazet
  2016-05-23 20:35   ` Michael S. Tsirkin
  2016-05-30  9:59 ` Jason Wang
  3 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2016-05-23 13:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, davem, netdev, Steven Rostedt, brouer

On Mon, 2016-05-23 at 13:43 +0300, Michael S. Tsirkin wrote:
> This is in response to the proposal by Jason to make tun
> rx packet queue lockless using a circular buffer.
> My testing seems to show that at least for the common usecase
> in networking, which isn't lockless, circular buffer
> with indices does not perform that well, because
> each index access causes a cache line to bounce between
> CPUs, and index access causes stalls due to the dependency.
> 
> By comparison, an array of pointers where NULL means invalid
> and !NULL means valid, can be updated without messing up barriers
> at all and does not have this issue.

Note that both consumers and producers write in the array, so in light
load (like TCP_RR), there are 2 cache line used byt the producers, and 2
cache line used for consumers, with potential bouncing.

In the other hand, the traditional sk_buff_head has one cache line,
holding the spinlock and list head/tail.

We might use the 'shared cache line' :

+       /* Shared consumer/producer data */
+       int size ____cacheline_aligned_in_smp; /* max entries in queue
*/
+       struct sk_buff **queue;


To put here some fast path involving a single cache line access when
queue has 0 or 1 item.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 0/2] skb_array: array based FIFO for skbs
  2016-05-23 13:31 ` [PATCH v5 0/2] skb_array: array based FIFO for skbs Eric Dumazet
@ 2016-05-23 20:35   ` Michael S. Tsirkin
  0 siblings, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-05-23 20:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, Jason Wang, davem, netdev, Steven Rostedt, brouer

On Mon, May 23, 2016 at 06:31:46AM -0700, Eric Dumazet wrote:
> On Mon, 2016-05-23 at 13:43 +0300, Michael S. Tsirkin wrote:
> > This is in response to the proposal by Jason to make tun
> > rx packet queue lockless using a circular buffer.
> > My testing seems to show that at least for the common usecase
> > in networking, which isn't lockless, circular buffer
> > with indices does not perform that well, because
> > each index access causes a cache line to bounce between
> > CPUs, and index access causes stalls due to the dependency.
> > 
> > By comparison, an array of pointers where NULL means invalid
> > and !NULL means valid, can be updated without messing up barriers
> > at all and does not have this issue.
> 
> Note that both consumers and producers write in the array, so in light
> load (like TCP_RR), there are 2 cache line used byt the producers, and 2
> cache line used for consumers, with potential bouncing.

The shared part is RO by producer and consumer both,
so it's not bouncing - it can be shared in both caches.

Clearly memory footprint for this data structure is bigger
so it might cause more misses.

> In the other hand, the traditional sk_buff_head has one cache line,
> holding the spinlock and list head/tail.
>
> We might use the 'shared cache line' :
> 
> +       /* Shared consumer/producer data */
> +       int size ____cacheline_aligned_in_smp; /* max entries in queue
> */
> +       struct sk_buff **queue;
> 
> 
> To put here some fast path involving a single cache line access when
> queue has 0 or 1 item.
> 

I will try to experiment with it, but pls note that
this cache line is RO by producer and consumer currently,
if we make it writeable it will be bouncing.

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] skb_array: ring test
  2016-05-23 13:09   ` Jesper Dangaard Brouer
@ 2016-05-23 20:52     ` Michael S. Tsirkin
  2016-05-24 10:28       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-05-23 20:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-kernel, Jason Wang, Eric Dumazet, davem, netdev, Steven Rostedt

On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 23 May 2016 13:43:46 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > Add ringtest based unit test for skb array.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >  tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++
> >  tools/virtio/ringtest/Makefile    |   4 +-
> 
> Patch didn't apply cleanly to Makefile, as you also seems to have
> "virtio_ring_inorder", I manually applied it.
> 
> I chdir to tools/virtio/ringtest/ and I could compile "skb_array",
> BUT how do I use it??? (the README is not helpful)
> 
> What is the "output", are there any performance measurement results?

First, if it completes successfully this means it completed
a ton of cycles without errors. It caches any missing barriers
which aren't nops on your system.

Second - use perf.

E.g. simple perf stat will measure how long does it take to execute.
there's a script that runs it on different CPUs,
so I normally do:

sh run-on-all.sh perf stat -r 5 ./skb_array


> > diff --git a/tools/virtio/ringtest/Makefile b/tools/virtio/ringtest/Makefile
> > index 6ba7455..87e58cf 100644
> > --- a/tools/virtio/ringtest/Makefile
> > +++ b/tools/virtio/ringtest/Makefile
> > @@ -1,6 +1,6 @@
> >  all:
> >  
> > -all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder
> > +all: ring virtio_ring_0_9 virtio_ring_poll virtio_ring_inorder skb_array
>                                               ^^^^^^^^^^^^^^^^^^^
> >  
> >  CFLAGS += -Wall
> >  CFLAGS += -pthread -O2 -ggdb
> > @@ -8,6 +8,7 @@ LDFLAGS += -pthread -O2 -ggdb
> >  
> >  main.o: main.c main.h
> >  ring.o: ring.c main.h
> > +skb_array.o: skb_array.c main.h ../../../include/linux/skb_array.h
> >  virtio_ring_0_9.o: virtio_ring_0_9.c main.h
> >  virtio_ring_poll.o: virtio_ring_poll.c virtio_ring_0_9.c main.h
> >  virtio_ring_inorder.o: virtio_ring_inorder.c virtio_ring_0_9.c main.h
> > @@ -15,6 +16,7 @@ ring: ring.o main.o
> >  virtio_ring_0_9: virtio_ring_0_9.o main.o
> >  virtio_ring_poll: virtio_ring_poll.o main.o
> >  virtio_ring_inorder: virtio_ring_inorder.o main.o
>    ^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^
> > +skb_array: skb_array.o main.o
> >  clean:
> >  	-rm main.o
> >  	-rm ring.o ring
> 
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] skb_array: ring test
  2016-05-23 20:52     ` Michael S. Tsirkin
@ 2016-05-24 10:28       ` Jesper Dangaard Brouer
  2016-05-24 10:33         ` Michael S. Tsirkin
                           ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-24 10:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, Eric Dumazet, davem, netdev,
	Steven Rostedt, brouer

On Mon, 23 May 2016 23:52:47 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote:
> > On Mon, 23 May 2016 13:43:46 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > Add ringtest based unit test for skb array.
> > >
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > >  tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++
> > >  tools/virtio/ringtest/Makefile    |   4 +-  
> > 
> > Patch didn't apply cleanly to Makefile, as you also seems to have
> > "virtio_ring_inorder", I manually applied it.
> > 
> > I chdir to tools/virtio/ringtest/ and I could compile "skb_array",
> > BUT how do I use it??? (the README is not helpful)
> > 
> > What is the "output", are there any performance measurement results?  
> 
> First, if it completes successfully this means it completed
> a ton of cycles without errors. It caches any missing barriers
> which aren't nops on your system.

I applied these patches on net-next (at commit 07b75260e) and the
skb_array test program never terminates.   Strangely if I use your git
tree[1] (on branch vhost) the program does terminate... I didn't spot
the difference.
 
> Second - use perf.

I do like perf, but it does not answer my questions about the
performance of this queue. I will code something up in my own
framework[2] to answer my own performance questions.

Like what is be minimum overhead (in cycles) achievable with this type
of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
for fastpath usage.

Then I also want to know how this performs when two CPUs are involved.
As this is also a primary use-case, for you when sending packets into a
guest.


> E.g. simple perf stat will measure how long does it take to execute.
> there's a script that runs it on different CPUs,
> so I normally do:
>
> sh run-on-all.sh perf stat -r 5 ./skb_array

I recommend documenting this in the README file in the same dir ;-)

[1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost
[2] https://github.com/netoptimizer/prototype-kernel
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] skb_array: ring test
  2016-05-24 10:28       ` Jesper Dangaard Brouer
@ 2016-05-24 10:33         ` Michael S. Tsirkin
  2016-05-24 11:54         ` Michael S. Tsirkin
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-05-24 10:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-kernel, Jason Wang, Eric Dumazet, davem, netdev, Steven Rostedt

On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 23 May 2016 23:52:47 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote:
> > > On Mon, 23 May 2016 13:43:46 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > Add ringtest based unit test for skb array.
> > > >
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > >  tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++
> > > >  tools/virtio/ringtest/Makefile    |   4 +-  
> > > 
> > > Patch didn't apply cleanly to Makefile, as you also seems to have
> > > "virtio_ring_inorder", I manually applied it.
> > > 
> > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array",
> > > BUT how do I use it??? (the README is not helpful)
> > > 
> > > What is the "output", are there any performance measurement results?  
> > 
> > First, if it completes successfully this means it completed
> > a ton of cycles without errors. It caches any missing barriers
> > which aren't nops on your system.
> 
> I applied these patches on net-next (at commit 07b75260e) and the
> skb_array test program never terminates.   Strangely if I use your git
> tree[1] (on branch vhost) the program does terminate... I didn't spot
> the difference.

Disassemble the binaries and compare? Should be identical.
Or attach gdb and look at array.producer and array.consumer.

> > Second - use perf.
> 
> I do like perf, but it does not answer my questions about the
> performance of this queue. I will code something up in my own
> framework[2] to answer my own performance questions.

Sounds good.

> Like what is be minimum overhead (in cycles) achievable with this type
> of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> for fastpath usage.

Interesting.

> Then I also want to know how this performs when two CPUs are involved.

This has flags to pin threads to different CPUs.


> As this is also a primary use-case, for you when sending packets into a
> guest.
> 

That's absolutely the primary usecase.
Was designed with this in mind.

> 
> > E.g. simple perf stat will measure how long does it take to execute.
> > there's a script that runs it on different CPUs,
> > so I normally do:
> >
> > sh run-on-all.sh perf stat -r 5 ./skb_array
> 
> I recommend documenting this in the README file in the same dir ;-)

Good idea.  Will do.

> [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost
> [2] https://github.com/netoptimizer/prototype-kernel
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] skb_array: ring test
  2016-05-24 10:28       ` Jesper Dangaard Brouer
  2016-05-24 10:33         ` Michael S. Tsirkin
@ 2016-05-24 11:54         ` Michael S. Tsirkin
  2016-05-24 12:11         ` Michael S. Tsirkin
  2016-05-24 17:03         ` Jesper Dangaard Brouer
  3 siblings, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-05-24 11:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-kernel, Jason Wang, Eric Dumazet, davem, netdev, Steven Rostedt

On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 23 May 2016 23:52:47 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote:
> > > On Mon, 23 May 2016 13:43:46 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > Add ringtest based unit test for skb array.
> > > >
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > >  tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++
> > > >  tools/virtio/ringtest/Makefile    |   4 +-  
> > > 
> > > Patch didn't apply cleanly to Makefile, as you also seems to have
> > > "virtio_ring_inorder", I manually applied it.
> > > 
> > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array",
> > > BUT how do I use it??? (the README is not helpful)
> > > 
> > > What is the "output", are there any performance measurement results?  
> > 
> > First, if it completes successfully this means it completed
> > a ton of cycles without errors. It caches any missing barriers
> > which aren't nops on your system.
> 
> I applied these patches on net-next (at commit 07b75260e) and the
> skb_array test program never terminates.   Strangely if I use your git
> tree[1] (on branch vhost) the program does terminate... I didn't spot
> the difference.

Oh, that's my bad. You need

    ringtest: pass buf != NULL
    
    just a stub pointer for now.
    
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

from my tree.

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] skb_array: ring test
  2016-05-24 10:28       ` Jesper Dangaard Brouer
  2016-05-24 10:33         ` Michael S. Tsirkin
  2016-05-24 11:54         ` Michael S. Tsirkin
@ 2016-05-24 12:11         ` Michael S. Tsirkin
  2016-05-24 17:03         ` Jesper Dangaard Brouer
  3 siblings, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-05-24 12:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-kernel, Jason Wang, Eric Dumazet, davem, netdev, Steven Rostedt

On Tue, May 24, 2016 at 12:28:09PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 23 May 2016 23:52:47 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Mon, May 23, 2016 at 03:09:18PM +0200, Jesper Dangaard Brouer wrote:
> > > On Mon, 23 May 2016 13:43:46 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > Add ringtest based unit test for skb array.
> > > >
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > >  tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++
> > > >  tools/virtio/ringtest/Makefile    |   4 +-  
> > > 
> > > Patch didn't apply cleanly to Makefile, as you also seems to have
> > > "virtio_ring_inorder", I manually applied it.
> > > 
> > > I chdir to tools/virtio/ringtest/ and I could compile "skb_array",
> > > BUT how do I use it??? (the README is not helpful)
> > > 
> > > What is the "output", are there any performance measurement results?  
> > 
> > First, if it completes successfully this means it completed
> > a ton of cycles without errors. It caches any missing barriers
> > which aren't nops on your system.
> 
> I applied these patches on net-next (at commit 07b75260e) and the
> skb_array test program never terminates.   Strangely if I use your git
> tree[1] (on branch vhost) the program does terminate... I didn't spot
> the difference.
>  
> > Second - use perf.
> 
> I do like perf, but it does not answer my questions about the
> performance of this queue. I will code something up in my own
> framework[2] to answer my own performance questions.
> 
> Like what is be minimum overhead (in cycles) achievable with this type
> of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> for fastpath usage.

Actually there is, kind of, a way to find out with my tool
if you have an HT CPU.  When you do run-on-all.sh
it will pin consumer to the last CPU, then run producer
on all of them. Look for the number for the HT pair -
this shares cache between producer and consumer.

This is not the same as doing produce + consume on
the same CPU but it's close enough I think.

To measure overhead I guess I should build a NOP tool
that does not actually produce or consume anything.
Will do.

> Then I also want to know how this performs when two CPUs are involved.
> As this is also a primary use-case, for you when sending packets into a
> guest.
> 
> 
> > E.g. simple perf stat will measure how long does it take to execute.
> > there's a script that runs it on different CPUs,
> > so I normally do:
> >
> > sh run-on-all.sh perf stat -r 5 ./skb_array
> 
> I recommend documenting this in the README file in the same dir ;-)
> 
> [1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost
> [2] https://github.com/netoptimizer/prototype-kernel
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] skb_array: ring test
  2016-05-24 10:28       ` Jesper Dangaard Brouer
                           ` (2 preceding siblings ...)
  2016-05-24 12:11         ` Michael S. Tsirkin
@ 2016-05-24 17:03         ` Jesper Dangaard Brouer
  2016-05-24 20:34           ` Michael S. Tsirkin
  3 siblings, 1 reply; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-24 17:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, Eric Dumazet, davem, netdev,
	Steven Rostedt, brouer


On Tue, 24 May 2016 12:28:09 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> I do like perf, but it does not answer my questions about the
> performance of this queue. I will code something up in my own
> framework[2] to answer my own performance questions.
> 
> Like what is be minimum overhead (in cycles) achievable with this type
> of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> for fastpath usage.

Coded it up here:
 https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c

This is a really fake benchmark, but it sort of shows the minimum
overhead achievable with this type of queue, where it is the same
CPU enqueuing and dequeuing, and cache is guaranteed to be hot.

Measured on a i7-4790K CPU @ 4.00GHz, the average cost of
enqueue+dequeue of a single object is around 102 cycles(tsc).

To compare this with below, where enq and deq is measured separately:
 102 / 2 = 51 cycles

 
> Then I also want to know how this performs when two CPUs are involved.
> As this is also a primary use-case, for you when sending packets into a
> guest.

Coded it up here:
 https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
 
This parallel benchmark try to keep two (or more) CPUs busy enqueuing or
dequeuing on the same skb_array queue.  It prefills the queue,
and stops the test as soon as queue is empty or full, or
completes a number of "loops"/cycles.

For two CPUs the results are really good:
 enqueue: 54 cycles(tsc)
 dequeue: 53 cycles(tsc)

Going to 4 CPUs, things break down (but it was not primary use-case?):
 CPU(0) 927 cycles(tsc) enqueue
 CPU(1) 921 cycles(tsc) dequeue
 CPU(2) 927 cycles(tsc) enqueue
 CPU(3) 898 cycles(tsc) dequeue


Next on my todo-list is to implement same tests for e.g. alf_queue, so
we can compare the concurrency part (which is the important part).
But FYI I'll be busy the next days at conf http://fosd2016.itu.dk/

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


[1] https://git.kernel.org/cgit/linux/kernel/git/mst/vhost.git/log/?h=vhost
[2] https://github.com/netoptimizer/prototype-kernel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] skb_array: ring test
  2016-05-24 17:03         ` Jesper Dangaard Brouer
@ 2016-05-24 20:34           ` Michael S. Tsirkin
  2016-06-02 18:47             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-05-24 20:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-kernel, Jason Wang, Eric Dumazet, davem, netdev, Steven Rostedt

On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Tue, 24 May 2016 12:28:09 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
> > I do like perf, but it does not answer my questions about the
> > performance of this queue. I will code something up in my own
> > framework[2] to answer my own performance questions.
> > 
> > Like what is be minimum overhead (in cycles) achievable with this type
> > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> > for fastpath usage.
> 
> Coded it up here:
>  https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184
>  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
> 
> This is a really fake benchmark, but it sort of shows the minimum
> overhead achievable with this type of queue, where it is the same
> CPU enqueuing and dequeuing, and cache is guaranteed to be hot.
> 
> Measured on a i7-4790K CPU @ 4.00GHz, the average cost of
> enqueue+dequeue of a single object is around 102 cycles(tsc).
> 
> To compare this with below, where enq and deq is measured separately:
>  102 / 2 = 51 cycles
> 
>  
> > Then I also want to know how this performs when two CPUs are involved.
> > As this is also a primary use-case, for you when sending packets into a
> > guest.
> 
> Coded it up here:
>  https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e
>  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
>  
> This parallel benchmark try to keep two (or more) CPUs busy enqueuing or
> dequeuing on the same skb_array queue.  It prefills the queue,
> and stops the test as soon as queue is empty or full, or
> completes a number of "loops"/cycles.
> 
> For two CPUs the results are really good:
>  enqueue: 54 cycles(tsc)
>  dequeue: 53 cycles(tsc)
> 
> Going to 4 CPUs, things break down (but it was not primary use-case?):
>  CPU(0) 927 cycles(tsc) enqueue
>  CPU(1) 921 cycles(tsc) dequeue
>  CPU(2) 927 cycles(tsc) enqueue
>  CPU(3) 898 cycles(tsc) dequeue

It's mostly the spinlock contention I guess.
Maybe we don't need fair spinlocks in this case.
Try replacing spinlocks with simple cmpxchg
and see what happens?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 0/2] skb_array: array based FIFO for skbs
  2016-05-23 10:43 [PATCH v5 0/2] skb_array: array based FIFO for skbs Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2016-05-23 13:31 ` [PATCH v5 0/2] skb_array: array based FIFO for skbs Eric Dumazet
@ 2016-05-30  9:59 ` Jason Wang
  2016-05-30 15:37   ` Michael S. Tsirkin
  3 siblings, 1 reply; 18+ messages in thread
From: Jason Wang @ 2016-05-30  9:59 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel
  Cc: Eric Dumazet, davem, netdev, Steven Rostedt, brouer



On 2016年05月23日 18:43, Michael S. Tsirkin wrote:
> This is in response to the proposal by Jason to make tun
> rx packet queue lockless using a circular buffer.
> My testing seems to show that at least for the common usecase
> in networking, which isn't lockless, circular buffer
> with indices does not perform that well, because
> each index access causes a cache line to bounce between
> CPUs, and index access causes stalls due to the dependency.
>
> By comparison, an array of pointers where NULL means invalid
> and !NULL means valid, can be updated without messing up barriers
> at all and does not have this issue.
>
> On the flip side, cache pressure may be caused by using large queues.
> tun has a queue of 1000 entries by default and that's 8K.
> At this point I'm not sure this can be solved efficiently.
> The correct solution might be sizing the queues appropriately.
>
> Here's an implementation of this idea: it can be used more
> or less whenever sk_buff_head can be used, except you need
> to know the queue size in advance.
>
> It's using skb pointers but we switching to void * would be easy at cost
> of type safety, though it appears that people want lockless  push
> etc so I'm not sure of the value.
>
> I didn't implement resizing but it's possible by holding
> both consumer and producer locks.
>
> I think this code works fine without any extra memory barriers since we
> always read and write the same location, so the accesses can not be
> reordered.
> Multiple writes of the same value into memory would mess things up
> for us, I don't think compilers would do it though.
> But if people feel it's better to be safe wrt compiler optimizations,
> specifying queue as volatile would probably do it in a cleaner way
> than converting all accesses to READ_ONCE/WRITE_ONCE. Thoughts?
>
> The only issue is with calls within a loop using the __skb_array_XXX
> accessors - in theory compiler could hoist accesses out of the loop.
>
> Following volatile-considered-harmful.txt I merely
> documented that callers that busy-poll should invoke cpu_relax().
> Most people will use the external skb_array_XXX APIs with a spinlock,
> so this should not be an issue for them.
>
> changes since v4 (v3 was never posted)
> 	documentation
> 	dropped SKB_ARRAY_MIN_SIZE heuristic
> 	unit test (in userspace, included as patch 2)
>
> changes since v2:
>          fixed integer overflow pointed out by Eric.
>          added some comments.
>
> changes since v1:
>          fixed bug pointed out by Eric.
>
>
> Michael S. Tsirkin (2):
>    skb_array: array based FIFO for skbs
>    skb_array: ring test
>
>   include/linux/skb_array.h         | 127 +++++++++++++++++++++++++++++
>   tools/virtio/ringtest/skb_array.c | 167 ++++++++++++++++++++++++++++++++++++++
>   tools/virtio/ringtest/Makefile    |   4 +-
>   3 files changed, 297 insertions(+), 1 deletion(-)
>   create mode 100644 include/linux/skb_array.h
>   create mode 100644 tools/virtio/ringtest/skb_array.c
>

I change tun to use skb array, looks like it can give about 5% more 
faster than skb ring.

And we usually don't need touch bhs during consume and produce (e.g for 
the case of tun).

Thanks

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 0/2] skb_array: array based FIFO for skbs
  2016-05-30  9:59 ` Jason Wang
@ 2016-05-30 15:37   ` Michael S. Tsirkin
  2016-05-31  2:29     ` Jason Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-05-30 15:37 UTC (permalink / raw)
  To: Jason Wang
  Cc: linux-kernel, Eric Dumazet, davem, netdev, Steven Rostedt, brouer

On Mon, May 30, 2016 at 05:59:33PM +0800, Jason Wang wrote:
> 
> 
> On 2016年05月23日 18:43, Michael S. Tsirkin wrote:
> >This is in response to the proposal by Jason to make tun
> >rx packet queue lockless using a circular buffer.
> >My testing seems to show that at least for the common usecase
> >in networking, which isn't lockless, circular buffer
> >with indices does not perform that well, because
> >each index access causes a cache line to bounce between
> >CPUs, and index access causes stalls due to the dependency.
> 
> I change tun to use skb array, looks like it can give about 5% more faster
> than skb ring.

OK and skb ring is 9% faster than the linked list, so together
this is a 14% speedup?

> And we usually don't need touch bhs during consume and produce (e.g for the
> case of tun).
> 
> Thanks

Maybe I'll drop it in v6 then ...
Could you post the full tun patchset please?

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 0/2] skb_array: array based FIFO for skbs
  2016-05-30 15:37   ` Michael S. Tsirkin
@ 2016-05-31  2:29     ` Jason Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Wang @ 2016-05-31  2:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Eric Dumazet, davem, netdev, Steven Rostedt, brouer



On 2016年05月30日 23:37, Michael S. Tsirkin wrote:
> On Mon, May 30, 2016 at 05:59:33PM +0800, Jason Wang wrote:
>>
>> On 2016年05月23日 18:43, Michael S. Tsirkin wrote:
>>> This is in response to the proposal by Jason to make tun
>>> rx packet queue lockless using a circular buffer.
>>> My testing seems to show that at least for the common usecase
>>> in networking, which isn't lockless, circular buffer
>>> with indices does not perform that well, because
>>> each index access causes a cache line to bounce between
>>> CPUs, and index access causes stalls due to the dependency.
>> I change tun to use skb array, looks like it can give about 5% more faster
>> than skb ring.
> OK and skb ring is 9% faster than the linked list, so together
> this is a 14% speedup?

Right.

>
>> And we usually don't need touch bhs during consume and produce (e.g for the
>> case of tun).
>>
>> Thanks
> Maybe I'll drop it in v6 then ...
> Could you post the full tun patchset please?
>

Since it needs no bh versions of produce/consume, maybe you can post v6 
first, then I can post the tun patches?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] skb_array: ring test
  2016-05-24 20:34           ` Michael S. Tsirkin
@ 2016-06-02 18:47             ` Jesper Dangaard Brouer
  2016-06-03 12:15               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2016-06-02 18:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, Eric Dumazet, davem, netdev,
	Steven Rostedt, brouer

On Tue, 24 May 2016 23:34:14 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote:
> > 
> > On Tue, 24 May 2016 12:28:09 +0200
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >   
> > > I do like perf, but it does not answer my questions about the
> > > performance of this queue. I will code something up in my own
> > > framework[2] to answer my own performance questions.
> > > 
> > > Like what is be minimum overhead (in cycles) achievable with this type
> > > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> > > for fastpath usage.  
> > 
> > Coded it up here:
> >  https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184
> >  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
> > 
> > This is a really fake benchmark, but it sort of shows the  
> > overhead achievable with this type of queue, where it is the same
> > CPU enqueuing and dequeuing, and cache is guaranteed to be hot.
> > 
> > Measured on a i7-4790K CPU @ 4.00GHz, the average cost of
> > enqueue+dequeue of a single object is around 102 cycles(tsc).
> > 
> > To compare this with below, where enq and deq is measured separately:
> >  102 / 2 = 51 cycles

The alf_queue[1] baseline is 26 cycles in this minimum overhead
achievable benchmark with a MPMC (Multi-Producer/Multi-Consumer) queue
which use a locked cmpxchg.  (SPSC variant is 5 cycles, thus most cost
comes from locked cmpxchg).

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h

> > > Then I also want to know how this performs when two CPUs are involved.
> > > As this is also a primary use-case, for you when sending packets into a
> > > guest.  
> > 
> > Coded it up here:
> >  https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e
> >  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
> >  
> > This parallel benchmark try to keep two (or more) CPUs busy enqueuing or
> > dequeuing on the same skb_array queue.  It prefills the queue,
> > and stops the test as soon as queue is empty or full, or
> > completes a number of "loops"/cycles.
> > 
> > For two CPUs the results are really good:
> >  enqueue: 54 cycles(tsc)
> >  dequeue: 53 cycles(tsc)

As MST points out, a scheme like the alf_queue[1] have the issue that it
"reads" the opposite cacheline of the consumer.tail/producer.tail to
determine if space-is-left/queue-is-empty.  This cause an expensive
transition for the cache coherency protocol.

Coded up similar test for alf_queue:
 https://github.com/netoptimizer/prototype-kernel/commit/b3ff2624f1
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c

For two CPUs MPMC results are, significantly worse, and demonstrate MSTs point:
 enqueue: 227 cycles(tsc)
 dequeue: 231 cycles(tsc)

Alf_queue also have a SPSC (Single-Producer/Single-Consumer) variant:
 enqueue: 24 cycles(tsc)
 dequeue: 23 cycles(tsc)


> > Going to 4 CPUs, things break down (but it was not primary use-case?):
> >  CPU(0) 927 cycles(tsc) enqueue
> >  CPU(1) 921 cycles(tsc) dequeue
> >  CPU(2) 927 cycles(tsc) enqueue
> >  CPU(3) 898 cycles(tsc) dequeue  
> 
> It's mostly the spinlock contention I guess.
> Maybe we don't need fair spinlocks in this case.
> Try replacing spinlocks with simple cmpxchg
> and see what happens?

The alf_queue uses a cmpxchg scheme, and it does scale better when the
number of CPUs increase:

 CPUs:4 Average: 586 cycles(tsc)
 CPUs:6 Average: 744 cycles(tsc)
 CPUs:8 Average: 1578 cycles(tsc)

Notice the alf_queue was designed with the purpose of bulking, to
mitigate the effect of this cacheline bouncing, but it was not covered
in this test.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] skb_array: ring test
  2016-06-02 18:47             ` Jesper Dangaard Brouer
@ 2016-06-03 12:15               ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2016-06-03 12:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, Eric Dumazet, davem, netdev,
	Steven Rostedt, brouer

On Thu, 2 Jun 2016 20:47:25 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Tue, 24 May 2016 23:34:14 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote:  
> > > 
> > > On Tue, 24 May 2016 12:28:09 +0200
> > > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > >     
> > > > I do like perf, but it does not answer my questions about the
> > > > performance of this queue. I will code something up in my own
> > > > framework[2] to answer my own performance questions.
> > > > 
> > > > Like what is be minimum overhead (in cycles) achievable with this type
> > > > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> > > > for fastpath usage.    
> > > 
> > > Coded it up here:
> > >  https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184
> > >  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
> > > 
> > > This is a really fake benchmark, but it sort of shows the  
> > > overhead achievable with this type of queue, where it is the same
> > > CPU enqueuing and dequeuing, and cache is guaranteed to be hot.
> > > 
> > > Measured on a i7-4790K CPU @ 4.00GHz, the average cost of
> > > enqueue+dequeue of a single object is around 102 cycles(tsc).
> > > 
> > > To compare this with below, where enq and deq is measured separately:
> > >  102 / 2 = 51 cycles  
> 
> The alf_queue[1] baseline is 26 cycles in this minimum overhead
> achievable benchmark with a MPMC (Multi-Producer/Multi-Consumer) queue
> which use a locked cmpxchg.  (SPSC variant is 5 cycles, thus most cost
> comes from locked cmpxchg).
> 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h
> 
> > > > Then I also want to know how this performs when two CPUs are involved.
> > > > As this is also a primary use-case, for you when sending packets into a
> > > > guest.    
> > > 
> > > Coded it up here:
> > >  https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e
> > >  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
> > >  
> > > This parallel benchmark try to keep two (or more) CPUs busy enqueuing or
> > > dequeuing on the same skb_array queue.  It prefills the queue,
> > > and stops the test as soon as queue is empty or full, or
> > > completes a number of "loops"/cycles.
> > > 
> > > For two CPUs the results are really good:
> > >  enqueue: 54 cycles(tsc)
> > >  dequeue: 53 cycles(tsc)  
> 
> As MST points out, a scheme like the alf_queue[1] have the issue that it
> "reads" the opposite cacheline of the consumer.tail/producer.tail to
> determine if space-is-left/queue-is-empty.  This cause an expensive
> transition for the cache coherency protocol.
> 
> Coded up similar test for alf_queue:
>  https://github.com/netoptimizer/prototype-kernel/commit/b3ff2624f1
>  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c
> 
> For two CPUs MPMC results are, significantly worse, and demonstrate MSTs point:
>  enqueue: 227 cycles(tsc)
>  dequeue: 231 cycles(tsc)
> 
> Alf_queue also have a SPSC (Single-Producer/Single-Consumer) variant:
>  enqueue: 24 cycles(tsc)
>  dequeue: 23 cycles(tsc)
> 
> 
> > > Going to 4 CPUs, things break down (but it was not primary use-case?):
> > >  CPU(0) 927 cycles(tsc) enqueue
> > >  CPU(1) 921 cycles(tsc) dequeue
> > >  CPU(2) 927 cycles(tsc) enqueue
> > >  CPU(3) 898 cycles(tsc) dequeue    
> > 
> > It's mostly the spinlock contention I guess.
> > Maybe we don't need fair spinlocks in this case.
> > Try replacing spinlocks with simple cmpxchg
> > and see what happens?  
> 
> The alf_queue uses a cmpxchg scheme, and it does scale better when the
> number of CPUs increase:
> 
>  CPUs:4 Average: 586 cycles(tsc)
>  CPUs:6 Average: 744 cycles(tsc)
>  CPUs:8 Average: 1578 cycles(tsc)
> 
> Notice the alf_queue was designed with the purpose of bulking, to
> mitigate the effect of this cacheline bouncing, but it was not covered
> in this test.

Added bulking to the alf_queue test:
 https://github.com/netoptimizer/prototype-kernel/commit/e22a0d8745 

This does help significantly, but requires use-cases where there are
packets to be bulk deq/enq.  On the other hand, the skb_array also
requires that objects in the queue/array exceed one cacheline, before
it starts to scale.

For two CPUs we need bulk=4 before beating skb_array.  See benchmark
adjusting bulk size:
 CPUs:2 bulk=step:1 Average: 231 cycles(tsc)
 CPUs:2 bulk=step:2 Average: 118 cycles(tsc)
 CPUs:2 bulk=step:3 Average: 65 cycles(tsc)
 CPUs:2 bulk=step:4 Average: 48 cycles(tsc)
 CPUs:2 bulk=step:5 Average: 40 cycles(tsc)
 CPUs:2 bulk=step:6 Average: 37 cycles(tsc)
 CPUs:2 bulk=step:7 Average: 29 cycles(tsc)
 CPUs:2 bulk=step:8 Average: 24 cycles(tsc)
 CPUs:2 bulk=step:9 Average: 23 cycles(tsc)
 CPUs:2 bulk=step:10 Average: 20 cycles(tsc)

Keeping bulk=8, and increasing the CPUs does show better scalability,
due to bulking.

This system (i7-4790K CPU @ 4.00GHz) only had 8-core CPUs:
 CPUs:2 bulk=step:8 Average: 25 cycles(tsc)
 CPUs:4 bulk=step:8 Average: 71 cycles(tsc)
 CPUs:6 bulk=step:8 Average: 100 cycles(tsc)
 CPUs:8 bulk=step:8 Average: 185 cycles(tsc)

Found a (slower) 24-core CPU system (E5-2695v2-ES @ 2.50GHz):
 CPUs:2 bulk=step:8 Average: 50 cycles(tsc)
 CPUs:4 bulk=step:8 Average: 101 cycles(tsc)
 CPUs:6 bulk=step:8 Average: 214 cycles(tsc)
 CPUs:8 bulk=step:8 Average: 347 cycles(tsc)
 CPUs:10 bulk=step:8 Average: 468 cycles(tsc)
 CPUs:12 bulk=step:8 Average: 670 cycles(tsc)
 CPUs:14 bulk=step:8 Average: 698 cycles(tsc)
 CPUs:16 bulk=step:8 Average: 1149 cycles(tsc)
 CPUs:18 bulk=step:8 Average: 1094 cycles(tsc)
 CPUs:20 bulk=step:8 Average: 1349 cycles(tsc)
 CPUs:22 bulk=step:8 Average: 1406 cycles(tsc)
 CPUs:24 bulk=step:8 Average: 1553 cycles(tsc)

I still think skb_array is the winner, when the normal use-case is two
CPUs, and we cannot guarantee CPU pinning (thus cannot use SPSC).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-06-03 12:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-23 10:43 [PATCH v5 0/2] skb_array: array based FIFO for skbs Michael S. Tsirkin
2016-05-23 10:43 ` [PATCH v5 1/2] " Michael S. Tsirkin
2016-05-23 10:43 ` [PATCH v5 2/2] skb_array: ring test Michael S. Tsirkin
2016-05-23 13:09   ` Jesper Dangaard Brouer
2016-05-23 20:52     ` Michael S. Tsirkin
2016-05-24 10:28       ` Jesper Dangaard Brouer
2016-05-24 10:33         ` Michael S. Tsirkin
2016-05-24 11:54         ` Michael S. Tsirkin
2016-05-24 12:11         ` Michael S. Tsirkin
2016-05-24 17:03         ` Jesper Dangaard Brouer
2016-05-24 20:34           ` Michael S. Tsirkin
2016-06-02 18:47             ` Jesper Dangaard Brouer
2016-06-03 12:15               ` Jesper Dangaard Brouer
2016-05-23 13:31 ` [PATCH v5 0/2] skb_array: array based FIFO for skbs Eric Dumazet
2016-05-23 20:35   ` Michael S. Tsirkin
2016-05-30  9:59 ` Jason Wang
2016-05-30 15:37   ` Michael S. Tsirkin
2016-05-31  2:29     ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).