All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] ptr_ring: batch ring zeroing
@ 2017-04-07  5:49 Michael S. Tsirkin
  2017-04-07  5:50   ` Michael S. Tsirkin
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-04-07  5:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jason Wang, brouer

A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
  that the following call to produce will succeed.
  No users seem to assume that.
- batching can also in theory reduce the signalling rate:
  users that would previously send interrups to the producer
  to wake it up after consuming each entry would now only
  need to do this once in a batch.
  Doing this would be easy by returning a flag to the caller.
  No users seem to do signalling on consume yet so this was not
  implemented yet.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?

 include/linux/ptr_ring.h | 63 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 54 insertions(+), 9 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6c70444..6b2e0dd 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -34,11 +34,13 @@
 struct ptr_ring {
 	int producer ____cacheline_aligned_in_smp;
 	spinlock_t producer_lock;
-	int consumer ____cacheline_aligned_in_smp;
+	int consumer_head ____cacheline_aligned_in_smp; /* next valid entry */
+	int consumer_tail; /* next entry to invalidate */
 	spinlock_t consumer_lock;
 	/* Shared consumer/producer data */
 	/* Read-only by both the producer and the consumer */
 	int size ____cacheline_aligned_in_smp; /* max entries in queue */
+	int batch; /* number of entries to consume in a batch */
 	void **queue;
 };
 
@@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, void *ptr)
 static inline void *__ptr_ring_peek(struct ptr_ring *r)
 {
 	if (likely(r->size))
-		return r->queue[r->consumer];
+		return r->queue[r->consumer_head];
 	return NULL;
 }
 
@@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
 /* Must only be called after __ptr_ring_peek returned !NULL */
 static inline void __ptr_ring_discard_one(struct ptr_ring *r)
 {
-	r->queue[r->consumer++] = NULL;
-	if (unlikely(r->consumer >= r->size))
-		r->consumer = 0;
+	/* Fundamentally, what we want to do is update consumer
+	 * index and zero out the entry so producer can reuse it.
+	 * Doing it naively at each consume would be as simple as:
+	 *       r->queue[r->consumer++] = NULL;
+	 *       if (unlikely(r->consumer >= r->size))
+	 *               r->consumer = 0;
+	 * but that is suboptimal when the ring is full as producer is writing
+	 * out new entries in the same cache line.  Defer these updates until a
+	 * batch of entries has been consumed.
+	 */
+	int head = r->consumer_head++;
+
+	/* Once we have processed enough entries invalidate them in
+	 * the ring all at once so producer can reuse their space in the ring.
+	 * We also do this when we reach end of the ring - not mandatory
+	 * but helps keep the implementation simple.
+	 */
+	if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
+		     r->consumer_head >= r->size)) {
+		/* Zero out entries in the reverse order: this way we touch the
+		 * cache line that producer might currently be reading the last;
+		 * producer won't make progress and touch other cache lines
+		 * besides the first one until we write out all entries.
+		 */
+		while (likely(head >= r->consumer_tail))
+			r->queue[head--] = NULL;
+		r->consumer_tail = r->consumer_head;
+	}
+	if (unlikely(r->consumer_head >= r->size)) {
+		r->consumer_head = 0;
+		r->consumer_tail = 0;
+	}
 }
 
 static inline void *__ptr_ring_consume(struct ptr_ring *r)
@@ -345,14 +376,27 @@ static inline void **__ptr_ring_init_queue_alloc(int size, gfp_t gfp)
 	return kzalloc(ALIGN(size * sizeof(void *), SMP_CACHE_BYTES), gfp);
 }
 
+static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)
+{
+	r->size = size;
+	r->batch = SMP_CACHE_BYTES * 2 / sizeof(*(r->queue));
+	/* We need to set batch at least to 1 to make logic
+	 * in __ptr_ring_discard_one work correctly.
+	 * Batching too much (because ring is small) would cause a lot of
+	 * burstiness. Needs tuning, for now disable batching.
+	 */
+	if (r->batch > r->size / 2 || !r->batch)
+		r->batch = 1;
+}
+
 static inline int ptr_ring_init(struct ptr_ring *r, int size, gfp_t gfp)
 {
 	r->queue = __ptr_ring_init_queue_alloc(size, gfp);
 	if (!r->queue)
 		return -ENOMEM;
 
-	r->size = size;
-	r->producer = r->consumer = 0;
+	__ptr_ring_set_size(r, size);
+	r->producer = r->consumer_head = r->consumer_tail = 0;
 	spin_lock_init(&r->producer_lock);
 	spin_lock_init(&r->consumer_lock);
 
@@ -373,9 +417,10 @@ static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
 		else if (destroy)
 			destroy(ptr);
 
-	r->size = size;
+	__ptr_ring_set_size(r, size);
 	r->producer = producer;
-	r->consumer = 0;
+	r->consumer_head = 0;
+	r->consumer_tail = 0;
 	old = r->queue;
 	r->queue = queue;
 
-- 
MST

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/3] ringtest: support test specific parameters
  2017-04-07  5:49 [PATCH 1/3] ptr_ring: batch ring zeroing Michael S. Tsirkin
@ 2017-04-07  5:50   ` Michael S. Tsirkin
  2017-04-07  5:50   ` Michael S. Tsirkin
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-04-07  5:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jason Wang, brouer, virtualization

Add a new flag for passing test-specific parameters.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/ringtest/main.c | 13 +++++++++++++
 tools/virtio/ringtest/main.h |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/tools/virtio/ringtest/main.c b/tools/virtio/ringtest/main.c
index f31353f..022ae95 100644
--- a/tools/virtio/ringtest/main.c
+++ b/tools/virtio/ringtest/main.c
@@ -20,6 +20,7 @@
 int runcycles = 10000000;
 int max_outstanding = INT_MAX;
 int batch = 1;
+int param = 0;
 
 bool do_sleep = false;
 bool do_relax = false;
@@ -247,6 +248,11 @@ static const struct option longopts[] = {
 		.val = 'b',
 	},
 	{
+		.name = "param",
+		.has_arg = required_argument,
+		.val = 'p',
+	},
+	{
 		.name = "sleep",
 		.has_arg = no_argument,
 		.val = 's',
@@ -274,6 +280,7 @@ static void help(void)
 		" [--run-cycles C (default: %d)]"
 		" [--batch b]"
 		" [--outstanding o]"
+		" [--param p]"
 		" [--sleep]"
 		" [--relax]"
 		" [--exit]"
@@ -328,6 +335,12 @@ int main(int argc, char **argv)
 			assert(c > 0 && c < INT_MAX);
 			max_outstanding = c;
 			break;
+		case 'p':
+			c = strtol(optarg, &endptr, 0);
+			assert(!*endptr);
+			assert(c > 0 && c < INT_MAX);
+			param = c;
+			break;
 		case 'b':
 			c = strtol(optarg, &endptr, 0);
 			assert(!*endptr);
diff --git a/tools/virtio/ringtest/main.h b/tools/virtio/ringtest/main.h
index 14142fa..90b0133 100644
--- a/tools/virtio/ringtest/main.h
+++ b/tools/virtio/ringtest/main.h
@@ -10,6 +10,8 @@
 
 #include <stdbool.h>
 
+extern int param;
+
 extern bool do_exit;
 
 #if defined(__x86_64__) || defined(__i386__)
-- 
MST

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/3] ringtest: support test specific parameters
@ 2017-04-07  5:50   ` Michael S. Tsirkin
  0 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-04-07  5:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: virtualization, brouer

Add a new flag for passing test-specific parameters.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/ringtest/main.c | 13 +++++++++++++
 tools/virtio/ringtest/main.h |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/tools/virtio/ringtest/main.c b/tools/virtio/ringtest/main.c
index f31353f..022ae95 100644
--- a/tools/virtio/ringtest/main.c
+++ b/tools/virtio/ringtest/main.c
@@ -20,6 +20,7 @@
 int runcycles = 10000000;
 int max_outstanding = INT_MAX;
 int batch = 1;
+int param = 0;
 
 bool do_sleep = false;
 bool do_relax = false;
@@ -247,6 +248,11 @@ static const struct option longopts[] = {
 		.val = 'b',
 	},
 	{
+		.name = "param",
+		.has_arg = required_argument,
+		.val = 'p',
+	},
+	{
 		.name = "sleep",
 		.has_arg = no_argument,
 		.val = 's',
@@ -274,6 +280,7 @@ static void help(void)
 		" [--run-cycles C (default: %d)]"
 		" [--batch b]"
 		" [--outstanding o]"
+		" [--param p]"
 		" [--sleep]"
 		" [--relax]"
 		" [--exit]"
@@ -328,6 +335,12 @@ int main(int argc, char **argv)
 			assert(c > 0 && c < INT_MAX);
 			max_outstanding = c;
 			break;
+		case 'p':
+			c = strtol(optarg, &endptr, 0);
+			assert(!*endptr);
+			assert(c > 0 && c < INT_MAX);
+			param = c;
+			break;
 		case 'b':
 			c = strtol(optarg, &endptr, 0);
 			assert(!*endptr);
diff --git a/tools/virtio/ringtest/main.h b/tools/virtio/ringtest/main.h
index 14142fa..90b0133 100644
--- a/tools/virtio/ringtest/main.h
+++ b/tools/virtio/ringtest/main.h
@@ -10,6 +10,8 @@
 
 #include <stdbool.h>
 
+extern int param;
+
 extern bool do_exit;
 
 #if defined(__x86_64__) || defined(__i386__)
-- 
MST

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/3] ptr_ring: support testing different batching sizes
  2017-04-07  5:49 [PATCH 1/3] ptr_ring: batch ring zeroing Michael S. Tsirkin
@ 2017-04-07  5:50   ` Michael S. Tsirkin
  2017-04-07  5:50   ` Michael S. Tsirkin
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-04-07  5:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jason Wang, brouer, virtualization

Use the param flag for that.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/ringtest/ptr_ring.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/virtio/ringtest/ptr_ring.c b/tools/virtio/ringtest/ptr_ring.c
index 635b07b..7b22f1b 100644
--- a/tools/virtio/ringtest/ptr_ring.c
+++ b/tools/virtio/ringtest/ptr_ring.c
@@ -97,6 +97,9 @@ void alloc_ring(void)
 {
 	int ret = ptr_ring_init(&array, ring_size, 0);
 	assert(!ret);
+	/* Hacky way to poke at ring internals. Useful for testing though. */
+	if (param)
+		array.batch = param;
 }
 
 /* guest side */
-- 
MST

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/3] ptr_ring: support testing different batching sizes
@ 2017-04-07  5:50   ` Michael S. Tsirkin
  0 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-04-07  5:50 UTC (permalink / raw)
  To: linux-kernel; +Cc: virtualization, brouer

Use the param flag for that.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 tools/virtio/ringtest/ptr_ring.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/virtio/ringtest/ptr_ring.c b/tools/virtio/ringtest/ptr_ring.c
index 635b07b..7b22f1b 100644
--- a/tools/virtio/ringtest/ptr_ring.c
+++ b/tools/virtio/ringtest/ptr_ring.c
@@ -97,6 +97,9 @@ void alloc_ring(void)
 {
 	int ret = ptr_ring_init(&array, ring_size, 0);
 	assert(!ret);
+	/* Hacky way to poke at ring internals. Useful for testing though. */
+	if (param)
+		array.batch = param;
 }
 
 /* guest side */
-- 
MST

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-04-07  5:49 [PATCH 1/3] ptr_ring: batch ring zeroing Michael S. Tsirkin
  2017-04-07  5:50   ` Michael S. Tsirkin
  2017-04-07  5:50   ` Michael S. Tsirkin
@ 2017-04-08 12:14 ` Jesper Dangaard Brouer
  2017-05-09 13:33   ` Michael S. Tsirkin
  2017-04-12  8:03 ` Jason Wang
  3 siblings, 1 reply; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2017-04-08 12:14 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-kernel, Jason Wang, brouer

On Fri, 7 Apr 2017 08:49:57 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> A known weakness in ptr_ring design is that it does not handle well the
> situation when ring is almost full: as entries are consumed they are
> immediately used again by the producer, so consumer and producer are
> writing to a shared cache line.
> 
> To fix this, add batching to consume calls: as entries are
> consumed do not write NULL into the ring until we get
> a multiple (in current implementation 2x) of cache lines
> away from the producer. At that point, write them all out.
> 
> We do the write out in the reverse order to keep
> producer from sharing cache with consumer for as long
> as possible.
> 
> Writeout also triggers when ring wraps around - there's
> no special reason to do this but it helps keep the code
> a bit simpler.
> 
> What should we do if getting away from producer by 2 cache lines
> would mean we are keeping the ring moe than half empty?
> Maybe we should reduce the batching in this case,
> current patch simply reduces the batching.
> 
> Notes:
> - it is no longer true that a call to consume guarantees
>   that the following call to produce will succeed.
>   No users seem to assume that.
> - batching can also in theory reduce the signalling rate:
>   users that would previously send interrups to the producer
>   to wake it up after consuming each entry would now only
>   need to do this once in a batch.
>   Doing this would be easy by returning a flag to the caller.
>   No users seem to do signalling on consume yet so this was not
>   implemented yet.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> Jason, I am curious whether the following gives you some of
> the performance boost that you see with vhost batching
> patches. Is vhost batching on top still helpful?
> 
>  include/linux/ptr_ring.h | 63 +++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 54 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 6c70444..6b2e0dd 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -34,11 +34,13 @@
>  struct ptr_ring {
>  	int producer ____cacheline_aligned_in_smp;
>  	spinlock_t producer_lock;
> -	int consumer ____cacheline_aligned_in_smp;
> +	int consumer_head ____cacheline_aligned_in_smp; /* next valid entry */
> +	int consumer_tail; /* next entry to invalidate */
>  	spinlock_t consumer_lock;
>  	/* Shared consumer/producer data */
>  	/* Read-only by both the producer and the consumer */
>  	int size ____cacheline_aligned_in_smp; /* max entries in queue */
> +	int batch; /* number of entries to consume in a batch */
>  	void **queue;
>  };
>  
> @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, void *ptr)
>  static inline void *__ptr_ring_peek(struct ptr_ring *r)
>  {
>  	if (likely(r->size))
> -		return r->queue[r->consumer];
> +		return r->queue[r->consumer_head];
>  	return NULL;
>  }
>  
> @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
>  /* Must only be called after __ptr_ring_peek returned !NULL */
>  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  {
> -	r->queue[r->consumer++] = NULL;
> -	if (unlikely(r->consumer >= r->size))
> -		r->consumer = 0;
> +	/* Fundamentally, what we want to do is update consumer
> +	 * index and zero out the entry so producer can reuse it.
> +	 * Doing it naively at each consume would be as simple as:
> +	 *       r->queue[r->consumer++] = NULL;
> +	 *       if (unlikely(r->consumer >= r->size))
> +	 *               r->consumer = 0;
> +	 * but that is suboptimal when the ring is full as producer is writing
> +	 * out new entries in the same cache line.  Defer these updates until a
> +	 * batch of entries has been consumed.
> +	 */
> +	int head = r->consumer_head++;
> +
> +	/* Once we have processed enough entries invalidate them in
> +	 * the ring all at once so producer can reuse their space in the ring.
> +	 * We also do this when we reach end of the ring - not mandatory
> +	 * but helps keep the implementation simple.
> +	 */
> +	if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> +		     r->consumer_head >= r->size)) {
> +		/* Zero out entries in the reverse order: this way we touch the
> +		 * cache line that producer might currently be reading the last;
> +		 * producer won't make progress and touch other cache lines
> +		 * besides the first one until we write out all entries.
> +		 */
> +		while (likely(head >= r->consumer_tail))
> +			r->queue[head--] = NULL;
> +		r->consumer_tail = r->consumer_head;
> +	}
> +	if (unlikely(r->consumer_head >= r->size)) {
> +		r->consumer_head = 0;
> +		r->consumer_tail = 0;
> +	}
>  }

I love this idea.  Reviewed and discussed the idea in-person with MST
during netdevconf[1] at this laptop.  I promised I will also run it
through my micro-benchmarking[2] once I return home (hint ptr_ring gets
used in network stack as skb_array).

Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>

[1] http://netdevconf.org/2.1/
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-04-07  5:49 [PATCH 1/3] ptr_ring: batch ring zeroing Michael S. Tsirkin
                   ` (2 preceding siblings ...)
  2017-04-08 12:14 ` [PATCH 1/3] ptr_ring: batch ring zeroing Jesper Dangaard Brouer
@ 2017-04-12  8:03 ` Jason Wang
  2017-04-14  7:52   ` Jason Wang
  3 siblings, 1 reply; 17+ messages in thread
From: Jason Wang @ 2017-04-12  8:03 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel; +Cc: brouer



On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
> A known weakness in ptr_ring design is that it does not handle well the
> situation when ring is almost full: as entries are consumed they are
> immediately used again by the producer, so consumer and producer are
> writing to a shared cache line.
>
> To fix this, add batching to consume calls: as entries are
> consumed do not write NULL into the ring until we get
> a multiple (in current implementation 2x) of cache lines
> away from the producer. At that point, write them all out.
>
> We do the write out in the reverse order to keep
> producer from sharing cache with consumer for as long
> as possible.
>
> Writeout also triggers when ring wraps around - there's
> no special reason to do this but it helps keep the code
> a bit simpler.
>
> What should we do if getting away from producer by 2 cache lines
> would mean we are keeping the ring moe than half empty?
> Maybe we should reduce the batching in this case,
> current patch simply reduces the batching.
>
> Notes:
> - it is no longer true that a call to consume guarantees
>    that the following call to produce will succeed.
>    No users seem to assume that.
> - batching can also in theory reduce the signalling rate:
>    users that would previously send interrups to the producer
>    to wake it up after consuming each entry would now only
>    need to do this once in a batch.
>    Doing this would be easy by returning a flag to the caller.
>    No users seem to do signalling on consume yet so this was not
>    implemented yet.
>
> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> ---
>
> Jason, I am curious whether the following gives you some of
> the performance boost that you see with vhost batching
> patches. Is vhost batching on top still helpful?

The patch looks good to me, will have a test for vhost batching patches.

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-04-12  8:03 ` Jason Wang
@ 2017-04-14  7:52   ` Jason Wang
  2017-04-14 21:00     ` Michael S. Tsirkin
  2017-04-14 22:50     ` Michael S. Tsirkin
  0 siblings, 2 replies; 17+ messages in thread
From: Jason Wang @ 2017-04-14  7:52 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel; +Cc: brouer



On 2017年04月12日 16:03, Jason Wang wrote:
>
>
> On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
>> A known weakness in ptr_ring design is that it does not handle well the
>> situation when ring is almost full: as entries are consumed they are
>> immediately used again by the producer, so consumer and producer are
>> writing to a shared cache line.
>>
>> To fix this, add batching to consume calls: as entries are
>> consumed do not write NULL into the ring until we get
>> a multiple (in current implementation 2x) of cache lines
>> away from the producer. At that point, write them all out.
>>
>> We do the write out in the reverse order to keep
>> producer from sharing cache with consumer for as long
>> as possible.
>>
>> Writeout also triggers when ring wraps around - there's
>> no special reason to do this but it helps keep the code
>> a bit simpler.
>>
>> What should we do if getting away from producer by 2 cache lines
>> would mean we are keeping the ring moe than half empty?
>> Maybe we should reduce the batching in this case,
>> current patch simply reduces the batching.
>>
>> Notes:
>> - it is no longer true that a call to consume guarantees
>>    that the following call to produce will succeed.
>>    No users seem to assume that.
>> - batching can also in theory reduce the signalling rate:
>>    users that would previously send interrups to the producer
>>    to wake it up after consuming each entry would now only
>>    need to do this once in a batch.
>>    Doing this would be easy by returning a flag to the caller.
>>    No users seem to do signalling on consume yet so this was not
>>    implemented yet.
>>
>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>> ---
>>
>> Jason, I am curious whether the following gives you some of
>> the performance boost that you see with vhost batching
>> patches. Is vhost batching on top still helpful?
>
> The patch looks good to me, will have a test for vhost batching patches.
>
> Thanks

Still helpful:

before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps

Acked-by: Jason Wang <jasowang@redhat.com>

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-04-14  7:52   ` Jason Wang
@ 2017-04-14 21:00     ` Michael S. Tsirkin
  2017-04-18  2:16       ` Jason Wang
  2017-04-14 22:50     ` Michael S. Tsirkin
  1 sibling, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 21:00 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, brouer

On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月12日 16:03, Jason Wang wrote:
> > 
> > 
> > On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >    that the following call to produce will succeed.
> > >    No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >    users that would previously send interrups to the producer
> > >    to wake it up after consuming each entry would now only
> > >    need to do this once in a batch.
> > >    Doing this would be easy by returning a flag to the caller.
> > >    No users seem to do signalling on consume yet so this was not
> > >    implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > 
> > The patch looks good to me, will have a test for vhost batching patches.
> > 
> > Thanks
> 
> Still helpful:
> 
> before this patch: 1.84Mpps
> with this patch: 2.00Mpps
> with batch dequeuing: 2.30Mpps
> 
> Acked-by: Jason Wang <jasowang@redhat.com>
> 
> Thanks

Fascinating. How do we explain the gain with batch dequeue?
Is it just the lock overhead? Can you pls try to replace
the lock with a simple non-fair atomic and see what happens?

-- 
MST

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-04-14  7:52   ` Jason Wang
  2017-04-14 21:00     ` Michael S. Tsirkin
@ 2017-04-14 22:50     ` Michael S. Tsirkin
  2017-04-18  2:18       ` Jason Wang
  1 sibling, 1 reply; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-04-14 22:50 UTC (permalink / raw)
  To: Jason Wang; +Cc: linux-kernel, brouer

On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:
> 
> 
> On 2017年04月12日 16:03, Jason Wang wrote:
> > 
> > 
> > On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >    that the following call to produce will succeed.
> > >    No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >    users that would previously send interrups to the producer
> > >    to wake it up after consuming each entry would now only
> > >    need to do this once in a batch.
> > >    Doing this would be easy by returning a flag to the caller.
> > >    No users seem to do signalling on consume yet so this was not
> > >    implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > 
> > The patch looks good to me, will have a test for vhost batching patches.
> > 
> > Thanks
> 
> Still helpful:
> 
> before this patch: 1.84Mpps
> with this patch: 2.00Mpps
> with batch dequeuing: 2.30Mpps

Just a thought: could you test dropping the consumer spinlock
completely?  Just around the peek?

As I said previously, perf c2c tool should be helpful
to locate sources latency related to cache.

> Acked-by: Jason Wang <jasowang@redhat.com>
> 
> Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-04-14 21:00     ` Michael S. Tsirkin
@ 2017-04-18  2:16       ` Jason Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Wang @ 2017-04-18  2:16 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-kernel, brouer



On 2017年04月15日 05:00, Michael S. Tsirkin wrote:
> On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:
>>
>> On 2017年04月12日 16:03, Jason Wang wrote:
>>>
>>> On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
>>>> A known weakness in ptr_ring design is that it does not handle well the
>>>> situation when ring is almost full: as entries are consumed they are
>>>> immediately used again by the producer, so consumer and producer are
>>>> writing to a shared cache line.
>>>>
>>>> To fix this, add batching to consume calls: as entries are
>>>> consumed do not write NULL into the ring until we get
>>>> a multiple (in current implementation 2x) of cache lines
>>>> away from the producer. At that point, write them all out.
>>>>
>>>> We do the write out in the reverse order to keep
>>>> producer from sharing cache with consumer for as long
>>>> as possible.
>>>>
>>>> Writeout also triggers when ring wraps around - there's
>>>> no special reason to do this but it helps keep the code
>>>> a bit simpler.
>>>>
>>>> What should we do if getting away from producer by 2 cache lines
>>>> would mean we are keeping the ring moe than half empty?
>>>> Maybe we should reduce the batching in this case,
>>>> current patch simply reduces the batching.
>>>>
>>>> Notes:
>>>> - it is no longer true that a call to consume guarantees
>>>>     that the following call to produce will succeed.
>>>>     No users seem to assume that.
>>>> - batching can also in theory reduce the signalling rate:
>>>>     users that would previously send interrups to the producer
>>>>     to wake it up after consuming each entry would now only
>>>>     need to do this once in a batch.
>>>>     Doing this would be easy by returning a flag to the caller.
>>>>     No users seem to do signalling on consume yet so this was not
>>>>     implemented yet.
>>>>
>>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>>> ---
>>>>
>>>> Jason, I am curious whether the following gives you some of
>>>> the performance boost that you see with vhost batching
>>>> patches. Is vhost batching on top still helpful?
>>> The patch looks good to me, will have a test for vhost batching patches.
>>>
>>> Thanks
>> Still helpful:
>>
>> before this patch: 1.84Mpps
>> with this patch: 2.00Mpps
>> with batch dequeuing: 2.30Mpps
>>
>> Acked-by: Jason Wang <jasowang@redhat.com>
>>
>> Thanks
> Fascinating. How do we explain the gain with batch dequeue?

I count the drop rate (pktgen on tun and count tun tx) and maybe it can 
explain more or less:

Before this patch: TX xmit 1.8Mpps Tx dropped 0.23Mpps Tx total 2.04Mpps 
11% dropped
After this patch: Tx xmit 1.95Mpps Tx dropped 0.33Mpps Tx total 2.28Mpps 
14% dropped
With batched dequeuing: Tx xmit 2.24Mpps Tx dropped 0.01Mpps Tx total 
2.26Mpps ~0% dropped

With this patch, we remove the cache contention by blocking the producer 
more or less. With batch dequeuing, the ring is not full in 99% of the 
cases which probably means the producer is not blocked for most of the time.

> Is it just the lock overhead?

I remove the spinlocks for peeking and dequeuing on top of this patch. 
The tx pps were increased from ~2Mpps to ~2.08Mpps. So it was not only 
the lock overhead.

> Can you pls try to replace
> the lock with a simple non-fair atomic and see what happens?
>

Not sure I get the idea, we are going for fast path of spinlocks for 
sure which is just a cmpxchg().

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-04-14 22:50     ` Michael S. Tsirkin
@ 2017-04-18  2:18       ` Jason Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Wang @ 2017-04-18  2:18 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: linux-kernel, brouer



On 2017年04月15日 06:50, Michael S. Tsirkin wrote:
> On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:
>>
>> On 2017年04月12日 16:03, Jason Wang wrote:
>>>
>>> On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
>>>> A known weakness in ptr_ring design is that it does not handle well the
>>>> situation when ring is almost full: as entries are consumed they are
>>>> immediately used again by the producer, so consumer and producer are
>>>> writing to a shared cache line.
>>>>
>>>> To fix this, add batching to consume calls: as entries are
>>>> consumed do not write NULL into the ring until we get
>>>> a multiple (in current implementation 2x) of cache lines
>>>> away from the producer. At that point, write them all out.
>>>>
>>>> We do the write out in the reverse order to keep
>>>> producer from sharing cache with consumer for as long
>>>> as possible.
>>>>
>>>> Writeout also triggers when ring wraps around - there's
>>>> no special reason to do this but it helps keep the code
>>>> a bit simpler.
>>>>
>>>> What should we do if getting away from producer by 2 cache lines
>>>> would mean we are keeping the ring moe than half empty?
>>>> Maybe we should reduce the batching in this case,
>>>> current patch simply reduces the batching.
>>>>
>>>> Notes:
>>>> - it is no longer true that a call to consume guarantees
>>>>     that the following call to produce will succeed.
>>>>     No users seem to assume that.
>>>> - batching can also in theory reduce the signalling rate:
>>>>     users that would previously send interrups to the producer
>>>>     to wake it up after consuming each entry would now only
>>>>     need to do this once in a batch.
>>>>     Doing this would be easy by returning a flag to the caller.
>>>>     No users seem to do signalling on consume yet so this was not
>>>>     implemented yet.
>>>>
>>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>>> ---
>>>>
>>>> Jason, I am curious whether the following gives you some of
>>>> the performance boost that you see with vhost batching
>>>> patches. Is vhost batching on top still helpful?
>>> The patch looks good to me, will have a test for vhost batching patches.
>>>
>>> Thanks
>> Still helpful:
>>
>> before this patch: 1.84Mpps
>> with this patch: 2.00Mpps
>> with batch dequeuing: 2.30Mpps
> Just a thought: could you test dropping the consumer spinlock
> completely?  Just around the peek?

2% improvement for dropping spinlock around peeking, 2% more for 
dropping spinlock for consuming.

>
> As I said previously, perf c2c tool should be helpful
> to locate sources latency related to cache.
>

perf c2c indeeds shows some false sharing were reduced by this patch. 
But it does not show obvious different with batch dequeuing on top.

Thanks

>> Acked-by: Jason Wang <jasowang@redhat.com>
>>
>> Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-04-08 12:14 ` [PATCH 1/3] ptr_ring: batch ring zeroing Jesper Dangaard Brouer
@ 2017-05-09 13:33   ` Michael S. Tsirkin
  2017-05-10  3:30     ` Jason Wang
  2017-05-10  9:18     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-05-09 13:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: linux-kernel, Jason Wang

On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 7 Apr 2017 08:49:57 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > A known weakness in ptr_ring design is that it does not handle well the
> > situation when ring is almost full: as entries are consumed they are
> > immediately used again by the producer, so consumer and producer are
> > writing to a shared cache line.
> > 
> > To fix this, add batching to consume calls: as entries are
> > consumed do not write NULL into the ring until we get
> > a multiple (in current implementation 2x) of cache lines
> > away from the producer. At that point, write them all out.
> > 
> > We do the write out in the reverse order to keep
> > producer from sharing cache with consumer for as long
> > as possible.
> > 
> > Writeout also triggers when ring wraps around - there's
> > no special reason to do this but it helps keep the code
> > a bit simpler.
> > 
> > What should we do if getting away from producer by 2 cache lines
> > would mean we are keeping the ring moe than half empty?
> > Maybe we should reduce the batching in this case,
> > current patch simply reduces the batching.
> > 
> > Notes:
> > - it is no longer true that a call to consume guarantees
> >   that the following call to produce will succeed.
> >   No users seem to assume that.
> > - batching can also in theory reduce the signalling rate:
> >   users that would previously send interrups to the producer
> >   to wake it up after consuming each entry would now only
> >   need to do this once in a batch.
> >   Doing this would be easy by returning a flag to the caller.
> >   No users seem to do signalling on consume yet so this was not
> >   implemented yet.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > 
> > Jason, I am curious whether the following gives you some of
> > the performance boost that you see with vhost batching
> > patches. Is vhost batching on top still helpful?
> > 
> >  include/linux/ptr_ring.h | 63 +++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 54 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 6c70444..6b2e0dd 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -34,11 +34,13 @@
> >  struct ptr_ring {
> >  	int producer ____cacheline_aligned_in_smp;
> >  	spinlock_t producer_lock;
> > -	int consumer ____cacheline_aligned_in_smp;
> > +	int consumer_head ____cacheline_aligned_in_smp; /* next valid entry */
> > +	int consumer_tail; /* next entry to invalidate */
> >  	spinlock_t consumer_lock;
> >  	/* Shared consumer/producer data */
> >  	/* Read-only by both the producer and the consumer */
> >  	int size ____cacheline_aligned_in_smp; /* max entries in queue */
> > +	int batch; /* number of entries to consume in a batch */
> >  	void **queue;
> >  };
> >  
> > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, void *ptr)
> >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> >  {
> >  	if (likely(r->size))
> > -		return r->queue[r->consumer];
> > +		return r->queue[r->consumer_head];
> >  	return NULL;
> >  }
> >  
> > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
> >  /* Must only be called after __ptr_ring_peek returned !NULL */
> >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >  {
> > -	r->queue[r->consumer++] = NULL;
> > -	if (unlikely(r->consumer >= r->size))
> > -		r->consumer = 0;
> > +	/* Fundamentally, what we want to do is update consumer
> > +	 * index and zero out the entry so producer can reuse it.
> > +	 * Doing it naively at each consume would be as simple as:
> > +	 *       r->queue[r->consumer++] = NULL;
> > +	 *       if (unlikely(r->consumer >= r->size))
> > +	 *               r->consumer = 0;
> > +	 * but that is suboptimal when the ring is full as producer is writing
> > +	 * out new entries in the same cache line.  Defer these updates until a
> > +	 * batch of entries has been consumed.
> > +	 */
> > +	int head = r->consumer_head++;
> > +
> > +	/* Once we have processed enough entries invalidate them in
> > +	 * the ring all at once so producer can reuse their space in the ring.
> > +	 * We also do this when we reach end of the ring - not mandatory
> > +	 * but helps keep the implementation simple.
> > +	 */
> > +	if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> > +		     r->consumer_head >= r->size)) {
> > +		/* Zero out entries in the reverse order: this way we touch the
> > +		 * cache line that producer might currently be reading the last;
> > +		 * producer won't make progress and touch other cache lines
> > +		 * besides the first one until we write out all entries.
> > +		 */
> > +		while (likely(head >= r->consumer_tail))
> > +			r->queue[head--] = NULL;
> > +		r->consumer_tail = r->consumer_head;
> > +	}
> > +	if (unlikely(r->consumer_head >= r->size)) {
> > +		r->consumer_head = 0;
> > +		r->consumer_tail = 0;
> > +	}
> >  }
> 
> I love this idea.  Reviewed and discussed the idea in-person with MST
> during netdevconf[1] at this laptop.  I promised I will also run it
> through my micro-benchmarking[2] once I return home (hint ptr_ring gets
> used in network stack as skb_array).

I'm merging this through my tree. Any objections?

> Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> [1] http://netdevconf.org/2.1/
> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-05-09 13:33   ` Michael S. Tsirkin
@ 2017-05-10  3:30     ` Jason Wang
  2017-05-10 12:22       ` Michael S. Tsirkin
  2017-05-10  9:18     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 17+ messages in thread
From: Jason Wang @ 2017-05-10  3:30 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jesper Dangaard Brouer; +Cc: linux-kernel



On 2017年05月09日 21:33, Michael S. Tsirkin wrote:
>> I love this idea.  Reviewed and discussed the idea in-person with MST
>> during netdevconf[1] at this laptop.  I promised I will also run it
>> through my micro-benchmarking[2] once I return home (hint ptr_ring gets
>> used in network stack as skb_array).
> I'm merging this through my tree. Any objections?
>

Batch dequeuing series depends on this, maybe it's better to have this 
in that series. Let me post a V4 series with this.

Thanks

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-05-09 13:33   ` Michael S. Tsirkin
  2017-05-10  3:30     ` Jason Wang
@ 2017-05-10  9:18     ` Jesper Dangaard Brouer
  2017-05-10 12:20       ` Michael S. Tsirkin
  1 sibling, 1 reply; 17+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-10  9:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Jason Wang, brouer, netdev, John Fastabend

On Tue, 9 May 2017 16:33:14 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 7 Apr 2017 08:49:57 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > A known weakness in ptr_ring design is that it does not handle well the
> > > situation when ring is almost full: as entries are consumed they are
> > > immediately used again by the producer, so consumer and producer are
> > > writing to a shared cache line.
> > > 
> > > To fix this, add batching to consume calls: as entries are
> > > consumed do not write NULL into the ring until we get
> > > a multiple (in current implementation 2x) of cache lines
> > > away from the producer. At that point, write them all out.
> > > 
> > > We do the write out in the reverse order to keep
> > > producer from sharing cache with consumer for as long
> > > as possible.
> > > 
> > > Writeout also triggers when ring wraps around - there's
> > > no special reason to do this but it helps keep the code
> > > a bit simpler.
> > > 
> > > What should we do if getting away from producer by 2 cache lines
> > > would mean we are keeping the ring moe than half empty?
> > > Maybe we should reduce the batching in this case,
> > > current patch simply reduces the batching.
> > > 
> > > Notes:
> > > - it is no longer true that a call to consume guarantees
> > >   that the following call to produce will succeed.
> > >   No users seem to assume that.
> > > - batching can also in theory reduce the signalling rate:
> > >   users that would previously send interrups to the producer
> > >   to wake it up after consuming each entry would now only
> > >   need to do this once in a batch.
> > >   Doing this would be easy by returning a flag to the caller.
> > >   No users seem to do signalling on consume yet so this was not
> > >   implemented yet.
> > > 
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > > 
> > > Jason, I am curious whether the following gives you some of
> > > the performance boost that you see with vhost batching
> > > patches. Is vhost batching on top still helpful?
> > > 
> > >  include/linux/ptr_ring.h | 63 +++++++++++++++++++++++++++++++++++++++++-------
> > >  1 file changed, 54 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > index 6c70444..6b2e0dd 100644
> > > --- a/include/linux/ptr_ring.h
> > > +++ b/include/linux/ptr_ring.h
> > > @@ -34,11 +34,13 @@
> > >  struct ptr_ring {
> > >  	int producer ____cacheline_aligned_in_smp;
> > >  	spinlock_t producer_lock;
> > > -	int consumer ____cacheline_aligned_in_smp;
> > > +	int consumer_head ____cacheline_aligned_in_smp; /* next valid entry */
> > > +	int consumer_tail; /* next entry to invalidate */
> > >  	spinlock_t consumer_lock;
> > >  	/* Shared consumer/producer data */
> > >  	/* Read-only by both the producer and the consumer */
> > >  	int size ____cacheline_aligned_in_smp; /* max entries in queue */
> > > +	int batch; /* number of entries to consume in a batch */
> > >  	void **queue;
> > >  };
> > >  
> > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, void *ptr)
> > >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> > >  {
> > >  	if (likely(r->size))
> > > -		return r->queue[r->consumer];
> > > +		return r->queue[r->consumer_head];
> > >  	return NULL;
> > >  }
> > >  
> > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
> > >  /* Must only be called after __ptr_ring_peek returned !NULL */
> > >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> > >  {
> > > -	r->queue[r->consumer++] = NULL;
> > > -	if (unlikely(r->consumer >= r->size))
> > > -		r->consumer = 0;
> > > +	/* Fundamentally, what we want to do is update consumer
> > > +	 * index and zero out the entry so producer can reuse it.
> > > +	 * Doing it naively at each consume would be as simple as:
> > > +	 *       r->queue[r->consumer++] = NULL;
> > > +	 *       if (unlikely(r->consumer >= r->size))
> > > +	 *               r->consumer = 0;
> > > +	 * but that is suboptimal when the ring is full as producer is writing
> > > +	 * out new entries in the same cache line.  Defer these updates until a
> > > +	 * batch of entries has been consumed.
> > > +	 */
> > > +	int head = r->consumer_head++;
> > > +
> > > +	/* Once we have processed enough entries invalidate them in
> > > +	 * the ring all at once so producer can reuse their space in the ring.
> > > +	 * We also do this when we reach end of the ring - not mandatory
> > > +	 * but helps keep the implementation simple.
> > > +	 */
> > > +	if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> > > +		     r->consumer_head >= r->size)) {
> > > +		/* Zero out entries in the reverse order: this way we touch the
> > > +		 * cache line that producer might currently be reading the last;
> > > +		 * producer won't make progress and touch other cache lines
> > > +		 * besides the first one until we write out all entries.
> > > +		 */
> > > +		while (likely(head >= r->consumer_tail))
> > > +			r->queue[head--] = NULL;
> > > +		r->consumer_tail = r->consumer_head;
> > > +	}
> > > +	if (unlikely(r->consumer_head >= r->size)) {
> > > +		r->consumer_head = 0;
> > > +		r->consumer_tail = 0;
> > > +	}
> > >  }  
> > 
> > I love this idea.  Reviewed and discussed the idea in-person with MST
> > during netdevconf[1] at this laptop.  I promised I will also run it
> > through my micro-benchmarking[2] once I return home (hint ptr_ring gets
> > used in network stack as skb_array).  
> 
> I'm merging this through my tree. Any objections?

I just did the micro-benchmark evaluation I promised and everything
looks good, so no objections from me.

John Fastabend recently posted a RFC patchset for removing the qdisc
lock.  The main purpose is to separate enqueue'ers and dequeue'ers from
serializing on the same lock.  But we need a new queue implementation
that avoids the false-sharing between enq+deq.

This is why John choose to use ptr_ring, changed (pfifo_fast) qdisc to
use this ptr_ring.  I think this patch might help his overload testing,
as my theory is that he is hitting false-sharing on the queue, due to
the queue always being full.


> > Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > 
> > [1] http://netdevconf.org/2.1/
> > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c

If you like can also add my:

Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

$ modprobe skb_array_test01 
$ dmesg 
[73228.381497] skb_array_test01: Loaded
[73228.381498] skb_array_test01: PASSED - basic_init_and_cleanup()
[73228.381498] skb_array_test01: PASSED - basic_add_and_remove_object()
[73228.381505] skb_array_test01: PASSED - test_queue_full_condition()
[73228.381505] skb_array_test01: PASSED - test_queue_empty_condition()
[73228.381510] skb_array_test01: PASSED - test_queue_resize()

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-05-10  9:18     ` Jesper Dangaard Brouer
@ 2017-05-10 12:20       ` Michael S. Tsirkin
  0 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-05-10 12:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: linux-kernel, Jason Wang, netdev, John Fastabend

On Wed, May 10, 2017 at 11:18:13AM +0200, Jesper Dangaard Brouer wrote:
> On Tue, 9 May 2017 16:33:14 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> > > On Fri, 7 Apr 2017 08:49:57 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > >   
> > > > A known weakness in ptr_ring design is that it does not handle well the
> > > > situation when ring is almost full: as entries are consumed they are
> > > > immediately used again by the producer, so consumer and producer are
> > > > writing to a shared cache line.
> > > > 
> > > > To fix this, add batching to consume calls: as entries are
> > > > consumed do not write NULL into the ring until we get
> > > > a multiple (in current implementation 2x) of cache lines
> > > > away from the producer. At that point, write them all out.
> > > > 
> > > > We do the write out in the reverse order to keep
> > > > producer from sharing cache with consumer for as long
> > > > as possible.
> > > > 
> > > > Writeout also triggers when ring wraps around - there's
> > > > no special reason to do this but it helps keep the code
> > > > a bit simpler.
> > > > 
> > > > What should we do if getting away from producer by 2 cache lines
> > > > would mean we are keeping the ring moe than half empty?
> > > > Maybe we should reduce the batching in this case,
> > > > current patch simply reduces the batching.
> > > > 
> > > > Notes:
> > > > - it is no longer true that a call to consume guarantees
> > > >   that the following call to produce will succeed.
> > > >   No users seem to assume that.
> > > > - batching can also in theory reduce the signalling rate:
> > > >   users that would previously send interrups to the producer
> > > >   to wake it up after consuming each entry would now only
> > > >   need to do this once in a batch.
> > > >   Doing this would be easy by returning a flag to the caller.
> > > >   No users seem to do signalling on consume yet so this was not
> > > >   implemented yet.
> > > > 
> > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > ---
> > > > 
> > > > Jason, I am curious whether the following gives you some of
> > > > the performance boost that you see with vhost batching
> > > > patches. Is vhost batching on top still helpful?
> > > > 
> > > >  include/linux/ptr_ring.h | 63 +++++++++++++++++++++++++++++++++++++++++-------
> > > >  1 file changed, 54 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > > > index 6c70444..6b2e0dd 100644
> > > > --- a/include/linux/ptr_ring.h
> > > > +++ b/include/linux/ptr_ring.h
> > > > @@ -34,11 +34,13 @@
> > > >  struct ptr_ring {
> > > >  	int producer ____cacheline_aligned_in_smp;
> > > >  	spinlock_t producer_lock;
> > > > -	int consumer ____cacheline_aligned_in_smp;
> > > > +	int consumer_head ____cacheline_aligned_in_smp; /* next valid entry */
> > > > +	int consumer_tail; /* next entry to invalidate */
> > > >  	spinlock_t consumer_lock;
> > > >  	/* Shared consumer/producer data */
> > > >  	/* Read-only by both the producer and the consumer */
> > > >  	int size ____cacheline_aligned_in_smp; /* max entries in queue */
> > > > +	int batch; /* number of entries to consume in a batch */
> > > >  	void **queue;
> > > >  };
> > > >  
> > > > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, void *ptr)
> > > >  static inline void *__ptr_ring_peek(struct ptr_ring *r)
> > > >  {
> > > >  	if (likely(r->size))
> > > > -		return r->queue[r->consumer];
> > > > +		return r->queue[r->consumer_head];
> > > >  	return NULL;
> > > >  }
> > > >  
> > > > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
> > > >  /* Must only be called after __ptr_ring_peek returned !NULL */
> > > >  static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> > > >  {
> > > > -	r->queue[r->consumer++] = NULL;
> > > > -	if (unlikely(r->consumer >= r->size))
> > > > -		r->consumer = 0;
> > > > +	/* Fundamentally, what we want to do is update consumer
> > > > +	 * index and zero out the entry so producer can reuse it.
> > > > +	 * Doing it naively at each consume would be as simple as:
> > > > +	 *       r->queue[r->consumer++] = NULL;
> > > > +	 *       if (unlikely(r->consumer >= r->size))
> > > > +	 *               r->consumer = 0;
> > > > +	 * but that is suboptimal when the ring is full as producer is writing
> > > > +	 * out new entries in the same cache line.  Defer these updates until a
> > > > +	 * batch of entries has been consumed.
> > > > +	 */
> > > > +	int head = r->consumer_head++;
> > > > +
> > > > +	/* Once we have processed enough entries invalidate them in
> > > > +	 * the ring all at once so producer can reuse their space in the ring.
> > > > +	 * We also do this when we reach end of the ring - not mandatory
> > > > +	 * but helps keep the implementation simple.
> > > > +	 */
> > > > +	if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> > > > +		     r->consumer_head >= r->size)) {
> > > > +		/* Zero out entries in the reverse order: this way we touch the
> > > > +		 * cache line that producer might currently be reading the last;
> > > > +		 * producer won't make progress and touch other cache lines
> > > > +		 * besides the first one until we write out all entries.
> > > > +		 */
> > > > +		while (likely(head >= r->consumer_tail))
> > > > +			r->queue[head--] = NULL;
> > > > +		r->consumer_tail = r->consumer_head;
> > > > +	}
> > > > +	if (unlikely(r->consumer_head >= r->size)) {
> > > > +		r->consumer_head = 0;
> > > > +		r->consumer_tail = 0;
> > > > +	}
> > > >  }  
> > > 
> > > I love this idea.  Reviewed and discussed the idea in-person with MST
> > > during netdevconf[1] at this laptop.  I promised I will also run it
> > > through my micro-benchmarking[2] once I return home (hint ptr_ring gets
> > > used in network stack as skb_array).  
> > 
> > I'm merging this through my tree. Any objections?
> 
> I just did the micro-benchmark evaluation I promised and everything
> looks good, so no objections from me.
> 
> John Fastabend recently posted a RFC patchset for removing the qdisc
> lock.  The main purpose is to separate enqueue'ers and dequeue'ers from
> serializing on the same lock.  But we need a new queue implementation
> that avoids the false-sharing between enq+deq.
> 
> This is why John choose to use ptr_ring, changed (pfifo_fast) qdisc to
> use this ptr_ring.  I think this patch might help his overload testing,
> as my theory is that he is hitting false-sharing on the queue, due to
> the queue always being full.
> 
> 
> > > Reviewed-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > > 
> > > [1] http://netdevconf.org/2.1/
> > > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
> 
> If you like can also add my:
> 
> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>

I pushed it out already unfortunately so I can't attach that.
Sorry.  Thanks a lot for the testing!

> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> $ modprobe skb_array_test01 
> $ dmesg 
> [73228.381497] skb_array_test01: Loaded
> [73228.381498] skb_array_test01: PASSED - basic_init_and_cleanup()
> [73228.381498] skb_array_test01: PASSED - basic_add_and_remove_object()
> [73228.381505] skb_array_test01: PASSED - test_queue_full_condition()
> [73228.381505] skb_array_test01: PASSED - test_queue_empty_condition()
> [73228.381510] skb_array_test01: PASSED - test_queue_resize()

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] ptr_ring: batch ring zeroing
  2017-05-10  3:30     ` Jason Wang
@ 2017-05-10 12:22       ` Michael S. Tsirkin
  0 siblings, 0 replies; 17+ messages in thread
From: Michael S. Tsirkin @ 2017-05-10 12:22 UTC (permalink / raw)
  To: Jason Wang; +Cc: Jesper Dangaard Brouer, linux-kernel

On Wed, May 10, 2017 at 11:30:42AM +0800, Jason Wang wrote:
> 
> 
> On 2017年05月09日 21:33, Michael S. Tsirkin wrote:
> > > I love this idea.  Reviewed and discussed the idea in-person with MST
> > > during netdevconf[1] at this laptop.  I promised I will also run it
> > > through my micro-benchmarking[2] once I return home (hint ptr_ring gets
> > > used in network stack as skb_array).
> > I'm merging this through my tree. Any objections?
> > 
> 
> Batch dequeuing series depends on this, maybe it's better to have this in
> that series. Let me post a V4 series with this.
> 
> Thanks

FYI I'm including 1/3 in the pull request.

-- 
MST

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-05-10 12:22 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-07  5:49 [PATCH 1/3] ptr_ring: batch ring zeroing Michael S. Tsirkin
2017-04-07  5:50 ` [PATCH 2/3] ringtest: support test specific parameters Michael S. Tsirkin
2017-04-07  5:50   ` Michael S. Tsirkin
2017-04-07  5:50 ` [PATCH 3/3] ptr_ring: support testing different batching sizes Michael S. Tsirkin
2017-04-07  5:50   ` Michael S. Tsirkin
2017-04-08 12:14 ` [PATCH 1/3] ptr_ring: batch ring zeroing Jesper Dangaard Brouer
2017-05-09 13:33   ` Michael S. Tsirkin
2017-05-10  3:30     ` Jason Wang
2017-05-10 12:22       ` Michael S. Tsirkin
2017-05-10  9:18     ` Jesper Dangaard Brouer
2017-05-10 12:20       ` Michael S. Tsirkin
2017-04-12  8:03 ` Jason Wang
2017-04-14  7:52   ` Jason Wang
2017-04-14 21:00     ` Michael S. Tsirkin
2017-04-18  2:16       ` Jason Wang
2017-04-14 22:50     ` Michael S. Tsirkin
2017-04-18  2:18       ` Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.