linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/2] add benchmark selftest and optimization for ptr_ring
@ 2021-06-25  3:18 Yunsheng Lin
  2021-06-25  3:18 ` [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application " Yunsheng Lin
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-25  3:18 UTC (permalink / raw)
  To: davem, kuba, jasowang, mst
  Cc: brouer, paulmck, peterz, will, shuah, linux-kernel, netdev,
	linux-kselftest, linuxarm

Patch 1: add a selftest app to benchmark the performance
         of ptr_ring.
Patch 2: make __ptr_ring_empty() checking more reliable
         and use the just added selftest to benchmark the
         performance impact.

V2: add patch 1 and add performance data for patch 2.

Yunsheng Lin (2):
  selftests/ptr_ring: add benchmark application for ptr_ring
  ptr_ring: make __ptr_ring_empty() checking more reliable

 MAINTAINERS                                      |   5 +
 include/linux/ptr_ring.h                         |  25 ++-
 tools/testing/selftests/ptr_ring/Makefile        |   6 +
 tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++
 tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++
 5 files changed, 426 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/ptr_ring/Makefile
 create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c
 create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h

-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application for ptr_ring
  2021-06-25  3:18 [PATCH net-next v2 0/2] add benchmark selftest and optimization for ptr_ring Yunsheng Lin
@ 2021-06-25  3:18 ` Yunsheng Lin
  2021-06-25  3:36   ` Jason Wang
  2021-06-25  6:37   ` Michael S. Tsirkin
  2021-06-25  3:18 ` [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable Yunsheng Lin
  2021-06-25  6:42 ` [PATCH net-next v2 0/2] add benchmark selftest and optimization for ptr_ring Michael S. Tsirkin
  2 siblings, 2 replies; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-25  3:18 UTC (permalink / raw)
  To: davem, kuba, jasowang, mst
  Cc: brouer, paulmck, peterz, will, shuah, linux-kernel, netdev,
	linux-kselftest, linuxarm

Currently ptr_ring selftest is embedded within the virtio
selftest, which involves some specific virtio operation,
such as notifying and kicking.

As ptr_ring has been used by various subsystems, it deserves
it's owner's selftest in order to benchmark different usecase
of ptr_ring, such as page pool and pfifo_fast qdisc.

So add a simple application to benchmark ptr_ring performance.
Currently two test mode is supported:
Mode 0: Both enqueuing and dequeuing is done in a single thread,
        it is called simple test mode in the test app.
Mode 1: Enqueuing and dequeuing is done in different thread
        concurrently, also known as SPSC(single-producer/
        single-consumer) test.

The multi-producer/single-consumer test for pfifo_fast case is
not added yet, which can be added if using CAS atomic operation
to enable lockless multi-producer is proved to be better than
using r->producer_lock.

Only supported on x86 and arm64 for now.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 MAINTAINERS                                      |   5 +
 tools/testing/selftests/ptr_ring/Makefile        |   6 +
 tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++
 tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++
 4 files changed, 410 insertions(+)
 create mode 100644 tools/testing/selftests/ptr_ring/Makefile
 create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c
 create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h

diff --git a/MAINTAINERS b/MAINTAINERS
index cc375fd..1227022 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14847,6 +14847,11 @@ F:	drivers/net/phy/dp83640*
 F:	drivers/ptp/*
 F:	include/linux/ptp_cl*
 
+PTR RING BENCHMARK
+M:	Yunsheng Lin <linyunsheng@huawei.com>
+L:	netdev@vger.kernel.org
+F:	tools/testing/selftests/ptr_ring/
+
 PTRACE SUPPORT
 M:	Oleg Nesterov <oleg@redhat.com>
 S:	Maintained
diff --git a/tools/testing/selftests/ptr_ring/Makefile b/tools/testing/selftests/ptr_ring/Makefile
new file mode 100644
index 0000000..346dea9
--- /dev/null
+++ b/tools/testing/selftests/ptr_ring/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+LDLIBS = -lpthread
+
+TEST_GEN_PROGS := ptr_ring_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.c b/tools/testing/selftests/ptr_ring/ptr_ring_test.c
new file mode 100644
index 0000000..4f32d3d
--- /dev/null
+++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.c
@@ -0,0 +1,249 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2021 HiSilicon Limited.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/time.h>
+#include <malloc.h>
+#include <assert.h>
+#include <stdbool.h>
+#include <pthread.h>
+
+#include "ptr_ring_test.h"
+#include "../../../../include/linux/ptr_ring.h"
+
+#define MIN_RING_SIZE	2
+#define MAX_RING_SIZE	10000000
+
+static struct ptr_ring ring ____cacheline_aligned_in_smp;
+
+struct worker_info {
+	pthread_t tid;
+	int test_count;
+	bool error;
+	long duration_us;
+};
+
+static void *produce_worker(void *arg)
+{
+	struct worker_info *info = arg;
+	struct timeval start, end;
+	unsigned long i = 0;
+	long sec, us;
+	int ret;
+
+	gettimeofday(&start, NULL);
+
+	while (++i <= info->test_count) {
+		while (__ptr_ring_full(&ring))
+			cpu_relax();
+
+		ret = __ptr_ring_produce(&ring, (void *)i);
+		if (ret) {
+			fprintf(stderr, "produce failed: %d\n", ret);
+			info->error = true;
+			return NULL;
+		}
+	}
+
+	gettimeofday(&end, NULL);
+
+	sec = (end.tv_sec - start.tv_sec);
+	us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
+	info->duration_us = us;
+	info->error = false;
+
+	return NULL;
+}
+
+static void *consume_worker(void *arg)
+{
+	struct worker_info *info = arg;
+	struct timeval start, end;
+	unsigned long i = 0;
+	long sec, us;
+	int *ptr;
+
+	gettimeofday(&start, NULL);
+
+	while (++i <= info->test_count) {
+		while (__ptr_ring_empty(&ring))
+			cpu_relax();
+
+		ptr = __ptr_ring_consume(&ring);
+		if ((unsigned long)ptr != i) {
+			fprintf(stderr, "consumer failed, ptr: %lu, i: %lu\n",
+				(unsigned long)ptr, i);
+			info->error = true;
+			return NULL;
+		}
+	}
+
+	gettimeofday(&end, NULL);
+
+	if (!__ptr_ring_empty(&ring)) {
+		fprintf(stderr, "ring should be empty, test failed\n");
+		info->error = true;
+		return NULL;
+	}
+
+	sec = (end.tv_sec - start.tv_sec);
+	us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
+	info->duration_us = us;
+	info->error = false;
+	return NULL;
+}
+
+/* test case for single producer single consumer */
+static void spsc_test(int size, int count)
+{
+	struct worker_info producer, consumer;
+	pthread_attr_t attr;
+	void *res;
+	int ret;
+
+	ret = ptr_ring_init(&ring, size, 0);
+	if (ret) {
+		fprintf(stderr, "init failed: %d\n", ret);
+		return;
+	}
+
+	producer.test_count = count;
+	consumer.test_count = count;
+
+	ret = pthread_attr_init(&attr);
+	if (ret) {
+		fprintf(stderr, "pthread attr init failed: %d\n", ret);
+		goto out;
+	}
+
+	ret = pthread_create(&producer.tid, &attr,
+			     produce_worker, &producer);
+	if (ret) {
+		fprintf(stderr, "create producer thread failed: %d\n", ret);
+		goto out;
+	}
+
+	ret = pthread_create(&consumer.tid, &attr,
+			     consume_worker, &consumer);
+	if (ret) {
+		fprintf(stderr, "create consumer thread failed: %d\n", ret);
+		goto out;
+	}
+
+	ret = pthread_join(producer.tid, &res);
+	if (ret) {
+		fprintf(stderr, "join producer thread failed: %d\n", ret);
+		goto out;
+	}
+
+	ret = pthread_join(consumer.tid, &res);
+	if (ret) {
+		fprintf(stderr, "join consumer thread failed: %d\n", ret);
+		goto out;
+	}
+
+	if (producer.error || consumer.error) {
+		fprintf(stderr, "spsc test failed\n");
+		goto out;
+	}
+
+	printf("ptr_ring(size:%d) perf spsc test for %d times, took %ld us + %ld us\n",
+	       size, count, producer.duration_us, consumer.duration_us);
+out:
+	ptr_ring_cleanup(&ring, NULL);
+}
+
+static void simple_test(int size, int count)
+{
+	struct timeval start, end;
+	long sec, us;
+	int i = 0;
+	int *ptr;
+	int ret;
+
+	ret = ptr_ring_init(&ring, size, 0);
+	if (ret) {
+		fprintf(stderr, "init failed: %d\n", ret);
+		return;
+	}
+
+	gettimeofday(&start, NULL);
+
+	while (++i <= count) {
+		ret = __ptr_ring_produce(&ring, &count);
+		if (ret) {
+			fprintf(stderr, "produce failed: %d\n", ret);
+			goto out;
+		}
+
+		ptr = __ptr_ring_consume(&ring);
+		if (ptr != &count)  {
+			fprintf(stderr, "consume failed: %p\n", ptr);
+			goto out;
+		}
+	}
+
+	gettimeofday(&end, NULL);
+	sec = (end.tv_sec - start.tv_sec);
+	us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
+	printf("ptr_ring(size:%d) perf simple test for %d times, took %ld us\n",
+	       size, count, us);
+
+out:
+	ptr_ring_cleanup(&ring, NULL);
+}
+
+int main(int argc, char *argv[])
+{
+	int count = 1000000;
+	int size = 1000;
+	int mode = 0;
+	int opt;
+
+	while ((opt = getopt(argc, argv, "N:s:m:")) != -1) {
+		switch (opt) {
+		case 'N':
+			count = atoi(optarg);
+			break;
+		case 's':
+			size = atoi(optarg);
+			break;
+		case 'm':
+			mode = atoi(optarg);
+			break;
+		default:
+			return -1;
+		}
+	}
+
+	if (count <= 0) {
+		fprintf(stderr, "invalid test count, must be > 0\n");
+		return -1;
+	}
+
+	if (size < MIN_RING_SIZE || size > MAX_RING_SIZE) {
+		fprintf(stderr, "invalid ring size, must be in %d-%d\n",
+			MIN_RING_SIZE, MAX_RING_SIZE);
+		return -1;
+	}
+
+	switch (mode) {
+	case 0:
+		simple_test(size, count);
+		break;
+	case 1:
+		spsc_test(size, count);
+		break;
+	default:
+		fprintf(stderr, "invalid test mode\n");
+		return -1;
+	}
+
+	return 0;
+}
diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.h b/tools/testing/selftests/ptr_ring/ptr_ring_test.h
new file mode 100644
index 0000000..6bf2494
--- /dev/null
+++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.h
@@ -0,0 +1,150 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _TEST_PTR_RING_IMPL_H
+#define _TEST_PTR_RING_IMPL_H
+
+#if defined(__x86_64__) || defined(__i386__)
+static inline void cpu_relax(void)
+{
+	asm volatile ("rep; nop" ::: "memory");
+}
+#elif defined(__aarch64__)
+static inline void cpu_relax(void)
+{
+	asm volatile("yield" ::: "memory");
+}
+#else
+#define cpu_relax() assert(0)
+#endif
+
+static inline void barrier(void)
+{
+	asm volatile("" ::: "memory");
+}
+
+/*
+ * This abuses the atomic builtins for thread fences, and
+ * adds a compiler barrier.
+ */
+#define smp_release() do { \
+	barrier(); \
+	__atomic_thread_fence(__ATOMIC_RELEASE); \
+} while (0)
+
+#define smp_acquire() do { \
+	__atomic_thread_fence(__ATOMIC_ACQUIRE); \
+	barrier(); \
+} while (0)
+
+#if defined(__i386__) || defined(__x86_64__)
+#define smp_wmb()		barrier()
+#else
+#define smp_wmb()		smp_release()
+#endif
+
+#define READ_ONCE(x)		(*(volatile typeof(x) *)&(x))
+#define WRITE_ONCE(x, val)	((*(volatile typeof(x) *)&(x)) = (val))
+#define SMP_CACHE_BYTES		64
+#define cache_line_size		SMP_CACHE_BYTES
+#define unlikely(x)		(__builtin_expect(!!(x), 0))
+#define likely(x)		(__builtin_expect(!!(x), 1))
+#define ALIGN(x, a)		(((x) + (a) - 1) / (a) * (a))
+#define SIZE_MAX		(~(size_t)0)
+#define KMALLOC_MAX_SIZE	SIZE_MAX
+#define spinlock_t		pthread_spinlock_t
+#define gfp_t			int
+#define __GFP_ZERO		0x1
+
+#define ____cacheline_aligned_in_smp __attribute__((aligned(SMP_CACHE_BYTES)))
+
+static void *kmalloc(unsigned int size, gfp_t gfp)
+{
+	void *p;
+
+	p = memalign(64, size);
+	if (!p)
+		return p;
+
+	if (gfp & __GFP_ZERO)
+		memset(p, 0, size);
+
+	return p;
+}
+
+static inline void *kzalloc(unsigned int size, gfp_t flags)
+{
+	return kmalloc(size, flags | __GFP_ZERO);
+}
+
+static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
+{
+	if (size != 0 && n > SIZE_MAX / size)
+		return NULL;
+	return kmalloc(n * size, flags);
+}
+
+static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
+{
+	return kmalloc_array(n, size, flags | __GFP_ZERO);
+}
+
+static void kfree(void *p)
+{
+	free(p);
+}
+
+#define kvmalloc_array		kmalloc_array
+#define kvfree			kfree
+
+static void spin_lock_init(spinlock_t *lock)
+{
+	int r = pthread_spin_init(lock, 0);
+
+	assert(!r);
+}
+
+static void spin_lock(spinlock_t *lock)
+{
+	int ret = pthread_spin_lock(lock);
+
+	assert(!ret);
+}
+
+static void spin_unlock(spinlock_t *lock)
+{
+	int ret = pthread_spin_unlock(lock);
+
+	assert(!ret);
+}
+
+static void spin_lock_bh(spinlock_t *lock)
+{
+	spin_lock(lock);
+}
+
+static void spin_unlock_bh(spinlock_t *lock)
+{
+	spin_unlock(lock);
+}
+
+static void spin_lock_irq(spinlock_t *lock)
+{
+	spin_lock(lock);
+}
+
+static void spin_unlock_irq(spinlock_t *lock)
+{
+	spin_unlock(lock);
+}
+
+static void spin_lock_irqsave(spinlock_t *lock, unsigned long f)
+{
+	spin_lock(lock);
+}
+
+static void spin_unlock_irqrestore(spinlock_t *lock, unsigned long f)
+{
+	spin_unlock(lock);
+}
+
+#endif
-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-25  3:18 [PATCH net-next v2 0/2] add benchmark selftest and optimization for ptr_ring Yunsheng Lin
  2021-06-25  3:18 ` [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application " Yunsheng Lin
@ 2021-06-25  3:18 ` Yunsheng Lin
  2021-06-25  6:32   ` Michael S. Tsirkin
  2021-06-25  6:39   ` Michael S. Tsirkin
  2021-06-25  6:42 ` [PATCH net-next v2 0/2] add benchmark selftest and optimization for ptr_ring Michael S. Tsirkin
  2 siblings, 2 replies; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-25  3:18 UTC (permalink / raw)
  To: davem, kuba, jasowang, mst
  Cc: brouer, paulmck, peterz, will, shuah, linux-kernel, netdev,
	linux-kselftest, linuxarm

Currently r->queue[] is cleared after r->consumer_head is moved
forward, which makes the __ptr_ring_empty() checking called in
page_pool_refill_alloc_cache() unreliable if the checking is done
after the r->queue clearing and before the consumer_head moving
forward.

Move the r->queue[] clearing after consumer_head moving forward
to make __ptr_ring_empty() checking more reliable.

As a side effect of above change, a consumer_head checking is
avoided for the likely case, and it has noticeable performance
improvement when it is tested using the ptr_ring_test selftest
added in the previous patch.

Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
to test the case of single thread doing both the enqueuing and
dequeuing:

 arch     unpatched           patched       delta
arm64      4648 ms            4464 ms       +3.9%
 X86       2562 ms            2401 ms       +6.2%

Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
to test the case of one thread doing enqueuing and another thread
doing dequeuing concurrently, also known as single-producer/single-
consumer:

 arch      unpatched             patched         delta
arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
 x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
V2: Add performance data.
---
 include/linux/ptr_ring.h | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 808f9d3..db9c282 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
 	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
 	 * to work correctly.
 	 */
-	int consumer_head = r->consumer_head;
-	int head = consumer_head++;
+	int consumer_head = r->consumer_head + 1;
 
 	/* Once we have processed enough entries invalidate them in
 	 * the ring all at once so producer can reuse their space in the ring.
@@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
 	 */
 	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
 		     consumer_head >= r->size)) {
+		int tail = r->consumer_tail;
+
+		if (unlikely(consumer_head >= r->size)) {
+			r->consumer_tail = 0;
+			WRITE_ONCE(r->consumer_head, 0);
+		} else {
+			r->consumer_tail = consumer_head;
+			WRITE_ONCE(r->consumer_head, consumer_head);
+		}
+
 		/* Zero out entries in the reverse order: this way we touch the
 		 * cache line that producer might currently be reading the last;
 		 * producer won't make progress and touch other cache lines
 		 * besides the first one until we write out all entries.
 		 */
-		while (likely(head >= r->consumer_tail))
-			r->queue[head--] = NULL;
-		r->consumer_tail = consumer_head;
-	}
-	if (unlikely(consumer_head >= r->size)) {
-		consumer_head = 0;
-		r->consumer_tail = 0;
+		while (likely(--consumer_head >= tail))
+			r->queue[consumer_head] = NULL;
+
+		return;
 	}
+
 	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
 	WRITE_ONCE(r->consumer_head, consumer_head);
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application for ptr_ring
  2021-06-25  3:18 ` [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application " Yunsheng Lin
@ 2021-06-25  3:36   ` Jason Wang
  2021-06-25  3:52     ` Yunsheng Lin
  2021-06-25  6:37   ` Michael S. Tsirkin
  1 sibling, 1 reply; 20+ messages in thread
From: Jason Wang @ 2021-06-25  3:36 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, mst
  Cc: brouer, paulmck, peterz, will, shuah, linux-kernel, netdev,
	linux-kselftest, linuxarm


在 2021/6/25 上午11:18, Yunsheng Lin 写道:
> Currently ptr_ring selftest is embedded within the virtio
> selftest, which involves some specific virtio operation,
> such as notifying and kicking.
>
> As ptr_ring has been used by various subsystems, it deserves
> it's owner's selftest in order to benchmark different usecase
> of ptr_ring, such as page pool and pfifo_fast qdisc.
>
> So add a simple application to benchmark ptr_ring performance.
> Currently two test mode is supported:
> Mode 0: Both enqueuing and dequeuing is done in a single thread,
>          it is called simple test mode in the test app.
> Mode 1: Enqueuing and dequeuing is done in different thread
>          concurrently, also known as SPSC(single-producer/
>          single-consumer) test.
>
> The multi-producer/single-consumer test for pfifo_fast case is
> not added yet, which can be added if using CAS atomic operation
> to enable lockless multi-producer is proved to be better than
> using r->producer_lock.
>
> Only supported on x86 and arm64 for now.
>
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>   MAINTAINERS                                      |   5 +
>   tools/testing/selftests/ptr_ring/Makefile        |   6 +
>   tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++
>   tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++
>   4 files changed, 410 insertions(+)


Why can't you simply reuse tools/virtio/ringtest?

Thanks


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application for ptr_ring
  2021-06-25  3:36   ` Jason Wang
@ 2021-06-25  3:52     ` Yunsheng Lin
  2021-06-27  6:09       ` Michael S. Tsirkin
  0 siblings, 1 reply; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-25  3:52 UTC (permalink / raw)
  To: Jason Wang, davem, kuba, mst
  Cc: brouer, paulmck, peterz, will, shuah, linux-kernel, netdev,
	linux-kselftest, linuxarm

On 2021/6/25 11:36, Jason Wang wrote:
> 
> 在 2021/6/25 上午11:18, Yunsheng Lin 写道:
>> Currently ptr_ring selftest is embedded within the virtio
>> selftest, which involves some specific virtio operation,
>> such as notifying and kicking.
>>
>> As ptr_ring has been used by various subsystems, it deserves
>> it's owner's selftest in order to benchmark different usecase
>> of ptr_ring, such as page pool and pfifo_fast qdisc.
>>
>> So add a simple application to benchmark ptr_ring performance.
>> Currently two test mode is supported:
>> Mode 0: Both enqueuing and dequeuing is done in a single thread,
>>          it is called simple test mode in the test app.
>> Mode 1: Enqueuing and dequeuing is done in different thread
>>          concurrently, also known as SPSC(single-producer/
>>          single-consumer) test.
>>
>> The multi-producer/single-consumer test for pfifo_fast case is
>> not added yet, which can be added if using CAS atomic operation
>> to enable lockless multi-producer is proved to be better than
>> using r->producer_lock.
>>
>> Only supported on x86 and arm64 for now.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>>   MAINTAINERS                                      |   5 +
>>   tools/testing/selftests/ptr_ring/Makefile        |   6 +
>>   tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++
>>   tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++
>>   4 files changed, 410 insertions(+)
> 
> 
> Why can't you simply reuse tools/virtio/ringtest?

The main reason is stated in the commit log:
"Currently ptr_ring selftest is embedded within the virtio
selftest, which involves some specific virtio operation,
such as notifying and kicking.

As ptr_ring has been used by various subsystems, it deserves
it's owner's selftest in order to benchmark different usecase
of ptr_ring, such as page pool and pfifo_fast qdisc."

More specificly in tools/virtio/ringtest/main.c and
tools/virtio/ringtest/ptr_ring.c, there are a lot of operation
related to virtio usecase, such as start_guest(), start_host(),
poll_used(), notify() or kick() ....., so it makes more sense
to add a generic selftest for ptr ring as it is not only used
by virtio now.


> 
> Thanks
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-25  3:18 ` [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable Yunsheng Lin
@ 2021-06-25  6:32   ` Michael S. Tsirkin
  2021-06-25  7:21     ` Yunsheng Lin
  2021-06-25  6:39   ` Michael S. Tsirkin
  1 sibling, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2021-06-25  6:32 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
> Currently r->queue[] is cleared after r->consumer_head is moved
> forward, which makes the __ptr_ring_empty() checking called in
> page_pool_refill_alloc_cache() unreliable if the checking is done
> after the r->queue clearing and before the consumer_head moving
> forward.


Well the documentation for __ptr_ring_empty clearly states is
is not guaranteed to be reliable.

 *
 * NB: This is only safe to call if ring is never resized.
 *
 * However, if some other CPU consumes ring entries at the same time, the value
 * returned is not guaranteed to be correct.
 *
 * In this case - to avoid incorrectly detecting the ring
 * as empty - the CPU consuming the ring entries is responsible
 * for either consuming all ring entries until the ring is empty,
 * or synchronizing with some other CPU and causing it to
 * re-test __ptr_ring_empty and/or consume the ring enteries
 * after the synchronization point.
 *

Is it then the case that page_pool_refill_alloc_cache violates
this requirement? How?

It looks like you are trying to make the guarantee stronger and ensure
no false positives.

If yes please document this as such, update the comment so all
code can be evaluated with the eye towards whether the new stronger
guarantee is maintained. In particular I think I see at least one
issue with this immediately.


> Move the r->queue[] clearing after consumer_head moving forward
> to make __ptr_ring_empty() checking more reliable.
> 
> As a side effect of above change, a consumer_head checking is
> avoided for the likely case, and it has noticeable performance
> improvement when it is tested using the ptr_ring_test selftest
> added in the previous patch.
> 
> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
> to test the case of single thread doing both the enqueuing and
> dequeuing:
> 
>  arch     unpatched           patched       delta
> arm64      4648 ms            4464 ms       +3.9%
>  X86       2562 ms            2401 ms       +6.2%
> 
> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
> to test the case of one thread doing enqueuing and another thread
> doing dequeuing concurrently, also known as single-producer/single-
> consumer:
> 
>  arch      unpatched             patched         delta
> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> V2: Add performance data.
> ---
>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 808f9d3..db9c282 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
>  	 * to work correctly.
>  	 */
> -	int consumer_head = r->consumer_head;
> -	int head = consumer_head++;
> +	int consumer_head = r->consumer_head + 1;
>  
>  	/* Once we have processed enough entries invalidate them in
>  	 * the ring all at once so producer can reuse their space in the ring.
> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  	 */
>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
>  		     consumer_head >= r->size)) {
> +		int tail = r->consumer_tail;
> +
> +		if (unlikely(consumer_head >= r->size)) {
> +			r->consumer_tail = 0;
> +			WRITE_ONCE(r->consumer_head, 0);
> +		} else {
> +			r->consumer_tail = consumer_head;
> +			WRITE_ONCE(r->consumer_head, consumer_head);
> +		}
> +
>  		/* Zero out entries in the reverse order: this way we touch the
>  		 * cache line that producer might currently be reading the last;
>  		 * producer won't make progress and touch other cache lines
>  		 * besides the first one until we write out all entries.
>  		 */
> -		while (likely(head >= r->consumer_tail))
> -			r->queue[head--] = NULL;
> -		r->consumer_tail = consumer_head;
> -	}
> -	if (unlikely(consumer_head >= r->size)) {
> -		consumer_head = 0;
> -		r->consumer_tail = 0;
> +		while (likely(--consumer_head >= tail))
> +			r->queue[consumer_head] = NULL;
> +
> +		return;


So if now we need this to be reliable then
we also need smp_wmb before writing r->queue[consumer_head],
there could be other gotchas.

>  	}
> +
>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
>  	WRITE_ONCE(r->consumer_head, consumer_head);
>  }
> -- 
> 2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application for ptr_ring
  2021-06-25  3:18 ` [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application " Yunsheng Lin
  2021-06-25  3:36   ` Jason Wang
@ 2021-06-25  6:37   ` Michael S. Tsirkin
  2021-06-25  7:40     ` Yunsheng Lin
  1 sibling, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2021-06-25  6:37 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On Fri, Jun 25, 2021 at 11:18:55AM +0800, Yunsheng Lin wrote:
> Currently ptr_ring selftest is embedded within the virtio
> selftest, which involves some specific virtio operation,
> such as notifying and kicking.
> 
> As ptr_ring has been used by various subsystems, it deserves
> it's owner's selftest in order to benchmark different usecase
> of ptr_ring, such as page pool and pfifo_fast qdisc.
> 
> So add a simple application to benchmark ptr_ring performance.
> Currently two test mode is supported:
> Mode 0: Both enqueuing and dequeuing is done in a single thread,
>         it is called simple test mode in the test app.
> Mode 1: Enqueuing and dequeuing is done in different thread
>         concurrently, also known as SPSC(single-producer/
>         single-consumer) test.
> 
> The multi-producer/single-consumer test for pfifo_fast case is
> not added yet, which can be added if using CAS atomic operation
> to enable lockless multi-producer is proved to be better than
> using r->producer_lock.
> 
> Only supported on x86 and arm64 for now.
> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>  MAINTAINERS                                      |   5 +
>  tools/testing/selftests/ptr_ring/Makefile        |   6 +
>  tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++
>  tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++
>  4 files changed, 410 insertions(+)
>  create mode 100644 tools/testing/selftests/ptr_ring/Makefile
>  create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c
>  create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index cc375fd..1227022 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14847,6 +14847,11 @@ F:	drivers/net/phy/dp83640*
>  F:	drivers/ptp/*
>  F:	include/linux/ptp_cl*
>  
> +PTR RING BENCHMARK
> +M:	Yunsheng Lin <linyunsheng@huawei.com>
> +L:	netdev@vger.kernel.org
> +F:	tools/testing/selftests/ptr_ring/
> +
>  PTRACE SUPPORT
>  M:	Oleg Nesterov <oleg@redhat.com>
>  S:	Maintained
> diff --git a/tools/testing/selftests/ptr_ring/Makefile b/tools/testing/selftests/ptr_ring/Makefile
> new file mode 100644
> index 0000000..346dea9
> --- /dev/null
> +++ b/tools/testing/selftests/ptr_ring/Makefile
> @@ -0,0 +1,6 @@
> +# SPDX-License-Identifier: GPL-2.0
> +LDLIBS = -lpthread
> +
> +TEST_GEN_PROGS := ptr_ring_test
> +
> +include ../lib.mk
> diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.c b/tools/testing/selftests/ptr_ring/ptr_ring_test.c
> new file mode 100644
> index 0000000..4f32d3d
> --- /dev/null
> +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.c
> @@ -0,0 +1,249 @@
> +// SPDX-License-Identifier: GPL-2.0-only

Can we keep this GPL-2.0-or-later same as ptr ring itself?
Encourages reuse ...

> +/*
> + * Copyright (C) 2021 HiSilicon Limited.
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <sys/time.h>
> +#include <malloc.h>
> +#include <assert.h>
> +#include <stdbool.h>
> +#include <pthread.h>
> +
> +#include "ptr_ring_test.h"
> +#include "../../../../include/linux/ptr_ring.h"
> +
> +#define MIN_RING_SIZE	2
> +#define MAX_RING_SIZE	10000000
> +
> +static struct ptr_ring ring ____cacheline_aligned_in_smp;
> +
> +struct worker_info {
> +	pthread_t tid;
> +	int test_count;
> +	bool error;
> +	long duration_us;
> +};
> +
> +static void *produce_worker(void *arg)
> +{
> +	struct worker_info *info = arg;
> +	struct timeval start, end;
> +	unsigned long i = 0;
> +	long sec, us;
> +	int ret;
> +
> +	gettimeofday(&start, NULL);
> +
> +	while (++i <= info->test_count) {
> +		while (__ptr_ring_full(&ring))
> +			cpu_relax();
> +
> +		ret = __ptr_ring_produce(&ring, (void *)i);
> +		if (ret) {
> +			fprintf(stderr, "produce failed: %d\n", ret);
> +			info->error = true;
> +			return NULL;
> +		}
> +	}
> +
> +	gettimeofday(&end, NULL);
> +
> +	sec = (end.tv_sec - start.tv_sec);
> +	us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
> +	info->duration_us = us;
> +	info->error = false;
> +
> +	return NULL;
> +}

perf does all of this and more. Let's not reinvent the wheel - just run
the test.

> +
> +static void *consume_worker(void *arg)
> +{
> +	struct worker_info *info = arg;
> +	struct timeval start, end;
> +	unsigned long i = 0;
> +	long sec, us;
> +	int *ptr;
> +
> +	gettimeofday(&start, NULL);
> +
> +	while (++i <= info->test_count) {
> +		while (__ptr_ring_empty(&ring))
> +			cpu_relax();
> +
> +		ptr = __ptr_ring_consume(&ring);
> +		if ((unsigned long)ptr != i) {
> +			fprintf(stderr, "consumer failed, ptr: %lu, i: %lu\n",
> +				(unsigned long)ptr, i);
> +			info->error = true;
> +			return NULL;
> +		}
> +	}
> +
> +	gettimeofday(&end, NULL);
> +
> +	if (!__ptr_ring_empty(&ring)) {
> +		fprintf(stderr, "ring should be empty, test failed\n");
> +		info->error = true;
> +		return NULL;
> +	}
> +
> +	sec = (end.tv_sec - start.tv_sec);
> +	us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
> +	info->duration_us = us;
> +	info->error = false;
> +	return NULL;
> +}
> +
> +/* test case for single producer single consumer */
> +static void spsc_test(int size, int count)
> +{
> +	struct worker_info producer, consumer;
> +	pthread_attr_t attr;
> +	void *res;
> +	int ret;
> +
> +	ret = ptr_ring_init(&ring, size, 0);
> +	if (ret) {
> +		fprintf(stderr, "init failed: %d\n", ret);
> +		return;
> +	}
> +
> +	producer.test_count = count;
> +	consumer.test_count = count;
> +
> +	ret = pthread_attr_init(&attr);
> +	if (ret) {
> +		fprintf(stderr, "pthread attr init failed: %d\n", ret);
> +		goto out;
> +	}
> +
> +	ret = pthread_create(&producer.tid, &attr,
> +			     produce_worker, &producer);
> +	if (ret) {
> +		fprintf(stderr, "create producer thread failed: %d\n", ret);
> +		goto out;
> +	}
> +
> +	ret = pthread_create(&consumer.tid, &attr,
> +			     consume_worker, &consumer);
> +	if (ret) {
> +		fprintf(stderr, "create consumer thread failed: %d\n", ret);
> +		goto out;
> +	}
> +
> +	ret = pthread_join(producer.tid, &res);
> +	if (ret) {
> +		fprintf(stderr, "join producer thread failed: %d\n", ret);
> +		goto out;
> +	}
> +
> +	ret = pthread_join(consumer.tid, &res);
> +	if (ret) {
> +		fprintf(stderr, "join consumer thread failed: %d\n", ret);
> +		goto out;
> +	}
> +
> +	if (producer.error || consumer.error) {
> +		fprintf(stderr, "spsc test failed\n");
> +		goto out;
> +	}
> +
> +	printf("ptr_ring(size:%d) perf spsc test for %d times, took %ld us + %ld us\n",
> +	       size, count, producer.duration_us, consumer.duration_us);
> +out:
> +	ptr_ring_cleanup(&ring, NULL);
> +}
> +
> +static void simple_test(int size, int count)
> +{
> +	struct timeval start, end;
> +	long sec, us;
> +	int i = 0;
> +	int *ptr;
> +	int ret;
> +
> +	ret = ptr_ring_init(&ring, size, 0);
> +	if (ret) {
> +		fprintf(stderr, "init failed: %d\n", ret);
> +		return;
> +	}
> +
> +	gettimeofday(&start, NULL);
> +
> +	while (++i <= count) {
> +		ret = __ptr_ring_produce(&ring, &count);
> +		if (ret) {
> +			fprintf(stderr, "produce failed: %d\n", ret);
> +			goto out;
> +		}
> +
> +		ptr = __ptr_ring_consume(&ring);
> +		if (ptr != &count)  {
> +			fprintf(stderr, "consume failed: %p\n", ptr);
> +			goto out;
> +		}
> +	}
> +
> +	gettimeofday(&end, NULL);
> +	sec = (end.tv_sec - start.tv_sec);
> +	us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
> +	printf("ptr_ring(size:%d) perf simple test for %d times, took %ld us\n",
> +	       size, count, us);
> +
> +out:
> +	ptr_ring_cleanup(&ring, NULL);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	int count = 1000000;
> +	int size = 1000;
> +	int mode = 0;
> +	int opt;
> +
> +	while ((opt = getopt(argc, argv, "N:s:m:")) != -1) {
> +		switch (opt) {
> +		case 'N':
> +			count = atoi(optarg);
> +			break;
> +		case 's':
> +			size = atoi(optarg);
> +			break;
> +		case 'm':
> +			mode = atoi(optarg);
> +			break;
> +		default:
> +			return -1;
> +		}
> +	}
> +
> +	if (count <= 0) {
> +		fprintf(stderr, "invalid test count, must be > 0\n");
> +		return -1;
> +	}
> +
> +	if (size < MIN_RING_SIZE || size > MAX_RING_SIZE) {
> +		fprintf(stderr, "invalid ring size, must be in %d-%d\n",
> +			MIN_RING_SIZE, MAX_RING_SIZE);
> +		return -1;
> +	}
> +
> +	switch (mode) {
> +	case 0:
> +		simple_test(size, count);
> +		break;
> +	case 1:
> +		spsc_test(size, count);
> +		break;
> +	default:
> +		fprintf(stderr, "invalid test mode\n");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.h b/tools/testing/selftests/ptr_ring/ptr_ring_test.h
> new file mode 100644
> index 0000000..6bf2494
> --- /dev/null
> +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.h
> @@ -0,0 +1,150 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */

We already have hacks like this in the virtio test.
Let's refactor not duplicate please.


> +
> +#ifndef _TEST_PTR_RING_IMPL_H
> +#define _TEST_PTR_RING_IMPL_H
> +
> +#if defined(__x86_64__) || defined(__i386__)
> +static inline void cpu_relax(void)
> +{
> +	asm volatile ("rep; nop" ::: "memory");
> +}
> +#elif defined(__aarch64__)
> +static inline void cpu_relax(void)
> +{
> +	asm volatile("yield" ::: "memory");
> +}
> +#else
> +#define cpu_relax() assert(0)
> +#endif
> +
> +static inline void barrier(void)
> +{
> +	asm volatile("" ::: "memory");
> +}
> +
> +/*
> + * This abuses the atomic builtins for thread fences, and
> + * adds a compiler barrier.
> + */
> +#define smp_release() do { \
> +	barrier(); \
> +	__atomic_thread_fence(__ATOMIC_RELEASE); \
> +} while (0)
> +
> +#define smp_acquire() do { \
> +	__atomic_thread_fence(__ATOMIC_ACQUIRE); \
> +	barrier(); \
> +} while (0)
> +
> +#if defined(__i386__) || defined(__x86_64__)
> +#define smp_wmb()		barrier()
> +#else
> +#define smp_wmb()		smp_release()
> +#endif
> +
> +#define READ_ONCE(x)		(*(volatile typeof(x) *)&(x))
> +#define WRITE_ONCE(x, val)	((*(volatile typeof(x) *)&(x)) = (val))
> +#define SMP_CACHE_BYTES		64
> +#define cache_line_size		SMP_CACHE_BYTES
> +#define unlikely(x)		(__builtin_expect(!!(x), 0))
> +#define likely(x)		(__builtin_expect(!!(x), 1))
> +#define ALIGN(x, a)		(((x) + (a) - 1) / (a) * (a))
> +#define SIZE_MAX		(~(size_t)0)
> +#define KMALLOC_MAX_SIZE	SIZE_MAX
> +#define spinlock_t		pthread_spinlock_t
> +#define gfp_t			int
> +#define __GFP_ZERO		0x1
> +
> +#define ____cacheline_aligned_in_smp __attribute__((aligned(SMP_CACHE_BYTES)))
> +
> +static void *kmalloc(unsigned int size, gfp_t gfp)
> +{
> +	void *p;
> +
> +	p = memalign(64, size);
> +	if (!p)
> +		return p;
> +
> +	if (gfp & __GFP_ZERO)
> +		memset(p, 0, size);
> +
> +	return p;
> +}
> +
> +static inline void *kzalloc(unsigned int size, gfp_t flags)
> +{
> +	return kmalloc(size, flags | __GFP_ZERO);
> +}
> +
> +static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
> +{
> +	if (size != 0 && n > SIZE_MAX / size)
> +		return NULL;
> +	return kmalloc(n * size, flags);
> +}
> +
> +static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
> +{
> +	return kmalloc_array(n, size, flags | __GFP_ZERO);
> +}
> +
> +static void kfree(void *p)
> +{
> +	free(p);
> +}
> +
> +#define kvmalloc_array		kmalloc_array
> +#define kvfree			kfree
> +
> +static void spin_lock_init(spinlock_t *lock)
> +{
> +	int r = pthread_spin_init(lock, 0);
> +
> +	assert(!r);
> +}
> +
> +static void spin_lock(spinlock_t *lock)
> +{
> +	int ret = pthread_spin_lock(lock);
> +
> +	assert(!ret);
> +}
> +
> +static void spin_unlock(spinlock_t *lock)
> +{
> +	int ret = pthread_spin_unlock(lock);
> +
> +	assert(!ret);
> +}
> +
> +static void spin_lock_bh(spinlock_t *lock)
> +{
> +	spin_lock(lock);
> +}
> +
> +static void spin_unlock_bh(spinlock_t *lock)
> +{
> +	spin_unlock(lock);
> +}
> +
> +static void spin_lock_irq(spinlock_t *lock)
> +{
> +	spin_lock(lock);
> +}
> +
> +static void spin_unlock_irq(spinlock_t *lock)
> +{
> +	spin_unlock(lock);
> +}
> +
> +static void spin_lock_irqsave(spinlock_t *lock, unsigned long f)
> +{
> +	spin_lock(lock);
> +}
> +
> +static void spin_unlock_irqrestore(spinlock_t *lock, unsigned long f)
> +{
> +	spin_unlock(lock);
> +}
> +
> +#endif
> -- 
> 2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-25  3:18 ` [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable Yunsheng Lin
  2021-06-25  6:32   ` Michael S. Tsirkin
@ 2021-06-25  6:39   ` Michael S. Tsirkin
  2021-06-25  9:20     ` Yunsheng Lin
  1 sibling, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2021-06-25  6:39 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
> Currently r->queue[] is cleared after r->consumer_head is moved
> forward, which makes the __ptr_ring_empty() checking called in
> page_pool_refill_alloc_cache() unreliable if the checking is done
> after the r->queue clearing and before the consumer_head moving
> forward.
> 
> Move the r->queue[] clearing after consumer_head moving forward
> to make __ptr_ring_empty() checking more reliable.
> 
> As a side effect of above change, a consumer_head checking is
> avoided for the likely case, and it has noticeable performance
> improvement when it is tested using the ptr_ring_test selftest
> added in the previous patch.
> 
> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
> to test the case of single thread doing both the enqueuing and
> dequeuing:
> 
>  arch     unpatched           patched       delta
> arm64      4648 ms            4464 ms       +3.9%
>  X86       2562 ms            2401 ms       +6.2%
> 
> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
> to test the case of one thread doing enqueuing and another thread
> doing dequeuing concurrently, also known as single-producer/single-
> consumer:
> 
>  arch      unpatched             patched         delta
> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%

Nice but it's small - could be a fluke.
How many tests did you run? What is the variance?
Did you try pinning to different CPUs to observe numa effects?
Please use perf or some other modern tool for this kind
of benchmark. Thanks!

> 
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
> V2: Add performance data.
> ---
>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 808f9d3..db9c282 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
>  	 * to work correctly.
>  	 */
> -	int consumer_head = r->consumer_head;
> -	int head = consumer_head++;
> +	int consumer_head = r->consumer_head + 1;
>  
>  	/* Once we have processed enough entries invalidate them in
>  	 * the ring all at once so producer can reuse their space in the ring.
> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>  	 */
>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
>  		     consumer_head >= r->size)) {
> +		int tail = r->consumer_tail;
> +
> +		if (unlikely(consumer_head >= r->size)) {
> +			r->consumer_tail = 0;
> +			WRITE_ONCE(r->consumer_head, 0);
> +		} else {
> +			r->consumer_tail = consumer_head;
> +			WRITE_ONCE(r->consumer_head, consumer_head);
> +		}
> +
>  		/* Zero out entries in the reverse order: this way we touch the
>  		 * cache line that producer might currently be reading the last;
>  		 * producer won't make progress and touch other cache lines
>  		 * besides the first one until we write out all entries.
>  		 */
> -		while (likely(head >= r->consumer_tail))
> -			r->queue[head--] = NULL;
> -		r->consumer_tail = consumer_head;
> -	}
> -	if (unlikely(consumer_head >= r->size)) {
> -		consumer_head = 0;
> -		r->consumer_tail = 0;
> +		while (likely(--consumer_head >= tail))
> +			r->queue[consumer_head] = NULL;
> +
> +		return;
>  	}
> +
>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
>  	WRITE_ONCE(r->consumer_head, consumer_head);
>  }
> -- 
> 2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 0/2] add benchmark selftest and optimization for ptr_ring
  2021-06-25  3:18 [PATCH net-next v2 0/2] add benchmark selftest and optimization for ptr_ring Yunsheng Lin
  2021-06-25  3:18 ` [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application " Yunsheng Lin
  2021-06-25  3:18 ` [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable Yunsheng Lin
@ 2021-06-25  6:42 ` Michael S. Tsirkin
  2 siblings, 0 replies; 20+ messages in thread
From: Michael S. Tsirkin @ 2021-06-25  6:42 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On Fri, Jun 25, 2021 at 11:18:54AM +0800, Yunsheng Lin wrote:
> Patch 1: add a selftest app to benchmark the performance
>          of ptr_ring.
> Patch 2: make __ptr_ring_empty() checking more reliable
>          and use the just added selftest to benchmark the
>          performance impact.
> 
> V2: add patch 1 and add performance data for patch 2.

Thanks for the patches!
There are some things to improve there - I sent comments
in response to invididual patches.

> Yunsheng Lin (2):
>   selftests/ptr_ring: add benchmark application for ptr_ring
>   ptr_ring: make __ptr_ring_empty() checking more reliable
> 
>  MAINTAINERS                                      |   5 +
>  include/linux/ptr_ring.h                         |  25 ++-
>  tools/testing/selftests/ptr_ring/Makefile        |   6 +
>  tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++
>  tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++
>  5 files changed, 426 insertions(+), 9 deletions(-)
>  create mode 100644 tools/testing/selftests/ptr_ring/Makefile
>  create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c
>  create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
> 
> -- 
> 2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-25  6:32   ` Michael S. Tsirkin
@ 2021-06-25  7:21     ` Yunsheng Lin
  2021-06-25  7:30       ` Michael S. Tsirkin
  0 siblings, 1 reply; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-25  7:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On 2021/6/25 14:32, Michael S. Tsirkin wrote:
> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
>> Currently r->queue[] is cleared after r->consumer_head is moved
>> forward, which makes the __ptr_ring_empty() checking called in
>> page_pool_refill_alloc_cache() unreliable if the checking is done
>> after the r->queue clearing and before the consumer_head moving
>> forward.
> 
> 
> Well the documentation for __ptr_ring_empty clearly states is
> is not guaranteed to be reliable.

Yes, this patch does not make __ptr_ring_empty() strictly reliable
without taking the r->consumer_lock, as the disscuission in [1].

1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011

> 
>  *
>  * NB: This is only safe to call if ring is never resized.
>  *
>  * However, if some other CPU consumes ring entries at the same time, the value
>  * returned is not guaranteed to be correct.
>  *
>  * In this case - to avoid incorrectly detecting the ring
>  * as empty - the CPU consuming the ring entries is responsible
>  * for either consuming all ring entries until the ring is empty,
>  * or synchronizing with some other CPU and causing it to
>  * re-test __ptr_ring_empty and/or consume the ring enteries
>  * after the synchronization point.
>  *
> 
> Is it then the case that page_pool_refill_alloc_cache violates
> this requirement? How?

As my understanding:
page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid
taking r->consumer_lock, when the above data race happens, it will
exit out and allocate page from the page allocator instead of reusing
the page in ptr_ring, which *may* not be happening if __ptr_ring_empty()
is more reliable.

> 
> It looks like you are trying to make the guarantee stronger and ensure
> no false positives.
> 
> If yes please document this as such, update the comment so all
> code can be evaluated with the eye towards whether the new stronger
> guarantee is maintained. In particular I think I see at least one
> issue with this immediately.
> 
> 
>> Move the r->queue[] clearing after consumer_head moving forward
>> to make __ptr_ring_empty() checking more reliable.
>>
>> As a side effect of above change, a consumer_head checking is
>> avoided for the likely case, and it has noticeable performance
>> improvement when it is tested using the ptr_ring_test selftest
>> added in the previous patch.
>>
>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
>> to test the case of single thread doing both the enqueuing and
>> dequeuing:
>>
>>  arch     unpatched           patched       delta
>> arm64      4648 ms            4464 ms       +3.9%
>>  X86       2562 ms            2401 ms       +6.2%
>>
>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
>> to test the case of one thread doing enqueuing and another thread
>> doing dequeuing concurrently, also known as single-producer/single-
>> consumer:
>>
>>  arch      unpatched             patched         delta
>> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>> V2: Add performance data.
>> ---
>>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
>>  1 file changed, 16 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>> index 808f9d3..db9c282 100644
>> --- a/include/linux/ptr_ring.h
>> +++ b/include/linux/ptr_ring.h
>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
>>  	 * to work correctly.
>>  	 */
>> -	int consumer_head = r->consumer_head;
>> -	int head = consumer_head++;
>> +	int consumer_head = r->consumer_head + 1;
>>  
>>  	/* Once we have processed enough entries invalidate them in
>>  	 * the ring all at once so producer can reuse their space in the ring.
>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>>  	 */
>>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
>>  		     consumer_head >= r->size)) {
>> +		int tail = r->consumer_tail;
>> +
>> +		if (unlikely(consumer_head >= r->size)) {
>> +			r->consumer_tail = 0;
>> +			WRITE_ONCE(r->consumer_head, 0);
>> +		} else {
>> +			r->consumer_tail = consumer_head;
>> +			WRITE_ONCE(r->consumer_head, consumer_head);
>> +		}
>> +
>>  		/* Zero out entries in the reverse order: this way we touch the
>>  		 * cache line that producer might currently be reading the last;
>>  		 * producer won't make progress and touch other cache lines
>>  		 * besides the first one until we write out all entries.
>>  		 */
>> -		while (likely(head >= r->consumer_tail))
>> -			r->queue[head--] = NULL;
>> -		r->consumer_tail = consumer_head;
>> -	}
>> -	if (unlikely(consumer_head >= r->size)) {
>> -		consumer_head = 0;
>> -		r->consumer_tail = 0;
>> +		while (likely(--consumer_head >= tail))
>> +			r->queue[consumer_head] = NULL;
>> +
>> +		return;
> 
> 
> So if now we need this to be reliable then
> we also need smp_wmb before writing r->queue[consumer_head],
> there could be other gotchas.

Yes, This patch does not make it strictly reliable.
T think I could mention that in the commit log?

> 
>>  	}
>> +
>>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
>>  	WRITE_ONCE(r->consumer_head, consumer_head);
>>  }
>> -- 
>> 2.7.4
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-25  7:21     ` Yunsheng Lin
@ 2021-06-25  7:30       ` Michael S. Tsirkin
  2021-06-25  8:33         ` Yunsheng Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2021-06-25  7:30 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote:
> On 2021/6/25 14:32, Michael S. Tsirkin wrote:
> > On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
> >> Currently r->queue[] is cleared after r->consumer_head is moved
> >> forward, which makes the __ptr_ring_empty() checking called in
> >> page_pool_refill_alloc_cache() unreliable if the checking is done
> >> after the r->queue clearing and before the consumer_head moving
> >> forward.
> > 
> > 
> > Well the documentation for __ptr_ring_empty clearly states is
> > is not guaranteed to be reliable.
> 
> Yes, this patch does not make __ptr_ring_empty() strictly reliable
> without taking the r->consumer_lock, as the disscuission in [1].
> 
> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011
> 
> > 
> >  *
> >  * NB: This is only safe to call if ring is never resized.
> >  *
> >  * However, if some other CPU consumes ring entries at the same time, the value
> >  * returned is not guaranteed to be correct.
> >  *
> >  * In this case - to avoid incorrectly detecting the ring
> >  * as empty - the CPU consuming the ring entries is responsible
> >  * for either consuming all ring entries until the ring is empty,
> >  * or synchronizing with some other CPU and causing it to
> >  * re-test __ptr_ring_empty and/or consume the ring enteries
> >  * after the synchronization point.
> >  *
> > 
> > Is it then the case that page_pool_refill_alloc_cache violates
> > this requirement? How?
> 
> As my understanding:
> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid
> taking r->consumer_lock, when the above data race happens, it will
> exit out and allocate page from the page allocator instead of reusing
> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty()
> is more reliable.

Question is how do we know it's more reliable?
It would be nice if we did actually made it more reliable,
as it is we are just shifting races around.


> > 
> > It looks like you are trying to make the guarantee stronger and ensure
> > no false positives.
> > 
> > If yes please document this as such, update the comment so all
> > code can be evaluated with the eye towards whether the new stronger
> > guarantee is maintained. In particular I think I see at least one
> > issue with this immediately.
> > 
> > 
> >> Move the r->queue[] clearing after consumer_head moving forward
> >> to make __ptr_ring_empty() checking more reliable.
> >>
> >> As a side effect of above change, a consumer_head checking is
> >> avoided for the likely case, and it has noticeable performance
> >> improvement when it is tested using the ptr_ring_test selftest
> >> added in the previous patch.
> >>
> >> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
> >> to test the case of single thread doing both the enqueuing and
> >> dequeuing:
> >>
> >>  arch     unpatched           patched       delta
> >> arm64      4648 ms            4464 ms       +3.9%
> >>  X86       2562 ms            2401 ms       +6.2%
> >>
> >> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
> >> to test the case of one thread doing enqueuing and another thread
> >> doing dequeuing concurrently, also known as single-producer/single-
> >> consumer:
> >>
> >>  arch      unpatched             patched         delta
> >> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
> >>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
> >>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >> ---
> >> V2: Add performance data.
> >> ---
> >>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
> >>  1 file changed, 16 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> >> index 808f9d3..db9c282 100644
> >> --- a/include/linux/ptr_ring.h
> >> +++ b/include/linux/ptr_ring.h
> >> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
> >>  	 * to work correctly.
> >>  	 */
> >> -	int consumer_head = r->consumer_head;
> >> -	int head = consumer_head++;
> >> +	int consumer_head = r->consumer_head + 1;
> >>  
> >>  	/* Once we have processed enough entries invalidate them in
> >>  	 * the ring all at once so producer can reuse their space in the ring.
> >> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >>  	 */
> >>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
> >>  		     consumer_head >= r->size)) {
> >> +		int tail = r->consumer_tail;
> >> +
> >> +		if (unlikely(consumer_head >= r->size)) {
> >> +			r->consumer_tail = 0;
> >> +			WRITE_ONCE(r->consumer_head, 0);
> >> +		} else {
> >> +			r->consumer_tail = consumer_head;
> >> +			WRITE_ONCE(r->consumer_head, consumer_head);
> >> +		}
> >> +
> >>  		/* Zero out entries in the reverse order: this way we touch the
> >>  		 * cache line that producer might currently be reading the last;
> >>  		 * producer won't make progress and touch other cache lines
> >>  		 * besides the first one until we write out all entries.
> >>  		 */
> >> -		while (likely(head >= r->consumer_tail))
> >> -			r->queue[head--] = NULL;
> >> -		r->consumer_tail = consumer_head;
> >> -	}
> >> -	if (unlikely(consumer_head >= r->size)) {
> >> -		consumer_head = 0;
> >> -		r->consumer_tail = 0;
> >> +		while (likely(--consumer_head >= tail))
> >> +			r->queue[consumer_head] = NULL;
> >> +
> >> +		return;
> > 
> > 
> > So if now we need this to be reliable then
> > we also need smp_wmb before writing r->queue[consumer_head],
> > there could be other gotchas.
> 
> Yes, This patch does not make it strictly reliable.
> T think I could mention that in the commit log?

OK so it's not that it makes it more reliable - this patch simply makes
a possible false positive less likely while making  a false negative
more likely. Our assumption is that a false negative is cheaper then?

How do we know that it is?

And even if we prove the ptr_ring itself is faster now,
how do we know what affects callers in a better way a
false positive or a false negative?

I would rather we worked on actually making it reliable
e.g. if we can guarantee no false positives, that would be
a net win.

> 
> > 
> >>  	}
> >> +
> >>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
> >>  	WRITE_ONCE(r->consumer_head, consumer_head);
> >>  }
> >> -- 
> >> 2.7.4
> > 
> > 
> > .
> > 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application for ptr_ring
  2021-06-25  6:37   ` Michael S. Tsirkin
@ 2021-06-25  7:40     ` Yunsheng Lin
  0 siblings, 0 replies; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-25  7:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On 2021/6/25 14:37, Michael S. Tsirkin wrote:
> On Fri, Jun 25, 2021 at 11:18:55AM +0800, Yunsheng Lin wrote:
>> Currently ptr_ring selftest is embedded within the virtio
>> selftest, which involves some specific virtio operation,
>> such as notifying and kicking.
>>
>> As ptr_ring has been used by various subsystems, it deserves
>> it's owner's selftest in order to benchmark different usecase
>> of ptr_ring, such as page pool and pfifo_fast qdisc.
>>
>> So add a simple application to benchmark ptr_ring performance.
>> Currently two test mode is supported:
>> Mode 0: Both enqueuing and dequeuing is done in a single thread,
>>         it is called simple test mode in the test app.
>> Mode 1: Enqueuing and dequeuing is done in different thread
>>         concurrently, also known as SPSC(single-producer/
>>         single-consumer) test.
>>
>> The multi-producer/single-consumer test for pfifo_fast case is
>> not added yet, which can be added if using CAS atomic operation
>> to enable lockless multi-producer is proved to be better than
>> using r->producer_lock.
>>
>> Only supported on x86 and arm64 for now.
>>
>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>> ---
>>  MAINTAINERS                                      |   5 +
>>  tools/testing/selftests/ptr_ring/Makefile        |   6 +
>>  tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++
>>  tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++
>>  4 files changed, 410 insertions(+)
>>  create mode 100644 tools/testing/selftests/ptr_ring/Makefile
>>  create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.c
>>  create mode 100644 tools/testing/selftests/ptr_ring/ptr_ring_test.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index cc375fd..1227022 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -14847,6 +14847,11 @@ F:	drivers/net/phy/dp83640*
>>  F:	drivers/ptp/*
>>  F:	include/linux/ptp_cl*
>>  
>> +PTR RING BENCHMARK
>> +M:	Yunsheng Lin <linyunsheng@huawei.com>
>> +L:	netdev@vger.kernel.org
>> +F:	tools/testing/selftests/ptr_ring/
>> +
>>  PTRACE SUPPORT
>>  M:	Oleg Nesterov <oleg@redhat.com>
>>  S:	Maintained
>> diff --git a/tools/testing/selftests/ptr_ring/Makefile b/tools/testing/selftests/ptr_ring/Makefile
>> new file mode 100644
>> index 0000000..346dea9
>> --- /dev/null
>> +++ b/tools/testing/selftests/ptr_ring/Makefile
>> @@ -0,0 +1,6 @@
>> +# SPDX-License-Identifier: GPL-2.0
>> +LDLIBS = -lpthread
>> +
>> +TEST_GEN_PROGS := ptr_ring_test
>> +
>> +include ../lib.mk
>> diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.c b/tools/testing/selftests/ptr_ring/ptr_ring_test.c
>> new file mode 100644
>> index 0000000..4f32d3d
>> --- /dev/null
>> +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.c
>> @@ -0,0 +1,249 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
> 
> Can we keep this GPL-2.0-or-later same as ptr ring itself?
> Encourages reuse ...

Ok.

> 
>> +/*
>> + * Copyright (C) 2021 HiSilicon Limited.
>> + */
>> +
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <unistd.h>
>> +#include <string.h>
>> +#include <errno.h>
>> +#include <sys/time.h>
>> +#include <malloc.h>
>> +#include <assert.h>
>> +#include <stdbool.h>
>> +#include <pthread.h>
>> +
>> +#include "ptr_ring_test.h"
>> +#include "../../../../include/linux/ptr_ring.h"
>> +
>> +#define MIN_RING_SIZE	2
>> +#define MAX_RING_SIZE	10000000
>> +
>> +static struct ptr_ring ring ____cacheline_aligned_in_smp;
>> +
>> +struct worker_info {
>> +	pthread_t tid;
>> +	int test_count;
>> +	bool error;
>> +	long duration_us;
>> +};
>> +
>> +static void *produce_worker(void *arg)
>> +{
>> +	struct worker_info *info = arg;
>> +	struct timeval start, end;
>> +	unsigned long i = 0;
>> +	long sec, us;
>> +	int ret;
>> +
>> +	gettimeofday(&start, NULL);
>> +
>> +	while (++i <= info->test_count) {
>> +		while (__ptr_ring_full(&ring))
>> +			cpu_relax();
>> +
>> +		ret = __ptr_ring_produce(&ring, (void *)i);
>> +		if (ret) {
>> +			fprintf(stderr, "produce failed: %d\n", ret);
>> +			info->error = true;
>> +			return NULL;
>> +		}
>> +	}
>> +
>> +	gettimeofday(&end, NULL);
>> +
>> +	sec = (end.tv_sec - start.tv_sec);
>> +	us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
>> +	info->duration_us = us;
>> +	info->error = false;
>> +
>> +	return NULL;
>> +}
> 
> perf does all of this and more. Let's not reinvent the wheel - just run
> the test.

You are suggesting to use perf stat + "test cmd" and remove
the above timestamp sampling, right?

> 
>> +
>> +static void *consume_worker(void *arg)
>> +{
>> +	struct worker_info *info = arg;
>> +	struct timeval start, end;
>> +	unsigned long i = 0;
>> +	long sec, us;
>> +	int *ptr;
>> +
>> +	gettimeofday(&start, NULL);
>> +
>> +	while (++i <= info->test_count) {
>> +		while (__ptr_ring_empty(&ring))
>> +			cpu_relax();
>> +
>> +		ptr = __ptr_ring_consume(&ring);
>> +		if ((unsigned long)ptr != i) {
>> +			fprintf(stderr, "consumer failed, ptr: %lu, i: %lu\n",
>> +				(unsigned long)ptr, i);
>> +			info->error = true;
>> +			return NULL;
>> +		}
>> +	}
>> +
>> +	gettimeofday(&end, NULL);
>> +
>> +	if (!__ptr_ring_empty(&ring)) {
>> +		fprintf(stderr, "ring should be empty, test failed\n");
>> +		info->error = true;
>> +		return NULL;
>> +	}
>> +
>> +	sec = (end.tv_sec - start.tv_sec);
>> +	us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
>> +	info->duration_us = us;
>> +	info->error = false;
>> +	return NULL;
>> +}
>> +
>> +/* test case for single producer single consumer */
>> +static void spsc_test(int size, int count)
>> +{
>> +	struct worker_info producer, consumer;
>> +	pthread_attr_t attr;
>> +	void *res;
>> +	int ret;
>> +
>> +	ret = ptr_ring_init(&ring, size, 0);
>> +	if (ret) {
>> +		fprintf(stderr, "init failed: %d\n", ret);
>> +		return;
>> +	}
>> +
>> +	producer.test_count = count;
>> +	consumer.test_count = count;
>> +
>> +	ret = pthread_attr_init(&attr);
>> +	if (ret) {
>> +		fprintf(stderr, "pthread attr init failed: %d\n", ret);
>> +		goto out;
>> +	}
>> +
>> +	ret = pthread_create(&producer.tid, &attr,
>> +			     produce_worker, &producer);
>> +	if (ret) {
>> +		fprintf(stderr, "create producer thread failed: %d\n", ret);
>> +		goto out;
>> +	}
>> +
>> +	ret = pthread_create(&consumer.tid, &attr,
>> +			     consume_worker, &consumer);
>> +	if (ret) {
>> +		fprintf(stderr, "create consumer thread failed: %d\n", ret);
>> +		goto out;
>> +	}
>> +
>> +	ret = pthread_join(producer.tid, &res);
>> +	if (ret) {
>> +		fprintf(stderr, "join producer thread failed: %d\n", ret);
>> +		goto out;
>> +	}
>> +
>> +	ret = pthread_join(consumer.tid, &res);
>> +	if (ret) {
>> +		fprintf(stderr, "join consumer thread failed: %d\n", ret);
>> +		goto out;
>> +	}
>> +
>> +	if (producer.error || consumer.error) {
>> +		fprintf(stderr, "spsc test failed\n");
>> +		goto out;
>> +	}
>> +
>> +	printf("ptr_ring(size:%d) perf spsc test for %d times, took %ld us + %ld us\n",
>> +	       size, count, producer.duration_us, consumer.duration_us);
>> +out:
>> +	ptr_ring_cleanup(&ring, NULL);
>> +}
>> +
>> +static void simple_test(int size, int count)
>> +{
>> +	struct timeval start, end;
>> +	long sec, us;
>> +	int i = 0;
>> +	int *ptr;
>> +	int ret;
>> +
>> +	ret = ptr_ring_init(&ring, size, 0);
>> +	if (ret) {
>> +		fprintf(stderr, "init failed: %d\n", ret);
>> +		return;
>> +	}
>> +
>> +	gettimeofday(&start, NULL);
>> +
>> +	while (++i <= count) {
>> +		ret = __ptr_ring_produce(&ring, &count);
>> +		if (ret) {
>> +			fprintf(stderr, "produce failed: %d\n", ret);
>> +			goto out;
>> +		}
>> +
>> +		ptr = __ptr_ring_consume(&ring);
>> +		if (ptr != &count)  {
>> +			fprintf(stderr, "consume failed: %p\n", ptr);
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	gettimeofday(&end, NULL);
>> +	sec = (end.tv_sec - start.tv_sec);
>> +	us = ((sec * 1000000) + end.tv_usec) - (start.tv_usec);
>> +	printf("ptr_ring(size:%d) perf simple test for %d times, took %ld us\n",
>> +	       size, count, us);
>> +
>> +out:
>> +	ptr_ring_cleanup(&ring, NULL);
>> +}
>> +
>> +int main(int argc, char *argv[])
>> +{
>> +	int count = 1000000;
>> +	int size = 1000;
>> +	int mode = 0;
>> +	int opt;
>> +
>> +	while ((opt = getopt(argc, argv, "N:s:m:")) != -1) {
>> +		switch (opt) {
>> +		case 'N':
>> +			count = atoi(optarg);
>> +			break;
>> +		case 's':
>> +			size = atoi(optarg);
>> +			break;
>> +		case 'm':
>> +			mode = atoi(optarg);
>> +			break;
>> +		default:
>> +			return -1;
>> +		}
>> +	}
>> +
>> +	if (count <= 0) {
>> +		fprintf(stderr, "invalid test count, must be > 0\n");
>> +		return -1;
>> +	}
>> +
>> +	if (size < MIN_RING_SIZE || size > MAX_RING_SIZE) {
>> +		fprintf(stderr, "invalid ring size, must be in %d-%d\n",
>> +			MIN_RING_SIZE, MAX_RING_SIZE);
>> +		return -1;
>> +	}
>> +
>> +	switch (mode) {
>> +	case 0:
>> +		simple_test(size, count);
>> +		break;
>> +	case 1:
>> +		spsc_test(size, count);
>> +		break;
>> +	default:
>> +		fprintf(stderr, "invalid test mode\n");
>> +		return -1;
>> +	}
>> +
>> +	return 0;
>> +}
>> diff --git a/tools/testing/selftests/ptr_ring/ptr_ring_test.h b/tools/testing/selftests/ptr_ring/ptr_ring_test.h
>> new file mode 100644
>> index 0000000..6bf2494
>> --- /dev/null
>> +++ b/tools/testing/selftests/ptr_ring/ptr_ring_test.h
>> @@ -0,0 +1,150 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> 
> We already have hacks like this in the virtio test.
> Let's refactor not duplicate please.

Yes, I took most of below from virtio test.
But I am not sure I understand what you meant by refactoring.
Are you suggesting to use function from standard C library
instead of using the below "#if defined" hack?

I am not sure if all of the below function has a similiar
one in standard C library.

Would you be more specific about what does refactoring
mean?

> 
> 
>> +
>> +#ifndef _TEST_PTR_RING_IMPL_H
>> +#define _TEST_PTR_RING_IMPL_H
>> +
>> +#if defined(__x86_64__) || defined(__i386__)
>> +static inline void cpu_relax(void)
>> +{
>> +	asm volatile ("rep; nop" ::: "memory");
>> +}
>> +#elif defined(__aarch64__)
>> +static inline void cpu_relax(void)
>> +{
>> +	asm volatile("yield" ::: "memory");
>> +}
>> +#else
>> +#define cpu_relax() assert(0)
>> +#endif
>> +
>> +static inline void barrier(void)
>> +{
>> +	asm volatile("" ::: "memory");
>> +}
>> +
>> +/*
>> + * This abuses the atomic builtins for thread fences, and
>> + * adds a compiler barrier.
>> + */
>> +#define smp_release() do { \
>> +	barrier(); \
>> +	__atomic_thread_fence(__ATOMIC_RELEASE); \
>> +} while (0)
>> +
>> +#define smp_acquire() do { \
>> +	__atomic_thread_fence(__ATOMIC_ACQUIRE); \
>> +	barrier(); \
>> +} while (0)
>> +
>> +#if defined(__i386__) || defined(__x86_64__)
>> +#define smp_wmb()		barrier()
>> +#else
>> +#define smp_wmb()		smp_release()
>> +#endif
>> +
>> +#define READ_ONCE(x)		(*(volatile typeof(x) *)&(x))
>> +#define WRITE_ONCE(x, val)	((*(volatile typeof(x) *)&(x)) = (val))
>> +#define SMP_CACHE_BYTES		64
>> +#define cache_line_size		SMP_CACHE_BYTES
>> +#define unlikely(x)		(__builtin_expect(!!(x), 0))
>> +#define likely(x)		(__builtin_expect(!!(x), 1))
>> +#define ALIGN(x, a)		(((x) + (a) - 1) / (a) * (a))
>> +#define SIZE_MAX		(~(size_t)0)
>> +#define KMALLOC_MAX_SIZE	SIZE_MAX
>> +#define spinlock_t		pthread_spinlock_t
>> +#define gfp_t			int
>> +#define __GFP_ZERO		0x1
>> +
>> +#define ____cacheline_aligned_in_smp __attribute__((aligned(SMP_CACHE_BYTES)))
>> +
>> +static void *kmalloc(unsigned int size, gfp_t gfp)
>> +{
>> +	void *p;
>> +
>> +	p = memalign(64, size);
>> +	if (!p)
>> +		return p;
>> +
>> +	if (gfp & __GFP_ZERO)
>> +		memset(p, 0, size);
>> +
>> +	return p;
>> +}
>> +
>> +static inline void *kzalloc(unsigned int size, gfp_t flags)
>> +{
>> +	return kmalloc(size, flags | __GFP_ZERO);
>> +}
>> +
>> +static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
>> +{
>> +	if (size != 0 && n > SIZE_MAX / size)
>> +		return NULL;
>> +	return kmalloc(n * size, flags);
>> +}
>> +
>> +static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
>> +{
>> +	return kmalloc_array(n, size, flags | __GFP_ZERO);
>> +}
>> +
>> +static void kfree(void *p)
>> +{
>> +	free(p);
>> +}
>> +
>> +#define kvmalloc_array		kmalloc_array
>> +#define kvfree			kfree
>> +
>> +static void spin_lock_init(spinlock_t *lock)
>> +{
>> +	int r = pthread_spin_init(lock, 0);
>> +
>> +	assert(!r);
>> +}
>> +
>> +static void spin_lock(spinlock_t *lock)
>> +{
>> +	int ret = pthread_spin_lock(lock);
>> +
>> +	assert(!ret);
>> +}
>> +
>> +static void spin_unlock(spinlock_t *lock)
>> +{
>> +	int ret = pthread_spin_unlock(lock);
>> +
>> +	assert(!ret);
>> +}
>> +
>> +static void spin_lock_bh(spinlock_t *lock)
>> +{
>> +	spin_lock(lock);
>> +}
>> +
>> +static void spin_unlock_bh(spinlock_t *lock)
>> +{
>> +	spin_unlock(lock);
>> +}
>> +
>> +static void spin_lock_irq(spinlock_t *lock)
>> +{
>> +	spin_lock(lock);
>> +}
>> +
>> +static void spin_unlock_irq(spinlock_t *lock)
>> +{
>> +	spin_unlock(lock);
>> +}
>> +
>> +static void spin_lock_irqsave(spinlock_t *lock, unsigned long f)
>> +{
>> +	spin_lock(lock);
>> +}
>> +
>> +static void spin_unlock_irqrestore(spinlock_t *lock, unsigned long f)
>> +{
>> +	spin_unlock(lock);
>> +}
>> +
>> +#endif
>> -- 
>> 2.7.4
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-25  7:30       ` Michael S. Tsirkin
@ 2021-06-25  8:33         ` Yunsheng Lin
  2021-06-27  6:03           ` Michael S. Tsirkin
  0 siblings, 1 reply; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-25  8:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On 2021/6/25 15:30, Michael S. Tsirkin wrote:
> On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote:
>> On 2021/6/25 14:32, Michael S. Tsirkin wrote:
>>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
>>>> Currently r->queue[] is cleared after r->consumer_head is moved
>>>> forward, which makes the __ptr_ring_empty() checking called in
>>>> page_pool_refill_alloc_cache() unreliable if the checking is done
>>>> after the r->queue clearing and before the consumer_head moving
>>>> forward.
>>>
>>>
>>> Well the documentation for __ptr_ring_empty clearly states is
>>> is not guaranteed to be reliable.
>>
>> Yes, this patch does not make __ptr_ring_empty() strictly reliable
>> without taking the r->consumer_lock, as the disscuission in [1].
>>
>> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011
>>
>>>
>>>  *
>>>  * NB: This is only safe to call if ring is never resized.
>>>  *
>>>  * However, if some other CPU consumes ring entries at the same time, the value
>>>  * returned is not guaranteed to be correct.
>>>  *
>>>  * In this case - to avoid incorrectly detecting the ring
>>>  * as empty - the CPU consuming the ring entries is responsible
>>>  * for either consuming all ring entries until the ring is empty,
>>>  * or synchronizing with some other CPU and causing it to
>>>  * re-test __ptr_ring_empty and/or consume the ring enteries
>>>  * after the synchronization point.
>>>  *
>>>
>>> Is it then the case that page_pool_refill_alloc_cache violates
>>> this requirement? How?
>>
>> As my understanding:
>> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid
>> taking r->consumer_lock, when the above data race happens, it will
>> exit out and allocate page from the page allocator instead of reusing
>> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty()
>> is more reliable.
> 
> Question is how do we know it's more reliable?
> It would be nice if we did actually made it more reliable,
> as it is we are just shifting races around.
> 
> 
>>>
>>> It looks like you are trying to make the guarantee stronger and ensure
>>> no false positives.
>>>
>>> If yes please document this as such, update the comment so all
>>> code can be evaluated with the eye towards whether the new stronger
>>> guarantee is maintained. In particular I think I see at least one
>>> issue with this immediately.
>>>
>>>
>>>> Move the r->queue[] clearing after consumer_head moving forward
>>>> to make __ptr_ring_empty() checking more reliable.
>>>>
>>>> As a side effect of above change, a consumer_head checking is
>>>> avoided for the likely case, and it has noticeable performance
>>>> improvement when it is tested using the ptr_ring_test selftest
>>>> added in the previous patch.
>>>>
>>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
>>>> to test the case of single thread doing both the enqueuing and
>>>> dequeuing:
>>>>
>>>>  arch     unpatched           patched       delta
>>>> arm64      4648 ms            4464 ms       +3.9%
>>>>  X86       2562 ms            2401 ms       +6.2%
>>>>
>>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
>>>> to test the case of one thread doing enqueuing and another thread
>>>> doing dequeuing concurrently, also known as single-producer/single-
>>>> consumer:
>>>>
>>>>  arch      unpatched             patched         delta
>>>> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>>>>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
>>>>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>>> ---
>>>> V2: Add performance data.
>>>> ---
>>>>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
>>>>  1 file changed, 16 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
>>>> index 808f9d3..db9c282 100644
>>>> --- a/include/linux/ptr_ring.h
>>>> +++ b/include/linux/ptr_ring.h
>>>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>>>>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
>>>>  	 * to work correctly.
>>>>  	 */
>>>> -	int consumer_head = r->consumer_head;
>>>> -	int head = consumer_head++;
>>>> +	int consumer_head = r->consumer_head + 1;
>>>>  
>>>>  	/* Once we have processed enough entries invalidate them in
>>>>  	 * the ring all at once so producer can reuse their space in the ring.
>>>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
>>>>  	 */
>>>>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
>>>>  		     consumer_head >= r->size)) {
>>>> +		int tail = r->consumer_tail;
>>>> +
>>>> +		if (unlikely(consumer_head >= r->size)) {
>>>> +			r->consumer_tail = 0;
>>>> +			WRITE_ONCE(r->consumer_head, 0);
>>>> +		} else {
>>>> +			r->consumer_tail = consumer_head;
>>>> +			WRITE_ONCE(r->consumer_head, consumer_head);
>>>> +		}
>>>> +
>>>>  		/* Zero out entries in the reverse order: this way we touch the
>>>>  		 * cache line that producer might currently be reading the last;
>>>>  		 * producer won't make progress and touch other cache lines
>>>>  		 * besides the first one until we write out all entries.
>>>>  		 */
>>>> -		while (likely(head >= r->consumer_tail))
>>>> -			r->queue[head--] = NULL;
>>>> -		r->consumer_tail = consumer_head;
>>>> -	}
>>>> -	if (unlikely(consumer_head >= r->size)) {
>>>> -		consumer_head = 0;
>>>> -		r->consumer_tail = 0;
>>>> +		while (likely(--consumer_head >= tail))
>>>> +			r->queue[consumer_head] = NULL;
>>>> +
>>>> +		return;
>>>
>>>
>>> So if now we need this to be reliable then
>>> we also need smp_wmb before writing r->queue[consumer_head],
>>> there could be other gotchas.
>>
>> Yes, This patch does not make it strictly reliable.
>> T think I could mention that in the commit log?
> 
> OK so it's not that it makes it more reliable - this patch simply makes
> a possible false positive less likely while making  a false negative
> more likely. Our assumption is that a false negative is cheaper then?
> 
> How do we know that it is?
> 
> And even if we prove the ptr_ring itself is faster now,
> how do we know what affects callers in a better way a
> false positive or a false negative?
> 
> I would rather we worked on actually making it reliable
> e.g. if we can guarantee no false positives, that would be
> a net win.
I thought deeper about the case you mentioned above, it
seems for the above to happen, the consumer_head need to
be rolled back to zero and incremented to the point when
caller of __ptr_ring_empty() is still *not* able to see the
r->queue[] which has been set to NULL in __ptr_ring_discard_one().

It seems smp_wmb() only need to be done once when consumer_head
is rolled back to zero, and maybe that is enough to make sure the
case you mentioned is fixed too?

And the smp_wmb() is only done once in a round of producing/
consuming, so the performance impact should be minimized?(of
course we need to test it too).

> 
>>
>>>
>>>>  	}
>>>> +
>>>>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
>>>>  	WRITE_ONCE(r->consumer_head, consumer_head);
>>>>  }
>>>> -- 
>>>> 2.7.4
>>>
>>>
>>> .
>>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-25  6:39   ` Michael S. Tsirkin
@ 2021-06-25  9:20     ` Yunsheng Lin
  2021-06-27  6:07       ` Michael S. Tsirkin
  0 siblings, 1 reply; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-25  9:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On 2021/6/25 14:39, Michael S. Tsirkin wrote:
> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
>> Currently r->queue[] is cleared after r->consumer_head is moved
>> forward, which makes the __ptr_ring_empty() checking called in
>> page_pool_refill_alloc_cache() unreliable if the checking is done
>> after the r->queue clearing and before the consumer_head moving
>> forward.
>>
>> Move the r->queue[] clearing after consumer_head moving forward
>> to make __ptr_ring_empty() checking more reliable.
>>
>> As a side effect of above change, a consumer_head checking is
>> avoided for the likely case, and it has noticeable performance
>> improvement when it is tested using the ptr_ring_test selftest
>> added in the previous patch.
>>
>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
>> to test the case of single thread doing both the enqueuing and
>> dequeuing:
>>
>>  arch     unpatched           patched       delta
>> arm64      4648 ms            4464 ms       +3.9%
>>  X86       2562 ms            2401 ms       +6.2%
>>
>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
>> to test the case of one thread doing enqueuing and another thread
>> doing dequeuing concurrently, also known as single-producer/single-
>> consumer:
>>
>>  arch      unpatched             patched         delta
>> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
> 
> Nice but it's small - could be a fluke.
> How many tests did you run? What is the variance?
> Did you try pinning to different CPUs to observe numa effects?
> Please use perf or some other modern tool for this kind
> of benchmark. Thanks!

The result is quite stable, and retest using perf stat:

---------------unpatched ptr_ring.c begin----------------------------------

perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2385198 us

 Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':

           2385.49 msec task-clock                #    1.000 CPUs utilized
                26      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.024 K/sec
        6202023521      cycles                    #    2.600 GHz
       17424187640      instructions              #    2.81  insn per cycle
   <not supported>      branches
           6506477      branch-misses

       2.385785170 seconds time elapsed

       2.384014000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2383385 us

 Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':

           2383.67 msec task-clock                #    1.000 CPUs utilized
                26      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.024 K/sec
        6197278066      cycles                    #    2.600 GHz
       17424207772      instructions              #    2.81  insn per cycle
   <not supported>      branches
           6495766      branch-misses

       2.383941170 seconds time elapsed

       2.382215000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2390858 us

 Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':

           2391.16 msec task-clock                #    1.000 CPUs utilized
                25      context-switches          #    0.010 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.024 K/sec
        6216704120      cycles                    #    2.600 GHz
       17424243041      instructions              #    2.80  insn per cycle
   <not supported>      branches
           6483886      branch-misses

       2.391420440 seconds time elapsed

       2.389647000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2389810 us

 Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':

           2390.10 msec task-clock                #    1.000 CPUs utilized
                26      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                58      page-faults               #    0.024 K/sec
        6213995715      cycles                    #    2.600 GHz
       17424227499      instructions              #    2.80  insn per cycle
   <not supported>      branches
           6474069      branch-misses

       2.390367070 seconds time elapsed

       2.388644000 seconds user
       0.000000000 seconds sys

---------------unpatched ptr_ring.c end----------------------------------



---------------patched ptr_ring.c begin----------------------------------
root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2198894 us

 Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':

           2199.18 msec task-clock                #    1.000 CPUs utilized
                23      context-switches          #    0.010 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                56      page-faults               #    0.025 K/sec
        5717671859      cycles                    #    2.600 GHz
       16124164124      instructions              #    2.82  insn per cycle
   <not supported>      branches
           6564829      branch-misses

       2.199445990 seconds time elapsed

       2.197859000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2222337 us

 Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':

           2222.63 msec task-clock                #    1.000 CPUs utilized
                23      context-switches          #    0.010 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.026 K/sec
        5778632853      cycles                    #    2.600 GHz
       16124210769      instructions              #    2.79  insn per cycle
   <not supported>      branches
           6603904      branch-misses

       2.222901020 seconds time elapsed

       2.221312000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2251980 us

 Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':

           2252.28 msec task-clock                #    1.000 CPUs utilized
                25      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.025 K/sec
        5855668335      cycles                    #    2.600 GHz
       16124310588      instructions              #    2.75  insn per cycle
   <not supported>      branches
           6777279      branch-misses

       2.252543340 seconds time elapsed

       2.250897000 seconds user
       0.000000000 seconds sys


root@(none):~#
root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2209415 us

 Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':

           2209.70 msec task-clock                #    1.000 CPUs utilized
                24      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                58      page-faults               #    0.026 K/sec
        5745003772      cycles                    #    2.600 GHz
       16124198886      instructions              #    2.81  insn per cycle
   <not supported>      branches
           6508414      branch-misses

       2.209973960 seconds time elapsed

       2.208354000 seconds user
       0.000000000 seconds sys


root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
ptr_ring(size:1000) perf simple test for 100000000 times, took 2211409 us

 Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':

           2211.70 msec task-clock                #    1.000 CPUs utilized
                24      context-switches          #    0.011 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                57      page-faults               #    0.026 K/sec
        5750136694      cycles                    #    2.600 GHz
       16124176577      instructions              #    2.80  insn per cycle
   <not supported>      branches
           6553023      branch-misses

       2.211968470 seconds time elapsed

       2.210303000 seconds user
       0.000000000 seconds sys

---------------patched ptr_ring.c end----------------------------------

> 
>>



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-25  8:33         ` Yunsheng Lin
@ 2021-06-27  6:03           ` Michael S. Tsirkin
  2021-06-28  2:17             ` Yunsheng Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2021-06-27  6:03 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On Fri, Jun 25, 2021 at 04:33:40PM +0800, Yunsheng Lin wrote:
> On 2021/6/25 15:30, Michael S. Tsirkin wrote:
> > On Fri, Jun 25, 2021 at 03:21:33PM +0800, Yunsheng Lin wrote:
> >> On 2021/6/25 14:32, Michael S. Tsirkin wrote:
> >>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
> >>>> Currently r->queue[] is cleared after r->consumer_head is moved
> >>>> forward, which makes the __ptr_ring_empty() checking called in
> >>>> page_pool_refill_alloc_cache() unreliable if the checking is done
> >>>> after the r->queue clearing and before the consumer_head moving
> >>>> forward.
> >>>
> >>>
> >>> Well the documentation for __ptr_ring_empty clearly states is
> >>> is not guaranteed to be reliable.
> >>
> >> Yes, this patch does not make __ptr_ring_empty() strictly reliable
> >> without taking the r->consumer_lock, as the disscuission in [1].
> >>
> >> 1. https://patchwork.kernel.org/project/netdevbpf/patch/1622032173-11883-1-git-send-email-linyunsheng@huawei.com/#24207011
> >>
> >>>
> >>>  *
> >>>  * NB: This is only safe to call if ring is never resized.
> >>>  *
> >>>  * However, if some other CPU consumes ring entries at the same time, the value
> >>>  * returned is not guaranteed to be correct.
> >>>  *
> >>>  * In this case - to avoid incorrectly detecting the ring
> >>>  * as empty - the CPU consuming the ring entries is responsible
> >>>  * for either consuming all ring entries until the ring is empty,
> >>>  * or synchronizing with some other CPU and causing it to
> >>>  * re-test __ptr_ring_empty and/or consume the ring enteries
> >>>  * after the synchronization point.
> >>>  *
> >>>
> >>> Is it then the case that page_pool_refill_alloc_cache violates
> >>> this requirement? How?
> >>
> >> As my understanding:
> >> page_pool_refill_alloc_cache() uses __ptr_ring_empty() to avoid
> >> taking r->consumer_lock, when the above data race happens, it will
> >> exit out and allocate page from the page allocator instead of reusing
> >> the page in ptr_ring, which *may* not be happening if __ptr_ring_empty()
> >> is more reliable.
> > 
> > Question is how do we know it's more reliable?
> > It would be nice if we did actually made it more reliable,
> > as it is we are just shifting races around.
> > 
> > 
> >>>
> >>> It looks like you are trying to make the guarantee stronger and ensure
> >>> no false positives.
> >>>
> >>> If yes please document this as such, update the comment so all
> >>> code can be evaluated with the eye towards whether the new stronger
> >>> guarantee is maintained. In particular I think I see at least one
> >>> issue with this immediately.
> >>>
> >>>
> >>>> Move the r->queue[] clearing after consumer_head moving forward
> >>>> to make __ptr_ring_empty() checking more reliable.
> >>>>
> >>>> As a side effect of above change, a consumer_head checking is
> >>>> avoided for the likely case, and it has noticeable performance
> >>>> improvement when it is tested using the ptr_ring_test selftest
> >>>> added in the previous patch.
> >>>>
> >>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
> >>>> to test the case of single thread doing both the enqueuing and
> >>>> dequeuing:
> >>>>
> >>>>  arch     unpatched           patched       delta
> >>>> arm64      4648 ms            4464 ms       +3.9%
> >>>>  X86       2562 ms            2401 ms       +6.2%
> >>>>
> >>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
> >>>> to test the case of one thread doing enqueuing and another thread
> >>>> doing dequeuing concurrently, also known as single-producer/single-
> >>>> consumer:
> >>>>
> >>>>  arch      unpatched             patched         delta
> >>>> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
> >>>>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
> >>>>
> >>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >>>> ---
> >>>> V2: Add performance data.
> >>>> ---
> >>>>  include/linux/ptr_ring.h | 25 ++++++++++++++++---------
> >>>>  1 file changed, 16 insertions(+), 9 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> >>>> index 808f9d3..db9c282 100644
> >>>> --- a/include/linux/ptr_ring.h
> >>>> +++ b/include/linux/ptr_ring.h
> >>>> @@ -261,8 +261,7 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >>>>  	/* Note: we must keep consumer_head valid at all times for __ptr_ring_empty
> >>>>  	 * to work correctly.
> >>>>  	 */
> >>>> -	int consumer_head = r->consumer_head;
> >>>> -	int head = consumer_head++;
> >>>> +	int consumer_head = r->consumer_head + 1;
> >>>>  
> >>>>  	/* Once we have processed enough entries invalidate them in
> >>>>  	 * the ring all at once so producer can reuse their space in the ring.
> >>>> @@ -271,19 +270,27 @@ static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> >>>>  	 */
> >>>>  	if (unlikely(consumer_head - r->consumer_tail >= r->batch ||
> >>>>  		     consumer_head >= r->size)) {
> >>>> +		int tail = r->consumer_tail;
> >>>> +
> >>>> +		if (unlikely(consumer_head >= r->size)) {
> >>>> +			r->consumer_tail = 0;
> >>>> +			WRITE_ONCE(r->consumer_head, 0);
> >>>> +		} else {
> >>>> +			r->consumer_tail = consumer_head;
> >>>> +			WRITE_ONCE(r->consumer_head, consumer_head);
> >>>> +		}
> >>>> +
> >>>>  		/* Zero out entries in the reverse order: this way we touch the
> >>>>  		 * cache line that producer might currently be reading the last;
> >>>>  		 * producer won't make progress and touch other cache lines
> >>>>  		 * besides the first one until we write out all entries.
> >>>>  		 */
> >>>> -		while (likely(head >= r->consumer_tail))
> >>>> -			r->queue[head--] = NULL;
> >>>> -		r->consumer_tail = consumer_head;
> >>>> -	}
> >>>> -	if (unlikely(consumer_head >= r->size)) {
> >>>> -		consumer_head = 0;
> >>>> -		r->consumer_tail = 0;
> >>>> +		while (likely(--consumer_head >= tail))
> >>>> +			r->queue[consumer_head] = NULL;
> >>>> +
> >>>> +		return;
> >>>
> >>>
> >>> So if now we need this to be reliable then
> >>> we also need smp_wmb before writing r->queue[consumer_head],
> >>> there could be other gotchas.
> >>
> >> Yes, This patch does not make it strictly reliable.
> >> T think I could mention that in the commit log?
> > 
> > OK so it's not that it makes it more reliable - this patch simply makes
> > a possible false positive less likely while making  a false negative
> > more likely. Our assumption is that a false negative is cheaper then?
> > 
> > How do we know that it is?
> > 
> > And even if we prove the ptr_ring itself is faster now,
> > how do we know what affects callers in a better way a
> > false positive or a false negative?
> > 
> > I would rather we worked on actually making it reliable
> > e.g. if we can guarantee no false positives, that would be
> > a net win.
> I thought deeper about the case you mentioned above, it
> seems for the above to happen, the consumer_head need to
> be rolled back to zero and incremented to the point when
> caller of __ptr_ring_empty() is still *not* able to see the
> r->queue[] which has been set to NULL in __ptr_ring_discard_one().
> 
> It seems smp_wmb() only need to be done once when consumer_head
> is rolled back to zero, and maybe that is enough to make sure the
> case you mentioned is fixed too?
> 
> And the smp_wmb() is only done once in a round of producing/
> consuming, so the performance impact should be minimized?(of
> course we need to test it too).


Sorry I don't really understand the question here.
I think I agree it's enough to do one smp_wmb between
the write of r->queue and write of consumer_head
to help guarantee no false positives.
What other code changes are necessary I can't yet say
without more a deeper code review.

> > 
> >>
> >>>
> >>>>  	}
> >>>> +
> >>>>  	/* matching READ_ONCE in __ptr_ring_empty for lockless tests */
> >>>>  	WRITE_ONCE(r->consumer_head, consumer_head);
> >>>>  }
> >>>> -- 
> >>>> 2.7.4
> >>>
> >>>
> >>> .
> >>>
> > 
> > 
> > .
> > 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-25  9:20     ` Yunsheng Lin
@ 2021-06-27  6:07       ` Michael S. Tsirkin
  2021-06-28  2:11         ` [Linuxarm] " Yunsheng Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2021-06-27  6:07 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On Fri, Jun 25, 2021 at 05:20:10PM +0800, Yunsheng Lin wrote:
> On 2021/6/25 14:39, Michael S. Tsirkin wrote:
> > On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
> >> Currently r->queue[] is cleared after r->consumer_head is moved
> >> forward, which makes the __ptr_ring_empty() checking called in
> >> page_pool_refill_alloc_cache() unreliable if the checking is done
> >> after the r->queue clearing and before the consumer_head moving
> >> forward.
> >>
> >> Move the r->queue[] clearing after consumer_head moving forward
> >> to make __ptr_ring_empty() checking more reliable.
> >>
> >> As a side effect of above change, a consumer_head checking is
> >> avoided for the likely case, and it has noticeable performance
> >> improvement when it is tested using the ptr_ring_test selftest
> >> added in the previous patch.
> >>
> >> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
> >> to test the case of single thread doing both the enqueuing and
> >> dequeuing:
> >>
> >>  arch     unpatched           patched       delta
> >> arm64      4648 ms            4464 ms       +3.9%
> >>  X86       2562 ms            2401 ms       +6.2%
> >>
> >> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
> >> to test the case of one thread doing enqueuing and another thread
> >> doing dequeuing concurrently, also known as single-producer/single-
> >> consumer:
> >>
> >>  arch      unpatched             patched         delta
> >> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
> >>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
> > 
> > Nice but it's small - could be a fluke.
> > How many tests did you run? What is the variance?
> > Did you try pinning to different CPUs to observe numa effects?
> > Please use perf or some other modern tool for this kind
> > of benchmark. Thanks!
> 
> The result is quite stable, and retest using perf stat:

How stable exactly?  Try with -r so we can find out.

> ---------------unpatched ptr_ring.c begin----------------------------------
> 
> perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2385198 us
> 
>  Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
> 
>            2385.49 msec task-clock                #    1.000 CPUs utilized
>                 26      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.024 K/sec
>         6202023521      cycles                    #    2.600 GHz
>        17424187640      instructions              #    2.81  insn per cycle
>    <not supported>      branches
>            6506477      branch-misses
> 
>        2.385785170 seconds time elapsed
> 
>        2.384014000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2383385 us
> 
>  Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
> 
>            2383.67 msec task-clock                #    1.000 CPUs utilized
>                 26      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.024 K/sec
>         6197278066      cycles                    #    2.600 GHz
>        17424207772      instructions              #    2.81  insn per cycle
>    <not supported>      branches
>            6495766      branch-misses
> 
>        2.383941170 seconds time elapsed
> 
>        2.382215000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2390858 us
> 
>  Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
> 
>            2391.16 msec task-clock                #    1.000 CPUs utilized
>                 25      context-switches          #    0.010 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.024 K/sec
>         6216704120      cycles                    #    2.600 GHz
>        17424243041      instructions              #    2.80  insn per cycle
>    <not supported>      branches
>            6483886      branch-misses
> 
>        2.391420440 seconds time elapsed
> 
>        2.389647000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2389810 us
> 
>  Performance counter stats for './ptr_ring_test -s 1000 -m 0 -N 100000000':
> 
>            2390.10 msec task-clock                #    1.000 CPUs utilized
>                 26      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 58      page-faults               #    0.024 K/sec
>         6213995715      cycles                    #    2.600 GHz
>        17424227499      instructions              #    2.80  insn per cycle
>    <not supported>      branches
>            6474069      branch-misses
> 
>        2.390367070 seconds time elapsed
> 
>        2.388644000 seconds user
>        0.000000000 seconds sys
> 
> ---------------unpatched ptr_ring.c end----------------------------------
> 
> 
> 
> ---------------patched ptr_ring.c begin----------------------------------
> root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2198894 us
> 
>  Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
> 
>            2199.18 msec task-clock                #    1.000 CPUs utilized
>                 23      context-switches          #    0.010 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 56      page-faults               #    0.025 K/sec
>         5717671859      cycles                    #    2.600 GHz
>        16124164124      instructions              #    2.82  insn per cycle
>    <not supported>      branches
>            6564829      branch-misses
> 
>        2.199445990 seconds time elapsed
> 
>        2.197859000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2222337 us
> 
>  Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
> 
>            2222.63 msec task-clock                #    1.000 CPUs utilized
>                 23      context-switches          #    0.010 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.026 K/sec
>         5778632853      cycles                    #    2.600 GHz
>        16124210769      instructions              #    2.79  insn per cycle
>    <not supported>      branches
>            6603904      branch-misses
> 
>        2.222901020 seconds time elapsed
> 
>        2.221312000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2251980 us
> 
>  Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
> 
>            2252.28 msec task-clock                #    1.000 CPUs utilized
>                 25      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.025 K/sec
>         5855668335      cycles                    #    2.600 GHz
>        16124310588      instructions              #    2.75  insn per cycle
>    <not supported>      branches
>            6777279      branch-misses
> 
>        2.252543340 seconds time elapsed
> 
>        2.250897000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~#
> root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2209415 us
> 
>  Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
> 
>            2209.70 msec task-clock                #    1.000 CPUs utilized
>                 24      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 58      page-faults               #    0.026 K/sec
>         5745003772      cycles                    #    2.600 GHz
>        16124198886      instructions              #    2.81  insn per cycle
>    <not supported>      branches
>            6508414      branch-misses
> 
>        2.209973960 seconds time elapsed
> 
>        2.208354000 seconds user
>        0.000000000 seconds sys
> 
> 
> root@(none):~# perf stat ./ptr_ring_test_opt -s 1000 -m 0 -N 100000000
> ptr_ring(size:1000) perf simple test for 100000000 times, took 2211409 us
> 
>  Performance counter stats for './ptr_ring_test_opt -s 1000 -m 0 -N 100000000':
> 
>            2211.70 msec task-clock                #    1.000 CPUs utilized
>                 24      context-switches          #    0.011 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 57      page-faults               #    0.026 K/sec
>         5750136694      cycles                    #    2.600 GHz
>        16124176577      instructions              #    2.80  insn per cycle
>    <not supported>      branches
>            6553023      branch-misses
> 
>        2.211968470 seconds time elapsed
> 
>        2.210303000 seconds user
>        0.000000000 seconds sys
> 
> ---------------patched ptr_ring.c end----------------------------------
> 
> > 
> >>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application for ptr_ring
  2021-06-25  3:52     ` Yunsheng Lin
@ 2021-06-27  6:09       ` Michael S. Tsirkin
  2021-06-28  1:42         ` Yunsheng Lin
  0 siblings, 1 reply; 20+ messages in thread
From: Michael S. Tsirkin @ 2021-06-27  6:09 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: Jason Wang, davem, kuba, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On Fri, Jun 25, 2021 at 11:52:16AM +0800, Yunsheng Lin wrote:
> On 2021/6/25 11:36, Jason Wang wrote:
> > 
> > 在 2021/6/25 上午11:18, Yunsheng Lin 写道:
> >> Currently ptr_ring selftest is embedded within the virtio
> >> selftest, which involves some specific virtio operation,
> >> such as notifying and kicking.
> >>
> >> As ptr_ring has been used by various subsystems, it deserves
> >> it's owner's selftest in order to benchmark different usecase
> >> of ptr_ring, such as page pool and pfifo_fast qdisc.
> >>
> >> So add a simple application to benchmark ptr_ring performance.
> >> Currently two test mode is supported:
> >> Mode 0: Both enqueuing and dequeuing is done in a single thread,
> >>          it is called simple test mode in the test app.
> >> Mode 1: Enqueuing and dequeuing is done in different thread
> >>          concurrently, also known as SPSC(single-producer/
> >>          single-consumer) test.
> >>
> >> The multi-producer/single-consumer test for pfifo_fast case is
> >> not added yet, which can be added if using CAS atomic operation
> >> to enable lockless multi-producer is proved to be better than
> >> using r->producer_lock.
> >>
> >> Only supported on x86 and arm64 for now.
> >>
> >> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> >> ---
> >>   MAINTAINERS                                      |   5 +
> >>   tools/testing/selftests/ptr_ring/Makefile        |   6 +
> >>   tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++
> >>   tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++
> >>   4 files changed, 410 insertions(+)
> > 
> > 
> > Why can't you simply reuse tools/virtio/ringtest?
> 
> The main reason is stated in the commit log:
> "Currently ptr_ring selftest is embedded within the virtio
> selftest, which involves some specific virtio operation,
> such as notifying and kicking.
> 
> As ptr_ring has been used by various subsystems, it deserves
> it's owner's selftest in order to benchmark different usecase
> of ptr_ring, such as page pool and pfifo_fast qdisc."
> 
> More specificly in tools/virtio/ringtest/main.c and
> tools/virtio/ringtest/ptr_ring.c, there are a lot of operation
> related to virtio usecase, such as start_guest(), start_host(),
> poll_used(), notify() or kick() ....., so it makes more sense
> to add a generic selftest for ptr ring as it is not only used
> by virtio now.


Okay that answers why you didn't just run main.c
but why not add a new test under tools/virtio/ringtest/
reusing the rest of infrastructure that you currently copied?

> 
> > 
> > Thanks
> > 
> > 
> > .
> > 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application for ptr_ring
  2021-06-27  6:09       ` Michael S. Tsirkin
@ 2021-06-28  1:42         ` Yunsheng Lin
  0 siblings, 0 replies; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-28  1:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, davem, kuba, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On 2021/6/27 14:09, Michael S. Tsirkin wrote:
> On Fri, Jun 25, 2021 at 11:52:16AM +0800, Yunsheng Lin wrote:
>> On 2021/6/25 11:36, Jason Wang wrote:
>>>
>>> 在 2021/6/25 上午11:18, Yunsheng Lin 写道:
>>>> Currently ptr_ring selftest is embedded within the virtio
>>>> selftest, which involves some specific virtio operation,
>>>> such as notifying and kicking.
>>>>
>>>> As ptr_ring has been used by various subsystems, it deserves
>>>> it's owner's selftest in order to benchmark different usecase
>>>> of ptr_ring, such as page pool and pfifo_fast qdisc.
>>>>
>>>> So add a simple application to benchmark ptr_ring performance.
>>>> Currently two test mode is supported:
>>>> Mode 0: Both enqueuing and dequeuing is done in a single thread,
>>>>          it is called simple test mode in the test app.
>>>> Mode 1: Enqueuing and dequeuing is done in different thread
>>>>          concurrently, also known as SPSC(single-producer/
>>>>          single-consumer) test.
>>>>
>>>> The multi-producer/single-consumer test for pfifo_fast case is
>>>> not added yet, which can be added if using CAS atomic operation
>>>> to enable lockless multi-producer is proved to be better than
>>>> using r->producer_lock.
>>>>
>>>> Only supported on x86 and arm64 for now.
>>>>
>>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
>>>> ---
>>>>   MAINTAINERS                                      |   5 +
>>>>   tools/testing/selftests/ptr_ring/Makefile        |   6 +
>>>>   tools/testing/selftests/ptr_ring/ptr_ring_test.c | 249 +++++++++++++++++++++++
>>>>   tools/testing/selftests/ptr_ring/ptr_ring_test.h | 150 ++++++++++++++
>>>>   4 files changed, 410 insertions(+)
>>>
>>>
>>> Why can't you simply reuse tools/virtio/ringtest?
>>
>> The main reason is stated in the commit log:
>> "Currently ptr_ring selftest is embedded within the virtio
>> selftest, which involves some specific virtio operation,
>> such as notifying and kicking.
>>
>> As ptr_ring has been used by various subsystems, it deserves
>> it's owner's selftest in order to benchmark different usecase
>> of ptr_ring, such as page pool and pfifo_fast qdisc."
>>
>> More specificly in tools/virtio/ringtest/main.c and
>> tools/virtio/ringtest/ptr_ring.c, there are a lot of operation
>> related to virtio usecase, such as start_guest(), start_host(),
>> poll_used(), notify() or kick() ....., so it makes more sense
>> to add a generic selftest for ptr ring as it is not only used
>> by virtio now.
> 
> 
> Okay that answers why you didn't just run main.c
> but why not add a new test under tools/virtio/ringtest/
> reusing the rest of infrastructure that you currently copied?

Actually, my first attempt was to reuse the infrastructure in
tools/virtio/ or tools/virtio/ringtest/, and neither of them
was able to be compiled in the latest kernel.

And then I read through the code to try fixing the compile error,
I found that the testcase under tools/virtio/ is coupled deeply
to virtio as explained above, which was difficult to read for
someone who is not fimiliar with virtio.

So I searched for how testing is supposed to be added in the kernel,
it seems it is more common to add the testing in tools/testing or
tools/testing/selftest, and ptr ring is not only used by virtio now,
so it seems more appropriate to add a sperate testing for virtio by
instinct.

Most of tools/virtio/ is to do testing related to virtio testing, IMHO,
most of them are better to be in tools/testing/selftest. Even if most of
virtio testing is moved to tools/testing/selftest, I think it makes more
sense to decouple the virtio testing to ptr_ring testing too if we can
find some mechanism to share the abstract infrastructure in ptr_ring_test.h
for both virtio and ptr_ring testing.


> 
>>
>>>
>>> Thanks
>>>
>>>
>>> .
>>>
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Linuxarm] Re: [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-27  6:07       ` Michael S. Tsirkin
@ 2021-06-28  2:11         ` Yunsheng Lin
  0 siblings, 0 replies; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-28  2:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On 2021/6/27 14:07, Michael S. Tsirkin wrote:
> On Fri, Jun 25, 2021 at 05:20:10PM +0800, Yunsheng Lin wrote:
>> On 2021/6/25 14:39, Michael S. Tsirkin wrote:
>>> On Fri, Jun 25, 2021 at 11:18:56AM +0800, Yunsheng Lin wrote:
>>>> Currently r->queue[] is cleared after r->consumer_head is moved
>>>> forward, which makes the __ptr_ring_empty() checking called in
>>>> page_pool_refill_alloc_cache() unreliable if the checking is done
>>>> after the r->queue clearing and before the consumer_head moving
>>>> forward.
>>>>
>>>> Move the r->queue[] clearing after consumer_head moving forward
>>>> to make __ptr_ring_empty() checking more reliable.
>>>>
>>>> As a side effect of above change, a consumer_head checking is
>>>> avoided for the likely case, and it has noticeable performance
>>>> improvement when it is tested using the ptr_ring_test selftest
>>>> added in the previous patch.
>>>>
>>>> Using "taskset -c 1 ./ptr_ring_test -s 1000 -m 0 -N 100000000"
>>>> to test the case of single thread doing both the enqueuing and
>>>> dequeuing:
>>>>
>>>>  arch     unpatched           patched       delta
>>>> arm64      4648 ms            4464 ms       +3.9%
>>>>  X86       2562 ms            2401 ms       +6.2%
>>>>
>>>> Using "taskset -c 1-2 ./ptr_ring_test -s 1000 -m 1 -N 100000000"
>>>> to test the case of one thread doing enqueuing and another thread
>>>> doing dequeuing concurrently, also known as single-producer/single-
>>>> consumer:
>>>>
>>>>  arch      unpatched             patched         delta
>>>> arm64   3624 ms + 3624 ms   3462 ms + 3462 ms    +4.4%
>>>>  x86    2758 ms + 2758 ms   2547 ms + 2547 ms    +7.6%
>>>
>>> Nice but it's small - could be a fluke.
>>> How many tests did you run? What is the variance?
>>> Did you try pinning to different CPUs to observe numa effects?
>>> Please use perf or some other modern tool for this kind
>>> of benchmark. Thanks!
>>
>> The result is quite stable, and retest using perf stat:
> 
> How stable exactly?  Try with -r so we can find out.

Retest with "perf stat -r":

For unpatched one:
Performance counter stats for './ptr_ring_test -s 1000 -m 1 -N 100000000' (100 runs):

           6780.97 msec task-clock                #    2.000 CPUs utilized            ( +-  5.36% )
                73      context-switches          #    0.011 K/sec                    ( +-  5.07% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
                81      page-faults               #    0.012 K/sec                    ( +-  0.76% )
       17629544748      cycles                    #    2.600 GHz                      ( +-  5.36% )
       25496488950      instructions              #    1.45  insn per cycle           ( +-  0.26% )
   <not supported>      branches
          11489031      branch-misses                                                 ( +-  1.69% )

             3.391 +- 0.182 seconds time elapsed  ( +-  5.35% )

For patched one:
Performance counter stats for './ptr_ring_test_opt -s 1000 -m 1 -N 100000000' (100 runs):

           6567.83 msec task-clock                #    2.000 CPUs utilized            ( +-  5.53% )
                71      context-switches          #    0.011 K/sec                    ( +-  5.26% )
                 0      cpu-migrations            #    0.000 K/sec
                82      page-faults               #    0.012 K/sec                    ( +-  0.85% )
       17075489298      cycles                    #    2.600 GHz                      ( +-  5.53% )
       23861051578      instructions              #    1.40  insn per cycle           ( +-  0.07% )
   <not supported>      branches
          10473776      branch-misses                                                 ( +-  0.60% )

             3.284 +- 0.182 seconds time elapsed  ( +-  5.53% )


The result is more stable when using taskset to limit the running cpu, but I suppose
the above data is stable enough to justify the performance improvement?











^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable
  2021-06-27  6:03           ` Michael S. Tsirkin
@ 2021-06-28  2:17             ` Yunsheng Lin
  0 siblings, 0 replies; 20+ messages in thread
From: Yunsheng Lin @ 2021-06-28  2:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: davem, kuba, jasowang, brouer, paulmck, peterz, will, shuah,
	linux-kernel, netdev, linux-kselftest, linuxarm

On 2021/6/27 14:03, Michael S. Tsirkin wrote:
>>>>>
>>>>>
>>>>> So if now we need this to be reliable then
>>>>> we also need smp_wmb before writing r->queue[consumer_head],
>>>>> there could be other gotchas.
>>>>
>>>> Yes, This patch does not make it strictly reliable.
>>>> T think I could mention that in the commit log?
>>>
>>> OK so it's not that it makes it more reliable - this patch simply makes
>>> a possible false positive less likely while making  a false negative
>>> more likely. Our assumption is that a false negative is cheaper then?
>>>
>>> How do we know that it is?
>>>
>>> And even if we prove the ptr_ring itself is faster now,
>>> how do we know what affects callers in a better way a
>>> false positive or a false negative?
>>>
>>> I would rather we worked on actually making it reliable
>>> e.g. if we can guarantee no false positives, that would be
>>> a net win.
>> I thought deeper about the case you mentioned above, it
>> seems for the above to happen, the consumer_head need to
>> be rolled back to zero and incremented to the point when
>> caller of __ptr_ring_empty() is still *not* able to see the
>> r->queue[] which has been set to NULL in __ptr_ring_discard_one().
>>
>> It seems smp_wmb() only need to be done once when consumer_head
>> is rolled back to zero, and maybe that is enough to make sure the
>> case you mentioned is fixed too?
>>
>> And the smp_wmb() is only done once in a round of producing/
>> consuming, so the performance impact should be minimized?(of
>> course we need to test it too).
> 
> 
> Sorry I don't really understand the question here.
> I think I agree it's enough to do one smp_wmb between
> the write of r->queue and write of consumer_head
> to help guarantee no false positives.
> What other code changes are necessary I can't yet say
> without more a deeper code review.
> 

Ok, thanks for the reviewing.
Will add handling the case you mentioned above in V3 if there
is no noticable performanc impact for handling the above case.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-06-28  2:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-25  3:18 [PATCH net-next v2 0/2] add benchmark selftest and optimization for ptr_ring Yunsheng Lin
2021-06-25  3:18 ` [PATCH net-next v2 1/2] selftests/ptr_ring: add benchmark application " Yunsheng Lin
2021-06-25  3:36   ` Jason Wang
2021-06-25  3:52     ` Yunsheng Lin
2021-06-27  6:09       ` Michael S. Tsirkin
2021-06-28  1:42         ` Yunsheng Lin
2021-06-25  6:37   ` Michael S. Tsirkin
2021-06-25  7:40     ` Yunsheng Lin
2021-06-25  3:18 ` [PATCH net-next v2 2/2] ptr_ring: make __ptr_ring_empty() checking more reliable Yunsheng Lin
2021-06-25  6:32   ` Michael S. Tsirkin
2021-06-25  7:21     ` Yunsheng Lin
2021-06-25  7:30       ` Michael S. Tsirkin
2021-06-25  8:33         ` Yunsheng Lin
2021-06-27  6:03           ` Michael S. Tsirkin
2021-06-28  2:17             ` Yunsheng Lin
2021-06-25  6:39   ` Michael S. Tsirkin
2021-06-25  9:20     ` Yunsheng Lin
2021-06-27  6:07       ` Michael S. Tsirkin
2021-06-28  2:11         ` [Linuxarm] " Yunsheng Lin
2021-06-25  6:42 ` [PATCH net-next v2 0/2] add benchmark selftest and optimization for ptr_ring Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).