[PATCH bpf-next 0/3] improve and fix barriers for walking perf rb

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb
@ 2018-10-17 14:41 Daniel Borkmann
  2018-10-17 14:41 ` [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure Daniel Borkmann
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev,
	Daniel Borkmann

This set first adds smp_* barrier variants to tools infrastructure
and in a second step updates perf and libbpf to make use of them.
For details, please see individual patches, thanks!

Arnaldo, if there are no objections, could this be routed via bpf-next
with Acked-by's due to later dependencies in libbpf? Alternatively,
I could also get the 2nd patch out during merge window, but perhaps
it's okay to do in one go as there shouldn't be much conflict in perf.

Thanks!

Daniel Borkmann (3):
  tools: add smp_* barrier variants to include infrastructure
  tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  bpf, libbpf: use proper barriers in perf ring buffer walk

 tools/arch/arm64/include/asm/barrier.h | 10 ++++++++++
 tools/arch/x86/include/asm/barrier.h   |  9 ++++++---
 tools/include/asm/barrier.h            | 11 +++++++++++
 tools/lib/bpf/libbpf.c                 | 25 +++++++++++++++++++------
 tools/perf/util/mmap.h                 |  5 +++--
 5 files changed, 49 insertions(+), 11 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure
  2018-10-17 14:41 [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Daniel Borkmann
@ 2018-10-17 14:41 ` Daniel Borkmann
  2018-10-17 14:41 ` [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} Daniel Borkmann
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev,
	Daniel Borkmann

Add the definition for smp_rmb(), smp_wmb(), and smp_mb() to the
tools include infrastructure. This patch adds the implementation
for x86-64 and arm64, and have it fall back for other archs which
do not have it implemented at this point such that others can be
added successively for those who have access to test machines. The
x86-64 one uses lock + add combination for smp_mb() with address
below red zone.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/arch/arm64/include/asm/barrier.h | 10 ++++++++++
 tools/arch/x86/include/asm/barrier.h   |  9 ++++++---
 tools/include/asm/barrier.h            | 11 +++++++++++
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/tools/arch/arm64/include/asm/barrier.h b/tools/arch/arm64/include/asm/barrier.h
index 40bde6b..acf1f06 100644
--- a/tools/arch/arm64/include/asm/barrier.h
+++ b/tools/arch/arm64/include/asm/barrier.h
@@ -14,4 +14,14 @@
 #define wmb()		asm volatile("dmb ishst" ::: "memory")
 #define rmb()		asm volatile("dmb ishld" ::: "memory")
 
+/*
+ * Kernel uses dmb variants on arm64 for smp_*() barriers. Pretty much the same
+ * implementation as above mb()/wmb()/rmb(), though for the latter kernel uses
+ * dsb. In any case, should above mb()/wmb()/rmb() change, make sure the below
+ * smp_*() don't.
+ */
+#define smp_mb()	asm volatile("dmb ish" ::: "memory")
+#define smp_wmb()	asm volatile("dmb ishst" ::: "memory")
+#define smp_rmb()	asm volatile("dmb ishld" ::: "memory")
+
 #endif /* _TOOLS_LINUX_ASM_AARCH64_BARRIER_H */
diff --git a/tools/arch/x86/include/asm/barrier.h b/tools/arch/x86/include/asm/barrier.h
index 8774dee..c97c0c5 100644
--- a/tools/arch/x86/include/asm/barrier.h
+++ b/tools/arch/x86/include/asm/barrier.h
@@ -21,9 +21,12 @@
 #define rmb()	asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
 #define wmb()	asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
 #elif defined(__x86_64__)
-#define mb() 	asm volatile("mfence":::"memory")
-#define rmb()	asm volatile("lfence":::"memory")
-#define wmb()	asm volatile("sfence" ::: "memory")
+#define mb()      asm volatile("mfence" ::: "memory")
+#define rmb()     asm volatile("lfence" ::: "memory")
+#define wmb()     asm volatile("sfence" ::: "memory")
+#define smp_rmb() barrier()
+#define smp_wmb() barrier()
+#define smp_mb()  asm volatile("lock; addl $0,-132(%%rsp)" ::: "memory", "cc")
 #endif
 
 #endif /* _TOOLS_LINUX_ASM_X86_BARRIER_H */
diff --git a/tools/include/asm/barrier.h b/tools/include/asm/barrier.h
index 391d942..e4c8845 100644
--- a/tools/include/asm/barrier.h
+++ b/tools/include/asm/barrier.h
@@ -1,4 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/compiler.h>
 #if defined(__i386__) || defined(__x86_64__)
 #include "../../arch/x86/include/asm/barrier.h"
 #elif defined(__arm__)
@@ -26,3 +27,13 @@
 #else
 #include <asm-generic/barrier.h>
 #endif
+/* Fallback definitions for archs that haven't been updated yet. */
+#ifndef smp_rmb
+# define smp_rmb()	rmb()
+#endif
+#ifndef smp_wmb
+# define smp_wmb()	wmb()
+#endif
+#ifndef smp_mb
+# define smp_mb()	mb()
+#endif
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-17 14:41 [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Daniel Borkmann
  2018-10-17 14:41 ` [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure Daniel Borkmann
@ 2018-10-17 14:41 ` Daniel Borkmann
  2018-10-17 15:50   ` Peter Zijlstra
  2018-10-17 14:41 ` [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk Daniel Borkmann
  2018-10-17 15:03 ` [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Arnaldo Carvalho de Melo
  3 siblings, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev,
	Daniel Borkmann

Switch both rmb()/mb() barriers to more lightweight smp_rmb()/smp_mb()
ones. When walking the perf ring buffer they pair the following way,
quoting kernel/events/ring_buffer.c:

  Since the mmap() consumer (userspace) can run on a different CPU:

    kernel                               user

    if (LOAD ->data_tail) {              LOAD ->data_head
                          (A)            smp_rmb()       (C)
      STORE $data                        LOAD $data
      smp_wmb()           (B)            smp_mb()        (D)
      STORE ->data_head                  STORE ->data_tail
    }

  Where A pairs with D, and B pairs with C.

  In our case (A) is a control dependency that separates the load
  of the ->data_tail and the stores of $data. In case ->data_tail
  indicates there is no room in the buffer to store $data we do not.

  D needs to be a full barrier since it separates the data READ from
  the tail WRITE.

  For B a WMB is sufficient since it separates two WRITEs, and for C
  an RMB is sufficient since it separates two READs.

Currently, on x86-64, perf uses LFENCE and MFENCE which is overkill
as we can do more lightweight in particular given this is fast-path.

According to Peter rmb()/mb() were added back then via a94d342b9cb0
("tools/perf: Add required memory barriers") at a time where kernel
still supported chips that needed it, but nowadays support for these
has been ditched completely, therefore we can fix them up as well.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/util/mmap.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 05a6d47..de6dc2e 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -73,7 +73,8 @@ static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
 {
 	struct perf_event_mmap_page *pc = mm->base;
 	u64 head = READ_ONCE(pc->data_head);
-	rmb();
+
+	smp_rmb();
 	return head;
 }

@@ -84,7 +85,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail)
 	/*
 	 * ensure all reads are done before we write the tail out.
 	 */
-	mb();
+	smp_mb();
 	pc->data_tail = tail;
 }

-- 
2.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk
  2018-10-17 14:41 [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Daniel Borkmann
  2018-10-17 14:41 ` [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure Daniel Borkmann
  2018-10-17 14:41 ` [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} Daniel Borkmann
@ 2018-10-17 14:41 ` Daniel Borkmann
  2018-10-17 15:51   ` Peter Zijlstra
  2018-10-17 15:03 ` [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Arnaldo Carvalho de Melo
  3 siblings, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev,
	Daniel Borkmann

Add bpf_perf_read_head() and bpf_perf_write_tail() helpers to make it
more clear in what context barriers are used here, and use smp_rmb()
as well as smp_mb() barriers. Given libbpf is not restricted to x86-64
only, the compiler barrier needs to be replaced with smp_rmb(). Also
the __sync_synchronize() emits mfence whereas faster lock + add can
be used on x86-64 via smp_mb().

Fixes: d0cabbb021be ("tools: bpf: move the event reading loop to libbpf")
Fixes: 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/libbpf.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index bd71efc..e8ae8db 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -20,6 +20,7 @@
 #include <fcntl.h>
 #include <errno.h>
 #include <asm/unistd.h>
+#include <asm/barrier.h>
 #include <linux/err.h>
 #include <linux/kernel.h>
 #include <linux/bpf.h>
@@ -2413,18 +2414,32 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
 	return 0;
 }
 
+static __u64 bpf_perf_read_head(struct perf_event_mmap_page *header)
+{
+	__u64 data_head = READ_ONCE(header->data_head);
+
+	smp_rmb();
+	return data_head;
+}
+
+static void bpf_perf_write_tail(struct perf_event_mmap_page *header,
+				__u64 data_tail)
+{
+	smp_mb();
+	header->data_tail = data_tail;
+}
+
 enum bpf_perf_event_ret
 bpf_perf_event_read_simple(void *mem, unsigned long size,
 			   unsigned long page_size, void **buf, size_t *buf_len,
 			   bpf_perf_event_print_t fn, void *priv)
 {
-	volatile struct perf_event_mmap_page *header = mem;
+	struct perf_event_mmap_page *header = mem;
+	__u64 data_head = bpf_perf_read_head(header);
 	__u64 data_tail = header->data_tail;
-	__u64 data_head = header->data_head;
 	int ret = LIBBPF_PERF_EVENT_ERROR;
 	void *base, *begin, *end;
 
-	asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
 	if (data_head == data_tail)
 		return LIBBPF_PERF_EVENT_CONT;
 
@@ -2467,8 +2482,6 @@ bpf_perf_event_read_simple(void *mem, unsigned long size,
 		data_tail += ehdr->size;
 	}
 
-	__sync_synchronize(); /* smp_mb() */
-	header->data_tail = data_tail;
-
+	bpf_perf_write_tail(header, data_tail);
 	return ret;
 }
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb
  2018-10-17 14:41 [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Daniel Borkmann
                   ` (2 preceding siblings ...)
  2018-10-17 14:41 ` [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk Daniel Borkmann
@ 2018-10-17 15:03 ` Arnaldo Carvalho de Melo
  3 siblings, 0 replies; 18+ messages in thread
From: Arnaldo Carvalho de Melo @ 2018-10-17 15:03 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: alexei.starovoitov, peterz, paulmck, will.deacon, yhs,
	john.fastabend, netdev

Em Wed, Oct 17, 2018 at 04:41:53PM +0200, Daniel Borkmann escreveu:
> This set first adds smp_* barrier variants to tools infrastructure
> and in a second step updates perf and libbpf to make use of them.
> For details, please see individual patches, thanks!
> 
> Arnaldo, if there are no objections, could this be routed via bpf-next
> with Acked-by's due to later dependencies in libbpf? Alternatively,
> I could also get the 2nd patch out during merge window, but perhaps
> it's okay to do in one go as there shouldn't be much conflict in perf.

Right, when updating kernel/events/ring_buffer.c the corresponding
code in tools/ should've been changed :-)

Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>

- Arnaldo
 
> Thanks!
> 
> Daniel Borkmann (3):
>   tools: add smp_* barrier variants to include infrastructure
>   tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
>   bpf, libbpf: use proper barriers in perf ring buffer walk
> 
>  tools/arch/arm64/include/asm/barrier.h | 10 ++++++++++
>  tools/arch/x86/include/asm/barrier.h   |  9 ++++++---
>  tools/include/asm/barrier.h            | 11 +++++++++++
>  tools/lib/bpf/libbpf.c                 | 25 +++++++++++++++++++------
>  tools/perf/util/mmap.h                 |  5 +++--
>  5 files changed, 49 insertions(+), 11 deletions(-)
> 
> -- 
> 2.9.5

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-17 14:41 ` [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} Daniel Borkmann
@ 2018-10-17 15:50   ` Peter Zijlstra
  2018-10-17 23:10     ` Daniel Borkmann
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2018-10-17 15:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
	john.fastabend, netdev

On Wed, Oct 17, 2018 at 04:41:55PM +0200, Daniel Borkmann wrote:
> @@ -73,7 +73,8 @@ static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
>  {
>  	struct perf_event_mmap_page *pc = mm->base;
>  	u64 head = READ_ONCE(pc->data_head);
> -	rmb();
> +
> +	smp_rmb();
>  	return head;
>  }
>  
> @@ -84,7 +85,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail)
>  	/*
>  	 * ensure all reads are done before we write the tail out.
>  	 */
> -	mb();
> +	smp_mb();
>  	pc->data_tail = tail;

Ideally that would be a WRITE_ONCE() to avoid store tearing.

Alternatively, I think we can use smp_store_release() here, all we care
about is that the prior loads stay prior.

Similarly, I suppose, we could use smp_load_acquire() for the data_head
load above.

>  }
>  
> -- 
> 2.9.5
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk
  2018-10-17 14:41 ` [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk Daniel Borkmann
@ 2018-10-17 15:51   ` Peter Zijlstra
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2018-10-17 15:51 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
	john.fastabend, netdev

On Wed, Oct 17, 2018 at 04:41:56PM +0200, Daniel Borkmann wrote:
> +static __u64 bpf_perf_read_head(struct perf_event_mmap_page *header)
> +{
> +	__u64 data_head = READ_ONCE(header->data_head);
> +
> +	smp_rmb();
> +	return data_head;
> +}
> +
> +static void bpf_perf_write_tail(struct perf_event_mmap_page *header,
> +				__u64 data_tail)
> +{
> +	smp_mb();
> +	header->data_tail = data_tail;
> +}

Same coments, either smp_load_acquire()/smp_store_release() or at the
very least a WRITE_ONCE() there.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-17 15:50   ` Peter Zijlstra
@ 2018-10-17 23:10     ` Daniel Borkmann
  2018-10-18  8:14       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2018-10-17 23:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
	john.fastabend, netdev

On 10/17/2018 05:50 PM, Peter Zijlstra wrote:
> On Wed, Oct 17, 2018 at 04:41:55PM +0200, Daniel Borkmann wrote:
>> @@ -73,7 +73,8 @@ static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
>>  {
>>  	struct perf_event_mmap_page *pc = mm->base;
>>  	u64 head = READ_ONCE(pc->data_head);
>> -	rmb();
>> +
>> +	smp_rmb();
>>  	return head;
>>  }
>>  
>> @@ -84,7 +85,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail)
>>  	/*
>>  	 * ensure all reads are done before we write the tail out.
>>  	 */
>> -	mb();
>> +	smp_mb();
>>  	pc->data_tail = tail;
> 
> Ideally that would be a WRITE_ONCE() to avoid store tearing.

Right, agree.

> Alternatively, I think we can use smp_store_release() here, all we care
> about is that the prior loads stay prior.
> 
> Similarly, I suppose, we could use smp_load_acquire() for the data_head
> load above.

Wouldn't this then also allow the kernel side to use smp_store_release()
when it updates the head? We'd be pretty much at the model as described
in Documentation/core-api/circular-buffers.rst.

Meaning, rough pseudo-code diff would look as:

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5d3cf40..3d96275 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -84,8 +84,9 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
 	 *
 	 * See perf_output_begin().
 	 */
-	smp_wmb(); /* B, matches C */
-	rb->user_page->data_head = head;
+
+	/* B, matches C */
+	smp_store_release(&rb->user_page->data_head, head);

 	/*
 	 * Now check if we missed an update -- rely on previous implied

Plus, user space side of perf (assuming we have the barriers imported):

diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 05a6d47..66e1304 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -72,20 +72,15 @@ void perf_mmap__consume(struct perf_mmap *map);
 static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
 {
 	struct perf_event_mmap_page *pc = mm->base;
-	u64 head = READ_ONCE(pc->data_head);
-	rmb();
-	return head;
+
+	return smp_load_acquire(&pc->data_head);
 }

 static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail)
 {
 	struct perf_event_mmap_page *pc = md->base;

-	/*
-	 * ensure all reads are done before we write the tail out.
-	 */
-	mb();
-	pc->data_tail = tail;
+	smp_store_release(&pc->data_tail, tail);
 }

 union perf_event *perf_mmap__read_forward(struct perf_mmap *map);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-17 23:10     ` Daniel Borkmann
@ 2018-10-18  8:14       ` Peter Zijlstra
  2018-10-18 15:04         ` Daniel Borkmann
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2018-10-18  8:14 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
	john.fastabend, netdev

On Thu, Oct 18, 2018 at 01:10:15AM +0200, Daniel Borkmann wrote:

> Wouldn't this then also allow the kernel side to use smp_store_release()
> when it updates the head? We'd be pretty much at the model as described
> in Documentation/core-api/circular-buffers.rst.
> 
> Meaning, rough pseudo-code diff would look as:
> 
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 5d3cf40..3d96275 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -84,8 +84,9 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
>  	 *
>  	 * See perf_output_begin().
>  	 */
> -	smp_wmb(); /* B, matches C */
> -	rb->user_page->data_head = head;
> +
> +	/* B, matches C */
> +	smp_store_release(&rb->user_page->data_head, head);

Yes, this would be correct.

The reason we didn't do this is because smp_store_release() ends up
being smp_mb() + WRITE_ONCE() for a fair number of platforms, even if
they have a cheaper smp_wmb(). Most notably ARM.

(ARM64 OTOH would like to have smp_store_release() there I imagine;
while x86 doesn't care either way around).

A similar concern exists for the smp_load_acquire() I proposed for the
userspace side, ARM would have to resort to smp_mb() in that situation,
instead of the cheaper smp_rmb().

The smp_store_release() on the userspace side will actually be of equal
cost or cheaper, since it already has an smp_mb(). Most notably, x86 can
avoid barrier entirely, because TSO doesn't allow the LOAD-STORE reorder
(it only allows the STORE-LOAD reorder). And PowerPC can use LWSYNC
instead of SYNC.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-18  8:14       ` Peter Zijlstra
@ 2018-10-18 15:04         ` Daniel Borkmann
  2018-10-18 15:33           ` Alexei Starovoitov
  2018-10-19  9:44           ` Peter Zijlstra
  0 siblings, 2 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-10-18 15:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
	john.fastabend, netdev

On 10/18/2018 10:14 AM, Peter Zijlstra wrote:
> On Thu, Oct 18, 2018 at 01:10:15AM +0200, Daniel Borkmann wrote:
> 
>> Wouldn't this then also allow the kernel side to use smp_store_release()
>> when it updates the head? We'd be pretty much at the model as described
>> in Documentation/core-api/circular-buffers.rst.
>>
>> Meaning, rough pseudo-code diff would look as:
>>
>> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>> index 5d3cf40..3d96275 100644
>> --- a/kernel/events/ring_buffer.c
>> +++ b/kernel/events/ring_buffer.c
>> @@ -84,8 +84,9 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
>>  	 *
>>  	 * See perf_output_begin().
>>  	 */
>> -	smp_wmb(); /* B, matches C */
>> -	rb->user_page->data_head = head;
>> +
>> +	/* B, matches C */
>> +	smp_store_release(&rb->user_page->data_head, head);
> 
> Yes, this would be correct.
> 
> The reason we didn't do this is because smp_store_release() ends up
> being smp_mb() + WRITE_ONCE() for a fair number of platforms, even if
> they have a cheaper smp_wmb(). Most notably ARM.

Yep agree, that would be worse..

> (ARM64 OTOH would like to have smp_store_release() there I imagine;
> while x86 doesn't care either way around).
> 
> A similar concern exists for the smp_load_acquire() I proposed for the
> userspace side, ARM would have to resort to smp_mb() in that situation,
> instead of the cheaper smp_rmb().
> 
> The smp_store_release() on the userspace side will actually be of equal
> cost or cheaper, since it already has an smp_mb(). Most notably, x86 can
> avoid barrier entirely, because TSO doesn't allow the LOAD-STORE reorder
> (it only allows the STORE-LOAD reorder). And PowerPC can use LWSYNC
> instead of SYNC.

Ok, thanks a lot for your feedback, Peter! I've changed the user space
side now to the following diff (also moving to a small helper so it can
be reused by libbpf in the subsequent fix I had in the series):

 tools/arch/arm64/include/asm/barrier.h    | 70 +++++++++++++++++++++++++++++++
 tools/arch/ia64/include/asm/barrier.h     | 13 ++++++
 tools/arch/powerpc/include/asm/barrier.h  | 16 +++++++
 tools/arch/s390/include/asm/barrier.h     | 13 ++++++
 tools/arch/sparc/include/asm/barrier_64.h | 13 ++++++
 tools/arch/x86/include/asm/barrier.h      | 14 +++++++
 tools/include/asm/barrier.h               | 35 ++++++++++++++++
 tools/include/linux/ring_buffer.h         | 69 ++++++++++++++++++++++++++++++
 tools/perf/util/mmap.h                    | 15 ++-----
 9 files changed, 246 insertions(+), 12 deletions(-)
 create mode 100644 tools/include/linux/ring_buffer.h

diff --git a/tools/arch/arm64/include/asm/barrier.h b/tools/arch/arm64/include/asm/barrier.h
index 40bde6b..12835ea 100644
--- a/tools/arch/arm64/include/asm/barrier.h
+++ b/tools/arch/arm64/include/asm/barrier.h
@@ -14,4 +14,74 @@
 #define wmb()		asm volatile("dmb ishst" ::: "memory")
 #define rmb()		asm volatile("dmb ishld" ::: "memory")

+#define smp_store_release(p, v)					\
+do {								\
+	union { typeof(*p) __val; char __c[1]; } __u =		\
+		{ .__val = (__force typeof(*p)) (v) }; 		\
+								\
+	switch (sizeof(*p)) {					\
+	case 1:							\
+		asm volatile ("stlrb %w1, %0"			\
+				: "=Q" (*p)			\
+				: "r" (*(__u8 *)__u.__c)	\
+				: "memory");			\
+		break;						\
+	case 2:							\
+		asm volatile ("stlrh %w1, %0"			\
+				: "=Q" (*p)			\
+				: "r" (*(__u16 *)__u.__c)	\
+				: "memory");			\
+		break;						\
+	case 4:							\
+		asm volatile ("stlr %w1, %0"			\
+				: "=Q" (*p)			\
+				: "r" (*(__u32 *)__u.__c)	\
+				: "memory");			\
+		break;						\
+	case 8:							\
+		asm volatile ("stlr %1, %0"			\
+				: "=Q" (*p)			\
+				: "r" (*(__u64 *)__u.__c)	\
+				: "memory");			\
+		break;						\
+	default:						\
+		/* Only to shut up gcc ... */			\
+		mb();						\
+		break;						\
+	}							\
+} while (0)
+
+#define smp_load_acquire(p)					\
+({								\
+	union { typeof(*p) __val; char __c[1]; } __u;		\
+								\
+	switch (sizeof(*p)) {					\
+	case 1:							\
+		asm volatile ("ldarb %w0, %1"			\
+			: "=r" (*(__u8 *)__u.__c)		\
+			: "Q" (*p) : "memory");			\
+		break;						\
+	case 2:							\
+		asm volatile ("ldarh %w0, %1"			\
+			: "=r" (*(__u16 *)__u.__c)		\
+			: "Q" (*p) : "memory");			\
+		break;						\
+	case 4:							\
+		asm volatile ("ldar %w0, %1"			\
+			: "=r" (*(__u32 *)__u.__c)		\
+			: "Q" (*p) : "memory");			\
+		break;						\
+	case 8:							\
+		asm volatile ("ldar %0, %1"			\
+			: "=r" (*(__u64 *)__u.__c)		\
+			: "Q" (*p) : "memory");			\
+		break;						\
+	default:						\
+		/* Only to shut up gcc ... */			\
+		mb();						\
+		break;						\
+	}							\
+	__u.__val;						\
+})
+
 #endif /* _TOOLS_LINUX_ASM_AARCH64_BARRIER_H */
diff --git a/tools/arch/ia64/include/asm/barrier.h b/tools/arch/ia64/include/asm/barrier.h
index d808ee0..4d471d9 100644
--- a/tools/arch/ia64/include/asm/barrier.h
+++ b/tools/arch/ia64/include/asm/barrier.h
@@ -46,4 +46,17 @@
 #define rmb()		mb()
 #define wmb()		mb()

+#define smp_store_release(p, v)			\
+do {						\
+	barrier();				\
+	WRITE_ONCE(*p, v);			\
+} while (0)
+
+#define smp_load_acquire(p)			\
+({						\
+	typeof(*p) ___p1 = READ_ONCE(*p);	\
+	barrier();				\
+	___p1;					\
+})
+
 #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h
index a634da0..905a2c6 100644
--- a/tools/arch/powerpc/include/asm/barrier.h
+++ b/tools/arch/powerpc/include/asm/barrier.h
@@ -27,4 +27,20 @@
 #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
 #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")

+#if defined(__powerpc64__)
+#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
+
+#define smp_store_release(p, v)			\
+do {						\
+	smp_lwsync();				\
+	WRITE_ONCE(*p, v);			\
+} while (0)
+
+#define smp_load_acquire(p)			\
+({						\
+	typeof(*p) ___p1 = READ_ONCE(*p);	\
+	smp_lwsync();				\
+	___p1;					\
+})
+#endif /* defined(__powerpc64__) */
 #endif /* _TOOLS_LINUX_ASM_POWERPC_BARRIER_H */
diff --git a/tools/arch/s390/include/asm/barrier.h b/tools/arch/s390/include/asm/barrier.h
index 5030c99..de362fa6 100644
--- a/tools/arch/s390/include/asm/barrier.h
+++ b/tools/arch/s390/include/asm/barrier.h
@@ -28,4 +28,17 @@
 #define rmb()				mb()
 #define wmb()				mb()

+#define smp_store_release(p, v)			\
+do {						\
+	barrier();				\
+	WRITE_ONCE(*p, v);			\
+} while (0)
+
+#define smp_load_acquire(p)			\
+({						\
+	typeof(*p) ___p1 = READ_ONCE(*p);	\
+	barrier();				\
+	___p1;					\
+})
+
 #endif /* __TOOLS_LIB_ASM_BARRIER_H */
diff --git a/tools/arch/sparc/include/asm/barrier_64.h b/tools/arch/sparc/include/asm/barrier_64.h
index ba61344..cfb0fdc 100644
--- a/tools/arch/sparc/include/asm/barrier_64.h
+++ b/tools/arch/sparc/include/asm/barrier_64.h
@@ -40,4 +40,17 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
 #define rmb()	__asm__ __volatile__("":::"memory")
 #define wmb()	__asm__ __volatile__("":::"memory")

+#define smp_store_release(p, v)			\
+do {						\
+	barrier();				\
+	WRITE_ONCE(*p, v);			\
+} while (0)
+
+#define smp_load_acquire(p)			\
+({						\
+	typeof(*p) ___p1 = READ_ONCE(*p);	\
+	barrier();				\
+	___p1;					\
+})
+
 #endif /* !(__TOOLS_LINUX_SPARC64_BARRIER_H) */
diff --git a/tools/arch/x86/include/asm/barrier.h b/tools/arch/x86/include/asm/barrier.h
index 8774dee..5891986 100644
--- a/tools/arch/x86/include/asm/barrier.h
+++ b/tools/arch/x86/include/asm/barrier.h
@@ -26,4 +26,18 @@
 #define wmb()	asm volatile("sfence" ::: "memory")
 #endif

+#if defined(__x86_64__)
+#define smp_store_release(p, v)			\
+do {						\
+	barrier();				\
+	WRITE_ONCE(*p, v);			\
+} while (0)
+
+#define smp_load_acquire(p)			\
+({						\
+	typeof(*p) ___p1 = READ_ONCE(*p);	\
+	barrier();				\
+	___p1;					\
+})
+#endif /* defined(__x86_64__) */
 #endif /* _TOOLS_LINUX_ASM_X86_BARRIER_H */
diff --git a/tools/include/asm/barrier.h b/tools/include/asm/barrier.h
index 391d942..8d378c5 100644
--- a/tools/include/asm/barrier.h
+++ b/tools/include/asm/barrier.h
@@ -1,4 +1,5 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/compiler.h>
 #if defined(__i386__) || defined(__x86_64__)
 #include "../../arch/x86/include/asm/barrier.h"
 #elif defined(__arm__)
@@ -26,3 +27,37 @@
 #else
 #include <asm-generic/barrier.h>
 #endif
+
+/*
+ * Generic fallback smp_*() definitions for archs that haven't
+ * been updated yet.
+ */
+
+#ifndef smp_rmb
+# define smp_rmb()	rmb()
+#endif
+
+#ifndef smp_wmb
+# define smp_wmb()	wmb()
+#endif
+
+#ifndef smp_mb
+# define smp_mb()	mb()
+#endif
+
+#ifndef smp_store_release
+# define smp_store_release(p, v)		\
+do {						\
+	smp_mb();				\
+	WRITE_ONCE(*p, v);			\
+} while (0)
+#endif
+
+#ifndef smp_load_acquire
+# define smp_load_acquire(p)			\
+({						\
+	typeof(*p) ___p1 = READ_ONCE(*p);	\
+	smp_mb();				\
+	___p1;					\
+})
+#endif
diff --git a/tools/include/linux/ring_buffer.h b/tools/include/linux/ring_buffer.h
new file mode 100644
index 0000000..48200e0
--- /dev/null
+++ b/tools/include/linux/ring_buffer.h
@@ -0,0 +1,69 @@
+#ifndef _TOOLS_LINUX_RING_BUFFER_H_
+#define _TOOLS_LINUX_RING_BUFFER_H_
+
+#include <linux/compiler.h>
+#include <asm/barrier.h>
+
+/*
+ * Below barriers pair as follows (kernel/events/ring_buffer.c):
+ *
+ * Since the mmap() consumer (userspace) can run on a different CPU:
+ *
+ *   kernel                             user
+ *
+ *   if (LOAD ->data_tail) {            LOAD ->data_head
+ *                      (A)             smp_rmb()       (C)
+ *      STORE $data                     LOAD $data
+ *      smp_wmb()       (B)             smp_mb()        (D)
+ *      STORE ->data_head               STORE ->data_tail
+ *   }
+ *
+ * Where A pairs with D, and B pairs with C.
+ *
+ * In our case A is a control dependency that separates the load
+ * of the ->data_tail and the stores of $data. In case ->data_tail
+ * indicates there is no room in the buffer to store $data we do not.
+ *
+ * D needs to be a full barrier since it separates the data READ
+ * from the tail WRITE.
+ *
+ * For B a WMB is sufficient since it separates two WRITEs, and for
+ * C an RMB is sufficient since it separates two READs.
+ */
+
+/*
+ * Note, instead of B, C, D we could also use smp_store_release()
+ * in B and D as well as smp_load_acquire() in C. However, this
+ * optimization makes sense not for all architectures since it
+ * would resolve into READ_ONCE() + smp_mb() pair for smp_load_acquire()
+ * and smp_mb() + WRITE_ONCE() pair for smp_store_release(), thus
+ * for those smp_wmb() in B and smp_rmb() in C would still be less
+ * expensive. For the case of D this has either the same cost or
+ * is less expensive. For example, due to TSO (total store order),
+ * x86 can avoid the CPU barrier entirely.
+ */
+
+static inline u64 ring_buffer_read_head(struct perf_event_mmap_page *base)
+{
+/*
+ * Architectures where smp_load_acquire() does not fallback to
+ * READ_ONCE() + smp_mb() pair.
+ */
+#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \
+    defined(__ia64__) || defined(__sparc__) && defined(__arch64__)
+	return smp_load_acquire(&base->data_head);
+#else
+	u64 head = READ_ONCE(base->data_head);
+
+	smp_rmb();
+	return head;
+#endif
+}
+
+static inline void ring_buffer_write_tail(struct perf_event_mmap_page *base,
+					  u64 tail)
+{
+	smp_store_release(&base->data_tail, tail);
+}
+
+#endif /* _TOOLS_LINUX_RING_BUFFER_H_ */
diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h
index 05a6d47..8f6531f 100644
--- a/tools/perf/util/mmap.h
+++ b/tools/perf/util/mmap.h
@@ -4,7 +4,7 @@
 #include <linux/compiler.h>
 #include <linux/refcount.h>
 #include <linux/types.h>
-#include <asm/barrier.h>
+#include <linux/ring_buffer.h>
 #include <stdbool.h>
 #include "auxtrace.h"
 #include "event.h"
@@ -71,21 +71,12 @@ void perf_mmap__consume(struct perf_mmap *map);

 static inline u64 perf_mmap__read_head(struct perf_mmap *mm)
 {
-	struct perf_event_mmap_page *pc = mm->base;
-	u64 head = READ_ONCE(pc->data_head);
-	rmb();
-	return head;
+	return ring_buffer_read_head(mm->base);
 }

 static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail)
 {
-	struct perf_event_mmap_page *pc = md->base;
-
-	/*
-	 * ensure all reads are done before we write the tail out.
-	 */
-	mb();
-	pc->data_tail = tail;
+	ring_buffer_write_tail(md->base, tail);
 }

 union perf_event *perf_mmap__read_forward(struct perf_mmap *map);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-18 15:04         ` Daniel Borkmann
@ 2018-10-18 15:33           ` Alexei Starovoitov
  2018-10-18 19:00             ` Daniel Borkmann
  2018-10-19  8:04             ` Peter Zijlstra
  2018-10-19  9:44           ` Peter Zijlstra
  1 sibling, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2018-10-18 15:33 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Peter Zijlstra, paulmck, will.deacon, acme, yhs, john.fastabend, netdev

On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
>  #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
> diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h
> index a634da0..905a2c6 100644
> --- a/tools/arch/powerpc/include/asm/barrier.h
> +++ b/tools/arch/powerpc/include/asm/barrier.h
> @@ -27,4 +27,20 @@
>  #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
>  #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> 
> +#if defined(__powerpc64__)
> +#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
> +
> +#define smp_store_release(p, v)			\
> +do {						\
> +	smp_lwsync();				\
> +	WRITE_ONCE(*p, v);			\
> +} while (0)
> +
> +#define smp_load_acquire(p)			\
> +({						\
> +	typeof(*p) ___p1 = READ_ONCE(*p);	\
> +	smp_lwsync();				\
> +	___p1;					\

I don't like this proliferation of asm.
Why do we think that we can do better job than compiler?
can we please use gcc builtins instead?
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
__atomic_load_n(ptr, __ATOMIC_ACQUIRE);
__atomic_store_n(ptr, val, __ATOMIC_RELEASE);
are done specifically for this use case if I'm not mistaken.
I think it pays to learn what compiler provides.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-18 15:33           ` Alexei Starovoitov
@ 2018-10-18 19:00             ` Daniel Borkmann
  2018-10-19  3:53               ` Alexei Starovoitov
  2018-10-19  8:04             ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Daniel Borkmann @ 2018-10-18 19:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peter Zijlstra, paulmck, will.deacon, acme, yhs, john.fastabend, netdev

On 10/18/2018 05:33 PM, Alexei Starovoitov wrote:
> On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
>>  #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
>> diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h
>> index a634da0..905a2c6 100644
>> --- a/tools/arch/powerpc/include/asm/barrier.h
>> +++ b/tools/arch/powerpc/include/asm/barrier.h
>> @@ -27,4 +27,20 @@
>>  #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
>>  #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
>>
>> +#if defined(__powerpc64__)
>> +#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
>> +
>> +#define smp_store_release(p, v)			\
>> +do {						\
>> +	smp_lwsync();				\
>> +	WRITE_ONCE(*p, v);			\
>> +} while (0)
>> +
>> +#define smp_load_acquire(p)			\
>> +({						\
>> +	typeof(*p) ___p1 = READ_ONCE(*p);	\
>> +	smp_lwsync();				\
>> +	___p1;					\
> 
> I don't like this proliferation of asm.
> Why do we think that we can do better job than compiler?
> can we please use gcc builtins instead?
> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
> __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
> are done specifically for this use case if I'm not mistaken.
> I think it pays to learn what compiler provides.

But are you sure the C11 memory model matches exact same model as kernel?
Seems like last time Will looked into it [0] it wasn't the case ...

The above was pulled in and slightly adapted from kernel side of arch
asm barriers. Hm, it would probably be safest if an arch decides to adapt
C11 barriers first from kernel side and user space could then use the
exact same matching builtin functions for scenarios like these as well.

  [0] https://lore.kernel.org/lkml/20170308174300.GL20400@arm.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-18 19:00             ` Daniel Borkmann
@ 2018-10-19  3:53               ` Alexei Starovoitov
  2018-10-19 11:02                 ` Will Deacon
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2018-10-19  3:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Peter Zijlstra, paulmck, will.deacon, acme, yhs, john.fastabend, netdev

On Thu, Oct 18, 2018 at 09:00:46PM +0200, Daniel Borkmann wrote:
> On 10/18/2018 05:33 PM, Alexei Starovoitov wrote:
> > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
> >>  #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
> >> diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h
> >> index a634da0..905a2c6 100644
> >> --- a/tools/arch/powerpc/include/asm/barrier.h
> >> +++ b/tools/arch/powerpc/include/asm/barrier.h
> >> @@ -27,4 +27,20 @@
> >>  #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> >>  #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> >>
> >> +#if defined(__powerpc64__)
> >> +#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
> >> +
> >> +#define smp_store_release(p, v)			\
> >> +do {						\
> >> +	smp_lwsync();				\
> >> +	WRITE_ONCE(*p, v);			\
> >> +} while (0)
> >> +
> >> +#define smp_load_acquire(p)			\
> >> +({						\
> >> +	typeof(*p) ___p1 = READ_ONCE(*p);	\
> >> +	smp_lwsync();				\
> >> +	___p1;					\
> > 
> > I don't like this proliferation of asm.
> > Why do we think that we can do better job than compiler?
> > can we please use gcc builtins instead?
> > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> > __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
> > __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
> > are done specifically for this use case if I'm not mistaken.
> > I think it pays to learn what compiler provides.
> 
> But are you sure the C11 memory model matches exact same model as kernel?
> Seems like last time Will looked into it [0] it wasn't the case ...

I'm only suggesting equivalence of __atomic_load_n(ptr, __ATOMIC_ACQUIRE)
with kernel's smp_load_acquire().
I've seen a bunch of user space ring buffer implementations implemented
with __atomic_load_n() primitives.
But let's ask experts who live in both worlds.

Paul,
what would you recommend?
Should we copy paste smp_store_release() from kernel to be used
in user space library/tools
or use __atomic_load_n() builtins instead?


> The above was pulled in and slightly adapted from kernel side of arch
> asm barriers. Hm, it would probably be safest if an arch decides to adapt
> C11 barriers first from kernel side and user space could then use the
> exact same matching builtin functions for scenarios like these as well.
> 
>   [0] https://lore.kernel.org/lkml/20170308174300.GL20400@arm.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-18 15:33           ` Alexei Starovoitov
  2018-10-18 19:00             ` Daniel Borkmann
@ 2018-10-19  8:04             ` Peter Zijlstra
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2018-10-19  8:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, paulmck, will.deacon, acme, yhs, john.fastabend, netdev

On Thu, Oct 18, 2018 at 08:33:09AM -0700, Alexei Starovoitov wrote:
> On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
> >  #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
> > diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h
> > index a634da0..905a2c6 100644
> > --- a/tools/arch/powerpc/include/asm/barrier.h
> > +++ b/tools/arch/powerpc/include/asm/barrier.h
> > @@ -27,4 +27,20 @@
> >  #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> >  #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > 
> > +#if defined(__powerpc64__)
> > +#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
> > +
> > +#define smp_store_release(p, v)			\
> > +do {						\
> > +	smp_lwsync();				\
> > +	WRITE_ONCE(*p, v);			\
> > +} while (0)
> > +
> > +#define smp_load_acquire(p)			\
> > +({						\
> > +	typeof(*p) ___p1 = READ_ONCE(*p);	\
> > +	smp_lwsync();				\
> > +	___p1;					\
> 
> I don't like this proliferation of asm.
> Why do we think that we can do better job than compiler?
> can we please use gcc builtins instead?
> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
> __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
> are done specifically for this use case if I'm not mistaken.
> I think it pays to learn what compiler provides.

My problem with using the C11 stuff for this is that we're then limited
to compilers that actually support that. The kernel has a minimum of
gcc-4.6 (and thus perf does too I think) and gcc-4.6 does not have C11.

What Daniel writes is also true; the kernel and C11 memory models don't
align; but you're right in that for this purpose the C11 load-acquire
and store-release would indeed suffice.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-18 15:04         ` Daniel Borkmann
  2018-10-18 15:33           ` Alexei Starovoitov
@ 2018-10-19  9:44           ` Peter Zijlstra
  2018-10-19 10:37             ` Daniel Borkmann
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2018-10-19  9:44 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
	john.fastabend, netdev

On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
> diff --git a/tools/include/linux/ring_buffer.h b/tools/include/linux/ring_buffer.h
> new file mode 100644
> index 0000000..48200e0
> --- /dev/null
> +++ b/tools/include/linux/ring_buffer.h
> @@ -0,0 +1,69 @@
> +#ifndef _TOOLS_LINUX_RING_BUFFER_H_
> +#define _TOOLS_LINUX_RING_BUFFER_H_
> +
> +#include <linux/compiler.h>
> +#include <asm/barrier.h>
> +
> +/*
> + * Below barriers pair as follows (kernel/events/ring_buffer.c):
> + *
> + * Since the mmap() consumer (userspace) can run on a different CPU:
> + *
> + *   kernel                             user
> + *
> + *   if (LOAD ->data_tail) {            LOAD ->data_head
> + *                      (A)             smp_rmb()       (C)
> + *      STORE $data                     LOAD $data
> + *      smp_wmb()       (B)             smp_mb()        (D)
> + *      STORE ->data_head               STORE ->data_tail
> + *   }
> + *
> + * Where A pairs with D, and B pairs with C.
> + *
> + * In our case A is a control dependency that separates the load
> + * of the ->data_tail and the stores of $data. In case ->data_tail
> + * indicates there is no room in the buffer to store $data we do not.
> + *
> + * D needs to be a full barrier since it separates the data READ
> + * from the tail WRITE.
> + *
> + * For B a WMB is sufficient since it separates two WRITEs, and for
> + * C an RMB is sufficient since it separates two READs.
> + */
> +
> +/*
> + * Note, instead of B, C, D we could also use smp_store_release()
> + * in B and D as well as smp_load_acquire() in C. However, this
> + * optimization makes sense not for all architectures since it
> + * would resolve into READ_ONCE() + smp_mb() pair for smp_load_acquire()
> + * and smp_mb() + WRITE_ONCE() pair for smp_store_release(), thus
> + * for those smp_wmb() in B and smp_rmb() in C would still be less
> + * expensive. For the case of D this has either the same cost or
> + * is less expensive. For example, due to TSO (total store order),
> + * x86 can avoid the CPU barrier entirely.
> + */
> +
> +static inline u64 ring_buffer_read_head(struct perf_event_mmap_page *base)
> +{
> +/*
> + * Architectures where smp_load_acquire() does not fallback to
> + * READ_ONCE() + smp_mb() pair.
> + */
> +#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \
> +    defined(__ia64__) || defined(__sparc__) && defined(__arch64__)
> +	return smp_load_acquire(&base->data_head);
> +#else
> +	u64 head = READ_ONCE(base->data_head);
> +
> +	smp_rmb();
> +	return head;
> +#endif
> +}
> +
> +static inline void ring_buffer_write_tail(struct perf_event_mmap_page *base,
> +					  u64 tail)
> +{
> +	smp_store_release(&base->data_tail, tail);
> +}
> +
> +#endif /* _TOOLS_LINUX_RING_BUFFER_H_ */

(for the whole patch, but in particular the above)

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-19  9:44           ` Peter Zijlstra
@ 2018-10-19 10:37             ` Daniel Borkmann
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Borkmann @ 2018-10-19 10:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs,
	john.fastabend, netdev

On 10/19/2018 11:44 AM, Peter Zijlstra wrote:
> On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
>> diff --git a/tools/include/linux/ring_buffer.h b/tools/include/linux/ring_buffer.h
>> new file mode 100644
>> index 0000000..48200e0
>> --- /dev/null
>> +++ b/tools/include/linux/ring_buffer.h
>> @@ -0,0 +1,69 @@
>> +#ifndef _TOOLS_LINUX_RING_BUFFER_H_
>> +#define _TOOLS_LINUX_RING_BUFFER_H_
>> +
>> +#include <linux/compiler.h>
>> +#include <asm/barrier.h>
>> +
>> +/*
>> + * Below barriers pair as follows (kernel/events/ring_buffer.c):
>> + *
>> + * Since the mmap() consumer (userspace) can run on a different CPU:
>> + *
>> + *   kernel                             user
>> + *
>> + *   if (LOAD ->data_tail) {            LOAD ->data_head
>> + *                      (A)             smp_rmb()       (C)
>> + *      STORE $data                     LOAD $data
>> + *      smp_wmb()       (B)             smp_mb()        (D)
>> + *      STORE ->data_head               STORE ->data_tail
>> + *   }
>> + *
>> + * Where A pairs with D, and B pairs with C.
>> + *
>> + * In our case A is a control dependency that separates the load
>> + * of the ->data_tail and the stores of $data. In case ->data_tail
>> + * indicates there is no room in the buffer to store $data we do not.
>> + *
>> + * D needs to be a full barrier since it separates the data READ
>> + * from the tail WRITE.
>> + *
>> + * For B a WMB is sufficient since it separates two WRITEs, and for
>> + * C an RMB is sufficient since it separates two READs.
>> + */
>> +
>> +/*
>> + * Note, instead of B, C, D we could also use smp_store_release()
>> + * in B and D as well as smp_load_acquire() in C. However, this
>> + * optimization makes sense not for all architectures since it
>> + * would resolve into READ_ONCE() + smp_mb() pair for smp_load_acquire()
>> + * and smp_mb() + WRITE_ONCE() pair for smp_store_release(), thus
>> + * for those smp_wmb() in B and smp_rmb() in C would still be less
>> + * expensive. For the case of D this has either the same cost or
>> + * is less expensive. For example, due to TSO (total store order),
>> + * x86 can avoid the CPU barrier entirely.
>> + */
>> +
>> +static inline u64 ring_buffer_read_head(struct perf_event_mmap_page *base)
>> +{
>> +/*
>> + * Architectures where smp_load_acquire() does not fallback to
>> + * READ_ONCE() + smp_mb() pair.
>> + */
>> +#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \
>> +    defined(__ia64__) || defined(__sparc__) && defined(__arch64__)
>> +	return smp_load_acquire(&base->data_head);
>> +#else
>> +	u64 head = READ_ONCE(base->data_head);
>> +
>> +	smp_rmb();
>> +	return head;
>> +#endif
>> +}
>> +
>> +static inline void ring_buffer_write_tail(struct perf_event_mmap_page *base,
>> +					  u64 tail)
>> +{
>> +	smp_store_release(&base->data_tail, tail);
>> +}
>> +
>> +#endif /* _TOOLS_LINUX_RING_BUFFER_H_ */
> 
> (for the whole patch, but in particular the above)
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Great, thanks a lot, Peter! Will flush out v2 in a bit.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-19  3:53               ` Alexei Starovoitov
@ 2018-10-19 11:02                 ` Will Deacon
  2018-10-19 11:56                   ` Paul E. McKenney
  0 siblings, 1 reply; 18+ messages in thread
From: Will Deacon @ 2018-10-19 11:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Peter Zijlstra, paulmck, acme, yhs,
	john.fastabend, netdev

On Thu, Oct 18, 2018 at 08:53:42PM -0700, Alexei Starovoitov wrote:
> On Thu, Oct 18, 2018 at 09:00:46PM +0200, Daniel Borkmann wrote:
> > On 10/18/2018 05:33 PM, Alexei Starovoitov wrote:
> > > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
> > >>  #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
> > >> diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h
> > >> index a634da0..905a2c6 100644
> > >> --- a/tools/arch/powerpc/include/asm/barrier.h
> > >> +++ b/tools/arch/powerpc/include/asm/barrier.h
> > >> @@ -27,4 +27,20 @@
> > >>  #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > >>  #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > >>
> > >> +#if defined(__powerpc64__)
> > >> +#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
> > >> +
> > >> +#define smp_store_release(p, v)			\
> > >> +do {						\
> > >> +	smp_lwsync();				\
> > >> +	WRITE_ONCE(*p, v);			\
> > >> +} while (0)
> > >> +
> > >> +#define smp_load_acquire(p)			\
> > >> +({						\
> > >> +	typeof(*p) ___p1 = READ_ONCE(*p);	\
> > >> +	smp_lwsync();				\
> > >> +	___p1;					\
> > > 
> > > I don't like this proliferation of asm.
> > > Why do we think that we can do better job than compiler?
> > > can we please use gcc builtins instead?
> > > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> > > __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
> > > __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
> > > are done specifically for this use case if I'm not mistaken.
> > > I think it pays to learn what compiler provides.
> > 
> > But are you sure the C11 memory model matches exact same model as kernel?
> > Seems like last time Will looked into it [0] it wasn't the case ...
> 
> I'm only suggesting equivalence of __atomic_load_n(ptr, __ATOMIC_ACQUIRE)
> with kernel's smp_load_acquire().
> I've seen a bunch of user space ring buffer implementations implemented
> with __atomic_load_n() primitives.
> But let's ask experts who live in both worlds.

One thing to be wary of is if there is an implementation choice between
how to implement load-acquire and store-release for a given architecture.
In these situations, it's often important that concurrent software agrees
on the "mapping", so we'd need to be sure that (a) All userspace compilers
that we care about have compatible mappings and (b) These mappings are
compatible with the kernel code.

Will

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}
  2018-10-19 11:02                 ` Will Deacon
@ 2018-10-19 11:56                   ` Paul E. McKenney
  0 siblings, 0 replies; 18+ messages in thread
From: Paul E. McKenney @ 2018-10-19 11:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Alexei Starovoitov, Daniel Borkmann, Peter Zijlstra, acme, yhs,
	john.fastabend, netdev

On Fri, Oct 19, 2018 at 12:02:43PM +0100, Will Deacon wrote:
> On Thu, Oct 18, 2018 at 08:53:42PM -0700, Alexei Starovoitov wrote:
> > On Thu, Oct 18, 2018 at 09:00:46PM +0200, Daniel Borkmann wrote:
> > > On 10/18/2018 05:33 PM, Alexei Starovoitov wrote:
> > > > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
> > > >>  #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
> > > >> diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h
> > > >> index a634da0..905a2c6 100644
> > > >> --- a/tools/arch/powerpc/include/asm/barrier.h
> > > >> +++ b/tools/arch/powerpc/include/asm/barrier.h
> > > >> @@ -27,4 +27,20 @@
> > > >>  #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > > >>  #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > > >>
> > > >> +#if defined(__powerpc64__)
> > > >> +#define smp_lwsync()	__asm__ __volatile__ ("lwsync" : : : "memory")
> > > >> +
> > > >> +#define smp_store_release(p, v)			\
> > > >> +do {						\
> > > >> +	smp_lwsync();				\
> > > >> +	WRITE_ONCE(*p, v);			\
> > > >> +} while (0)
> > > >> +
> > > >> +#define smp_load_acquire(p)			\
> > > >> +({						\
> > > >> +	typeof(*p) ___p1 = READ_ONCE(*p);	\
> > > >> +	smp_lwsync();				\
> > > >> +	___p1;					\
> > > > 
> > > > I don't like this proliferation of asm.
> > > > Why do we think that we can do better job than compiler?
> > > > can we please use gcc builtins instead?
> > > > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> > > > __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
> > > > __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
> > > > are done specifically for this use case if I'm not mistaken.
> > > > I think it pays to learn what compiler provides.
> > > 
> > > But are you sure the C11 memory model matches exact same model as kernel?
> > > Seems like last time Will looked into it [0] it wasn't the case ...
> > 
> > I'm only suggesting equivalence of __atomic_load_n(ptr, __ATOMIC_ACQUIRE)
> > with kernel's smp_load_acquire().
> > I've seen a bunch of user space ring buffer implementations implemented
> > with __atomic_load_n() primitives.
> > But let's ask experts who live in both worlds.
> 
> One thing to be wary of is if there is an implementation choice between
> how to implement load-acquire and store-release for a given architecture.
> In these situations, it's often important that concurrent software agrees
> on the "mapping", so we'd need to be sure that (a) All userspace compilers
> that we care about have compatible mappings and (b) These mappings are
> compatible with the kernel code.

Agreed!  Mixing and matching can be done, but it does require quite a
bit of care.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-10-19 20:02 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-17 14:41 [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Daniel Borkmann
2018-10-17 14:41 ` [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure Daniel Borkmann
2018-10-17 14:41 ` [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} Daniel Borkmann
2018-10-17 15:50   ` Peter Zijlstra
2018-10-17 23:10     ` Daniel Borkmann
2018-10-18  8:14       ` Peter Zijlstra
2018-10-18 15:04         ` Daniel Borkmann
2018-10-18 15:33           ` Alexei Starovoitov
2018-10-18 19:00             ` Daniel Borkmann
2018-10-19  3:53               ` Alexei Starovoitov
2018-10-19 11:02                 ` Will Deacon
2018-10-19 11:56                   ` Paul E. McKenney
2018-10-19  8:04             ` Peter Zijlstra
2018-10-19  9:44           ` Peter Zijlstra
2018-10-19 10:37             ` Daniel Borkmann
2018-10-17 14:41 ` [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk Daniel Borkmann
2018-10-17 15:51   ` Peter Zijlstra
2018-10-17 15:03 ` [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Arnaldo Carvalho de Melo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.