* [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb @ 2018-10-17 14:41 Daniel Borkmann 2018-10-17 14:41 ` [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure Daniel Borkmann ` (3 more replies) 0 siblings, 4 replies; 18+ messages in thread From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw) To: alexei.starovoitov Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev, Daniel Borkmann This set first adds smp_* barrier variants to tools infrastructure and in a second step updates perf and libbpf to make use of them. For details, please see individual patches, thanks! Arnaldo, if there are no objections, could this be routed via bpf-next with Acked-by's due to later dependencies in libbpf? Alternatively, I could also get the 2nd patch out during merge window, but perhaps it's okay to do in one go as there shouldn't be much conflict in perf. Thanks! Daniel Borkmann (3): tools: add smp_* barrier variants to include infrastructure tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} bpf, libbpf: use proper barriers in perf ring buffer walk tools/arch/arm64/include/asm/barrier.h | 10 ++++++++++ tools/arch/x86/include/asm/barrier.h | 9 ++++++--- tools/include/asm/barrier.h | 11 +++++++++++ tools/lib/bpf/libbpf.c | 25 +++++++++++++++++++------ tools/perf/util/mmap.h | 5 +++-- 5 files changed, 49 insertions(+), 11 deletions(-) -- 2.9.5 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure 2018-10-17 14:41 [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Daniel Borkmann @ 2018-10-17 14:41 ` Daniel Borkmann 2018-10-17 14:41 ` [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} Daniel Borkmann ` (2 subsequent siblings) 3 siblings, 0 replies; 18+ messages in thread From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw) To: alexei.starovoitov Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev, Daniel Borkmann Add the definition for smp_rmb(), smp_wmb(), and smp_mb() to the tools include infrastructure. This patch adds the implementation for x86-64 and arm64, and have it fall back for other archs which do not have it implemented at this point such that others can be added successively for those who have access to test machines. The x86-64 one uses lock + add combination for smp_mb() with address below red zone. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> --- tools/arch/arm64/include/asm/barrier.h | 10 ++++++++++ tools/arch/x86/include/asm/barrier.h | 9 ++++++--- tools/include/asm/barrier.h | 11 +++++++++++ 3 files changed, 27 insertions(+), 3 deletions(-) diff --git a/tools/arch/arm64/include/asm/barrier.h b/tools/arch/arm64/include/asm/barrier.h index 40bde6b..acf1f06 100644 --- a/tools/arch/arm64/include/asm/barrier.h +++ b/tools/arch/arm64/include/asm/barrier.h @@ -14,4 +14,14 @@ #define wmb() asm volatile("dmb ishst" ::: "memory") #define rmb() asm volatile("dmb ishld" ::: "memory") +/* + * Kernel uses dmb variants on arm64 for smp_*() barriers. Pretty much the same + * implementation as above mb()/wmb()/rmb(), though for the latter kernel uses + * dsb. In any case, should above mb()/wmb()/rmb() change, make sure the below + * smp_*() don't. + */ +#define smp_mb() asm volatile("dmb ish" ::: "memory") +#define smp_wmb() asm volatile("dmb ishst" ::: "memory") +#define smp_rmb() asm volatile("dmb ishld" ::: "memory") + #endif /* _TOOLS_LINUX_ASM_AARCH64_BARRIER_H */ diff --git a/tools/arch/x86/include/asm/barrier.h b/tools/arch/x86/include/asm/barrier.h index 8774dee..c97c0c5 100644 --- a/tools/arch/x86/include/asm/barrier.h +++ b/tools/arch/x86/include/asm/barrier.h @@ -21,9 +21,12 @@ #define rmb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") #define wmb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") #elif defined(__x86_64__) -#define mb() asm volatile("mfence":::"memory") -#define rmb() asm volatile("lfence":::"memory") -#define wmb() asm volatile("sfence" ::: "memory") +#define mb() asm volatile("mfence" ::: "memory") +#define rmb() asm volatile("lfence" ::: "memory") +#define wmb() asm volatile("sfence" ::: "memory") +#define smp_rmb() barrier() +#define smp_wmb() barrier() +#define smp_mb() asm volatile("lock; addl $0,-132(%%rsp)" ::: "memory", "cc") #endif #endif /* _TOOLS_LINUX_ASM_X86_BARRIER_H */ diff --git a/tools/include/asm/barrier.h b/tools/include/asm/barrier.h index 391d942..e4c8845 100644 --- a/tools/include/asm/barrier.h +++ b/tools/include/asm/barrier.h @@ -1,4 +1,5 @@ /* SPDX-License-Identifier: GPL-2.0 */ +#include <linux/compiler.h> #if defined(__i386__) || defined(__x86_64__) #include "../../arch/x86/include/asm/barrier.h" #elif defined(__arm__) @@ -26,3 +27,13 @@ #else #include <asm-generic/barrier.h> #endif +/* Fallback definitions for archs that haven't been updated yet. */ +#ifndef smp_rmb +# define smp_rmb() rmb() +#endif +#ifndef smp_wmb +# define smp_wmb() wmb() +#endif +#ifndef smp_mb +# define smp_mb() mb() +#endif -- 2.9.5 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-17 14:41 [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Daniel Borkmann 2018-10-17 14:41 ` [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure Daniel Borkmann @ 2018-10-17 14:41 ` Daniel Borkmann 2018-10-17 15:50 ` Peter Zijlstra 2018-10-17 14:41 ` [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk Daniel Borkmann 2018-10-17 15:03 ` [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Arnaldo Carvalho de Melo 3 siblings, 1 reply; 18+ messages in thread From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw) To: alexei.starovoitov Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev, Daniel Borkmann Switch both rmb()/mb() barriers to more lightweight smp_rmb()/smp_mb() ones. When walking the perf ring buffer they pair the following way, quoting kernel/events/ring_buffer.c: Since the mmap() consumer (userspace) can run on a different CPU: kernel user if (LOAD ->data_tail) { LOAD ->data_head (A) smp_rmb() (C) STORE $data LOAD $data smp_wmb() (B) smp_mb() (D) STORE ->data_head STORE ->data_tail } Where A pairs with D, and B pairs with C. In our case (A) is a control dependency that separates the load of the ->data_tail and the stores of $data. In case ->data_tail indicates there is no room in the buffer to store $data we do not. D needs to be a full barrier since it separates the data READ from the tail WRITE. For B a WMB is sufficient since it separates two WRITEs, and for C an RMB is sufficient since it separates two READs. Currently, on x86-64, perf uses LFENCE and MFENCE which is overkill as we can do more lightweight in particular given this is fast-path. According to Peter rmb()/mb() were added back then via a94d342b9cb0 ("tools/perf: Add required memory barriers") at a time where kernel still supported chips that needed it, but nowadays support for these has been ditched completely, therefore we can fix them up as well. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> --- tools/perf/util/mmap.h | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h index 05a6d47..de6dc2e 100644 --- a/tools/perf/util/mmap.h +++ b/tools/perf/util/mmap.h @@ -73,7 +73,8 @@ static inline u64 perf_mmap__read_head(struct perf_mmap *mm) { struct perf_event_mmap_page *pc = mm->base; u64 head = READ_ONCE(pc->data_head); - rmb(); + + smp_rmb(); return head; } @@ -84,7 +85,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail) /* * ensure all reads are done before we write the tail out. */ - mb(); + smp_mb(); pc->data_tail = tail; } -- 2.9.5 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-17 14:41 ` [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} Daniel Borkmann @ 2018-10-17 15:50 ` Peter Zijlstra 2018-10-17 23:10 ` Daniel Borkmann 0 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2018-10-17 15:50 UTC (permalink / raw) To: Daniel Borkmann Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On Wed, Oct 17, 2018 at 04:41:55PM +0200, Daniel Borkmann wrote: > @@ -73,7 +73,8 @@ static inline u64 perf_mmap__read_head(struct perf_mmap *mm) > { > struct perf_event_mmap_page *pc = mm->base; > u64 head = READ_ONCE(pc->data_head); > - rmb(); > + > + smp_rmb(); > return head; > } > > @@ -84,7 +85,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail) > /* > * ensure all reads are done before we write the tail out. > */ > - mb(); > + smp_mb(); > pc->data_tail = tail; Ideally that would be a WRITE_ONCE() to avoid store tearing. Alternatively, I think we can use smp_store_release() here, all we care about is that the prior loads stay prior. Similarly, I suppose, we could use smp_load_acquire() for the data_head load above. > } > > -- > 2.9.5 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-17 15:50 ` Peter Zijlstra @ 2018-10-17 23:10 ` Daniel Borkmann 2018-10-18 8:14 ` Peter Zijlstra 0 siblings, 1 reply; 18+ messages in thread From: Daniel Borkmann @ 2018-10-17 23:10 UTC (permalink / raw) To: Peter Zijlstra Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On 10/17/2018 05:50 PM, Peter Zijlstra wrote: > On Wed, Oct 17, 2018 at 04:41:55PM +0200, Daniel Borkmann wrote: >> @@ -73,7 +73,8 @@ static inline u64 perf_mmap__read_head(struct perf_mmap *mm) >> { >> struct perf_event_mmap_page *pc = mm->base; >> u64 head = READ_ONCE(pc->data_head); >> - rmb(); >> + >> + smp_rmb(); >> return head; >> } >> >> @@ -84,7 +85,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail) >> /* >> * ensure all reads are done before we write the tail out. >> */ >> - mb(); >> + smp_mb(); >> pc->data_tail = tail; > > Ideally that would be a WRITE_ONCE() to avoid store tearing. Right, agree. > Alternatively, I think we can use smp_store_release() here, all we care > about is that the prior loads stay prior. > > Similarly, I suppose, we could use smp_load_acquire() for the data_head > load above. Wouldn't this then also allow the kernel side to use smp_store_release() when it updates the head? We'd be pretty much at the model as described in Documentation/core-api/circular-buffers.rst. Meaning, rough pseudo-code diff would look as: diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 5d3cf40..3d96275 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -84,8 +84,9 @@ static void perf_output_put_handle(struct perf_output_handle *handle) * * See perf_output_begin(). */ - smp_wmb(); /* B, matches C */ - rb->user_page->data_head = head; + + /* B, matches C */ + smp_store_release(&rb->user_page->data_head, head); /* * Now check if we missed an update -- rely on previous implied Plus, user space side of perf (assuming we have the barriers imported): diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h index 05a6d47..66e1304 100644 --- a/tools/perf/util/mmap.h +++ b/tools/perf/util/mmap.h @@ -72,20 +72,15 @@ void perf_mmap__consume(struct perf_mmap *map); static inline u64 perf_mmap__read_head(struct perf_mmap *mm) { struct perf_event_mmap_page *pc = mm->base; - u64 head = READ_ONCE(pc->data_head); - rmb(); - return head; + + return smp_load_acquire(&pc->data_head); } static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail) { struct perf_event_mmap_page *pc = md->base; - /* - * ensure all reads are done before we write the tail out. - */ - mb(); - pc->data_tail = tail; + smp_store_release(&pc->data_tail, tail); } union perf_event *perf_mmap__read_forward(struct perf_mmap *map); ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-17 23:10 ` Daniel Borkmann @ 2018-10-18 8:14 ` Peter Zijlstra 2018-10-18 15:04 ` Daniel Borkmann 0 siblings, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2018-10-18 8:14 UTC (permalink / raw) To: Daniel Borkmann Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On Thu, Oct 18, 2018 at 01:10:15AM +0200, Daniel Borkmann wrote: > Wouldn't this then also allow the kernel side to use smp_store_release() > when it updates the head? We'd be pretty much at the model as described > in Documentation/core-api/circular-buffers.rst. > > Meaning, rough pseudo-code diff would look as: > > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c > index 5d3cf40..3d96275 100644 > --- a/kernel/events/ring_buffer.c > +++ b/kernel/events/ring_buffer.c > @@ -84,8 +84,9 @@ static void perf_output_put_handle(struct perf_output_handle *handle) > * > * See perf_output_begin(). > */ > - smp_wmb(); /* B, matches C */ > - rb->user_page->data_head = head; > + > + /* B, matches C */ > + smp_store_release(&rb->user_page->data_head, head); Yes, this would be correct. The reason we didn't do this is because smp_store_release() ends up being smp_mb() + WRITE_ONCE() for a fair number of platforms, even if they have a cheaper smp_wmb(). Most notably ARM. (ARM64 OTOH would like to have smp_store_release() there I imagine; while x86 doesn't care either way around). A similar concern exists for the smp_load_acquire() I proposed for the userspace side, ARM would have to resort to smp_mb() in that situation, instead of the cheaper smp_rmb(). The smp_store_release() on the userspace side will actually be of equal cost or cheaper, since it already has an smp_mb(). Most notably, x86 can avoid barrier entirely, because TSO doesn't allow the LOAD-STORE reorder (it only allows the STORE-LOAD reorder). And PowerPC can use LWSYNC instead of SYNC. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-18 8:14 ` Peter Zijlstra @ 2018-10-18 15:04 ` Daniel Borkmann 2018-10-18 15:33 ` Alexei Starovoitov 2018-10-19 9:44 ` Peter Zijlstra 0 siblings, 2 replies; 18+ messages in thread From: Daniel Borkmann @ 2018-10-18 15:04 UTC (permalink / raw) To: Peter Zijlstra Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On 10/18/2018 10:14 AM, Peter Zijlstra wrote: > On Thu, Oct 18, 2018 at 01:10:15AM +0200, Daniel Borkmann wrote: > >> Wouldn't this then also allow the kernel side to use smp_store_release() >> when it updates the head? We'd be pretty much at the model as described >> in Documentation/core-api/circular-buffers.rst. >> >> Meaning, rough pseudo-code diff would look as: >> >> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c >> index 5d3cf40..3d96275 100644 >> --- a/kernel/events/ring_buffer.c >> +++ b/kernel/events/ring_buffer.c >> @@ -84,8 +84,9 @@ static void perf_output_put_handle(struct perf_output_handle *handle) >> * >> * See perf_output_begin(). >> */ >> - smp_wmb(); /* B, matches C */ >> - rb->user_page->data_head = head; >> + >> + /* B, matches C */ >> + smp_store_release(&rb->user_page->data_head, head); > > Yes, this would be correct. > > The reason we didn't do this is because smp_store_release() ends up > being smp_mb() + WRITE_ONCE() for a fair number of platforms, even if > they have a cheaper smp_wmb(). Most notably ARM. Yep agree, that would be worse.. > (ARM64 OTOH would like to have smp_store_release() there I imagine; > while x86 doesn't care either way around). > > A similar concern exists for the smp_load_acquire() I proposed for the > userspace side, ARM would have to resort to smp_mb() in that situation, > instead of the cheaper smp_rmb(). > > The smp_store_release() on the userspace side will actually be of equal > cost or cheaper, since it already has an smp_mb(). Most notably, x86 can > avoid barrier entirely, because TSO doesn't allow the LOAD-STORE reorder > (it only allows the STORE-LOAD reorder). And PowerPC can use LWSYNC > instead of SYNC. Ok, thanks a lot for your feedback, Peter! I've changed the user space side now to the following diff (also moving to a small helper so it can be reused by libbpf in the subsequent fix I had in the series): tools/arch/arm64/include/asm/barrier.h | 70 +++++++++++++++++++++++++++++++ tools/arch/ia64/include/asm/barrier.h | 13 ++++++ tools/arch/powerpc/include/asm/barrier.h | 16 +++++++ tools/arch/s390/include/asm/barrier.h | 13 ++++++ tools/arch/sparc/include/asm/barrier_64.h | 13 ++++++ tools/arch/x86/include/asm/barrier.h | 14 +++++++ tools/include/asm/barrier.h | 35 ++++++++++++++++ tools/include/linux/ring_buffer.h | 69 ++++++++++++++++++++++++++++++ tools/perf/util/mmap.h | 15 ++----- 9 files changed, 246 insertions(+), 12 deletions(-) create mode 100644 tools/include/linux/ring_buffer.h diff --git a/tools/arch/arm64/include/asm/barrier.h b/tools/arch/arm64/include/asm/barrier.h index 40bde6b..12835ea 100644 --- a/tools/arch/arm64/include/asm/barrier.h +++ b/tools/arch/arm64/include/asm/barrier.h @@ -14,4 +14,74 @@ #define wmb() asm volatile("dmb ishst" ::: "memory") #define rmb() asm volatile("dmb ishld" ::: "memory") +#define smp_store_release(p, v) \ +do { \ + union { typeof(*p) __val; char __c[1]; } __u = \ + { .__val = (__force typeof(*p)) (v) }; \ + \ + switch (sizeof(*p)) { \ + case 1: \ + asm volatile ("stlrb %w1, %0" \ + : "=Q" (*p) \ + : "r" (*(__u8 *)__u.__c) \ + : "memory"); \ + break; \ + case 2: \ + asm volatile ("stlrh %w1, %0" \ + : "=Q" (*p) \ + : "r" (*(__u16 *)__u.__c) \ + : "memory"); \ + break; \ + case 4: \ + asm volatile ("stlr %w1, %0" \ + : "=Q" (*p) \ + : "r" (*(__u32 *)__u.__c) \ + : "memory"); \ + break; \ + case 8: \ + asm volatile ("stlr %1, %0" \ + : "=Q" (*p) \ + : "r" (*(__u64 *)__u.__c) \ + : "memory"); \ + break; \ + default: \ + /* Only to shut up gcc ... */ \ + mb(); \ + break; \ + } \ +} while (0) + +#define smp_load_acquire(p) \ +({ \ + union { typeof(*p) __val; char __c[1]; } __u; \ + \ + switch (sizeof(*p)) { \ + case 1: \ + asm volatile ("ldarb %w0, %1" \ + : "=r" (*(__u8 *)__u.__c) \ + : "Q" (*p) : "memory"); \ + break; \ + case 2: \ + asm volatile ("ldarh %w0, %1" \ + : "=r" (*(__u16 *)__u.__c) \ + : "Q" (*p) : "memory"); \ + break; \ + case 4: \ + asm volatile ("ldar %w0, %1" \ + : "=r" (*(__u32 *)__u.__c) \ + : "Q" (*p) : "memory"); \ + break; \ + case 8: \ + asm volatile ("ldar %0, %1" \ + : "=r" (*(__u64 *)__u.__c) \ + : "Q" (*p) : "memory"); \ + break; \ + default: \ + /* Only to shut up gcc ... */ \ + mb(); \ + break; \ + } \ + __u.__val; \ +}) + #endif /* _TOOLS_LINUX_ASM_AARCH64_BARRIER_H */ diff --git a/tools/arch/ia64/include/asm/barrier.h b/tools/arch/ia64/include/asm/barrier.h index d808ee0..4d471d9 100644 --- a/tools/arch/ia64/include/asm/barrier.h +++ b/tools/arch/ia64/include/asm/barrier.h @@ -46,4 +46,17 @@ #define rmb() mb() #define wmb() mb() +#define smp_store_release(p, v) \ +do { \ + barrier(); \ + WRITE_ONCE(*p, v); \ +} while (0) + +#define smp_load_acquire(p) \ +({ \ + typeof(*p) ___p1 = READ_ONCE(*p); \ + barrier(); \ + ___p1; \ +}) + #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */ diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h index a634da0..905a2c6 100644 --- a/tools/arch/powerpc/include/asm/barrier.h +++ b/tools/arch/powerpc/include/asm/barrier.h @@ -27,4 +27,20 @@ #define rmb() __asm__ __volatile__ ("sync" : : : "memory") #define wmb() __asm__ __volatile__ ("sync" : : : "memory") +#if defined(__powerpc64__) +#define smp_lwsync() __asm__ __volatile__ ("lwsync" : : : "memory") + +#define smp_store_release(p, v) \ +do { \ + smp_lwsync(); \ + WRITE_ONCE(*p, v); \ +} while (0) + +#define smp_load_acquire(p) \ +({ \ + typeof(*p) ___p1 = READ_ONCE(*p); \ + smp_lwsync(); \ + ___p1; \ +}) +#endif /* defined(__powerpc64__) */ #endif /* _TOOLS_LINUX_ASM_POWERPC_BARRIER_H */ diff --git a/tools/arch/s390/include/asm/barrier.h b/tools/arch/s390/include/asm/barrier.h index 5030c99..de362fa6 100644 --- a/tools/arch/s390/include/asm/barrier.h +++ b/tools/arch/s390/include/asm/barrier.h @@ -28,4 +28,17 @@ #define rmb() mb() #define wmb() mb() +#define smp_store_release(p, v) \ +do { \ + barrier(); \ + WRITE_ONCE(*p, v); \ +} while (0) + +#define smp_load_acquire(p) \ +({ \ + typeof(*p) ___p1 = READ_ONCE(*p); \ + barrier(); \ + ___p1; \ +}) + #endif /* __TOOLS_LIB_ASM_BARRIER_H */ diff --git a/tools/arch/sparc/include/asm/barrier_64.h b/tools/arch/sparc/include/asm/barrier_64.h index ba61344..cfb0fdc 100644 --- a/tools/arch/sparc/include/asm/barrier_64.h +++ b/tools/arch/sparc/include/asm/barrier_64.h @@ -40,4 +40,17 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \ #define rmb() __asm__ __volatile__("":::"memory") #define wmb() __asm__ __volatile__("":::"memory") +#define smp_store_release(p, v) \ +do { \ + barrier(); \ + WRITE_ONCE(*p, v); \ +} while (0) + +#define smp_load_acquire(p) \ +({ \ + typeof(*p) ___p1 = READ_ONCE(*p); \ + barrier(); \ + ___p1; \ +}) + #endif /* !(__TOOLS_LINUX_SPARC64_BARRIER_H) */ diff --git a/tools/arch/x86/include/asm/barrier.h b/tools/arch/x86/include/asm/barrier.h index 8774dee..5891986 100644 --- a/tools/arch/x86/include/asm/barrier.h +++ b/tools/arch/x86/include/asm/barrier.h @@ -26,4 +26,18 @@ #define wmb() asm volatile("sfence" ::: "memory") #endif +#if defined(__x86_64__) +#define smp_store_release(p, v) \ +do { \ + barrier(); \ + WRITE_ONCE(*p, v); \ +} while (0) + +#define smp_load_acquire(p) \ +({ \ + typeof(*p) ___p1 = READ_ONCE(*p); \ + barrier(); \ + ___p1; \ +}) +#endif /* defined(__x86_64__) */ #endif /* _TOOLS_LINUX_ASM_X86_BARRIER_H */ diff --git a/tools/include/asm/barrier.h b/tools/include/asm/barrier.h index 391d942..8d378c5 100644 --- a/tools/include/asm/barrier.h +++ b/tools/include/asm/barrier.h @@ -1,4 +1,5 @@ /* SPDX-License-Identifier: GPL-2.0 */ +#include <linux/compiler.h> #if defined(__i386__) || defined(__x86_64__) #include "../../arch/x86/include/asm/barrier.h" #elif defined(__arm__) @@ -26,3 +27,37 @@ #else #include <asm-generic/barrier.h> #endif + +/* + * Generic fallback smp_*() definitions for archs that haven't + * been updated yet. + */ + +#ifndef smp_rmb +# define smp_rmb() rmb() +#endif + +#ifndef smp_wmb +# define smp_wmb() wmb() +#endif + +#ifndef smp_mb +# define smp_mb() mb() +#endif + +#ifndef smp_store_release +# define smp_store_release(p, v) \ +do { \ + smp_mb(); \ + WRITE_ONCE(*p, v); \ +} while (0) +#endif + +#ifndef smp_load_acquire +# define smp_load_acquire(p) \ +({ \ + typeof(*p) ___p1 = READ_ONCE(*p); \ + smp_mb(); \ + ___p1; \ +}) +#endif diff --git a/tools/include/linux/ring_buffer.h b/tools/include/linux/ring_buffer.h new file mode 100644 index 0000000..48200e0 --- /dev/null +++ b/tools/include/linux/ring_buffer.h @@ -0,0 +1,69 @@ +#ifndef _TOOLS_LINUX_RING_BUFFER_H_ +#define _TOOLS_LINUX_RING_BUFFER_H_ + +#include <linux/compiler.h> +#include <asm/barrier.h> + +/* + * Below barriers pair as follows (kernel/events/ring_buffer.c): + * + * Since the mmap() consumer (userspace) can run on a different CPU: + * + * kernel user + * + * if (LOAD ->data_tail) { LOAD ->data_head + * (A) smp_rmb() (C) + * STORE $data LOAD $data + * smp_wmb() (B) smp_mb() (D) + * STORE ->data_head STORE ->data_tail + * } + * + * Where A pairs with D, and B pairs with C. + * + * In our case A is a control dependency that separates the load + * of the ->data_tail and the stores of $data. In case ->data_tail + * indicates there is no room in the buffer to store $data we do not. + * + * D needs to be a full barrier since it separates the data READ + * from the tail WRITE. + * + * For B a WMB is sufficient since it separates two WRITEs, and for + * C an RMB is sufficient since it separates two READs. + */ + +/* + * Note, instead of B, C, D we could also use smp_store_release() + * in B and D as well as smp_load_acquire() in C. However, this + * optimization makes sense not for all architectures since it + * would resolve into READ_ONCE() + smp_mb() pair for smp_load_acquire() + * and smp_mb() + WRITE_ONCE() pair for smp_store_release(), thus + * for those smp_wmb() in B and smp_rmb() in C would still be less + * expensive. For the case of D this has either the same cost or + * is less expensive. For example, due to TSO (total store order), + * x86 can avoid the CPU barrier entirely. + */ + +static inline u64 ring_buffer_read_head(struct perf_event_mmap_page *base) +{ +/* + * Architectures where smp_load_acquire() does not fallback to + * READ_ONCE() + smp_mb() pair. + */ +#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \ + defined(__ia64__) || defined(__sparc__) && defined(__arch64__) + return smp_load_acquire(&base->data_head); +#else + u64 head = READ_ONCE(base->data_head); + + smp_rmb(); + return head; +#endif +} + +static inline void ring_buffer_write_tail(struct perf_event_mmap_page *base, + u64 tail) +{ + smp_store_release(&base->data_tail, tail); +} + +#endif /* _TOOLS_LINUX_RING_BUFFER_H_ */ diff --git a/tools/perf/util/mmap.h b/tools/perf/util/mmap.h index 05a6d47..8f6531f 100644 --- a/tools/perf/util/mmap.h +++ b/tools/perf/util/mmap.h @@ -4,7 +4,7 @@ #include <linux/compiler.h> #include <linux/refcount.h> #include <linux/types.h> -#include <asm/barrier.h> +#include <linux/ring_buffer.h> #include <stdbool.h> #include "auxtrace.h" #include "event.h" @@ -71,21 +71,12 @@ void perf_mmap__consume(struct perf_mmap *map); static inline u64 perf_mmap__read_head(struct perf_mmap *mm) { - struct perf_event_mmap_page *pc = mm->base; - u64 head = READ_ONCE(pc->data_head); - rmb(); - return head; + return ring_buffer_read_head(mm->base); } static inline void perf_mmap__write_tail(struct perf_mmap *md, u64 tail) { - struct perf_event_mmap_page *pc = md->base; - - /* - * ensure all reads are done before we write the tail out. - */ - mb(); - pc->data_tail = tail; + ring_buffer_write_tail(md->base, tail); } union perf_event *perf_mmap__read_forward(struct perf_mmap *map); -- 2.9.5 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-18 15:04 ` Daniel Borkmann @ 2018-10-18 15:33 ` Alexei Starovoitov 2018-10-18 19:00 ` Daniel Borkmann 2018-10-19 8:04 ` Peter Zijlstra 2018-10-19 9:44 ` Peter Zijlstra 1 sibling, 2 replies; 18+ messages in thread From: Alexei Starovoitov @ 2018-10-18 15:33 UTC (permalink / raw) To: Daniel Borkmann Cc: Peter Zijlstra, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote: > #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */ > diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h > index a634da0..905a2c6 100644 > --- a/tools/arch/powerpc/include/asm/barrier.h > +++ b/tools/arch/powerpc/include/asm/barrier.h > @@ -27,4 +27,20 @@ > #define rmb() __asm__ __volatile__ ("sync" : : : "memory") > #define wmb() __asm__ __volatile__ ("sync" : : : "memory") > > +#if defined(__powerpc64__) > +#define smp_lwsync() __asm__ __volatile__ ("lwsync" : : : "memory") > + > +#define smp_store_release(p, v) \ > +do { \ > + smp_lwsync(); \ > + WRITE_ONCE(*p, v); \ > +} while (0) > + > +#define smp_load_acquire(p) \ > +({ \ > + typeof(*p) ___p1 = READ_ONCE(*p); \ > + smp_lwsync(); \ > + ___p1; \ I don't like this proliferation of asm. Why do we think that we can do better job than compiler? can we please use gcc builtins instead? https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html __atomic_load_n(ptr, __ATOMIC_ACQUIRE); __atomic_store_n(ptr, val, __ATOMIC_RELEASE); are done specifically for this use case if I'm not mistaken. I think it pays to learn what compiler provides. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-18 15:33 ` Alexei Starovoitov @ 2018-10-18 19:00 ` Daniel Borkmann 2018-10-19 3:53 ` Alexei Starovoitov 2018-10-19 8:04 ` Peter Zijlstra 1 sibling, 1 reply; 18+ messages in thread From: Daniel Borkmann @ 2018-10-18 19:00 UTC (permalink / raw) To: Alexei Starovoitov Cc: Peter Zijlstra, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On 10/18/2018 05:33 PM, Alexei Starovoitov wrote: > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote: >> #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */ >> diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h >> index a634da0..905a2c6 100644 >> --- a/tools/arch/powerpc/include/asm/barrier.h >> +++ b/tools/arch/powerpc/include/asm/barrier.h >> @@ -27,4 +27,20 @@ >> #define rmb() __asm__ __volatile__ ("sync" : : : "memory") >> #define wmb() __asm__ __volatile__ ("sync" : : : "memory") >> >> +#if defined(__powerpc64__) >> +#define smp_lwsync() __asm__ __volatile__ ("lwsync" : : : "memory") >> + >> +#define smp_store_release(p, v) \ >> +do { \ >> + smp_lwsync(); \ >> + WRITE_ONCE(*p, v); \ >> +} while (0) >> + >> +#define smp_load_acquire(p) \ >> +({ \ >> + typeof(*p) ___p1 = READ_ONCE(*p); \ >> + smp_lwsync(); \ >> + ___p1; \ > > I don't like this proliferation of asm. > Why do we think that we can do better job than compiler? > can we please use gcc builtins instead? > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > __atomic_load_n(ptr, __ATOMIC_ACQUIRE); > __atomic_store_n(ptr, val, __ATOMIC_RELEASE); > are done specifically for this use case if I'm not mistaken. > I think it pays to learn what compiler provides. But are you sure the C11 memory model matches exact same model as kernel? Seems like last time Will looked into it [0] it wasn't the case ... The above was pulled in and slightly adapted from kernel side of arch asm barriers. Hm, it would probably be safest if an arch decides to adapt C11 barriers first from kernel side and user space could then use the exact same matching builtin functions for scenarios like these as well. [0] https://lore.kernel.org/lkml/20170308174300.GL20400@arm.com/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-18 19:00 ` Daniel Borkmann @ 2018-10-19 3:53 ` Alexei Starovoitov 2018-10-19 11:02 ` Will Deacon 0 siblings, 1 reply; 18+ messages in thread From: Alexei Starovoitov @ 2018-10-19 3:53 UTC (permalink / raw) To: Daniel Borkmann Cc: Peter Zijlstra, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On Thu, Oct 18, 2018 at 09:00:46PM +0200, Daniel Borkmann wrote: > On 10/18/2018 05:33 PM, Alexei Starovoitov wrote: > > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote: > >> #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */ > >> diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h > >> index a634da0..905a2c6 100644 > >> --- a/tools/arch/powerpc/include/asm/barrier.h > >> +++ b/tools/arch/powerpc/include/asm/barrier.h > >> @@ -27,4 +27,20 @@ > >> #define rmb() __asm__ __volatile__ ("sync" : : : "memory") > >> #define wmb() __asm__ __volatile__ ("sync" : : : "memory") > >> > >> +#if defined(__powerpc64__) > >> +#define smp_lwsync() __asm__ __volatile__ ("lwsync" : : : "memory") > >> + > >> +#define smp_store_release(p, v) \ > >> +do { \ > >> + smp_lwsync(); \ > >> + WRITE_ONCE(*p, v); \ > >> +} while (0) > >> + > >> +#define smp_load_acquire(p) \ > >> +({ \ > >> + typeof(*p) ___p1 = READ_ONCE(*p); \ > >> + smp_lwsync(); \ > >> + ___p1; \ > > > > I don't like this proliferation of asm. > > Why do we think that we can do better job than compiler? > > can we please use gcc builtins instead? > > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > > __atomic_load_n(ptr, __ATOMIC_ACQUIRE); > > __atomic_store_n(ptr, val, __ATOMIC_RELEASE); > > are done specifically for this use case if I'm not mistaken. > > I think it pays to learn what compiler provides. > > But are you sure the C11 memory model matches exact same model as kernel? > Seems like last time Will looked into it [0] it wasn't the case ... I'm only suggesting equivalence of __atomic_load_n(ptr, __ATOMIC_ACQUIRE) with kernel's smp_load_acquire(). I've seen a bunch of user space ring buffer implementations implemented with __atomic_load_n() primitives. But let's ask experts who live in both worlds. Paul, what would you recommend? Should we copy paste smp_store_release() from kernel to be used in user space library/tools or use __atomic_load_n() builtins instead? > The above was pulled in and slightly adapted from kernel side of arch > asm barriers. Hm, it would probably be safest if an arch decides to adapt > C11 barriers first from kernel side and user space could then use the > exact same matching builtin functions for scenarios like these as well. > > [0] https://lore.kernel.org/lkml/20170308174300.GL20400@arm.com/ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-19 3:53 ` Alexei Starovoitov @ 2018-10-19 11:02 ` Will Deacon 2018-10-19 11:56 ` Paul E. McKenney 0 siblings, 1 reply; 18+ messages in thread From: Will Deacon @ 2018-10-19 11:02 UTC (permalink / raw) To: Alexei Starovoitov Cc: Daniel Borkmann, Peter Zijlstra, paulmck, acme, yhs, john.fastabend, netdev On Thu, Oct 18, 2018 at 08:53:42PM -0700, Alexei Starovoitov wrote: > On Thu, Oct 18, 2018 at 09:00:46PM +0200, Daniel Borkmann wrote: > > On 10/18/2018 05:33 PM, Alexei Starovoitov wrote: > > > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote: > > >> #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */ > > >> diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h > > >> index a634da0..905a2c6 100644 > > >> --- a/tools/arch/powerpc/include/asm/barrier.h > > >> +++ b/tools/arch/powerpc/include/asm/barrier.h > > >> @@ -27,4 +27,20 @@ > > >> #define rmb() __asm__ __volatile__ ("sync" : : : "memory") > > >> #define wmb() __asm__ __volatile__ ("sync" : : : "memory") > > >> > > >> +#if defined(__powerpc64__) > > >> +#define smp_lwsync() __asm__ __volatile__ ("lwsync" : : : "memory") > > >> + > > >> +#define smp_store_release(p, v) \ > > >> +do { \ > > >> + smp_lwsync(); \ > > >> + WRITE_ONCE(*p, v); \ > > >> +} while (0) > > >> + > > >> +#define smp_load_acquire(p) \ > > >> +({ \ > > >> + typeof(*p) ___p1 = READ_ONCE(*p); \ > > >> + smp_lwsync(); \ > > >> + ___p1; \ > > > > > > I don't like this proliferation of asm. > > > Why do we think that we can do better job than compiler? > > > can we please use gcc builtins instead? > > > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > > > __atomic_load_n(ptr, __ATOMIC_ACQUIRE); > > > __atomic_store_n(ptr, val, __ATOMIC_RELEASE); > > > are done specifically for this use case if I'm not mistaken. > > > I think it pays to learn what compiler provides. > > > > But are you sure the C11 memory model matches exact same model as kernel? > > Seems like last time Will looked into it [0] it wasn't the case ... > > I'm only suggesting equivalence of __atomic_load_n(ptr, __ATOMIC_ACQUIRE) > with kernel's smp_load_acquire(). > I've seen a bunch of user space ring buffer implementations implemented > with __atomic_load_n() primitives. > But let's ask experts who live in both worlds. One thing to be wary of is if there is an implementation choice between how to implement load-acquire and store-release for a given architecture. In these situations, it's often important that concurrent software agrees on the "mapping", so we'd need to be sure that (a) All userspace compilers that we care about have compatible mappings and (b) These mappings are compatible with the kernel code. Will ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-19 11:02 ` Will Deacon @ 2018-10-19 11:56 ` Paul E. McKenney 0 siblings, 0 replies; 18+ messages in thread From: Paul E. McKenney @ 2018-10-19 11:56 UTC (permalink / raw) To: Will Deacon Cc: Alexei Starovoitov, Daniel Borkmann, Peter Zijlstra, acme, yhs, john.fastabend, netdev On Fri, Oct 19, 2018 at 12:02:43PM +0100, Will Deacon wrote: > On Thu, Oct 18, 2018 at 08:53:42PM -0700, Alexei Starovoitov wrote: > > On Thu, Oct 18, 2018 at 09:00:46PM +0200, Daniel Borkmann wrote: > > > On 10/18/2018 05:33 PM, Alexei Starovoitov wrote: > > > > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote: > > > >> #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */ > > > >> diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h > > > >> index a634da0..905a2c6 100644 > > > >> --- a/tools/arch/powerpc/include/asm/barrier.h > > > >> +++ b/tools/arch/powerpc/include/asm/barrier.h > > > >> @@ -27,4 +27,20 @@ > > > >> #define rmb() __asm__ __volatile__ ("sync" : : : "memory") > > > >> #define wmb() __asm__ __volatile__ ("sync" : : : "memory") > > > >> > > > >> +#if defined(__powerpc64__) > > > >> +#define smp_lwsync() __asm__ __volatile__ ("lwsync" : : : "memory") > > > >> + > > > >> +#define smp_store_release(p, v) \ > > > >> +do { \ > > > >> + smp_lwsync(); \ > > > >> + WRITE_ONCE(*p, v); \ > > > >> +} while (0) > > > >> + > > > >> +#define smp_load_acquire(p) \ > > > >> +({ \ > > > >> + typeof(*p) ___p1 = READ_ONCE(*p); \ > > > >> + smp_lwsync(); \ > > > >> + ___p1; \ > > > > > > > > I don't like this proliferation of asm. > > > > Why do we think that we can do better job than compiler? > > > > can we please use gcc builtins instead? > > > > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > > > > __atomic_load_n(ptr, __ATOMIC_ACQUIRE); > > > > __atomic_store_n(ptr, val, __ATOMIC_RELEASE); > > > > are done specifically for this use case if I'm not mistaken. > > > > I think it pays to learn what compiler provides. > > > > > > But are you sure the C11 memory model matches exact same model as kernel? > > > Seems like last time Will looked into it [0] it wasn't the case ... > > > > I'm only suggesting equivalence of __atomic_load_n(ptr, __ATOMIC_ACQUIRE) > > with kernel's smp_load_acquire(). > > I've seen a bunch of user space ring buffer implementations implemented > > with __atomic_load_n() primitives. > > But let's ask experts who live in both worlds. > > One thing to be wary of is if there is an implementation choice between > how to implement load-acquire and store-release for a given architecture. > In these situations, it's often important that concurrent software agrees > on the "mapping", so we'd need to be sure that (a) All userspace compilers > that we care about have compatible mappings and (b) These mappings are > compatible with the kernel code. Agreed! Mixing and matching can be done, but it does require quite a bit of care. Thanx, Paul ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-18 15:33 ` Alexei Starovoitov 2018-10-18 19:00 ` Daniel Borkmann @ 2018-10-19 8:04 ` Peter Zijlstra 1 sibling, 0 replies; 18+ messages in thread From: Peter Zijlstra @ 2018-10-19 8:04 UTC (permalink / raw) To: Alexei Starovoitov Cc: Daniel Borkmann, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On Thu, Oct 18, 2018 at 08:33:09AM -0700, Alexei Starovoitov wrote: > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote: > > #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */ > > diff --git a/tools/arch/powerpc/include/asm/barrier.h b/tools/arch/powerpc/include/asm/barrier.h > > index a634da0..905a2c6 100644 > > --- a/tools/arch/powerpc/include/asm/barrier.h > > +++ b/tools/arch/powerpc/include/asm/barrier.h > > @@ -27,4 +27,20 @@ > > #define rmb() __asm__ __volatile__ ("sync" : : : "memory") > > #define wmb() __asm__ __volatile__ ("sync" : : : "memory") > > > > +#if defined(__powerpc64__) > > +#define smp_lwsync() __asm__ __volatile__ ("lwsync" : : : "memory") > > + > > +#define smp_store_release(p, v) \ > > +do { \ > > + smp_lwsync(); \ > > + WRITE_ONCE(*p, v); \ > > +} while (0) > > + > > +#define smp_load_acquire(p) \ > > +({ \ > > + typeof(*p) ___p1 = READ_ONCE(*p); \ > > + smp_lwsync(); \ > > + ___p1; \ > > I don't like this proliferation of asm. > Why do we think that we can do better job than compiler? > can we please use gcc builtins instead? > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html > __atomic_load_n(ptr, __ATOMIC_ACQUIRE); > __atomic_store_n(ptr, val, __ATOMIC_RELEASE); > are done specifically for this use case if I'm not mistaken. > I think it pays to learn what compiler provides. My problem with using the C11 stuff for this is that we're then limited to compilers that actually support that. The kernel has a minimum of gcc-4.6 (and thus perf does too I think) and gcc-4.6 does not have C11. What Daniel writes is also true; the kernel and C11 memory models don't align; but you're right in that for this purpose the C11 load-acquire and store-release would indeed suffice. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-18 15:04 ` Daniel Borkmann 2018-10-18 15:33 ` Alexei Starovoitov @ 2018-10-19 9:44 ` Peter Zijlstra 2018-10-19 10:37 ` Daniel Borkmann 1 sibling, 1 reply; 18+ messages in thread From: Peter Zijlstra @ 2018-10-19 9:44 UTC (permalink / raw) To: Daniel Borkmann Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote: > diff --git a/tools/include/linux/ring_buffer.h b/tools/include/linux/ring_buffer.h > new file mode 100644 > index 0000000..48200e0 > --- /dev/null > +++ b/tools/include/linux/ring_buffer.h > @@ -0,0 +1,69 @@ > +#ifndef _TOOLS_LINUX_RING_BUFFER_H_ > +#define _TOOLS_LINUX_RING_BUFFER_H_ > + > +#include <linux/compiler.h> > +#include <asm/barrier.h> > + > +/* > + * Below barriers pair as follows (kernel/events/ring_buffer.c): > + * > + * Since the mmap() consumer (userspace) can run on a different CPU: > + * > + * kernel user > + * > + * if (LOAD ->data_tail) { LOAD ->data_head > + * (A) smp_rmb() (C) > + * STORE $data LOAD $data > + * smp_wmb() (B) smp_mb() (D) > + * STORE ->data_head STORE ->data_tail > + * } > + * > + * Where A pairs with D, and B pairs with C. > + * > + * In our case A is a control dependency that separates the load > + * of the ->data_tail and the stores of $data. In case ->data_tail > + * indicates there is no room in the buffer to store $data we do not. > + * > + * D needs to be a full barrier since it separates the data READ > + * from the tail WRITE. > + * > + * For B a WMB is sufficient since it separates two WRITEs, and for > + * C an RMB is sufficient since it separates two READs. > + */ > + > +/* > + * Note, instead of B, C, D we could also use smp_store_release() > + * in B and D as well as smp_load_acquire() in C. However, this > + * optimization makes sense not for all architectures since it > + * would resolve into READ_ONCE() + smp_mb() pair for smp_load_acquire() > + * and smp_mb() + WRITE_ONCE() pair for smp_store_release(), thus > + * for those smp_wmb() in B and smp_rmb() in C would still be less > + * expensive. For the case of D this has either the same cost or > + * is less expensive. For example, due to TSO (total store order), > + * x86 can avoid the CPU barrier entirely. > + */ > + > +static inline u64 ring_buffer_read_head(struct perf_event_mmap_page *base) > +{ > +/* > + * Architectures where smp_load_acquire() does not fallback to > + * READ_ONCE() + smp_mb() pair. > + */ > +#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \ > + defined(__ia64__) || defined(__sparc__) && defined(__arch64__) > + return smp_load_acquire(&base->data_head); > +#else > + u64 head = READ_ONCE(base->data_head); > + > + smp_rmb(); > + return head; > +#endif > +} > + > +static inline void ring_buffer_write_tail(struct perf_event_mmap_page *base, > + u64 tail) > +{ > + smp_store_release(&base->data_tail, tail); > +} > + > +#endif /* _TOOLS_LINUX_RING_BUFFER_H_ */ (for the whole patch, but in particular the above) Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} 2018-10-19 9:44 ` Peter Zijlstra @ 2018-10-19 10:37 ` Daniel Borkmann 0 siblings, 0 replies; 18+ messages in thread From: Daniel Borkmann @ 2018-10-19 10:37 UTC (permalink / raw) To: Peter Zijlstra Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On 10/19/2018 11:44 AM, Peter Zijlstra wrote: > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote: >> diff --git a/tools/include/linux/ring_buffer.h b/tools/include/linux/ring_buffer.h >> new file mode 100644 >> index 0000000..48200e0 >> --- /dev/null >> +++ b/tools/include/linux/ring_buffer.h >> @@ -0,0 +1,69 @@ >> +#ifndef _TOOLS_LINUX_RING_BUFFER_H_ >> +#define _TOOLS_LINUX_RING_BUFFER_H_ >> + >> +#include <linux/compiler.h> >> +#include <asm/barrier.h> >> + >> +/* >> + * Below barriers pair as follows (kernel/events/ring_buffer.c): >> + * >> + * Since the mmap() consumer (userspace) can run on a different CPU: >> + * >> + * kernel user >> + * >> + * if (LOAD ->data_tail) { LOAD ->data_head >> + * (A) smp_rmb() (C) >> + * STORE $data LOAD $data >> + * smp_wmb() (B) smp_mb() (D) >> + * STORE ->data_head STORE ->data_tail >> + * } >> + * >> + * Where A pairs with D, and B pairs with C. >> + * >> + * In our case A is a control dependency that separates the load >> + * of the ->data_tail and the stores of $data. In case ->data_tail >> + * indicates there is no room in the buffer to store $data we do not. >> + * >> + * D needs to be a full barrier since it separates the data READ >> + * from the tail WRITE. >> + * >> + * For B a WMB is sufficient since it separates two WRITEs, and for >> + * C an RMB is sufficient since it separates two READs. >> + */ >> + >> +/* >> + * Note, instead of B, C, D we could also use smp_store_release() >> + * in B and D as well as smp_load_acquire() in C. However, this >> + * optimization makes sense not for all architectures since it >> + * would resolve into READ_ONCE() + smp_mb() pair for smp_load_acquire() >> + * and smp_mb() + WRITE_ONCE() pair for smp_store_release(), thus >> + * for those smp_wmb() in B and smp_rmb() in C would still be less >> + * expensive. For the case of D this has either the same cost or >> + * is less expensive. For example, due to TSO (total store order), >> + * x86 can avoid the CPU barrier entirely. >> + */ >> + >> +static inline u64 ring_buffer_read_head(struct perf_event_mmap_page *base) >> +{ >> +/* >> + * Architectures where smp_load_acquire() does not fallback to >> + * READ_ONCE() + smp_mb() pair. >> + */ >> +#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \ >> + defined(__ia64__) || defined(__sparc__) && defined(__arch64__) >> + return smp_load_acquire(&base->data_head); >> +#else >> + u64 head = READ_ONCE(base->data_head); >> + >> + smp_rmb(); >> + return head; >> +#endif >> +} >> + >> +static inline void ring_buffer_write_tail(struct perf_event_mmap_page *base, >> + u64 tail) >> +{ >> + smp_store_release(&base->data_tail, tail); >> +} >> + >> +#endif /* _TOOLS_LINUX_RING_BUFFER_H_ */ > > (for the whole patch, but in particular the above) > > Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Great, thanks a lot, Peter! Will flush out v2 in a bit. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk 2018-10-17 14:41 [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Daniel Borkmann 2018-10-17 14:41 ` [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure Daniel Borkmann 2018-10-17 14:41 ` [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} Daniel Borkmann @ 2018-10-17 14:41 ` Daniel Borkmann 2018-10-17 15:51 ` Peter Zijlstra 2018-10-17 15:03 ` [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Arnaldo Carvalho de Melo 3 siblings, 1 reply; 18+ messages in thread From: Daniel Borkmann @ 2018-10-17 14:41 UTC (permalink / raw) To: alexei.starovoitov Cc: peterz, paulmck, will.deacon, acme, yhs, john.fastabend, netdev, Daniel Borkmann Add bpf_perf_read_head() and bpf_perf_write_tail() helpers to make it more clear in what context barriers are used here, and use smp_rmb() as well as smp_mb() barriers. Given libbpf is not restricted to x86-64 only, the compiler barrier needs to be replaced with smp_rmb(). Also the __sync_synchronize() emits mfence whereas faster lock + add can be used on x86-64 via smp_mb(). Fixes: d0cabbb021be ("tools: bpf: move the event reading loop to libbpf") Fixes: 39111695b1b8 ("samples: bpf: add bpf_perf_event_output example") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> --- tools/lib/bpf/libbpf.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index bd71efc..e8ae8db 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -20,6 +20,7 @@ #include <fcntl.h> #include <errno.h> #include <asm/unistd.h> +#include <asm/barrier.h> #include <linux/err.h> #include <linux/kernel.h> #include <linux/bpf.h> @@ -2413,18 +2414,32 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr, return 0; } +static __u64 bpf_perf_read_head(struct perf_event_mmap_page *header) +{ + __u64 data_head = READ_ONCE(header->data_head); + + smp_rmb(); + return data_head; +} + +static void bpf_perf_write_tail(struct perf_event_mmap_page *header, + __u64 data_tail) +{ + smp_mb(); + header->data_tail = data_tail; +} + enum bpf_perf_event_ret bpf_perf_event_read_simple(void *mem, unsigned long size, unsigned long page_size, void **buf, size_t *buf_len, bpf_perf_event_print_t fn, void *priv) { - volatile struct perf_event_mmap_page *header = mem; + struct perf_event_mmap_page *header = mem; + __u64 data_head = bpf_perf_read_head(header); __u64 data_tail = header->data_tail; - __u64 data_head = header->data_head; int ret = LIBBPF_PERF_EVENT_ERROR; void *base, *begin, *end; - asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */ if (data_head == data_tail) return LIBBPF_PERF_EVENT_CONT; @@ -2467,8 +2482,6 @@ bpf_perf_event_read_simple(void *mem, unsigned long size, data_tail += ehdr->size; } - __sync_synchronize(); /* smp_mb() */ - header->data_tail = data_tail; - + bpf_perf_write_tail(header, data_tail); return ret; } -- 2.9.5 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk 2018-10-17 14:41 ` [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk Daniel Borkmann @ 2018-10-17 15:51 ` Peter Zijlstra 0 siblings, 0 replies; 18+ messages in thread From: Peter Zijlstra @ 2018-10-17 15:51 UTC (permalink / raw) To: Daniel Borkmann Cc: alexei.starovoitov, paulmck, will.deacon, acme, yhs, john.fastabend, netdev On Wed, Oct 17, 2018 at 04:41:56PM +0200, Daniel Borkmann wrote: > +static __u64 bpf_perf_read_head(struct perf_event_mmap_page *header) > +{ > + __u64 data_head = READ_ONCE(header->data_head); > + > + smp_rmb(); > + return data_head; > +} > + > +static void bpf_perf_write_tail(struct perf_event_mmap_page *header, > + __u64 data_tail) > +{ > + smp_mb(); > + header->data_tail = data_tail; > +} Same coments, either smp_load_acquire()/smp_store_release() or at the very least a WRITE_ONCE() there. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb 2018-10-17 14:41 [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Daniel Borkmann ` (2 preceding siblings ...) 2018-10-17 14:41 ` [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk Daniel Borkmann @ 2018-10-17 15:03 ` Arnaldo Carvalho de Melo 3 siblings, 0 replies; 18+ messages in thread From: Arnaldo Carvalho de Melo @ 2018-10-17 15:03 UTC (permalink / raw) To: Daniel Borkmann Cc: alexei.starovoitov, peterz, paulmck, will.deacon, yhs, john.fastabend, netdev Em Wed, Oct 17, 2018 at 04:41:53PM +0200, Daniel Borkmann escreveu: > This set first adds smp_* barrier variants to tools infrastructure > and in a second step updates perf and libbpf to make use of them. > For details, please see individual patches, thanks! > > Arnaldo, if there are no objections, could this be routed via bpf-next > with Acked-by's due to later dependencies in libbpf? Alternatively, > I could also get the 2nd patch out during merge window, but perhaps > it's okay to do in one go as there shouldn't be much conflict in perf. Right, when updating kernel/events/ring_buffer.c the corresponding code in tools/ should've been changed :-) Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> - Arnaldo > Thanks! > > Daniel Borkmann (3): > tools: add smp_* barrier variants to include infrastructure > tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} > bpf, libbpf: use proper barriers in perf ring buffer walk > > tools/arch/arm64/include/asm/barrier.h | 10 ++++++++++ > tools/arch/x86/include/asm/barrier.h | 9 ++++++--- > tools/include/asm/barrier.h | 11 +++++++++++ > tools/lib/bpf/libbpf.c | 25 +++++++++++++++++++------ > tools/perf/util/mmap.h | 5 +++-- > 5 files changed, 49 insertions(+), 11 deletions(-) > > -- > 2.9.5 ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2018-10-19 20:02 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-10-17 14:41 [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Daniel Borkmann 2018-10-17 14:41 ` [PATCH bpf-next 1/3] tools: add smp_* barrier variants to include infrastructure Daniel Borkmann 2018-10-17 14:41 ` [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb} Daniel Borkmann 2018-10-17 15:50 ` Peter Zijlstra 2018-10-17 23:10 ` Daniel Borkmann 2018-10-18 8:14 ` Peter Zijlstra 2018-10-18 15:04 ` Daniel Borkmann 2018-10-18 15:33 ` Alexei Starovoitov 2018-10-18 19:00 ` Daniel Borkmann 2018-10-19 3:53 ` Alexei Starovoitov 2018-10-19 11:02 ` Will Deacon 2018-10-19 11:56 ` Paul E. McKenney 2018-10-19 8:04 ` Peter Zijlstra 2018-10-19 9:44 ` Peter Zijlstra 2018-10-19 10:37 ` Daniel Borkmann 2018-10-17 14:41 ` [PATCH bpf-next 3/3] bpf, libbpf: use proper barriers in perf ring buffer walk Daniel Borkmann 2018-10-17 15:51 ` Peter Zijlstra 2018-10-17 15:03 ` [PATCH bpf-next 0/3] improve and fix barriers for walking perf rb Arnaldo Carvalho de Melo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.