All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/2] atomics: fix small RCU perf. regression + update documentation
@ 2016-05-21 20:42 Emilio G. Cota
  2016-05-21 20:42 ` [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics Emilio G. Cota
  2016-05-21 20:42 ` [Qemu-devel] [PATCH 2/2] docs/atomics: update atomic_read/set comparison with Linux Emilio G. Cota
  0 siblings, 2 replies; 14+ messages in thread
From: Emilio G. Cota @ 2016-05-21 20:42 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Patch 1 fixes a small performance regression introduced when moving
our atomics to __atomic primitives. The regression can be measured
on RMO architectures (I used aarch64); the effect is very small
but consistently measurable: for instance, rcutorture performance
degraded by about 0.3%.

Patch 2 originates from a recent discussion[1] that led me to
look into what guarantees can be assumed from our atomic_read/set
implementations.

Thanks,

		Emilio

[1] https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg03088.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-21 20:42 [Qemu-devel] [PATCH 0/2] atomics: fix small RCU perf. regression + update documentation Emilio G. Cota
@ 2016-05-21 20:42 ` Emilio G. Cota
  2016-05-22  7:58   ` Alex Bennée
                     ` (2 more replies)
  2016-05-21 20:42 ` [Qemu-devel] [PATCH 2/2] docs/atomics: update atomic_read/set comparison with Linux Emilio G. Cota
  1 sibling, 3 replies; 14+ messages in thread
From: Emilio G. Cota @ 2016-05-21 20:42 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Commit a0aa44b4 ("include/qemu/atomic.h: default to __atomic functions")
set all atomics to default (on recent GCC versions) to __atomic primitives.

In the process, the atomic_rcu_read/set were converted to implement
consume/release semantics, respectively. This is inefficient; for
correctness and maximum performance we only need an smp_barrier_depends
for reads, and an smp_wmb for writes. Fix it by using the original
definition of these two primitives for all compilers.

Note that in case these semantics were necessary to avoid false
positives under Thread Sanitizer, we could have them defined as such
under #ifdef __SANITIZE_THREAD__. I tried running QEMU under tsan
to explore this, but unfortunately I couldn't get it to work (tsan dies
with an "unexpected memory mapping"). I suspect though that the
atomic_read/set embedded in atomic_rcu_read/set is enough for tsan,
though.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/atomic.h | 99 ++++++++++++++++++++++-----------------------------
 1 file changed, 43 insertions(+), 56 deletions(-)

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 5bc4d6c..4d62425 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -56,22 +56,6 @@
     __atomic_store(ptr, &_val, __ATOMIC_RELAXED);     \
 } while(0)
 
-/* Atomic RCU operations imply weak memory barriers */
-
-#define atomic_rcu_read(ptr)                          \
-    ({                                                \
-    QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
-    typeof(*ptr) _val;                                \
-    __atomic_load(ptr, &_val, __ATOMIC_CONSUME);      \
-    _val;                                             \
-    })
-
-#define atomic_rcu_set(ptr, i) do {                   \
-    QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
-    typeof(*ptr) _val = (i);                          \
-    __atomic_store(ptr, &_val, __ATOMIC_RELEASE);     \
-} while(0)
-
 /* atomic_mb_read/set semantics map Java volatile variables. They are
  * less expensive on some platforms (notably POWER & ARMv7) than fully
  * sequentially consistent operations.
@@ -242,46 +226,6 @@
 #define atomic_read(ptr)       (*(__typeof__(*ptr) volatile*) (ptr))
 #define atomic_set(ptr, i)     ((*(__typeof__(*ptr) volatile*) (ptr)) = (i))
 
-/**
- * atomic_rcu_read - reads a RCU-protected pointer to a local variable
- * into a RCU read-side critical section. The pointer can later be safely
- * dereferenced within the critical section.
- *
- * This ensures that the pointer copy is invariant thorough the whole critical
- * section.
- *
- * Inserts memory barriers on architectures that require them (currently only
- * Alpha) and documents which pointers are protected by RCU.
- *
- * atomic_rcu_read also includes a compiler barrier to ensure that
- * value-speculative optimizations (e.g. VSS: Value Speculation
- * Scheduling) does not perform the data read before the pointer read
- * by speculating the value of the pointer.
- *
- * Should match atomic_rcu_set(), atomic_xchg(), atomic_cmpxchg().
- */
-#define atomic_rcu_read(ptr)    ({                \
-    typeof(*ptr) _val = atomic_read(ptr);         \
-    smp_read_barrier_depends();                   \
-    _val;                                         \
-})
-
-/**
- * atomic_rcu_set - assigns (publicizes) a pointer to a new data structure
- * meant to be read by RCU read-side critical sections.
- *
- * Documents which pointers will be dereferenced by RCU read-side critical
- * sections and adds the required memory barriers on architectures requiring
- * them. It also makes sure the compiler does not reorder code initializing the
- * data structure before its publication.
- *
- * Should match atomic_rcu_read().
- */
-#define atomic_rcu_set(ptr, i)  do {              \
-    smp_wmb();                                    \
-    atomic_set(ptr, i);                           \
-} while (0)
-
 /* These have the same semantics as Java volatile variables.
  * See http://gee.cs.oswego.edu/dl/jmm/cookbook.html:
  * "1. Issue a StoreStore barrier (wmb) before each volatile store."
@@ -345,4 +289,47 @@
 #define atomic_or(ptr, n)      ((void) __sync_fetch_and_or(ptr, n))
 
 #endif /* __ATOMIC_RELAXED */
+
+/**
+ * atomic_rcu_read - reads a RCU-protected pointer to a local variable
+ * into a RCU read-side critical section. The pointer can later be safely
+ * dereferenced within the critical section.
+ *
+ * This ensures that the pointer copy is invariant thorough the whole critical
+ * section.
+ *
+ * Inserts memory barriers on architectures that require them (currently only
+ * Alpha) and documents which pointers are protected by RCU.
+ *
+ * atomic_rcu_read also includes a compiler barrier to ensure that
+ * value-speculative optimizations (e.g. VSS: Value Speculation
+ * Scheduling) does not perform the data read before the pointer read
+ * by speculating the value of the pointer.
+ *
+ * Should match atomic_rcu_set(), atomic_xchg(), atomic_cmpxchg().
+ */
+#define atomic_rcu_read(ptr)    ({                    \
+    QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
+    typeof(*ptr) _val = atomic_read(ptr);             \
+    smp_read_barrier_depends();                       \
+    _val;                                             \
+})
+
+/**
+ * atomic_rcu_set - assigns (publicizes) a pointer to a new data structure
+ * meant to be read by RCU read-side critical sections.
+ *
+ * Documents which pointers will be dereferenced by RCU read-side critical
+ * sections and adds the required memory barriers on architectures requiring
+ * them. It also makes sure the compiler does not reorder code initializing the
+ * data structure before its publication.
+ *
+ * Should match atomic_rcu_read().
+ */
+#define atomic_rcu_set(ptr, i)  do {                  \
+    QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
+    smp_wmb();                                        \
+    atomic_set(ptr, i);                               \
+} while (0)
+
 #endif /* __QEMU_ATOMIC_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH 2/2] docs/atomics: update atomic_read/set comparison with Linux
  2016-05-21 20:42 [Qemu-devel] [PATCH 0/2] atomics: fix small RCU perf. regression + update documentation Emilio G. Cota
  2016-05-21 20:42 ` [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics Emilio G. Cota
@ 2016-05-21 20:42 ` Emilio G. Cota
  1 sibling, 0 replies; 14+ messages in thread
From: Emilio G. Cota @ 2016-05-21 20:42 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Recently Linux did a mass conversion of its atomic_read/set calls
so that they at least are READ/WRITE_ONCE. See Linux's commit
62e8a325 ("atomic, arch: Audit atomic_{read,set}()"). It seems though
that their documentation hasn't been updated to reflect this.

The appended updates our documentation to reflect the change, which
means there is effectively no difference between our atomic_read/set
and the current Linux implementation.

While at it, fix the statement that a barrier is implied by
atomic_read/set, which is incorrect. Volatile/atomic semantics prevent
transformations pertaining the variable they apply to; this, however,
has no effect on surrounding statements like barriers do. For more
details on this, see:
  https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 docs/atomics.txt | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/docs/atomics.txt b/docs/atomics.txt
index ef285e3..7540990 100644
--- a/docs/atomics.txt
+++ b/docs/atomics.txt
@@ -326,9 +326,19 @@ and memory barriers, and the equivalents in QEMU:
   use a boxed atomic_t type; atomic operations in QEMU are polymorphic
   and use normal C types.
 
-- atomic_read and atomic_set in Linux give no guarantee at all;
-  atomic_read and atomic_set in QEMU include a compiler barrier
-  (similar to the ACCESS_ONCE macro in Linux).
+- Originally, atomic_read and atomic_set in Linux gave no guarantee
+  at all. Recently they have been updated to implement volatile
+  semantics via ACCESS_ONCE (or the more recent READ/WRITE_ONCE).
+
+  QEMU's atomic_read/set implement, if the compiler supports it, C11
+  atomic relaxed semantics, and volatile semantics otherwise.
+  Both semantics prevent the compiler from doing certain transformations;
+  the difference is that atomic accesses are guaranteed to be atomic,
+  while volatile accesses aren't. Thus, in the volatile case we just cross
+  our fingers hoping that the compiler will generate atomic accesses,
+  since we assume the variables passed are machine-word sized and
+  properly aligned.
+  No barriers are implied by atomic_read/set in either Linux or QEMU.
 
 - most atomic read-modify-write operations in Linux return void;
   in QEMU, all of them return the old value of the variable.
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-21 20:42 ` [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics Emilio G. Cota
@ 2016-05-22  7:58   ` Alex Bennée
  2016-05-24 18:42     ` Emilio G. Cota
  2016-05-23 14:21   ` Paolo Bonzini
  2016-05-23 16:53   ` Richard Henderson
  2 siblings, 1 reply; 14+ messages in thread
From: Alex Bennée @ 2016-05-22  7:58 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> Commit a0aa44b4 ("include/qemu/atomic.h: default to __atomic functions")
> set all atomics to default (on recent GCC versions) to __atomic primitives.
>
> In the process, the atomic_rcu_read/set were converted to implement
> consume/release semantics, respectively. This is inefficient; for
> correctness and maximum performance we only need an smp_barrier_depends
> for reads, and an smp_wmb for writes. Fix it by using the original
> definition of these two primitives for all compilers.
>
> Note that in case these semantics were necessary to avoid false
> positives under Thread Sanitizer, we could have them defined as such
> under #ifdef __SANITIZE_THREAD__. I tried running QEMU under tsan
> to explore this, but unfortunately I couldn't get it to work (tsan dies
> with an "unexpected memory mapping"). I suspect though that the
> atomic_read/set embedded in atomic_rcu_read/set is enough for tsan,
> though.

For tsan runs you need to re-build with:

  ./configure --cc=gcc --extra-cflags="-pie -fPIE -fsanitize=thread" --with-coroutine=gthread

Specifically the coroutine ucontext messing really confuses TSAN.

>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/atomic.h | 99 ++++++++++++++++++++++-----------------------------
>  1 file changed, 43 insertions(+), 56 deletions(-)
>
> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
> index 5bc4d6c..4d62425 100644
> --- a/include/qemu/atomic.h
> +++ b/include/qemu/atomic.h
> @@ -56,22 +56,6 @@
>      __atomic_store(ptr, &_val, __ATOMIC_RELAXED);     \
>  } while(0)
>
> -/* Atomic RCU operations imply weak memory barriers */
> -
> -#define atomic_rcu_read(ptr)                          \
> -    ({                                                \
> -    QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
> -    typeof(*ptr) _val;                                \
> -    __atomic_load(ptr, &_val, __ATOMIC_CONSUME);      \
> -    _val;                                             \
> -    })
> -
> -#define atomic_rcu_set(ptr, i) do {                   \
> -    QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
> -    typeof(*ptr) _val = (i);                          \
> -    __atomic_store(ptr, &_val, __ATOMIC_RELEASE);     \
> -} while(0)
> -
>  /* atomic_mb_read/set semantics map Java volatile variables. They are
>   * less expensive on some platforms (notably POWER & ARMv7) than fully
>   * sequentially consistent operations.
> @@ -242,46 +226,6 @@
>  #define atomic_read(ptr)       (*(__typeof__(*ptr) volatile*) (ptr))
>  #define atomic_set(ptr, i)     ((*(__typeof__(*ptr) volatile*) (ptr)) = (i))
>
> -/**
> - * atomic_rcu_read - reads a RCU-protected pointer to a local variable
> - * into a RCU read-side critical section. The pointer can later be safely
> - * dereferenced within the critical section.
> - *
> - * This ensures that the pointer copy is invariant thorough the whole critical
> - * section.
> - *
> - * Inserts memory barriers on architectures that require them (currently only
> - * Alpha) and documents which pointers are protected by RCU.
> - *
> - * atomic_rcu_read also includes a compiler barrier to ensure that
> - * value-speculative optimizations (e.g. VSS: Value Speculation
> - * Scheduling) does not perform the data read before the pointer read
> - * by speculating the value of the pointer.
> - *
> - * Should match atomic_rcu_set(), atomic_xchg(), atomic_cmpxchg().
> - */
> -#define atomic_rcu_read(ptr)    ({                \
> -    typeof(*ptr) _val = atomic_read(ptr);         \
> -    smp_read_barrier_depends();                   \
> -    _val;                                         \
> -})
> -
> -/**
> - * atomic_rcu_set - assigns (publicizes) a pointer to a new data structure
> - * meant to be read by RCU read-side critical sections.
> - *
> - * Documents which pointers will be dereferenced by RCU read-side critical
> - * sections and adds the required memory barriers on architectures requiring
> - * them. It also makes sure the compiler does not reorder code initializing the
> - * data structure before its publication.
> - *
> - * Should match atomic_rcu_read().
> - */
> -#define atomic_rcu_set(ptr, i)  do {              \
> -    smp_wmb();                                    \
> -    atomic_set(ptr, i);                           \
> -} while (0)
> -
>  /* These have the same semantics as Java volatile variables.
>   * See http://gee.cs.oswego.edu/dl/jmm/cookbook.html:
>   * "1. Issue a StoreStore barrier (wmb) before each volatile store."
> @@ -345,4 +289,47 @@
>  #define atomic_or(ptr, n)      ((void) __sync_fetch_and_or(ptr, n))
>
>  #endif /* __ATOMIC_RELAXED */
> +
> +/**
> + * atomic_rcu_read - reads a RCU-protected pointer to a local variable
> + * into a RCU read-side critical section. The pointer can later be safely
> + * dereferenced within the critical section.
> + *
> + * This ensures that the pointer copy is invariant thorough the whole critical
> + * section.
> + *
> + * Inserts memory barriers on architectures that require them (currently only
> + * Alpha) and documents which pointers are protected by RCU.
> + *
> + * atomic_rcu_read also includes a compiler barrier to ensure that
> + * value-speculative optimizations (e.g. VSS: Value Speculation
> + * Scheduling) does not perform the data read before the pointer read
> + * by speculating the value of the pointer.
> + *
> + * Should match atomic_rcu_set(), atomic_xchg(), atomic_cmpxchg().
> + */
> +#define atomic_rcu_read(ptr)    ({                    \
> +    QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
> +    typeof(*ptr) _val = atomic_read(ptr);             \
> +    smp_read_barrier_depends();                       \
> +    _val;                                             \
> +})
> +
> +/**
> + * atomic_rcu_set - assigns (publicizes) a pointer to a new data structure
> + * meant to be read by RCU read-side critical sections.
> + *
> + * Documents which pointers will be dereferenced by RCU read-side critical
> + * sections and adds the required memory barriers on architectures requiring
> + * them. It also makes sure the compiler does not reorder code initializing the
> + * data structure before its publication.
> + *
> + * Should match atomic_rcu_read().
> + */
> +#define atomic_rcu_set(ptr, i)  do {                  \
> +    QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
> +    smp_wmb();                                        \
> +    atomic_set(ptr, i);                               \
> +} while (0)
> +
>  #endif /* __QEMU_ATOMIC_H */


--
Alex Bennée

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-21 20:42 ` [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics Emilio G. Cota
  2016-05-22  7:58   ` Alex Bennée
@ 2016-05-23 14:21   ` Paolo Bonzini
  2016-05-23 15:55     ` Emilio G. Cota
  2016-05-23 16:53   ` Richard Henderson
  2 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2016-05-23 14:21 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Richard Henderson, Sergey Fedorov



On 21/05/2016 22:42, Emilio G. Cota wrote:
> Commit a0aa44b4 ("include/qemu/atomic.h: default to __atomic functions")
> set all atomics to default (on recent GCC versions) to __atomic primitives.
> 
> In the process, the atomic_rcu_read/set were converted to implement
> consume/release semantics, respectively. This is inefficient; for
> correctness and maximum performance we only need an smp_barrier_depends
> for reads, and an smp_wmb for writes. Fix it by using the original
> definition of these two primitives for all compilers.

Indeed most compilers implement consume the same as acquire, which is
inefficient.  However, isn't in practice atomic_thread_fence(release) +
atomic_store(relaxed) the same as atomic_store(release)?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-23 14:21   ` Paolo Bonzini
@ 2016-05-23 15:55     ` Emilio G. Cota
  0 siblings, 0 replies; 14+ messages in thread
From: Emilio G. Cota @ 2016-05-23 15:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée,
	Richard Henderson, Sergey Fedorov

On Mon, May 23, 2016 at 16:21:36 +0200, Paolo Bonzini wrote:
> On 21/05/2016 22:42, Emilio G. Cota wrote:
> > Commit a0aa44b4 ("include/qemu/atomic.h: default to __atomic functions")
> > set all atomics to default (on recent GCC versions) to __atomic primitives.
> > 
> > In the process, the atomic_rcu_read/set were converted to implement
> > consume/release semantics, respectively. This is inefficient; for
> > correctness and maximum performance we only need an smp_barrier_depends
> > for reads, and an smp_wmb for writes. Fix it by using the original
> > definition of these two primitives for all compilers.
> 
> Indeed most compilers implement consume the same as acquire, which is
> inefficient.
> However, isn't in practice atomic_thread_fence(release) +
> atomic_store(relaxed) the same as atomic_store(release)?

Yes. However this is not the issue I'm addressing with the patch.

The performance regression I measured is due to using load-acquire vs.
load+smp_read_barrier_depends(). In the latter case only Alpha will
emit a fence; in the former we always emit store-release, which
is "stronger" (i.e. more constraining.)

A similar thing applies to atomic_rcu_write, although I haven't
measured its impact. We only need smp_wmb+store, yet we emit a
store-release, which is again "stronger".

		E.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-21 20:42 ` [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics Emilio G. Cota
  2016-05-22  7:58   ` Alex Bennée
  2016-05-23 14:21   ` Paolo Bonzini
@ 2016-05-23 16:53   ` Richard Henderson
  2016-05-23 17:09     ` Emilio G. Cota
  2 siblings, 1 reply; 14+ messages in thread
From: Richard Henderson @ 2016-05-23 16:53 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Sergey Fedorov

On 05/21/2016 01:42 PM, Emilio G. Cota wrote:
> In the process, the atomic_rcu_read/set were converted to implement
> consume/release semantics, respectively. This is inefficient; for
> correctness and maximum performance we only need an smp_barrier_depends
> for reads, and an smp_wmb for writes. Fix it by using the original
> definition of these two primitives for all compilers.

For what host do you think this is inefficient?

In particular, what you've done is going to be less efficient for e.g. armv8, 
where the __atomic formulation is going to produce load-acquire and 
store-release instructions.  Whereas the separate barriers are going to produce 
two insns.

As for the common case of x86_64, what you're doing is going to make no 
difference at all.

So what are you trying to improve?


r~

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-23 16:53   ` Richard Henderson
@ 2016-05-23 17:09     ` Emilio G. Cota
  2016-05-24  7:08       ` Paolo Bonzini
  0 siblings, 1 reply; 14+ messages in thread
From: Emilio G. Cota @ 2016-05-23 17:09 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Sergey Fedorov

On Mon, May 23, 2016 at 09:53:00 -0700, Richard Henderson wrote:
> On 05/21/2016 01:42 PM, Emilio G. Cota wrote:
> >In the process, the atomic_rcu_read/set were converted to implement
> >consume/release semantics, respectively. This is inefficient; for
> >correctness and maximum performance we only need an smp_barrier_depends
> >for reads, and an smp_wmb for writes. Fix it by using the original
> >definition of these two primitives for all compilers.
> 
> For what host do you think this is inefficient?
> 
> In particular, what you've done is going to be less efficient for e.g.
> armv8, where the __atomic formulation is going to produce load-acquire and
> store-release instructions.  Whereas the separate barriers are going to
> produce two insns.
> 
> As for the common case of x86_64, what you're doing is going to make no
> difference at all.
> 
> So what are you trying to improve?

Precisely I tested this on ARMv8. The goal is to not emit a fence at
all, i.e. to emit a single store instead of LDR (load-acquire).

I just realised that under #ifdef __ATOMIC we have:

#define smp_read_barrier_depends() ({ barrier(); __atomic_thread_fence(__ATOMIC_CONSUME); barrier(); })

Why? This should be:

#ifdef __alpha__
#define smp_read_barrier_depends()   asm volatile("mb":::"memory")
#endif

unconditionally.

My patch should have included this additional change to make sense.
Sorry for the confusion.

		E.

PS. And really equating smp_wmb/rmb to release/acquire as we have under
#ifdef __ATOMIC is hard to justify, other than to please tsan.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-23 17:09     ` Emilio G. Cota
@ 2016-05-24  7:08       ` Paolo Bonzini
  2016-05-24 19:56         ` Emilio G. Cota
  0 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2016-05-24  7:08 UTC (permalink / raw)
  To: Emilio G. Cota, Richard Henderson
  Cc: MTTCG Devel, Alex Bennée, QEMU Developers, Sergey Fedorov



On 23/05/2016 19:09, Emilio G. Cota wrote:
> 		E.
> 
> PS. And really equating smp_wmb/rmb to release/acquire as we have under
> #ifdef __ATOMIC is hard to justify, other than to please tsan.

That only makes a difference on arm64, right?

	acquire		release		rmb		wmb
x86	--		--		--		--
power	lwsync		lwsync		lwsync		lwsync
armv7	dmb		dmb		dmb		dmb
arm64	dmb ishld	dmb ish		dmb ishld	dmb ishst
ia64	--		--		--		--

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-22  7:58   ` Alex Bennée
@ 2016-05-24 18:42     ` Emilio G. Cota
  0 siblings, 0 replies; 14+ messages in thread
From: Emilio G. Cota @ 2016-05-24 18:42 UTC (permalink / raw)
  To: Alex Bennée
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov

On Sun, May 22, 2016 at 08:58:51 +0100, Alex Bennée wrote:
> For tsan runs you need to re-build with:
> 
>   ./configure --cc=gcc --extra-cflags="-pie -fPIE -fsanitize=thread" --with-coroutine=gthread
> 
> Specifically the coroutine ucontext messing really confuses TSAN.

With your configure args + the appended I can at least compile
arm-softmmu, but I still get:

FATAL: ThreadSanitizer: unexpected memory mapping 0x55d9f389d000-0x55d9f4111000

Am I missing something? Thanks,

		Emilio


diff --git a/configure b/configure
index ab54f3c..a8903fe 100755
--- a/configure
+++ b/configure
@@ -112,7 +112,7 @@ compile_object() {

 compile_prog() {
   local_cflags="$1"
-  local_ldflags="$2"
+  local_ldflags="$2 -ltsan"
   do_cc $QEMU_CFLAGS $local_cflags -o $TMPE $TMPC $LDFLAGS $local_ldflags
 }

@@ -174,12 +174,12 @@ audio_drv_list=""
 block_drv_rw_whitelist=""
 block_drv_ro_whitelist=""
 host_cc="cc"
-libs_softmmu=""
-libs_tools=""
+libs_softmmu="-ltsan"
+libs_tools="-ltsan"
 audio_pt_int=""
 audio_win_int=""
 cc_i386=i386-pc-linux-gnu-gcc
-libs_qga=""
+libs_qga="-ltsan"
 debug_info="yes"
 stack_protector=""

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-24  7:08       ` Paolo Bonzini
@ 2016-05-24 19:56         ` Emilio G. Cota
  2016-05-24 19:59           ` Sergey Fedorov
  0 siblings, 1 reply; 14+ messages in thread
From: Emilio G. Cota @ 2016-05-24 19:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Richard Henderson, MTTCG Devel, Alex Bennée,
	QEMU Developers, Sergey Fedorov

On Tue, May 24, 2016 at 09:08:01 +0200, Paolo Bonzini wrote:
> On 23/05/2016 19:09, Emilio G. Cota wrote:
> > PS. And really equating smp_wmb/rmb to release/acquire as we have under
> > #ifdef __ATOMIC is hard to justify, other than to please tsan.
> 
> That only makes a difference on arm64, right?
> 
> 	acquire		release		rmb		wmb
> x86	--		--		--		--
> power	lwsync		lwsync		lwsync		lwsync
> armv7	dmb		dmb		dmb		dmb
> arm64	dmb ishld	dmb ish		dmb ishld	dmb ishst
> ia64	--		--		--		--

Yes. I now see why we're defining rmb/wmb based on acquire/release:
it's quite convenient given that the compiler provides them, and
the (tiny) differences in practice are not worth the trouble of
adding asm for them. So I take back my comment =)

The gains of getting rid of the consume barrier from atomic_rcu_read
are clear though; updated patch to follow.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-24 19:56         ` Emilio G. Cota
@ 2016-05-24 19:59           ` Sergey Fedorov
  2016-05-25  8:52             ` Alex Bennée
  0 siblings, 1 reply; 14+ messages in thread
From: Sergey Fedorov @ 2016-05-24 19:59 UTC (permalink / raw)
  To: Emilio G. Cota, Paolo Bonzini
  Cc: Richard Henderson, MTTCG Devel, Alex Bennée, QEMU Developers

On 24/05/16 22:56, Emilio G. Cota wrote:
> On Tue, May 24, 2016 at 09:08:01 +0200, Paolo Bonzini wrote:
>> On 23/05/2016 19:09, Emilio G. Cota wrote:
>>> PS. And really equating smp_wmb/rmb to release/acquire as we have under
>>> #ifdef __ATOMIC is hard to justify, other than to please tsan.
>> That only makes a difference on arm64, right?
>>
>> 	acquire		release		rmb		wmb
>> x86	--		--		--		--
>> power	lwsync		lwsync		lwsync		lwsync
>> armv7	dmb		dmb		dmb		dmb
>> arm64	dmb ishld	dmb ish		dmb ishld	dmb ishst
>> ia64	--		--		--		--
> Yes. I now see why we're defining rmb/wmb based on acquire/release:
> it's quite convenient given that the compiler provides them, and
> the (tiny) differences in practice are not worth the trouble of
> adding asm for them. So I take back my comment =)
>
> The gains of getting rid of the consume barrier from atomic_rcu_read
> are clear though; updated patch to follow.

However, maybe it's not such a pain to maintain an optimized version for
AArch64 in assembly :P

Best,
Sergey

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-24 19:59           ` Sergey Fedorov
@ 2016-05-25  8:52             ` Alex Bennée
  2016-05-25 11:02               ` Sergey Fedorov
  0 siblings, 1 reply; 14+ messages in thread
From: Alex Bennée @ 2016-05-25  8:52 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Emilio G. Cota, Paolo Bonzini, Richard Henderson, MTTCG Devel,
	QEMU Developers


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 24/05/16 22:56, Emilio G. Cota wrote:
>> On Tue, May 24, 2016 at 09:08:01 +0200, Paolo Bonzini wrote:
>>> On 23/05/2016 19:09, Emilio G. Cota wrote:
>>>> PS. And really equating smp_wmb/rmb to release/acquire as we have under
>>>> #ifdef __ATOMIC is hard to justify, other than to please tsan.
>>> That only makes a difference on arm64, right?
>>>
>>> 	acquire		release		rmb		wmb
>>> x86	--		--		--		--
>>> power	lwsync		lwsync		lwsync		lwsync
>>> armv7	dmb		dmb		dmb		dmb
>>> arm64	dmb ishld	dmb ish		dmb ishld	dmb ishst
>>> ia64	--		--		--		--
>> Yes. I now see why we're defining rmb/wmb based on acquire/release:
>> it's quite convenient given that the compiler provides them, and
>> the (tiny) differences in practice are not worth the trouble of
>> adding asm for them. So I take back my comment =)
>>
>> The gains of getting rid of the consume barrier from atomic_rcu_read
>> are clear though; updated patch to follow.
>
> However, maybe it's not such a pain to maintain an optimized version for
> AArch64 in assembly :P

Please don't. The advantage of the builtins is they are known by things
like tsan.

>
> Best,
> Sergey


--
Alex Bennée

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics
  2016-05-25  8:52             ` Alex Bennée
@ 2016-05-25 11:02               ` Sergey Fedorov
  0 siblings, 0 replies; 14+ messages in thread
From: Sergey Fedorov @ 2016-05-25 11:02 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Emilio G. Cota, Paolo Bonzini, Richard Henderson, MTTCG Devel,
	QEMU Developers

On 25/05/16 11:52, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 24/05/16 22:56, Emilio G. Cota wrote:
>>> On Tue, May 24, 2016 at 09:08:01 +0200, Paolo Bonzini wrote:
>>>> On 23/05/2016 19:09, Emilio G. Cota wrote:
>>>>> PS. And really equating smp_wmb/rmb to release/acquire as we have under
>>>>> #ifdef __ATOMIC is hard to justify, other than to please tsan.
>>>> That only makes a difference on arm64, right?
>>>>
>>>> 	acquire		release		rmb		wmb
>>>> x86	--		--		--		--
>>>> power	lwsync		lwsync		lwsync		lwsync
>>>> armv7	dmb		dmb		dmb		dmb
>>>> arm64	dmb ishld	dmb ish		dmb ishld	dmb ishst
>>>> ia64	--		--		--		--
>>> Yes. I now see why we're defining rmb/wmb based on acquire/release:
>>> it's quite convenient given that the compiler provides them, and
>>> the (tiny) differences in practice are not worth the trouble of
>>> adding asm for them. So I take back my comment =)
>>>
>>> The gains of getting rid of the consume barrier from atomic_rcu_read
>>> are clear though; updated patch to follow.
>> However, maybe it's not such a pain to maintain an optimized version for
>> AArch64 in assembly :P
> Please don't. The advantage of the builtins is they are known by things
> like tsan.
>

We can always do:

    #if defined(__aarch64__) && !defined(__SANITIZE_THREAD__)
    /* AArch64 asm variant */
    #else
    /* GCC __atomic variant */
    #endif


Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-05-25 11:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-21 20:42 [Qemu-devel] [PATCH 0/2] atomics: fix small RCU perf. regression + update documentation Emilio G. Cota
2016-05-21 20:42 ` [Qemu-devel] [PATCH 1/2] atomics: do not use __atomic primitives for RCU atomics Emilio G. Cota
2016-05-22  7:58   ` Alex Bennée
2016-05-24 18:42     ` Emilio G. Cota
2016-05-23 14:21   ` Paolo Bonzini
2016-05-23 15:55     ` Emilio G. Cota
2016-05-23 16:53   ` Richard Henderson
2016-05-23 17:09     ` Emilio G. Cota
2016-05-24  7:08       ` Paolo Bonzini
2016-05-24 19:56         ` Emilio G. Cota
2016-05-24 19:59           ` Sergey Fedorov
2016-05-25  8:52             ` Alex Bennée
2016-05-25 11:02               ` Sergey Fedorov
2016-05-21 20:42 ` [Qemu-devel] [PATCH 2/2] docs/atomics: update atomic_read/set comparison with Linux Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.