All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v5 00/18] tb hash improvements
@ 2016-05-14  3:34 Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 01/18] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
                   ` (18 more replies)
  0 siblings, 19 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

This patchset applies on top of tcg-next (8b1fe3f4 "cpu-exec:
Clean up 'interrupt_request' reloading", tagged "pull-tcg-20160512").

For reference, here is v4:
  https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg04670.html

Changes from v4:

- atomics.h:
  + Add atomic_read_acquire and atomic_set_release
  + Rename atomic_test_and_set to atomic_test_and_set_acquire
  [ Richard: I removed your reviewed-by ]

- qemu_spin @ thread.h:
  + add bool qemu_spin_locked() to check whether the lock is taken.
  + Use newly-added acquire/release atomic ops. This is clearer and
    improves performance; for instance, now we don't emit an
    unnecessary smp_mb() thanks to using atomic_set_release()
    instead of atomic_mb_set(). Also, note that __sync_test_and_set
    has acquire semantics, so it makes sense to have an
    atomic_test_and_set_acquire that directly calls it, instead
    of calling atomic_xchg, which emits a full barrier (that we don't
    need) before __sync_test_and_set.
  [ Richard: I removed your reviewed-by ]

- tests:
  + add parallel benchmark (qht-bench). Some perf numbers in
    the commit message, comparing QHT vs. CLHT and ck_hs.

  + invoke qht-bench from `make check` with test-qht-par. It
    uses system(3); I couldn't find a way to detect from qht-bench
    when it is run from gtester, so I decided to just add a silly
    program to invoke it.

- trivial: util/Makefile.objs: add qdist.o and qht.o each on a
           separate line

- trivial: added copyright header to test programs

- trivial: updated phys_pc, pc, flags commit message with Richard's
           comment that hashing cs_base probably isn't worth it.

- qht:
  + Document that duplicate pointer values cannot be inserted.
  + qht_insert: return true/false upon success/failure, just like
                qht_remove. This can help find bugs.
  + qht_remove: only write to seqlock if the removal happens --
                otherwise the write is unnecessary, since nothing
		is written to the bucket.
  + trivial: s/n_items/n_entries/ for consistency.
  + qht_grow: substitute it for qht_resize. This is mostly useful
              for testing.
  + resize: do not track qht_map->n_entries; track instead the
            number of non-head buckets added.
	    This improves scalability, since we only increment
	    this number (with the relatively expensive atomic_inc)
	    every time a new non-head bucket is allocated, instead
	    of every time an entry is added/removed.
    * return bool from qht_resize and qht_reset_size; they return
      false if the resize was not needed (i.e. if the previous size
      was the requested size).
  + qht_lookup: do not check for !NULL entries; check directly
                for a hash match.
		This gives a ~2% perf. increase during
		benchmarking. The buckets in the microbenchmarks
		are equally-sized well distributed, which is
		approximately the case in QEMU thanks to xxhash
		and resizing.
  + Remove MRU bucket promotion policy. With automatic resizing,
    this is not needed. Furthermore, removing it saves code.
  + qht_lookup: Add fast-path without do {} while (seqlock). This
                gives a 4% perf. improvement on a read-only benchmark.
  + struct qht_bucket: document the struct
  + rename qht_lock() to qht_map_lock_buckets()
  + add map__atomic_mb and bucket_next__atomic_mb helpers that
    include the necessary atomic_read() and rmb().

  [ All the above changes for qht are simple enough that I kept
    Richard's reviewed-by.]

  + Support concurrent writes to separate buckets. This is in an
    additional patch to ease reviewing; feel free to squash it on
    top of the QHT patch.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 01/18] compiler.h: add QEMU_ALIGNED() to enforce struct alignment
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 02/18] seqlock: remove optional mutex Emilio G. Cota
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/compiler.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
index 8f1cc7b..b64f899 100644
--- a/include/qemu/compiler.h
+++ b/include/qemu/compiler.h
@@ -41,6 +41,8 @@
 # define QEMU_PACKED __attribute__((packed))
 #endif
 
+#define QEMU_ALIGNED(X) __attribute__((aligned(X)))
+
 #ifndef glue
 #define xglue(x, y) x ## y
 #define glue(x, y) xglue(x, y)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 02/18] seqlock: remove optional mutex
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 01/18] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 03/18] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

This option is unused; besides, it bloats the struct when not needed.
Let's just let writers define their own locks elsewhere.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c                 |  2 +-
 include/qemu/seqlock.h | 10 +---------
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/cpus.c b/cpus.c
index cbeb1f6..dd86da5 100644
--- a/cpus.c
+++ b/cpus.c
@@ -619,7 +619,7 @@ int cpu_throttle_get_percentage(void)
 
 void cpu_ticks_init(void)
 {
-    seqlock_init(&timers_state.vm_clock_seqlock, NULL);
+    seqlock_init(&timers_state.vm_clock_seqlock);
     vmstate_register(NULL, 0, &vmstate_timers, &timers_state);
     throttle_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL_RT,
                                            cpu_throttle_timer_tick, NULL);
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index 70b01fd..e673482 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -19,22 +19,17 @@
 typedef struct QemuSeqLock QemuSeqLock;
 
 struct QemuSeqLock {
-    QemuMutex *mutex;
     unsigned sequence;
 };
 
-static inline void seqlock_init(QemuSeqLock *sl, QemuMutex *mutex)
+static inline void seqlock_init(QemuSeqLock *sl)
 {
-    sl->mutex = mutex;
     sl->sequence = 0;
 }
 
 /* Lock out other writers and update the count.  */
 static inline void seqlock_write_lock(QemuSeqLock *sl)
 {
-    if (sl->mutex) {
-        qemu_mutex_lock(sl->mutex);
-    }
     ++sl->sequence;
 
     /* Write sequence before updating other fields.  */
@@ -47,9 +42,6 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
     smp_wmb();
 
     ++sl->sequence;
-    if (sl->mutex) {
-        qemu_mutex_unlock(sl->mutex);
-    }
 }
 
 static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 03/18] seqlock: rename write_lock/unlock to write_begin/end
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 01/18] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 02/18] seqlock: remove optional mutex Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 04/18] include/processor.h: define cpu_relax() Emilio G. Cota
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

It is a more appropriate name, now that the mutex embedded
in the seqlock is gone.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c                 | 28 ++++++++++++++--------------
 include/qemu/seqlock.h |  4 ++--
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/cpus.c b/cpus.c
index dd86da5..735c9b2 100644
--- a/cpus.c
+++ b/cpus.c
@@ -247,13 +247,13 @@ int64_t cpu_get_clock(void)
 void cpu_enable_ticks(void)
 {
     /* Here, the really thing protected by seqlock is cpu_clock_offset. */
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (!timers_state.cpu_ticks_enabled) {
         timers_state.cpu_ticks_offset -= cpu_get_host_ticks();
         timers_state.cpu_clock_offset -= get_clock();
         timers_state.cpu_ticks_enabled = 1;
     }
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 /* disable cpu_get_ticks() : the clock is stopped. You must not call
@@ -263,13 +263,13 @@ void cpu_enable_ticks(void)
 void cpu_disable_ticks(void)
 {
     /* Here, the really thing protected by seqlock is cpu_clock_offset. */
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (timers_state.cpu_ticks_enabled) {
         timers_state.cpu_ticks_offset += cpu_get_host_ticks();
         timers_state.cpu_clock_offset = cpu_get_clock_locked();
         timers_state.cpu_ticks_enabled = 0;
     }
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 /* Correlation between real and virtual time is always going to be
@@ -292,7 +292,7 @@ static void icount_adjust(void)
         return;
     }
 
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     cur_time = cpu_get_clock_locked();
     cur_icount = cpu_get_icount_locked();
 
@@ -313,7 +313,7 @@ static void icount_adjust(void)
     last_delta = delta;
     timers_state.qemu_icount_bias = cur_icount
                               - (timers_state.qemu_icount << icount_time_shift);
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 static void icount_adjust_rt(void *opaque)
@@ -353,7 +353,7 @@ static void icount_warp_rt(void)
         return;
     }
 
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (runstate_is_running()) {
         int64_t clock = REPLAY_CLOCK(REPLAY_CLOCK_VIRTUAL_RT,
                                      cpu_get_clock_locked());
@@ -372,7 +372,7 @@ static void icount_warp_rt(void)
         timers_state.qemu_icount_bias += warp_delta;
     }
     vm_clock_warp_start = -1;
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 
     if (qemu_clock_expired(QEMU_CLOCK_VIRTUAL)) {
         qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
@@ -397,9 +397,9 @@ void qtest_clock_warp(int64_t dest)
         int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
         int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
 
-        seqlock_write_lock(&timers_state.vm_clock_seqlock);
+        seqlock_write_begin(&timers_state.vm_clock_seqlock);
         timers_state.qemu_icount_bias += warp;
-        seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+        seqlock_write_end(&timers_state.vm_clock_seqlock);
 
         qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
         timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
@@ -466,9 +466,9 @@ void qemu_start_warp_timer(void)
              * It is useful when we want a deterministic execution time,
              * isolated from host latencies.
              */
-            seqlock_write_lock(&timers_state.vm_clock_seqlock);
+            seqlock_write_begin(&timers_state.vm_clock_seqlock);
             timers_state.qemu_icount_bias += deadline;
-            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+            seqlock_write_end(&timers_state.vm_clock_seqlock);
             qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
         } else {
             /*
@@ -479,11 +479,11 @@ void qemu_start_warp_timer(void)
              * you will not be sending network packets continuously instead of
              * every 100ms.
              */
-            seqlock_write_lock(&timers_state.vm_clock_seqlock);
+            seqlock_write_begin(&timers_state.vm_clock_seqlock);
             if (vm_clock_warp_start == -1 || vm_clock_warp_start > clock) {
                 vm_clock_warp_start = clock;
             }
-            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+            seqlock_write_end(&timers_state.vm_clock_seqlock);
             timer_mod_anticipate(icount_warp_timer, clock + deadline);
         }
     } else if (deadline == 0) {
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index e673482..4dfc055 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -28,7 +28,7 @@ static inline void seqlock_init(QemuSeqLock *sl)
 }
 
 /* Lock out other writers and update the count.  */
-static inline void seqlock_write_lock(QemuSeqLock *sl)
+static inline void seqlock_write_begin(QemuSeqLock *sl)
 {
     ++sl->sequence;
 
@@ -36,7 +36,7 @@ static inline void seqlock_write_lock(QemuSeqLock *sl)
     smp_wmb();
 }
 
-static inline void seqlock_write_unlock(QemuSeqLock *sl)
+static inline void seqlock_write_end(QemuSeqLock *sl)
 {
     /* Write other fields before finalizing sequence.  */
     smp_wmb();
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 04/18] include/processor.h: define cpu_relax()
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (2 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 03/18] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-18 17:47   ` Sergey Fedorov
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire Emilio G. Cota
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Taken from the linux kernel.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/processor.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)
 create mode 100644 include/qemu/processor.h

diff --git a/include/qemu/processor.h b/include/qemu/processor.h
new file mode 100644
index 0000000..4e6a71f
--- /dev/null
+++ b/include/qemu/processor.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_PROCESSOR_H
+#define QEMU_PROCESSOR_H
+
+#include "qemu/atomic.h"
+
+#if defined(__i386__) || defined(__x86_64__)
+#define cpu_relax() asm volatile("rep; nop" ::: "memory")
+#endif
+
+#ifdef __ia64__
+#define cpu_relax() asm volatile("hint @pause" ::: "memory")
+#endif
+
+#ifdef __aarch64__
+#define cpu_relax() asm volatile("yield" ::: "memory")
+#endif
+
+#if defined(__powerpc64__)
+/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
+#define cpu_relax() asm volatile("or 1, 1, 1;"
+                                 "or 2, 2, 2;" ::: "memory")
+#endif
+
+#ifndef cpu_relax
+#define cpu_relax() barrier()
+#endif
+
+#endif /* QEMU_PROCESSOR_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (3 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 04/18] include/processor.h: define cpu_relax() Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-16 10:05   ` Paolo Bonzini
  2016-05-17 16:15   ` Sergey Fedorov
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 06/18] atomics: add atomic_read_acquire and atomic_set_release Emilio G. Cota
                   ` (13 subsequent siblings)
  18 siblings, 2 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

This new helper expands to __atomic_test_and_set with acquire semantics
where available; otherwise it expands to __sync_test_and_set, which
has acquire semantics.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/atomic.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 5bc4d6c..6061a46 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -113,6 +113,7 @@
 } while(0)
 #endif
 
+#define atomic_test_and_set_acquire(ptr) __atomic_test_and_set(ptr, __ATOMIC_ACQUIRE)
 
 /* All the remaining operations are fully sequentially consistent */
 
@@ -327,6 +328,8 @@
 #endif
 #endif
 
+#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)
+
 /* Provide shorter names for GCC atomic builtins.  */
 #define atomic_fetch_inc(ptr)  __sync_fetch_and_add(ptr, 1)
 #define atomic_fetch_dec(ptr)  __sync_fetch_and_add(ptr, -1)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 06/18] atomics: add atomic_read_acquire and atomic_set_release
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (4 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-15 10:22   ` Pranith Kumar
  2016-05-17 16:53   ` Sergey Fedorov
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

When __atomic is not available, we use full memory barriers instead
of smp/wmb, since acquire/release barriers apply to all memory
operations and not just to loads/stores, respectively.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/atomic.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 6061a46..1766c22 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -56,6 +56,21 @@
     __atomic_store(ptr, &_val, __ATOMIC_RELAXED);     \
 } while(0)
 
+/* atomic read/set with acquire/release barrier */
+#define atomic_read_acquire(ptr)                      \
+    ({                                                \
+    QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
+    typeof(*ptr) _val;                                \
+    __atomic_load(ptr, &_val, __ATOMIC_ACQUIRE);      \
+    _val;                                             \
+    })
+
+#define atomic_set_release(ptr, i)  do {              \
+    QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
+    typeof(*ptr) _val = (i);                          \
+    __atomic_store(ptr, &_val, __ATOMIC_RELEASE);     \
+} while(0)
+
 /* Atomic RCU operations imply weak memory barriers */
 
 #define atomic_rcu_read(ptr)                          \
@@ -243,6 +258,18 @@
 #define atomic_read(ptr)       (*(__typeof__(*ptr) volatile*) (ptr))
 #define atomic_set(ptr, i)     ((*(__typeof__(*ptr) volatile*) (ptr)) = (i))
 
+/* atomic read/set with acquire/release barrier */
+#define atomic_read_acquire(ptr)    ({            \
+    typeof(*ptr) _val = atomic_read(ptr);         \
+    smp_mb();                                     \
+    _val;                                         \
+})
+
+#define atomic_set_release(ptr, i)  do {          \
+    smp_mb();                                     \
+    atomic_set(ptr, i);                           \
+} while (0)
+
 /**
  * atomic_rcu_read - reads a RCU-protected pointer to a local variable
  * into a RCU read-side critical section. The pointer can later be safely
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (5 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 06/18] atomics: add atomic_read_acquire and atomic_set_release Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
       [not found]   ` <573B5134.8060104@gmail.com>
                     ` (2 more replies)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 08/18] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
                   ` (11 subsequent siblings)
  18 siblings, 3 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

From: Guillaume Delbergue <guillaume.delbergue@greensocs.com>

Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
[Rewritten. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Emilio's additions: use TAS instead of atomic_xchg; emit acquire/release
 barriers; call cpu_relax() while spinning; optimize for uncontended locks by
 acquiring the lock with TAS instead of TATAS; add qemu_spin_locked().]
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/thread.h | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index bdae6df..4b74ee5 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -1,6 +1,9 @@
 #ifndef __QEMU_THREAD_H
 #define __QEMU_THREAD_H 1
 
+#include <errno.h>
+#include "qemu/processor.h"
+#include "qemu/atomic.h"
 
 typedef struct QemuMutex QemuMutex;
 typedef struct QemuCond QemuCond;
@@ -60,4 +63,40 @@ struct Notifier;
 void qemu_thread_atexit_add(struct Notifier *notifier);
 void qemu_thread_atexit_remove(struct Notifier *notifier);
 
+typedef struct QemuSpin {
+    int value;
+} QemuSpin;
+
+static inline void qemu_spin_init(QemuSpin *spin)
+{
+    atomic_set_release(&spin->value, 0);
+}
+
+static inline void qemu_spin_lock(QemuSpin *spin)
+{
+    while (atomic_test_and_set_acquire(&spin->value)) {
+        while (atomic_read(&spin->value)) {
+            cpu_relax();
+        }
+    }
+}
+
+static inline int qemu_spin_trylock(QemuSpin *spin)
+{
+    if (atomic_test_and_set_acquire(&spin->value)) {
+        return -EBUSY;
+    }
+    return 0;
+}
+
+static inline bool qemu_spin_locked(QemuSpin *spin)
+{
+    return atomic_read_acquire(&spin->value);
+}
+
+static inline void qemu_spin_unlock(QemuSpin *spin)
+{
+    atomic_set_release(&spin->value, 0);
+}
+
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 08/18] exec: add tb_hash_func5, derived from xxhash
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (6 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-17 17:22   ` Sergey Fedorov
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 09/18] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

This will be used by upcoming changes for hashing the tb hash.

Add this into a separate file to include the copyright notice from
xxhash.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-hash-xx.h | 94 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)
 create mode 100644 include/exec/tb-hash-xx.h

diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
new file mode 100644
index 0000000..67f4e6f
--- /dev/null
+++ b/include/exec/tb-hash-xx.h
@@ -0,0 +1,94 @@
+/*
+ * xxHash - Fast Hash algorithm
+ * Copyright (C) 2012-2016, Yann Collet
+ *
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * + Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * + Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * You can contact the author at :
+ * - xxHash source repository : https://github.com/Cyan4973/xxHash
+ */
+#ifndef EXEC_TB_HASH_XX
+#define EXEC_TB_HASH_XX
+
+#include <qemu/bitops.h>
+
+#define PRIME32_1   2654435761U
+#define PRIME32_2   2246822519U
+#define PRIME32_3   3266489917U
+#define PRIME32_4    668265263U
+#define PRIME32_5    374761393U
+
+#define TB_HASH_XX_SEED 1
+
+/*
+ * xxhash32, customized for input variables that are not guaranteed to be
+ * contiguous in memory.
+ */
+static inline
+uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e)
+{
+    uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
+    uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
+    uint32_t v3 = TB_HASH_XX_SEED + 0;
+    uint32_t v4 = TB_HASH_XX_SEED - PRIME32_1;
+    uint32_t a = a0 >> 31 >> 1;
+    uint32_t b = a0;
+    uint32_t c = b0 >> 31 >> 1;
+    uint32_t d = b0;
+    uint32_t h32;
+
+    v1 += a * PRIME32_2;
+    v1 = rol32(v1, 13);
+    v1 *= PRIME32_1;
+
+    v2 += b * PRIME32_2;
+    v2 = rol32(v2, 13);
+    v2 *= PRIME32_1;
+
+    v3 += c * PRIME32_2;
+    v3 = rol32(v3, 13);
+    v3 *= PRIME32_1;
+
+    v4 += d * PRIME32_2;
+    v4 = rol32(v4, 13);
+    v4 *= PRIME32_1;
+
+    h32 = rol32(v1, 1) + rol32(v2, 7) + rol32(v3, 12) + rol32(v4, 18);
+    h32 += 20;
+
+    h32 += e * PRIME32_3;
+    h32  = rol32(h32, 17) * PRIME32_4;
+
+    h32 ^= h32 >> 15;
+    h32 *= PRIME32_2;
+    h32 ^= h32 >> 13;
+    h32 *= PRIME32_3;
+    h32 ^= h32 >> 16;
+
+    return h32;
+}
+
+#endif /* EXEC_TB_HASH_XX */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 09/18] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (7 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 08/18] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-17 17:47   ` Sergey Fedorov
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 10/18] qdist: add module to represent frequency distributions of data Emilio G. Cota
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
   high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though. UPDATE:
On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
> consisting of only a delay slot).
> It may well still turn out to be reasonable to ignore cs_base for hashing.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c             |  4 ++--
 include/exec/tb-hash.h |  8 ++++++--
 translate-all.c        | 10 +++++-----
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 14df1aa..1735032 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -231,13 +231,13 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
 {
     CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb, **tb_hash_head, **ptb1;
-    unsigned int h;
+    uint32_t h;
     tb_page_addr_t phys_pc, phys_page1;
 
     /* find translated block using physical mappings */
     phys_pc = get_page_addr_code(env, pc);
     phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, pc, flags);
 
     /* Start at head of the hash entry */
     ptb1 = tb_hash_head = &tcg_ctx.tb_ctx.tb_phys_hash[h];
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 0f4e8a0..4b9635a 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,6 +20,9 @@
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
+#include "exec/exec-all.h"
+#include "exec/tb-hash-xx.h"
+
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
    addresses on the same page.  The top bits are the same.  This allows
    TLB invalidation to quickly clear a subset of the hash table.  */
@@ -43,9 +46,10 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
            | (tmp & TB_JMP_ADDR_MASK));
 }
 
-static inline unsigned int tb_phys_hash_func(tb_page_addr_t pc)
+static inline
+uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
 {
-    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index b54f472..c48fccb 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -991,12 +991,12 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 {
     CPUState *cpu;
     PageDesc *p;
-    unsigned int h;
+    uint32_t h;
     tb_page_addr_t phys_pc;
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
 
     /* remove the TB from the page list */
@@ -1126,11 +1126,11 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
-    unsigned int h;
+    uint32_t h;
     TranslationBlock **ptb;
 
-    /* add in the physical hash table */
-    h = tb_phys_hash_func(phys_pc);
+    /* add in the hash table */
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     tb->phys_hash_next = *ptb;
     *ptb = tb;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 10/18] qdist: add module to represent frequency distributions of data
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (8 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 09/18] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 11/18] qdist: add test program Emilio G. Cota
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Sometimes it is useful to have a quick histogram to represent a certain
distribution -- for example, when investigating a performance regression
in a hash table due to inadequate hashing.

The appended allows us to easily represent a distribution using Unicode
characters. Further, the data structure keeping track of the distribution
is so simple that obtaining its values for off-line processing is trivial.

Example, taking the last 10 commits to QEMU:

 Characters in commit title  Count
-----------------------------------
                         39      1
                         48      1
                         53      1
                         54      2
                         57      1
                         61      1
                         67      1
                         78      1
                         80      1
qdist_init(&dist);
qdist_inc(&dist, 39);
[...]
qdist_inc(&dist, 80);

char *str = qdist_pr(&dist, 9, QDIST_PR_LABELS);
// -> [39.0,43.6)▂▂ █▂ ▂ ▄[75.4,80.0]
g_free(str);

char *str = qdist_pr(&dist, 4, QDIST_PR_LABELS);
// -> [39.0,49.2)▁█▁▁[69.8,80.0]
g_free(str);

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/qdist.h |  62 +++++++++
 util/Makefile.objs   |   1 +
 util/qdist.c         | 386 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 449 insertions(+)
 create mode 100644 include/qemu/qdist.h
 create mode 100644 util/qdist.c

diff --git a/include/qemu/qdist.h b/include/qemu/qdist.h
new file mode 100644
index 0000000..6d8b701
--- /dev/null
+++ b/include/qemu/qdist.h
@@ -0,0 +1,62 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_QDIST_H
+#define QEMU_QDIST_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/bitops.h"
+
+/*
+ * Samples with the same 'x value' end up in the same qdist_entry,
+ * e.g. inc(0.1) and inc(0.1) end up as {x=0.1, count=2}.
+ *
+ * Binning happens only at print time, so that we retain the flexibility to
+ * choose the binning. This might not be ideal for workloads that do not care
+ * much about precision and insert many samples all with different x values;
+ * in that case, pre-binning (e.g. entering both 0.115 and 0.097 as 0.1)
+ * should be considered.
+ */
+struct qdist_entry {
+    double x;
+    unsigned long count;
+};
+
+struct qdist {
+    struct qdist_entry *entries;
+    size_t n;
+};
+
+#define QDIST_PR_BORDER     BIT(0)
+#define QDIST_PR_LABELS     BIT(1)
+/* the remaining options only work if PR_LABELS is set */
+#define QDIST_PR_NODECIMAL  BIT(2)
+#define QDIST_PR_PERCENT    BIT(3)
+#define QDIST_PR_100X       BIT(4)
+#define QDIST_PR_NOBINRANGE BIT(5)
+
+void qdist_init(struct qdist *dist);
+void qdist_destroy(struct qdist *dist);
+
+void qdist_add(struct qdist *dist, double x, long count);
+void qdist_inc(struct qdist *dist, double x);
+double qdist_xmin(const struct qdist *dist);
+double qdist_xmax(const struct qdist *dist);
+double qdist_avg(const struct qdist *dist);
+unsigned long qdist_sample_count(const struct qdist *dist);
+size_t qdist_unique_entries(const struct qdist *dist);
+
+/* callers must free the returned string with g_free() */
+char *qdist_pr_plain(const struct qdist *dist, size_t n_groups);
+
+/* callers must free the returned string with g_free() */
+char *qdist_pr(const struct qdist *dist, size_t n_groups, uint32_t opt);
+
+/* Only qdist code and test code should ever call this function */
+void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n);
+
+#endif /* QEMU_QDIST_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index a8a777e..702435e 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -32,3 +32,4 @@ util-obj-y += buffer.o
 util-obj-y += timed-average.o
 util-obj-y += base64.o
 util-obj-y += log.o
+util-obj-y += qdist.o
diff --git a/util/qdist.c b/util/qdist.c
new file mode 100644
index 0000000..3343640
--- /dev/null
+++ b/util/qdist.c
@@ -0,0 +1,386 @@
+/*
+ * qdist.c - QEMU helpers for handling frequency distributions of data.
+ *
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/qdist.h"
+
+#include <math.h>
+#ifndef NAN
+#define NAN (0.0 / 0.0)
+#endif
+
+void qdist_init(struct qdist *dist)
+{
+    dist->entries = NULL;
+    dist->n = 0;
+}
+
+void qdist_destroy(struct qdist *dist)
+{
+    g_free(dist->entries);
+}
+
+static inline int qdist_cmp_double(double a, double b)
+{
+    if (a > b) {
+        return 1;
+    } else if (a < b) {
+        return -1;
+    }
+    return 0;
+}
+
+static int qdist_cmp(const void *ap, const void *bp)
+{
+    const struct qdist_entry *a = ap;
+    const struct qdist_entry *b = bp;
+
+    return qdist_cmp_double(a->x, b->x);
+}
+
+void qdist_add(struct qdist *dist, double x, long count)
+{
+    struct qdist_entry *entry = NULL;
+
+    if (dist->entries) {
+        struct qdist_entry e;
+
+        e.x = x;
+        entry = bsearch(&e, dist->entries, dist->n, sizeof(e), qdist_cmp);
+    }
+
+    if (entry) {
+        entry->count += count;
+        return;
+    }
+
+    dist->entries = g_realloc(dist->entries,
+                              sizeof(*dist->entries) * (dist->n + 1));
+    dist->n++;
+    entry = &dist->entries[dist->n - 1];
+    entry->x = x;
+    entry->count = count;
+    qsort(dist->entries, dist->n, sizeof(*entry), qdist_cmp);
+}
+
+void qdist_inc(struct qdist *dist, double x)
+{
+    qdist_add(dist, x, 1);
+}
+
+/*
+ * Unicode for block elements. See:
+ *   https://en.wikipedia.org/wiki/Block_Elements
+ */
+static const gunichar qdist_blocks[] = {
+    0x2581,
+    0x2582,
+    0x2583,
+    0x2584,
+    0x2585,
+    0x2586,
+    0x2587,
+    0x2588
+};
+
+#define QDIST_NR_BLOCK_CODES ARRAY_SIZE(qdist_blocks)
+
+/*
+ * Print a distribution into a string.
+ *
+ * This function assumes that appropriate binning has been done on the input;
+ * see qdist_bin__internal() and qdist_pr_plain().
+ *
+ * Callers must free the returned string with g_free().
+ */
+static char *qdist_pr_internal(const struct qdist *dist)
+{
+    double min, max, step;
+    GString *s = g_string_new("");
+    size_t i;
+
+    /* if only one entry, its printout will be either full or empty */
+    if (dist->n == 1) {
+        if (dist->entries[0].count) {
+            g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES - 1]);
+        } else {
+            g_string_append_c(s, ' ');
+        }
+        goto out;
+    }
+
+    /* get min and max counts */
+    min = dist->entries[0].count;
+    max = min;
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        if (e->count < min) {
+            min = e->count;
+        }
+        if (e->count > max) {
+            max = e->count;
+        }
+    }
+
+    /* floor((count - min) * step) will give us the block index */
+    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
+
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+        int index;
+
+        /* make an exception with 0; instead of using block[0], print a space */
+        if (e->count) {
+            index = (int)((e->count - min) * step);
+            g_string_append_unichar(s, qdist_blocks[index]);
+        } else {
+            g_string_append_c(s, ' ');
+        }
+    }
+ out:
+    return g_string_free(s, FALSE);
+}
+
+/*
+ * Bin the distribution in @from into @n bins of consecutive, non-overlapping
+ * intervals, copying the result to @to.
+ *
+ * This function is internal to qdist: only this file and test code should
+ * ever call it.
+ *
+ * Note: calling this function on an already-binned qdist is a bug.
+ *
+ * If @n == 0 or @from->n == 1, use @from->n.
+ */
+void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
+{
+    double xmin, xmax;
+    double step;
+    size_t i, j, j_min;
+
+    qdist_init(to);
+
+    if (!from->entries) {
+        return;
+    }
+    if (!n || from->n == 1) {
+        n = from->n;
+    }
+
+    /* set equally-sized bins between @from's left and right */
+    xmin = qdist_xmin(from);
+    xmax = qdist_xmax(from);
+    step = (xmax - xmin) / n;
+
+    if (n == from->n) {
+        /* if @from's entries are equally spaced, no need to re-bin */
+        for (i = 0; i < from->n; i++) {
+            if (from->entries[i].x != xmin + i * step) {
+                goto rebin;
+            }
+        }
+        /* they're equally spaced, so copy the dist and bail out */
+        to->entries = g_malloc(sizeof(*to->entries) * from->n);
+        to->n = from->n;
+        memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
+        return;
+    }
+
+ rebin:
+    j_min = 0;
+    for (i = 0; i < n; i++) {
+        double x;
+        double left, right;
+
+        left = xmin + i * step;
+        right = xmin + (i + 1) * step;
+
+        /* Add x, even if it might not get any counts later */
+        x = left;
+        qdist_add(to, x, 0);
+
+        /*
+         * To avoid double-counting we capture [left, right) ranges, except for
+         * the righmost bin, which captures a [left, right] range.
+         */
+        for (j = j_min; j < from->n; j++) {
+            struct qdist_entry *o = &from->entries[j];
+
+            /* entries are ordered so do not check beyond right */
+            if (o->x > right) {
+                break;
+            }
+            if (o->x >= left && (o->x < right ||
+                                   (i == n - 1 && o->x == right))) {
+                qdist_add(to, x, o->count);
+                /* don't check this entry again */
+                j_min = j + 1;
+            }
+        }
+    }
+}
+
+/*
+ * Print @dist into a string, after re-binning it into @n bins of consecutive,
+ * non-overlapping intervals.
+ *
+ * If @n == 0, use @orig->n.
+ *
+ * Callers must free the returned string with g_free().
+ */
+char *qdist_pr_plain(const struct qdist *dist, size_t n)
+{
+    struct qdist binned;
+    char *ret;
+
+    if (!dist->entries) {
+        return NULL;
+    }
+    qdist_bin__internal(&binned, dist, n);
+    ret = qdist_pr_internal(&binned);
+    qdist_destroy(&binned);
+    return ret;
+}
+
+static char *qdist_pr_label(const struct qdist *dist, size_t n_bins,
+                            uint32_t opt, bool is_left)
+{
+    const char *percent;
+    const char *lparen;
+    const char *rparen;
+    GString *s;
+    double x1, x2, step;
+    double x;
+    double n;
+    int dec;
+
+    s = g_string_new("");
+    if (!(opt & QDIST_PR_LABELS)) {
+        goto out;
+    }
+
+    dec = opt & QDIST_PR_NODECIMAL ? 0 : 1;
+    percent = opt & QDIST_PR_PERCENT ? "%" : "";
+
+    n = n_bins ? n_bins : dist->n;
+    x = is_left ? qdist_xmin(dist) : qdist_xmax(dist);
+    step = (qdist_xmax(dist) - qdist_xmin(dist)) / n;
+
+    if (opt & QDIST_PR_100X) {
+        x *= 100.0;
+        step *= 100.0;
+    }
+    if (opt & QDIST_PR_NOBINRANGE) {
+        lparen = rparen = "";
+        x1 = x;
+        x2 = x; /* unnecessary, but a dumb compiler might not figure it out */
+    } else {
+        lparen = "[";
+        rparen = is_left ? ")" : "]";
+        if (is_left) {
+            x1 = x;
+            x2 = x + step;
+        } else {
+            x1 = x - step;
+            x2 = x;
+        }
+    }
+    g_string_append_printf(s, "%s%.*f", lparen, dec, x1);
+    if (!(opt & QDIST_PR_NOBINRANGE)) {
+        g_string_append_printf(s, ",%.*f%s", dec, x2, rparen);
+    }
+    g_string_append(s, percent);
+ out:
+    return g_string_free(s, FALSE);
+}
+
+/*
+ * Print the distribution's histogram into a string.
+ *
+ * See also: qdist_pr_plain().
+ *
+ * Callers must free the returned string with g_free().
+ */
+char *qdist_pr(const struct qdist *dist, size_t n_bins, uint32_t opt)
+{
+    const char *border = opt & QDIST_PR_BORDER ? "|" : "";
+    char *llabel, *rlabel;
+    char *hgram;
+    GString *s;
+
+    if (dist->entries == NULL) {
+        return NULL;
+    }
+
+    s = g_string_new("");
+
+    llabel = qdist_pr_label(dist, n_bins, opt, true);
+    rlabel = qdist_pr_label(dist, n_bins, opt, false);
+    hgram = qdist_pr_plain(dist, n_bins);
+    g_string_append_printf(s, "%s%s%s%s%s",
+                           llabel, border, hgram, border, rlabel);
+    g_free(llabel);
+    g_free(rlabel);
+    g_free(hgram);
+
+    return g_string_free(s, FALSE);
+}
+
+static inline double qdist_x(const struct qdist *dist, int index)
+{
+    if (dist->entries == NULL) {
+        return NAN;
+    }
+    return dist->entries[index].x;
+}
+
+double qdist_xmin(const struct qdist *dist)
+{
+    return qdist_x(dist, 0);
+}
+
+double qdist_xmax(const struct qdist *dist)
+{
+    return qdist_x(dist, dist->n - 1);
+}
+
+size_t qdist_unique_entries(const struct qdist *dist)
+{
+    return dist->n;
+}
+
+unsigned long qdist_sample_count(const struct qdist *dist)
+{
+    unsigned long count = 0;
+    size_t i;
+
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        count += e->count;
+    }
+    return count;
+}
+
+double qdist_avg(const struct qdist *dist)
+{
+    unsigned long count;
+    size_t i;
+    double ret = 0;
+
+    count = qdist_sample_count(dist);
+    if (!count) {
+        return NAN;
+    }
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        ret += e->x * e->count / count;
+    }
+    return ret;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 11/18] qdist: add test program
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (9 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 10/18] qdist: add module to represent frequency distributions of data Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore   |   1 +
 tests/Makefile     |   6 +-
 tests/test-qdist.c | 369 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 375 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-qdist.c

diff --git a/tests/.gitignore b/tests/.gitignore
index a06a8ba..7c0d156 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -48,6 +48,7 @@ test-qapi-types.[ch]
 test-qapi-visit.[ch]
 test-qdev-global-props
 test-qemu-opts
+test-qdist
 test-qga
 test-qmp-commands
 test-qmp-commands.h
diff --git a/tests/Makefile b/tests/Makefile
index 9dddde6..a5af20b 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -70,6 +70,8 @@ check-unit-y += tests/rcutorture$(EXESUF)
 gcov-files-rcutorture-y = util/rcu.c
 check-unit-y += tests/test-rcu-list$(EXESUF)
 gcov-files-test-rcu-list-y = util/rcu.c
+check-unit-y += tests/test-qdist$(EXESUF)
+gcov-files-test-qdist-y = util/qdist.c
 check-unit-y += tests/test-bitops$(EXESUF)
 check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
 check-unit-y += tests/check-qom-interface$(EXESUF)
@@ -392,7 +394,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-qmp-commands.o tests/test-visitor-serialization.o \
 	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
-	tests/rcutorture.o tests/test-rcu-list.o
+	tests/rcutorture.o tests/test-rcu-list.o \
+	tests/test-qdist.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -431,6 +434,7 @@ tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
 tests/test-int128$(EXESUF): tests/test-int128.o
 tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
+tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/test-qdist.c b/tests/test-qdist.c
new file mode 100644
index 0000000..7625a57
--- /dev/null
+++ b/tests/test-qdist.c
@@ -0,0 +1,369 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "qemu/qdist.h"
+
+#include <math.h>
+
+struct entry_desc {
+    double x;
+    unsigned long count;
+
+    /* 0 prints a space, 1-8 prints from qdist_blocks[] */
+    int fill_code;
+};
+
+/* See: https://en.wikipedia.org/wiki/Block_Elements */
+static const gunichar qdist_blocks[] = {
+    0x2581,
+    0x2582,
+    0x2583,
+    0x2584,
+    0x2585,
+    0x2586,
+    0x2587,
+    0x2588
+};
+
+#define QDIST_NR_BLOCK_CODES ARRAY_SIZE(qdist_blocks)
+
+static char *pr_hist(const struct entry_desc *darr, size_t n)
+{
+    GString *s = g_string_new("");
+    size_t i;
+
+    for (i = 0; i < n; i++) {
+        int fill = darr[i].fill_code;
+
+        if (fill) {
+            assert(fill <= QDIST_NR_BLOCK_CODES);
+            g_string_append_unichar(s, qdist_blocks[fill - 1]);
+        } else {
+            g_string_append_c(s, ' ');
+        }
+    }
+    return g_string_free(s, FALSE);
+}
+
+static void
+histogram_check(const struct qdist *dist, const struct entry_desc *darr,
+                size_t n, size_t n_bins)
+{
+    char *pr = qdist_pr_plain(dist, n_bins);
+    char *str = pr_hist(darr, n);
+
+    g_assert_cmpstr(pr, ==, str);
+    g_free(pr);
+    g_free(str);
+}
+
+static void histogram_check_single_full(const struct qdist *dist, size_t n_bins)
+{
+    struct entry_desc desc = { .fill_code = 8 };
+
+    histogram_check(dist, &desc, 1, n_bins);
+}
+
+static void
+entries_check(const struct qdist *dist, const struct entry_desc *darr, size_t n)
+{
+    size_t i;
+
+    for (i = 0; i < n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        g_assert_cmpuint(e->count, ==, darr[i].count);
+    }
+}
+
+static void
+entries_insert(struct qdist *dist, const struct entry_desc *darr, size_t n)
+{
+    size_t i;
+
+    for (i = 0; i < n; i++) {
+        qdist_add(dist, darr[i].x, darr[i].count);
+    }
+}
+
+static void do_test_bin(const struct entry_desc *a, size_t n_a,
+                        const struct entry_desc *b, size_t n_b)
+{
+    struct qdist qda;
+    struct qdist qdb;
+
+    qdist_init(&qda);
+
+    entries_insert(&qda, a, n_a);
+    qdist_inc(&qda, a[0].x);
+    qdist_add(&qda, a[0].x, -1);
+
+    g_assert_cmpuint(qdist_unique_entries(&qda), ==, n_a);
+    g_assert_cmpfloat(qdist_xmin(&qda), ==, a[0].x);
+    g_assert_cmpfloat(qdist_xmax(&qda), ==, a[n_a - 1].x);
+    histogram_check(&qda, a, n_a, 0);
+    histogram_check(&qda, a, n_a, n_a);
+
+    qdist_bin__internal(&qdb, &qda, n_b);
+    g_assert_cmpuint(qdb.n, ==, n_b);
+    entries_check(&qdb, b, n_b);
+    g_assert_cmpuint(qdist_sample_count(&qda), ==, qdist_sample_count(&qdb));
+    /*
+     * No histogram_check() for $qdb, since we'd rebin it and that is a bug.
+     * Instead, regenerate it from $qda.
+     */
+    histogram_check(&qda, b, n_b, n_b);
+
+    qdist_destroy(&qdb);
+    qdist_destroy(&qda);
+}
+
+static void do_test_pr(uint32_t opt)
+{
+    static const struct entry_desc desc[] = {
+        [0] = { 1, 900, 8 },
+        [1] = { 2, 1, 1 },
+        [2] = { 3, 2, 1 }
+    };
+    static const char border[] = "|";
+    const char *llabel = NULL;
+    const char *rlabel = NULL;
+    struct qdist dist;
+    GString *s;
+    char *str;
+    char *pr;
+    size_t n;
+
+    n = ARRAY_SIZE(desc);
+    qdist_init(&dist);
+
+    entries_insert(&dist, desc, n);
+    histogram_check(&dist, desc, n, 0);
+
+    s = g_string_new("");
+
+    if (opt & QDIST_PR_LABELS) {
+        unsigned int lopts = opt & (QDIST_PR_NODECIMAL |
+                                    QDIST_PR_PERCENT |
+                                    QDIST_PR_100X |
+                                    QDIST_PR_NOBINRANGE);
+
+        if (lopts == 0) {
+            llabel = "[1.0,1.7)";
+            rlabel = "[2.3,3.0]";
+        } else if (lopts == QDIST_PR_NODECIMAL) {
+            llabel = "[1,2)";
+            rlabel = "[2,3]";
+        } else if (lopts == (QDIST_PR_PERCENT | QDIST_PR_NODECIMAL)) {
+            llabel = "[1,2)%";
+            rlabel = "[2,3]%";
+        } else if (lopts == QDIST_PR_100X) {
+            llabel = "[100.0,166.7)";
+            rlabel = "[233.3,300.0]";
+        } else if (lopts == (QDIST_PR_NOBINRANGE | QDIST_PR_NODECIMAL)) {
+            llabel = "1";
+            rlabel = "3";
+        } else {
+            g_assert_cmpstr("BUG", ==, "This is not meant to be exhaustive");
+        }
+    }
+
+    if (llabel) {
+        g_string_append(s, llabel);
+    }
+    if (opt & QDIST_PR_BORDER) {
+        g_string_append(s, border);
+    }
+
+    str = pr_hist(desc, n);
+    g_string_append(s, str);
+    g_free(str);
+
+    if (opt & QDIST_PR_BORDER) {
+        g_string_append(s, border);
+    }
+    if (rlabel) {
+        g_string_append(s, rlabel);
+    }
+
+    str = g_string_free(s, FALSE);
+    pr = qdist_pr(&dist, n, opt);
+    g_assert_cmpstr(pr, ==, str);
+    g_free(pr);
+    g_free(str);
+
+    qdist_destroy(&dist);
+}
+
+static inline void do_test_pr_label(uint32_t opt)
+{
+    opt |= QDIST_PR_LABELS;
+    do_test_pr(opt);
+}
+
+static void test_pr(void)
+{
+    do_test_pr(0);
+
+    do_test_pr(QDIST_PR_BORDER);
+
+    /* 100X should be ignored because we're not setting LABELS */
+    do_test_pr(QDIST_PR_100X);
+
+    do_test_pr_label(0);
+    do_test_pr_label(QDIST_PR_NODECIMAL);
+    do_test_pr_label(QDIST_PR_PERCENT | QDIST_PR_NODECIMAL);
+    do_test_pr_label(QDIST_PR_100X);
+    do_test_pr_label(QDIST_PR_NOBINRANGE | QDIST_PR_NODECIMAL);
+}
+
+static void test_bin_shrink(void)
+{
+    static const struct entry_desc a[] = {
+        [0] = { 0.0,   42922, 7 },
+        [1] = { 0.25,  47834, 8 },
+        [2] = { 0.50,  26628, 0 },
+        [3] = { 0.625, 597,   4 },
+        [4] = { 0.75,  10298, 1 },
+        [5] = { 0.875, 22,    2 },
+        [6] = { 1.0,   2771,  1 }
+    };
+    static const struct entry_desc b[] = {
+        [0] = { 0.0, 42922, 7 },
+        [1] = { 0.25, 47834, 8 },
+        [2] = { 0.50, 27225, 3 },
+        [3] = { 0.75, 13091, 1 }
+    };
+
+    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
+}
+
+static void test_bin_expand(void)
+{
+    static const struct entry_desc a[] = {
+        [0] = { 0.0,   11713, 5 },
+        [1] = { 0.25,  20294, 0 },
+        [2] = { 0.50,  17266, 8 },
+        [3] = { 0.625, 1506,  0 },
+        [4] = { 0.75,  10355, 6 },
+        [5] = { 0.833, 2,     1 },
+        [6] = { 0.875, 99,    4 },
+        [7] = { 1.0,   4301,  2 }
+    };
+    static const struct entry_desc b[] = {
+        [0] = { 0.0, 11713, 5 },
+        [1] = { 0.0, 0,     0 },
+        [2] = { 0.0, 20294, 8 },
+        [3] = { 0.0, 0,     0 },
+        [4] = { 0.0, 0,     0 },
+        [5] = { 0.0, 17266, 6 },
+        [6] = { 0.0, 1506,  1 },
+        [7] = { 0.0, 10355, 4 },
+        [8] = { 0.0, 101,   1 },
+        [9] = { 0.0, 4301,  2 }
+    };
+
+    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
+}
+
+static void test_bin_simple(void)
+{
+    static const struct entry_desc a[] = {
+        [0] = { 10, 101, 8 },
+        [1] = { 11, 0, 0 },
+        [2] = { 12, 2, 1 }
+    };
+    static const struct entry_desc b[] = {
+        [0] = { 0, 101, 8 },
+        [1] = { 0, 0, 0 },
+        [2] = { 0, 0, 0 },
+        [3] = { 0, 0, 0 },
+        [4] = { 0, 2, 1 }
+    };
+
+    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
+}
+
+static void test_single_full(void)
+{
+    struct qdist dist;
+
+    qdist_init(&dist);
+
+    qdist_add(&dist, 3, 102);
+    g_assert_cmpfloat(qdist_avg(&dist), ==, 3);
+    g_assert_cmpfloat(qdist_xmin(&dist), ==, 3);
+    g_assert_cmpfloat(qdist_xmax(&dist), ==, 3);
+
+    histogram_check_single_full(&dist, 0);
+    histogram_check_single_full(&dist, 1);
+    histogram_check_single_full(&dist, 10);
+
+    qdist_destroy(&dist);
+}
+
+static void test_single_empty(void)
+{
+    struct qdist dist;
+    char *pr;
+
+    qdist_init(&dist);
+
+    qdist_add(&dist, 3, 0);
+    g_assert_cmpuint(qdist_sample_count(&dist), ==, 0);
+    g_assert(isnan(qdist_avg(&dist)));
+    g_assert_cmpfloat(qdist_xmin(&dist), ==, 3);
+    g_assert_cmpfloat(qdist_xmax(&dist), ==, 3);
+
+    pr = qdist_pr_plain(&dist, 0);
+    g_assert_cmpstr(pr, ==, " ");
+    g_free(pr);
+
+    pr = qdist_pr_plain(&dist, 1);
+    g_assert_cmpstr(pr, ==, " ");
+    g_free(pr);
+
+    pr = qdist_pr_plain(&dist, 2);
+    g_assert_cmpstr(pr, ==, " ");
+    g_free(pr);
+
+    qdist_destroy(&dist);
+}
+
+static void test_none(void)
+{
+    struct qdist dist;
+    char *pr;
+
+    qdist_init(&dist);
+
+    g_assert(isnan(qdist_avg(&dist)));
+    g_assert(isnan(qdist_xmin(&dist)));
+    g_assert(isnan(qdist_xmax(&dist)));
+
+    pr = qdist_pr_plain(&dist, 0);
+    g_assert(pr == NULL);
+
+    pr = qdist_pr_plain(&dist, 2);
+    g_assert(pr == NULL);
+
+    qdist_destroy(&dist);
+}
+
+int main(int argc, char *argv[])
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/qdist/none", test_none);
+    g_test_add_func("/qdist/single/empty", test_single_empty);
+    g_test_add_func("/qdist/single/full", test_single_full);
+    g_test_add_func("/qdist/binning/simple", test_bin_simple);
+    g_test_add_func("/qdist/binning/expand", test_bin_expand);
+    g_test_add_func("/qdist/binning/shrink", test_bin_shrink);
+    g_test_add_func("/qdist/pr", test_pr);
+    return g_test_run();
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (10 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 11/18] qdist: add test program Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-20 22:13   ` Sergey Fedorov
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 13/18] qht: support parallel writes Emilio G. Cota
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

This is a fast, scalable chained hash table with optional auto-resizing, allowing
reads that are concurrent with reads, and reads/writes that are concurrent
with writes to separate buckets.

A hash table with these features will be necessary for the scalability
of the ongoing MTTCG work; before those changes arrive we can already
benefit from the single-threaded speedup that qht also provides.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/qht.h |  66 +++++
 util/Makefile.objs |   1 +
 util/qht.c         | 703 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 770 insertions(+)
 create mode 100644 include/qemu/qht.h
 create mode 100644 util/qht.c

diff --git a/include/qemu/qht.h b/include/qemu/qht.h
new file mode 100644
index 0000000..c2ab8b8
--- /dev/null
+++ b/include/qemu/qht.h
@@ -0,0 +1,66 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_QHT_H
+#define QEMU_QHT_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/seqlock.h"
+#include "qemu/qdist.h"
+#include "qemu/rcu.h"
+
+struct qht {
+    struct qht_map *map;
+    unsigned int mode;
+};
+
+struct qht_stats {
+    size_t head_buckets;
+    size_t used_head_buckets;
+    size_t entries;
+    struct qdist chain;
+    struct qdist occupancy;
+};
+
+typedef bool (*qht_lookup_func_t)(const void *obj, const void *userp);
+typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
+
+#define QHT_MODE_AUTO_RESIZE 0x1 /* auto-resize when heavily loaded */
+
+void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
+
+/* call only when there are no readers left */
+void qht_destroy(struct qht *ht);
+
+/* call with an external lock held */
+void qht_reset(struct qht *ht);
+
+/* call with an external lock held */
+bool qht_reset_size(struct qht *ht, size_t n_elems);
+
+/* call with an external lock held */
+bool qht_insert(struct qht *ht, void *p, uint32_t hash);
+
+/* call with an external lock held */
+bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
+
+/* call with an external lock held */
+void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp);
+
+/* call with an external lock held */
+bool qht_resize(struct qht *ht, size_t n_elems);
+
+/* if @func is NULL, then pointer comparison is used */
+void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
+                 uint32_t hash);
+
+/* pass @stats to qht_statistics_destroy() when done */
+void qht_statistics_init(struct qht *ht, struct qht_stats *stats);
+
+void qht_statistics_destroy(struct qht_stats *stats);
+
+#endif /* QEMU_QHT_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 702435e..45f8794 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -33,3 +33,4 @@ util-obj-y += timed-average.o
 util-obj-y += base64.o
 util-obj-y += log.o
 util-obj-y += qdist.o
+util-obj-y += qht.o
diff --git a/util/qht.c b/util/qht.c
new file mode 100644
index 0000000..112f32d
--- /dev/null
+++ b/util/qht.c
@@ -0,0 +1,703 @@
+/*
+ * qht.c - QEMU Hash Table, designed to scale for read-mostly workloads.
+ *
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ *
+ * Assumptions:
+ * - Writers and iterators must take an external lock.
+ * - NULL cannot be inserted as a pointer value.
+ * - Duplicate pointer values cannot be inserted.
+ *
+ * Features:
+ * - Optional auto-resizing: the hash table resizes up if the load surpasses
+ *   a certain threshold. Resizing is done concurrently with readers.
+ *
+ * The key structure is the bucket, which is cacheline-sized. Buckets
+ * contain a few hash values and pointers; the u32 hash values are stored in
+ * full so that resizing is fast. Having this structure instead of directly
+ * chaining items has three advantages:
+ * - Failed lookups fail fast, and touch a minimum number of cache lines.
+ * - Resizing the hash table with concurrent lookups is easy.
+ *
+ * There are two types of buckets:
+ * 1. "head" buckets are the ones allocated in the array of buckets in qht_map.
+ * 2. all "non-head" buckets (i.e. all others) are members of a chain that
+ *    starts from a head bucket.
+ * Note that the seqlock and spinlock of a head bucket applies to all buckets
+ * chained to it; these two fields are unused in non-head buckets.
+ *
+ * On removals, we move the last valid item in the chain to the position of the
+ * just-removed entry. This makes lookups slightly faster, since the moment an
+ * invalid entry is found, the (failed) lookup is over.
+ *
+ * Resizing is done by taking all spinlocks (so that no readers-turned-writers
+ * can race with us) and then placing all elements into a new hash table. Last,
+ * the ht->map pointer is set, and the old map is freed once no RCU readers can
+ * see it anymore.
+ *
+ * Related Work:
+ * - Idea of cacheline-sized buckets with full hashes taken from:
+ *   David, Guerraoui & Trigonakis, "Asynchronized Concurrency:
+ *   The Secret to Scaling Concurrent Search Data Structures", ASPLOS'15.
+ * - Why not RCU-based hash tables? They would allow us to get rid of the
+ *   seqlock, but resizing would take forever since RCU read critical
+ *   sections in QEMU take quite a long time.
+ *   More info on relativistic hash tables:
+ *   + Triplett, McKenney & Walpole, "Resizable, Scalable, Concurrent Hash
+ *     Tables via Relativistic Programming", USENIX ATC'11.
+ *   + Corbet, "Relativistic hash tables, part 1: Algorithms", @ lwn.net, 2014.
+ *     https://lwn.net/Articles/612021/
+ */
+#include "qemu/qht.h"
+#include "qemu/atomic.h"
+
+//#define QHT_DEBUG
+
+/*
+ * We want to avoid false sharing of cache lines. Most systems have 64-byte
+ * cache lines so we go with it for simplicity.
+ *
+ * Note that systems with smaller cache lines will be fine (the struct is
+ * almost 64-bytes); systems with larger cache lines might suffer from
+ * some false sharing.
+ */
+#define QHT_BUCKET_ALIGN 64
+
+/* define these to keep sizeof(qht_bucket) within QHT_BUCKET_ALIGN */
+#if HOST_LONG_BITS == 32
+#define QHT_BUCKET_ENTRIES 6
+#else /* 64-bit */
+#define QHT_BUCKET_ENTRIES 4
+#endif
+
+struct qht_bucket {
+    QemuSpin lock;
+    QemuSeqLock sequence;
+    uint32_t hashes[QHT_BUCKET_ENTRIES];
+    void *pointers[QHT_BUCKET_ENTRIES];
+    struct qht_bucket *next;
+} QEMU_ALIGNED(QHT_BUCKET_ALIGN);
+
+QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
+
+/**
+ * struct qht_map - structure to track an array of buckets
+ * @rcu: used by RCU. Keep it as the top field in the struct to help valgrind
+ *       find the whole struct.
+ * @buckets: array of head buckets. It is constant once the map is created.
+ * @n: number of head buckets. It is constant once the map is created.
+ * @n_added_buckets: number of added (i.e. "non-head") buckets
+ * @n_added_buckets_threshold: threshold to trigger an upward resize once the
+ *                             number of added buckets surpasses it.
+ *
+ * Buckets are tracked in what we call a "map", i.e. this structure.
+ */
+struct qht_map {
+    struct rcu_head rcu;
+    struct qht_bucket *buckets;
+    size_t n;
+    size_t n_added_buckets;
+    size_t n_added_buckets_threshold;
+};
+
+/* trigger a resize when n_added_buckets > n_buckets / div */
+#define QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV 8
+
+static void qht_do_resize(struct qht *ht, size_t n);
+
+static inline struct qht_map *qht_map__atomic_mb(const struct qht *ht)
+{
+    struct qht_map *map;
+
+    map = atomic_read(&ht->map);
+    /* paired with smp_wmb() before setting ht->map */
+    smp_rmb();
+    return map;
+}
+
+/* helper for lockless bucket chain traversals */
+static inline
+struct qht_bucket *bucket_next__atomic_mb(const struct qht_bucket *b)
+{
+    struct qht_bucket *ret;
+
+    ret = atomic_read(&b->next);
+    /*
+     * This barrier guarantees that we will read a properly initialized b->next;
+     * it is paired with an smp_wmb() before setting b->next.
+     */
+    smp_rmb();
+    return ret;
+}
+
+#ifdef QHT_DEBUG
+static void qht_bucket_debug(struct qht_bucket *b)
+{
+    bool seen_empty = false;
+    bool corrupt = false;
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] == NULL) {
+                seen_empty = true;
+                continue;
+            }
+            if (seen_empty) {
+                fprintf(stderr, "%s: b: %p, pos: %i, hash: 0x%x, p: %p\n",
+                       __func__, b, i, b->hashes[i], b->pointers[i]);
+                corrupt = true;
+            }
+        }
+        b = b->next;
+    } while (b);
+    assert(!corrupt);
+}
+
+static void qht_map_debug(struct qht_map *map)
+{
+    int i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_bucket_debug(&map->buckets[i]);
+    }
+}
+#else
+static inline void qht_bucket_debug(struct qht_bucket *b)
+{ }
+
+static inline void qht_map_debug(struct qht_map *map)
+{ }
+#endif /* QHT_DEBUG */
+
+static inline size_t qht_elems_to_buckets(size_t n_elems)
+{
+    return pow2ceil(n_elems / QHT_BUCKET_ENTRIES);
+}
+
+static inline void qht_head_init(struct qht_bucket *b)
+{
+    memset(b, 0, sizeof(*b));
+    qemu_spin_init(&b->lock);
+    seqlock_init(&b->sequence);
+}
+
+static inline
+struct qht_bucket *qht_map_to_bucket(struct qht_map *map, uint32_t hash)
+{
+    return &map->buckets[hash & (map->n - 1)];
+}
+
+/* acquire all bucket locks from a map */
+static void qht_map_lock_buckets(struct qht_map *map)
+{
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+
+        qemu_spin_lock(&b->lock);
+    }
+}
+
+static void qht_map_unlock_buckets(struct qht_map *map)
+{
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+
+        qemu_spin_unlock(&b->lock);
+    }
+}
+
+static inline bool qht_map_needs_resize(struct qht_map *map)
+{
+    return atomic_read(&map->n_added_buckets) > map->n_added_buckets_threshold;
+}
+
+static inline void qht_chain_destroy(struct qht_bucket *head)
+{
+    struct qht_bucket *curr = head->next;
+    struct qht_bucket *prev;
+
+    while (curr) {
+        prev = curr;
+        curr = curr->next;
+        qemu_vfree(prev);
+    }
+}
+
+/* pass only an orphan map */
+static void qht_map_destroy(struct qht_map *map)
+{
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_chain_destroy(&map->buckets[i]);
+    }
+    qemu_vfree(map->buckets);
+    g_free(map);
+}
+
+static void qht_map_reclaim(struct rcu_head *rcu)
+{
+    struct qht_map *map = container_of(rcu, struct qht_map, rcu);
+
+    qht_map_destroy(map);
+}
+
+static struct qht_map *qht_map_create(size_t n)
+{
+    struct qht_map *map;
+    size_t i;
+
+    map = g_malloc(sizeof(*map));
+    map->n = n;
+
+    map->n_added_buckets = 0;
+    map->n_added_buckets_threshold = n / QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV;
+
+    /* let tiny hash tables to at least add one non-head bucket */
+    if (unlikely(map->n_added_buckets_threshold == 0)) {
+        map->n_added_buckets_threshold = 1;
+    }
+
+    map->buckets = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*map->buckets) * n);
+    for (i = 0; i < n; i++) {
+        qht_head_init(&map->buckets[i]);
+    }
+    return map;
+}
+
+static inline void qht_publish(struct qht *ht, struct qht_map *new)
+{
+    /* Readers should see a properly initialized map; pair with smp_rmb() */
+    smp_wmb();
+    atomic_set(&ht->map, new);
+}
+
+void qht_init(struct qht *ht, size_t n_elems, unsigned int mode)
+{
+    struct qht_map *map;
+    size_t n = qht_elems_to_buckets(n_elems);
+
+    ht->mode = mode;
+    map = qht_map_create(n);
+    qht_publish(ht, map);
+}
+
+/* call only when there are no readers left */
+void qht_destroy(struct qht *ht)
+{
+    qht_map_destroy(ht->map);
+    memset(ht, 0, sizeof(*ht));
+}
+
+static void qht_bucket_reset(struct qht_bucket *head)
+{
+    struct qht_bucket *b = head;
+    int i;
+
+    qemu_spin_lock(&head->lock);
+    seqlock_write_begin(&head->sequence);
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] == NULL) {
+                goto done;
+            }
+            atomic_set(&b->hashes[i], 0);
+            atomic_set(&b->pointers[i], NULL);
+        }
+        b = b->next;
+    } while (b);
+ done:
+    seqlock_write_end(&head->sequence);
+    qemu_spin_unlock(&head->lock);
+}
+
+/* call with an external lock held */
+void qht_reset(struct qht *ht)
+{
+    struct qht_map *map = ht->map;
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_bucket_reset(&map->buckets[i]);
+    }
+    qht_map_debug(map);
+}
+
+/* call with an external lock held */
+bool qht_reset_size(struct qht *ht, size_t n_elems)
+{
+    struct qht_map *old = ht->map;
+
+    qht_reset(ht);
+    if (old->n == qht_elems_to_buckets(n_elems)) {
+        return false;
+    }
+    qht_init(ht, n_elems, ht->mode);
+    call_rcu1(&old->rcu, qht_map_reclaim);
+    return true;
+}
+
+static inline
+void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
+                    const void *userp, uint32_t hash)
+{
+    struct qht_bucket *b = head;
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (atomic_read(&b->hashes[i]) == hash) {
+                void *p = atomic_read(&b->pointers[i]);
+
+                if (likely(p) && likely(func(p, userp))) {
+                    return p;
+                }
+            }
+        }
+        b = bucket_next__atomic_mb(b);
+    } while (b);
+
+    return NULL;
+}
+
+static __attribute__((noinline))
+void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
+                           const void *userp, uint32_t hash)
+{
+    uint32_t version;
+    void *ret;
+
+    do {
+        version = seqlock_read_begin(&b->sequence);
+        ret = qht_do_lookup(b, func, userp, hash);
+    } while (seqlock_read_retry(&b->sequence, version));
+    return ret;
+}
+
+void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
+                 uint32_t hash)
+{
+    struct qht_bucket *b;
+    struct qht_map *map;
+    uint32_t version;
+    void *ret;
+
+    map = qht_map__atomic_mb(ht);
+    b = qht_map_to_bucket(map, hash);
+
+    version = seqlock_read_begin(&b->sequence);
+    ret = qht_do_lookup(b, func, userp, hash);
+    if (likely(!seqlock_read_retry(&b->sequence, version))) {
+        return ret;
+    }
+    /*
+     * Removing the do/while from the fastpath gives a 4% perf. increase when
+     * running a 100%-lookup microbenchmark.
+     */
+    return qht_lookup__slowpath(b, func, userp, hash);
+}
+
+/* call with head->lock held */
+static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
+                               struct qht_bucket *head, void *p, uint32_t hash,
+                               bool *needs_resize)
+{
+    struct qht_bucket *b = head;
+    struct qht_bucket *prev = NULL;
+    struct qht_bucket *new = NULL;
+    int i;
+
+    for (;;) {
+        if (b == NULL) {
+            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
+            memset(b, 0, sizeof(*b));
+            new = b;
+            atomic_inc(&map->n_added_buckets);
+            if (unlikely(qht_map_needs_resize(map)) && needs_resize) {
+                *needs_resize = true;
+            }
+        }
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i]) {
+                if (unlikely(b->pointers[i] == p)) {
+                    return false;
+                }
+                continue;
+            }
+            /* found an empty key: acquire the seqlock and write */
+            seqlock_write_begin(&head->sequence);
+            if (new) {
+                /*
+                 * This barrier is paired with smp_rmb() after reading
+                 * b->next when not holding b->lock.
+                 */
+                smp_wmb();
+                atomic_set(&prev->next, b);
+            }
+            atomic_set(&b->hashes[i], hash);
+            atomic_set(&b->pointers[i], p);
+            seqlock_write_end(&head->sequence);
+            return true;
+        }
+        prev = b;
+        b = b->next;
+    }
+}
+
+/* call with an external lock held */
+bool qht_insert(struct qht *ht, void *p, uint32_t hash)
+{
+    struct qht_map *map = ht->map;
+    struct qht_bucket *b = qht_map_to_bucket(map, hash);
+    bool needs_resize = false;
+    bool ret;
+
+    /* NULL pointers are not supported */
+    assert(p);
+
+    qemu_spin_lock(&b->lock);
+    ret = qht_insert__locked(ht, map, b, p, hash, &needs_resize);
+    qht_bucket_debug(b);
+    qemu_spin_unlock(&b->lock);
+
+    if (unlikely(needs_resize) && ht->mode & QHT_MODE_AUTO_RESIZE) {
+        qht_do_resize(ht, map->n * 2);
+    }
+    return ret;
+}
+
+static inline bool qht_entry_is_last(struct qht_bucket *b, int pos)
+{
+    if (pos == QHT_BUCKET_ENTRIES - 1) {
+        if (b->next == NULL) {
+            return true;
+        }
+        return b->next->pointers[0] == NULL;
+    }
+    return b->pointers[pos + 1] == NULL;
+}
+
+static void
+qht_entry_move(struct qht_bucket *to, int i, struct qht_bucket *from, int j)
+{
+    assert(!(to == from && i == j));
+    assert(to->pointers[i] == NULL);
+    assert(from->pointers[j]);
+
+    atomic_set(&to->hashes[i], from->hashes[j]);
+    atomic_set(&to->pointers[i], from->pointers[j]);
+
+    atomic_set(&from->hashes[j], 0);
+    atomic_set(&from->pointers[j], NULL);
+}
+
+/*
+ * Find the last valid entry in @head, and swap it with @orig[pos], which has
+ * just been invalidated.
+ */
+static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
+{
+    struct qht_bucket *b = orig;
+    struct qht_bucket *prev = NULL;
+    int i;
+
+    if (qht_entry_is_last(orig, pos)) {
+        return;
+    }
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] || (b == orig && i == pos)) {
+                continue;
+            }
+            if (i > 0) {
+                return qht_entry_move(orig, pos, b, i - 1);
+            }
+            assert(prev);
+            return qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
+        }
+        prev = b;
+        b = b->next;
+    } while (b);
+    /* no free entries other than orig[pos], so swap it with the last one */
+    qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
+}
+
+/* call with b->lock held */
+static inline
+bool qht_remove__locked(struct qht_map *map, struct qht_bucket *head,
+                        const void *p, uint32_t hash)
+{
+    struct qht_bucket *b = head;
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            void *q = b->pointers[i];
+
+            if (unlikely(q == NULL)) {
+                return false;
+            }
+            if (q == p) {
+                assert(b->hashes[i] == hash);
+                seqlock_write_begin(&head->sequence);
+                atomic_set(&b->hashes[i], 0);
+                atomic_set(&b->pointers[i], NULL);
+                qht_bucket_fill_hole(b, i);
+                seqlock_write_end(&head->sequence);
+                return true;
+            }
+        }
+        b = b->next;
+    } while (b);
+    return false;
+}
+
+/* call with an external lock held */
+bool qht_remove(struct qht *ht, const void *p, uint32_t hash)
+{
+    struct qht_map *map = ht->map;
+    struct qht_bucket *b = qht_map_to_bucket(map, hash);
+    bool ret;
+
+    qemu_spin_lock(&b->lock);
+    ret = qht_remove__locked(map, b, p, hash);
+    qht_bucket_debug(b);
+    qemu_spin_unlock(&b->lock);
+    return ret;
+}
+
+static inline void qht_bucket_iter(struct qht *ht, struct qht_bucket *b,
+                                   qht_iter_func_t func, void *userp)
+{
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] == NULL) {
+                return;
+            }
+            func(ht, b->pointers[i], b->hashes[i], userp);
+        }
+        b = b->next;
+    } while (b);
+}
+
+/* external lock + all of the map's locks held */
+static inline void qht_map_iter__locked(struct qht *ht, struct qht_map *map,
+                                        qht_iter_func_t func, void *userp)
+{
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_bucket_iter(ht, &map->buckets[i], func, userp);
+    }
+}
+
+/* call with an external lock held */
+void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp)
+{
+    qht_map_lock_buckets(ht->map);
+    qht_map_iter__locked(ht, ht->map, func, userp);
+    qht_map_unlock_buckets(ht->map);
+}
+
+static void qht_map_copy(struct qht *ht, void *p, uint32_t hash, void *userp)
+{
+    struct qht_map *new = userp;
+    struct qht_bucket *b = qht_map_to_bucket(new, hash);
+
+    /* no need to acquire b->lock because no thread has seen this map yet */
+    qht_insert__locked(ht, new, b, p, hash, NULL);
+}
+
+/* call with an external lock held */
+static void qht_do_resize(struct qht *ht, size_t n)
+{
+    struct qht_map *old = ht->map;
+    struct qht_map *new;
+
+    g_assert_cmpuint(n, !=, old->n);
+    new = qht_map_create(n);
+    qht_iter(ht, qht_map_copy, new);
+    qht_map_debug(new);
+
+    qht_publish(ht, new);
+    call_rcu1(&old->rcu, qht_map_reclaim);
+}
+
+/* call with an external lock held */
+bool qht_resize(struct qht *ht, size_t n_elems)
+{
+    size_t n = qht_elems_to_buckets(n_elems);
+
+    if (n == ht->map->n) {
+        return false;
+    }
+    qht_do_resize(ht, n);
+    return true;
+}
+
+/* pass @stats to qht_statistics_destroy() when done */
+void qht_statistics_init(struct qht *ht, struct qht_stats *stats)
+{
+    struct qht_map *map;
+    int i;
+
+    map = qht_map__atomic_mb(ht);
+
+    stats->head_buckets = map->n;
+    stats->used_head_buckets = 0;
+    stats->entries = 0;
+    qdist_init(&stats->chain);
+    qdist_init(&stats->occupancy);
+
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *head = &map->buckets[i];
+        struct qht_bucket *b;
+        uint32_t version;
+        size_t buckets;
+        size_t entries;
+        int j;
+
+        do {
+            version = seqlock_read_begin(&head->sequence);
+            buckets = 0;
+            entries = 0;
+            b = head;
+            do {
+                for (j = 0; j < QHT_BUCKET_ENTRIES; j++) {
+                    if (atomic_read(&b->pointers[j]) == NULL) {
+                        break;
+                    }
+                    entries++;
+                }
+                buckets++;
+                b = bucket_next__atomic_mb(b);
+            } while (b);
+        } while (seqlock_read_retry(&head->sequence, version));
+
+        if (entries) {
+            qdist_inc(&stats->chain, buckets);
+            qdist_inc(&stats->occupancy,
+                      (double)entries / QHT_BUCKET_ENTRIES / buckets);
+            stats->used_head_buckets++;
+            stats->entries += entries;
+        } else {
+            qdist_inc(&stats->occupancy, 0);
+        }
+    }
+}
+
+void qht_statistics_destroy(struct qht_stats *stats)
+{
+    qdist_destroy(&stats->occupancy);
+    qdist_destroy(&stats->chain);
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 13/18] qht: support parallel writes
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (11 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-23 20:28   ` Sergey Fedorov
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 14/18] qht: add test program Emilio G. Cota
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

The appended increases write scalability by allowing concurrent writes
to separate buckets. It also removes the need for an external lock when
operating on the hash table.

Lookup code remains the same; insert and remove fast paths get an extra
smp_rmb() before reading the map pointer and a likely() check after
acquiring the lock of the bucket they operate on.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/qht.h |  10 +-
 util/qht.c         | 276 ++++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 215 insertions(+), 71 deletions(-)

diff --git a/include/qemu/qht.h b/include/qemu/qht.h
index c2ab8b8..f0d68fb 100644
--- a/include/qemu/qht.h
+++ b/include/qemu/qht.h
@@ -10,11 +10,13 @@
 #include "qemu/osdep.h"
 #include "qemu-common.h"
 #include "qemu/seqlock.h"
+#include "qemu/thread.h"
 #include "qemu/qdist.h"
 #include "qemu/rcu.h"
 
 struct qht {
     struct qht_map *map;
+    QemuSpin lock; /* serializes setters of ht->map */
     unsigned int mode;
 };
 
@@ -33,25 +35,19 @@ typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
 
 void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
 
-/* call only when there are no readers left */
+/* call only when there are no readers/writers left */
 void qht_destroy(struct qht *ht);
 
-/* call with an external lock held */
 void qht_reset(struct qht *ht);
 
-/* call with an external lock held */
 bool qht_reset_size(struct qht *ht, size_t n_elems);
 
-/* call with an external lock held */
 bool qht_insert(struct qht *ht, void *p, uint32_t hash);
 
-/* call with an external lock held */
 bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
 
-/* call with an external lock held */
 void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp);
 
-/* call with an external lock held */
 bool qht_resize(struct qht *ht, size_t n_elems);
 
 /* if @func is NULL, then pointer comparison is used */
diff --git a/util/qht.c b/util/qht.c
index 112f32d..8a9b66a 100644
--- a/util/qht.c
+++ b/util/qht.c
@@ -7,13 +7,19 @@
  *   See the COPYING file in the top-level directory.
  *
  * Assumptions:
- * - Writers and iterators must take an external lock.
  * - NULL cannot be inserted as a pointer value.
  * - Duplicate pointer values cannot be inserted.
  *
  * Features:
+ * - Reads (i.e. lookups and iterators) can be concurrent with other reads.
+ *   Lookups that are concurrent with writes to the same bucket will retry
+ *   via a seqlock; iterators acquire all bucket locks and therefore can be
+ *   concurrent with lookups and are serialized wrt writers.
+ * - Writes (i.e. insertions/removals) can be concurrent with writes to
+ *   different buckets; writes to the same bucket are serialized through a lock.
  * - Optional auto-resizing: the hash table resizes up if the load surpasses
- *   a certain threshold. Resizing is done concurrently with readers.
+ *   a certain threshold. Resizing is done concurrently with readers; writes
+ *   are serialized with the resize operation.
  *
  * The key structure is the bucket, which is cacheline-sized. Buckets
  * contain a few hash values and pointers; the u32 hash values are stored in
@@ -33,10 +39,11 @@
  * just-removed entry. This makes lookups slightly faster, since the moment an
  * invalid entry is found, the (failed) lookup is over.
  *
- * Resizing is done by taking all spinlocks (so that no readers-turned-writers
- * can race with us) and then placing all elements into a new hash table. Last,
- * the ht->map pointer is set, and the old map is freed once no RCU readers can
- * see it anymore.
+ * Resizing is done by taking all bucket spinlocks (so that no other writers can
+ * race with us) and then copying all entries into a new hash map. The old
+ * hash map is marked as stale, so that no further insertions/removals will
+ * be performed on it. Last, the ht->map pointer is set, and the old map is
+ * freed once no RCU readers can see it anymore.
  *
  * Related Work:
  * - Idea of cacheline-sized buckets with full hashes taken from:
@@ -92,8 +99,14 @@ QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
  * @n_added_buckets: number of added (i.e. "non-head") buckets
  * @n_added_buckets_threshold: threshold to trigger an upward resize once the
  *                             number of added buckets surpasses it.
+ * @stale: marks that the map is stale
  *
  * Buckets are tracked in what we call a "map", i.e. this structure.
+ *
+ * The @stale flag is protected by the locks embedded in map->buckets: _all_ of
+ * these locks have to be acquired (in order from 0 to n-1) before setting
+ * the flag. Reading the flag is cheaper: it can be read after acquiring any
+ * of the bucket locks.
  */
 struct qht_map {
     struct rcu_head rcu;
@@ -101,12 +114,14 @@ struct qht_map {
     size_t n;
     size_t n_added_buckets;
     size_t n_added_buckets_threshold;
+    bool stale;
 };
 
 /* trigger a resize when n_added_buckets > n_buckets / div */
 #define QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV 8
 
-static void qht_do_resize(struct qht *ht, size_t n);
+static void qht_do_resize(struct qht *ht, struct qht_map *new);
+static void qht_grow_maybe(struct qht *ht);
 
 static inline struct qht_map *qht_map__atomic_mb(const struct qht *ht)
 {
@@ -134,7 +149,7 @@ struct qht_bucket *bucket_next__atomic_mb(const struct qht_bucket *b)
 }
 
 #ifdef QHT_DEBUG
-static void qht_bucket_debug(struct qht_bucket *b)
+static void qht_bucket_debug__locked(struct qht_bucket *b)
 {
     bool seen_empty = false;
     bool corrupt = false;
@@ -157,19 +172,19 @@ static void qht_bucket_debug(struct qht_bucket *b)
     assert(!corrupt);
 }
 
-static void qht_map_debug(struct qht_map *map)
+static void qht_map_debug__all_locked(struct qht_map *map)
 {
     int i;
 
     for (i = 0; i < map->n; i++) {
-        qht_bucket_debug(&map->buckets[i]);
+        qht_bucket_debug__locked(&map->buckets[i]);
     }
 }
 #else
-static inline void qht_bucket_debug(struct qht_bucket *b)
+static inline void qht_bucket_debug__locked(struct qht_bucket *b)
 { }
 
-static inline void qht_map_debug(struct qht_map *map)
+static inline void qht_map_debug__all_locked(struct qht_map *map)
 { }
 #endif /* QHT_DEBUG */
 
@@ -214,6 +229,63 @@ static void qht_map_unlock_buckets(struct qht_map *map)
     }
 }
 
+/*
+ * Grab all bucket locks, and set @pmap after making sure the map isn't stale.
+ *
+ * Pairs with qht_map_unlock_buckets(), hence the pass-by-reference.
+ */
+static inline
+void qht_map_lock_buckets__no_stale(struct qht *ht, struct qht_map **pmap)
+{
+    struct qht_map *map;
+
+    for (;;) {
+        map = qht_map__atomic_mb(ht);
+        qht_map_lock_buckets(map);
+        if (likely(!map->stale)) {
+            *pmap = map;
+            return;
+        }
+        qht_map_unlock_buckets(map);
+
+        /* resize in progress; wait until it completes */
+        while (qemu_spin_locked(&ht->lock)) {
+            cpu_relax();
+        }
+    }
+}
+
+/*
+ * Get a head bucket and lock it, making sure its parent map is not stale.
+ * @pmap is filled with a pointer to the bucket's parent map.
+ *
+ * Unlock with qemu_spin_unlock(&b->lock).
+ */
+static inline
+struct qht_bucket *qht_bucket_lock__no_stale(struct qht *ht, uint32_t hash,
+                                             struct qht_map **pmap)
+{
+    struct qht_bucket *b;
+    struct qht_map *map;
+
+    for (;;) {
+        map = qht_map__atomic_mb(ht);
+        b = qht_map_to_bucket(map, hash);
+
+        qemu_spin_lock(&b->lock);
+        if (likely(!map->stale)) {
+            *pmap = map;
+            return b;
+        }
+        qemu_spin_unlock(&b->lock);
+
+        /* resize in progress; wait until it completes */
+        while (qemu_spin_locked(&ht->lock)) {
+            cpu_relax();
+        }
+    }
+}
+
 static inline bool qht_map_needs_resize(struct qht_map *map)
 {
     return atomic_read(&map->n_added_buckets) > map->n_added_buckets_threshold;
@@ -266,6 +338,7 @@ static struct qht_map *qht_map_create(size_t n)
         map->n_added_buckets_threshold = 1;
     }
 
+    map->stale = false;
     map->buckets = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*map->buckets) * n);
     for (i = 0; i < n; i++) {
         qht_head_init(&map->buckets[i]);
@@ -273,7 +346,8 @@ static struct qht_map *qht_map_create(size_t n)
     return map;
 }
 
-static inline void qht_publish(struct qht *ht, struct qht_map *new)
+/* call with ht->lock held */
+static inline void qht_publish__htlocked(struct qht *ht, struct qht_map *new)
 {
     /* Readers should see a properly initialized map; pair with smp_rmb() */
     smp_wmb();
@@ -286,23 +360,24 @@ void qht_init(struct qht *ht, size_t n_elems, unsigned int mode)
     size_t n = qht_elems_to_buckets(n_elems);
 
     ht->mode = mode;
+    qemu_spin_init(&ht->lock);
     map = qht_map_create(n);
-    qht_publish(ht, map);
+    /* no need to take ht->lock since we just initialized it */
+    qht_publish__htlocked(ht, map);
 }
 
-/* call only when there are no readers left */
+/* call only when there are no readers/writers left */
 void qht_destroy(struct qht *ht)
 {
     qht_map_destroy(ht->map);
     memset(ht, 0, sizeof(*ht));
 }
 
-static void qht_bucket_reset(struct qht_bucket *head)
+static void qht_bucket_reset__locked(struct qht_bucket *head)
 {
     struct qht_bucket *b = head;
     int i;
 
-    qemu_spin_lock(&head->lock);
     seqlock_write_begin(&head->sequence);
     do {
         for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
@@ -316,33 +391,56 @@ static void qht_bucket_reset(struct qht_bucket *head)
     } while (b);
  done:
     seqlock_write_end(&head->sequence);
-    qemu_spin_unlock(&head->lock);
 }
 
-/* call with an external lock held */
-void qht_reset(struct qht *ht)
+/* call with all bucket locks held */
+static void qht_map_reset__all_locked(struct qht_map *map)
 {
-    struct qht_map *map = ht->map;
     size_t i;
 
     for (i = 0; i < map->n; i++) {
-        qht_bucket_reset(&map->buckets[i]);
+        qht_bucket_reset__locked(&map->buckets[i]);
     }
-    qht_map_debug(map);
+    qht_map_debug__all_locked(map);
+}
+
+void qht_reset(struct qht *ht)
+{
+    struct qht_map *map;
+
+    qht_map_lock_buckets__no_stale(ht, &map);
+    qht_map_reset__all_locked(map);
+    qht_map_unlock_buckets(map);
 }
 
-/* call with an external lock held */
 bool qht_reset_size(struct qht *ht, size_t n_elems)
 {
-    struct qht_map *old = ht->map;
+    struct qht_map *new;
+    struct qht_map *map;
+    bool resize;
+    size_t n;
 
-    qht_reset(ht);
-    if (old->n == qht_elems_to_buckets(n_elems)) {
-        return false;
+    n = qht_elems_to_buckets(n_elems);
+    /* allocate the new map first so that we don't hold up ht->lock */
+    new = qht_map_create(n);
+    resize = false;
+
+    qemu_spin_lock(&ht->lock);
+    map = ht->map;
+    qht_map_lock_buckets(map);
+    qht_map_reset__all_locked(map);
+    if (n != map->n) {
+        qht_do_resize(ht, new);
+        resize = true;
     }
-    qht_init(ht, n_elems, ht->mode);
-    call_rcu1(&old->rcu, qht_map_reclaim);
-    return true;
+    qht_map_unlock_buckets(map);
+    qemu_spin_unlock(&ht->lock);
+
+    if (unlikely(!resize)) {
+        qht_map_destroy(new);
+    }
+
+    return resize;
 }
 
 static inline
@@ -452,24 +550,23 @@ static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
     }
 }
 
-/* call with an external lock held */
 bool qht_insert(struct qht *ht, void *p, uint32_t hash)
 {
-    struct qht_map *map = ht->map;
-    struct qht_bucket *b = qht_map_to_bucket(map, hash);
+    struct qht_bucket *b;
+    struct qht_map *map;
     bool needs_resize = false;
     bool ret;
 
     /* NULL pointers are not supported */
     assert(p);
 
-    qemu_spin_lock(&b->lock);
+    b = qht_bucket_lock__no_stale(ht, hash, &map);
     ret = qht_insert__locked(ht, map, b, p, hash, &needs_resize);
-    qht_bucket_debug(b);
+    qht_bucket_debug__locked(b);
     qemu_spin_unlock(&b->lock);
 
     if (unlikely(needs_resize) && ht->mode & QHT_MODE_AUTO_RESIZE) {
-        qht_do_resize(ht, map->n * 2);
+        qht_grow_maybe(ht);
     }
     return ret;
 }
@@ -560,16 +657,15 @@ bool qht_remove__locked(struct qht_map *map, struct qht_bucket *head,
     return false;
 }
 
-/* call with an external lock held */
 bool qht_remove(struct qht *ht, const void *p, uint32_t hash)
 {
-    struct qht_map *map = ht->map;
-    struct qht_bucket *b = qht_map_to_bucket(map, hash);
+    struct qht_bucket *b;
+    struct qht_map *map;
     bool ret;
 
-    qemu_spin_lock(&b->lock);
+    b = qht_bucket_lock__no_stale(ht, hash, &map);
     ret = qht_remove__locked(map, b, p, hash);
-    qht_bucket_debug(b);
+    qht_bucket_debug__locked(b);
     qemu_spin_unlock(&b->lock);
     return ret;
 }
@@ -590,9 +686,9 @@ static inline void qht_bucket_iter(struct qht *ht, struct qht_bucket *b,
     } while (b);
 }
 
-/* external lock + all of the map's locks held */
-static inline void qht_map_iter__locked(struct qht *ht, struct qht_map *map,
-                                        qht_iter_func_t func, void *userp)
+/* call with all of the map's locks held */
+static inline void qht_map_iter__all_locked(struct qht *ht, struct qht_map *map,
+                                            qht_iter_func_t func, void *userp)
 {
     size_t i;
 
@@ -601,12 +697,15 @@ static inline void qht_map_iter__locked(struct qht *ht, struct qht_map *map,
     }
 }
 
-/* call with an external lock held */
 void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp)
 {
-    qht_map_lock_buckets(ht->map);
-    qht_map_iter__locked(ht, ht->map, func, userp);
-    qht_map_unlock_buckets(ht->map);
+    struct qht_map *map;
+
+    map = qht_map__atomic_mb(ht);
+    qht_map_lock_buckets(map);
+    /* Note: ht here is merely for carrying ht->mode; ht->map won't be read */
+    qht_map_iter__all_locked(ht, map, func, userp);
+    qht_map_unlock_buckets(map);
 }
 
 static void qht_map_copy(struct qht *ht, void *p, uint32_t hash, void *userp)
@@ -618,31 +717,80 @@ static void qht_map_copy(struct qht *ht, void *p, uint32_t hash, void *userp)
     qht_insert__locked(ht, new, b, p, hash, NULL);
 }
 
-/* call with an external lock held */
-static void qht_do_resize(struct qht *ht, size_t n)
+/*
+ * Call with ht->lock and all bucket locks held.
+ *
+ * Creating the @new map here would add unnecessary delay while all the locks
+ * are held--holding up the bucket locks is particularly bad, since no writes
+ * can occur while these are held. Thus, we let callers create the new map,
+ * hopefully without the bucket locks held.
+ */
+static void qht_do_resize(struct qht *ht, struct qht_map *new)
 {
-    struct qht_map *old = ht->map;
-    struct qht_map *new;
+    struct qht_map *old;
 
-    g_assert_cmpuint(n, !=, old->n);
-    new = qht_map_create(n);
-    qht_iter(ht, qht_map_copy, new);
-    qht_map_debug(new);
+    old = ht->map;
+    g_assert_cmpuint(new->n, !=, old->n);
 
-    qht_publish(ht, new);
+    assert(!old->stale);
+    old->stale = true;
+    qht_map_iter__all_locked(ht, old, qht_map_copy, new);
+
+    qht_map_debug__all_locked(new);
+
+    qht_publish__htlocked(ht, new);
     call_rcu1(&old->rcu, qht_map_reclaim);
 }
 
-/* call with an external lock held */
 bool qht_resize(struct qht *ht, size_t n_elems)
 {
-    size_t n = qht_elems_to_buckets(n_elems);
+    struct qht_map *new;
+    bool resized;
+    size_t n;
+
+    n = qht_elems_to_buckets(n_elems);
+    /* allocate the new map first so that we don't hold up ht->lock */
+    new = qht_map_create(n);
+    resized = false;
+
+    qemu_spin_lock(&ht->lock);
+    if (n != ht->map->n) {
+        struct qht_map *old = ht->map;
+
+        qht_map_lock_buckets(old);
+        qht_do_resize(ht, new);
+        qht_map_unlock_buckets(old);
+        resized = true;
+    }
+    qemu_spin_unlock(&ht->lock);
 
-    if (n == ht->map->n) {
-        return false;
+    if (unlikely(!resized)) {
+        qht_map_destroy(new);
+    }
+    return resized;
+}
+
+static __attribute__((noinline)) void qht_grow_maybe(struct qht *ht)
+{
+    struct qht_map *map;
+
+    /*
+     * If the lock is taken it probably means there's an ongoing resize,
+     * so bail out.
+     */
+    if (qemu_spin_trylock(&ht->lock)) {
+        return;
+    }
+    map = ht->map;
+    /* another thread might have just performed the resize we were after */
+    if (qht_map_needs_resize(map)) {
+        struct qht_map *new = qht_map_create(map->n * 2);
+
+        qht_map_lock_buckets(map);
+        qht_do_resize(ht, new);
+        qht_map_unlock_buckets(map);
     }
-    qht_do_resize(ht, n);
-    return true;
+    qemu_spin_unlock(&ht->lock);
 }
 
 /* pass @stats to qht_statistics_destroy() when done */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 14/18] qht: add test program
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (12 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 13/18] qht: support parallel writes Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 15/18] qht: add qht-bench, a performance benchmark Emilio G. Cota
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore |   1 +
 tests/Makefile   |   6 ++-
 tests/test-qht.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 165 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-qht.c

diff --git a/tests/.gitignore b/tests/.gitignore
index 7c0d156..ffde5d2 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -50,6 +50,7 @@ test-qdev-global-props
 test-qemu-opts
 test-qdist
 test-qga
+test-qht
 test-qmp-commands
 test-qmp-commands.h
 test-qmp-event
diff --git a/tests/Makefile b/tests/Makefile
index a5af20b..8589b11 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -72,6 +72,8 @@ check-unit-y += tests/test-rcu-list$(EXESUF)
 gcov-files-test-rcu-list-y = util/rcu.c
 check-unit-y += tests/test-qdist$(EXESUF)
 gcov-files-test-qdist-y = util/qdist.c
+check-unit-y += tests/test-qht$(EXESUF)
+gcov-files-test-qht-y = util/qht.c
 check-unit-y += tests/test-bitops$(EXESUF)
 check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
 check-unit-y += tests/check-qom-interface$(EXESUF)
@@ -395,7 +397,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
 	tests/rcutorture.o tests/test-rcu-list.o \
-	tests/test-qdist.o
+	tests/test-qdist.o \
+	tests/test-qht.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -435,6 +438,7 @@ tests/test-int128$(EXESUF): tests/test-int128.o
 tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
 tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
+tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/test-qht.c b/tests/test-qht.c
new file mode 100644
index 0000000..c8eb930
--- /dev/null
+++ b/tests/test-qht.c
@@ -0,0 +1,159 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "qemu/qht.h"
+
+#define N 5000
+
+static struct qht ht;
+static int32_t arr[N * 2];
+
+static bool is_equal(const void *obj, const void *userp)
+{
+    const int32_t *a = obj;
+    const int32_t *b = userp;
+
+    return *a == *b;
+}
+
+static void insert(int a, int b)
+{
+    int i;
+
+    for (i = a; i < b; i++) {
+        uint32_t hash;
+
+        arr[i] = i;
+        hash = i;
+
+        qht_insert(&ht, &arr[i], hash);
+    }
+}
+
+static void rm(int init, int end)
+{
+    int i;
+
+    for (i = init; i < end; i++) {
+        uint32_t hash;
+
+        hash = arr[i];
+        g_assert_true(qht_remove(&ht, &arr[i], hash));
+    }
+}
+
+static void check(int a, int b, bool expected)
+{
+    struct qht_stats stats;
+    int i;
+
+    for (i = a; i < b; i++) {
+        void *p;
+        uint32_t hash;
+        int32_t val;
+
+        val = i;
+        hash = i;
+        p = qht_lookup(&ht, is_equal, &val, hash);
+        g_assert_true(!!p == expected);
+    }
+    qht_statistics_init(&ht, &stats);
+    if (stats.used_head_buckets) {
+        g_assert_cmpfloat(qdist_avg(&stats.chain), >=, 1.0);
+    }
+    g_assert_cmpuint(stats.head_buckets, >, 0);
+    qht_statistics_destroy(&stats);
+}
+
+static void count_func(struct qht *ht, void *p, uint32_t hash, void *userp)
+{
+    unsigned int *curr = userp;
+
+    (*curr)++;
+}
+
+static void check_n(size_t expected)
+{
+    struct qht_stats stats;
+
+    qht_statistics_init(&ht, &stats);
+    g_assert_cmpuint(stats.entries, ==, expected);
+    qht_statistics_destroy(&stats);
+}
+
+static void iter_check(unsigned int count)
+{
+    unsigned int curr = 0;
+
+    qht_iter(&ht, count_func, &curr);
+    g_assert_cmpuint(curr, ==, count);
+}
+
+static void qht_do_test(unsigned int mode, size_t init_entries)
+{
+    qht_init(&ht, 0, mode);
+
+    insert(0, N);
+    check(0, N, true);
+    check_n(N);
+    check(-N, -1, false);
+    iter_check(N);
+
+    rm(101, 102);
+    check_n(N - 1);
+    insert(N, N * 2);
+    check_n(N + N - 1);
+    rm(N, N * 2);
+    check_n(N - 1);
+    insert(101, 102);
+    check_n(N);
+
+    rm(10, 200);
+    check_n(N - 190);
+    insert(150, 200);
+    check_n(N - 190 + 50);
+    insert(10, 150);
+    check_n(N);
+
+    rm(1, 2);
+    check_n(N - 1);
+    qht_reset_size(&ht, 0);
+    check_n(0);
+    check(0, N, false);
+
+    qht_destroy(&ht);
+}
+
+static void qht_test(unsigned int mode)
+{
+    qht_do_test(mode, 0);
+    qht_do_test(mode, 1);
+    qht_do_test(mode, 2);
+    qht_do_test(mode, 8);
+    qht_do_test(mode, 16);
+    qht_do_test(mode, 8192);
+    qht_do_test(mode, 16384);
+}
+
+static void test_default(void)
+{
+    qht_test(0);
+}
+
+static void test_resize(void)
+{
+    qht_test(QHT_MODE_AUTO_RESIZE);
+}
+
+int main(int argc, char *argv[])
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/qht/mode/default", test_default);
+    g_test_add_func("/qht/mode/resize", test_resize);
+    return g_test_run();
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 15/18] qht: add qht-bench, a performance benchmark
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (13 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 14/18] qht: add test program Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 16/18] qht: add test-qht-par to invoke qht-bench from 'check' target Emilio G. Cota
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

This serves as a performance benchmark as well as a stress test
for QHT. We can tweak quite a number of things, including the
number of resize threads and how frequently resizes are triggered.

A performance comparison of QHT vs CLHT[1] and ck_hs[2] using
this same benchmark program can be found here:
  http://imgur.com/a/0Bms4

The tests are run on a 64-core AMD Opteron 6376.

Note that ck_hs's performance drops significantly as writes go
up, since it requires an external lock (I used a ck_spinlock)
around every write.

Also, note that CLHT instead of using a seqlock, relies on an
allocator that does not ever return the same address during the
same read-critical section. This gives it a slight performance
advantage over QHT on read-heavy workloads, since the seqlock
writes aren't there.

[1] CLHT: https://github.com/LPD-EPFL/CLHT
          https://infoscience.epfl.ch/record/207109/files/ascy_asplos15.pdf

[2] ck_hs: http://concurrencykit.org/
           http://backtrace.io/blog/blog/2015/03/13/workload-specialization/

A few of those plots are shown in text here, since that site
might not be online forever. Throughput is on Mops/s on the Y axis.

                             200K keys, 0 % updates

  450 ++--+------+------+-------+-------+-------+-------+------+-------+--++
      |   +      +      +       +       +       +       +      +      +N+  |
  400 ++                                                           ---+E+ ++
      |                                                       +++----      |
  350 ++          9 ++------+------++                       --+E+    -+H+ ++
      |             |      +H+-     |                 -+N+----   ---- +++  |
  300 ++          8 ++     +E+     ++             -----+E+  --+H+         ++
      |             |      +++      |         -+N+-----+H+--               |
  250 ++          7 ++------+------++  +++-----+E+----                    ++
  200 ++                    1         -+E+-----+H+                        ++
      |                           ----                     qht +-E--+      |
  150 ++                      -+E+                        clht +-H--+     ++
      |                   ----                              ck +-N--+      |
  100 ++               +E+                                                ++
      |            ----                                                    |
   50 ++       -+E+                                                       ++
      |   +E+E+  +      +       +       +       +       +      +       +   |
    0 ++--E------+------+-------+-------+-------+-------+------+-------+--++
          1      8      16      24      32      40      48     56      64
                                Number of threads

                             200K keys, 1 % updates

  350 ++--+------+------+-------+-------+-------+-------+------+-------+--++
      |   +      +      +       +       +       +       +      +     -+E+  |
  300 ++                                                         -----+H+ ++
      |                                                       +E+--        |
      |           9 ++------+------++                  +++----             |
  250 ++            |      +E+   -- |                 -+E+                ++
      |           8 ++         --  ++             ----                     |
  200 ++            |      +++-     |  +++  ---+E+                        ++
      |           7 ++------N------++ -+E+--               qht +-E--+      |
      |                     1  +++----                    clht +-H--+      |
  150 ++                      -+E+                          ck +-N--+     ++
      |                   ----                                             |
  100 ++               +E+                                                ++
      |            ----                                                    |
      |        -+E+                                                        |
   50 ++    +H+-+N+----+N+-----+N+------                                  ++
      |   +E+E+  +      +       +      +N+-----+N+-----+N+----+N+-----+N+  |
    0 ++--E------+------+-------+-------+-------+-------+------+-------+--++
          1      8      16      24      32      40      48     56      64
                                Number of threads

                             200K keys, 20 % updates

  300 ++--+------+------+-------+-------+-------+-------+------+-------+--++
      |   +      +      +       +       +       +       +      +       +   |
      |                                                              -+H+  |
  250 ++                                                         ----     ++
      |           9 ++------+------++                       --+H+  ---+E+  |
      |           8 ++     +H+--   ++                 -+H+----+E+--        |
  200 ++            |      +E+    --|             -----+E+--  +++         ++
      |           7 ++      + ---- ++       ---+H+---- +++ qht +-E--+      |
  150 ++          6 ++------N------++ -+H+-----+E+        clht +-H--+     ++
      |                     1     -----+E+--                ck +-N--+      |
      |                       -+H+----                                     |
  100 ++                  -----+E+                                        ++
      |                +E+--                                               |
      |            ----+++                                                 |
   50 ++       -+E+                                                       ++
      |     +E+ +++                                                        |
      |   +E+N+-+N+-----+       +       +       +       +      +       +   |
    0 ++--E------+------N-------N-------N-------N-------N------N-------N--++
          1      8      16      24      32      40      48     56      64
                                Number of threads

                            200K keys, 100 % updates       qht +-E--+
                                                          clht +-H--+
  160 ++--+------+------+-------+-------+-------+-------+---ck-+-N-----+--++
      |   +      +      +       +       +       +       +      +   ----H   |
  140 ++                                                      +H+--  -+E+ ++
      |                                                +++----   ----      |
  120 ++          8 ++------+------++                 -+H+    +E+         ++
      |           7 ++     +H+---- ++             ---- +++----             |
  100 ++            |      +E+      |  +++  ---+H+    -+E+                ++
      |           6 ++     +++     ++ -+H+--   +++----                     |
   80 ++          5 ++------N----------+E+-----+E+                        ++
      |                     1 -+H+---- +++                                 |
      |                   -----+E+                                         |
   60 ++               +H+---- +++                                        ++
      |            ----+E+                                                 |
   40 ++        +H+----                                                   ++
      |       --+E+                                                        |
   20 ++    +E+                                                           ++
      |  +EE+    +      +       +       +       +       +      +       +   |
    0 ++--+N-N---N------N-------N-------N-------N-------N------N-------N--++
          1      8      16      24      32      40      48     56      64
                                Number of threads

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore  |   1 +
 tests/Makefile    |   3 +-
 tests/qht-bench.c | 473 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 476 insertions(+), 1 deletion(-)
 create mode 100644 tests/qht-bench.c

diff --git a/tests/.gitignore b/tests/.gitignore
index ffde5d2..d19023e 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -7,6 +7,7 @@ check-qnull
 check-qstring
 check-qom-interface
 check-qom-proplist
+qht-bench
 rcutorture
 test-aio
 test-base64
diff --git a/tests/Makefile b/tests/Makefile
index 8589b11..176bbd8 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -398,7 +398,7 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
 	tests/rcutorture.o tests/test-rcu-list.o \
 	tests/test-qdist.o \
-	tests/test-qht.o
+	tests/test-qht.o tests/qht-bench.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -439,6 +439,7 @@ tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
 tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
 tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
+tests/qht-bench$(EXESUF): tests/qht-bench.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/qht-bench.c b/tests/qht-bench.c
new file mode 100644
index 0000000..d86688d
--- /dev/null
+++ b/tests/qht-bench.c
@@ -0,0 +1,473 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "qemu/processor.h"
+#include "qemu/atomic.h"
+#include "qemu/qht.h"
+#include "exec/tb-hash-xx.h"
+
+struct thread_stats {
+    size_t rd;
+    size_t not_rd;
+    size_t in;
+    size_t not_in;
+    size_t rm;
+    size_t not_rm;
+    size_t rz;
+    size_t not_rz;
+};
+
+struct thread_info {
+    void (*func)(struct thread_info *);
+    struct thread_stats stats;
+    uint64_t r;
+    bool write_op; /* writes alternate between insertions and removals */
+    bool resize_down;
+} QEMU_ALIGNED(64); /* avoid false sharing among threads */
+
+static struct qht ht;
+static QemuThread *rw_threads;
+
+#define DEFAULT_RANGE (4096)
+#define DEFAULT_QHT_N_ELEMS DEFAULT_RANGE
+
+static unsigned int duration = 1;
+static unsigned int n_rw_threads = 1;
+static unsigned long lookup_range = DEFAULT_RANGE;
+static unsigned long update_range = DEFAULT_RANGE;
+static size_t init_range = DEFAULT_RANGE;
+static size_t init_size = DEFAULT_RANGE;
+static long populate_offset;
+static long *keys;
+
+static size_t resize_min;
+static size_t resize_max;
+static struct thread_info *rz_info;
+static unsigned long resize_delay = 1000;
+static double resize_rate; /* 0.0 to 1.0 */
+static unsigned int n_rz_threads = 1;
+static QemuThread *rz_threads;
+
+static double update_rate; /* 0.0 to 1.0 */
+static uint64_t update_threshold;
+static uint64_t resize_threshold;
+
+static size_t qht_n_elems = DEFAULT_QHT_N_ELEMS;
+static int qht_mode;
+
+static bool test_start;
+static bool test_stop;
+
+static struct thread_info *rw_info;
+
+static const char commands_string[] =
+    " -d = duration, in seconds\n"
+    " -n = number of threads\n"
+    "\n"
+    " -k = initial number of keys\n"
+    " -o = offset at which keys start\n"
+    " -K = initial range of keys (will be rounded up to pow2)\n"
+    " -l = lookup range of keys (will be rounded up to pow2)\n"
+    " -r = update range of keys (will be rounded up to pow2)\n"
+    "\n"
+    " -u = update rate (0.0 to 100.0), 50/50 split of insertions/removals\n"
+    "\n"
+    " -s = initial size hint\n"
+    " -R = enable auto-resize\n"
+    " -S = resize rate (0.0 to 100.0)\n"
+    " -D = delay (in us) between potential resizes\n"
+    " -N = number of resize threads";
+
+static void usage_complete(int argc, char *argv[])
+{
+    fprintf(stderr, "Usage: %s [options]\n", argv[0]);
+    fprintf(stderr, "options:\n%s\n", commands_string);
+    exit(-1);
+}
+
+static bool is_equal(const void *obj, const void *userp)
+{
+    const long *a = obj;
+    const long *b = userp;
+
+    return *a == *b;
+}
+
+static inline uint32_t h(unsigned long v)
+{
+    return tb_hash_func5(v, 0, 0);
+}
+
+/*
+ * From: https://en.wikipedia.org/wiki/Xorshift
+ * This is faster than rand_r(), and gives us a wider range (RAND_MAX is only
+ * guaranteed to be >= INT_MAX).
+ */
+static uint64_t xorshift64star(uint64_t x)
+{
+    x ^= x >> 12; /* a */
+    x ^= x << 25; /* b */
+    x ^= x >> 27; /* c */
+    return x * UINT64_C(2685821657736338717);
+}
+
+static void do_rz(struct thread_info *info)
+{
+    struct thread_stats *stats = &info->stats;
+
+    if (info->r < resize_threshold) {
+        size_t size = info->resize_down ? resize_min : resize_max;
+        bool resized;
+
+        resized = qht_resize(&ht, size);
+        info->resize_down = !info->resize_down;
+
+        if (resized) {
+            stats->rz++;
+        } else {
+            stats->not_rz++;
+        }
+    }
+    g_usleep(resize_delay);
+}
+
+static void do_rw(struct thread_info *info)
+{
+    struct thread_stats *stats = &info->stats;
+    uint32_t hash;
+    long *p;
+
+    if (info->r >= update_threshold) {
+        bool read;
+
+        p = &keys[info->r & (lookup_range - 1)];
+        hash = h(*p);
+        read = qht_lookup(&ht, is_equal, p, hash);
+        if (read) {
+            stats->rd++;
+        } else {
+            stats->not_rd++;
+        }
+    } else {
+        p = &keys[info->r & (update_range - 1)];
+        hash = h(*p);
+        if (info->write_op) {
+            bool written = false;
+
+            if (qht_lookup(&ht, is_equal, p, hash) == NULL) {
+                written = qht_insert(&ht, p, hash);
+            }
+            if (written) {
+                stats->in++;
+            } else {
+                stats->not_in++;
+            }
+        } else {
+            bool removed = false;
+
+            if (qht_lookup(&ht, is_equal, p, hash)) {
+                removed = qht_remove(&ht, p, hash);
+            }
+            if (removed) {
+                stats->rm++;
+            } else {
+                stats->not_rm++;
+            }
+        }
+        info->write_op = !info->write_op;
+    }
+}
+
+static void *thread_func(void *p)
+{
+    struct thread_info *info = p;
+
+    while (!atomic_mb_read(&test_start)) {
+        cpu_relax();
+    }
+
+    rcu_register_thread();
+
+    rcu_read_lock();
+    while (!atomic_read(&test_stop)) {
+        info->r = xorshift64star(info->r);
+        info->func(info);
+    }
+    rcu_read_unlock();
+
+    rcu_unregister_thread();
+    return NULL;
+}
+
+/* sets everything except info->func */
+static void prepare_thread_info(struct thread_info *info, int i)
+{
+    /* seed for the RNG; each thread should have a different one */
+    info->r = (i + 1) ^ time(NULL);
+    /* the first update will be a write */
+    info->write_op = true;
+    /* the first resize will be down */
+    info->resize_down = true;
+
+    memset(&info->stats, 0, sizeof(info->stats));
+}
+
+static void
+th_create_n(QemuThread **threads, struct thread_info **infos, const char *name,
+            void (*func)(struct thread_info *), int offset, int n)
+{
+    struct thread_info *info;
+    QemuThread *th;
+    int i;
+
+    th = g_malloc(sizeof(*th) * n);
+    *threads = th;
+
+    info = qemu_memalign(64, sizeof(*info) * n);
+    *infos = info;
+
+    for (i = 0; i < n; i++) {
+        prepare_thread_info(&info[i], i);
+        info[i].func = func;
+        qemu_thread_create(&th[i], name, thread_func, &info[i],
+                           QEMU_THREAD_JOINABLE);
+    }
+}
+
+static void create_threads(void)
+{
+    th_create_n(&rw_threads, &rw_info, "rw", do_rw, 0, n_rw_threads);
+    th_create_n(&rz_threads, &rz_info, "rz", do_rz, n_rw_threads, n_rz_threads);
+}
+
+static void pr_params(void)
+{
+    printf("Parameters:\n");
+    printf(" duration:          %d s\n", duration);
+    printf(" # of threads:      %u\n", n_rw_threads);
+    printf(" initial # of keys: %zu\n", init_size);
+    printf(" initial size hint: %zu\n", qht_n_elems);
+    printf(" auto-resize:       %s\n",
+           qht_mode & QHT_MODE_AUTO_RESIZE ? "on" : "off");
+    if (resize_rate) {
+        printf(" resize_rate:       %f%%\n", resize_rate * 100.0);
+        printf(" resize range:      %zu-%zu\n", resize_min, resize_max);
+        printf(" # resize threads   %u\n", n_rz_threads);
+    }
+    printf(" update rate:       %f%%\n", update_rate * 100.0);
+    printf(" offset:            %ld\n", populate_offset);
+    printf(" initial key range: %zu\n", init_range);
+    printf(" lookup range:      %zu\n", lookup_range);
+    printf(" update range:      %zu\n", update_range);
+}
+
+static void do_threshold(double rate, uint64_t *threshold)
+{
+    if (rate == 1.0) {
+        *threshold = UINT64_MAX;
+    } else {
+        *threshold = rate * UINT64_MAX;
+    }
+}
+
+static void htable_init(void)
+{
+    unsigned long n = MAX(init_range, update_range);
+    uint64_t r = time(NULL);
+    size_t retries = 0;
+    size_t i;
+
+    /* avoid allocating memory later by allocating all the keys now */
+    keys = g_malloc(sizeof(*keys) * n);
+    for (i = 0; i < n; i++) {
+        keys[i] = populate_offset + i;
+    }
+
+    /* some sanity checks */
+    g_assert_cmpuint(lookup_range, <=, n);
+
+    /* compute thresholds */
+    do_threshold(update_rate, &update_threshold);
+    do_threshold(resize_rate, &resize_threshold);
+
+    if (resize_rate) {
+        resize_min = n / 2;
+        resize_max = n;
+        assert(resize_min < resize_max);
+    } else {
+        n_rz_threads = 0;
+    }
+
+    /* initialize the hash table */
+    qht_init(&ht, qht_n_elems, qht_mode);
+    assert(init_size <= init_range);
+
+    pr_params();
+
+    fprintf(stderr, "Initialization: populating %zu items...", init_size);
+    for (i = 0; i < init_size; i++) {
+        for (;;) {
+            uint32_t hash;
+            long *p;
+
+            r = xorshift64star(r);
+            p = &keys[r & (init_range - 1)];
+            hash = h(*p);
+            if (qht_insert(&ht, p, hash)) {
+                break;
+            }
+            retries++;
+        }
+    }
+    fprintf(stderr, " populated after %zu retries\n", retries);
+}
+
+static void add_stats(struct thread_stats *s, struct thread_info *info, int n)
+{
+    int i;
+
+    for (i = 0; i < n; i++) {
+        struct thread_stats *stats = &info[i].stats;
+
+        s->rd += stats->rd;
+        s->not_rd += stats->not_rd;
+
+        s->in += stats->in;
+        s->not_in += stats->not_in;
+
+        s->rm += stats->rm;
+        s->not_rm += stats->not_rm;
+
+        s->rz += stats->rz;
+        s->not_rz += stats->not_rz;
+    }
+}
+
+static void pr_stats(void)
+{
+    struct thread_stats s = {};
+    double tx;
+
+    add_stats(&s, rw_info, n_rw_threads);
+    add_stats(&s, rz_info, n_rz_threads);
+
+    printf("Results:\n");
+
+    if (resize_rate) {
+        printf(" Resizes:           %zu (%.2f%% of %zu)\n",
+               s.rz, (double)s.rz / (s.rz + s.not_rz) * 100, s.rz + s.not_rz);
+    }
+
+    printf(" Read:              %.2f M (%.2f%% of %.2fM)\n",
+           (double)s.rd / 1e6,
+           (double)s.rd / (s.rd + s.not_rd) * 100,
+           (double)(s.rd + s.not_rd) / 1e6);
+    printf(" Inserted:          %.2f M (%.2f%% of %.2fM)\n",
+           (double)s.in / 1e6,
+           (double)s.in / (s.in + s.not_in) * 100,
+           (double)(s.in + s.not_in) / 1e6);
+    printf(" Removed:           %.2f M (%.2f%% of %.2fM)\n",
+           (double)s.rm / 1e6,
+           (double)s.rm / (s.rm + s.not_rm) * 100,
+           (double)(s.rm + s.not_rm) / 1e6);
+
+    tx = (s.rd + s.not_rd + s.in + s.not_in + s.rm + s.not_rm) / 1e6 / duration;
+    printf(" Throughput:        %.2f MT/s\n", tx);
+    printf(" Throughput/thread: %.2f MT/s/thread\n", tx / n_rw_threads);
+}
+
+static void run_test(void)
+{
+    unsigned int remaining;
+    int i;
+
+    atomic_mb_set(&test_start, true);
+    do {
+        remaining = sleep(duration);
+    } while (remaining);
+    atomic_mb_set(&test_stop, true);
+
+    for (i = 0; i < n_rw_threads; i++) {
+        qemu_thread_join(&rw_threads[i]);
+    }
+    for (i = 0; i < n_rz_threads; i++) {
+        qemu_thread_join(&rz_threads[i]);
+    }
+}
+
+static void parse_args(int argc, char *argv[])
+{
+    int c;
+
+    for (;;) {
+        c = getopt(argc, argv, "d:D:k:K:l:hn:N:o:r:Rs:S:u:");
+        if (c < 0) {
+            break;
+        }
+        switch (c) {
+        case 'd':
+            duration = atoi(optarg);
+            break;
+        case 'D':
+            resize_delay = atol(optarg);
+            break;
+        case 'h':
+            usage_complete(argc, argv);
+            exit(0);
+        case 'k':
+            init_size = atol(optarg);
+            break;
+        case 'K':
+            init_range = pow2ceil(atol(optarg));
+            break;
+        case 'l':
+            lookup_range = pow2ceil(atol(optarg));
+            break;
+        case 'n':
+            n_rw_threads = atoi(optarg);
+            break;
+        case 'N':
+            n_rz_threads = atoi(optarg);
+            break;
+        case 'o':
+            populate_offset = atol(optarg);
+            break;
+        case 'r':
+            update_range = pow2ceil(atol(optarg));
+            break;
+        case 'R':
+            qht_mode |= QHT_MODE_AUTO_RESIZE;
+            break;
+        case 's':
+            qht_n_elems = atol(optarg);
+            break;
+        case 'S':
+            resize_rate = atof(optarg) / 100.0;
+            if (resize_rate > 1.0) {
+                resize_rate = 1.0;
+            }
+            break;
+        case 'u':
+            update_rate = atof(optarg) / 100.0;
+            if (update_rate > 1.0) {
+                update_rate = 1.0;
+            }
+            break;
+        }
+    }
+}
+
+int main(int argc, char *argv[])
+{
+    parse_args(argc, argv);
+    htable_init();
+    create_threads();
+    run_test();
+    pr_stats();
+    return 0;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 16/18] qht: add test-qht-par to invoke qht-bench from 'check' target
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (14 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 15/18] qht: add qht-bench, a performance benchmark Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 17/18] tb hash: track translated blocks with qht Emilio G. Cota
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore     |  1 +
 tests/Makefile       |  5 ++++-
 tests/test-qht-par.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-qht-par.c

diff --git a/tests/.gitignore b/tests/.gitignore
index d19023e..840ea39 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -52,6 +52,7 @@ test-qemu-opts
 test-qdist
 test-qga
 test-qht
+test-qht-par
 test-qmp-commands
 test-qmp-commands.h
 test-qmp-event
diff --git a/tests/Makefile b/tests/Makefile
index 176bbd8..b4e4e21 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -74,6 +74,8 @@ check-unit-y += tests/test-qdist$(EXESUF)
 gcov-files-test-qdist-y = util/qdist.c
 check-unit-y += tests/test-qht$(EXESUF)
 gcov-files-test-qht-y = util/qht.c
+check-unit-y += tests/test-qht-par$(EXESUF)
+gcov-files-test-qht-par-y = util/qht.c
 check-unit-y += tests/test-bitops$(EXESUF)
 check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
 check-unit-y += tests/check-qom-interface$(EXESUF)
@@ -398,7 +400,7 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
 	tests/rcutorture.o tests/test-rcu-list.o \
 	tests/test-qdist.o \
-	tests/test-qht.o tests/qht-bench.o
+	tests/test-qht.o tests/qht-bench.o tests/test-qht-par.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -439,6 +441,7 @@ tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
 tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
 tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
+tests/test-qht-par$(EXESUF): tests/test-qht-par.o tests/qht-bench$(EXESUF) $(test-util-obj-y)
 tests/qht-bench$(EXESUF): tests/qht-bench.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
diff --git a/tests/test-qht-par.c b/tests/test-qht-par.c
new file mode 100644
index 0000000..fc0cb23
--- /dev/null
+++ b/tests/test-qht-par.c
@@ -0,0 +1,56 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include <glib.h>
+
+#define TEST_QHT_STRING "tests/qht-bench 1>/dev/null 2>&1 -R -S0.1 -D10000 -N1"
+
+static void test_qht(int n_threads, int update_rate, int duration)
+{
+    char *str;
+    int rc;
+
+    str = g_strdup_printf(TEST_QHT_STRING "-n %d -u %d -d %d",
+                          n_threads, update_rate, duration);
+    rc = system(str);
+    g_free(str);
+    g_assert_cmpint(rc, ==, 0);
+}
+
+static void test_2th0u1s(void)
+{
+    test_qht(2, 0, 1);
+}
+
+static void test_2th20u1s(void)
+{
+    test_qht(2, 20, 1);
+}
+
+static void test_2th0u5s(void)
+{
+    test_qht(2, 0, 5);
+}
+
+static void test_2th20u5s(void)
+{
+    test_qht(2, 20, 5);
+}
+
+int main(int argc, char *argv[])
+{
+    g_test_init(&argc, &argv, NULL);
+
+    if (g_test_quick()) {
+        g_test_add_func("/qht/parallel/2threads-0%updates-1s", test_2th0u1s);
+        g_test_add_func("/qht/parallel/2threads-20%updates-1s", test_2th20u1s);
+    } else {
+        g_test_add_func("/qht/parallel/2threads-0%updates-5s", test_2th0u5s);
+        g_test_add_func("/qht/parallel/2threads-20%updates-5s", test_2th20u5s);
+    }
+    return g_test_run();
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 17/18] tb hash: track translated blocks with qht
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (15 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 16/18] qht: add test-qht-par to invoke qht-bench from 'check' target Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 18/18] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
  2016-05-23 22:26 ` [Qemu-devel] [PATCH v5 00/18] tb hash improvements Sergey Fedorov
  18 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
              MRU is implemented here by adding an intermediate struct
              that contains the u32 hash and a pointer to the TB; this
              allows us, on an MRU promotion, to copy said struct (that is not
              at the head), and put this new copy at the head. After a grace
              period, the original non-head struct can be eliminated, and
              after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                   no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                 no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
  http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.

                            Host: Intel Xeon E5-2690

  28 ++------------+-------------+-------------+-------------+------------++
     A*****        +             +             +             master **A*** +
  27 ++    *                                                 xxhash ##B###++
     |      A******A******                               xxhash-rcu $$C$$$ |
  26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
     D%%$$                              A******A******A*qht-dyn-mru A*E****A
  25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
     B#####%                                                               |
  24 ++    #C$$$$$                                                        ++
     |      B###  $                                                        |
     |          ## C$$$$$$                                                 |
  23 ++           #       C$$$$$$                                         ++
     |             B######       C$$$$$$                                %%%D
  22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
     |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
  21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
     +             E@@@   F&&&   +      E@     +      F&&&   +             +
  20 ++------------+-------------+-------------+-------------+------------++
     14            16            18            20            22            24
                             log2 number of buckets

                                 Host: Intel i7-4790K

  14.5 ++------------+------------+-------------+------------+------------++
       A**           +            +             +            master **A*** +
    14 ++ **                                                 xxhash ##B###++
  13.5 ++   **                                           xxhash-rcu $$C$$$++
       |                                            qht-fixed-nomru %%D%%% |
    13 ++     A******                                   qht-dyn-mru @@E@@@++
       |             A*****A******A******             qht-dyn-nomru &&F&&& |
  12.5 C$$                               A******A******A*****A******    ***A
    12 ++ $$                                                        A***  ++
       D%%% $$                                                             |
  11.5 ++  %%                                                             ++
       B###  %C$$$$$$                                                      |
    11 ++  ## D%%%%% C$$$$$                                               ++
       |     #      %      C$$$$$$                                         |
  10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
    10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
       +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
   9.5 ++------------+------------+-------------+------------+------------++
       14            16           18            20           22            24
                              log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 86 ++++++++++++++++++++++++-------------------------
 include/exec/exec-all.h |  9 +++---
 include/exec/tb-hash.h  |  3 +-
 translate-all.c         | 85 ++++++++++++++++++++++--------------------------
 4 files changed, 86 insertions(+), 97 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 1735032..6a2350d 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -224,57 +224,57 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
 }
 #endif
 
+struct tb_desc {
+    target_ulong pc;
+    target_ulong cs_base;
+    CPUArchState *env;
+    tb_page_addr_t phys_page1;
+    uint32_t flags;
+};
+
+static bool tb_cmp(const void *p, const void *d)
+{
+    const TranslationBlock *tb = p;
+    const struct tb_desc *desc = d;
+
+    if (tb->pc == desc->pc &&
+        tb->page_addr[0] == desc->phys_page1 &&
+        tb->cs_base == desc->cs_base &&
+        tb->flags == desc->flags) {
+        /* check next page if needed */
+        if (tb->page_addr[1] == -1) {
+            return true;
+        } else {
+            tb_page_addr_t phys_page2;
+            target_ulong virt_page2;
+
+            virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
+            phys_page2 = get_page_addr_code(desc->env, virt_page2);
+            if (tb->page_addr[1] == phys_page2) {
+                return true;
+            }
+        }
+    }
+    return false;
+}
+
 static TranslationBlock *tb_find_physical(CPUState *cpu,
                                           target_ulong pc,
                                           target_ulong cs_base,
                                           uint32_t flags)
 {
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
-    TranslationBlock *tb, **tb_hash_head, **ptb1;
+    tb_page_addr_t phys_pc;
+    struct tb_desc desc;
     uint32_t h;
-    tb_page_addr_t phys_pc, phys_page1;
 
-    /* find translated block using physical mappings */
-    phys_pc = get_page_addr_code(env, pc);
-    phys_page1 = phys_pc & TARGET_PAGE_MASK;
+    desc.env = (CPUArchState *)cpu->env_ptr;
+    desc.cs_base = cs_base;
+    desc.flags = flags;
+    desc.pc = pc;
+    phys_pc = get_page_addr_code(desc.env, pc);
+    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_hash_func(phys_pc, pc, flags);
-
-    /* Start at head of the hash entry */
-    ptb1 = tb_hash_head = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tb = *ptb1;
-
-    while (tb) {
-        if (tb->pc == pc &&
-            tb->page_addr[0] == phys_page1 &&
-            tb->cs_base == cs_base &&
-            tb->flags == flags) {
-
-            if (tb->page_addr[1] == -1) {
-                /* done, we have a match */
-                break;
-            } else {
-                /* check next page if needed */
-                target_ulong virt_page2 = (pc & TARGET_PAGE_MASK) +
-                                          TARGET_PAGE_SIZE;
-                tb_page_addr_t phys_page2 = get_page_addr_code(env, virt_page2);
-
-                if (tb->page_addr[1] == phys_page2) {
-                    break;
-                }
-            }
-        }
-
-        ptb1 = &tb->phys_hash_next;
-        tb = *ptb1;
-    }
-
-    if (tb) {
-        /* Move the TB to the head of the list */
-        *ptb1 = tb->phys_hash_next;
-        tb->phys_hash_next = *tb_hash_head;
-        *tb_hash_head = tb;
-    }
-    return tb;
+    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
 }
 
 static TranslationBlock *tb_find_slow(CPUState *cpu,
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 85528f9..68e73b6 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -21,6 +21,7 @@
 #define _EXEC_ALL_H_
 
 #include "qemu-common.h"
+#include "qemu/qht.h"
 
 /* allow to see translation results - the slowdown should be negligible, so we leave it */
 #define DEBUG_DISAS
@@ -212,8 +213,8 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
 
 #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
 
-#define CODE_GEN_PHYS_HASH_BITS     15
-#define CODE_GEN_PHYS_HASH_SIZE     (1 << CODE_GEN_PHYS_HASH_BITS)
+#define CODE_GEN_HTABLE_BITS     15
+#define CODE_GEN_HTABLE_SIZE     (1 << CODE_GEN_HTABLE_BITS)
 
 /* Estimated block size for TB allocation.  */
 /* ??? The following is based on a 2015 survey of x86_64 host output.
@@ -250,8 +251,6 @@ struct TranslationBlock {
 
     void *tc_ptr;    /* pointer to the translated code */
     uint8_t *tc_search;  /* pointer to search data */
-    /* next matching tb for physical address. */
-    struct TranslationBlock *phys_hash_next;
     /* original tb when cflags has CF_NOCACHE */
     struct TranslationBlock *orig_tb;
     /* first and second physical page containing code. The lower bit
@@ -296,7 +295,7 @@ typedef struct TBContext TBContext;
 struct TBContext {
 
     TranslationBlock *tbs;
-    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
+    struct qht htable;
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 4b9635a..d274357 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,7 +20,6 @@
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
-#include "exec/exec-all.h"
 #include "exec/tb-hash-xx.h"
 
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
@@ -49,7 +48,7 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
 static inline
 uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
 {
-    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    return tb_hash_func5(phys_pc, pc, flags);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index c48fccb..5357737 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -734,6 +734,13 @@ static inline void code_gen_alloc(size_t tb_size)
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
+static void tb_htable_init(void)
+{
+    unsigned int mode = QHT_MODE_AUTO_RESIZE;
+
+    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
+}
+
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
    (in bytes) allocated to the translation buffer. Zero means default
    size. */
@@ -741,6 +748,7 @@ void tcg_exec_init(unsigned long tb_size)
 {
     cpu_gen_init();
     page_init();
+    tb_htable_init();
     code_gen_alloc(tb_size);
 #if defined(CONFIG_SOFTMMU)
     /* There's no guest base to take into account, so go ahead and
@@ -845,7 +853,7 @@ void tb_flush(CPUState *cpu)
         cpu->tb_flushed = true;
     }
 
-    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
+    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
@@ -856,60 +864,46 @@ void tb_flush(CPUState *cpu)
 
 #ifdef DEBUG_TB_CHECK
 
-static void tb_invalidate_check(target_ulong address)
+static void
+do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 {
-    TranslationBlock *tb;
-    int i;
+    TranslationBlock *tb = p;
+    target_ulong addr = *(target_ulong *)userp;
 
-    address &= TARGET_PAGE_MASK;
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-             tb = tb->phys_hash_next) {
-            if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
-                  address >= tb->pc + tb->size)) {
-                printf("ERROR invalidate: address=" TARGET_FMT_lx
-                       " PC=%08lx size=%04x\n",
-                       address, (long)tb->pc, tb->size);
-            }
-        }
+    if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
+        printf("ERROR invalidate: address=" TARGET_FMT_lx
+               " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
     }
 }
 
-/* verify that all the pages have correct rights for code */
-static void tb_page_check(void)
+static void tb_invalidate_check(target_ulong address)
 {
-    TranslationBlock *tb;
-    int i, flags1, flags2;
-
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-                tb = tb->phys_hash_next) {
-            flags1 = page_get_flags(tb->pc);
-            flags2 = page_get_flags(tb->pc + tb->size - 1);
-            if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
-                printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
-                       (long)tb->pc, tb->size, flags1, flags2);
-            }
-        }
-    }
+    address &= TARGET_PAGE_MASK;
+    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
 }
 
-#endif
-
-static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock *tb)
+static void
+do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 {
-    TranslationBlock *tb1;
+    TranslationBlock *tb = p;
+    int flags1, flags2;
 
-    for (;;) {
-        tb1 = *ptb;
-        if (tb1 == tb) {
-            *ptb = tb1->phys_hash_next;
-            break;
-        }
-        ptb = &tb1->phys_hash_next;
+    flags1 = page_get_flags(tb->pc);
+    flags2 = page_get_flags(tb->pc + tb->size - 1);
+    if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
+        printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
+               (long)tb->pc, tb->size, flags1, flags2);
     }
 }
 
+/* verify that all the pages have correct rights for code */
+static void tb_page_check(void)
+{
+    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
+}
+
+#endif
+
 static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
 {
     TranslationBlock *tb1;
@@ -997,7 +991,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
+    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
@@ -1127,13 +1121,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
     uint32_t h;
-    TranslationBlock **ptb;
 
     /* add in the hash table */
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tb->phys_hash_next = *ptb;
-    *ptb = tb;
+    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* add in the page list */
     tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [Qemu-devel] [PATCH v5 18/18] translate-all: add tb hash bucket info to 'info jit' dump
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (16 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 17/18] tb hash: track translated blocks with qht Emilio G. Cota
@ 2016-05-14  3:34 ` Emilio G. Cota
  2016-05-23 22:26 ` [Qemu-devel] [PATCH v5 00/18] tb hash improvements Sergey Fedorov
  18 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-14  3:34 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Examples:

- Good hashing, i.e. tb_hash_func5(phys_pc, pc, flags):
TB count            715135/2684354
[...]
TB hash buckets     388775/524288 (74.15% head buckets used)
TB hash occupancy   33.04% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
TB hash avg chain   1.017 buckets. Histogram: 1|█▁▁|3

- Not-so-good hashing, i.e. tb_hash_func5(phys_pc, pc, 0):
TB count            712636/2684354
[...]
TB hash buckets     344924/524288 (65.79% head buckets used)
TB hash occupancy   31.64% avg chain occ. Histogram: [0,10)%|█ ▆  ▅▁▃▁▂|[90,100]%
TB hash avg chain   1.047 buckets. Histogram: 1|█▁▁▁|4

- Bad hashing, i.e. tb_hash_func5(phys_pc, 0, 0):
TB count            702818/2684354
[...]
TB hash buckets     112741/524288 (21.50% head buckets used)
TB hash occupancy   10.15% avg chain occ. Histogram: [0,10)%|█ ▁  ▁▁▁▁▁|[90,100]%
TB hash avg chain   2.107 buckets. Histogram: [1.0,10.2)|█▁▁▁▁▁▁▁▁▁|[83.8,93.0]

- Good hashing, but no auto-resize:
TB count            715634/2684354
TB hash buckets     8192/8192 (100.00% head buckets used)
TB hash occupancy   98.30% avg chain occ. Histogram: [95.3,95.8)%|▁▁▃▄▃▄▁▇▁█|[99.5,100.0]%
TB hash avg chain   22.070 buckets. Histogram: [15.0,16.7)|▁▂▅▄█▅▁▁▁▁|[30.3,32.0]

Suggested-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 translate-all.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/translate-all.c b/translate-all.c
index 5357737..c8074cf 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1667,6 +1667,10 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     int i, target_code_size, max_target_code_size;
     int direct_jmp_count, direct_jmp2_count, cross_page;
     TranslationBlock *tb;
+    struct qht_stats hst;
+    uint32_t hgram_opts;
+    size_t hgram_bins;
+    char *hgram;
 
     target_code_size = 0;
     max_target_code_size = 0;
@@ -1717,6 +1721,38 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
                 direct_jmp2_count,
                 tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
                         tcg_ctx.tb_ctx.nb_tbs : 0);
+
+    qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
+
+    cpu_fprintf(f, "TB hash buckets     %zu/%zu (%0.2f%% head buckets used)\n",
+                hst.used_head_buckets, hst.head_buckets,
+                (double)hst.used_head_buckets / hst.head_buckets * 100);
+
+    hgram_opts =  QDIST_PR_BORDER | QDIST_PR_LABELS;
+    hgram_opts |= QDIST_PR_100X   | QDIST_PR_PERCENT;
+    if (qdist_xmax(&hst.occupancy) - qdist_xmin(&hst.occupancy) == 1) {
+        hgram_opts |= QDIST_PR_NODECIMAL;
+    }
+    hgram = qdist_pr(&hst.occupancy, 10, hgram_opts);
+    cpu_fprintf(f, "TB hash occupancy   %0.2f%% avg chain occ. Histogram: %s\n",
+                qdist_avg(&hst.occupancy) * 100, hgram);
+    g_free(hgram);
+
+    hgram_opts = QDIST_PR_BORDER | QDIST_PR_LABELS;
+    hgram_bins = qdist_xmax(&hst.chain) - qdist_xmin(&hst.chain);
+    if (hgram_bins > 10) {
+        hgram_bins = 10;
+    } else {
+        hgram_bins = 0;
+        hgram_opts |= QDIST_PR_NODECIMAL | QDIST_PR_NOBINRANGE;
+    }
+    hgram = qdist_pr(&hst.chain, hgram_bins, hgram_opts);
+    cpu_fprintf(f, "TB hash avg chain   %0.3f buckets. Histogram: %s\n",
+                qdist_avg(&hst.chain), hgram);
+    g_free(hgram);
+
+    qht_statistics_destroy(&hst);
+
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %d\n", tcg_ctx.tb_ctx.tb_flush_count);
     cpu_fprintf(f, "TB invalidate count %d\n",
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/18] atomics: add atomic_read_acquire and atomic_set_release
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 06/18] atomics: add atomic_read_acquire and atomic_set_release Emilio G. Cota
@ 2016-05-15 10:22   ` Pranith Kumar
  2016-05-16 18:27     ` Emilio G. Cota
  2016-05-17 16:53   ` Sergey Fedorov
  1 sibling, 1 reply; 79+ messages in thread
From: Pranith Kumar @ 2016-05-15 10:22 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson, Sergey Fedorov

Hi Emilio,

On Fri, May 13, 2016 at 11:34 PM, Emilio G. Cota <cota@braap.org> wrote:
> When __atomic is not available, we use full memory barriers instead
> of smp/wmb, since acquire/release barriers apply to all memory
> operations and not just to loads/stores, respectively.
>

If it is not too late can we rename this to
atomic_load_acquire()/atomic_store_release() like in the linux kernel?
Looks good either way.

Reviewed-by: Pranith Kumar <bobby.prani@gmail.com>

-- 
Pranith

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire Emilio G. Cota
@ 2016-05-16 10:05   ` Paolo Bonzini
  2016-05-17 16:15   ` Sergey Fedorov
  1 sibling, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2016-05-16 10:05 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Sergey Fedorov, Richard Henderson, Alex Bennée, Peter Crosthwaite



On 14/05/2016 05:34, Emilio G. Cota wrote:
> This new helper expands to __atomic_test_and_set with acquire semantics
> where available; otherwise it expands to __sync_test_and_set, which
> has acquire semantics.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Non-seqcst read-modify-write operations are beyond what I expected to
have in qemu/atomic.h, but I guess it's okay for test-and-set because of
__sync_test_and_set.

Paolo

> ---
>  include/qemu/atomic.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
> index 5bc4d6c..6061a46 100644
> --- a/include/qemu/atomic.h
> +++ b/include/qemu/atomic.h
> @@ -113,6 +113,7 @@
>  } while(0)
>  #endif
>  
> +#define atomic_test_and_set_acquire(ptr) __atomic_test_and_set(ptr, __ATOMIC_ACQUIRE)
>  
>  /* All the remaining operations are fully sequentially consistent */
>  
> @@ -327,6 +328,8 @@
>  #endif
>  #endif
>  
> +#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)
> +
>  /* Provide shorter names for GCC atomic builtins.  */
>  #define atomic_fetch_inc(ptr)  __sync_fetch_and_add(ptr, 1)
>  #define atomic_fetch_dec(ptr)  __sync_fetch_and_add(ptr, -1)
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/18] atomics: add atomic_read_acquire and atomic_set_release
  2016-05-15 10:22   ` Pranith Kumar
@ 2016-05-16 18:27     ` Emilio G. Cota
  0 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-16 18:27 UTC (permalink / raw)
  To: Pranith Kumar
  Cc: MTTCG Devel, Peter Crosthwaite, QEMU Developers, Sergey Fedorov,
	Paolo Bonzini, Alex Bennée, Richard Henderson

On Sun, May 15, 2016 at 06:22:36 -0400, Pranith Kumar wrote:
> On Fri, May 13, 2016 at 11:34 PM, Emilio G. Cota <cota@braap.org> wrote:
> > When __atomic is not available, we use full memory barriers instead
> > of smp/wmb, since acquire/release barriers apply to all memory
> > operations and not just to loads/stores, respectively.
> 
> If it is not too late can we rename this to
> atomic_load_acquire()/atomic_store_release() like in the linux kernel?

I'd keep read/set just for consistency with the rest of the file.

BTW in the kernel, atomic_{read/set}_{acquire/release} are defined
in include/linux/atomic.h:

    #ifndef atomic_read_acquire
    #define  atomic_read_acquire(v)         smp_load_acquire(&(v)->counter)
    #endif

    #ifndef atomic_set_release
    #define  atomic_set_release(v, i)       smp_store_release(&(v)->counter, (i))
    #endif

The smp_load/store variants are called much more frequently, though.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire Emilio G. Cota
  2016-05-16 10:05   ` Paolo Bonzini
@ 2016-05-17 16:15   ` Sergey Fedorov
  2016-05-17 16:23     ` Paolo Bonzini
  1 sibling, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-17 16:15 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite, Richard Henderson

On 14/05/16 06:34, Emilio G. Cota wrote:
> This new helper expands to __atomic_test_and_set with acquire semantics
> where available; otherwise it expands to __sync_test_and_set, which
> has acquire semantics.

Why don't also add atomic_clear_release() for completeness?

Kind regards,
Sergey

>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/atomic.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
> index 5bc4d6c..6061a46 100644
> --- a/include/qemu/atomic.h
> +++ b/include/qemu/atomic.h
> @@ -113,6 +113,7 @@
>  } while(0)
>  #endif
>  
> +#define atomic_test_and_set_acquire(ptr) __atomic_test_and_set(ptr, __ATOMIC_ACQUIRE)
>  
>  /* All the remaining operations are fully sequentially consistent */
>  
> @@ -327,6 +328,8 @@
>  #endif
>  #endif
>  
> +#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)
> +
>  /* Provide shorter names for GCC atomic builtins.  */
>  #define atomic_fetch_inc(ptr)  __sync_fetch_and_add(ptr, 1)
>  #define atomic_fetch_dec(ptr)  __sync_fetch_and_add(ptr, -1)

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire
  2016-05-17 16:15   ` Sergey Fedorov
@ 2016-05-17 16:23     ` Paolo Bonzini
  2016-05-17 16:47       ` Sergey Fedorov
  0 siblings, 1 reply; 79+ messages in thread
From: Paolo Bonzini @ 2016-05-17 16:23 UTC (permalink / raw)
  To: Sergey Fedorov, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Peter Crosthwaite, Richard Henderson



On 17/05/2016 18:15, Sergey Fedorov wrote:
> On 14/05/16 06:34, Emilio G. Cota wrote:
>> This new helper expands to __atomic_test_and_set with acquire semantics
>> where available; otherwise it expands to __sync_test_and_set, which
>> has acquire semantics.
> 
> Why don't also add atomic_clear_release() for completeness?

The previous patch adds atomic_set_release.

Paolo

> Kind regards,
> Sergey
> 
>>
>> Signed-off-by: Emilio G. Cota <cota@braap.org>
>> ---
>>  include/qemu/atomic.h | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
>> index 5bc4d6c..6061a46 100644
>> --- a/include/qemu/atomic.h
>> +++ b/include/qemu/atomic.h
>> @@ -113,6 +113,7 @@
>>  } while(0)
>>  #endif
>>  
>> +#define atomic_test_and_set_acquire(ptr) __atomic_test_and_set(ptr, __ATOMIC_ACQUIRE)
>>  
>>  /* All the remaining operations are fully sequentially consistent */
>>  
>> @@ -327,6 +328,8 @@
>>  #endif
>>  #endif
>>  
>> +#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)
>> +
>>  /* Provide shorter names for GCC atomic builtins.  */
>>  #define atomic_fetch_inc(ptr)  __sync_fetch_and_add(ptr, 1)
>>  #define atomic_fetch_dec(ptr)  __sync_fetch_and_add(ptr, -1)
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire
  2016-05-17 16:23     ` Paolo Bonzini
@ 2016-05-17 16:47       ` Sergey Fedorov
  2016-05-17 17:08         ` Paolo Bonzini
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-17 16:47 UTC (permalink / raw)
  To: Paolo Bonzini, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Peter Crosthwaite, Richard Henderson

On 17/05/16 19:23, Paolo Bonzini wrote:
>
> On 17/05/2016 18:15, Sergey Fedorov wrote:
>> On 14/05/16 06:34, Emilio G. Cota wrote:
>>> This new helper expands to __atomic_test_and_set with acquire semantics
>>> where available; otherwise it expands to __sync_test_and_set, which
>>> has acquire semantics.
>> Why don't also add atomic_clear_release() for completeness?
> The previous patch adds atomic_set_release.

Yes, but it would take the advantage of __sync_lock_release() being just
a release barrier rather than a full barrier of smp_mb() in
atomic_set_release(). But that's only the case for legacy __sync
built-ins (before GCC 4.7.0), though.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/18] atomics: add atomic_read_acquire and atomic_set_release
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 06/18] atomics: add atomic_read_acquire and atomic_set_release Emilio G. Cota
  2016-05-15 10:22   ` Pranith Kumar
@ 2016-05-17 16:53   ` Sergey Fedorov
  2016-05-17 17:08     ` Paolo Bonzini
  1 sibling, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-17 16:53 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite, Richard Henderson

On 14/05/16 06:34, Emilio G. Cota wrote:
> When __atomic is not available, we use full memory barriers instead
> of smp/wmb, since acquire/release barriers apply to all memory
> operations and not just to loads/stores, respectively.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/atomic.h | 27 +++++++++++++++++++++++++++
>

Update docs/atomics.txt? (The same for the previous patch.)

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire
  2016-05-17 16:47       ` Sergey Fedorov
@ 2016-05-17 17:08         ` Paolo Bonzini
  0 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2016-05-17 17:08 UTC (permalink / raw)
  To: Sergey Fedorov, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Peter Crosthwaite, Richard Henderson



On 17/05/2016 18:47, Sergey Fedorov wrote:
>>> >> On 14/05/16 06:34, Emilio G. Cota wrote:
>>>> >>> This new helper expands to __atomic_test_and_set with acquire semantics
>>>> >>> where available; otherwise it expands to __sync_test_and_set, which
>>>> >>> has acquire semantics.
>>> >> Why don't also add atomic_clear_release() for completeness?
>> > The previous patch adds atomic_set_release.
> Yes, but it would take the advantage of __sync_lock_release() being just
> a release barrier rather than a full barrier of smp_mb() in
> atomic_set_release(). But that's only the case for legacy __sync
> built-ins (before GCC 4.7.0), though.

That will be fixed soon by adding smp_mb_acquire() and smp_mb_release(). :)

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/18] atomics: add atomic_read_acquire and atomic_set_release
  2016-05-17 16:53   ` Sergey Fedorov
@ 2016-05-17 17:08     ` Paolo Bonzini
  0 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2016-05-17 17:08 UTC (permalink / raw)
  To: Sergey Fedorov, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Peter Crosthwaite, Richard Henderson



On 17/05/2016 18:53, Sergey Fedorov wrote:
> On 14/05/16 06:34, Emilio G. Cota wrote:
>> > When __atomic is not available, we use full memory barriers instead
>> > of smp/wmb, since acquire/release barriers apply to all memory
>> > operations and not just to loads/stores, respectively.
>> >
>> > Signed-off-by: Emilio G. Cota <cota@braap.org>
>> > ---
>> >  include/qemu/atomic.h | 27 +++++++++++++++++++++++++++
>> >
> Update docs/atomics.txt? (The same for the previous patch.)

I'm okay with doing this separately.

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/18] exec: add tb_hash_func5, derived from xxhash
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 08/18] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
@ 2016-05-17 17:22   ` Sergey Fedorov
  2016-05-17 19:48     ` Emilio G. Cota
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-17 17:22 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite, Richard Henderson

On 14/05/16 06:34, Emilio G. Cota wrote:
> This will be used by upcoming changes for hashing the tb hash.
>
> Add this into a separate file to include the copyright notice from
> xxhash.
>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/exec/tb-hash-xx.h | 94 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 94 insertions(+)
>  create mode 100644 include/exec/tb-hash-xx.h
>
> diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
> new file mode 100644
> index 0000000..67f4e6f
> --- /dev/null
> +++ b/include/exec/tb-hash-xx.h
> @@ -0,0 +1,94 @@
> +/*
> + * xxHash - Fast Hash algorithm
> + * Copyright (C) 2012-2016, Yann Collet
> + *
> + * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions are
> + * met:
> + *
> + * + Redistributions of source code must retain the above copyright
> + * notice, this list of conditions and the following disclaimer.
> + * + Redistributions in binary form must reproduce the above
> + * copyright notice, this list of conditions and the following disclaimer
> + * in the documentation and/or other materials provided with the
> + * distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + *
> + * You can contact the author at :
> + * - xxHash source repository : https://github.com/Cyan4973/xxHash
> + */
> +#ifndef EXEC_TB_HASH_XX
> +#define EXEC_TB_HASH_XX
> +
> +#include <qemu/bitops.h>
> +
> +#define PRIME32_1   2654435761U
> +#define PRIME32_2   2246822519U
> +#define PRIME32_3   3266489917U
> +#define PRIME32_4    668265263U
> +#define PRIME32_5    374761393U
> +
> +#define TB_HASH_XX_SEED 1
> +
> +/*
> + * xxhash32, customized for input variables that are not guaranteed to be
> + * contiguous in memory.
> + */
> +static inline
> +uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e)
> +{
> +    uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
> +    uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
> +    uint32_t v3 = TB_HASH_XX_SEED + 0;
> +    uint32_t v4 = TB_HASH_XX_SEED - PRIME32_1;
> +    uint32_t a = a0 >> 31 >> 1;

I'm wondering if there's something special forcing us to make ">> 31
>>1" instead of just ">> 32" on uint64_t?

Kind regards,
Sergey

> +    uint32_t b = a0;
> +    uint32_t c = b0 >> 31 >> 1;
> +    uint32_t d = b0;
> +    uint32_t h32;
> +
> +    v1 += a * PRIME32_2;
> +    v1 = rol32(v1, 13);
> +    v1 *= PRIME32_1;
> +
> +    v2 += b * PRIME32_2;
> +    v2 = rol32(v2, 13);
> +    v2 *= PRIME32_1;
> +
> +    v3 += c * PRIME32_2;
> +    v3 = rol32(v3, 13);
> +    v3 *= PRIME32_1;
> +
> +    v4 += d * PRIME32_2;
> +    v4 = rol32(v4, 13);
> +    v4 *= PRIME32_1;
> +
> +    h32 = rol32(v1, 1) + rol32(v2, 7) + rol32(v3, 12) + rol32(v4, 18);
> +    h32 += 20;
> +
> +    h32 += e * PRIME32_3;
> +    h32  = rol32(h32, 17) * PRIME32_4;
> +
> +    h32 ^= h32 >> 15;
> +    h32 *= PRIME32_2;
> +    h32 ^= h32 >> 13;
> +    h32 *= PRIME32_3;
> +    h32 ^= h32 >> 16;
> +
> +    return h32;
> +}
> +
> +#endif /* EXEC_TB_HASH_XX */

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/18] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 09/18] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
@ 2016-05-17 17:47   ` Sergey Fedorov
  2016-05-17 19:09     ` Emilio G. Cota
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-17 17:47 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite, Richard Henderson

On 14/05/16 06:34, Emilio G. Cota wrote:
> For some workloads such as arm bootup, tb_phys_hash is performance-critical.
> The is due to the high frequency of accesses to the hash table, originated
> by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
> More info:
>   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
>
> To dig further into this I modified an arm image booting debian jessie to
> immediately shut down after boot. Analysis revealed that quite a bit of time
> is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
> results in very uneven loading of chains in the hash table's buckets;
> the longest observed chain had ~550 elements.
>
> The appended addresses this with two changes:

Does "the appended" means "this patch"? Sorry, I've just never seen such
expression before...

>
> 1) Use xxhash as the hash table's hash function. xxhash is a fast,
>    high-quality hashing function.
>
> 2) Feed the hashing function with not just tb_phys, but also pc and flags.
>
> This improves performance over using just tb_phys for hashing, since that
> resulted in some hash buckets having many TB's, while others getting very few;
> with these changes, the longest observed chain on a single hash bucket is
> brought down from ~550 to ~40.
>
> Tests show that the other element checked for in tb_find_physical,
> cs_base, is always a match when tb_phys+pc+flags are a match,
> so hashing cs_base is wasteful. It could be that this is an ARM-only
> thing, though. UPDATE:
> On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
>> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
>> consisting of only a delay slot).
>> It may well still turn out to be reasonable to ignore cs_base for hashing.
> BTW, after this change the hash table should not be called "tb_hash_phys"
> anymore; this is addressed later in this series.
>
> This change gives consistent bootup time improvements. I tested two
> host machines:
> - Intel Xeon E5-2690: 11.6% less time
> - Intel i7-4790K: 19.2% less time
>
> Increasing the number of hash buckets yields further improvements. However,
> using a larger, fixed number of buckets can degrade performance for other
> workloads that do not translate as many blocks (600K+ for debian-jessie arm
> bootup). This is dealt with later in this series.
>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c             |  4 ++--
>  include/exec/tb-hash.h |  8 ++++++--
>  translate-all.c        | 10 +++++-----
>  3 files changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 14df1aa..1735032 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -231,13 +231,13 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
>  {
>      CPUArchState *env = (CPUArchState *)cpu->env_ptr;
>      TranslationBlock *tb, **tb_hash_head, **ptb1;
> -    unsigned int h;
> +    uint32_t h;
>      tb_page_addr_t phys_pc, phys_page1;
>  
>      /* find translated block using physical mappings */
>      phys_pc = get_page_addr_code(env, pc);
>      phys_page1 = phys_pc & TARGET_PAGE_MASK;
> -    h = tb_phys_hash_func(phys_pc);
> +    h = tb_hash_func(phys_pc, pc, flags);
>  
>      /* Start at head of the hash entry */
>      ptb1 = tb_hash_head = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
> index 0f4e8a0..4b9635a 100644
> --- a/include/exec/tb-hash.h
> +++ b/include/exec/tb-hash.h
> @@ -20,6 +20,9 @@
>  #ifndef EXEC_TB_HASH
>  #define EXEC_TB_HASH
>  
> +#include "exec/exec-all.h"
> +#include "exec/tb-hash-xx.h"
> +
>  /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
>     addresses on the same page.  The top bits are the same.  This allows
>     TLB invalidation to quickly clear a subset of the hash table.  */
> @@ -43,9 +46,10 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
>             | (tmp & TB_JMP_ADDR_MASK));
>  }
>  
> -static inline unsigned int tb_phys_hash_func(tb_page_addr_t pc)
> +static inline
> +uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)

Nitpicking: now 'flags' is of uint32_t type.

>  {
> -    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
> +    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
>  }
>  
>  #endif
> diff --git a/translate-all.c b/translate-all.c
> index b54f472..c48fccb 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -991,12 +991,12 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>  {
>      CPUState *cpu;
>      PageDesc *p;
> -    unsigned int h;
> +    uint32_t h;
>      tb_page_addr_t phys_pc;
>  
>      /* remove the TB from the hash list */
>      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
> -    h = tb_phys_hash_func(phys_pc);
> +    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
>      tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
>  
>      /* remove the TB from the page list */
> @@ -1126,11 +1126,11 @@ static inline void tb_alloc_page(TranslationBlock *tb,
>  static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>                           tb_page_addr_t phys_page2)
>  {
> -    unsigned int h;
> +    uint32_t h;
>      TranslationBlock **ptb;
>  
> -    /* add in the physical hash table */
> -    h = tb_phys_hash_func(phys_pc);
> +    /* add in the hash table */
> +    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
>      ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
>      tb->phys_hash_next = *ptb;
>      *ptb = tb;

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/18] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-05-17 17:47   ` Sergey Fedorov
@ 2016-05-17 19:09     ` Emilio G. Cota
  0 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-17 19:09 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Tue, May 17, 2016 at 20:47:52 +0300, Sergey Fedorov wrote:
> On 14/05/16 06:34, Emilio G. Cota wrote:
> > For some workloads such as arm bootup, tb_phys_hash is performance-critical.
> > The is due to the high frequency of accesses to the hash table, originated
> > by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
> > More info:
> >   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
> >
> > To dig further into this I modified an arm image booting debian jessie to
> > immediately shut down after boot. Analysis revealed that quite a bit of time
> > is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
> > results in very uneven loading of chains in the hash table's buckets;
> > the longest observed chain had ~550 elements.
> >
> > The appended addresses this with two changes:
> 
> Does "the appended" means "this patch"? Sorry, I've just never seen such
> expression before...

Yes, in this context a patch is _appended_ to the (long-ish) discussion.

(snip)
> > -static inline unsigned int tb_phys_hash_func(tb_page_addr_t pc)
> > +static inline
> > +uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
> 
> Nitpicking: now 'flags' is of uint32_t type.

I've changed this in my tree -- thanks!

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
       [not found]   ` <573B5134.8060104@gmail.com>
@ 2016-05-17 19:19     ` Richard Henderson
  2016-05-17 19:57       ` Sergey Fedorov
  2016-05-17 20:04       ` Emilio G. Cota
  2016-05-17 19:38     ` Emilio G. Cota
  1 sibling, 2 replies; 79+ messages in thread
From: Richard Henderson @ 2016-05-17 19:19 UTC (permalink / raw)
  To: Sergey Fedorov, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite

On 05/17/2016 10:13 AM, Sergey Fedorov wrote:
>> > +static inline void qemu_spin_lock(QemuSpin *spin)
>> > +{
>> > +    while (atomic_test_and_set_acquire(&spin->value)) {
>>From gcc-4.8 info page, node "__atomic Builtins", description of
> __atomic_test_and_set():
> 
>     It should be only used for operands of type 'bool' or 'char'.
> 

Hum.  I thought I remembered all operand sizes there, but I've just re-checked
and you're right about bool (and really only bool).

Perhaps we should just stick with __sync_test_and_set then.  I'm thinking here
of e.g. armv6, a reasonable host, which can't operate on 1 byte atomic values.


r~

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
       [not found]   ` <573B5134.8060104@gmail.com>
  2016-05-17 19:19     ` Richard Henderson
@ 2016-05-17 19:38     ` Emilio G. Cota
  2016-05-17 20:35       ` Sergey Fedorov
  1 sibling, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-17 19:38 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Tue, May 17, 2016 at 20:13:24 +0300, Sergey Fedorov wrote:
> On 14/05/16 06:34, Emilio G. Cota wrote:
(snip)
> > +static inline void qemu_spin_lock(QemuSpin *spin)
> > +{
> > +    while (atomic_test_and_set_acquire(&spin->value)) {
> 
> From gcc-4.8 info page, node "__atomic Builtins", description of
> __atomic_test_and_set():
> 
>     It should be only used for operands of type 'bool' or 'char'.

Yes I'm aware of that. The way I interpret it is that if you're
storing something other than something ~= bool, you might
be in trouble, since it might get cleared.
We use 'int' but effectively store here a bool, so we're safe.

As to why we're using int, see
  http://thread.gmane.org/gmane.comp.emulators.qemu/405812/focus=405965

> > +        while (atomic_read(&spin->value)) {
> > +            cpu_relax();
> > +        }
> > +    }
> Looks like relaxed atomic access can be a subject to various
> optimisations according to
> https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync#Relaxed.

The important thing here is that the read actually happens
on every iteration; this is achieved with atomic_read().
Barriers etc. do not matter here because once we exit
the loop, the try to acquire the lock -- and if we succeed,
we then emit the right barrier.

> > +static inline bool qemu_spin_locked(QemuSpin *spin)
> > +{
> > +    return atomic_read_acquire(&spin->value);
> 
> Why not just atomic_read()?

I think atomic_read() is better, yes. I'll change it. I went
with the fence because I wanted to have at least a caller
of atomic_read_acquire :P

I also hesitated between calling it _locked or _is_locked;
I used _locked for consistency with qemu_mutex_iothread_locked,
although I think _is_locked is a bit clearer:
qemu_spin_locked(foo)
   is a little too similar to
qemu_spin_lock(foo).

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/18] exec: add tb_hash_func5, derived from xxhash
  2016-05-17 17:22   ` Sergey Fedorov
@ 2016-05-17 19:48     ` Emilio G. Cota
  0 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-17 19:48 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Tue, May 17, 2016 at 20:22:52 +0300, Sergey Fedorov wrote:
> On 14/05/16 06:34, Emilio G. Cota wrote:
(snip)
> > +static inline
> > +uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e)
> > +{
> > +    uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
> > +    uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
> > +    uint32_t v3 = TB_HASH_XX_SEED + 0;
> > +    uint32_t v4 = TB_HASH_XX_SEED - PRIME32_1;
> > +    uint32_t a = a0 >> 31 >> 1;
> 
> I'm wondering if there's something special forcing us to make ">> 31
> >>1" instead of just ">> 32" on uint64_t?

Not really; it's perfectly fine to do >> 32 since both a0 and b0
are u64's.

I've changed it in my tree, thanks.

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-17 19:19     ` Richard Henderson
@ 2016-05-17 19:57       ` Sergey Fedorov
  2016-05-17 20:01         ` Sergey Fedorov
  2016-05-17 20:04       ` Emilio G. Cota
  1 sibling, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-17 19:57 UTC (permalink / raw)
  To: Richard Henderson, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite

On 17/05/16 22:19, Richard Henderson wrote:
> On 05/17/2016 10:13 AM, Sergey Fedorov wrote:
>>>> +static inline void qemu_spin_lock(QemuSpin *spin)
>>>> +{
>>>> +    while (atomic_test_and_set_acquire(&spin->value)) {
>> >From gcc-4.8 info page, node "__atomic Builtins", description of
>> __atomic_test_and_set():
>>
>>     It should be only used for operands of type 'bool' or 'char'.
>>
> Hum.  I thought I remembered all operand sizes there, but I've just re-checked
> and you're right about bool (and really only bool).
>
> Perhaps we should just stick with __sync_test_and_set then.  I'm thinking here
> of e.g. armv6, a reasonable host, which can't operate on 1 byte atomic values.
>

Sorry, I can't see reading ARMv6 ARM that 1-byte access can't be atomic.
What I've found:

    B2.4.1 Normal memory attribute
    (snip)
    Shared Normal memory

        (snip)
        ... Reads to Shared Normal Memory that are aligned in memory to
        the size of the access must be atomic.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-17 19:57       ` Sergey Fedorov
@ 2016-05-17 20:01         ` Sergey Fedorov
  2016-05-17 22:12           ` Richard Henderson
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-17 20:01 UTC (permalink / raw)
  To: Richard Henderson, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite

On 17/05/16 22:57, Sergey Fedorov wrote:
> On 17/05/16 22:19, Richard Henderson wrote:
>> On 05/17/2016 10:13 AM, Sergey Fedorov wrote:
>>>>> +static inline void qemu_spin_lock(QemuSpin *spin)
>>>>> +{
>>>>> +    while (atomic_test_and_set_acquire(&spin->value)) {
>>> >From gcc-4.8 info page, node "__atomic Builtins", description of
>>> __atomic_test_and_set():
>>>
>>>     It should be only used for operands of type 'bool' or 'char'.
>>>
>> Hum.  I thought I remembered all operand sizes there, but I've just re-checked
>> and you're right about bool (and really only bool).
>>
>> Perhaps we should just stick with __sync_test_and_set then.  I'm thinking here
>> of e.g. armv6, a reasonable host, which can't operate on 1 byte atomic values.
>>
>
> Sorry, I can't see reading ARMv6 ARM that 1-byte access can't be
> atomic. What I've found:
>
>     B2.4.1 Normal memory attribute
>     (snip)
>     Shared Normal memory
>
>         (snip)
>         ... Reads to Shared Normal Memory that are aligned in memory
>         to the size of the access must be atomic.
>
>

Actually, here's the sample code:

    #include <stdbool.h>

    struct foo {
        bool b;
        int i;
    };

    int main(void)
    {
        struct foo f;
        __atomic_store_n(&f.b, 0, __ATOMIC_SEQ_CST);
        __atomic_store_n(&f.i, 0, __ATOMIC_SEQ_CST);
        return 0;
    }


compiles with:

    arm-linux-gnueabi-gcc -march=armv6 -O2 -c a.c

and disasm:

    00000000 <main>:
       0:    e24dd008     sub    sp, sp, #8
       4:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
       8:    e3a03000     mov    r3, #0
       c:    e5cd3000     strb    r3, [sp]
      10:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
      14:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
      18:    e1a00003     mov    r0, r3
      1c:    e58d3004     str    r3, [sp, #4]
      20:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
      24:    e28dd008     add    sp, sp, #8
      28:    e12fff1e     bx    lr

Looks like GCC has no trouble generating __atomic_store_n() for 1-byte
bool...

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-17 19:19     ` Richard Henderson
  2016-05-17 19:57       ` Sergey Fedorov
@ 2016-05-17 20:04       ` Emilio G. Cota
  2016-05-17 20:20         ` Sergey Fedorov
  1 sibling, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-17 20:04 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Sergey Fedorov, QEMU Developers, MTTCG Devel, Alex Bennée,
	Paolo Bonzini, Peter Crosthwaite

On Tue, May 17, 2016 at 12:19:27 -0700, Richard Henderson wrote:
> On 05/17/2016 10:13 AM, Sergey Fedorov wrote:
> >> > +static inline void qemu_spin_lock(QemuSpin *spin)
> >> > +{
> >> > +    while (atomic_test_and_set_acquire(&spin->value)) {
> >>From gcc-4.8 info page, node "__atomic Builtins", description of
> > __atomic_test_and_set():
> > 
> >     It should be only used for operands of type 'bool' or 'char'.
> > 
> 
> Hum.  I thought I remembered all operand sizes there, but I've just re-checked
> and you're right about bool (and really only bool).
> 
> Perhaps we should just stick with __sync_test_and_set then.  I'm thinking here
> of e.g. armv6, a reasonable host, which can't operate on 1 byte atomic values.

I like this idea, it gets rid of any guesswork (as in my previous email).
I've changed the patch to:

commit 8f89d36b6203b78df2bf1e3f82871b8aa2ca83b7
Author: Emilio G. Cota <cota@braap.org>
Date:   Thu Apr 28 10:56:26 2016 -0400

    atomics: add atomic_test_and_set_acquire
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 5bc4d6c..95de7a7 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -113,6 +113,13 @@
 } while(0)
 #endif
 
+/*
+ * We might we tempted to use __atomic_test_and_set with __ATOMIC_ACQUIRE;
+ * however, the documentation explicitly says that we should only pass
+ * a boolean to it, so we use __sync_lock_test_and_set, which doesn't
+ * have this limitation, and is documented to have acquire semantics.
+ */
+#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)
 
 /* All the remaining operations are fully sequentially consistent */
 
@@ -327,6 +334,8 @@
 #endif
 #endif
 
+#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)
+
 /* Provide shorter names for GCC atomic builtins.  */
 #define atomic_fetch_inc(ptr)  __sync_fetch_and_add(ptr, 1)
 #define atomic_fetch_dec(ptr)  __sync_fetch_and_add(ptr, -1)
---

An alternative would be to add just a single line, right below the
barrier() definition or at the end of the file. Adding both lines
is IMO a bit clearer, since the newly-added comment only applies
under the C11 definitions.

Thanks,

		Emilio

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-17 20:04       ` Emilio G. Cota
@ 2016-05-17 20:20         ` Sergey Fedorov
  2016-05-18  0:28           ` Emilio G. Cota
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-17 20:20 UTC (permalink / raw)
  To: Emilio G. Cota, Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite

On 17/05/16 23:04, Emilio G. Cota wrote:
> On Tue, May 17, 2016 at 12:19:27 -0700, Richard Henderson wrote:
>> On 05/17/2016 10:13 AM, Sergey Fedorov wrote:
>>>>> +static inline void qemu_spin_lock(QemuSpin *spin)
>>>>> +{
>>>>> +    while (atomic_test_and_set_acquire(&spin->value)) {
>>> >From gcc-4.8 info page, node "__atomic Builtins", description of
>>> __atomic_test_and_set():
>>>
>>>     It should be only used for operands of type 'bool' or 'char'.
>>>
>> Hum.  I thought I remembered all operand sizes there, but I've just re-checked
>> and you're right about bool (and really only bool).
>>
>> Perhaps we should just stick with __sync_test_and_set then.  I'm thinking here
>> of e.g. armv6, a reasonable host, which can't operate on 1 byte atomic values.
> I like this idea, it gets rid of any guesswork (as in my previous email).
> I've changed the patch to:
>
> commit 8f89d36b6203b78df2bf1e3f82871b8aa2ca83b7
> Author: Emilio G. Cota <cota@braap.org>
> Date:   Thu Apr 28 10:56:26 2016 -0400
>
>     atomics: add atomic_test_and_set_acquire
>     
>     Signed-off-by: Emilio G. Cota <cota@braap.org>
>
> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
> index 5bc4d6c..95de7a7 100644
> --- a/include/qemu/atomic.h
> +++ b/include/qemu/atomic.h
> @@ -113,6 +113,13 @@
>  } while(0)
>  #endif
>  
> +/*
> + * We might we tempted to use __atomic_test_and_set with __ATOMIC_ACQUIRE;
> + * however, the documentation explicitly says that we should only pass
> + * a boolean to it, so we use __sync_lock_test_and_set, which doesn't
> + * have this limitation, and is documented to have acquire semantics.
> + */
> +#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)

So you are going to stick to *legacy* built-ins?

Kind regards,
Sergey

>  
>  /* All the remaining operations are fully sequentially consistent */
>  
> @@ -327,6 +334,8 @@
>  #endif
>  #endif
>  
> +#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)
> +
>  /* Provide shorter names for GCC atomic builtins.  */
>  #define atomic_fetch_inc(ptr)  __sync_fetch_and_add(ptr, 1)
>  #define atomic_fetch_dec(ptr)  __sync_fetch_and_add(ptr, -1)
> ---
>
> An alternative would be to add just a single line, right below the
> barrier() definition or at the end of the file. Adding both lines
> is IMO a bit clearer, since the newly-added comment only applies
> under the C11 definitions.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-17 19:38     ` Emilio G. Cota
@ 2016-05-17 20:35       ` Sergey Fedorov
  2016-05-17 23:18         ` Emilio G. Cota
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-17 20:35 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On 17/05/16 22:38, Emilio G. Cota wrote:
> On Tue, May 17, 2016 at 20:13:24 +0300, Sergey Fedorov wrote:
>> On 14/05/16 06:34, Emilio G. Cota wrote:
(snip)
>>> +        while (atomic_read(&spin->value)) {
>>> +            cpu_relax();
>>> +        }
>>> +    }
>> Looks like relaxed atomic access can be a subject to various
>> optimisations according to
>> https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync#Relaxed.
> The important thing here is that the read actually happens
> on every iteration; this is achieved with atomic_read().
> Barriers etc. do not matter here because once we exit
> the loop, the try to acquire the lock -- and if we succeed,
> we then emit the right barrier.

I just can't find where it is stated that an expression like
"__atomic_load(ptr, &_val, __ATOMIC_RELAXED)" has a _compiler_ barrier
or volatile access semantic. Hopefully, cpu_relax() serves as a compiler
barrier. If we rely on that, we'd better put a comment about it.

Kind regards,
Sergey

>>> +static inline bool qemu_spin_locked(QemuSpin *spin)
>>> +{
>>> +    return atomic_read_acquire(&spin->value);
>> Why not just atomic_read()?
> I think atomic_read() is better, yes. I'll change it. I went
> with the fence because I wanted to have at least a caller
> of atomic_read_acquire :P
>
> I also hesitated between calling it _locked or _is_locked;
> I used _locked for consistency with qemu_mutex_iothread_locked,
> although I think _is_locked is a bit clearer:
> qemu_spin_locked(foo)
>    is a little too similar to
> qemu_spin_lock(foo).
>
> Thanks,
>
> 		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-17 20:01         ` Sergey Fedorov
@ 2016-05-17 22:12           ` Richard Henderson
  2016-05-17 22:22             ` Richard Henderson
  0 siblings, 1 reply; 79+ messages in thread
From: Richard Henderson @ 2016-05-17 22:12 UTC (permalink / raw)
  To: Sergey Fedorov, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite

On 05/17/2016 01:01 PM, Sergey Fedorov wrote:
>> Sorry, I can't see reading ARMv6 ARM that 1-byte access can't be atomic. What
>> I've found:
>>
>>     B2.4.1 Normal memory attribute
>>     (snip)
>>     Shared Normal memory
>>
>>         (snip)
>>         ... Reads to Shared Normal Memory that are aligned in memory to the
>>         size of the access must be atomic.
...
> Looks like GCC has no trouble generating __atomic_store_n() for 1-byte bool...

Not loads and stores, but other atomic ops like xchg.  The native atomic
operations are all 4 bytes long.

I suppose the compiler may well be able to synthesize sub-word atomic ops, but
it'll be 2 or 3 times the size of a word-sized atomic op, and for no good reason.


r~

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-17 22:12           ` Richard Henderson
@ 2016-05-17 22:22             ` Richard Henderson
  0 siblings, 0 replies; 79+ messages in thread
From: Richard Henderson @ 2016-05-17 22:22 UTC (permalink / raw)
  To: Sergey Fedorov, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite

On 05/17/2016 03:12 PM, Richard Henderson wrote:
> On 05/17/2016 01:01 PM, Sergey Fedorov wrote:
>>> Sorry, I can't see reading ARMv6 ARM that 1-byte access can't be atomic. What
>>> I've found:
>>>
>>>     B2.4.1 Normal memory attribute
>>>     (snip)
>>>     Shared Normal memory
>>>
>>>         (snip)
>>>         ... Reads to Shared Normal Memory that are aligned in memory to the
>>>         size of the access must be atomic.
> ...
>> Looks like GCC has no trouble generating __atomic_store_n() for 1-byte bool...
> 
> Not loads and stores, but other atomic ops like xchg.  The native atomic
> operations are all 4 bytes long.
> 
> I suppose the compiler may well be able to synthesize sub-word atomic ops, but
> it'll be 2 or 3 times the size of a word-sized atomic op, and for no good reason.

Indeed, even with gcc 7 branch,


struct foo {
  _Bool b;
  int i;
} f;

void a()
{
  __atomic_exchange_n(&f.b, 1, __ATOMIC_ACQUIRE);
  __atomic_exchange_n(&f.i, 1, __ATOMIC_ACQUIRE);
}

void b()
{
  __sync_lock_test_and_set(&f.b, 1);
  __sync_lock_test_and_set(&f.i, 1);
}

$ ./gcc/xgcc -B./gcc/ -O2 -S ~/z.c -march=armv6
$ cat z.s
a:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	push	{r4, r5, r6, lr}
	mov	r6, #1
	ldr	r5, .L8
	ldrb	r3, [r5]	@ zero_extendqisi2
.L2:
	mov	r2, r6
	sxtb	r1, r3
	mov	r0, r5
	mov	r4, r3
	bl	__sync_val_compare_and_swap_1
	uxtb	r4, r4
	uxtb	r2, r0
	cmp	r2, r4
	mov	r3, r0
	bne	.L2
	ldr	r3, .L8+4
	mov	r2, #1
.L5:
	ldrex	r1, [r3]
	strex	r0, r2, [r3]
	cmp	r0, #0
	bne	.L5
	mcr	p15, 0, r0, c7, c10, 5
	pop	{r4, r5, r6, pc}
...
b:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	push	{r4, lr}
	mov	r1, #1
	ldr	r4, .L13
	mov	r0, r4
	bl	__sync_lock_test_and_set_1
	add	r4, r4, #4
	mov	r3, #1
.L11:
	ldrex	r2, [r4]
	strex	r1, r3, [r4]
	cmp	r1, #0
	bne	.L11
	mcr	p15, 0, r0, c7, c10, 5
	pop	{r4, pc}



r~

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-17 20:35       ` Sergey Fedorov
@ 2016-05-17 23:18         ` Emilio G. Cota
  2016-05-18 13:59           ` Sergey Fedorov
  0 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-17 23:18 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Tue, May 17, 2016 at 23:35:57 +0300, Sergey Fedorov wrote:
> On 17/05/16 22:38, Emilio G. Cota wrote:
> > On Tue, May 17, 2016 at 20:13:24 +0300, Sergey Fedorov wrote:
> >> On 14/05/16 06:34, Emilio G. Cota wrote:
> (snip)
> >>> +        while (atomic_read(&spin->value)) {
> >>> +            cpu_relax();
> >>> +        }
> >>> +    }
> >> Looks like relaxed atomic access can be a subject to various
> >> optimisations according to
> >> https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync#Relaxed.
> > The important thing here is that the read actually happens
> > on every iteration; this is achieved with atomic_read().
> > Barriers etc. do not matter here because once we exit
> > the loop, the try to acquire the lock -- and if we succeed,
> > we then emit the right barrier.
> 
> I just can't find where it is stated that an expression like
> "__atomic_load(ptr, &_val, __ATOMIC_RELAXED)" has a _compiler_ barrier
> or volatile access semantic. Hopefully, cpu_relax() serves as a compiler
> barrier. If we rely on that, we'd better put a comment about it.

I treat atomic_read/set as ACCESS_ONCE[1], i.e. volatile cast.
>From docs/atomics.txt:

  COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
  ============================================
  [...]
  - atomic_read and atomic_set in Linux give no guarantee at all;
    atomic_read and atomic_set in QEMU include a compiler barrier
    (similar to the ACCESS_ONCE macro in Linux).

[1] https://lwn.net/Articles/508991/

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-17 20:20         ` Sergey Fedorov
@ 2016-05-18  0:28           ` Emilio G. Cota
  2016-05-18 14:18             ` Sergey Fedorov
  0 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-18  0:28 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Richard Henderson, QEMU Developers, MTTCG Devel,
	Alex Bennée, Paolo Bonzini, Peter Crosthwaite

On Tue, May 17, 2016 at 23:20:11 +0300, Sergey Fedorov wrote:
> On 17/05/16 23:04, Emilio G. Cota wrote:
(snip)
> > +/*
> > + * We might we tempted to use __atomic_test_and_set with __ATOMIC_ACQUIRE;
> > + * however, the documentation explicitly says that we should only pass
> > + * a boolean to it, so we use __sync_lock_test_and_set, which doesn't
> > + * have this limitation, and is documented to have acquire semantics.
> > + */
> > +#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)
> 
> So you are going to stick to *legacy* built-ins?

Why not? AFAIK the reason to avoid __sync primitives is that in most cases
they include barriers that callers might not necessarily need; __atomic's
allow for finer tuning, which is in general a good thing. However,
__sync_test_and_set has the exact semantics we need, without the limitations
documented for __atomic_test_and_set; so why not use it?

		E.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-17 23:18         ` Emilio G. Cota
@ 2016-05-18 13:59           ` Sergey Fedorov
  2016-05-18 14:05             ` Paolo Bonzini
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 13:59 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On 18/05/16 02:18, Emilio G. Cota wrote:
> On Tue, May 17, 2016 at 23:35:57 +0300, Sergey Fedorov wrote:
>> On 17/05/16 22:38, Emilio G. Cota wrote:
>>> On Tue, May 17, 2016 at 20:13:24 +0300, Sergey Fedorov wrote:
>>>> On 14/05/16 06:34, Emilio G. Cota wrote:
>> (snip)
>>>>> +        while (atomic_read(&spin->value)) {
>>>>> +            cpu_relax();
>>>>> +        }
>>>>> +    }
>>>> Looks like relaxed atomic access can be a subject to various
>>>> optimisations according to
>>>> https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync#Relaxed.
>>> The important thing here is that the read actually happens
>>> on every iteration; this is achieved with atomic_read().
>>> Barriers etc. do not matter here because once we exit
>>> the loop, the try to acquire the lock -- and if we succeed,
>>> we then emit the right barrier.
>> I just can't find where it is stated that an expression like
>> "__atomic_load(ptr, &_val, __ATOMIC_RELAXED)" has a _compiler_ barrier
>> or volatile access semantic. Hopefully, cpu_relax() serves as a compiler
>> barrier. If we rely on that, we'd better put a comment about it.
> I treat atomic_read/set as ACCESS_ONCE[1], i.e. volatile cast.
> From docs/atomics.txt:
>
>   COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
>   ============================================
>   [...]
>   - atomic_read and atomic_set in Linux give no guarantee at all;
>     atomic_read and atomic_set in QEMU include a compiler barrier
>     (similar to the ACCESS_ONCE macro in Linux).
>
> [1] https://lwn.net/Articles/508991/

But actually (cf include/qemu/atomic.h) we can have:

    #define atomic_read(ptr)                              \
        ({                                                \
        QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
        typeof(*ptr) _val;                                \
         __atomic_load(ptr, &_val, __ATOMIC_RELAXED);     \
        _val;                                             \
        })


I can't find anywhere if this __atomic_load() has volatile/compiler
barrier semantics...

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 13:59           ` Sergey Fedorov
@ 2016-05-18 14:05             ` Paolo Bonzini
  2016-05-18 14:10               ` Sergey Fedorov
  0 siblings, 1 reply; 79+ messages in thread
From: Paolo Bonzini @ 2016-05-18 14:05 UTC (permalink / raw)
  To: Sergey Fedorov, Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée,
	Peter Crosthwaite, Richard Henderson



On 18/05/2016 15:59, Sergey Fedorov wrote:
> 
> But actually (cf include/qemu/atomic.h) we can have:
> 
>     #define atomic_read(ptr)                              \
>         ({                                                \
>         QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
>         typeof(*ptr) _val;                                \
>          __atomic_load(ptr, &_val, __ATOMIC_RELAXED);     \
>         _val;                                             \
>         })
> 
> 
> I can't find anywhere if this __atomic_load() has volatile/compiler
> barrier semantics...

The standard says "you can have data races on atomic loads", that is
very close to compiler barrier semantics but indeed atomics.txt should
be updated to explain the C11 memory model in not-so-formal terms.

For example this:

  atomic_set(&x, 1);
  atomic_set(&y, 1);
  atomic_set(&x, 2);
  atomic_set(&y, 2);

could become

  atomic_set(&x, 2);
  atomic_set(&y, 2);

with C11 atomics but not with volatile.  However this:

  if (atomic_read(&x) != 1) {
    atomic_set(&x, 1);
  }

couldn't become an unconditional

  atomic_set(&x, 1);

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 14:05             ` Paolo Bonzini
@ 2016-05-18 14:10               ` Sergey Fedorov
  2016-05-18 14:40                 ` Paolo Bonzini
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 14:10 UTC (permalink / raw)
  To: Paolo Bonzini, Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée,
	Peter Crosthwaite, Richard Henderson

On 18/05/16 17:05, Paolo Bonzini wrote:
>
> On 18/05/2016 15:59, Sergey Fedorov wrote:
>> But actually (cf include/qemu/atomic.h) we can have:
>>
>>     #define atomic_read(ptr)                              \
>>         ({                                                \
>>         QEMU_BUILD_BUG_ON(sizeof(*ptr) > sizeof(void *)); \
>>         typeof(*ptr) _val;                                \
>>          __atomic_load(ptr, &_val, __ATOMIC_RELAXED);     \
>>         _val;                                             \
>>         })
>>
>>
>> I can't find anywhere if this __atomic_load() has volatile/compiler
>> barrier semantics...
> The standard says "you can have data races on atomic loads", that is
> very close to compiler barrier semantics but indeed atomics.txt should
> be updated to explain the C11 memory model in not-so-formal terms.
>
> For example this:
>
>   atomic_set(&x, 1);
>   atomic_set(&y, 1);
>   atomic_set(&x, 2);
>   atomic_set(&y, 2);
>
> could become
>
>   atomic_set(&x, 2);
>   atomic_set(&y, 2);
>
> with C11 atomics but not with volatile.  However this:
>
>   if (atomic_read(&x) != 1) {
>     atomic_set(&x, 1);
>   }
>
> couldn't become an unconditional
>
>   atomic_set(&x, 1);

Sorry, I can't figure out why it couldn't...

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18  0:28           ` Emilio G. Cota
@ 2016-05-18 14:18             ` Sergey Fedorov
  2016-05-18 14:47               ` Sergey Fedorov
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 14:18 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Richard Henderson, QEMU Developers, MTTCG Devel,
	Alex Bennée, Paolo Bonzini, Peter Crosthwaite

On 18/05/16 03:28, Emilio G. Cota wrote:
> On Tue, May 17, 2016 at 23:20:11 +0300, Sergey Fedorov wrote:
>> On 17/05/16 23:04, Emilio G. Cota wrote:
> (snip)
>>> +/*
>>> + * We might we tempted to use __atomic_test_and_set with __ATOMIC_ACQUIRE;
>>> + * however, the documentation explicitly says that we should only pass
>>> + * a boolean to it, so we use __sync_lock_test_and_set, which doesn't
>>> + * have this limitation, and is documented to have acquire semantics.
>>> + */
>>> +#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)
>> So you are going to stick to *legacy* built-ins?
> Why not? AFAIK the reason to avoid __sync primitives is that in most cases
> they include barriers that callers might not necessarily need; __atomic's
> allow for finer tuning, which is in general a good thing. However,
> __sync_test_and_set has the exact semantics we need, without the limitations
> documented for __atomic_test_and_set; so why not use it?

So it should be okay as long as the legacy build-ins are supported.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 14:10               ` Sergey Fedorov
@ 2016-05-18 14:40                 ` Paolo Bonzini
  0 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2016-05-18 14:40 UTC (permalink / raw)
  To: Sergey Fedorov, Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée,
	Peter Crosthwaite, Richard Henderson



On 18/05/2016 16:10, Sergey Fedorov wrote:
> On 18/05/16 17:05, Paolo Bonzini wrote:
>> this:
>>
>>   if (atomic_read(&x) != 1) {
>>     atomic_set(&x, 1);
>>   }
>>
>> couldn't become an unconditional
>>
>>   atomic_set(&x, 1);
> 
> Sorry, I can't figure out why it couldn't...

Because atomics cannot create new unconditional writes (and neither can
volatile).

Thanks,

paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 14:18             ` Sergey Fedorov
@ 2016-05-18 14:47               ` Sergey Fedorov
  2016-05-18 14:59                 ` Paolo Bonzini
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 14:47 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Richard Henderson, QEMU Developers, MTTCG Devel,
	Alex Bennée, Paolo Bonzini, Peter Crosthwaite

On 18/05/16 17:18, Sergey Fedorov wrote:
> On 18/05/16 03:28, Emilio G. Cota wrote:
>> On Tue, May 17, 2016 at 23:20:11 +0300, Sergey Fedorov wrote:
>>> On 17/05/16 23:04, Emilio G. Cota wrote:
>> (snip)
>>>> +/*
>>>> + * We might we tempted to use __atomic_test_and_set with __ATOMIC_ACQUIRE;
>>>> + * however, the documentation explicitly says that we should only pass
>>>> + * a boolean to it, so we use __sync_lock_test_and_set, which doesn't
>>>> + * have this limitation, and is documented to have acquire semantics.
>>>> + */
>>>> +#define atomic_test_and_set_acquire(ptr) __sync_lock_test_and_set(ptr, true)
>>> So you are going to stick to *legacy* built-ins?
>> Why not? AFAIK the reason to avoid __sync primitives is that in most cases
>> they include barriers that callers might not necessarily need; __atomic's
>> allow for finer tuning, which is in general a good thing. However,
>> __sync_test_and_set has the exact semantics we need, without the limitations
>> documented for __atomic_test_and_set; so why not use it?
> So it should be okay as long as the legacy build-ins are supported.

However, there's also __atomic_compare_exchange_n(). Could it be the choice?

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 14:47               ` Sergey Fedorov
@ 2016-05-18 14:59                 ` Paolo Bonzini
  2016-05-18 15:05                   ` Sergey Fedorov
  0 siblings, 1 reply; 79+ messages in thread
From: Paolo Bonzini @ 2016-05-18 14:59 UTC (permalink / raw)
  To: Sergey Fedorov, Emilio G. Cota
  Cc: Richard Henderson, QEMU Developers, MTTCG Devel,
	Alex Bennée, Peter Crosthwaite



On 18/05/2016 16:47, Sergey Fedorov wrote:
>>> >> Why not? AFAIK the reason to avoid __sync primitives is that in most cases
>>> >> they include barriers that callers might not necessarily need; __atomic's
>>> >> allow for finer tuning, which is in general a good thing. However,
>>> >> __sync_test_and_set has the exact semantics we need, without the limitations
>>> >> documented for __atomic_test_and_set; so why not use it?
>> > So it should be okay as long as the legacy build-ins are supported.
> However, there's also __atomic_compare_exchange_n(). Could it be the choice?

cmpxchg is not TAS.  I don't see any reason not to use
__sync_test_and_set, the only sensible alternative is to ignore the
standard and use __atomic_test_and_set on int.

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 14:59                 ` Paolo Bonzini
@ 2016-05-18 15:05                   ` Sergey Fedorov
  2016-05-18 15:09                     ` Paolo Bonzini
  2016-05-18 15:35                     ` Peter Maydell
  0 siblings, 2 replies; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 15:05 UTC (permalink / raw)
  To: Paolo Bonzini, Emilio G. Cota
  Cc: Richard Henderson, QEMU Developers, MTTCG Devel,
	Alex Bennée, Peter Crosthwaite

On 18/05/16 17:59, Paolo Bonzini wrote:
>
> On 18/05/2016 16:47, Sergey Fedorov wrote:
>>>>>> Why not? AFAIK the reason to avoid __sync primitives is that in most cases
>>>>>> they include barriers that callers might not necessarily need; __atomic's
>>>>>> allow for finer tuning, which is in general a good thing. However,
>>>>>> __sync_test_and_set has the exact semantics we need, without the limitations
>>>>>> documented for __atomic_test_and_set; so why not use it?
>>>> So it should be okay as long as the legacy build-ins are supported.
>> However, there's also __atomic_compare_exchange_n(). Could it be the choice?
> cmpxchg is not TAS.  I don't see any reason not to use
> __sync_test_and_set, the only sensible alternative is to ignore the
> standard and use __atomic_test_and_set on int.

Please look at this:

$ cat >a.c <<EOF
int atomic_exchange(int *x, int v)
{
    return __atomic_exchange_n(x, v, __ATOMIC_ACQUIRE);
}

_Bool atomic_compare_exchange(int *x, int o, int n)
{
    return __atomic_compare_exchange_n(x, &o, n, 1,
            __ATOMIC_ACQUIRE, __ATOMIC_RELAXED);
}

_Bool sync_val_compare_and_swap(int *x, int o, int n)
{
    return __sync_val_compare_and_swap(x, 0, n);
}

int sync_lock_test_and_set(int *x, int v)
{
    __sync_lock_test_and_set(x, v);
}
EOF

$ arm-linux-gnueabi-gcc -march=armv6 -O2 -c a.c

$ arm-linux-gnueabi-objdump -d a.o

a.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <atomic_exchange>:
   0:    e1902f9f     ldrex    r2, [r0]
   4:    e1803f91     strex    r3, r1, [r0]
   8:    e3530000     cmp    r3, #0
   c:    1afffffb     bne    0 <atomic_exchange>
  10:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
  14:    e1a00002     mov    r0, r2
  18:    e12fff1e     bx    lr

0000001c <atomic_compare_exchange>:
  1c:    e24dd008     sub    sp, sp, #8
  20:    e58d1004     str    r1, [sp, #4]
  24:    e1903f9f     ldrex    r3, [r0]
  28:    e1530001     cmp    r3, r1
  2c:    1a000002     bne    3c <atomic_compare_exchange+0x20>
  30:    e180cf92     strex    ip, r2, [r0]
  34:    e35c0000     cmp    ip, #0
  38:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
  3c:    13a00000     movne    r0, #0
  40:    03a00001     moveq    r0, #1
  44:    e28dd008     add    sp, sp, #8
  48:    e12fff1e     bx    lr

0000004c <sync_val_compare_and_swap>:
  4c:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
  50:    e1901f9f     ldrex    r1, [r0]
  54:    e3510000     cmp    r1, #0
  58:    1a000002     bne    68 <sync_val_compare_and_swap+0x1c>
  5c:    e1803f92     strex    r3, r2, [r0]
  60:    e3530000     cmp    r3, #0
  64:    1afffff9     bne    50 <sync_val_compare_and_swap+0x4>
  68:    e2910000     adds    r0, r1, #0
  6c:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
  70:    13a00001     movne    r0, #1
  74:    e12fff1e     bx    lr

00000078 <sync_lock_test_and_set>:
  78:    e1902f9f     ldrex    r2, [r0]
  7c:    e1803f91     strex    r3, r1, [r0]
  80:    e3530000     cmp    r3, #0
  84:    1afffffb     bne    78 <sync_lock_test_and_set>
  88:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
  8c:    e12fff1e     bx    lr


atomic_compare_exchange() looks pretty good, doesn't it? Could we use it
to implement qemu_spin_lock()?

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 15:05                   ` Sergey Fedorov
@ 2016-05-18 15:09                     ` Paolo Bonzini
  2016-05-18 16:59                       ` Emilio G. Cota
  2016-05-18 15:35                     ` Peter Maydell
  1 sibling, 1 reply; 79+ messages in thread
From: Paolo Bonzini @ 2016-05-18 15:09 UTC (permalink / raw)
  To: Sergey Fedorov, Emilio G. Cota
  Cc: Richard Henderson, QEMU Developers, MTTCG Devel,
	Alex Bennée, Peter Crosthwaite



On 18/05/2016 17:05, Sergey Fedorov wrote:
> Please look at this:
> 
> $ cat >a.c <<EOF
> int atomic_exchange(int *x, int v)
> {
>     return __atomic_exchange_n(x, v, __ATOMIC_ACQUIRE);
> }
> 
> int sync_lock_test_and_set(int *x, int v)
> {
>     __sync_lock_test_and_set(x, v);
> }
> EOF
> 
> Disassembly of section .text:
> 
> 00000000 <atomic_exchange>:
>    0:    e1902f9f     ldrex    r2, [r0]
>    4:    e1803f91     strex    r3, r1, [r0]
>    8:    e3530000     cmp    r3, #0
>    c:    1afffffb     bne    0 <atomic_exchange>
>   10:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
>   14:    e1a00002     mov    r0, r2
>   18:    e12fff1e     bx    lr
> 
> 00000078 <sync_lock_test_and_set>:
>   78:    e1902f9f     ldrex    r2, [r0]
>   7c:    e1803f91     strex    r3, r1, [r0]
>   80:    e3530000     cmp    r3, #0
>   84:    1afffffb     bne    78 <sync_lock_test_and_set>
>   88:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
>   8c:    e12fff1e     bx    lr
> 
> 
> atomic_compare_exchange() looks pretty good, doesn't it? Could we use it
> to implement qemu_spin_lock()?

I guess you mean atomic_exchange?  That one looks good, indeed it's
equivalent to __sync_lock_test_and_set.

But honestly I think it would be even better to just use
__sync_lock_test_and_set in the spinlock implementation and not add this
to atomics.h.  There's already enough issues with the current subset of
atomics, I am not really happy to add non-SC read-modify-write
operations to the mix.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 15:05                   ` Sergey Fedorov
  2016-05-18 15:09                     ` Paolo Bonzini
@ 2016-05-18 15:35                     ` Peter Maydell
  2016-05-18 15:36                       ` Paolo Bonzini
  2016-05-18 16:02                       ` Richard Henderson
  1 sibling, 2 replies; 79+ messages in thread
From: Peter Maydell @ 2016-05-18 15:35 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Paolo Bonzini, Emilio G. Cota, Richard Henderson,
	QEMU Developers, MTTCG Devel, Alex Bennée,
	Peter Crosthwaite

On 18 May 2016 at 16:05, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
> $ arm-linux-gnueabi-gcc -march=armv6 -O2 -c a.c

I don't think armv6 is a sufficiently common host for us to
worry too much about how its atomic primitives come out.
ARMv7 and 64-bit ARMv8 are more relevant, I think.
(v7 probably gets compiled the same way as v6 here, though.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 15:35                     ` Peter Maydell
@ 2016-05-18 15:36                       ` Paolo Bonzini
  2016-05-18 15:44                         ` Peter Maydell
  2016-05-18 16:02                       ` Richard Henderson
  1 sibling, 1 reply; 79+ messages in thread
From: Paolo Bonzini @ 2016-05-18 15:36 UTC (permalink / raw)
  To: Peter Maydell, Sergey Fedorov
  Cc: Emilio G. Cota, Richard Henderson, QEMU Developers, MTTCG Devel,
	Alex Bennée, Peter Crosthwaite



On 18/05/2016 17:35, Peter Maydell wrote:
>> > $ arm-linux-gnueabi-gcc -march=armv6 -O2 -c a.c
> I don't think armv6 is a sufficiently common host for us to
> worry too much about how its atomic primitives come out.
> ARMv7 and 64-bit ARMv8 are more relevant, I think.
> (v7 probably gets compiled the same way as v6 here, though.)

Well, v6 is raspberry pi isn't it?

Paolo

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 15:36                       ` Paolo Bonzini
@ 2016-05-18 15:44                         ` Peter Maydell
  2016-05-18 15:59                           ` Sergey Fedorov
  0 siblings, 1 reply; 79+ messages in thread
From: Peter Maydell @ 2016-05-18 15:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sergey Fedorov, Emilio G. Cota, Richard Henderson,
	QEMU Developers, MTTCG Devel, Alex Bennée,
	Peter Crosthwaite

On 18 May 2016 at 16:36, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 18/05/2016 17:35, Peter Maydell wrote:
>>> > $ arm-linux-gnueabi-gcc -march=armv6 -O2 -c a.c
>> I don't think armv6 is a sufficiently common host for us to
>> worry too much about how its atomic primitives come out.
>> ARMv7 and 64-bit ARMv8 are more relevant, I think.
>> (v7 probably gets compiled the same way as v6 here, though.)
>
> Well, v6 is raspberry pi isn't it?

Yes, but v6 is also pretty slow anyhow, and if it wasn't
for the outlier raspi case then v6 would be definitely
irrelevant to everybody. Running QEMU on a slow ARM
board is unlikely to be a great experience regardless.
I'm not saying we should happily break v6, but I think
we're better off making optimisation decisions looking
forwards at v7 and v8 boards, rather than backwards at a
single legacy v6 board.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 15:44                         ` Peter Maydell
@ 2016-05-18 15:59                           ` Sergey Fedorov
  0 siblings, 0 replies; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 15:59 UTC (permalink / raw)
  To: Peter Maydell, Paolo Bonzini
  Cc: Emilio G. Cota, Richard Henderson, QEMU Developers, MTTCG Devel,
	Alex Bennée, Peter Crosthwaite

On 18/05/16 18:44, Peter Maydell wrote:
> On 18 May 2016 at 16:36, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 18/05/2016 17:35, Peter Maydell wrote:
>>>>> $ arm-linux-gnueabi-gcc -march=armv6 -O2 -c a.c
>>> I don't think armv6 is a sufficiently common host for us to
>>> worry too much about how its atomic primitives come out.
>>> ARMv7 and 64-bit ARMv8 are more relevant, I think.
>>> (v7 probably gets compiled the same way as v6 here, though.)
>> Well, v6 is raspberry pi isn't it?
> Yes, but v6 is also pretty slow anyhow, and if it wasn't
> for the outlier raspi case then v6 would be definitely
> irrelevant to everybody. Running QEMU on a slow ARM
> board is unlikely to be a great experience regardless.
> I'm not saying we should happily break v6, but I think
> we're better off making optimisation decisions looking
> forwards at v7 and v8 boards, rather than backwards at a
> single legacy v6 board.

Well, ARMv7 code looks like exactly the same except we have "dmb sy"
instead of "mcr 15, 0, r0, cr7, cr10, {5}".

Here is ARMv8 code for reference:


a.o:     file format elf64-littleaarch64


Disassembly of section .text:

    0000000000000000 <atomic_exchange>:
       0:    885ffc02     ldaxr    w2, [x0]
       4:    88037c01     stxr    w3, w1, [x0]
       8:    35ffffc3     cbnz    w3, 0 <atomic_exchange>
       c:    2a0203e0     mov    w0, w2
      10:    d65f03c0     ret

    0000000000000014 <atomic_compare_exchange>:
      14:    d10043ff     sub    sp, sp, #0x10
      18:    b9000fe1     str    w1, [sp,#12]
      1c:    885ffc03     ldaxr    w3, [x0]
      20:    6b01007f     cmp    w3, w1
      24:    54000061     b.ne    30 <atomic_compare_exchange+0x1c>
      28:    88047c02     stxr    w4, w2, [x0]
      2c:    6b1f009f     cmp    w4, wzr
      30:    1a9f17e0     cset    w0, eq
      34:    910043ff     add    sp, sp, #0x10
      38:    d65f03c0     ret

    000000000000003c <sync_val_compare_and_swap>:
      3c:    885ffc01     ldaxr    w1, [x0]
      40:    6b1f003f     cmp    w1, wzr
      44:    54000061     b.ne    50 <sync_val_compare_and_swap+0x14>
      48:    8803fc02     stlxr    w3, w2, [x0]
      4c:    35ffff83     cbnz    w3, 3c <sync_val_compare_and_swap>
      50:    6b1f003f     cmp    w1, wzr
      54:    1a9f07e0     cset    w0, ne
      58:    d65f03c0     ret

    000000000000005c <sync_lock_test_and_set>:
      5c:    885ffc02     ldaxr    w2, [x0]
      60:    88037c01     stxr    w3, w1, [x0]
      64:    35ffffc3     cbnz    w3, 5c <sync_lock_test_and_set>
      68:    d65f03c0     ret


and x86-64 as well (but I'm not good at reading x86 code):

a.o:     file format elf64-x86-64


Disassembly of section .text:

    0000000000000000 <atomic_exchange>:
       0:   89 f0                   mov    %esi,%eax
       2:   87 07                   xchg   %eax,(%rdi)
       4:   c3                      retq  
       5:   66 66 2e 0f 1f 84 00    data32 nopw %cs:0x0(%rax,%rax,1)
       c:   00 00 00 00

    0000000000000010 <atomic_compare_exchange>:
      10:   89 f0                   mov    %esi,%eax
      12:   89 74 24 fc             mov    %esi,-0x4(%rsp)
      16:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)
      1a:   0f 94 c0                sete   %al
      1d:   c3                      retq  
      1e:   66 90                   xchg   %ax,%ax

    0000000000000020 <sync_val_compare_and_swap>:
      20:   31 c0                   xor    %eax,%eax
      22:   f0 0f b1 17             lock cmpxchg %edx,(%rdi)
      26:   85 c0                   test   %eax,%eax
      28:   0f 95 c0                setne  %al
      2b:   c3                      retq  
      2c:   0f 1f 40 00             nopl   0x0(%rax)

    0000000000000030 <sync_lock_test_and_set>:
      30:   87 37                   xchg   %esi,(%rdi)
      32:   c3                      retq  

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 15:35                     ` Peter Maydell
  2016-05-18 15:36                       ` Paolo Bonzini
@ 2016-05-18 16:02                       ` Richard Henderson
  1 sibling, 0 replies; 79+ messages in thread
From: Richard Henderson @ 2016-05-18 16:02 UTC (permalink / raw)
  To: Peter Maydell, Sergey Fedorov
  Cc: Paolo Bonzini, Emilio G. Cota, QEMU Developers, MTTCG Devel,
	Alex Bennée, Peter Crosthwaite

On 05/18/2016 08:35 AM, Peter Maydell wrote:
> (v7 probably gets compiled the same way as v6 here, though.)

The meaningful difference in v7 is ldrexb, which goes to the byte semantics of 
__atomic_test_and_set, which is how this digression got started.


r~

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 15:09                     ` Paolo Bonzini
@ 2016-05-18 16:59                       ` Emilio G. Cota
  2016-05-18 17:00                         ` Paolo Bonzini
  0 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-18 16:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sergey Fedorov, Richard Henderson, QEMU Developers, MTTCG Devel,
	Alex Bennée, Peter Crosthwaite

On Wed, May 18, 2016 at 17:09:48 +0200, Paolo Bonzini wrote:
> But honestly I think it would be even better to just use
> __sync_lock_test_and_set in the spinlock implementation and not add this
> to atomics.h.  There's already enough issues with the current subset of
> atomics, I am not really happy to add non-SC read-modify-write
> operations to the mix.

I can drop the two patches that touch atomic.h, and have the
spinlock patch as appended. OK with this?

Thanks,

		Emilio

commit 99c8f1049a1508edc9de30e3cf27898888aa7f68
Author: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
Date:   Sun Oct 18 09:44:02 2015 +0200

    qemu-thread: add simple test-and-set spinlock
    
    Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
    [Rewritten. - Paolo]
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
    [Emilio's additions: use TAS instead of atomic_xchg; emit acquire/release
     barriers; call cpu_relax() while spinning; optimize for uncontended locks by
     acquiring the lock with TAS instead of TATAS; add qemu_spin_locked().]
    Signed-off-by: Emilio G. Cota <cota@braap.org>

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index bdae6df..2d225ff 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -1,6 +1,9 @@
 #ifndef __QEMU_THREAD_H
 #define __QEMU_THREAD_H 1
 
+#include <errno.h>
+#include "qemu/processor.h"
+#include "qemu/atomic.h"
 
 typedef struct QemuMutex QemuMutex;
 typedef struct QemuCond QemuCond;
@@ -60,4 +63,40 @@ struct Notifier;
 void qemu_thread_atexit_add(struct Notifier *notifier);
 void qemu_thread_atexit_remove(struct Notifier *notifier);
 
+typedef struct QemuSpin {
+    int value;
+} QemuSpin;
+
+static inline void qemu_spin_init(QemuSpin *spin)
+{
+    __sync_lock_release(&spin->value);
+}
+
+static inline void qemu_spin_lock(QemuSpin *spin)
+{
+    while (__sync_lock_test_and_set(&spin->value, true)) {
+        while (atomic_read(&spin->value)) {
+            cpu_relax();
+        }
+    }
+}
+
+static inline int qemu_spin_trylock(QemuSpin *spin)
+{
+    if (__sync_lock_test_and_set(&spin->value, true)) {
+        return -EBUSY;
+    }
+    return 0;
+}
+
+static inline bool qemu_spin_locked(QemuSpin *spin)
+{
+    return atomic_read(&spin->value);
+}
+
+static inline void qemu_spin_unlock(QemuSpin *spin)
+{
+    __sync_lock_release(&spin->value);
+}
+
 #endif

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 16:59                       ` Emilio G. Cota
@ 2016-05-18 17:00                         ` Paolo Bonzini
  0 siblings, 0 replies; 79+ messages in thread
From: Paolo Bonzini @ 2016-05-18 17:00 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Sergey Fedorov, Richard Henderson, QEMU Developers, MTTCG Devel,
	Alex Bennée, Peter Crosthwaite



On 18/05/2016 18:59, Emilio G. Cota wrote:
> On Wed, May 18, 2016 at 17:09:48 +0200, Paolo Bonzini wrote:
>> But honestly I think it would be even better to just use
>> __sync_lock_test_and_set in the spinlock implementation and not add this
>> to atomics.h.  There's already enough issues with the current subset of
>> atomics, I am not really happy to add non-SC read-modify-write
>> operations to the mix.
> 
> I can drop the two patches that touch atomic.h, and have the
> spinlock patch as appended.

Load acquire and store release are fine by me, I wanted to add them too.
 But the patch you attached is fine, I will salvage the
load-acquire/store-release patch later.

Thanks,

Paolo

> Thanks,
> 
> 		Emilio
> 
> commit 99c8f1049a1508edc9de30e3cf27898888aa7f68
> Author: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
> Date:   Sun Oct 18 09:44:02 2015 +0200
> 
>     qemu-thread: add simple test-and-set spinlock
>     
>     Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
>     [Rewritten. - Paolo]
>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>     [Emilio's additions: use TAS instead of atomic_xchg; emit acquire/release
>      barriers; call cpu_relax() while spinning; optimize for uncontended locks by
>      acquiring the lock with TAS instead of TATAS; add qemu_spin_locked().]
>     Signed-off-by: Emilio G. Cota <cota@braap.org>
> 
> diff --git a/include/qemu/thread.h b/include/qemu/thread.h
> index bdae6df..2d225ff 100644
> --- a/include/qemu/thread.h
> +++ b/include/qemu/thread.h
> @@ -1,6 +1,9 @@
>  #ifndef __QEMU_THREAD_H
>  #define __QEMU_THREAD_H 1
>  
> +#include <errno.h>
> +#include "qemu/processor.h"
> +#include "qemu/atomic.h"
>  
>  typedef struct QemuMutex QemuMutex;
>  typedef struct QemuCond QemuCond;
> @@ -60,4 +63,40 @@ struct Notifier;
>  void qemu_thread_atexit_add(struct Notifier *notifier);
>  void qemu_thread_atexit_remove(struct Notifier *notifier);
>  
> +typedef struct QemuSpin {
> +    int value;
> +} QemuSpin;
> +
> +static inline void qemu_spin_init(QemuSpin *spin)
> +{
> +    __sync_lock_release(&spin->value);
> +}
> +
> +static inline void qemu_spin_lock(QemuSpin *spin)
> +{
> +    while (__sync_lock_test_and_set(&spin->value, true)) {
> +        while (atomic_read(&spin->value)) {
> +            cpu_relax();
> +        }
> +    }
> +}
> +
> +static inline int qemu_spin_trylock(QemuSpin *spin)
> +{
> +    if (__sync_lock_test_and_set(&spin->value, true)) {
> +        return -EBUSY;
> +    }
> +    return 0;
> +}
> +
> +static inline bool qemu_spin_locked(QemuSpin *spin)
> +{
> +    return atomic_read(&spin->value);
> +}
> +
> +static inline void qemu_spin_unlock(QemuSpin *spin)
> +{
> +    __sync_lock_release(&spin->value);
> +}
> +
>  #endif
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/18] include/processor.h: define cpu_relax()
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 04/18] include/processor.h: define cpu_relax() Emilio G. Cota
@ 2016-05-18 17:47   ` Sergey Fedorov
  2016-05-18 18:29     ` Emilio G. Cota
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 17:47 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite, Richard Henderson

On 14/05/16 06:34, Emilio G. Cota wrote:
> Taken from the linux kernel.
>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/processor.h | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>  create mode 100644 include/qemu/processor.h
>
> diff --git a/include/qemu/processor.h b/include/qemu/processor.h
> new file mode 100644
> index 0000000..4e6a71f
> --- /dev/null
> +++ b/include/qemu/processor.h
> @@ -0,0 +1,34 @@
> +/*
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2.
> + *   See the COPYING file in the top-level directory.
> + */
> +#ifndef QEMU_PROCESSOR_H
> +#define QEMU_PROCESSOR_H
> +
> +#include "qemu/atomic.h"
> +
> +#if defined(__i386__) || defined(__x86_64__)
> +#define cpu_relax() asm volatile("rep; nop" ::: "memory")
> +#endif
> +
> +#ifdef __ia64__
> +#define cpu_relax() asm volatile("hint @pause" ::: "memory")
> +#endif
> +
> +#ifdef __aarch64__
> +#define cpu_relax() asm volatile("yield" ::: "memory")
> +#endif
> +
> +#if defined(__powerpc64__)
> +/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
> +#define cpu_relax() asm volatile("or 1, 1, 1;"
> +                                 "or 2, 2, 2;" ::: "memory")
> +#endif
> +
> +#ifndef cpu_relax
> +#define cpu_relax() barrier()
> +#endif
> +
> +#endif /* QEMU_PROCESSOR_H */

Why don't do like this:

#if defined(__foo__)
#  define ...
#elif defined(__bar__)
#  define ...
#else
#  define ...
#endif

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
       [not found]   ` <573B5134.8060104@gmail.com>
@ 2016-05-18 18:21   ` Sergey Fedorov
  2016-05-18 19:04     ` Emilio G. Cota
  2016-05-18 19:51   ` Sergey Fedorov
  2 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 18:21 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite, Richard Henderson

On 14/05/16 06:34, Emilio G. Cota wrote:
> +static inline int qemu_spin_trylock(QemuSpin *spin)
> +{
> +    if (atomic_test_and_set_acquire(&spin->value)) {
> +        return -EBUSY;

Seems this should be:

    return EBUSY;

> +    }
> +    return 0;
> +}

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/18] include/processor.h: define cpu_relax()
  2016-05-18 17:47   ` Sergey Fedorov
@ 2016-05-18 18:29     ` Emilio G. Cota
  2016-05-18 18:37       ` Sergey Fedorov
  0 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-18 18:29 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Wed, May 18, 2016 at 20:47:56 +0300, Sergey Fedorov wrote:
> Why don't do like this:
> 
> #if defined(__foo__)
> #  define ...
> #elif defined(__bar__)
> #  define ...
> #else
> #  define ...
> #endif

Good point. Changed to:

commit ad31d6cff8e309e41bd4bed110f173e473c27c5a
Author: Emilio G. Cota <cota@braap.org>
Date:   Wed Apr 6 18:21:08 2016 -0400

    include/processor.h: define cpu_relax()
    
    Taken from the linux kernel.
    
    Reviewed-by: Richard Henderson <rth@twiddle.net>
    Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
    Signed-off-by: Emilio G. Cota <cota@braap.org>

diff --git a/include/qemu/processor.h b/include/qemu/processor.h
new file mode 100644
index 0000000..42bcc99
--- /dev/null
+++ b/include/qemu/processor.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_PROCESSOR_H
+#define QEMU_PROCESSOR_H
+
+#include "qemu/atomic.h"
+
+#if defined(__i386__) || defined(__x86_64__)
+# define cpu_relax() asm volatile("rep; nop" ::: "memory")
+
+#elif defined(__ia64__)
+# define cpu_relax() asm volatile("hint @pause" ::: "memory")
+
+#elif defined(__aarch64__)
+# define cpu_relax() asm volatile("yield" ::: "memory")
+
+#elif defined(__powerpc64__)
+/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
+# define cpu_relax() asm volatile("or 1, 1, 1;"
+                                  "or 2, 2, 2;" ::: "memory")
+
+#else
+# define cpu_relax() barrier()
+#endif
+
+#endif /* QEMU_PROCESSOR_H */

^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/18] include/processor.h: define cpu_relax()
  2016-05-18 18:29     ` Emilio G. Cota
@ 2016-05-18 18:37       ` Sergey Fedorov
  0 siblings, 0 replies; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 18:37 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On 18/05/16 21:29, Emilio G. Cota wrote:
> On Wed, May 18, 2016 at 20:47:56 +0300, Sergey Fedorov wrote:
>> Why don't do like this:
>>
>> #if defined(__foo__)
>> #  define ...
>> #elif defined(__bar__)
>> #  define ...
>> #else
>> #  define ...
>> #endif
> Good point. Changed to:
>
> commit ad31d6cff8e309e41bd4bed110f173e473c27c5a
> Author: Emilio G. Cota <cota@braap.org>
> Date:   Wed Apr 6 18:21:08 2016 -0400
>
>     include/processor.h: define cpu_relax()
>     
>     Taken from the linux kernel.
>     
>     Reviewed-by: Richard Henderson <rth@twiddle.net>
>     Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
>     Signed-off-by: Emilio G. Cota <cota@braap.org>
>
> diff --git a/include/qemu/processor.h b/include/qemu/processor.h
> new file mode 100644
> index 0000000..42bcc99
> --- /dev/null
> +++ b/include/qemu/processor.h
> @@ -0,0 +1,30 @@
> +/*
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2.
> + *   See the COPYING file in the top-level directory.
> + */
> +#ifndef QEMU_PROCESSOR_H
> +#define QEMU_PROCESSOR_H
> +
> +#include "qemu/atomic.h"
> +
> +#if defined(__i386__) || defined(__x86_64__)
> +# define cpu_relax() asm volatile("rep; nop" ::: "memory")
> +
> +#elif defined(__ia64__)
> +# define cpu_relax() asm volatile("hint @pause" ::: "memory")
> +
> +#elif defined(__aarch64__)
> +# define cpu_relax() asm volatile("yield" ::: "memory")
> +
> +#elif defined(__powerpc64__)
> +/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
> +# define cpu_relax() asm volatile("or 1, 1, 1;"
> +                                  "or 2, 2, 2;" ::: "memory")
> +
> +#else
> +# define cpu_relax() barrier()
> +#endif
> +
> +#endif /* QEMU_PROCESSOR_H */

Looks like you prefer "sparse" code :)

-Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 18:21   ` Sergey Fedorov
@ 2016-05-18 19:04     ` Emilio G. Cota
  0 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-18 19:04 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Wed, May 18, 2016 at 21:21:26 +0300, Sergey Fedorov wrote:
> On 14/05/16 06:34, Emilio G. Cota wrote:
> > +static inline int qemu_spin_trylock(QemuSpin *spin)
> > +{
> > +    if (atomic_test_and_set_acquire(&spin->value)) {
> > +        return -EBUSY;
> 
> Seems this should be:
> 
>     return EBUSY;

I don't think any caller would/should ever check this value, other
than if (!trylock).

It's true though that pthread_mutex_trylock returns EBUSY, so it's
probably best to remain consistent with it; I've made the change.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
       [not found]   ` <573B5134.8060104@gmail.com>
  2016-05-18 18:21   ` Sergey Fedorov
@ 2016-05-18 19:51   ` Sergey Fedorov
  2016-05-18 20:52     ` Emilio G. Cota
  2 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 19:51 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite, Richard Henderson

On 14/05/16 06:34, Emilio G. Cota wrote:
> +static inline void qemu_spin_lock(QemuSpin *spin)
> +{
> +    while (atomic_test_and_set_acquire(&spin->value)) {

A possible optimization might be using unlikely() here, copmare:

spin.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <spin_lock__no_hint>:
   0:    52800022     mov    w2, #0x1                       // #1
   4:    885ffc01     ldaxr    w1, [x0]
   8:    88037c02     stxr    w3, w2, [x0]
   c:    35ffffc3     cbnz    w3, 4 <spin_lock__no_hint+0x4>
  10:    340000a1     cbz    w1, 24 <spin_lock__no_hint+0x24>
  14:    b9400001     ldr    w1, [x0]
  18:    34ffff61     cbz    w1, 4 <spin_lock__no_hint+0x4>
  1c:    d503203f     yield
  20:    17fffffd     b    14 <spin_lock__no_hint+0x14>
  24:    d65f03c0     ret

0000000000000028 <spin_lock__hint>:
  28:    52800022     mov    w2, #0x1                       // #1
  2c:    885ffc01     ldaxr    w1, [x0]
  30:    88037c02     stxr    w3, w2, [x0]
  34:    35ffffc3     cbnz    w3, 2c <spin_lock__hint+0x4>
  38:    35000061     cbnz    w1, 44 <spin_lock__hint+0x1c>
  3c:    d65f03c0     ret
  40:    d503203f     yield
  44:    b9400001     ldr    w1, [x0]
  48:    35ffffc1     cbnz    w1, 40 <spin_lock__hint+0x18>
  4c:    17fffff8     b    2c <spin_lock__hint+0x4>

spin_lock__hint(), the one where unlikely() used, gives a bit more
CPU-pipeline-friendly fast-path.

> +        while (atomic_read(&spin->value)) {
> +            cpu_relax();
> +        }
> +    }
> +}
> +
> +static inline int qemu_spin_trylock(QemuSpin *spin)
> +{
> +    if (atomic_test_and_set_acquire(&spin->value)) {
> +        return -EBUSY;
> +    }
> +    return 0;
> +}

Here we could also benefit from unlikely(), I think.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 19:51   ` Sergey Fedorov
@ 2016-05-18 20:52     ` Emilio G. Cota
  2016-05-18 20:57       ` Sergey Fedorov
  0 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-18 20:52 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Wed, May 18, 2016 at 22:51:09 +0300, Sergey Fedorov wrote:
> On 14/05/16 06:34, Emilio G. Cota wrote:
> > +static inline void qemu_spin_lock(QemuSpin *spin)
> > +{
> > +    while (atomic_test_and_set_acquire(&spin->value)) {
> 
> A possible optimization might be using unlikely() here, copmare:

Testing with a spinlock-heavy workload reveals a little improvement:

taskset -c 0 tests/qht-bench \
	-d 5 -n 1 -u 100 -k 4096 -K 4096 -l 4096 -r 4096 -s 4096

I'm running this 10 times. Results in Mops/s:
Head			31.283 +- 0.190557661148069
while (unlikely)	31.397 +- 0.107501937967028
if (likely) + while	31.524 +- 0.219605707272527 

The last case does:
    if (likely(__sync_lock_test_and_set(&spin->value, true) == false)) {
        return;
    }
    while (__sync_lock_test_and_set(&spin->value, true)) {
        while (atomic_read(&spin->value)) {
            cpu_relax();
        }
    }

Although I don't like how this will do the TAS twice if the lock is
contended.

I'll just add the unlikely() to while().

> > +static inline int qemu_spin_trylock(QemuSpin *spin)
> > +{
> > +    if (atomic_test_and_set_acquire(&spin->value)) {
> > +        return -EBUSY;
> > +    }
> > +    return 0;
> > +}
> 
> Here we could also benefit from unlikely(), I think.

I never liked this branch in _trylock, because there will
be a branch anyway around the function. How about:

static inline bool qemu_spin_trylock(QemuSpin *spin)
{
    return __sync_lock_test_and_set(&spin->value, true);
}

We don't return EBUSY, which nobody cares about anyway; callers
will still do if (!trylock). With this we save a branch,
and let callers sprinkle likely/unlikely based on how contented
they expect the lock to be.

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock
  2016-05-18 20:52     ` Emilio G. Cota
@ 2016-05-18 20:57       ` Sergey Fedorov
  0 siblings, 0 replies; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-18 20:57 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On 18/05/16 23:52, Emilio G. Cota wrote:
> On Wed, May 18, 2016 at 22:51:09 +0300, Sergey Fedorov wrote:
>> On 14/05/16 06:34, Emilio G. Cota wrote:
>>> +static inline void qemu_spin_lock(QemuSpin *spin)
>>> +{
>>> +    while (atomic_test_and_set_acquire(&spin->value)) {
>> A possible optimization might be using unlikely() here, copmare:
> Testing with a spinlock-heavy workload reveals a little improvement:
>
> taskset -c 0 tests/qht-bench \
> 	-d 5 -n 1 -u 100 -k 4096 -K 4096 -l 4096 -r 4096 -s 4096
>
> I'm running this 10 times. Results in Mops/s:
> Head			31.283 +- 0.190557661148069
> while (unlikely)	31.397 +- 0.107501937967028
> if (likely) + while	31.524 +- 0.219605707272527 
>
> The last case does:
>     if (likely(__sync_lock_test_and_set(&spin->value, true) == false)) {
>         return;
>     }
>     while (__sync_lock_test_and_set(&spin->value, true)) {
>         while (atomic_read(&spin->value)) {
>             cpu_relax();
>         }
>     }
>
> Although I don't like how this will do the TAS twice if the lock is
> contended.
>
> I'll just add the unlikely() to while().

Great!

>
>>> +static inline int qemu_spin_trylock(QemuSpin *spin)
>>> +{
>>> +    if (atomic_test_and_set_acquire(&spin->value)) {
>>> +        return -EBUSY;
>>> +    }
>>> +    return 0;
>>> +}
>> Here we could also benefit from unlikely(), I think.
> I never liked this branch in _trylock, because there will
> be a branch anyway around the function. How about:
>
> static inline bool qemu_spin_trylock(QemuSpin *spin)
> {
>     return __sync_lock_test_and_set(&spin->value, true);
> }
>
> We don't return EBUSY, which nobody cares about anyway; callers
> will still do if (!trylock). With this we save a branch,
> and let callers sprinkle likely/unlikely based on how contented
> they expect the lock to be.

It's "static inline" anyway, so that shouldn't matter much if we have
that "if". But you're right, let's save likely/unlikely for the user of
the function.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
@ 2016-05-20 22:13   ` Sergey Fedorov
  2016-05-21  2:48     ` Emilio G. Cota
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-20 22:13 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite, Richard Henderson

Wow, that's really great stuff!

On 14/05/16 06:34, Emilio G. Cota wrote:
> diff --git a/include/qemu/qht.h b/include/qemu/qht.h
> new file mode 100644
> index 0000000..c2ab8b8
> --- /dev/null
> +++ b/include/qemu/qht.h
> @@ -0,0 +1,66 @@
> +/*
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + *   See the COPYING file in the top-level directory.
> + */
> +#ifndef QEMU_QHT_H
> +#define QEMU_QHT_H
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"

There's no need in qemu-common.h

> +#include "qemu/seqlock.h"
> +#include "qemu/qdist.h"
> +#include "qemu/rcu.h"

qemu/rcu.h is really required in qht.c, not here.

> +
> +struct qht {
> +    struct qht_map *map;
> +    unsigned int mode;
> +};
> +
> +struct qht_stats {
> +    size_t head_buckets;
> +    size_t used_head_buckets;
> +    size_t entries;
> +    struct qdist chain;
> +    struct qdist occupancy;
> +};
> +
> +typedef bool (*qht_lookup_func_t)(const void *obj, const void *userp);
> +typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);

We could move ght_stats, qht_lookup_func_t and qht_iter_func_t closer to
the relevant functions declarations. Anyway, I think it's also fine to
keep them here. :)

Although the API is mostly intuitive some kernel-doc-style comments
wouldn’t hurt, I think. ;-)

> +
> +#define QHT_MODE_AUTO_RESIZE 0x1 /* auto-resize when heavily loaded */
> +
> +void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
> +
> +/* call only when there are no readers left */
> +void qht_destroy(struct qht *ht);
> +
> +/* call with an external lock held */
> +void qht_reset(struct qht *ht);
> +
> +/* call with an external lock held */
> +bool qht_reset_size(struct qht *ht, size_t n_elems);
> +
> +/* call with an external lock held */
> +bool qht_insert(struct qht *ht, void *p, uint32_t hash);
> +
> +/* call with an external lock held */
> +bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
> +
> +/* call with an external lock held */
> +void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp);
> +
> +/* call with an external lock held */
> +bool qht_resize(struct qht *ht, size_t n_elems);
> +
> +/* if @func is NULL, then pointer comparison is used */
> +void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
> +                 uint32_t hash);
> +
> +/* pass @stats to qht_statistics_destroy() when done */
> +void qht_statistics_init(struct qht *ht, struct qht_stats *stats);
> +
> +void qht_statistics_destroy(struct qht_stats *stats);
> +
> +#endif /* QEMU_QHT_H */
(snip)
> diff --git a/util/qht.c b/util/qht.c
> new file mode 100644
> index 0000000..112f32d
> --- /dev/null
> +++ b/util/qht.c
> @@ -0,0 +1,703 @@
> +/*
> + * qht.c - QEMU Hash Table, designed to scale for read-mostly workloads.
> + *
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + *   See the COPYING file in the top-level directory.
> + *
> + * Assumptions:
> + * - Writers and iterators must take an external lock.
> + * - NULL cannot be inserted as a pointer value.
> + * - Duplicate pointer values cannot be inserted.
> + *
> + * Features:
> + * - Optional auto-resizing: the hash table resizes up if the load surpasses
> + *   a certain threshold. Resizing is done concurrently with readers.
> + *
> + * The key structure is the bucket, which is cacheline-sized. Buckets
> + * contain a few hash values and pointers; the u32 hash values are stored in
> + * full so that resizing is fast. Having this structure instead of directly
> + * chaining items has three advantages:

s/three/two/?

> + * - Failed lookups fail fast, and touch a minimum number of cache lines.
> + * - Resizing the hash table with concurrent lookups is easy.
> + *
> + * There are two types of buckets:
> + * 1. "head" buckets are the ones allocated in the array of buckets in qht_map.
> + * 2. all "non-head" buckets (i.e. all others) are members of a chain that
> + *    starts from a head bucket.
> + * Note that the seqlock and spinlock of a head bucket applies to all buckets
> + * chained to it; these two fields are unused in non-head buckets.
> + *
> + * On removals, we move the last valid item in the chain to the position of the
> + * just-removed entry. This makes lookups slightly faster, since the moment an
> + * invalid entry is found, the (failed) lookup is over.
> + *
> + * Resizing is done by taking all spinlocks (so that no readers-turned-writers
> + * can race with us) and then placing all elements into a new hash table. Last,
> + * the ht->map pointer is set, and the old map is freed once no RCU readers can
> + * see it anymore.
> + *
> + * Related Work:
> + * - Idea of cacheline-sized buckets with full hashes taken from:
> + *   David, Guerraoui & Trigonakis, "Asynchronized Concurrency:
> + *   The Secret to Scaling Concurrent Search Data Structures", ASPLOS'15.
> + * - Why not RCU-based hash tables? They would allow us to get rid of the
> + *   seqlock, but resizing would take forever since RCU read critical
> + *   sections in QEMU take quite a long time.
> + *   More info on relativistic hash tables:
> + *   + Triplett, McKenney & Walpole, "Resizable, Scalable, Concurrent Hash
> + *     Tables via Relativistic Programming", USENIX ATC'11.
> + *   + Corbet, "Relativistic hash tables, part 1: Algorithms", @ lwn.net, 2014.
> + *     https://lwn.net/Articles/612021/
> + */
> +#include "qemu/qht.h"
> +#include "qemu/atomic.h"
> +
> +//#define QHT_DEBUG
> +
> +/*
> + * We want to avoid false sharing of cache lines. Most systems have 64-byte
> + * cache lines so we go with it for simplicity.
> + *
> + * Note that systems with smaller cache lines will be fine (the struct is
> + * almost 64-bytes); systems with larger cache lines might suffer from
> + * some false sharing.
> + */
> +#define QHT_BUCKET_ALIGN 64
> +
> +/* define these to keep sizeof(qht_bucket) within QHT_BUCKET_ALIGN */
> +#if HOST_LONG_BITS == 32
> +#define QHT_BUCKET_ENTRIES 6
> +#else /* 64-bit */
> +#define QHT_BUCKET_ENTRIES 4
> +#endif
> +
> +struct qht_bucket {
> +    QemuSpin lock;
> +    QemuSeqLock sequence;
> +    uint32_t hashes[QHT_BUCKET_ENTRIES];
> +    void *pointers[QHT_BUCKET_ENTRIES];
> +    struct qht_bucket *next;
> +} QEMU_ALIGNED(QHT_BUCKET_ALIGN);
> +
> +QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);

Have you considered using separate structures for head buckets and
non-head buckets, e.g. "struct qht_head_bucket" and "struct
qht_added_bucket"? This would give us a little more entries per cache-line.

> +
> +/**
> + * struct qht_map - structure to track an array of buckets
> + * @rcu: used by RCU. Keep it as the top field in the struct to help valgrind
> + *       find the whole struct.
> + * @buckets: array of head buckets. It is constant once the map is created.
> + * @n: number of head buckets. It is constant once the map is created.
> + * @n_added_buckets: number of added (i.e. "non-head") buckets
> + * @n_added_buckets_threshold: threshold to trigger an upward resize once the
> + *                             number of added buckets surpasses it.
> + *
> + * Buckets are tracked in what we call a "map", i.e. this structure.
> + */
> +struct qht_map {
> +    struct rcu_head rcu;
> +    struct qht_bucket *buckets;
> +    size_t n;

s/n/n_buckets/? (Actually, 'n_buckets' is already mentioned in a comment
below.)

> +    size_t n_added_buckets;
> +    size_t n_added_buckets_threshold;
> +};
> +
> +/* trigger a resize when n_added_buckets > n_buckets / div */
> +#define QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV 8
> +
> +static void qht_do_resize(struct qht *ht, size_t n);
> +
> +static inline struct qht_map *qht_map__atomic_mb(const struct qht *ht)
> +{
> +    struct qht_map *map;
> +
> +    map = atomic_read(&ht->map);
> +    /* paired with smp_wmb() before setting ht->map */
> +    smp_rmb();
> +    return map;
> +}

Why don't just use atomic_rcu_read/set()? Looks like they were meant for
that exact purpose.

> +
> +/* helper for lockless bucket chain traversals */
> +static inline
> +struct qht_bucket *bucket_next__atomic_mb(const struct qht_bucket *b)
> +{
> +    struct qht_bucket *ret;
> +
> +    ret = atomic_read(&b->next);
> +    /*
> +     * This barrier guarantees that we will read a properly initialized b->next;
> +     * it is paired with an smp_wmb() before setting b->next.
> +     */
> +    smp_rmb();
> +    return ret;
> +}
> +
> +#ifdef QHT_DEBUG
> +static void qht_bucket_debug(struct qht_bucket *b)
> +{
> +    bool seen_empty = false;
> +    bool corrupt = false;
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->pointers[i] == NULL) {
> +                seen_empty = true;
> +                continue;
> +            }
> +            if (seen_empty) {
> +                fprintf(stderr, "%s: b: %p, pos: %i, hash: 0x%x, p: %p\n",
> +                       __func__, b, i, b->hashes[i], b->pointers[i]);
> +                corrupt = true;
> +            }
> +        }
> +        b = b->next;
> +    } while (b);
> +    assert(!corrupt);
> +}
> +
> +static void qht_map_debug(struct qht_map *map)
> +{
> +    int i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        qht_bucket_debug(&map->buckets[i]);
> +    }
> +}
> +#else
> +static inline void qht_bucket_debug(struct qht_bucket *b)
> +{ }
> +
> +static inline void qht_map_debug(struct qht_map *map)
> +{ }
> +#endif /* QHT_DEBUG */
> +
> +static inline size_t qht_elems_to_buckets(size_t n_elems)
> +{
> +    return pow2ceil(n_elems / QHT_BUCKET_ENTRIES);
> +}
> +
> +static inline void qht_head_init(struct qht_bucket *b)
> +{
> +    memset(b, 0, sizeof(*b));
> +    qemu_spin_init(&b->lock);
> +    seqlock_init(&b->sequence);
> +}
> +
> +static inline
> +struct qht_bucket *qht_map_to_bucket(struct qht_map *map, uint32_t hash)
> +{
> +    return &map->buckets[hash & (map->n - 1)];
> +}
> +
> +/* acquire all bucket locks from a map */
> +static void qht_map_lock_buckets(struct qht_map *map)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        struct qht_bucket *b = &map->buckets[i];
> +
> +        qemu_spin_lock(&b->lock);
> +    }
> +}
> +
> +static void qht_map_unlock_buckets(struct qht_map *map)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        struct qht_bucket *b = &map->buckets[i];
> +
> +        qemu_spin_unlock(&b->lock);
> +    }
> +}
> +
> +static inline bool qht_map_needs_resize(struct qht_map *map)
> +{
> +    return atomic_read(&map->n_added_buckets) > map->n_added_buckets_threshold;
> +}
> +
> +static inline void qht_chain_destroy(struct qht_bucket *head)
> +{
> +    struct qht_bucket *curr = head->next;
> +    struct qht_bucket *prev;
> +
> +    while (curr) {
> +        prev = curr;
> +        curr = curr->next;
> +        qemu_vfree(prev);
> +    }
> +}
> +
> +/* pass only an orphan map */
> +static void qht_map_destroy(struct qht_map *map)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        qht_chain_destroy(&map->buckets[i]);
> +    }
> +    qemu_vfree(map->buckets);
> +    g_free(map);
> +}
> +
> +static void qht_map_reclaim(struct rcu_head *rcu)
> +{
> +    struct qht_map *map = container_of(rcu, struct qht_map, rcu);
> +
> +    qht_map_destroy(map);
> +}
> +
> +static struct qht_map *qht_map_create(size_t n)
> +{
> +    struct qht_map *map;
> +    size_t i;
> +
> +    map = g_malloc(sizeof(*map));
> +    map->n = n;
> +
> +    map->n_added_buckets = 0;
> +    map->n_added_buckets_threshold = n / QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV;
> +
> +    /* let tiny hash tables to at least add one non-head bucket */
> +    if (unlikely(map->n_added_buckets_threshold == 0)) {
> +        map->n_added_buckets_threshold = 1;
> +    }
> +
> +    map->buckets = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*map->buckets) * n);
> +    for (i = 0; i < n; i++) {
> +        qht_head_init(&map->buckets[i]);
> +    }
> +    return map;
> +}
> +
> +static inline void qht_publish(struct qht *ht, struct qht_map *new)
> +{
> +    /* Readers should see a properly initialized map; pair with smp_rmb() */
> +    smp_wmb();
> +    atomic_set(&ht->map, new);
> +}
> +
> +void qht_init(struct qht *ht, size_t n_elems, unsigned int mode)
> +{
> +    struct qht_map *map;
> +    size_t n = qht_elems_to_buckets(n_elems);
> +
> +    ht->mode = mode;
> +    map = qht_map_create(n);
> +    qht_publish(ht, map);
> +}
> +
> +/* call only when there are no readers left */
> +void qht_destroy(struct qht *ht)
> +{
> +    qht_map_destroy(ht->map);
> +    memset(ht, 0, sizeof(*ht));
> +}
> +
> +static void qht_bucket_reset(struct qht_bucket *head)
> +{
> +    struct qht_bucket *b = head;
> +    int i;
> +
> +    qemu_spin_lock(&head->lock);
> +    seqlock_write_begin(&head->sequence);
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->pointers[i] == NULL) {
> +                goto done;
> +            }
> +            atomic_set(&b->hashes[i], 0);
> +            atomic_set(&b->pointers[i], NULL);
> +        }
> +        b = b->next;
> +    } while (b);
> + done:
> +    seqlock_write_end(&head->sequence);
> +    qemu_spin_unlock(&head->lock);
> +}
> +
> +/* call with an external lock held */
> +void qht_reset(struct qht *ht)
> +{
> +    struct qht_map *map = ht->map;
> +    size_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        qht_bucket_reset(&map->buckets[i]);
> +    }
> +    qht_map_debug(map);
> +}
> +
> +/* call with an external lock held */
> +bool qht_reset_size(struct qht *ht, size_t n_elems)
> +{
> +    struct qht_map *old = ht->map;
> +
> +    qht_reset(ht);
> +    if (old->n == qht_elems_to_buckets(n_elems)) {
> +        return false;
> +    }
> +    qht_init(ht, n_elems, ht->mode);
> +    call_rcu1(&old->rcu, qht_map_reclaim);
> +    return true;
> +}
> +
> +static inline
> +void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
> +                    const void *userp, uint32_t hash)
> +{
> +    struct qht_bucket *b = head;
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (atomic_read(&b->hashes[i]) == hash) {
> +                void *p = atomic_read(&b->pointers[i]);

Why do we need this atomic_read() and other (looking a bit inconsistent)
atomic operations on 'b->pointers' and 'b->hash'? if we always have to
access them protected properly by a seqlock together with a spinlock?

> +
> +                if (likely(p) && likely(func(p, userp))) {
> +                    return p;
> +                }
> +            }
> +        }
> +        b = bucket_next__atomic_mb(b);
> +    } while (b);
> +
> +    return NULL;
> +}
> +
> +static __attribute__((noinline))
> +void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
> +                           const void *userp, uint32_t hash)
> +{
> +    uint32_t version;
> +    void *ret;
> +
> +    do {
> +        version = seqlock_read_begin(&b->sequence);

seqlock_read_begin() returns "unsigned".

> +        ret = qht_do_lookup(b, func, userp, hash);
> +    } while (seqlock_read_retry(&b->sequence, version));
> +    return ret;
> +}
> +
> +void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
> +                 uint32_t hash)
> +{
> +    struct qht_bucket *b;
> +    struct qht_map *map;
> +    uint32_t version;
> +    void *ret;
> +
> +    map = qht_map__atomic_mb(ht);
> +    b = qht_map_to_bucket(map, hash);
> +
> +    version = seqlock_read_begin(&b->sequence);
> +    ret = qht_do_lookup(b, func, userp, hash);
> +    if (likely(!seqlock_read_retry(&b->sequence, version))) {
> +        return ret;
> +    }
> +    /*
> +     * Removing the do/while from the fastpath gives a 4% perf. increase when
> +     * running a 100%-lookup microbenchmark.
> +     */
> +    return qht_lookup__slowpath(b, func, userp, hash);
> +}
> +
> +/* call with head->lock held */
> +static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
> +                               struct qht_bucket *head, void *p, uint32_t hash,
> +                               bool *needs_resize)
> +{
> +    struct qht_bucket *b = head;
> +    struct qht_bucket *prev = NULL;
> +    struct qht_bucket *new = NULL;
> +    int i;
> +
> +    for (;;) {
> +        if (b == NULL) {
> +            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
> +            memset(b, 0, sizeof(*b));
> +            new = b;
> +            atomic_inc(&map->n_added_buckets);
> +            if (unlikely(qht_map_needs_resize(map)) && needs_resize) {
> +                *needs_resize = true;
> +            }
> +        }
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->pointers[i]) {
> +                if (unlikely(b->pointers[i] == p)) {
> +                    return false;
> +                }
> +                continue;
> +            }
> +            /* found an empty key: acquire the seqlock and write */
> +            seqlock_write_begin(&head->sequence);
> +            if (new) {
> +                /*
> +                 * This barrier is paired with smp_rmb() after reading
> +                 * b->next when not holding b->lock.
> +                 */
> +                smp_wmb();
> +                atomic_set(&prev->next, b);
> +            }
> +            atomic_set(&b->hashes[i], hash);
> +            atomic_set(&b->pointers[i], p);
> +            seqlock_write_end(&head->sequence);
> +            return true;
> +        }
> +        prev = b;
> +        b = b->next;
> +    }
> +}
> +
> +/* call with an external lock held */
> +bool qht_insert(struct qht *ht, void *p, uint32_t hash)
> +{
> +    struct qht_map *map = ht->map;
> +    struct qht_bucket *b = qht_map_to_bucket(map, hash);
> +    bool needs_resize = false;
> +    bool ret;
> +
> +    /* NULL pointers are not supported */
> +    assert(p);

Maybe wrap such assertions in a macro, something like tcg_debug_assert()?

> +
> +    qemu_spin_lock(&b->lock);
> +    ret = qht_insert__locked(ht, map, b, p, hash, &needs_resize);
> +    qht_bucket_debug(b);
> +    qemu_spin_unlock(&b->lock);
> +
> +    if (unlikely(needs_resize) && ht->mode & QHT_MODE_AUTO_RESIZE) {
> +        qht_do_resize(ht, map->n * 2);
> +    }
> +    return ret;
> +}
> +
> +static inline bool qht_entry_is_last(struct qht_bucket *b, int pos)
> +{
> +    if (pos == QHT_BUCKET_ENTRIES - 1) {
> +        if (b->next == NULL) {
> +            return true;
> +        }
> +        return b->next->pointers[0] == NULL;
> +    }
> +    return b->pointers[pos + 1] == NULL;
> +}
> +
> +static void
> +qht_entry_move(struct qht_bucket *to, int i, struct qht_bucket *from, int j)
> +{
> +    assert(!(to == from && i == j));
> +    assert(to->pointers[i] == NULL);
> +    assert(from->pointers[j]);
> +
> +    atomic_set(&to->hashes[i], from->hashes[j]);
> +    atomic_set(&to->pointers[i], from->pointers[j]);
> +
> +    atomic_set(&from->hashes[j], 0);
> +    atomic_set(&from->pointers[j], NULL);
> +}
> +
> +/*
> + * Find the last valid entry in @head, and swap it with @orig[pos], which has
> + * just been invalidated.
> + */
> +static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
> +{
> +    struct qht_bucket *b = orig;
> +    struct qht_bucket *prev = NULL;
> +    int i;
> +
> +    if (qht_entry_is_last(orig, pos)) {
> +        return;
> +    }
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {

We could iterate in the opposite direction: from the last entry in a
qht_bucket to the first. It would allow us to fast-forward to the next
qht_bucket in a chain in case of non-NULL last entry and speed-up the
search.

> +            if (b->pointers[i] || (b == orig && i == pos)) {
> +                continue;
> +            }
> +            if (i > 0) {
> +                return qht_entry_move(orig, pos, b, i - 1);
> +            }
> +            assert(prev);
> +            return qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
> +        }
> +        prev = b;
> +        b = b->next;
> +    } while (b);
> +    /* no free entries other than orig[pos], so swap it with the last one */
> +    qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
> +}
> +
> +/* call with b->lock held */
> +static inline
> +bool qht_remove__locked(struct qht_map *map, struct qht_bucket *head,
> +                        const void *p, uint32_t hash)
> +{
> +    struct qht_bucket *b = head;
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            void *q = b->pointers[i];
> +
> +            if (unlikely(q == NULL)) {
> +                return false;
> +            }
> +            if (q == p) {
> +                assert(b->hashes[i] == hash);
> +                seqlock_write_begin(&head->sequence);
> +                atomic_set(&b->hashes[i], 0);
> +                atomic_set(&b->pointers[i], NULL);

Could we better avoid zeroing these here? (At least when debugging is
disabled.)

> +                qht_bucket_fill_hole(b, i);
> +                seqlock_write_end(&head->sequence);
> +                return true;
> +            }
> +        }
> +        b = b->next;
> +    } while (b);
> +    return false;
> +}
> +
> +/* call with an external lock held */
> +bool qht_remove(struct qht *ht, const void *p, uint32_t hash)
> +{
> +    struct qht_map *map = ht->map;
> +    struct qht_bucket *b = qht_map_to_bucket(map, hash);
> +    bool ret;
> +
> +    qemu_spin_lock(&b->lock);
> +    ret = qht_remove__locked(map, b, p, hash);
> +    qht_bucket_debug(b);
> +    qemu_spin_unlock(&b->lock);
> +    return ret;
> +}
> +
> +static inline void qht_bucket_iter(struct qht *ht, struct qht_bucket *b,
> +                                   qht_iter_func_t func, void *userp)
> +{
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->pointers[i] == NULL) {
> +                return;
> +            }
> +            func(ht, b->pointers[i], b->hashes[i], userp);
> +        }
> +        b = b->next;
> +    } while (b);
> +}
> +
> +/* external lock + all of the map's locks held */
> +static inline void qht_map_iter__locked(struct qht *ht, struct qht_map *map,
> +                                        qht_iter_func_t func, void *userp)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        qht_bucket_iter(ht, &map->buckets[i], func, userp);
> +    }
> +}
> +
> +/* call with an external lock held */
> +void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp)
> +{
> +    qht_map_lock_buckets(ht->map);
> +    qht_map_iter__locked(ht, ht->map, func, userp);
> +    qht_map_unlock_buckets(ht->map);
> +}
> +
> +static void qht_map_copy(struct qht *ht, void *p, uint32_t hash, void *userp)
> +{
> +    struct qht_map *new = userp;
> +    struct qht_bucket *b = qht_map_to_bucket(new, hash);
> +
> +    /* no need to acquire b->lock because no thread has seen this map yet */
> +    qht_insert__locked(ht, new, b, p, hash, NULL);
> +}
> +
> +/* call with an external lock held */
> +static void qht_do_resize(struct qht *ht, size_t n)
> +{
> +    struct qht_map *old = ht->map;
> +    struct qht_map *new;
> +
> +    g_assert_cmpuint(n, !=, old->n);

Maybe use g_assert*() consistently? (I mead using either assert() or
g_assert*() variants.)

Kind regards,
Sergey

> +    new = qht_map_create(n);
> +    qht_iter(ht, qht_map_copy, new);
> +    qht_map_debug(new);
> +
> +    qht_publish(ht, new);
> +    call_rcu1(&old->rcu, qht_map_reclaim);
> +}
> +
> +/* call with an external lock held */
> +bool qht_resize(struct qht *ht, size_t n_elems)
> +{
> +    size_t n = qht_elems_to_buckets(n_elems);
> +
> +    if (n == ht->map->n) {
> +        return false;
> +    }
> +    qht_do_resize(ht, n);
> +    return true;
> +}
> +
> +/* pass @stats to qht_statistics_destroy() when done */
> +void qht_statistics_init(struct qht *ht, struct qht_stats *stats)
> +{
> +    struct qht_map *map;
> +    int i;
> +
> +    map = qht_map__atomic_mb(ht);
> +
> +    stats->head_buckets = map->n;
> +    stats->used_head_buckets = 0;
> +    stats->entries = 0;
> +    qdist_init(&stats->chain);
> +    qdist_init(&stats->occupancy);
> +
> +    for (i = 0; i < map->n; i++) {
> +        struct qht_bucket *head = &map->buckets[i];
> +        struct qht_bucket *b;
> +        uint32_t version;
> +        size_t buckets;
> +        size_t entries;
> +        int j;
> +
> +        do {
> +            version = seqlock_read_begin(&head->sequence);
> +            buckets = 0;
> +            entries = 0;
> +            b = head;
> +            do {
> +                for (j = 0; j < QHT_BUCKET_ENTRIES; j++) {
> +                    if (atomic_read(&b->pointers[j]) == NULL) {
> +                        break;
> +                    }
> +                    entries++;
> +                }
> +                buckets++;
> +                b = bucket_next__atomic_mb(b);
> +            } while (b);
> +        } while (seqlock_read_retry(&head->sequence, version));
> +
> +        if (entries) {
> +            qdist_inc(&stats->chain, buckets);
> +            qdist_inc(&stats->occupancy,
> +                      (double)entries / QHT_BUCKET_ENTRIES / buckets);
> +            stats->used_head_buckets++;
> +            stats->entries += entries;
> +        } else {
> +            qdist_inc(&stats->occupancy, 0);
> +        }
> +    }
> +}
> +
> +void qht_statistics_destroy(struct qht_stats *stats)
> +{
> +    qdist_destroy(&stats->occupancy);
> +    qdist_destroy(&stats->chain);
> +}

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-20 22:13   ` Sergey Fedorov
@ 2016-05-21  2:48     ` Emilio G. Cota
  2016-05-21 17:41       ` Emilio G. Cota
                         ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-21  2:48 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

Hi Sergey,

Any 'Ack' below means the change has made it to my tree.

On Sat, May 21, 2016 at 01:13:20 +0300, Sergey Fedorov wrote:
> > +#include "qemu/osdep.h"
> > +#include "qemu-common.h"
> 
> There's no need in qemu-common.h

Ack

> > +#include "qemu/seqlock.h"
> > +#include "qemu/qdist.h"
> > +#include "qemu/rcu.h"
> 
> qemu/rcu.h is really required in qht.c, not here.

Ack

> > +struct qht {
> > +    struct qht_map *map;
> > +    unsigned int mode;
> > +};
> > +
> > +struct qht_stats {
> > +    size_t head_buckets;
> > +    size_t used_head_buckets;
> > +    size_t entries;
> > +    struct qdist chain;
> > +    struct qdist occupancy;
> > +};
> > +
> > +typedef bool (*qht_lookup_func_t)(const void *obj, const void *userp);
> > +typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
> 
> We could move ght_stats, qht_lookup_func_t and qht_iter_func_t closer to
> the relevant functions declarations. Anyway, I think it's also fine to
> keep them here. :)

I like these at the top of files, so that functions can be easily moved around.

> Although the API is mostly intuitive some kernel-doc-style comments
> wouldn’t hurt, I think. ;-)

The nit that bothered me is the "external lock needed" bit, but it's
removed by the subsequent patch (which once it gets reviewed should be merged
onto this patch); I think the interface is simple enough that comments
would just add noise and maintenance burden. Plus, there are tests under
tests/.

(snip)
> > +/* if @func is NULL, then pointer comparison is used */
> > +void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
> > +                 uint32_t hash);

BTW the comment above this signature isn't true anymore; I've removed it.
[ This is an example of why I'd rather have a simple interface,
  than a complex one that needs documentation :P ]

> (snip)
> > diff --git a/util/qht.c b/util/qht.c
(snip)
> > + * The key structure is the bucket, which is cacheline-sized. Buckets
> > + * contain a few hash values and pointers; the u32 hash values are stored in
> > + * full so that resizing is fast. Having this structure instead of directly
> > + * chaining items has three advantages:
> 
> s/three/two/?

Ack

(snip)
> > +/* define these to keep sizeof(qht_bucket) within QHT_BUCKET_ALIGN */
> > +#if HOST_LONG_BITS == 32
> > +#define QHT_BUCKET_ENTRIES 6
> > +#else /* 64-bit */
> > +#define QHT_BUCKET_ENTRIES 4
> > +#endif
> > +
> > +struct qht_bucket {
> > +    QemuSpin lock;
> > +    QemuSeqLock sequence;
> > +    uint32_t hashes[QHT_BUCKET_ENTRIES];
> > +    void *pointers[QHT_BUCKET_ENTRIES];
> > +    struct qht_bucket *next;
> > +} QEMU_ALIGNED(QHT_BUCKET_ALIGN);
> > +
> > +QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
> 
> Have you considered using separate structures for head buckets and
> non-head buckets, e.g. "struct qht_head_bucket" and "struct
> qht_added_bucket"? This would give us a little more entries per cache-line.

I considered it. Note however that the gain would only apply to
32-bit hosts, since on 64-bit we'd only save 8 bytes but we'd
need 12 to store hash+pointer. (lock+sequence=8, hashes=4*4=16,
pointers=4*8=32, next=8, that is 8+16+32+8=32+32=64).

On 32-bits with 6 entries we have 4 bytes of waste; we could squeeze in
an extra entry. I'm reluctant to do this because (1) it would complicate
code and (2) I don't think we should care too much about performance on
32-bit hosts.

> > +/**
> > + * struct qht_map - structure to track an array of buckets
> > + * @rcu: used by RCU. Keep it as the top field in the struct to help valgrind
> > + *       find the whole struct.
> > + * @buckets: array of head buckets. It is constant once the map is created.
> > + * @n: number of head buckets. It is constant once the map is created.
> > + * @n_added_buckets: number of added (i.e. "non-head") buckets
> > + * @n_added_buckets_threshold: threshold to trigger an upward resize once the
> > + *                             number of added buckets surpasses it.
> > + *
> > + * Buckets are tracked in what we call a "map", i.e. this structure.
> > + */
> > +struct qht_map {
> > +    struct rcu_head rcu;
> > +    struct qht_bucket *buckets;
> > +    size_t n;
> 
> s/n/n_buckets/? (Actually, 'n_buckets' is already mentioned in a comment
> below.)

Ack. I also applied this change to other functions, e.g. qht_map_create and
qht_resize.

> > +    size_t n_added_buckets;
> > +    size_t n_added_buckets_threshold;
> > +};
> > +
> > +/* trigger a resize when n_added_buckets > n_buckets / div */
> > +#define QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV 8
> > +
> > +static void qht_do_resize(struct qht *ht, size_t n);
> > +
> > +static inline struct qht_map *qht_map__atomic_mb(const struct qht *ht)
> > +{
> > +    struct qht_map *map;
> > +
> > +    map = atomic_read(&ht->map);
> > +    /* paired with smp_wmb() before setting ht->map */
> > +    smp_rmb();
> > +    return map;
> > +}
> 
> Why don't just use atomic_rcu_read/set()? Looks like they were meant for
> that exact purpose.

Ack. I also changed it to the bucket->next functions.

In fact the right barrier to emit is smp_read_barrier_depends; I just
noticed that we emit a quite stronger CONSUME for recent gcc's under
atomic_rcu_read. I'll address this on a separate patch if nobody
beats me to it.

(snip)
> > +static inline
> > +void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
> > +                    const void *userp, uint32_t hash)
> > +{
> > +    struct qht_bucket *b = head;
> > +    int i;
> > +
> > +    do {
> > +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> > +            if (atomic_read(&b->hashes[i]) == hash) {
> > +                void *p = atomic_read(&b->pointers[i]);
> 
> Why do we need this atomic_read() and other (looking a bit inconsistent)
> atomic operations on 'b->pointers' and 'b->hash'? if we always have to
> access them protected properly by a seqlock together with a spinlock?

[ There should be consistency: read accesses use the atomic ops to read,
  while write accesses have acquired the bucket lock so don't need them.
  Well, they need care when they write, since there may be concurrent
  readers. ]

I'm using atomic_read but what I really want is ACCESS_ONCE. That is:
(1) Make sure that the accesses are done in a single instruction (even
    though gcc doesn't explicitly guarantee it even to aligned addresses
    anymore[1])
(2) Make sure the pointer value is only read once, and never refetched.
    This is what comes right after the pointer is read:
> +                if (likely(p) && likely(func(p, userp))) {
> +                    return p;
> +                }
    Refetching the pointer value might result in us passing something
    a NULL p value to the comparison function (since there may be
    concurrent updaters!), with an immediate segfault. See [2] for a
    discussion on this (essentially the compiler assumes that there's
    only a single thread).

Given that even reading a garbled hash is OK (we don't really need (1),
since the seqlock will make us retry anyway), I've changed the code to:

         for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
-            if (atomic_read(&b->hashes[i]) == hash) {
+            if (b->hashes[i] == hash) {
+                /* make sure the pointer is read only once */
                 void *p = atomic_read(&b->pointers[i]);

                 if (likely(p) && likely(func(p, userp))) {

Performance-wise this is the impact after 10 tries for:
	$ taskset -c 0 tests/qht-bench \
	  -d 5 -n 1 -u 0 -k 4096 -K 4096 -l 4096 -r 4096 -s 4096
on my Haswell machine I get, in Mops/s:
	atomic_read() for all		40.389 +- 0.20888327415622
	atomic_read(p) only		40.759 +- 0.212835356294224
	no atomic_read(p) (unsafe)	40.559 +- 0.121422128680622

Note that the unsafe version is slightly slower; I guess the CPU is trying
to speculate too much and is gaining little from it.

[1] "Linux-Kernel Memory Model" by Paul McKenney
    http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4374.html
[2] https://lwn.net/Articles/508991/

(snip)
> > +static __attribute__((noinline))
> > +void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
> > +                           const void *userp, uint32_t hash)
> > +{
> > +    uint32_t version;
> > +    void *ret;
> > +
> > +    do {
> > +        version = seqlock_read_begin(&b->sequence);
> 
> seqlock_read_begin() returns "unsigned".

Ack. Changed all callers to unsigned int.

(snip)
> > +/* call with an external lock held */
> > +bool qht_insert(struct qht *ht, void *p, uint32_t hash)
> > +{
> > +    struct qht_map *map = ht->map;
> > +    struct qht_bucket *b = qht_map_to_bucket(map, hash);
> > +    bool needs_resize = false;
> > +    bool ret;
> > +
> > +    /* NULL pointers are not supported */
> > +    assert(p);
> 
> Maybe wrap such assertions in a macro, something like tcg_debug_assert()?

Ack. Defined qht_debug_assert, which only generates the assert if
QHT_DEBUG is defined.

(snip)
> > +/*
> > + * Find the last valid entry in @head, and swap it with @orig[pos], which has
> > + * just been invalidated.
> > + */
> > +static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
> > +{
> > +    struct qht_bucket *b = orig;
> > +    struct qht_bucket *prev = NULL;
> > +    int i;
> > +
> > +    if (qht_entry_is_last(orig, pos)) {
> > +        return;
> > +    }
> > +    do {
> > +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> 
> We could iterate in the opposite direction: from the last entry in a
> qht_bucket to the first. It would allow us to fast-forward to the next
> qht_bucket in a chain in case of non-NULL last entry and speed-up the
> search.

But it would slow us down if--say--only the first entry is set. Also
it would complicate the code a bit.

Note that with the resizing threshold that we have, we're guaranteed to
have only up to 1/8 of the head buckets full. We should therefore optimize
for the case where the head bucket isn't full.

> > +            if (q == p) {
> > +                assert(b->hashes[i] == hash);
> > +                seqlock_write_begin(&head->sequence);
> > +                atomic_set(&b->hashes[i], 0);
> > +                atomic_set(&b->pointers[i], NULL);
> 
> Could we better avoid zeroing these here? (At least when debugging is
> disabled.)

Ack. We get a small speedup. Before and after, for the same test as
above with update rate 100%:
  33.779  0.19518936674135
  33.937  0.132249763704892

Changed the code to:

@@ -557,7 +557,7 @@ static void
 qht_entry_move(struct qht_bucket *to, int i, struct qht_bucket *from, int j)
 {
     qht_debug_assert(!(to == from && i == j));
-    qht_debug_assert(to->pointers[i] == NULL);
+    qht_debug_assert(to->pointers[i]);
     qht_debug_assert(from->pointers[j]);

     atomic_set(&to->hashes[i], from->hashes[j]);
@@ -578,11 +578,13 @@ static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
     int i;

     if (qht_entry_is_last(orig, pos)) {
+        atomic_set(&orig->hashes[pos], 0);
+        atomic_set(&orig->pointers[pos], NULL);
         return;
     }
     do {
         for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
-            if (b->pointers[i] || (b == orig && i == pos)) {
+            if (b->pointers[i]) {
                 continue;
             }
             if (i > 0) {
@@ -616,8 +618,6 @@ bool qht_remove__locked(struct qht_map *map, struct qht_bucket *head,
             if (q == p) {
                 qht_debug_assert(b->hashes[i] == hash);
                 seqlock_write_begin(&head->sequence);
-                atomic_set(&b->hashes[i], 0);
-                atomic_set(&b->pointers[i], NULL);
                 qht_bucket_fill_hole(b, i);
                 seqlock_write_end(&head->sequence);
                 return true;
@@ -634,6 +634,9 @@ bool qht_remove(struct qht *ht, const void *p, uint32_t hash)
     struct qht_map *map;
     bool ret;

+    /* NULL pointers are not supported */
+    qht_debug_assert(p);
+

(snip)
> > +/* call with an external lock held */
> > +static void qht_do_resize(struct qht *ht, size_t n)
> > +{
> > +    struct qht_map *old = ht->map;
> > +    struct qht_map *new;
> > +
> > +    g_assert_cmpuint(n, !=, old->n);
> 
> Maybe use g_assert*() consistently? (I mead using either assert() or
> g_assert*() variants.)

This is the only g_assert I'm using. This is a slow path and I quite like
the fact that the numbers would be printed; I left is as is.

All other asserts are of the type assert(bool), so there's little
point in using g_assert for them. BTW all other asserts are now
under qht_debug_assert.

Thanks for taking a look! If you have time, please check patch 13 out.
That patch should eventually be merged onto this one.

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-21  2:48     ` Emilio G. Cota
@ 2016-05-21 17:41       ` Emilio G. Cota
  2016-05-22  8:01         ` Alex Bennée
  2016-05-21 20:07       ` Sergey Fedorov
  2016-05-23 19:29       ` Sergey Fedorov
  2 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-21 17:41 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Fri, May 20, 2016 at 22:48:11 -0400, Emilio G. Cota wrote:
> On Sat, May 21, 2016 at 01:13:20 +0300, Sergey Fedorov wrote:
> > > +static inline
> > > +void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
> > > +                    const void *userp, uint32_t hash)
> > > +{
> > > +    struct qht_bucket *b = head;
> > > +    int i;
> > > +
> > > +    do {
> > > +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> > > +            if (atomic_read(&b->hashes[i]) == hash) {
> > > +                void *p = atomic_read(&b->pointers[i]);
> > 
> > Why do we need this atomic_read() and other (looking a bit inconsistent)
> > atomic operations on 'b->pointers' and 'b->hash'? if we always have to
> > access them protected properly by a seqlock together with a spinlock?
> 
> [ There should be consistency: read accesses use the atomic ops to read,
>   while write accesses have acquired the bucket lock so don't need them.
>   Well, they need care when they write, since there may be concurrent
>   readers. ]
> 
> I'm using atomic_read but what I really want is ACCESS_ONCE. That is:
> (1) Make sure that the accesses are done in a single instruction (even
>     though gcc doesn't explicitly guarantee it even to aligned addresses
>     anymore[1])
> (2) Make sure the pointer value is only read once, and never refetched.
>     This is what comes right after the pointer is read:
> > +                if (likely(p) && likely(func(p, userp))) {
> > +                    return p;
> > +                }
>     Refetching the pointer value might result in us passing something
>     a NULL p value to the comparison function (since there may be
>     concurrent updaters!), with an immediate segfault. See [2] for a
>     discussion on this (essentially the compiler assumes that there's
>     only a single thread).
> 
> Given that even reading a garbled hash is OK (we don't really need (1),
> since the seqlock will make us retry anyway), I've changed the code to:
> 
>          for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> -            if (atomic_read(&b->hashes[i]) == hash) {
> +            if (b->hashes[i] == hash) {
> +                /* make sure the pointer is read only once */
>                  void *p = atomic_read(&b->pointers[i]);
> 
>                  if (likely(p) && likely(func(p, userp))) {
> 
> Performance-wise this is the impact after 10 tries for:
> 	$ taskset -c 0 tests/qht-bench \
> 	  -d 5 -n 1 -u 0 -k 4096 -K 4096 -l 4096 -r 4096 -s 4096
> on my Haswell machine I get, in Mops/s:
> 	atomic_read() for all		40.389 +- 0.20888327415622
> 	atomic_read(p) only		40.759 +- 0.212835356294224
> 	no atomic_read(p) (unsafe)	40.559 +- 0.121422128680622
> 
> Note that the unsafe version is slightly slower; I guess the CPU is trying
> to speculate too much and is gaining little from it.
> 
> [1] "Linux-Kernel Memory Model" by Paul McKenney
>     http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4374.html
> [2] https://lwn.net/Articles/508991/

A small update: I just got rid of all the atomic_read/set's that
apply to the hashes, since retries will take care of possible races.

The atomic_read/set's remain only for b->pointers[], for the
above reasons.

		E.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-21  2:48     ` Emilio G. Cota
  2016-05-21 17:41       ` Emilio G. Cota
@ 2016-05-21 20:07       ` Sergey Fedorov
  2016-05-23 19:29       ` Sergey Fedorov
  2 siblings, 0 replies; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-21 20:07 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On 21/05/16 05:48, Emilio G. Cota wrote:
> Thanks for taking a look! If you have time, please check patch 13 out.
> That patch should eventually be merged onto this one.

Actually, I was reviewing the final code with all the series applied :)
I'd like to spend some time and review all the series in detail since it
is very important for our MTTCG work.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-21 17:41       ` Emilio G. Cota
@ 2016-05-22  8:01         ` Alex Bennée
  2016-05-23  5:35           ` Emilio G. Cota
  0 siblings, 1 reply; 79+ messages in thread
From: Alex Bennée @ 2016-05-22  8:01 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Sergey Fedorov, QEMU Developers, MTTCG Devel, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> On Fri, May 20, 2016 at 22:48:11 -0400, Emilio G. Cota wrote:
>> On Sat, May 21, 2016 at 01:13:20 +0300, Sergey Fedorov wrote:
>> > > +static inline
>> > > +void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
>> > > +                    const void *userp, uint32_t hash)
>> > > +{
>> > > +    struct qht_bucket *b = head;
>> > > +    int i;
>> > > +
>> > > +    do {
>> > > +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>> > > +            if (atomic_read(&b->hashes[i]) == hash) {
>> > > +                void *p = atomic_read(&b->pointers[i]);
>> >
>> > Why do we need this atomic_read() and other (looking a bit inconsistent)
>> > atomic operations on 'b->pointers' and 'b->hash'? if we always have to
>> > access them protected properly by a seqlock together with a spinlock?
>>
>> [ There should be consistency: read accesses use the atomic ops to read,
>>   while write accesses have acquired the bucket lock so don't need them.
>>   Well, they need care when they write, since there may be concurrent
>>   readers. ]
>>
>> I'm using atomic_read but what I really want is ACCESS_ONCE. That is:
>> (1) Make sure that the accesses are done in a single instruction (even
>>     though gcc doesn't explicitly guarantee it even to aligned addresses
>>     anymore[1])
>> (2) Make sure the pointer value is only read once, and never refetched.
>>     This is what comes right after the pointer is read:
>> > +                if (likely(p) && likely(func(p, userp))) {
>> > +                    return p;
>> > +                }
>>     Refetching the pointer value might result in us passing something
>>     a NULL p value to the comparison function (since there may be
>>     concurrent updaters!), with an immediate segfault. See [2] for a
>>     discussion on this (essentially the compiler assumes that there's
>>     only a single thread).
>>
>> Given that even reading a garbled hash is OK (we don't really need (1),
>> since the seqlock will make us retry anyway), I've changed the code to:
>>
>>          for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>> -            if (atomic_read(&b->hashes[i]) == hash) {
>> +            if (b->hashes[i] == hash) {
>> +                /* make sure the pointer is read only once */
>>                  void *p = atomic_read(&b->pointers[i]);
>>
>>                  if (likely(p) && likely(func(p, userp))) {
>>
>> Performance-wise this is the impact after 10 tries for:
>> 	$ taskset -c 0 tests/qht-bench \
>> 	  -d 5 -n 1 -u 0 -k 4096 -K 4096 -l 4096 -r 4096 -s 4096
>> on my Haswell machine I get, in Mops/s:
>> 	atomic_read() for all		40.389 +- 0.20888327415622
>> 	atomic_read(p) only		40.759 +- 0.212835356294224
>> 	no atomic_read(p) (unsafe)	40.559 +- 0.121422128680622
>>
>> Note that the unsafe version is slightly slower; I guess the CPU is trying
>> to speculate too much and is gaining little from it.
>>
>> [1] "Linux-Kernel Memory Model" by Paul McKenney
>>     http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4374.html
>> [2] https://lwn.net/Articles/508991/
>
> A small update: I just got rid of all the atomic_read/set's that
> apply to the hashes, since retries will take care of possible races.

I guess the potential hash-clash from a partially read or set hash is
handled by the eventual compare against a always valid pointer?

>
> The atomic_read/set's remain only for b->pointers[], for the
> above reasons.
>
> 		E.


--
Alex Bennée

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-22  8:01         ` Alex Bennée
@ 2016-05-23  5:35           ` Emilio G. Cota
  0 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-23  5:35 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Sergey Fedorov, QEMU Developers, MTTCG Devel, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Sun, May 22, 2016 at 09:01:59 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> > A small update: I just got rid of all the atomic_read/set's that
> > apply to the hashes, since retries will take care of possible races.
> 
> I guess the potential hash-clash from a partially read or set hash is
> handled by the eventual compare against a always valid pointer?

As long as we call the cmp function with a non-NULL pointer,
we're safe. Given that pointers are read and set atomically, the
'if (p)' check before calling the cmp function guarantees that we
won't cause a segfault. This also means that items removed from
the hash table must be freed after an RCU grace period, since readers
might still see the pointers and pass them to the cmp function.
I'll document this.

If there's a reader concurrent with a writer, it's possible that
the reader might read a partially-updated hash. That's fine because
the reader will retry anyway until it can see the effect
of the writer calling seqlock_write_end. The only important
concern here is to make sure the pointers are read/set atomically.

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-21  2:48     ` Emilio G. Cota
  2016-05-21 17:41       ` Emilio G. Cota
  2016-05-21 20:07       ` Sergey Fedorov
@ 2016-05-23 19:29       ` Sergey Fedorov
  2 siblings, 0 replies; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-23 19:29 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On 21/05/16 05:48, Emilio G. Cota wrote:
> On Sat, May 21, 2016 at 01:13:20 +0300, Sergey Fedorov wrote:
>> Although the API is mostly intuitive some kernel-doc-style comments
>> wouldn’t hurt, I think. ;-)
> The nit that bothered me is the "external lock needed" bit, but it's
> removed by the subsequent patch (which once it gets reviewed should be merged
> onto this patch); I think the interface is simple enough that comments
> would just add noise and maintenance burden. Plus, there are tests under
> tests/.

The interface is simple enough but e.g. the return value convention for
some of the functions may not be clear from a first glance. Regarding
maintenance burden, as soon as we have a good stable API it shouldn't be
painful.

> (snip)
>>> +/* define these to keep sizeof(qht_bucket) within QHT_BUCKET_ALIGN */
>>> +#if HOST_LONG_BITS == 32
>>> +#define QHT_BUCKET_ENTRIES 6
>>> +#else /* 64-bit */
>>> +#define QHT_BUCKET_ENTRIES 4
>>> +#endif
>>> +
>>> +struct qht_bucket {
>>> +    QemuSpin lock;
>>> +    QemuSeqLock sequence;
>>> +    uint32_t hashes[QHT_BUCKET_ENTRIES];
>>> +    void *pointers[QHT_BUCKET_ENTRIES];
>>> +    struct qht_bucket *next;
>>> +} QEMU_ALIGNED(QHT_BUCKET_ALIGN);
>>> +
>>> +QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
>> Have you considered using separate structures for head buckets and
>> non-head buckets, e.g. "struct qht_head_bucket" and "struct
>> qht_added_bucket"? This would give us a little more entries per cache-line.
> I considered it. Note however that the gain would only apply to
> 32-bit hosts, since on 64-bit we'd only save 8 bytes but we'd
> need 12 to store hash+pointer. (lock+sequence=8, hashes=4*4=16,
> pointers=4*8=32, next=8, that is 8+16+32+8=32+32=64).
>
> On 32-bits with 6 entries we have 4 bytes of waste; we could squeeze in
> an extra entry. I'm reluctant to do this because (1) it would complicate
> code and (2) I don't think we should care too much about performance on
> 32-bit hosts.

Fair enough.

> (snip)
>>> +static inline
>>> +void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
>>> +                    const void *userp, uint32_t hash)
>>> +{
>>> +    struct qht_bucket *b = head;
>>> +    int i;
>>> +
>>> +    do {
>>> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>>> +            if (atomic_read(&b->hashes[i]) == hash) {
>>> +                void *p = atomic_read(&b->pointers[i]);
>> Why do we need this atomic_read() and other (looking a bit inconsistent)
>> atomic operations on 'b->pointers' and 'b->hash'? if we always have to
>> access them protected properly by a seqlock together with a spinlock?
> [ There should be consistency: read accesses use the atomic ops to read,
>   while write accesses have acquired the bucket lock so don't need them.
>   Well, they need care when they write, since there may be concurrent
>   readers. ]

Well, I see the consistency now =)

> I'm using atomic_read but what I really want is ACCESS_ONCE. That is:
> (1) Make sure that the accesses are done in a single instruction (even
>     though gcc doesn't explicitly guarantee it even to aligned addresses
>     anymore[1])
> (2) Make sure the pointer value is only read once, and never refetched.
>     This is what comes right after the pointer is read:
>> +                if (likely(p) && likely(func(p, userp))) {
>> +                    return p;
>> +                }
>     Refetching the pointer value might result in us passing something
>     a NULL p value to the comparison function (since there may be
>     concurrent updaters!), with an immediate segfault. See [2] for a
>     discussion on this (essentially the compiler assumes that there's
>     only a single thread).
>
> Given that even reading a garbled hash is OK (we don't really need (1),
> since the seqlock will make us retry anyway), I've changed the code to:
>
>          for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> -            if (atomic_read(&b->hashes[i]) == hash) {
> +            if (b->hashes[i] == hash) {
> +                /* make sure the pointer is read only once */
>                  void *p = atomic_read(&b->pointers[i]);
>
>                  if (likely(p) && likely(func(p, userp))) {
>
> Performance-wise this is the impact after 10 tries for:
> 	$ taskset -c 0 tests/qht-bench \
> 	  -d 5 -n 1 -u 0 -k 4096 -K 4096 -l 4096 -r 4096 -s 4096
> on my Haswell machine I get, in Mops/s:
> 	atomic_read() for all		40.389 +- 0.20888327415622
> 	atomic_read(p) only		40.759 +- 0.212835356294224
> 	no atomic_read(p) (unsafe)	40.559 +- 0.121422128680622
>
> Note that the unsafe version is slightly slower; I guess the CPU is trying
> to speculate too much and is gaining little from it.
>
> [1] "Linux-Kernel Memory Model" by Paul McKenney
>     http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4374.html
> [2] https://lwn.net/Articles/508991/

Okay.

> (snip)
>>> +/*
>>> + * Find the last valid entry in @head, and swap it with @orig[pos], which has
>>> + * just been invalidated.
>>> + */
>>> +static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
>>> +{
>>> +    struct qht_bucket *b = orig;
>>> +    struct qht_bucket *prev = NULL;
>>> +    int i;
>>> +
>>> +    if (qht_entry_is_last(orig, pos)) {
>>> +        return;
>>> +    }
>>> +    do {
>>> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>> We could iterate in the opposite direction: from the last entry in a
>> qht_bucket to the first. It would allow us to fast-forward to the next
>> qht_bucket in a chain in case of non-NULL last entry and speed-up the
>> search.
> But it would slow us down if--say--only the first entry is set. Also
> it would complicate the code a bit.
>
> Note that with the resizing threshold that we have, we're guaranteed to
> have only up to 1/8 of the head buckets full. We should therefore optimize
> for the case where the head bucket isn't full.

Okay.


Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/18] qht: support parallel writes
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 13/18] qht: support parallel writes Emilio G. Cota
@ 2016-05-23 20:28   ` Sergey Fedorov
  2016-05-24 22:07     ` Emilio G. Cota
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-23 20:28 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite, Richard Henderson

On 14/05/16 06:34, Emilio G. Cota wrote:
> +/*
> + * Get a head bucket and lock it, making sure its parent map is not stale.
> + * @pmap is filled with a pointer to the bucket's parent map.
> + *
> + * Unlock with qemu_spin_unlock(&b->lock).
> + */
> +static inline
> +struct qht_bucket *qht_bucket_lock__no_stale(struct qht *ht, uint32_t hash,
> +                                             struct qht_map **pmap)
> +{
> +    struct qht_bucket *b;
> +    struct qht_map *map;
> +
> +    for (;;) {
> +        map = qht_map__atomic_mb(ht);
> +        b = qht_map_to_bucket(map, hash);
> +
> +        qemu_spin_lock(&b->lock);
> +        if (likely(!map->stale)) {
> +            *pmap = map;
> +            return b;
> +        }
> +        qemu_spin_unlock(&b->lock);
> +
> +        /* resize in progress; wait until it completes */
> +        while (qemu_spin_locked(&ht->lock)) {
> +            cpu_relax();
> +        }
> +    }
> +}

What if we turn qht::lock into a mutex and change the function as follows:

    static inline
    struct qht_bucket *qht_bucket_lock__no_stale(struct qht *ht,
    uint32_t hash,
                                                 struct qht_map
    **pmap)       
    {
        struct qht_bucket *b;
        struct qht_map *map;

        map = atomic_rcu_read(&ht->map);
        b = qht_map_to_bucket(map, hash);

        qemu_spin_lock(&b->lock);
        /* 'ht->map' access is serialized by 'b->lock' here */
        if (likely(map == ht->map)) {
            /* no resize in progress; we're done */
            *pmap = map;
            return b;
        }
        qemu_spin_unlock(&b->lock);

        /* resize in progress; retry grabbing 'ht->lock' */
        qemu_mutex_lock(&ht->lock);
        b = qht_map_to_bucket(ht->map, hash);
        *pmap = ht->map;
        qemu_spin_lock(&b->lock);
        qemu_mutex_unlock(&ht->lock);

        return b;
    }


With this implementation we could:
 (1) get rid of qht_map::stale
 (2) don't waste cycles waiting for resize to complete

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 00/18] tb hash improvements
  2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
                   ` (17 preceding siblings ...)
  2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 18/18] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
@ 2016-05-23 22:26 ` Sergey Fedorov
  18 siblings, 0 replies; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-23 22:26 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite, Richard Henderson

I think I'm done reviewing v5. (Though I haven't reviewed tests and
statistics patches.)

Kind regards,
Sergey

On 14/05/16 06:34, Emilio G. Cota wrote:
> This patchset applies on top of tcg-next (8b1fe3f4 "cpu-exec:
> Clean up 'interrupt_request' reloading", tagged "pull-tcg-20160512").
>
> For reference, here is v4:
>   https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg04670.html
>
> Changes from v4:
>
> - atomics.h:
>   + Add atomic_read_acquire and atomic_set_release
>   + Rename atomic_test_and_set to atomic_test_and_set_acquire
>   [ Richard: I removed your reviewed-by ]
>
> - qemu_spin @ thread.h:
>   + add bool qemu_spin_locked() to check whether the lock is taken.
>   + Use newly-added acquire/release atomic ops. This is clearer and
>     improves performance; for instance, now we don't emit an
>     unnecessary smp_mb() thanks to using atomic_set_release()
>     instead of atomic_mb_set(). Also, note that __sync_test_and_set
>     has acquire semantics, so it makes sense to have an
>     atomic_test_and_set_acquire that directly calls it, instead
>     of calling atomic_xchg, which emits a full barrier (that we don't
>     need) before __sync_test_and_set.
>   [ Richard: I removed your reviewed-by ]
>
> - tests:
>   + add parallel benchmark (qht-bench). Some perf numbers in
>     the commit message, comparing QHT vs. CLHT and ck_hs.
>
>   + invoke qht-bench from `make check` with test-qht-par. It
>     uses system(3); I couldn't find a way to detect from qht-bench
>     when it is run from gtester, so I decided to just add a silly
>     program to invoke it.
>
> - trivial: util/Makefile.objs: add qdist.o and qht.o each on a
>            separate line
>
> - trivial: added copyright header to test programs
>
> - trivial: updated phys_pc, pc, flags commit message with Richard's
>            comment that hashing cs_base probably isn't worth it.
>
> - qht:
>   + Document that duplicate pointer values cannot be inserted.
>   + qht_insert: return true/false upon success/failure, just like
>                 qht_remove. This can help find bugs.
>   + qht_remove: only write to seqlock if the removal happens --
>                 otherwise the write is unnecessary, since nothing
> 		is written to the bucket.
>   + trivial: s/n_items/n_entries/ for consistency.
>   + qht_grow: substitute it for qht_resize. This is mostly useful
>               for testing.
>   + resize: do not track qht_map->n_entries; track instead the
>             number of non-head buckets added.
> 	    This improves scalability, since we only increment
> 	    this number (with the relatively expensive atomic_inc)
> 	    every time a new non-head bucket is allocated, instead
> 	    of every time an entry is added/removed.
>     * return bool from qht_resize and qht_reset_size; they return
>       false if the resize was not needed (i.e. if the previous size
>       was the requested size).
>   + qht_lookup: do not check for !NULL entries; check directly
>                 for a hash match.
> 		This gives a ~2% perf. increase during
> 		benchmarking. The buckets in the microbenchmarks
> 		are equally-sized well distributed, which is
> 		approximately the case in QEMU thanks to xxhash
> 		and resizing.
>   + Remove MRU bucket promotion policy. With automatic resizing,
>     this is not needed. Furthermore, removing it saves code.
>   + qht_lookup: Add fast-path without do {} while (seqlock). This
>                 gives a 4% perf. improvement on a read-only benchmark.
>   + struct qht_bucket: document the struct
>   + rename qht_lock() to qht_map_lock_buckets()
>   + add map__atomic_mb and bucket_next__atomic_mb helpers that
>     include the necessary atomic_read() and rmb().
>
>   [ All the above changes for qht are simple enough that I kept
>     Richard's reviewed-by.]
>
>   + Support concurrent writes to separate buckets. This is in an
>     additional patch to ease reviewing; feel free to squash it on
>     top of the QHT patch.
>
> Thanks,
>
> 		Emilio
>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/18] qht: support parallel writes
  2016-05-23 20:28   ` Sergey Fedorov
@ 2016-05-24 22:07     ` Emilio G. Cota
  2016-05-24 22:17       ` Sergey Fedorov
  0 siblings, 1 reply; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-24 22:07 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Mon, May 23, 2016 at 23:28:27 +0300, Sergey Fedorov wrote:
> What if we turn qht::lock into a mutex and change the function as follows:
> 
>     static inline
>     struct qht_bucket *qht_bucket_lock__no_stale(struct qht *ht,
>     uint32_t hash,
>                                                  struct qht_map
>     **pmap)       
>     {
>         struct qht_bucket *b;
>         struct qht_map *map;
> 
>         map = atomic_rcu_read(&ht->map);
>         b = qht_map_to_bucket(map, hash);
> 
>         qemu_spin_lock(&b->lock);
>         /* 'ht->map' access is serialized by 'b->lock' here */
>         if (likely(map == ht->map)) {
>             /* no resize in progress; we're done */
>             *pmap = map;
>             return b;
>         }
>         qemu_spin_unlock(&b->lock);
> 
>         /* resize in progress; retry grabbing 'ht->lock' */
>         qemu_mutex_lock(&ht->lock);
>         b = qht_map_to_bucket(ht->map, hash);
>         *pmap = ht->map;
>         qemu_spin_lock(&b->lock);
>         qemu_mutex_unlock(&ht->lock);
> 
>         return b;
>     }
> 
> 
> With this implementation we could:
>  (1) get rid of qht_map::stale
>  (2) don't waste cycles waiting for resize to complete

I'll include this in v6.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/18] qht: support parallel writes
  2016-05-24 22:07     ` Emilio G. Cota
@ 2016-05-24 22:17       ` Sergey Fedorov
  2016-05-25  0:10         ` Emilio G. Cota
  0 siblings, 1 reply; 79+ messages in thread
From: Sergey Fedorov @ 2016-05-24 22:17 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On 25/05/16 01:07, Emilio G. Cota wrote:
> On Mon, May 23, 2016 at 23:28:27 +0300, Sergey Fedorov wrote:
>> What if we turn qht::lock into a mutex and change the function as follows:
>>
>>     static inline
>>     struct qht_bucket *qht_bucket_lock__no_stale(struct qht *ht,
>>     uint32_t hash,
>>                                                  struct qht_map
>>     **pmap)       
>>     {
>>         struct qht_bucket *b;
>>         struct qht_map *map;
>>
>>         map = atomic_rcu_read(&ht->map);
>>         b = qht_map_to_bucket(map, hash);
>>
>>         qemu_spin_lock(&b->lock);
>>         /* 'ht->map' access is serialized by 'b->lock' here */
>>         if (likely(map == ht->map)) {
>>             /* no resize in progress; we're done */
>>             *pmap = map;
>>             return b;
>>         }
>>         qemu_spin_unlock(&b->lock);
>>
>>         /* resize in progress; retry grabbing 'ht->lock' */
>>         qemu_mutex_lock(&ht->lock);
>>         b = qht_map_to_bucket(ht->map, hash);
>>         *pmap = ht->map;
>>         qemu_spin_lock(&b->lock);
>>         qemu_mutex_unlock(&ht->lock);
>>
>>         return b;
>>     }
>>
>>
>> With this implementation we could:
>>  (1) get rid of qht_map::stale
>>  (2) don't waste cycles waiting for resize to complete
> I'll include this in v6.

How is it by perf?

Regards,
Sergey

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/18] qht: support parallel writes
  2016-05-24 22:17       ` Sergey Fedorov
@ 2016-05-25  0:10         ` Emilio G. Cota
  0 siblings, 0 replies; 79+ messages in thread
From: Emilio G. Cota @ 2016-05-25  0:10 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

On Wed, May 25, 2016 at 01:17:21 +0300, Sergey Fedorov wrote:
> >> With this implementation we could:
> >>  (1) get rid of qht_map::stale
> >>  (2) don't waste cycles waiting for resize to complete
> > I'll include this in v6.
> 
> How is it by perf?

Not much of a difference, since resize is a slow path. Calling
qht-bench with lots of update and resize threads performs very
poorly either way =D

I like the change though because using the mutex here simplifies
the resize code; there's no guilt anymore attached to holding the lock
for some time (e.g. when allocating a new, possible quite large,
map), whereas with the spinlock we would allocate it before
acquiring the lock, without knowing whether the allocation would
be needed in the end.

		Emilio

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2016-05-25  0:10 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-14  3:34 [Qemu-devel] [PATCH v5 00/18] tb hash improvements Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 01/18] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 02/18] seqlock: remove optional mutex Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 03/18] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 04/18] include/processor.h: define cpu_relax() Emilio G. Cota
2016-05-18 17:47   ` Sergey Fedorov
2016-05-18 18:29     ` Emilio G. Cota
2016-05-18 18:37       ` Sergey Fedorov
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 05/18] atomics: add atomic_test_and_set_acquire Emilio G. Cota
2016-05-16 10:05   ` Paolo Bonzini
2016-05-17 16:15   ` Sergey Fedorov
2016-05-17 16:23     ` Paolo Bonzini
2016-05-17 16:47       ` Sergey Fedorov
2016-05-17 17:08         ` Paolo Bonzini
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 06/18] atomics: add atomic_read_acquire and atomic_set_release Emilio G. Cota
2016-05-15 10:22   ` Pranith Kumar
2016-05-16 18:27     ` Emilio G. Cota
2016-05-17 16:53   ` Sergey Fedorov
2016-05-17 17:08     ` Paolo Bonzini
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 07/18] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
     [not found]   ` <573B5134.8060104@gmail.com>
2016-05-17 19:19     ` Richard Henderson
2016-05-17 19:57       ` Sergey Fedorov
2016-05-17 20:01         ` Sergey Fedorov
2016-05-17 22:12           ` Richard Henderson
2016-05-17 22:22             ` Richard Henderson
2016-05-17 20:04       ` Emilio G. Cota
2016-05-17 20:20         ` Sergey Fedorov
2016-05-18  0:28           ` Emilio G. Cota
2016-05-18 14:18             ` Sergey Fedorov
2016-05-18 14:47               ` Sergey Fedorov
2016-05-18 14:59                 ` Paolo Bonzini
2016-05-18 15:05                   ` Sergey Fedorov
2016-05-18 15:09                     ` Paolo Bonzini
2016-05-18 16:59                       ` Emilio G. Cota
2016-05-18 17:00                         ` Paolo Bonzini
2016-05-18 15:35                     ` Peter Maydell
2016-05-18 15:36                       ` Paolo Bonzini
2016-05-18 15:44                         ` Peter Maydell
2016-05-18 15:59                           ` Sergey Fedorov
2016-05-18 16:02                       ` Richard Henderson
2016-05-17 19:38     ` Emilio G. Cota
2016-05-17 20:35       ` Sergey Fedorov
2016-05-17 23:18         ` Emilio G. Cota
2016-05-18 13:59           ` Sergey Fedorov
2016-05-18 14:05             ` Paolo Bonzini
2016-05-18 14:10               ` Sergey Fedorov
2016-05-18 14:40                 ` Paolo Bonzini
2016-05-18 18:21   ` Sergey Fedorov
2016-05-18 19:04     ` Emilio G. Cota
2016-05-18 19:51   ` Sergey Fedorov
2016-05-18 20:52     ` Emilio G. Cota
2016-05-18 20:57       ` Sergey Fedorov
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 08/18] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
2016-05-17 17:22   ` Sergey Fedorov
2016-05-17 19:48     ` Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 09/18] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
2016-05-17 17:47   ` Sergey Fedorov
2016-05-17 19:09     ` Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 10/18] qdist: add module to represent frequency distributions of data Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 11/18] qdist: add test program Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 12/18] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
2016-05-20 22:13   ` Sergey Fedorov
2016-05-21  2:48     ` Emilio G. Cota
2016-05-21 17:41       ` Emilio G. Cota
2016-05-22  8:01         ` Alex Bennée
2016-05-23  5:35           ` Emilio G. Cota
2016-05-21 20:07       ` Sergey Fedorov
2016-05-23 19:29       ` Sergey Fedorov
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 13/18] qht: support parallel writes Emilio G. Cota
2016-05-23 20:28   ` Sergey Fedorov
2016-05-24 22:07     ` Emilio G. Cota
2016-05-24 22:17       ` Sergey Fedorov
2016-05-25  0:10         ` Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 14/18] qht: add test program Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 15/18] qht: add qht-bench, a performance benchmark Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 16/18] qht: add test-qht-par to invoke qht-bench from 'check' target Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 17/18] tb hash: track translated blocks with qht Emilio G. Cota
2016-05-14  3:34 ` [Qemu-devel] [PATCH v5 18/18] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
2016-05-23 22:26 ` [Qemu-devel] [PATCH v5 00/18] tb hash improvements Sergey Fedorov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.