[Qemu-devel] [PATCH v3 00/11] tb hash improvements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 00/11] tb hash improvements
@ 2016-04-19 23:07 Emilio G. Cota
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 01/11] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
                   ` (10 more replies)
  0 siblings, 11 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

See v2 here:

  https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg01307.html

Changes from v2:

- Dropped "add missing fold of tb_ctx into tcg_ctx", already merged
  upstream as commit 7e6bd36d611.
- Added reviewed-by tags from Alex and Richard
- xxhash:
  + use rol32 from qemu/bitops.h
  + remove seed parameter from tb_hash_func5
- qht:
  + add comments suggested by Alex, almost all of them about MRU
  + add BUILD_BUG_ON check for the size of qht_bucket
  + add assert(orig != head) in MRU promotion function, and delete
    code path that dealt with that case (it was dead code)
  + fold qht_bucket_reset__locked into qht_bucket_reset
  + do not inline qht_lookup
    + move definitions of qht_bucket and qht_map to qht.c
  + remove 'count' variable for knowing whether lookups/insertions
    were on non-head buckets; just check 'b != head' instead.
  + add avg bucket chain length to 'info jit'. 
- qht-test:
  + drive tests with g_test 
  + add avg_bucket_chain_length checks

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 01/11] compiler.h: add QEMU_ALIGNED() to enforce struct alignment
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-22  9:32   ` Alex Bennée
  2016-04-22  9:35   ` Peter Maydell
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 02/11] seqlock: remove optional mutex Emilio G. Cota
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/compiler.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
index 8f1cc7b..1978ddc 100644
--- a/include/qemu/compiler.h
+++ b/include/qemu/compiler.h
@@ -41,6 +41,8 @@
 # define QEMU_PACKED __attribute__((packed))
 #endif
 
+#define QEMU_ALIGNED(B) __attribute__((aligned(B)))
+
 #ifndef glue
 #define xglue(x, y) x ## y
 #define glue(x, y) xglue(x, y)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 02/11] seqlock: remove optional mutex
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 01/11] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 03/11] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

This option is unused; besides, it bloats the struct when not needed.
Let's just let writers define their own locks elsewhere.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c                 |  2 +-
 include/qemu/seqlock.h | 10 +---------
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/cpus.c b/cpus.c
index cbeb1f6..dd86da5 100644
--- a/cpus.c
+++ b/cpus.c
@@ -619,7 +619,7 @@ int cpu_throttle_get_percentage(void)
 
 void cpu_ticks_init(void)
 {
-    seqlock_init(&timers_state.vm_clock_seqlock, NULL);
+    seqlock_init(&timers_state.vm_clock_seqlock);
     vmstate_register(NULL, 0, &vmstate_timers, &timers_state);
     throttle_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL_RT,
                                            cpu_throttle_timer_tick, NULL);
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index 70b01fd..e673482 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -19,22 +19,17 @@
 typedef struct QemuSeqLock QemuSeqLock;
 
 struct QemuSeqLock {
-    QemuMutex *mutex;
     unsigned sequence;
 };
 
-static inline void seqlock_init(QemuSeqLock *sl, QemuMutex *mutex)
+static inline void seqlock_init(QemuSeqLock *sl)
 {
-    sl->mutex = mutex;
     sl->sequence = 0;
 }
 
 /* Lock out other writers and update the count.  */
 static inline void seqlock_write_lock(QemuSeqLock *sl)
 {
-    if (sl->mutex) {
-        qemu_mutex_lock(sl->mutex);
-    }
     ++sl->sequence;
 
     /* Write sequence before updating other fields.  */
@@ -47,9 +42,6 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
     smp_wmb();
 
     ++sl->sequence;
-    if (sl->mutex) {
-        qemu_mutex_unlock(sl->mutex);
-    }
 }
 
 static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 03/11] seqlock: rename write_lock/unlock to write_begin/end
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 01/11] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 02/11] seqlock: remove optional mutex Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax() Emilio G. Cota
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

It is a more appropriate name, now that the mutex embedded
in the seqlock is gone.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c                 | 28 ++++++++++++++--------------
 include/qemu/seqlock.h |  4 ++--
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/cpus.c b/cpus.c
index dd86da5..735c9b2 100644
--- a/cpus.c
+++ b/cpus.c
@@ -247,13 +247,13 @@ int64_t cpu_get_clock(void)
 void cpu_enable_ticks(void)
 {
     /* Here, the really thing protected by seqlock is cpu_clock_offset. */
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (!timers_state.cpu_ticks_enabled) {
         timers_state.cpu_ticks_offset -= cpu_get_host_ticks();
         timers_state.cpu_clock_offset -= get_clock();
         timers_state.cpu_ticks_enabled = 1;
     }
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 /* disable cpu_get_ticks() : the clock is stopped. You must not call
@@ -263,13 +263,13 @@ void cpu_enable_ticks(void)
 void cpu_disable_ticks(void)
 {
     /* Here, the really thing protected by seqlock is cpu_clock_offset. */
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (timers_state.cpu_ticks_enabled) {
         timers_state.cpu_ticks_offset += cpu_get_host_ticks();
         timers_state.cpu_clock_offset = cpu_get_clock_locked();
         timers_state.cpu_ticks_enabled = 0;
     }
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 /* Correlation between real and virtual time is always going to be
@@ -292,7 +292,7 @@ static void icount_adjust(void)
         return;
     }
 
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     cur_time = cpu_get_clock_locked();
     cur_icount = cpu_get_icount_locked();
 
@@ -313,7 +313,7 @@ static void icount_adjust(void)
     last_delta = delta;
     timers_state.qemu_icount_bias = cur_icount
                               - (timers_state.qemu_icount << icount_time_shift);
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 static void icount_adjust_rt(void *opaque)
@@ -353,7 +353,7 @@ static void icount_warp_rt(void)
         return;
     }
 
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (runstate_is_running()) {
         int64_t clock = REPLAY_CLOCK(REPLAY_CLOCK_VIRTUAL_RT,
                                      cpu_get_clock_locked());
@@ -372,7 +372,7 @@ static void icount_warp_rt(void)
         timers_state.qemu_icount_bias += warp_delta;
     }
     vm_clock_warp_start = -1;
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 
     if (qemu_clock_expired(QEMU_CLOCK_VIRTUAL)) {
         qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
@@ -397,9 +397,9 @@ void qtest_clock_warp(int64_t dest)
         int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
         int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
 
-        seqlock_write_lock(&timers_state.vm_clock_seqlock);
+        seqlock_write_begin(&timers_state.vm_clock_seqlock);
         timers_state.qemu_icount_bias += warp;
-        seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+        seqlock_write_end(&timers_state.vm_clock_seqlock);
 
         qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
         timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
@@ -466,9 +466,9 @@ void qemu_start_warp_timer(void)
              * It is useful when we want a deterministic execution time,
              * isolated from host latencies.
              */
-            seqlock_write_lock(&timers_state.vm_clock_seqlock);
+            seqlock_write_begin(&timers_state.vm_clock_seqlock);
             timers_state.qemu_icount_bias += deadline;
-            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+            seqlock_write_end(&timers_state.vm_clock_seqlock);
             qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
         } else {
             /*
@@ -479,11 +479,11 @@ void qemu_start_warp_timer(void)
              * you will not be sending network packets continuously instead of
              * every 100ms.
              */
-            seqlock_write_lock(&timers_state.vm_clock_seqlock);
+            seqlock_write_begin(&timers_state.vm_clock_seqlock);
             if (vm_clock_warp_start == -1 || vm_clock_warp_start > clock) {
                 vm_clock_warp_start = clock;
             }
-            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+            seqlock_write_end(&timers_state.vm_clock_seqlock);
             timer_mod_anticipate(icount_warp_timer, clock + deadline);
         }
     } else if (deadline == 0) {
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index e673482..4dfc055 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -28,7 +28,7 @@ static inline void seqlock_init(QemuSeqLock *sl)
 }
 
 /* Lock out other writers and update the count.  */
-static inline void seqlock_write_lock(QemuSeqLock *sl)
+static inline void seqlock_write_begin(QemuSeqLock *sl)
 {
     ++sl->sequence;
 
@@ -36,7 +36,7 @@ static inline void seqlock_write_lock(QemuSeqLock *sl)
     smp_wmb();
 }
 
-static inline void seqlock_write_unlock(QemuSeqLock *sl)
+static inline void seqlock_write_end(QemuSeqLock *sl)
 {
     /* Write other fields before finalizing sequence.  */
     smp_wmb();
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax()
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
                   ` (2 preceding siblings ...)
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 03/11] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-20 12:15   ` KONRAD Frederic
                     ` (2 more replies)
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
                   ` (6 subsequent siblings)
  10 siblings, 3 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

Taken from the linux kernel.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/processor.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)
 create mode 100644 include/qemu/processor.h

diff --git a/include/qemu/processor.h b/include/qemu/processor.h
new file mode 100644
index 0000000..675a00a
--- /dev/null
+++ b/include/qemu/processor.h
@@ -0,0 +1,28 @@
+#ifndef QEMU_PROCESSOR_H
+#define QEMU_PROCESSOR_H
+
+#include "qemu/atomic.h"
+
+#if defined(__i386__) || defined(__x86_64__)
+#define cpu_relax() asm volatile("rep; nop" ::: "memory")
+#endif
+
+#ifdef __ia64__
+#define cpu_relax() asm volatile("hint @pause" ::: "memory")
+#endif
+
+#ifdef __aarch64__
+#define cpu_relax() asm volatile("yield" ::: "memory")
+#endif
+
+#if defined(__powerpc64__)
+/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
+#define cpu_relax() asm volatile("or 1, 1, 1;"
+                                 "or 2, 2, 2;" ::: "memory")
+#endif
+
+#ifndef cpu_relax
+#define cpu_relax() barrier()
+#endif
+
+#endif /* QEMU_PROCESSOR_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
                   ` (3 preceding siblings ...)
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax() Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-20 15:18   ` Richard Henderson
  2016-04-20 17:35   ` [Qemu-devel] [UPDATED " Emilio G. Cota
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 06/11] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

From: Guillaume Delbergue <guillaume.delbergue@greensocs.com>

Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
[Rewritten. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Emilio's additions: call cpu_relax() while spinning; optimize for
 uncontended locks by acquiring the lock with xchg+test instead of
 test+xchg+test.]
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/thread.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index bdae6df..e2af57c 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -1,6 +1,9 @@
 #ifndef __QEMU_THREAD_H
 #define __QEMU_THREAD_H 1
 
+#include <errno.h>
+#include "qemu/processor.h"
+#include "qemu/atomic.h"
 
 typedef struct QemuMutex QemuMutex;
 typedef struct QemuCond QemuCond;
@@ -60,4 +63,35 @@ struct Notifier;
 void qemu_thread_atexit_add(struct Notifier *notifier);
 void qemu_thread_atexit_remove(struct Notifier *notifier);
 
+typedef struct QemuSpin {
+    int value;
+} QemuSpin;
+
+static inline void qemu_spin_init(QemuSpin *spin)
+{
+    spin->value = 0;
+}
+
+static inline void qemu_spin_lock(QemuSpin *spin)
+{
+    while (atomic_xchg(&spin->value, true)) {
+        while (atomic_read(&spin->value)) {
+            cpu_relax();
+        }
+    }
+}
+
+static inline int qemu_spin_trylock(QemuSpin *spin)
+{
+    if (atomic_read(&spin->value) || atomic_xchg(&spin->value, true)) {
+        return -EBUSY;
+    }
+    return 0;
+}
+
+static inline void qemu_spin_unlock(QemuSpin *spin)
+{
+    atomic_mb_set(&spin->value, 0);
+}
+
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 06/11] exec: add tb_hash_func5, derived from xxhash
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
                   ` (4 preceding siblings ...)
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-20 15:19   ` Richard Henderson
  2016-04-22 12:58   ` Alex Bennée
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 07/11] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

This will be used by upcoming changes for hashing the tb hash.

Add this into a separate file to include the copyright notice from
xxhash.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-hash-xx.h | 94 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)
 create mode 100644 include/exec/tb-hash-xx.h

diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
new file mode 100644
index 0000000..67f4e6f
--- /dev/null
+++ b/include/exec/tb-hash-xx.h
@@ -0,0 +1,94 @@
+/*
+ * xxHash - Fast Hash algorithm
+ * Copyright (C) 2012-2016, Yann Collet
+ *
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * + Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * + Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * You can contact the author at :
+ * - xxHash source repository : https://github.com/Cyan4973/xxHash
+ */
+#ifndef EXEC_TB_HASH_XX
+#define EXEC_TB_HASH_XX
+
+#include <qemu/bitops.h>
+
+#define PRIME32_1   2654435761U
+#define PRIME32_2   2246822519U
+#define PRIME32_3   3266489917U
+#define PRIME32_4    668265263U
+#define PRIME32_5    374761393U
+
+#define TB_HASH_XX_SEED 1
+
+/*
+ * xxhash32, customized for input variables that are not guaranteed to be
+ * contiguous in memory.
+ */
+static inline
+uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e)
+{
+    uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
+    uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
+    uint32_t v3 = TB_HASH_XX_SEED + 0;
+    uint32_t v4 = TB_HASH_XX_SEED - PRIME32_1;
+    uint32_t a = a0 >> 31 >> 1;
+    uint32_t b = a0;
+    uint32_t c = b0 >> 31 >> 1;
+    uint32_t d = b0;
+    uint32_t h32;
+
+    v1 += a * PRIME32_2;
+    v1 = rol32(v1, 13);
+    v1 *= PRIME32_1;
+
+    v2 += b * PRIME32_2;
+    v2 = rol32(v2, 13);
+    v2 *= PRIME32_1;
+
+    v3 += c * PRIME32_2;
+    v3 = rol32(v3, 13);
+    v3 *= PRIME32_1;
+
+    v4 += d * PRIME32_2;
+    v4 = rol32(v4, 13);
+    v4 *= PRIME32_1;
+
+    h32 = rol32(v1, 1) + rol32(v2, 7) + rol32(v3, 12) + rol32(v4, 18);
+    h32 += 20;
+
+    h32 += e * PRIME32_3;
+    h32  = rol32(h32, 17) * PRIME32_4;
+
+    h32 ^= h32 >> 15;
+    h32 *= PRIME32_2;
+    h32 ^= h32 >> 13;
+    h32 *= PRIME32_3;
+    h32 ^= h32 >> 16;
+
+    return h32;
+}
+
+#endif /* EXEC_TB_HASH_XX */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 07/11] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
                   ` (5 preceding siblings ...)
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 06/11] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-22 12:58   ` Alex Bennée
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 08/11] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
   high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c             | 2 +-
 include/exec/tb-hash.h | 8 ++++++--
 translate-all.c        | 6 +++---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index debc65c..a889cf1 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -233,7 +233,7 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
     /* find translated block using physical mappings */
     phys_pc = get_page_addr_code(env, pc);
     phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, pc, flags);
     ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     for(;;) {
         tb = *ptb1;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 0f4e8a0..4b9635a 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,6 +20,9 @@
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
+#include "exec/exec-all.h"
+#include "exec/tb-hash-xx.h"
+
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
    addresses on the same page.  The top bits are the same.  This allows
    TLB invalidation to quickly clear a subset of the hash table.  */
@@ -43,9 +46,10 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
            | (tmp & TB_JMP_ADDR_MASK));
 }
 
-static inline unsigned int tb_phys_hash_func(tb_page_addr_t pc)
+static inline
+uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
 {
-    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index 1a8f68b..eca2f16 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -972,7 +972,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
 
     /* remove the TB from the page list */
@@ -1474,8 +1474,8 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     unsigned int h;
     TranslationBlock **ptb;
 
-    /* add in the physical hash table */
-    h = tb_phys_hash_func(phys_pc);
+    /* add in the hash table */
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     tb->phys_hash_next = *ptb;
     *ptb = tb;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 08/11] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
                   ` (6 preceding siblings ...)
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 07/11] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-22 14:04   ` Alex Bennée
  2016-04-24 20:01   ` Richard Henderson
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 09/11] qht: add test program Emilio G. Cota
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

This is a hash table with optional auto-resizing and MRU promotion for
reads and writes. Its implementation goal is to stay fast while
scaling for read-mostly workloads.

A hash table with these features will be necessary for the scalability
of the ongoing MTTCG work; before those changes arrive we can already
benefit from the single-threaded speedup that qht also provides.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/qht.h |  54 +++++
 util/Makefile.objs |   2 +-
 util/qht.c         | 590 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 645 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/qht.h
 create mode 100644 util/qht.c

diff --git a/include/qemu/qht.h b/include/qemu/qht.h
new file mode 100644
index 0000000..a0a1aa8
--- /dev/null
+++ b/include/qemu/qht.h
@@ -0,0 +1,54 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QHT_H
+#define QHT_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/seqlock.h"
+#include "qemu/rcu.h"
+
+struct qht {
+    struct qht_map *map;
+    unsigned int mode;
+};
+
+typedef bool (*qht_lookup_func_t)(const void *obj, const void *userp);
+typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
+
+#define QHT_MODE_MRU_LOOKUP  0x1 /* move looked-up items to head */
+#define QHT_MODE_MRU_INSERT  0x2 /* insert new elements at the head */
+#define QHT_MODE_AUTO_RESIZE 0x4 /* auto-resize when heavily loaded */
+
+void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
+
+/* call only when there are no readers left */
+void qht_destroy(struct qht *ht);
+
+/* call with an external lock held */
+void qht_reset(struct qht *ht);
+
+/* call with an external lock held */
+void qht_reset_size(struct qht *ht, size_t n_elems);
+
+/* call with an external lock held */
+void qht_insert(struct qht *ht, void *p, uint32_t hash);
+
+/* call with an external lock held */
+bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
+
+/* call with an external lock held */
+void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp);
+
+/* call with an external lock held */
+void qht_grow(struct qht *ht);
+
+void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
+                 uint32_t hash);
+double qht_avg_bucket_chain_length(struct qht *ht, size_t *n_head_buckets);
+
+#endif /* QHT_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index a8a777e..ae893dc 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -1,4 +1,4 @@
-util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
+util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o qht.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
diff --git a/util/qht.c b/util/qht.c
new file mode 100644
index 0000000..05ea5e8
--- /dev/null
+++ b/util/qht.c
@@ -0,0 +1,590 @@
+/*
+ * qht.c - QEMU Hash Table, designed to scale for read-mostly workloads.
+ *
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ *
+ * Assumptions:
+ * - Writers and iterators must take an external lock.
+ * - A hash value of 0 is invalid.
+ * - NULL cannot be inserted as a pointer value.
+ *
+ * Features:
+ * - Optional auto-resizing: the hash table resizes up if the load surpasses
+ *   a certain threshold. Resizing is done concurrently with readers.
+ * - Optional bucket MRU promotion policy.
+ *
+ * The key structure is the bucket, which is cacheline-sized. Buckets
+ * contain a few hash values and pointers; the u32 hash values are stored in
+ * full so that resizing is fast. Having this structure instead of directly
+ * chaining items has three advantages:
+ * - Failed lookups fail fast, and touch a minimum number of cache lines.
+ * - Resizing the hash table with concurrent lookups is easy.
+ * - We can have a few Most-Recently-Used (MRU) hash-pointer pairs in the same
+ *   head bucket. This helps scalability, since MRU promotions (i.e. writes to
+ *   the bucket) become less common.
+ *
+ * For concurrent lookups we use a per-bucket seqlock; per-bucket spinlocks
+ * allow readers (lookups) to upgrade to writers and thus implement an MRU
+ * promotion policy; these MRU-induced writes do not touch the cache lines of
+ * other head buckets.
+ *
+ * Note that there are two types of buckets:
+ * 1. "head" buckets are the ones allocated in the array of buckets in qht_map.
+ * 2. all "non-head" buckets (i.e. all others) are members of a chain that
+ *    starts from a head bucket.
+ * Note that the seqlock and spinlock of a head bucket applies to all buckets
+ * chained to it; these two fields are unused in non-head buckets.
+ *
+ * Resizing is done by taking all spinlocks (so that no readers-turned-writers
+ * can race with us) and then placing all elements into a new hash table. Last,
+ * the ht->map pointer is set, and the old map is freed once no RCU readers can
+ * see it anymore.
+ *
+ * Related Work:
+ * - Idea of cacheline-sized buckets with full hashes taken from:
+ *   David, Guerraoui & Trigonakis, "Asynchronized Concurrency:
+ *   The Secret to Scaling Concurrent Search Data Structures", ASPLOS'15.
+ * - Why not RCU-based hash tables? They would allow us to get rid of the
+ *   seqlock, but resizing would take forever since RCU read critical
+ *   sections in QEMU take quite a long time.
+ *   More info on relativistic hash tables:
+ *   + Triplett, McKenney & Walpole, "Resizable, Scalable, Concurrent Hash
+ *     Tables via Relativistic Programming", USENIX ATC'11.
+ *   + Corbet, "Relativistic hash tables, part 1: Algorithms", @ lwn.net, 2014.
+ *     https://lwn.net/Articles/612021/
+ */
+#include "qemu/qht.h"
+#include "qemu/atomic.h"
+
+/*
+ * We want to avoid false sharing of cache lines. Most systems have 64-byte
+ * cache lines so we go with it for simplicity.
+ *
+ * Note that systems with smaller cache lines will be fine (the struct is
+ * almost 64-bytes); systems with larger cache lines might suffer from
+ * some false sharing.
+ */
+#define QHT_BUCKET_ALIGN 64
+
+/* define these to keep sizeof(qht_bucket) within QHT_BUCKET_ALIGN */
+#if HOST_LONG_BITS == 32
+#define QHT_BUCKET_ENTRIES 6
+#else /* 64-bit */
+#define QHT_BUCKET_ENTRIES 4
+#endif
+
+struct qht_bucket {
+    QemuSpin lock;
+    QemuSeqLock sequence;
+    uint32_t hashes[QHT_BUCKET_ENTRIES];
+    void *pointers[QHT_BUCKET_ENTRIES];
+    struct qht_bucket *next;
+} QEMU_ALIGNED(QHT_BUCKET_ALIGN);
+
+QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
+
+struct qht_map {
+    struct qht_bucket *buckets;
+    size_t n;
+    size_t n_items;
+    size_t n_items_threshold;
+    struct rcu_head rcu;
+};
+
+static inline size_t qht_elems_to_buckets(size_t n_elems)
+{
+    return pow2ceil(n_elems / QHT_BUCKET_ENTRIES);
+}
+
+static inline void qht_head_init(struct qht_bucket *b)
+{
+    memset(b, 0, sizeof(*b));
+    qemu_spin_init(&b->lock);
+    seqlock_init(&b->sequence);
+}
+
+static inline
+struct qht_bucket *qht_map_to_bucket(struct qht_map *map, uint32_t hash)
+{
+    return &map->buckets[hash & (map->n - 1)];
+}
+
+static inline void qht_chain_destroy(struct qht_bucket *head)
+{
+    struct qht_bucket *curr = head->next;
+    struct qht_bucket *prev;
+
+    while (curr) {
+        prev = curr;
+        curr = curr->next;
+        qemu_vfree(prev);
+    }
+}
+
+/* pass only an orphan map */
+static void qht_map_destroy(struct qht_map *map)
+{
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_chain_destroy(&map->buckets[i]);
+    }
+    qemu_vfree(map->buckets);
+    g_free(map);
+}
+
+static void qht_map_reclaim(struct rcu_head *rcu)
+{
+    struct qht_map *map = container_of(rcu, struct qht_map, rcu);
+
+    qht_map_destroy(map);
+}
+
+static struct qht_map *qht_map_create(size_t n)
+{
+    struct qht_map *map;
+    size_t i;
+
+    map = g_malloc(sizeof(*map));
+    map->n = n;
+    map->n_items = 0;
+    map->n_items_threshold = n * QHT_BUCKET_ENTRIES / 2;
+    map->buckets = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*map->buckets) * n);
+    for (i = 0; i < n; i++) {
+        qht_head_init(&map->buckets[i]);
+    }
+    return map;
+}
+
+static inline void qht_publish(struct qht *ht, struct qht_map *new)
+{
+    /* Readers should see a properly initialized map; pair with smp_rmb() */
+    smp_wmb();
+    atomic_set(&ht->map, new);
+}
+
+void qht_init(struct qht *ht, size_t n_elems, unsigned int mode)
+{
+    struct qht_map *map;
+    size_t n = qht_elems_to_buckets(n_elems);
+
+    map = qht_map_create(n);
+    ht->mode = mode;
+    qht_publish(ht, map);
+}
+
+/* call only when there are no readers left */
+void qht_destroy(struct qht *ht)
+{
+    qht_map_destroy(ht->map);
+    memset(ht, 0, sizeof(*ht));
+}
+
+static void qht_bucket_reset(struct qht_bucket *head)
+{
+    struct qht_bucket *b = head;
+    int i;
+
+    qemu_spin_lock(&head->lock);
+    seqlock_write_begin(&head->sequence);
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            atomic_set(&b->hashes[i], 0);
+            atomic_set(&b->pointers[i], NULL);
+        }
+        b = b->next;
+    } while (b);
+    seqlock_write_end(&head->sequence);
+    qemu_spin_unlock(&head->lock);
+}
+
+/* call with an external lock held */
+void qht_reset(struct qht *ht)
+{
+    struct qht_map *map = ht->map;
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_bucket_reset(&map->buckets[i]);
+    }
+}
+
+/* call with an external lock held */
+void qht_reset_size(struct qht *ht, size_t n_elems)
+{
+    struct qht_map *old = ht->map;
+
+    qht_reset(ht);
+    if (old->n == qht_elems_to_buckets(n_elems)) {
+        return;
+    }
+    qht_init(ht, n_elems, ht->mode);
+    call_rcu1(&old->rcu, qht_map_reclaim);
+}
+
+/* Can only be called when at least orig is the 3rd link i.e. head->2nd->orig */
+static void qht_move_bucket_to_front__locked(struct qht_bucket *head,
+                                             struct qht_bucket *orig)
+{
+    struct qht_bucket *b = head->next;
+
+    for (;;) {
+        if (b->next == orig) {
+            atomic_set(&b->next, orig->next);
+            atomic_set(&orig->next, head->next);
+            atomic_set(&head->next, orig);
+            return;
+        }
+        b = b->next;
+    }
+}
+
+/*
+ * Call with head->lock held.
+ *
+ * This function moves the item in @orig[@pos] to @head[0]. In order to
+ * do this, the item i in @head is moved to @head's i+1 position,
+ * with the last item being moved to @orig[@pos].
+ * Further, @orig is brought to the front of the bucket chain, i.e.
+ * @head->next is set to point to @orig. This is done to make sure that
+ * the last item in @head is not moved to a too-distant place. An
+ * alternative would be to perform full MRU for the entire bucket
+ * chain, but that would cause too much churn.
+ *
+ * Note: @orig == @head is a bug.
+ */
+static inline void qht_bucket_mru__locked(struct qht_bucket *head,
+                                          struct qht_bucket *orig, int pos)
+{
+    uint32_t *dest_hash;
+    void **dest_p;
+    void *p;
+    uint32_t hash;
+    int i;
+
+    assert(orig != head);
+    for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+        if (i == QHT_BUCKET_ENTRIES - 1) {
+            dest_hash = &orig->hashes[pos];
+            dest_p = &orig->pointers[pos];
+        } else {
+            dest_hash = &head->hashes[i + 1];
+            dest_p = &head->pointers[i + 1];
+        }
+        hash = *dest_hash;
+        p = *dest_p;
+
+        atomic_set(dest_hash, head->hashes[i]);
+        atomic_set(dest_p, head->pointers[i]);
+
+        atomic_set(&head->hashes[i], hash);
+        atomic_set(&head->pointers[i], p);
+    }
+    if (head->next != orig) {
+        qht_move_bucket_to_front__locked(head, orig);
+    }
+}
+
+static void qht_bucket_mru(struct qht_bucket *head, struct qht_bucket *orig,
+                           const void *p, int pos)
+{
+    qemu_spin_lock(&head->lock);
+    if (unlikely(orig->pointers[pos] != p)) {
+        /* while we acquired the lock, the bucket was updated, so bail out */
+        goto out;
+    }
+    seqlock_write_begin(&head->sequence);
+    qht_bucket_mru__locked(head, orig, pos);
+    seqlock_write_end(&head->sequence);
+ out:
+    qemu_spin_unlock(&head->lock);
+}
+
+/*
+ * @far_bucket and @pos are only filled in if the match is found in a bucket
+ * that is not the head bucket.
+ */
+static inline
+void *qht_do_lookup(struct qht_bucket *head, struct qht_bucket **far_bucket,
+                    int *pos, qht_lookup_func_t func, const void *userp,
+                    uint32_t hash)
+{
+    struct qht_bucket *b = head;
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->hashes[i] == hash) {
+                void *p = atomic_read(&b->pointers[i]);
+
+                /*
+                 * p could be NULL if we're racing with a writer. We could use
+                 * barriers here but for performance we only issue the ones
+                 * in the seqlock.
+                 */
+                if (likely(p && func(p, userp))) {
+                    if (unlikely(b != head)) {
+                        *far_bucket = b;
+                        *pos = i;
+                    }
+                    return p;
+                }
+            }
+        }
+        b = atomic_read(&b->next);
+        /*
+         * This barrier guarantees that we will read a properly initialized
+         * b->next; it is paired with an smp_wmb() before setting b->next.
+         */
+        smp_rmb();
+    } while (b);
+    return NULL;
+}
+
+void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
+                 uint32_t hash)
+{
+    struct qht_bucket *far_bucket = NULL;
+    struct qht_bucket *b;
+    struct qht_map *map;
+    uint32_t version;
+    int pos = 0;
+    void *ret;
+
+    map = atomic_read(&ht->map);
+    /* paired with smp_wmb() before setting ht->map */
+    smp_rmb();
+    b = qht_map_to_bucket(map, hash);
+
+    do {
+        version = seqlock_read_begin(&b->sequence);
+        ret = qht_do_lookup(b, &far_bucket, &pos, func, userp, hash);
+    } while (seqlock_read_retry(&b->sequence, version));
+    if ((ht->mode & QHT_MODE_MRU_LOOKUP) && unlikely(far_bucket)) {
+        qht_bucket_mru(b, far_bucket, ret, pos);
+    }
+    return ret;
+}
+
+/* call with b->lock held */
+static void qht_insert__locked(struct qht *ht, struct qht_map *map,
+                               struct qht_bucket *head, void *p, uint32_t hash)
+{
+    struct qht_bucket *b = head;
+    struct qht_bucket *prev = NULL;
+    struct qht_bucket *new = NULL;
+    int i;
+
+    for (;;) {
+        if (b == NULL) {
+            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
+            memset(b, 0, sizeof(*b));
+            new = b;
+        }
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->hashes[i]) {
+                continue;
+            }
+            /* found an empty key: acquire the seqlock and write */
+            seqlock_write_begin(&head->sequence);
+            if (new) {
+                /*
+                 * This barrier is paired with smp_rmb() after reading
+                 * b->next when not holding b->lock.
+                 */
+                smp_wmb();
+                atomic_set(&prev->next, b);
+            }
+            atomic_set(&b->hashes[i], hash);
+            atomic_set(&b->pointers[i], p);
+            if ((ht->mode & QHT_MODE_MRU_INSERT) && b != head) {
+                qht_bucket_mru__locked(head, b, i);
+            }
+            seqlock_write_end(&head->sequence);
+            map->n_items++;
+            return;
+        }
+        prev = b;
+        b = b->next;
+    }
+}
+
+/* call with an external lock held */
+void qht_insert(struct qht *ht, void *p, uint32_t hash)
+{
+    struct qht_map *map = ht->map;
+    struct qht_bucket *b = qht_map_to_bucket(map, hash);
+
+    qemu_spin_lock(&b->lock);
+    qht_insert__locked(ht, map, b, p, hash);
+    qemu_spin_unlock(&b->lock);
+
+    if ((ht->mode & QHT_MODE_AUTO_RESIZE) &&
+        unlikely(map->n_items > map->n_items_threshold)) {
+        qht_grow(ht);
+    }
+}
+
+/* call with b->lock held */
+static inline bool qht_remove__locked(struct qht_map *map, struct qht_bucket *b,
+                                      const void *p, uint32_t hash)
+{
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->hashes[i] == hash && b->pointers[i] == p) {
+                atomic_set(&b->hashes[i], 0);
+                atomic_set(&b->pointers[i], NULL);
+                map->n_items--;
+                return true;
+            }
+        }
+        b = b->next;
+    } while (b);
+    return false;
+}
+
+/* call with an external lock held */
+bool qht_remove(struct qht *ht, const void *p, uint32_t hash)
+{
+    struct qht_map *map = ht->map;
+    struct qht_bucket *b = qht_map_to_bucket(map, hash);
+    bool ret;
+
+    qemu_spin_lock(&b->lock);
+    seqlock_write_begin(&b->sequence);
+    ret = qht_remove__locked(map, b, p, hash);
+    seqlock_write_end(&b->sequence);
+    qemu_spin_unlock(&b->lock);
+    return ret;
+}
+
+/*
+ * acquire all spinlocks from a map, so that writers cannot race with
+ * readers-turned-writers.
+ */
+static void qht_lock(struct qht *ht)
+{
+    struct qht_map *map = ht->map;
+    size_t i;
+
+    /* if readers cannot upgrade, do nothing */
+    if (!(ht->mode & QHT_MODE_MRU_LOOKUP)) {
+        return;
+    }
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+
+        qemu_spin_lock(&b->lock);
+    }
+}
+
+static void qht_unlock(struct qht *ht)
+{
+    struct qht_map *map = ht->map;
+    size_t i;
+
+    if (!(ht->mode & QHT_MODE_MRU_LOOKUP)) {
+        return;
+    }
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+
+        qemu_spin_unlock(&b->lock);
+    }
+}
+
+/* external lock + all of the map's locks held (if !MRU_LOOKUP) */
+static inline void qht_map_iter__locked(struct qht *ht, struct qht_map *map,
+                                        qht_iter_func_t func, void *userp)
+{
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+        int j;
+
+        do {
+            for (j = 0; j < QHT_BUCKET_ENTRIES; j++) {
+                if (b->hashes[j]) {
+                    func(ht, b->pointers[j], b->hashes[j], userp);
+                }
+            }
+            b = b->next;
+        } while (b);
+    }
+}
+
+/* call with an external lock held */
+void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp)
+{
+    qht_lock(ht);
+    qht_map_iter__locked(ht, ht->map, func, userp);
+    qht_unlock(ht);
+}
+
+static void qht_map_copy(struct qht *ht, void *p, uint32_t hash, void *userp)
+{
+    struct qht_map *new = userp;
+    struct qht_bucket *b = qht_map_to_bucket(new, hash);
+
+    /* no need to acquire b->lock because no thread has seen this map yet */
+    qht_insert__locked(ht, new, b, p, hash);
+}
+
+/* call with an external lock held */
+void qht_grow(struct qht *ht)
+{
+    struct qht_map *old = ht->map;
+    struct qht_map *new;
+    size_t n = old->n * 2;
+
+    new = qht_map_create(n);
+    qht_iter(ht, qht_map_copy, new);
+
+    qht_publish(ht, new);
+    call_rcu1(&old->rcu, qht_map_reclaim);
+}
+
+/*
+ * Returns the number of buckets that an average lookup will traverse.
+ * The return value is always >= 1. With a good hashing function, this
+ * value should be close to 1.
+ * Note that each bucket tracks up to QHT_BUCKET_ENTRIES items.
+ */
+double qht_avg_bucket_chain_length(struct qht *ht, size_t *n_head_buckets)
+{
+    struct qht_map *map;
+    size_t count = 0;
+    size_t i;
+
+    map = atomic_read(&ht->map);
+    /* paired with smp_wmb() before setting ht->map */
+    smp_rmb();
+
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *head = &map->buckets[i];
+        struct qht_bucket *b;
+        size_t bucket_count;
+        uint32_t version;
+
+        do {
+            version = seqlock_read_begin(&head->sequence);
+            bucket_count = 0;
+            b = head;
+            do {
+                bucket_count++;
+                b = b->next;
+            } while (b);
+        } while (seqlock_read_retry(&head->sequence, version));
+        count += bucket_count;
+    }
+    if (n_head_buckets) {
+        *n_head_buckets = map->n;
+    }
+    return (double)count / map->n;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 09/11] qht: add test program
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
                   ` (7 preceding siblings ...)
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 08/11] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-22 14:35   ` Alex Bennée
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 10/11] tb hash: track translated blocks with qht Emilio G. Cota
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
  10 siblings, 1 reply; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore |   1 +
 tests/Makefile   |   6 ++-
 tests/test-qht.c | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 145 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-qht.c

diff --git a/tests/.gitignore b/tests/.gitignore
index 9eed229..d095e05 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -48,6 +48,7 @@ test-qapi-visit.[ch]
 test-qdev-global-props
 test-qemu-opts
 test-qga
+test-qht
 test-qmp-commands
 test-qmp-commands.h
 test-qmp-event
diff --git a/tests/Makefile b/tests/Makefile
index 9de9598..0568ac2 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -68,6 +68,8 @@ check-unit-y += tests/rcutorture$(EXESUF)
 gcov-files-rcutorture-y = util/rcu.c
 check-unit-y += tests/test-rcu-list$(EXESUF)
 gcov-files-test-rcu-list-y = util/rcu.c
+check-unit-y += tests/test-qht$(EXESUF)
+gcov-files-test-qht-y = util/qht.c
 check-unit-y += tests/test-bitops$(EXESUF)
 check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
 check-unit-y += tests/check-qom-interface$(EXESUF)
@@ -387,7 +389,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-qmp-commands.o tests/test-visitor-serialization.o \
 	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
-	tests/rcutorture.o tests/test-rcu-list.o
+	tests/rcutorture.o tests/test-rcu-list.o \
+	tests/test-qht.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -425,6 +428,7 @@ tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
 tests/test-int128$(EXESUF): tests/test-int128.o
 tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
+tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/test-qht.c b/tests/test-qht.c
new file mode 100644
index 0000000..47024c6
--- /dev/null
+++ b/tests/test-qht.c
@@ -0,0 +1,139 @@
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "qemu/qht.h"
+#include "exec/tb-hash-xx.h"
+
+#define N 5000
+
+static struct qht ht;
+static int32_t arr[N];
+
+/*
+ * We might be tempted to use val as the hash. However, val
+ * could be 0, and all hashes passed to qht must be !0.
+ */
+static inline uint32_t hash_func(uint32_t val)
+{
+    return tb_hash_func5(val, 0, 0);
+}
+
+static bool is_equal(const void *obj, const void *userp)
+{
+    const int32_t *a = obj;
+    const int32_t *b = userp;
+
+    return *a == *b;
+}
+
+static void insert(int a, int b)
+{
+    int i;
+
+    for (i = a; i < b; i++) {
+        uint32_t hash;
+
+        arr[i] = i;
+        hash = hash_func(i);
+
+        qht_insert(&ht, &arr[i], hash);
+    }
+}
+
+static void rm(int init, int end)
+{
+    int i;
+
+    for (i = init; i < end; i++) {
+        uint32_t hash;
+
+        hash = hash_func(arr[i]);
+        g_assert_true(qht_remove(&ht, &arr[i], hash));
+    }
+}
+
+static void check(int a, int b, bool expected)
+{
+    double avg_chain;
+    size_t n_head_buckets;
+    int i;
+
+    for (i = a; i < b; i++) {
+        void *p;
+        uint32_t hash;
+        int32_t val;
+
+        val = i;
+        hash = hash_func(i);
+        p = qht_lookup(&ht, is_equal, &val, hash);
+        g_assert_true(!!p == expected);
+    }
+    avg_chain = qht_avg_bucket_chain_length(&ht, &n_head_buckets);
+    g_assert_cmpfloat(avg_chain, >=, 1.0);
+    g_assert_cmpuint(n_head_buckets, >, 0);
+}
+
+static void count_func(struct qht *ht, void *p, uint32_t hash, void *userp)
+{
+    unsigned int *curr = userp;
+
+    (*curr)++;
+}
+
+static void iter_check(unsigned int count)
+{
+    unsigned int curr = 0;
+
+    qht_iter(&ht, count_func, &curr);
+    g_assert_cmpuint(curr, ==, count);
+}
+
+static void qht_test(unsigned int mode)
+{
+    qht_init(&ht, 0, mode);
+
+    insert(0, N);
+    check(0, N, true);
+    check(-N, -1, false);
+    iter_check(N);
+    rm(1, 2);
+    qht_reset_size(&ht, 0);
+    check(0, N, false);
+
+    qht_destroy(&ht);
+}
+
+static void test_default(void)
+{
+    qht_test(0);
+}
+
+static void test_mru_lookup(void)
+{
+    qht_test(QHT_MODE_MRU_LOOKUP);
+}
+
+static void test_mru_all(void)
+{
+    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT);
+}
+
+static void test_mru_all_resize(void)
+{
+    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT | QHT_MODE_AUTO_RESIZE);
+}
+
+static void test_mru_insert_resize(void)
+{
+    qht_test(QHT_MODE_AUTO_RESIZE | QHT_MODE_MRU_INSERT);
+}
+
+int main(int argc, char *argv[])
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/qht/mode/default", test_default);
+    g_test_add_func("/qht/mode/mru_lookup", test_mru_lookup);
+    g_test_add_func("/qht/mode/mru_all", test_mru_all);
+    g_test_add_func("/qht/mode/mru_all_with_resize", test_mru_all_resize);
+    g_test_add_func("/qht/mode/mru_insert_with_resize", test_mru_insert_resize);
+    return g_test_run();
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 10/11] tb hash: track translated blocks with qht
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
                   ` (8 preceding siblings ...)
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 09/11] qht: add test program Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-28 13:27   ` Alex Bennée
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
  10 siblings, 1 reply; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
              MRU is implemented here by adding an intermediate struct
              that contains the u32 hash and a pointer to the TB; this
              allows us, on an MRU promotion, to copy said struct (that is not
              at the head), and put this new copy at the head. After a grace
              period, the original non-head struct can be eliminated, and
              after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                   no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                 no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
  http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.

                            Host: Intel Xeon E5-2690

  28 ++------------+-------------+-------------+-------------+------------++
     A*****        +             +             +             master **A*** +
  27 ++    *                                                 xxhash ##B###++
     |      A******A******                               xxhash-rcu $$C$$$ |
  26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
     D%%$$                              A******A******A*qht-dyn-mru A*E****A
  25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
     B#####%                                                               |
  24 ++    #C$$$$$                                                        ++
     |      B###  $                                                        |
     |          ## C$$$$$$                                                 |
  23 ++           #       C$$$$$$                                         ++
     |             B######       C$$$$$$                                %%%D
  22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
     |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
  21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
     +             E@@@   F&&&   +      E@     +      F&&&   +             +
  20 ++------------+-------------+-------------+-------------+------------++
     14            16            18            20            22            24
                             log2 number of buckets

                                 Host: Intel i7-4790K

  14.5 ++------------+------------+-------------+------------+------------++
       A**           +            +             +            master **A*** +
    14 ++ **                                                 xxhash ##B###++
  13.5 ++   **                                           xxhash-rcu $$C$$$++
       |                                            qht-fixed-nomru %%D%%% |
    13 ++     A******                                   qht-dyn-mru @@E@@@++
       |             A*****A******A******             qht-dyn-nomru &&F&&& |
  12.5 C$$                               A******A******A*****A******    ***A
    12 ++ $$                                                        A***  ++
       D%%% $$                                                             |
  11.5 ++  %%                                                             ++
       B###  %C$$$$$$                                                      |
    11 ++  ## D%%%%% C$$$$$                                               ++
       |     #      %      C$$$$$$                                         |
  10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
    10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
       +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
   9.5 ++------------+------------+-------------+------------+------------++
       14            16           18            20           22            24
                              log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 82 ++++++++++++++++++++++++-----------------------
 include/exec/exec-all.h |  9 +++---
 include/exec/tb-hash.h  |  3 +-
 translate-all.c         | 85 ++++++++++++++++++++++---------------------------
 4 files changed, 86 insertions(+), 93 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index a889cf1..fd4b42f 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -217,55 +217,59 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
     tb_free(tb);
 }
 
+struct tb_desc {
+    target_ulong pc;
+    target_ulong cs_base;
+    uint64_t flags;
+    tb_page_addr_t phys_page1;
+    CPUArchState *env;
+};
+
+static bool tb_cmp(const void *p, const void *d)
+{
+    const TranslationBlock *tb = p;
+    const struct tb_desc *desc = d;
+
+    if (tb->pc == desc->pc &&
+        tb->page_addr[0] == desc->phys_page1 &&
+        tb->cs_base == desc->cs_base &&
+        tb->flags == desc->flags) {
+        /* check next page if needed */
+        if (tb->page_addr[1] == -1) {
+            return true;
+        } else {
+            tb_page_addr_t phys_page2;
+            target_ulong virt_page2;
+
+            virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
+            phys_page2 = get_page_addr_code(desc->env, virt_page2);
+            if (tb->page_addr[1] == phys_page2) {
+                return true;
+            }
+        }
+    }
+    return false;
+}
+
 static TranslationBlock *tb_find_physical(CPUState *cpu,
                                           target_ulong pc,
                                           target_ulong cs_base,
                                           uint32_t flags)
 {
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
-    TranslationBlock *tb, **ptb1;
     unsigned int h;
-    tb_page_addr_t phys_pc, phys_page1;
-    target_ulong virt_page2;
+    tb_page_addr_t phys_pc;
+    struct tb_desc desc;
 
     tcg_ctx.tb_ctx.tb_invalidated_flag = 0;
 
-    /* find translated block using physical mappings */
-    phys_pc = get_page_addr_code(env, pc);
-    phys_page1 = phys_pc & TARGET_PAGE_MASK;
+    desc.env = (CPUArchState *)cpu->env_ptr;
+    desc.cs_base = cs_base;
+    desc.flags = flags;
+    desc.pc = pc;
+    phys_pc = get_page_addr_code(desc.env, pc);
+    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_hash_func(phys_pc, pc, flags);
-    ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    for(;;) {
-        tb = *ptb1;
-        if (!tb) {
-            return NULL;
-        }
-        if (tb->pc == pc &&
-            tb->page_addr[0] == phys_page1 &&
-            tb->cs_base == cs_base &&
-            tb->flags == flags) {
-            /* check next page if needed */
-            if (tb->page_addr[1] != -1) {
-                tb_page_addr_t phys_page2;
-
-                virt_page2 = (pc & TARGET_PAGE_MASK) +
-                    TARGET_PAGE_SIZE;
-                phys_page2 = get_page_addr_code(env, virt_page2);
-                if (tb->page_addr[1] == phys_page2) {
-                    break;
-                }
-            } else {
-                break;
-            }
-        }
-        ptb1 = &tb->phys_hash_next;
-    }
-
-    /* Move the TB to the head of the list */
-    *ptb1 = tb->phys_hash_next;
-    tb->phys_hash_next = tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tcg_ctx.tb_ctx.tb_phys_hash[h] = tb;
-    return tb;
+    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
 }
 
 static TranslationBlock *tb_find_slow(CPUState *cpu,
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index c75fb3a..27c25da 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -21,6 +21,7 @@
 #define _EXEC_ALL_H_
 
 #include "qemu-common.h"
+#include "qemu/qht.h"
 
 /* allow to see translation results - the slowdown should be negligible, so we leave it */
 #define DEBUG_DISAS
@@ -212,8 +213,8 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
 
 #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
 
-#define CODE_GEN_PHYS_HASH_BITS     15
-#define CODE_GEN_PHYS_HASH_SIZE     (1 << CODE_GEN_PHYS_HASH_BITS)
+#define CODE_GEN_HTABLE_BITS     15
+#define CODE_GEN_HTABLE_SIZE     (1 << CODE_GEN_HTABLE_BITS)
 
 /* Estimated block size for TB allocation.  */
 /* ??? The following is based on a 2015 survey of x86_64 host output.
@@ -249,8 +250,6 @@ struct TranslationBlock {
 
     void *tc_ptr;    /* pointer to the translated code */
     uint8_t *tc_search;  /* pointer to search data */
-    /* next matching tb for physical address. */
-    struct TranslationBlock *phys_hash_next;
     /* original tb when cflags has CF_NOCACHE */
     struct TranslationBlock *orig_tb;
     /* first and second physical page containing code. The lower bit
@@ -281,7 +280,7 @@ typedef struct TBContext TBContext;
 struct TBContext {
 
     TranslationBlock *tbs;
-    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
+    struct qht htable;
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 4b9635a..d274357 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,7 +20,6 @@
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
-#include "exec/exec-all.h"
 #include "exec/tb-hash-xx.h"
 
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
@@ -49,7 +48,7 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
 static inline
 uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
 {
-    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    return tb_hash_func5(phys_pc, pc, flags);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index eca2f16..617a572 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -735,6 +735,13 @@ static inline void code_gen_alloc(size_t tb_size)
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
+static void tb_htable_init(void)
+{
+    unsigned int mode = QHT_MODE_AUTO_RESIZE | QHT_MODE_MRU_INSERT;
+
+    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
+}
+
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
    (in bytes) allocated to the translation buffer. Zero means default
    size. */
@@ -742,6 +749,7 @@ void tcg_exec_init(unsigned long tb_size)
 {
     cpu_gen_init();
     page_init();
+    tb_htable_init();
     code_gen_alloc(tb_size);
 #if defined(CONFIG_SOFTMMU)
     /* There's no guest base to take into account, so go ahead and
@@ -843,7 +851,7 @@ void tb_flush(CPUState *cpu)
         memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
     }
 
-    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
+    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
@@ -854,60 +862,46 @@ void tb_flush(CPUState *cpu)
 
 #ifdef DEBUG_TB_CHECK
 
-static void tb_invalidate_check(target_ulong address)
+static void
+do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 {
-    TranslationBlock *tb;
-    int i;
+    TranslationBlock *tb = p;
+    target_ulong addr = *(target_ulong *)userp;
 
-    address &= TARGET_PAGE_MASK;
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-             tb = tb->phys_hash_next) {
-            if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
-                  address >= tb->pc + tb->size)) {
-                printf("ERROR invalidate: address=" TARGET_FMT_lx
-                       " PC=%08lx size=%04x\n",
-                       address, (long)tb->pc, tb->size);
-            }
-        }
+    if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
+        printf("ERROR invalidate: address=" TARGET_FMT_lx
+               " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
     }
 }
 
-/* verify that all the pages have correct rights for code */
-static void tb_page_check(void)
+static void tb_invalidate_check(target_ulong address)
 {
-    TranslationBlock *tb;
-    int i, flags1, flags2;
-
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-                tb = tb->phys_hash_next) {
-            flags1 = page_get_flags(tb->pc);
-            flags2 = page_get_flags(tb->pc + tb->size - 1);
-            if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
-                printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
-                       (long)tb->pc, tb->size, flags1, flags2);
-            }
-        }
-    }
+    address &= TARGET_PAGE_MASK;
+    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
 }
 
-#endif
-
-static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock *tb)
+static void
+do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 {
-    TranslationBlock *tb1;
+    TranslationBlock *tb = p;
+    int flags1, flags2;
 
-    for (;;) {
-        tb1 = *ptb;
-        if (tb1 == tb) {
-            *ptb = tb1->phys_hash_next;
-            break;
-        }
-        ptb = &tb1->phys_hash_next;
+    flags1 = page_get_flags(tb->pc);
+    flags2 = page_get_flags(tb->pc + tb->size - 1);
+    if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
+        printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
+               (long)tb->pc, tb->size, flags1, flags2);
     }
 }
 
+/* verify that all the pages have correct rights for code */
+static void tb_page_check(void)
+{
+    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
+}
+
+#endif
+
 static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
 {
     TranslationBlock *tb1;
@@ -973,7 +967,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
+    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
@@ -1472,13 +1466,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
     unsigned int h;
-    TranslationBlock **ptb;
 
     /* add in the hash table */
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tb->phys_hash_next = *ptb;
-    *ptb = tb;
+    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* add in the page list */
     tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
                   ` (9 preceding siblings ...)
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 10/11] tb hash: track translated blocks with qht Emilio G. Cota
@ 2016-04-19 23:07 ` Emilio G. Cota
  2016-04-20 15:21   ` Richard Henderson
                     ` (2 more replies)
  10 siblings, 3 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:07 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 translate-all.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/translate-all.c b/translate-all.c
index 617a572..769bffc 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1664,6 +1664,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 {
     int i, target_code_size, max_target_code_size;
     int direct_jmp_count, direct_jmp2_count, cross_page;
+    double ht_avg_len;
+    size_t ht_heads;
     TranslationBlock *tb;
 
     target_code_size = 0;
@@ -1715,6 +1717,9 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
                 direct_jmp2_count,
                 tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
                         tcg_ctx.tb_ctx.nb_tbs : 0);
+    ht_avg_len = qht_avg_bucket_chain_length(&tcg_ctx.tb_ctx.htable, &ht_heads);
+    cpu_fprintf(f, "TB hash avg chain   %0.5f buckets\n", ht_avg_len);
+    cpu_fprintf(f, "TB hash size        %zu head buckets\n", ht_heads);
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %d\n", tcg_ctx.tb_ctx.tb_flush_count);
     cpu_fprintf(f, "TB invalidate count %d\n",
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax()
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax() Emilio G. Cota
@ 2016-04-20 12:15   ` KONRAD Frederic
  2016-04-20 17:16     ` Emilio G. Cota
  2016-04-20 17:32   ` [Qemu-devel] [UPDATED " Emilio G. Cota
  2016-04-22  9:35   ` [Qemu-devel] [PATCH " Alex Bennée
  2 siblings, 1 reply; 45+ messages in thread
From: KONRAD Frederic @ 2016-04-20 12:15 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov



Le 20/04/2016 01:07, Emilio G. Cota a écrit :
> Taken from the linux kernel.
>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>   include/qemu/processor.h | 28 ++++++++++++++++++++++++++++
>   1 file changed, 28 insertions(+)
>   create mode 100644 include/qemu/processor.h
>
> diff --git a/include/qemu/processor.h b/include/qemu/processor.h
> new file mode 100644
> index 0000000..675a00a
> --- /dev/null
> +++ b/include/qemu/processor.h
> @@ -0,0 +1,28 @@

Hi,

Does this file need a license?

Fred

> +#ifndef QEMU_PROCESSOR_H
> +#define QEMU_PROCESSOR_H
> +
> +#include "qemu/atomic.h"
> +
> +#if defined(__i386__) || defined(__x86_64__)
> +#define cpu_relax() asm volatile("rep; nop" ::: "memory")
> +#endif
> +
> +#ifdef __ia64__
> +#define cpu_relax() asm volatile("hint @pause" ::: "memory")
> +#endif
> +
> +#ifdef __aarch64__
> +#define cpu_relax() asm volatile("yield" ::: "memory")
> +#endif
> +
> +#if defined(__powerpc64__)
> +/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
> +#define cpu_relax() asm volatile("or 1, 1, 1;"
> +                                 "or 2, 2, 2;" ::: "memory")
> +#endif
> +
> +#ifndef cpu_relax
> +#define cpu_relax() barrier()
> +#endif
> +
> +#endif /* QEMU_PROCESSOR_H */
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
@ 2016-04-20 15:18   ` Richard Henderson
  2016-04-20 17:17     ` Emilio G. Cota
  2016-04-20 17:35   ` [Qemu-devel] [UPDATED " Emilio G. Cota
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Henderson @ 2016-04-20 15:18 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Sergey Fedorov

On 04/19/2016 04:07 PM, Emilio G. Cota wrote:
> From: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
>
> Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
> [Rewritten. - Paolo]
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> [Emilio's additions: call cpu_relax() while spinning; optimize for
>   uncontended locks by acquiring the lock with xchg+test instead of
>   test+xchg+test.]
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---

It probably doesn't matter for any real hosts, but do note that there are 
compiler primitives for test-and-set that (can be) simpler for a cpu to 
implement than xchg.  This likely affects only ancient hosts like sparcv7,
or tiny hosts like SH.

We don't have to change anything here, but it does seem more natural to use a 
test-and-set primitive.

> +static inline int qemu_spin_trylock(QemuSpin *spin)
> +{
> +    if (atomic_read(&spin->value) || atomic_xchg(&spin->value, true)) {
> +        return -EBUSY;

I think there's no point in the extra read here.


r~

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/11] exec: add tb_hash_func5, derived from xxhash
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 06/11] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
@ 2016-04-20 15:19   ` Richard Henderson
  2016-04-22 12:58   ` Alex Bennée
  1 sibling, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2016-04-20 15:19 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Sergey Fedorov

On 04/19/2016 04:07 PM, Emilio G. Cota wrote:
> This will be used by upcoming changes for hashing the tb hash.
>
> Add this into a separate file to include the copyright notice from
> xxhash.
>
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   include/exec/tb-hash-xx.h | 94 +++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 94 insertions(+)
>   create mode 100644 include/exec/tb-hash-xx.h

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
@ 2016-04-20 15:21   ` Richard Henderson
  2016-04-22 14:38   ` Alex Bennée
  2016-04-22 17:41   ` Richard Henderson
  2 siblings, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2016-04-20 15:21 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Sergey Fedorov

On 04/19/2016 04:07 PM, Emilio G. Cota wrote:
> Suggested-by: Richard Henderson<rth@twiddle.net>
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   translate-all.c | 5 +++++
>   1 file changed, 5 insertions(+)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax()
  2016-04-20 12:15   ` KONRAD Frederic
@ 2016-04-20 17:16     ` Emilio G. Cota
  2016-04-20 17:18       ` Peter Maydell
  0 siblings, 1 reply; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-20 17:16 UTC (permalink / raw)
  To: KONRAD Frederic
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson, Peter Maydell,
	Sergey Fedorov

On Wed, Apr 20, 2016 at 14:15:41 +0200, KONRAD Frederic wrote:
> Le 20/04/2016 01:07, Emilio G. Cota a écrit :
> >Taken from the linux kernel.
> >
> >Reviewed-by: Richard Henderson <rth@twiddle.net>
> >Signed-off-by: Emilio G. Cota <cota@braap.org>
> >---
> >  include/qemu/processor.h | 28 ++++++++++++++++++++++++++++
> >  1 file changed, 28 insertions(+)
> >  create mode 100644 include/qemu/processor.h
> >
> >diff --git a/include/qemu/processor.h b/include/qemu/processor.h
> >new file mode 100644
> >index 0000000..675a00a
> >--- /dev/null
> >+++ b/include/qemu/processor.h
> >@@ -0,0 +1,28 @@
> 
> Hi,
> 
> Does this file need a license?

I figured GPLv2 would apply to it by default. However, IANAL so I'd
be happy to add a line to the patch stating the file is covered under
GPLv2.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock
  2016-04-20 15:18   ` Richard Henderson
@ 2016-04-20 17:17     ` Emilio G. Cota
  2016-04-20 17:55       ` Richard Henderson
  0 siblings, 1 reply; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-20 17:17 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On Wed, Apr 20, 2016 at 08:18:56 -0700, Richard Henderson wrote:
> On 04/19/2016 04:07 PM, Emilio G. Cota wrote:
> >From: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
> >
> >Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
> >[Rewritten. - Paolo]
> >Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >[Emilio's additions: call cpu_relax() while spinning; optimize for
> >  uncontended locks by acquiring the lock with xchg+test instead of
> >  test+xchg+test.]
> >Signed-off-by: Emilio G. Cota <cota@braap.org>
> >---
> 
> It probably doesn't matter for any real hosts, but do note that there are
> compiler primitives for test-and-set that (can be) simpler for a cpu to
> implement than xchg.  This likely affects only ancient hosts like sparcv7,
> or tiny hosts like SH.
> 
> We don't have to change anything here, but it does seem more natural to use
> a test-and-set primitive.

I've tried to find a GCC intrinsic for test-and-set, and I've only found
lock_test_and_set, which is what we use for atomic_xchg (except on ppc)
because it really is an atomic exchange:
 "This builtin, as described by Intel, is not a traditional test-and-set
  operation, but rather an atomic exchange operation."
 https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html

> >+static inline int qemu_spin_trylock(QemuSpin *spin)
> >+{
> >+    if (atomic_read(&spin->value) || atomic_xchg(&spin->value, true)) {
> >+        return -EBUSY;
> 
> I think there's no point in the extra read here.

Yep that's a remnant of the original "TATAS" implementation. Will
send a revised patch in a few minutes.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax()
  2016-04-20 17:16     ` Emilio G. Cota
@ 2016-04-20 17:18       ` Peter Maydell
  0 siblings, 0 replies; 45+ messages in thread
From: Peter Maydell @ 2016-04-20 17:18 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: KONRAD Frederic, QEMU Developers, MTTCG Devel, Alex Bennée,
	Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Sergey Fedorov

On 20 April 2016 at 18:16, Emilio G. Cota <cota@braap.org> wrote:
> On Wed, Apr 20, 2016 at 14:15:41 +0200, KONRAD Frederic wrote:
>> Le 20/04/2016 01:07, Emilio G. Cota a écrit :
>> >Taken from the linux kernel.
>> >
>> >Reviewed-by: Richard Henderson <rth@twiddle.net>
>> >Signed-off-by: Emilio G. Cota <cota@braap.org>
>> >---
>> >  include/qemu/processor.h | 28 ++++++++++++++++++++++++++++
>> >  1 file changed, 28 insertions(+)
>> >  create mode 100644 include/qemu/processor.h
>> >
>> >diff --git a/include/qemu/processor.h b/include/qemu/processor.h
>> >new file mode 100644
>> >index 0000000..675a00a
>> >--- /dev/null
>> >+++ b/include/qemu/processor.h
>> >@@ -0,0 +1,28 @@
>>
>> Hi,
>>
>> Does this file need a license?
>
> I figured GPLv2 would apply to it by default. However, IANAL so I'd
> be happy to add a line to the patch stating the file is covered under
> GPLv2.

Yeah, in general we prefer new files to have at least a short
copyright-and-license statement, though there are some older
files (particularly headers) without.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [Qemu-devel] [UPDATED v3 04/11] include/processor.h: define cpu_relax()
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax() Emilio G. Cota
  2016-04-20 12:15   ` KONRAD Frederic
@ 2016-04-20 17:32   ` Emilio G. Cota
  2016-04-22  9:35   ` [Qemu-devel] [PATCH " Alex Bennée
  2 siblings, 0 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-20 17:32 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

Taken from the linux kernel.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/processor.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)
 create mode 100644 include/qemu/processor.h

diff --git a/include/qemu/processor.h b/include/qemu/processor.h
new file mode 100644
index 0000000..4e6a71f
--- /dev/null
+++ b/include/qemu/processor.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_PROCESSOR_H
+#define QEMU_PROCESSOR_H
+
+#include "qemu/atomic.h"
+
+#if defined(__i386__) || defined(__x86_64__)
+#define cpu_relax() asm volatile("rep; nop" ::: "memory")
+#endif
+
+#ifdef __ia64__
+#define cpu_relax() asm volatile("hint @pause" ::: "memory")
+#endif
+
+#ifdef __aarch64__
+#define cpu_relax() asm volatile("yield" ::: "memory")
+#endif
+
+#if defined(__powerpc64__)
+/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
+#define cpu_relax() asm volatile("or 1, 1, 1;"
+                                 "or 2, 2, 2;" ::: "memory")
+#endif
+
+#ifndef cpu_relax
+#define cpu_relax() barrier()
+#endif
+
+#endif /* QEMU_PROCESSOR_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [Qemu-devel] [UPDATED v3 05/11] qemu-thread: add simple test-and-set spinlock
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
  2016-04-20 15:18   ` Richard Henderson
@ 2016-04-20 17:35   ` Emilio G. Cota
  1 sibling, 0 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-20 17:35 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

From: Guillaume Delbergue <guillaume.delbergue@greensocs.com>

Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
[Rewritten. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Emilio's additions: call cpu_relax() while spinning; optimize for
 uncontended locks by acquiring the lock with test-and-set instead of
 test-and-test-and-set.]
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/thread.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index bdae6df..a216941 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -1,6 +1,9 @@
 #ifndef __QEMU_THREAD_H
 #define __QEMU_THREAD_H 1
 
+#include <errno.h>
+#include "qemu/processor.h"
+#include "qemu/atomic.h"
 
 typedef struct QemuMutex QemuMutex;
 typedef struct QemuCond QemuCond;
@@ -60,4 +63,35 @@ struct Notifier;
 void qemu_thread_atexit_add(struct Notifier *notifier);
 void qemu_thread_atexit_remove(struct Notifier *notifier);
 
+typedef struct QemuSpin {
+    int value;
+} QemuSpin;
+
+static inline void qemu_spin_init(QemuSpin *spin)
+{
+    spin->value = 0;
+}
+
+static inline void qemu_spin_lock(QemuSpin *spin)
+{
+    while (atomic_xchg(&spin->value, true)) {
+        while (atomic_read(&spin->value)) {
+            cpu_relax();
+        }
+    }
+}
+
+static inline int qemu_spin_trylock(QemuSpin *spin)
+{
+    if (atomic_xchg(&spin->value, true)) {
+        return -EBUSY;
+    }
+    return 0;
+}
+
+static inline void qemu_spin_unlock(QemuSpin *spin)
+{
+    atomic_mb_set(&spin->value, 0);
+}
+
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock
  2016-04-20 17:17     ` Emilio G. Cota
@ 2016-04-20 17:55       ` Richard Henderson
  2016-04-20 18:11         ` Emilio G. Cota
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Henderson @ 2016-04-20 17:55 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On 04/20/2016 10:17 AM, Emilio G. Cota wrote:
> I've tried to find a GCC intrinsic for test-and-set, and I've only found
> lock_test_and_set, which is what we use for atomic_xchg (except on ppc)
> because it really is an atomic exchange:
>   "This builtin, as described by Intel, is not a traditional test-and-set
>    operation, but rather an atomic exchange operation."
>   https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html

Please read the entire documentation, not just the first sentence.


r~

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock
  2016-04-20 17:55       ` Richard Henderson
@ 2016-04-20 18:11         ` Emilio G. Cota
  2016-04-20 19:39           ` Richard Henderson
  0 siblings, 1 reply; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-20 18:11 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On Wed, Apr 20, 2016 at 10:55:45 -0700, Richard Henderson wrote:
> On 04/20/2016 10:17 AM, Emilio G. Cota wrote:
> >I've tried to find a GCC intrinsic for test-and-set, and I've only found
> >lock_test_and_set, which is what we use for atomic_xchg (except on ppc)
> >because it really is an atomic exchange:
> >  "This builtin, as described by Intel, is not a traditional test-and-set
> >   operation, but rather an atomic exchange operation."
> >  https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
> 
> Please read the entire documentation, not just the first sentence.

I did read it and I'm aware of the limitations of using xchg.

My comment was related to this:

> [...] do note that there are compiler primitives for test-and-set that
> (can be) simpler for a cpu to implement than xchg.

What compiler (I assume gcc) primitives are these? I couldn't find them.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock
  2016-04-20 18:11         ` Emilio G. Cota
@ 2016-04-20 19:39           ` Richard Henderson
  2016-04-21 16:24             ` Emilio G. Cota
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Henderson @ 2016-04-20 19:39 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On 04/20/2016 11:11 AM, Emilio G. Cota wrote:
> On Wed, Apr 20, 2016 at 10:55:45 -0700, Richard Henderson wrote:
>> On 04/20/2016 10:17 AM, Emilio G. Cota wrote:
>>> I've tried to find a GCC intrinsic for test-and-set, and I've only found
>>> lock_test_and_set, which is what we use for atomic_xchg (except on ppc)
>>> because it really is an atomic exchange:
>>>   "This builtin, as described by Intel, is not a traditional test-and-set
>>>    operation, but rather an atomic exchange operation."
>>>   https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
>>
>> Please read the entire documentation, not just the first sentence.
>
> I did read it and I'm aware of the limitations of using xchg.
>
> My comment was related to this:
>
>> [...] do note that there are compiler primitives for test-and-set that
>> (can be) simpler for a cpu to implement than xchg.
>
> What compiler (I assume gcc) primitives are these? I couldn't find them.

__sync_lock_test_and_set and __atomic_test_and_set.

Both expand to ldstub on sparcv7, tas on coldfire, tas.b on sh.
None of these are xchg operations.

I had forgotten that there wasn't a __sync_exchange builtin, so 
__sync_lock_test_and_set plays double-duty as both xchg and test-and-set.

But when __atomic builtins are available, __atomic_exchange does not fall back; 
you must use __atomic_test_and_set for less capable hosts.

But of course none of this is really relevant for normal hosts.


r~

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock
  2016-04-20 19:39           ` Richard Henderson
@ 2016-04-21 16:24             ` Emilio G. Cota
  0 siblings, 0 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-21 16:24 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On Wed, Apr 20, 2016 at 12:39:45 -0700, Richard Henderson wrote:
> On 04/20/2016 11:11 AM, Emilio G. Cota wrote:
> >On Wed, Apr 20, 2016 at 10:55:45 -0700, Richard Henderson wrote:
> >>On 04/20/2016 10:17 AM, Emilio G. Cota wrote:
(snip)
> >My comment was related to this:
> >
> >>[...] do note that there are compiler primitives for test-and-set that
> >>(can be) simpler for a cpu to implement than xchg.
> >
> >What compiler (I assume gcc) primitives are these? I couldn't find them.
> 
> __sync_lock_test_and_set and __atomic_test_and_set.
> 
> Both expand to ldstub on sparcv7, tas on coldfire, tas.b on sh.
> None of these are xchg operations.
> 
> I had forgotten that there wasn't a __sync_exchange builtin, so
> __sync_lock_test_and_set plays double-duty as both xchg and test-and-set.
> 
> But when __atomic builtins are available, __atomic_exchange does not fall
> back; you must use __atomic_test_and_set for less capable hosts.
> 
> But of course none of this is really relevant for normal hosts.

I see. Maybe we should do something like the appended, then?

		Emilio

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 5bc4d6c..6132dad 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -134,6 +134,7 @@
     })
 
 /* Provide shorter names for GCC atomic builtins, return old value */
+#define atomic_test_and_set(ptr) __atomic_test_and_set(ptr, __ATOMIC_SEQ_CST)
 #define atomic_fetch_inc(ptr)  __atomic_fetch_add(ptr, 1, __ATOMIC_SEQ_CST)
 #define atomic_fetch_dec(ptr)  __atomic_fetch_sub(ptr, 1, __ATOMIC_SEQ_CST)
 #define atomic_fetch_add(ptr, n) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST)
@@ -328,6 +329,7 @@
 #endif
 
 /* Provide shorter names for GCC atomic builtins.  */
+#define atomic_test_and_set(ptr) atomic_xchg(ptr, true)
 #define atomic_fetch_inc(ptr)  __sync_fetch_and_add(ptr, 1)
 #define atomic_fetch_dec(ptr)  __sync_fetch_and_add(ptr, -1)
 #define atomic_fetch_add       __sync_fetch_and_add
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index a216941..39ff1ac 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -74,7 +74,7 @@ static inline void qemu_spin_init(QemuSpin *spin)
 
 static inline void qemu_spin_lock(QemuSpin *spin)
 {
-    while (atomic_xchg(&spin->value, true)) {
+    while (atomic_test_and_set(&spin->value)) {
         while (atomic_read(&spin->value)) {
             cpu_relax();
         }
@@ -83,7 +83,7 @@ static inline void qemu_spin_lock(QemuSpin *spin)
 
 static inline int qemu_spin_trylock(QemuSpin *spin)
 {
-    if (atomic_xchg(&spin->value, true)) {
+    if (atomic_test_and_set(&spin->value)) {
         return -EBUSY;
     }
     return 0;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/11] compiler.h: add QEMU_ALIGNED() to enforce struct alignment
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 01/11] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
@ 2016-04-22  9:32   ` Alex Bennée
  2016-04-22  9:35   ` Peter Maydell
  1 sibling, 0 replies; 45+ messages in thread
From: Alex Bennée @ 2016-04-22  9:32 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  include/qemu/compiler.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
> index 8f1cc7b..1978ddc 100644
> --- a/include/qemu/compiler.h
> +++ b/include/qemu/compiler.h
> @@ -41,6 +41,8 @@
>  # define QEMU_PACKED __attribute__((packed))
>  #endif
>
> +#define QEMU_ALIGNED(B) __attribute__((aligned(B)))
> +
>  #ifndef glue
>  #define xglue(x, y) x ## y
>  #define glue(x, y) xglue(x, y)


--
Alex Bennée

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax()
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax() Emilio G. Cota
  2016-04-20 12:15   ` KONRAD Frederic
  2016-04-20 17:32   ` [Qemu-devel] [UPDATED " Emilio G. Cota
@ 2016-04-22  9:35   ` Alex Bennée
  2 siblings, 0 replies; 45+ messages in thread
From: Alex Bennée @ 2016-04-22  9:35 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> Taken from the linux kernel.
>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

License header notwithstanding:


Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  include/qemu/processor.h | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
>  create mode 100644 include/qemu/processor.h
>
> diff --git a/include/qemu/processor.h b/include/qemu/processor.h
> new file mode 100644
> index 0000000..675a00a
> --- /dev/null
> +++ b/include/qemu/processor.h
> @@ -0,0 +1,28 @@
> +#ifndef QEMU_PROCESSOR_H
> +#define QEMU_PROCESSOR_H
> +
> +#include "qemu/atomic.h"
> +
> +#if defined(__i386__) || defined(__x86_64__)
> +#define cpu_relax() asm volatile("rep; nop" ::: "memory")
> +#endif
> +
> +#ifdef __ia64__
> +#define cpu_relax() asm volatile("hint @pause" ::: "memory")
> +#endif
> +
> +#ifdef __aarch64__
> +#define cpu_relax() asm volatile("yield" ::: "memory")
> +#endif
> +
> +#if defined(__powerpc64__)
> +/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
> +#define cpu_relax() asm volatile("or 1, 1, 1;"
> +                                 "or 2, 2, 2;" ::: "memory")
> +#endif
> +
> +#ifndef cpu_relax
> +#define cpu_relax() barrier()
> +#endif
> +
> +#endif /* QEMU_PROCESSOR_H */


--
Alex Bennée

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/11] compiler.h: add QEMU_ALIGNED() to enforce struct alignment
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 01/11] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
  2016-04-22  9:32   ` Alex Bennée
@ 2016-04-22  9:35   ` Peter Maydell
  2016-04-22 15:50     ` Emilio G. Cota
  1 sibling, 1 reply; 45+ messages in thread
From: Peter Maydell @ 2016-04-22  9:35 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson, Sergey Fedorov

On 20 April 2016 at 00:07, Emilio G. Cota <cota@braap.org> wrote:
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/compiler.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
> index 8f1cc7b..1978ddc 100644
> --- a/include/qemu/compiler.h
> +++ b/include/qemu/compiler.h
> @@ -41,6 +41,8 @@
>  # define QEMU_PACKED __attribute__((packed))
>  #endif
>
> +#define QEMU_ALIGNED(B) __attribute__((aligned(B)))

A rather trivial thing, but if we have to respin this series
for some other reason could we use a different macro parameter
than 'B', please? I had to re-read the patch carefully before
I realised that (a) it wasn't "aligned(8)" and (b) it wasn't a
typo for "aligned(8)" either...

thanks
-- PMM

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/11] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 07/11] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
@ 2016-04-22 12:58   ` Alex Bennée
  0 siblings, 0 replies; 45+ messages in thread
From: Alex Bennée @ 2016-04-22 12:58 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> For some workloads such as arm bootup, tb_phys_hash is performance-critical.
> The is due to the high frequency of accesses to the hash table, originated
> by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
> More info:
>   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
>
> To dig further into this I modified an arm image booting debian jessie to
> immediately shut down after boot. Analysis revealed that quite a bit of time
> is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
> results in very uneven loading of chains in the hash table's buckets;
> the longest observed chain had ~550 elements.
>
> The appended addresses this with two changes:
>
> 1) Use xxhash as the hash table's hash function. xxhash is a fast,
>    high-quality hashing function.
>
> 2) Feed the hashing function with not just tb_phys, but also pc and flags.
>
> This improves performance over using just tb_phys for hashing, since that
> resulted in some hash buckets having many TB's, while others getting very few;
> with these changes, the longest observed chain on a single hash bucket is
> brought down from ~550 to ~40.
>
> Tests show that the other element checked for in tb_find_physical,
> cs_base, is always a match when tb_phys+pc+flags are a match,
> so hashing cs_base is wasteful. It could be that this is an ARM-only
> thing, though.
>
> BTW, after this change the hash table should not be called "tb_hash_phys"
> anymore; this is addressed later in this series.
>
> This change gives consistent bootup time improvements. I tested two
> host machines:
> - Intel Xeon E5-2690: 11.6% less time
> - Intel i7-4790K: 19.2% less time
>
> Increasing the number of hash buckets yields further improvements. However,
> using a larger, fixed number of buckets can degrade performance for other
> workloads that do not translate as many blocks (600K+ for debian-jessie arm
> bootup). This is dealt with later in this series.
>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>


> ---
>  cpu-exec.c             | 2 +-
>  include/exec/tb-hash.h | 8 ++++++--
>  translate-all.c        | 6 +++---
>  3 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index debc65c..a889cf1 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -233,7 +233,7 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
>      /* find translated block using physical mappings */
>      phys_pc = get_page_addr_code(env, pc);
>      phys_page1 = phys_pc & TARGET_PAGE_MASK;
> -    h = tb_phys_hash_func(phys_pc);
> +    h = tb_hash_func(phys_pc, pc, flags);
>      ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
>      for(;;) {
>          tb = *ptb1;
> diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
> index 0f4e8a0..4b9635a 100644
> --- a/include/exec/tb-hash.h
> +++ b/include/exec/tb-hash.h
> @@ -20,6 +20,9 @@
>  #ifndef EXEC_TB_HASH
>  #define EXEC_TB_HASH
>
> +#include "exec/exec-all.h"
> +#include "exec/tb-hash-xx.h"
> +
>  /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
>     addresses on the same page.  The top bits are the same.  This allows
>     TLB invalidation to quickly clear a subset of the hash table.  */
> @@ -43,9 +46,10 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
>             | (tmp & TB_JMP_ADDR_MASK));
>  }
>
> -static inline unsigned int tb_phys_hash_func(tb_page_addr_t pc)
> +static inline
> +uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
>  {
> -    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
> +    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
>  }
>
>  #endif
> diff --git a/translate-all.c b/translate-all.c
> index 1a8f68b..eca2f16 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -972,7 +972,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>
>      /* remove the TB from the hash list */
>      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
> -    h = tb_phys_hash_func(phys_pc);
> +    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
>      tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
>
>      /* remove the TB from the page list */
> @@ -1474,8 +1474,8 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>      unsigned int h;
>      TranslationBlock **ptb;
>
> -    /* add in the physical hash table */
> -    h = tb_phys_hash_func(phys_pc);
> +    /* add in the hash table */
> +    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
>      ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
>      tb->phys_hash_next = *ptb;
>      *ptb = tb;


--
Alex Bennée

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/11] exec: add tb_hash_func5, derived from xxhash
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 06/11] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
  2016-04-20 15:19   ` Richard Henderson
@ 2016-04-22 12:58   ` Alex Bennée
  1 sibling, 0 replies; 45+ messages in thread
From: Alex Bennée @ 2016-04-22 12:58 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> This will be used by upcoming changes for hashing the tb hash.
>
> Add this into a separate file to include the copyright notice from
> xxhash.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  include/exec/tb-hash-xx.h | 94 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 94 insertions(+)
>  create mode 100644 include/exec/tb-hash-xx.h
>
> diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
> new file mode 100644
> index 0000000..67f4e6f
> --- /dev/null
> +++ b/include/exec/tb-hash-xx.h
> @@ -0,0 +1,94 @@
> +/*
> + * xxHash - Fast Hash algorithm
> + * Copyright (C) 2012-2016, Yann Collet
> + *
> + * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions are
> + * met:
> + *
> + * + Redistributions of source code must retain the above copyright
> + * notice, this list of conditions and the following disclaimer.
> + * + Redistributions in binary form must reproduce the above
> + * copyright notice, this list of conditions and the following disclaimer
> + * in the documentation and/or other materials provided with the
> + * distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + *
> + * You can contact the author at :
> + * - xxHash source repository : https://github.com/Cyan4973/xxHash
> + */
> +#ifndef EXEC_TB_HASH_XX
> +#define EXEC_TB_HASH_XX
> +
> +#include <qemu/bitops.h>
> +
> +#define PRIME32_1   2654435761U
> +#define PRIME32_2   2246822519U
> +#define PRIME32_3   3266489917U
> +#define PRIME32_4    668265263U
> +#define PRIME32_5    374761393U
> +
> +#define TB_HASH_XX_SEED 1
> +
> +/*
> + * xxhash32, customized for input variables that are not guaranteed to be
> + * contiguous in memory.
> + */
> +static inline
> +uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e)
> +{
> +    uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
> +    uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
> +    uint32_t v3 = TB_HASH_XX_SEED + 0;
> +    uint32_t v4 = TB_HASH_XX_SEED - PRIME32_1;
> +    uint32_t a = a0 >> 31 >> 1;
> +    uint32_t b = a0;
> +    uint32_t c = b0 >> 31 >> 1;
> +    uint32_t d = b0;
> +    uint32_t h32;
> +
> +    v1 += a * PRIME32_2;
> +    v1 = rol32(v1, 13);
> +    v1 *= PRIME32_1;
> +
> +    v2 += b * PRIME32_2;
> +    v2 = rol32(v2, 13);
> +    v2 *= PRIME32_1;
> +
> +    v3 += c * PRIME32_2;
> +    v3 = rol32(v3, 13);
> +    v3 *= PRIME32_1;
> +
> +    v4 += d * PRIME32_2;
> +    v4 = rol32(v4, 13);
> +    v4 *= PRIME32_1;
> +
> +    h32 = rol32(v1, 1) + rol32(v2, 7) + rol32(v3, 12) + rol32(v4, 18);
> +    h32 += 20;
> +
> +    h32 += e * PRIME32_3;
> +    h32  = rol32(h32, 17) * PRIME32_4;
> +
> +    h32 ^= h32 >> 15;
> +    h32 *= PRIME32_2;
> +    h32 ^= h32 >> 13;
> +    h32 *= PRIME32_3;
> +    h32 ^= h32 >> 16;
> +
> +    return h32;
> +}
> +
> +#endif /* EXEC_TB_HASH_XX */


--
Alex Bennée

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 08/11] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 08/11] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
@ 2016-04-22 14:04   ` Alex Bennée
  2016-04-24 20:01   ` Richard Henderson
  1 sibling, 0 replies; 45+ messages in thread
From: Alex Bennée @ 2016-04-22 14:04 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> This is a hash table with optional auto-resizing and MRU promotion for
> reads and writes. Its implementation goal is to stay fast while
> scaling for read-mostly workloads.
>
> A hash table with these features will be necessary for the scalability
> of the ongoing MTTCG work; before those changes arrive we can already
> benefit from the single-threaded speedup that qht also provides.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>


> ---
>  include/qemu/qht.h |  54 +++++
>  util/Makefile.objs |   2 +-
>  util/qht.c         | 590 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 645 insertions(+), 1 deletion(-)
>  create mode 100644 include/qemu/qht.h
>  create mode 100644 util/qht.c
>
> diff --git a/include/qemu/qht.h b/include/qemu/qht.h
> new file mode 100644
> index 0000000..a0a1aa8
> --- /dev/null
> +++ b/include/qemu/qht.h
> @@ -0,0 +1,54 @@
> +/*
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + *   See the COPYING file in the top-level directory.
> + */
> +#ifndef QHT_H
> +#define QHT_H
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qemu/seqlock.h"
> +#include "qemu/rcu.h"
> +
> +struct qht {
> +    struct qht_map *map;
> +    unsigned int mode;
> +};
> +
> +typedef bool (*qht_lookup_func_t)(const void *obj, const void *userp);
> +typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
> +
> +#define QHT_MODE_MRU_LOOKUP  0x1 /* move looked-up items to head */
> +#define QHT_MODE_MRU_INSERT  0x2 /* insert new elements at the head */
> +#define QHT_MODE_AUTO_RESIZE 0x4 /* auto-resize when heavily loaded */
> +
> +void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
> +
> +/* call only when there are no readers left */
> +void qht_destroy(struct qht *ht);
> +
> +/* call with an external lock held */
> +void qht_reset(struct qht *ht);
> +
> +/* call with an external lock held */
> +void qht_reset_size(struct qht *ht, size_t n_elems);
> +
> +/* call with an external lock held */
> +void qht_insert(struct qht *ht, void *p, uint32_t hash);
> +
> +/* call with an external lock held */
> +bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
> +
> +/* call with an external lock held */
> +void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp);
> +
> +/* call with an external lock held */
> +void qht_grow(struct qht *ht);
> +
> +void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
> +                 uint32_t hash);
> +double qht_avg_bucket_chain_length(struct qht *ht, size_t *n_head_buckets);
> +
> +#endif /* QHT_H */
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index a8a777e..ae893dc 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -1,4 +1,4 @@
> -util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
> +util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o qht.o
>  util-obj-$(CONFIG_POSIX) += compatfd.o
>  util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
>  util-obj-$(CONFIG_POSIX) += mmap-alloc.o
> diff --git a/util/qht.c b/util/qht.c
> new file mode 100644
> index 0000000..05ea5e8
> --- /dev/null
> +++ b/util/qht.c
> @@ -0,0 +1,590 @@
> +/*
> + * qht.c - QEMU Hash Table, designed to scale for read-mostly workloads.
> + *
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + *   See the COPYING file in the top-level directory.
> + *
> + * Assumptions:
> + * - Writers and iterators must take an external lock.
> + * - A hash value of 0 is invalid.
> + * - NULL cannot be inserted as a pointer value.
> + *
> + * Features:
> + * - Optional auto-resizing: the hash table resizes up if the load surpasses
> + *   a certain threshold. Resizing is done concurrently with readers.
> + * - Optional bucket MRU promotion policy.
> + *
> + * The key structure is the bucket, which is cacheline-sized. Buckets
> + * contain a few hash values and pointers; the u32 hash values are stored in
> + * full so that resizing is fast. Having this structure instead of directly
> + * chaining items has three advantages:
> + * - Failed lookups fail fast, and touch a minimum number of cache lines.
> + * - Resizing the hash table with concurrent lookups is easy.
> + * - We can have a few Most-Recently-Used (MRU) hash-pointer pairs in the same
> + *   head bucket. This helps scalability, since MRU promotions (i.e. writes to
> + *   the bucket) become less common.
> + *
> + * For concurrent lookups we use a per-bucket seqlock; per-bucket spinlocks
> + * allow readers (lookups) to upgrade to writers and thus implement an MRU
> + * promotion policy; these MRU-induced writes do not touch the cache lines of
> + * other head buckets.
> + *
> + * Note that there are two types of buckets:
> + * 1. "head" buckets are the ones allocated in the array of buckets in qht_map.
> + * 2. all "non-head" buckets (i.e. all others) are members of a chain that
> + *    starts from a head bucket.
> + * Note that the seqlock and spinlock of a head bucket applies to all buckets
> + * chained to it; these two fields are unused in non-head buckets.
> + *
> + * Resizing is done by taking all spinlocks (so that no readers-turned-writers
> + * can race with us) and then placing all elements into a new hash table. Last,
> + * the ht->map pointer is set, and the old map is freed once no RCU readers can
> + * see it anymore.
> + *
> + * Related Work:
> + * - Idea of cacheline-sized buckets with full hashes taken from:
> + *   David, Guerraoui & Trigonakis, "Asynchronized Concurrency:
> + *   The Secret to Scaling Concurrent Search Data Structures", ASPLOS'15.
> + * - Why not RCU-based hash tables? They would allow us to get rid of the
> + *   seqlock, but resizing would take forever since RCU read critical
> + *   sections in QEMU take quite a long time.
> + *   More info on relativistic hash tables:
> + *   + Triplett, McKenney & Walpole, "Resizable, Scalable, Concurrent Hash
> + *     Tables via Relativistic Programming", USENIX ATC'11.
> + *   + Corbet, "Relativistic hash tables, part 1: Algorithms", @ lwn.net, 2014.
> + *     https://lwn.net/Articles/612021/
> + */
> +#include "qemu/qht.h"
> +#include "qemu/atomic.h"
> +
> +/*
> + * We want to avoid false sharing of cache lines. Most systems have 64-byte
> + * cache lines so we go with it for simplicity.
> + *
> + * Note that systems with smaller cache lines will be fine (the struct is
> + * almost 64-bytes); systems with larger cache lines might suffer from
> + * some false sharing.
> + */
> +#define QHT_BUCKET_ALIGN 64
> +
> +/* define these to keep sizeof(qht_bucket) within QHT_BUCKET_ALIGN */
> +#if HOST_LONG_BITS == 32
> +#define QHT_BUCKET_ENTRIES 6
> +#else /* 64-bit */
> +#define QHT_BUCKET_ENTRIES 4
> +#endif
> +
> +struct qht_bucket {
> +    QemuSpin lock;
> +    QemuSeqLock sequence;
> +    uint32_t hashes[QHT_BUCKET_ENTRIES];
> +    void *pointers[QHT_BUCKET_ENTRIES];
> +    struct qht_bucket *next;
> +} QEMU_ALIGNED(QHT_BUCKET_ALIGN);
> +
> +QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
> +
> +struct qht_map {
> +    struct qht_bucket *buckets;
> +    size_t n;
> +    size_t n_items;
> +    size_t n_items_threshold;
> +    struct rcu_head rcu;
> +};
> +
> +static inline size_t qht_elems_to_buckets(size_t n_elems)
> +{
> +    return pow2ceil(n_elems / QHT_BUCKET_ENTRIES);
> +}
> +
> +static inline void qht_head_init(struct qht_bucket *b)
> +{
> +    memset(b, 0, sizeof(*b));
> +    qemu_spin_init(&b->lock);
> +    seqlock_init(&b->sequence);
> +}
> +
> +static inline
> +struct qht_bucket *qht_map_to_bucket(struct qht_map *map, uint32_t hash)
> +{
> +    return &map->buckets[hash & (map->n - 1)];
> +}
> +
> +static inline void qht_chain_destroy(struct qht_bucket *head)
> +{
> +    struct qht_bucket *curr = head->next;
> +    struct qht_bucket *prev;
> +
> +    while (curr) {
> +        prev = curr;
> +        curr = curr->next;
> +        qemu_vfree(prev);
> +    }
> +}
> +
> +/* pass only an orphan map */
> +static void qht_map_destroy(struct qht_map *map)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        qht_chain_destroy(&map->buckets[i]);
> +    }
> +    qemu_vfree(map->buckets);
> +    g_free(map);
> +}
> +
> +static void qht_map_reclaim(struct rcu_head *rcu)
> +{
> +    struct qht_map *map = container_of(rcu, struct qht_map, rcu);
> +
> +    qht_map_destroy(map);
> +}
> +
> +static struct qht_map *qht_map_create(size_t n)
> +{
> +    struct qht_map *map;
> +    size_t i;
> +
> +    map = g_malloc(sizeof(*map));
> +    map->n = n;
> +    map->n_items = 0;
> +    map->n_items_threshold = n * QHT_BUCKET_ENTRIES / 2;
> +    map->buckets = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*map->buckets) * n);
> +    for (i = 0; i < n; i++) {
> +        qht_head_init(&map->buckets[i]);
> +    }
> +    return map;
> +}
> +
> +static inline void qht_publish(struct qht *ht, struct qht_map *new)
> +{
> +    /* Readers should see a properly initialized map; pair with smp_rmb() */
> +    smp_wmb();
> +    atomic_set(&ht->map, new);
> +}
> +
> +void qht_init(struct qht *ht, size_t n_elems, unsigned int mode)
> +{
> +    struct qht_map *map;
> +    size_t n = qht_elems_to_buckets(n_elems);
> +
> +    map = qht_map_create(n);
> +    ht->mode = mode;
> +    qht_publish(ht, map);
> +}
> +
> +/* call only when there are no readers left */
> +void qht_destroy(struct qht *ht)
> +{
> +    qht_map_destroy(ht->map);
> +    memset(ht, 0, sizeof(*ht));
> +}
> +
> +static void qht_bucket_reset(struct qht_bucket *head)
> +{
> +    struct qht_bucket *b = head;
> +    int i;
> +
> +    qemu_spin_lock(&head->lock);
> +    seqlock_write_begin(&head->sequence);
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            atomic_set(&b->hashes[i], 0);
> +            atomic_set(&b->pointers[i], NULL);
> +        }
> +        b = b->next;
> +    } while (b);
> +    seqlock_write_end(&head->sequence);
> +    qemu_spin_unlock(&head->lock);
> +}
> +
> +/* call with an external lock held */
> +void qht_reset(struct qht *ht)
> +{
> +    struct qht_map *map = ht->map;
> +    size_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        qht_bucket_reset(&map->buckets[i]);
> +    }
> +}
> +
> +/* call with an external lock held */
> +void qht_reset_size(struct qht *ht, size_t n_elems)
> +{
> +    struct qht_map *old = ht->map;
> +
> +    qht_reset(ht);
> +    if (old->n == qht_elems_to_buckets(n_elems)) {
> +        return;
> +    }
> +    qht_init(ht, n_elems, ht->mode);
> +    call_rcu1(&old->rcu, qht_map_reclaim);
> +}
> +
> +/* Can only be called when at least orig is the 3rd link i.e. head->2nd->orig */
> +static void qht_move_bucket_to_front__locked(struct qht_bucket *head,
> +                                             struct qht_bucket *orig)
> +{
> +    struct qht_bucket *b = head->next;
> +
> +    for (;;) {
> +        if (b->next == orig) {
> +            atomic_set(&b->next, orig->next);
> +            atomic_set(&orig->next, head->next);
> +            atomic_set(&head->next, orig);
> +            return;
> +        }
> +        b = b->next;
> +    }
> +}
> +
> +/*
> + * Call with head->lock held.
> + *
> + * This function moves the item in @orig[@pos] to @head[0]. In order to
> + * do this, the item i in @head is moved to @head's i+1 position,
> + * with the last item being moved to @orig[@pos].
> + * Further, @orig is brought to the front of the bucket chain, i.e.
> + * @head->next is set to point to @orig. This is done to make sure that
> + * the last item in @head is not moved to a too-distant place. An
> + * alternative would be to perform full MRU for the entire bucket
> + * chain, but that would cause too much churn.
> + *
> + * Note: @orig == @head is a bug.
> + */
> +static inline void qht_bucket_mru__locked(struct qht_bucket *head,
> +                                          struct qht_bucket *orig, int pos)
> +{
> +    uint32_t *dest_hash;
> +    void **dest_p;
> +    void *p;
> +    uint32_t hash;
> +    int i;
> +
> +    assert(orig != head);
> +    for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +        if (i == QHT_BUCKET_ENTRIES - 1) {
> +            dest_hash = &orig->hashes[pos];
> +            dest_p = &orig->pointers[pos];
> +        } else {
> +            dest_hash = &head->hashes[i + 1];
> +            dest_p = &head->pointers[i + 1];
> +        }
> +        hash = *dest_hash;
> +        p = *dest_p;
> +
> +        atomic_set(dest_hash, head->hashes[i]);
> +        atomic_set(dest_p, head->pointers[i]);
> +
> +        atomic_set(&head->hashes[i], hash);
> +        atomic_set(&head->pointers[i], p);
> +    }
> +    if (head->next != orig) {
> +        qht_move_bucket_to_front__locked(head, orig);
> +    }
> +}
> +
> +static void qht_bucket_mru(struct qht_bucket *head, struct qht_bucket *orig,
> +                           const void *p, int pos)
> +{
> +    qemu_spin_lock(&head->lock);
> +    if (unlikely(orig->pointers[pos] != p)) {
> +        /* while we acquired the lock, the bucket was updated, so bail out */
> +        goto out;
> +    }
> +    seqlock_write_begin(&head->sequence);
> +    qht_bucket_mru__locked(head, orig, pos);
> +    seqlock_write_end(&head->sequence);
> + out:
> +    qemu_spin_unlock(&head->lock);
> +}
> +
> +/*
> + * @far_bucket and @pos are only filled in if the match is found in a bucket
> + * that is not the head bucket.
> + */
> +static inline
> +void *qht_do_lookup(struct qht_bucket *head, struct qht_bucket **far_bucket,
> +                    int *pos, qht_lookup_func_t func, const void *userp,
> +                    uint32_t hash)
> +{
> +    struct qht_bucket *b = head;
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->hashes[i] == hash) {
> +                void *p = atomic_read(&b->pointers[i]);
> +
> +                /*
> +                 * p could be NULL if we're racing with a writer. We could use
> +                 * barriers here but for performance we only issue the ones
> +                 * in the seqlock.
> +                 */
> +                if (likely(p && func(p, userp))) {
> +                    if (unlikely(b != head)) {
> +                        *far_bucket = b;
> +                        *pos = i;
> +                    }
> +                    return p;
> +                }
> +            }
> +        }
> +        b = atomic_read(&b->next);
> +        /*
> +         * This barrier guarantees that we will read a properly initialized
> +         * b->next; it is paired with an smp_wmb() before setting b->next.
> +         */
> +        smp_rmb();
> +    } while (b);
> +    return NULL;
> +}
> +
> +void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
> +                 uint32_t hash)
> +{
> +    struct qht_bucket *far_bucket = NULL;
> +    struct qht_bucket *b;
> +    struct qht_map *map;
> +    uint32_t version;
> +    int pos = 0;
> +    void *ret;
> +
> +    map = atomic_read(&ht->map);
> +    /* paired with smp_wmb() before setting ht->map */
> +    smp_rmb();
> +    b = qht_map_to_bucket(map, hash);
> +
> +    do {
> +        version = seqlock_read_begin(&b->sequence);
> +        ret = qht_do_lookup(b, &far_bucket, &pos, func, userp, hash);
> +    } while (seqlock_read_retry(&b->sequence, version));
> +    if ((ht->mode & QHT_MODE_MRU_LOOKUP) && unlikely(far_bucket)) {
> +        qht_bucket_mru(b, far_bucket, ret, pos);
> +    }
> +    return ret;
> +}
> +
> +/* call with b->lock held */
> +static void qht_insert__locked(struct qht *ht, struct qht_map *map,
> +                               struct qht_bucket *head, void *p, uint32_t hash)
> +{
> +    struct qht_bucket *b = head;
> +    struct qht_bucket *prev = NULL;
> +    struct qht_bucket *new = NULL;
> +    int i;
> +
> +    for (;;) {
> +        if (b == NULL) {
> +            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
> +            memset(b, 0, sizeof(*b));
> +            new = b;
> +        }
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->hashes[i]) {
> +                continue;
> +            }
> +            /* found an empty key: acquire the seqlock and write */
> +            seqlock_write_begin(&head->sequence);
> +            if (new) {
> +                /*
> +                 * This barrier is paired with smp_rmb() after reading
> +                 * b->next when not holding b->lock.
> +                 */
> +                smp_wmb();
> +                atomic_set(&prev->next, b);
> +            }
> +            atomic_set(&b->hashes[i], hash);
> +            atomic_set(&b->pointers[i], p);
> +            if ((ht->mode & QHT_MODE_MRU_INSERT) && b != head) {
> +                qht_bucket_mru__locked(head, b, i);
> +            }
> +            seqlock_write_end(&head->sequence);
> +            map->n_items++;
> +            return;
> +        }
> +        prev = b;
> +        b = b->next;
> +    }
> +}
> +
> +/* call with an external lock held */
> +void qht_insert(struct qht *ht, void *p, uint32_t hash)
> +{
> +    struct qht_map *map = ht->map;
> +    struct qht_bucket *b = qht_map_to_bucket(map, hash);
> +
> +    qemu_spin_lock(&b->lock);
> +    qht_insert__locked(ht, map, b, p, hash);
> +    qemu_spin_unlock(&b->lock);
> +
> +    if ((ht->mode & QHT_MODE_AUTO_RESIZE) &&
> +        unlikely(map->n_items > map->n_items_threshold)) {
> +        qht_grow(ht);
> +    }
> +}
> +
> +/* call with b->lock held */
> +static inline bool qht_remove__locked(struct qht_map *map, struct qht_bucket *b,
> +                                      const void *p, uint32_t hash)
> +{
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->hashes[i] == hash && b->pointers[i] == p) {
> +                atomic_set(&b->hashes[i], 0);
> +                atomic_set(&b->pointers[i], NULL);
> +                map->n_items--;
> +                return true;
> +            }
> +        }
> +        b = b->next;
> +    } while (b);
> +    return false;
> +}
> +
> +/* call with an external lock held */
> +bool qht_remove(struct qht *ht, const void *p, uint32_t hash)
> +{
> +    struct qht_map *map = ht->map;
> +    struct qht_bucket *b = qht_map_to_bucket(map, hash);
> +    bool ret;
> +
> +    qemu_spin_lock(&b->lock);
> +    seqlock_write_begin(&b->sequence);
> +    ret = qht_remove__locked(map, b, p, hash);
> +    seqlock_write_end(&b->sequence);
> +    qemu_spin_unlock(&b->lock);
> +    return ret;
> +}
> +
> +/*
> + * acquire all spinlocks from a map, so that writers cannot race with
> + * readers-turned-writers.
> + */
> +static void qht_lock(struct qht *ht)
> +{
> +    struct qht_map *map = ht->map;
> +    size_t i;
> +
> +    /* if readers cannot upgrade, do nothing */
> +    if (!(ht->mode & QHT_MODE_MRU_LOOKUP)) {
> +        return;
> +    }
> +    for (i = 0; i < map->n; i++) {
> +        struct qht_bucket *b = &map->buckets[i];
> +
> +        qemu_spin_lock(&b->lock);
> +    }
> +}
> +
> +static void qht_unlock(struct qht *ht)
> +{
> +    struct qht_map *map = ht->map;
> +    size_t i;
> +
> +    if (!(ht->mode & QHT_MODE_MRU_LOOKUP)) {
> +        return;
> +    }
> +    for (i = 0; i < map->n; i++) {
> +        struct qht_bucket *b = &map->buckets[i];
> +
> +        qemu_spin_unlock(&b->lock);
> +    }
> +}
> +
> +/* external lock + all of the map's locks held (if !MRU_LOOKUP) */
> +static inline void qht_map_iter__locked(struct qht *ht, struct qht_map *map,
> +                                        qht_iter_func_t func, void *userp)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        struct qht_bucket *b = &map->buckets[i];
> +        int j;
> +
> +        do {
> +            for (j = 0; j < QHT_BUCKET_ENTRIES; j++) {
> +                if (b->hashes[j]) {
> +                    func(ht, b->pointers[j], b->hashes[j], userp);
> +                }
> +            }
> +            b = b->next;
> +        } while (b);
> +    }
> +}
> +
> +/* call with an external lock held */
> +void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp)
> +{
> +    qht_lock(ht);
> +    qht_map_iter__locked(ht, ht->map, func, userp);
> +    qht_unlock(ht);
> +}
> +
> +static void qht_map_copy(struct qht *ht, void *p, uint32_t hash, void *userp)
> +{
> +    struct qht_map *new = userp;
> +    struct qht_bucket *b = qht_map_to_bucket(new, hash);
> +
> +    /* no need to acquire b->lock because no thread has seen this map yet */
> +    qht_insert__locked(ht, new, b, p, hash);
> +}
> +
> +/* call with an external lock held */
> +void qht_grow(struct qht *ht)
> +{
> +    struct qht_map *old = ht->map;
> +    struct qht_map *new;
> +    size_t n = old->n * 2;
> +
> +    new = qht_map_create(n);
> +    qht_iter(ht, qht_map_copy, new);
> +
> +    qht_publish(ht, new);
> +    call_rcu1(&old->rcu, qht_map_reclaim);
> +}
> +
> +/*
> + * Returns the number of buckets that an average lookup will traverse.
> + * The return value is always >= 1. With a good hashing function, this
> + * value should be close to 1.
> + * Note that each bucket tracks up to QHT_BUCKET_ENTRIES items.
> + */
> +double qht_avg_bucket_chain_length(struct qht *ht, size_t *n_head_buckets)
> +{
> +    struct qht_map *map;
> +    size_t count = 0;
> +    size_t i;
> +
> +    map = atomic_read(&ht->map);
> +    /* paired with smp_wmb() before setting ht->map */
> +    smp_rmb();
> +
> +    for (i = 0; i < map->n; i++) {
> +        struct qht_bucket *head = &map->buckets[i];
> +        struct qht_bucket *b;
> +        size_t bucket_count;
> +        uint32_t version;
> +
> +        do {
> +            version = seqlock_read_begin(&head->sequence);
> +            bucket_count = 0;
> +            b = head;
> +            do {
> +                bucket_count++;
> +                b = b->next;
> +            } while (b);
> +        } while (seqlock_read_retry(&head->sequence, version));
> +        count += bucket_count;
> +    }
> +    if (n_head_buckets) {
> +        *n_head_buckets = map->n;
> +    }
> +    return (double)count / map->n;
> +}


--
Alex Bennée

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 09/11] qht: add test program
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 09/11] qht: add test program Emilio G. Cota
@ 2016-04-22 14:35   ` Alex Bennée
  0 siblings, 0 replies; 45+ messages in thread
From: Alex Bennée @ 2016-04-22 14:35 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  tests/.gitignore |   1 +
>  tests/Makefile   |   6 ++-
>  tests/test-qht.c | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 145 insertions(+), 1 deletion(-)
>  create mode 100644 tests/test-qht.c
>
> diff --git a/tests/.gitignore b/tests/.gitignore
> index 9eed229..d095e05 100644
> --- a/tests/.gitignore
> +++ b/tests/.gitignore
> @@ -48,6 +48,7 @@ test-qapi-visit.[ch]
>  test-qdev-global-props
>  test-qemu-opts
>  test-qga
> +test-qht
>  test-qmp-commands
>  test-qmp-commands.h
>  test-qmp-event
> diff --git a/tests/Makefile b/tests/Makefile
> index 9de9598..0568ac2 100644
> --- a/tests/Makefile
> +++ b/tests/Makefile
> @@ -68,6 +68,8 @@ check-unit-y += tests/rcutorture$(EXESUF)
>  gcov-files-rcutorture-y = util/rcu.c
>  check-unit-y += tests/test-rcu-list$(EXESUF)
>  gcov-files-test-rcu-list-y = util/rcu.c
> +check-unit-y += tests/test-qht$(EXESUF)
> +gcov-files-test-qht-y = util/qht.c
>  check-unit-y += tests/test-bitops$(EXESUF)
>  check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
>  check-unit-y += tests/check-qom-interface$(EXESUF)
> @@ -387,7 +389,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
>  	tests/test-qmp-commands.o tests/test-visitor-serialization.o \
>  	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
>  	tests/test-opts-visitor.o tests/test-qmp-event.o \
> -	tests/rcutorture.o tests/test-rcu-list.o
> +	tests/rcutorture.o tests/test-rcu-list.o \
> +	tests/test-qht.o
>
>  $(test-obj-y): QEMU_INCLUDES += -Itests
>  QEMU_CFLAGS += -I$(SRC_PATH)/tests
> @@ -425,6 +428,7 @@ tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
>  tests/test-int128$(EXESUF): tests/test-int128.o
>  tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
>  tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
> +tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
>
>  tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
>  	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
> diff --git a/tests/test-qht.c b/tests/test-qht.c
> new file mode 100644
> index 0000000..47024c6
> --- /dev/null
> +++ b/tests/test-qht.c
> @@ -0,0 +1,139 @@
> +#include "qemu/osdep.h"
> +#include <glib.h>
> +#include "qemu/qht.h"
> +#include "exec/tb-hash-xx.h"
> +
> +#define N 5000
> +
> +static struct qht ht;
> +static int32_t arr[N];
> +
> +/*
> + * We might be tempted to use val as the hash. However, val
> + * could be 0, and all hashes passed to qht must be !0.
> + */
> +static inline uint32_t hash_func(uint32_t val)
> +{
> +    return tb_hash_func5(val, 0, 0);
> +}
> +
> +static bool is_equal(const void *obj, const void *userp)
> +{
> +    const int32_t *a = obj;
> +    const int32_t *b = userp;
> +
> +    return *a == *b;
> +}
> +
> +static void insert(int a, int b)
> +{
> +    int i;
> +
> +    for (i = a; i < b; i++) {
> +        uint32_t hash;
> +
> +        arr[i] = i;
> +        hash = hash_func(i);
> +
> +        qht_insert(&ht, &arr[i], hash);
> +    }
> +}
> +
> +static void rm(int init, int end)
> +{
> +    int i;
> +
> +    for (i = init; i < end; i++) {
> +        uint32_t hash;
> +
> +        hash = hash_func(arr[i]);
> +        g_assert_true(qht_remove(&ht, &arr[i], hash));
> +    }
> +}
> +
> +static void check(int a, int b, bool expected)
> +{
> +    double avg_chain;
> +    size_t n_head_buckets;
> +    int i;
> +
> +    for (i = a; i < b; i++) {
> +        void *p;
> +        uint32_t hash;
> +        int32_t val;
> +
> +        val = i;
> +        hash = hash_func(i);
> +        p = qht_lookup(&ht, is_equal, &val, hash);
> +        g_assert_true(!!p == expected);
> +    }
> +    avg_chain = qht_avg_bucket_chain_length(&ht, &n_head_buckets);
> +    g_assert_cmpfloat(avg_chain, >=, 1.0);
> +    g_assert_cmpuint(n_head_buckets, >, 0);
> +}
> +
> +static void count_func(struct qht *ht, void *p, uint32_t hash, void *userp)
> +{
> +    unsigned int *curr = userp;
> +
> +    (*curr)++;
> +}
> +
> +static void iter_check(unsigned int count)
> +{
> +    unsigned int curr = 0;
> +
> +    qht_iter(&ht, count_func, &curr);
> +    g_assert_cmpuint(curr, ==, count);
> +}
> +
> +static void qht_test(unsigned int mode)
> +{
> +    qht_init(&ht, 0, mode);
> +
> +    insert(0, N);
> +    check(0, N, true);
> +    check(-N, -1, false);
> +    iter_check(N);
> +    rm(1, 2);
> +    qht_reset_size(&ht, 0);
> +    check(0, N, false);
> +
> +    qht_destroy(&ht);
> +}
> +
> +static void test_default(void)
> +{
> +    qht_test(0);
> +}
> +
> +static void test_mru_lookup(void)
> +{
> +    qht_test(QHT_MODE_MRU_LOOKUP);
> +}
> +
> +static void test_mru_all(void)
> +{
> +    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT);
> +}
> +
> +static void test_mru_all_resize(void)
> +{
> +    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT | QHT_MODE_AUTO_RESIZE);
> +}
> +
> +static void test_mru_insert_resize(void)
> +{
> +    qht_test(QHT_MODE_AUTO_RESIZE | QHT_MODE_MRU_INSERT);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +    g_test_init(&argc, &argv, NULL);
> +    g_test_add_func("/qht/mode/default", test_default);
> +    g_test_add_func("/qht/mode/mru_lookup", test_mru_lookup);
> +    g_test_add_func("/qht/mode/mru_all", test_mru_all);
> +    g_test_add_func("/qht/mode/mru_all_with_resize", test_mru_all_resize);
> +    g_test_add_func("/qht/mode/mru_insert_with_resize", test_mru_insert_resize);
> +    return g_test_run();
> +}


--
Alex Bennée

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
  2016-04-20 15:21   ` Richard Henderson
@ 2016-04-22 14:38   ` Alex Bennée
  2016-04-22 17:41   ` Richard Henderson
  2 siblings, 0 replies; 45+ messages in thread
From: Alex Bennée @ 2016-04-22 14:38 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>


> ---
>  translate-all.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/translate-all.c b/translate-all.c
> index 617a572..769bffc 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -1664,6 +1664,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>  {
>      int i, target_code_size, max_target_code_size;
>      int direct_jmp_count, direct_jmp2_count, cross_page;
> +    double ht_avg_len;
> +    size_t ht_heads;
>      TranslationBlock *tb;
>
>      target_code_size = 0;
> @@ -1715,6 +1717,9 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>                  direct_jmp2_count,
>                  tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
>                          tcg_ctx.tb_ctx.nb_tbs : 0);
> +    ht_avg_len = qht_avg_bucket_chain_length(&tcg_ctx.tb_ctx.htable, &ht_heads);
> +    cpu_fprintf(f, "TB hash avg chain   %0.5f buckets\n", ht_avg_len);
> +    cpu_fprintf(f, "TB hash size        %zu head buckets\n", ht_heads);
>      cpu_fprintf(f, "\nStatistics:\n");
>      cpu_fprintf(f, "TB flush count      %d\n", tcg_ctx.tb_ctx.tb_flush_count);
>      cpu_fprintf(f, "TB invalidate count %d\n",


--
Alex Bennée

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/11] compiler.h: add QEMU_ALIGNED() to enforce struct alignment
  2016-04-22  9:35   ` Peter Maydell
@ 2016-04-22 15:50     ` Emilio G. Cota
  0 siblings, 0 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-22 15:50 UTC (permalink / raw)
  To: Peter Maydell
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson, Sergey Fedorov

On Fri, Apr 22, 2016 at 10:35:42 +0100, Peter Maydell wrote:
> On 20 April 2016 at 00:07, Emilio G. Cota <cota@braap.org> wrote:
> > +#define QEMU_ALIGNED(B) __attribute__((aligned(B)))
> 
> A rather trivial thing, but if we have to respin this series
> for some other reason could we use a different macro parameter
> than 'B', please? I had to re-read the patch carefully before
> I realised that (a) it wasn't "aligned(8)" and (b) it wasn't a
> typo for "aligned(8)" either...

Heh sure, will use X.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
  2016-04-20 15:21   ` Richard Henderson
  2016-04-22 14:38   ` Alex Bennée
@ 2016-04-22 17:41   ` Richard Henderson
  2016-04-22 19:23     ` Emilio G. Cota
  2016-04-22 19:59     ` Richard Henderson
  2 siblings, 2 replies; 45+ messages in thread
From: Richard Henderson @ 2016-04-22 17:41 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Sergey Fedorov

On 04/19/2016 04:07 PM, Emilio G. Cota wrote:
> +    ht_avg_len = qht_avg_bucket_chain_length(&tcg_ctx.tb_ctx.htable, &ht_heads);
> +    cpu_fprintf(f, "TB hash avg chain   %0.5f buckets\n", ht_avg_len);
> +    cpu_fprintf(f, "TB hash size        %zu head buckets\n", ht_heads);

I think the accounting is questionable here.

Consider the following data:

TB count             230467/671088
TB invalidate count  25915
TB hash avg chain    1.03073 buckets
TB hash size         131072 head buckets

This means that we've got 230467 - 25915 = 204552 active TB's, installed into a
hash table with 131072 heads.  For a perfectly uniform distribution of TBs,
that would be an average chain length of 204552 / 131072 = 1.56.

In order to get the average down to 1.03, there need to be a substantial number
of heads with zero entries.

I think perhaps it might be more enlightening to separately account for empty
and non-empty heads.  E.g.

TB hash buckets used  xxxx/131072
TB hash avg chain     yyyy

where xxxx is the number of non-empty heads, and yyyy = |TBs| / xxxx.

I also wonder if it wouldn't be better to size the hash table as appropriate
for the maximum number of allowable TBs.

r~

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-22 17:41   ` Richard Henderson
@ 2016-04-22 19:23     ` Emilio G. Cota
  2016-04-22 19:59     ` Richard Henderson
  1 sibling, 0 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-22 19:23 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On Fri, Apr 22, 2016 at 10:41:25 -0700, Richard Henderson wrote:
> On 04/19/2016 04:07 PM, Emilio G. Cota wrote:
> > +    ht_avg_len = qht_avg_bucket_chain_length(&tcg_ctx.tb_ctx.htable, &ht_heads);
> > +    cpu_fprintf(f, "TB hash avg chain   %0.5f buckets\n", ht_avg_len);
> > +    cpu_fprintf(f, "TB hash size        %zu head buckets\n", ht_heads);
> 
> I think the accounting is questionable here.
> 
> Consider the following data:
> 
> TB count             230467/671088
> TB invalidate count  25915
> TB hash avg chain    1.03073 buckets
> TB hash size         131072 head buckets
> 
> This means that we've got 230467 - 25915 = 204552 active TB's, installed into a
> hash table with 131072 heads.  For a perfectly uniform distribution of TBs,
> that would be an average chain length of 204552 / 131072 = 1.56.
>
> In order to get the average down to 1.03, there need to be a substantial number
> of heads with zero entries.

Remember that each head can host up to QHT_BUCKET_ENTRIES, so that
1) the bucket fits in a cache line
2) resizing the cache is fast since the full hashes are also stored
   in the bucket
3) full comparisons (i.e. calls to the user-provided cmp() function) are only
   due to full hash collisions, not due to the hash table being
   undersized.
QHT_BUCKET_ENTRIES is 4 entries on 64-bit hosts, and 6 entries on 32-bit.

For your example hash table, assuming we're on 64-bit and a uniform
distribution, occupancy would be
  204552 / 131072 / 4 = 0.39

The confusion comes from the fact that qht_avg_bucket_chain_length()
returns the number of head buckets, not head entries:
/*
 * Returns the number of buckets that an average lookup will traverse.
 * The return value is always >= 1. With a good hashing function, this
 * value should be close to 1.
 * Note that each bucket tracks up to QHT_BUCKET_ENTRIES items.
 */
double qht_avg_bucket_chain_length(struct qht *ht, size_t *n_head_buckets)

The rationale is that all entries on the same bucket can be found in O(1):
even if a bucket is empty, a lookup will do QHT_BUCKET_ENTRIES
comparisons, since a removal simply sets bucket[entry] to NULL, without
touching other entries (on that bucket or any other chained bucket).
This is done to limit churn on removals (the corresponding bucket chain
would have to be traversed). However, although given the relatively
low amount of removals that we have, this might be worth exploring.

> I also wonder if it wouldn't be better to size the hash table as appropriate
> for the maximum number of allowable TBs.

This would hurt performance for workloads that have very low amounts of TB's
and that go to the hash table often, such as booting a minimal ARM linux image.

		Emilio

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-22 17:41   ` Richard Henderson
  2016-04-22 19:23     ` Emilio G. Cota
@ 2016-04-22 19:59     ` Richard Henderson
  2016-04-22 23:57       ` Emilio G. Cota
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Henderson @ 2016-04-22 19:59 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Sergey Fedorov

[-- Attachment #1: Type: text/plain, Size: 2222 bytes --]

On 04/22/2016 10:41 AM, Richard Henderson wrote:
> On 04/19/2016 04:07 PM, Emilio G. Cota wrote:
>> +    ht_avg_len = qht_avg_bucket_chain_length(&tcg_ctx.tb_ctx.htable, &ht_heads);
>> +    cpu_fprintf(f, "TB hash avg chain   %0.5f buckets\n", ht_avg_len);
>> +    cpu_fprintf(f, "TB hash size        %zu head buckets\n", ht_heads);
> 
> 
> I think the accounting is questionable here.
> 
> Consider the following data:
> 
> TB count             230467/671088
> TB invalidate count  25915
> TB hash avg chain    1.03073 buckets
> TB hash size         131072 head buckets
> 
> This means that we've got 230467 - 25915 = 204552 active TB's, installed into a
> hash table with 131072 heads.  For a perfectly uniform distribution of TBs,
> that would be an average chain length of 204552 / 131072 = 1.56.
> 
> In order to get the average down to 1.03, there need to be a substantial number
> of heads with zero entries.
> 
> I think perhaps it might be more enlightening to separately account for empty
> and non-empty heads.  E.g.
> 
> TB hash buckets used  xxxx/131072
> TB hash avg chain     yyyy
> 
> where xxxx is the number of non-empty heads, and yyyy = |TBs| / xxxx.
> 
> I also wonder if it wouldn't be better to size the hash table as appropriate
> for the maximum number of allowable TBs.

FWIW, so that I could get an idea of how the stats change as we improve the
hashing, I inserted the attachment 1 patch between patches 5 and 6, and with
attachment 2 attempting to fix the accounting for patches 9 and 10.

For booting an alpha kernel to login prompt:

Before hashing changes (@5/11)

TB count             175363/671088
TB invalidate count  3996
TB hash buckets      31731/32768
TB hash avg chain    5.289 max=59

After xxhash patch (@7/11)

TB hash buckets      32582/32768
TB hash avg chain    5.260 max=18

So far so good!

After qht patches (@11/11)

TB hash buckets      94360/131072
TB hash avg chain    1.774 max=8

Do note that those last numbers are off: 1.774 avg * 94360 used buckets =
167394 total entries, which is far from 171367, the correct number of total
entries.

I'm tempted to pull over gcc's non-chaining hash table implementation
(libiberty/hashtab.c, still gplv2+) and compare...



r~

[-- Attachment #2: z1.txt --]
[-- Type: text/plain, Size: 4375 bytes --]

>From cdc7b3631fd78bd2e31d2823f7543e2a56681149 Mon Sep 17 00:00:00 2001
From: Richard Henderson <rth@twiddle.net>
Date: Fri, 22 Apr 2016 11:28:52 -0700
Subject: translate-all: Add hashtable accounting to info jit

Dump hash table occupancy numbers with "info jit".

Signed-off-by: Richard Henderson <rth@twiddle.net>

diff --git a/translate-all.c b/translate-all.c
index 1a8f68b..ed296d5 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1671,39 +1671,55 @@ void tb_flush_jmp_cache(CPUState *cpu, target_ulong addr)
 
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 {
-    int i, target_code_size, max_target_code_size;
-    int direct_jmp_count, direct_jmp2_count, cross_page;
+    size_t target_code_size, max_target_code_size;
+    unsigned direct_jmp_count, direct_jmp2_count, cross_page;
+    unsigned used_buckets, max_chain, hash_tbs;
     TranslationBlock *tb;
+    int i;
 
     target_code_size = 0;
     max_target_code_size = 0;
     cross_page = 0;
     direct_jmp_count = 0;
     direct_jmp2_count = 0;
-    for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) {
-        tb = &tcg_ctx.tb_ctx.tbs[i];
-        target_code_size += tb->size;
-        if (tb->size > max_target_code_size) {
-            max_target_code_size = tb->size;
-        }
-        if (tb->page_addr[1] != -1) {
-            cross_page++;
-        }
-        if (tb->tb_next_offset[0] != 0xffff) {
-            direct_jmp_count++;
-            if (tb->tb_next_offset[1] != 0xffff) {
-                direct_jmp2_count++;
+    used_buckets = 0;
+    hash_tbs = 0;
+    max_chain = 0;
+
+    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
+        if (tcg_ctx.tb_ctx.tb_phys_hash[i]) {
+            unsigned this_chain = 0;
+            for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
+                 tb = tb->phys_hash_next) {
+                this_chain++;
+                hash_tbs++;
+                target_code_size += tb->size;
+                if (tb->page_addr[1] != -1) {
+                    cross_page++;
+                }
+                if (tb->tb_next_offset[0] != 0xffff) {
+                    direct_jmp_count++;
+                    if (tb->tb_next_offset[1] != 0xffff) {
+                        direct_jmp2_count++;
+                    }
+                }
             }
+            if (this_chain > max_chain) {
+                max_chain = this_chain;
+            }
+            used_buckets++;
         }
     }
-    /* XXX: avoid using doubles ? */
+    assert(hash_tbs ==
+           tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count);
+
     cpu_fprintf(f, "Translation buffer state:\n");
     cpu_fprintf(f, "gen code size       %td/%zd\n",
                 tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
                 tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
     cpu_fprintf(f, "TB count            %d/%d\n",
             tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.code_gen_max_blocks);
-    cpu_fprintf(f, "TB avg target size  %d max=%d bytes\n",
+    cpu_fprintf(f, "TB avg target size  %zd max=%zd bytes\n",
             tcg_ctx.tb_ctx.nb_tbs ? target_code_size /
                     tcg_ctx.tb_ctx.nb_tbs : 0,
             max_target_code_size);
@@ -1717,13 +1733,18 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     cpu_fprintf(f, "cross page TB count %d (%d%%)\n", cross_page,
             tcg_ctx.tb_ctx.nb_tbs ? (cross_page * 100) /
                                     tcg_ctx.tb_ctx.nb_tbs : 0);
-    cpu_fprintf(f, "direct jump count   %d (%d%%) (2 jumps=%d %d%%)\n",
+    cpu_fprintf(f, "direct jump count   %u (%u%%) (2 jumps=%u %u%%)\n",
                 direct_jmp_count,
                 tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp_count * 100) /
                         tcg_ctx.tb_ctx.nb_tbs : 0,
                 direct_jmp2_count,
                 tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
                         tcg_ctx.tb_ctx.nb_tbs : 0);
+    cpu_fprintf(f, "TB hash buckets     %u/%d\n",
+                used_buckets, CODE_GEN_PHYS_HASH_SIZE);
+    cpu_fprintf(f, "TB hash avg chain   %0.3f max=%u\n",
+                used_buckets ? (double)hash_tbs / used_buckets : 0.0,
+                max_chain);
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %d\n", tcg_ctx.tb_ctx.tb_flush_count);
     cpu_fprintf(f, "TB invalidate count %d\n",

[-- Attachment #3: z2.txt --]
[-- Type: text/plain, Size: 5743 bytes --]

>From 7f1d677f3d085b5891e1adbd5f602185d68ba81a Mon Sep 17 00:00:00 2001
From: Richard Henderson <rth@twiddle.net>
Date: Fri, 22 Apr 2016 12:50:00 -0700
Subject: fixup to {09,10,11}/11


diff --git a/include/qemu/qht.h b/include/qemu/qht.h
index a0a1aa8..2d0b58f 100644
--- a/include/qemu/qht.h
+++ b/include/qemu/qht.h
@@ -49,6 +49,14 @@ void qht_grow(struct qht *ht);
 
 void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
                  uint32_t hash);
-double qht_avg_bucket_chain_length(struct qht *ht, size_t *n_head_buckets);
+
+struct qht_stats {
+    size_t used_buckets;
+    size_t max_buckets;
+    size_t used_entries;
+    size_t max_chain;
+};
+
+struct qht_stats qht_statistics(struct qht *ht);
 
 #endif /* QHT_H */
diff --git a/translate-all.c b/translate-all.c
index 3b73b46..a9ceb0a 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1664,8 +1664,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 {
     size_t target_code_size, max_target_code_size;
     unsigned direct_jmp_count, direct_jmp2_count, cross_page;
-    unsigned used_buckets, max_chain, hash_tbs;
-    TranslationBlock *tb;
+    struct qht_stats hinfo;
     int i;
 
     target_code_size = 0;
@@ -1673,35 +1672,23 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     cross_page = 0;
     direct_jmp_count = 0;
     direct_jmp2_count = 0;
-    used_buckets = 0;
-    hash_tbs = 0;
-    max_chain = 0;
-
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        if (tcg_ctx.tb_ctx.tb_phys_hash[i]) {
-            unsigned this_chain = 0;
-            for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-                 tb = tb->phys_hash_next) {
-                this_chain++;
-                hash_tbs++;
-                target_code_size += tb->size;
-                if (tb->page_addr[1] != -1) {
-                    cross_page++;
-                }
-                if (tb->tb_next_offset[0] != 0xffff) {
-                    direct_jmp_count++;
-                    if (tb->tb_next_offset[1] != 0xffff) {
-                        direct_jmp2_count++;
-                    }
-                }
-            }
-            if (this_chain > max_chain) {
-                max_chain = this_chain;
+
+    for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) {
+        const TranslationBlock *tb = &tcg_ctx.tb_ctx.tbs[i];
+        target_code_size += tb->size;
+        if (tb->page_addr[1] != -1) {
+            cross_page++;
+        }
+        if (tb->tb_next_offset[0] != 0xffff) {
+            direct_jmp_count++;
+            if (tb->tb_next_offset[1] != 0xffff) {
+                direct_jmp2_count++;
             }
-            used_buckets++;
         }
     }
-    assert(hash_tbs ==
+
+    hinfo = qht_statistics(&tcg_ctx.tb_ctx.htable);
+    assert(hinfo.used_entries ==
            tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count);
 
     cpu_fprintf(f, "Translation buffer state:\n");
@@ -1731,11 +1718,12 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
                 direct_jmp2_count,
                 tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
                         tcg_ctx.tb_ctx.nb_tbs : 0);
-    cpu_fprintf(f, "TB hash buckets     %u/%d\n",
-                used_buckets, CODE_GEN_PHYS_HASH_SIZE);
-    cpu_fprintf(f, "TB hash avg chain   %0.3f max=%u\n",
-                used_buckets ? (double)hash_tbs / used_buckets : 0.0,
-                max_chain);
+    cpu_fprintf(f, "TB hash buckets     %zu/%zu\n",
+                hinfo.used_buckets, hinfo.max_buckets);
+    cpu_fprintf(f, "TB hash avg chain   %0.3f max=%zu\n",
+                hinfo.used_buckets
+                ? (double)hinfo.used_entries / hinfo.used_buckets : 0.0,
+                hinfo.max_chain);
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %d\n", tcg_ctx.tb_ctx.tb_flush_count);
     cpu_fprintf(f, "TB invalidate count %d\n",
diff --git a/util/qht.c b/util/qht.c
index 05ea5e8..535057b 100644
--- a/util/qht.c
+++ b/util/qht.c
@@ -556,35 +556,45 @@ void qht_grow(struct qht *ht)
  * value should be close to 1.
  * Note that each bucket tracks up to QHT_BUCKET_ENTRIES items.
  */
-double qht_avg_bucket_chain_length(struct qht *ht, size_t *n_head_buckets)
+struct qht_stats qht_statistics(struct qht *ht)
 {
     struct qht_map *map;
-    size_t count = 0;
-    size_t i;
+    struct qht_stats s = {};
+    size_t i, n;
 
     map = atomic_read(&ht->map);
     /* paired with smp_wmb() before setting ht->map */
     smp_rmb();
+    s.max_buckets = n = map->n;
 
-    for (i = 0; i < map->n; i++) {
+    for (i = 0; i < n; i++) {
         struct qht_bucket *head = &map->buckets[i];
         struct qht_bucket *b;
-        size_t bucket_count;
+        size_t this_chain;
         uint32_t version;
 
         do {
             version = seqlock_read_begin(&head->sequence);
-            bucket_count = 0;
+            this_chain = 0;
             b = head;
             do {
-                bucket_count++;
+                int j;
+                for (j = 0; j < QHT_BUCKET_ENTRIES; j++) {
+                    if (b->hashes[j]) {
+                        this_chain++;
+                    }
+                }
                 b = b->next;
             } while (b);
         } while (seqlock_read_retry(&head->sequence, version));
-        count += bucket_count;
-    }
-    if (n_head_buckets) {
-        *n_head_buckets = map->n;
+        if (this_chain != 0) {
+            s.used_entries += this_chain;
+            if (s.max_chain < this_chain) {
+                s.max_chain = this_chain;
+            }
+            s.used_buckets++;
+        }
     }
-    return (double)count / map->n;
+
+    return s;
 }

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-22 19:59     ` Richard Henderson
@ 2016-04-22 23:57       ` Emilio G. Cota
  2016-04-24 19:46         ` Richard Henderson
  0 siblings, 1 reply; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-22 23:57 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote:
> FWIW, so that I could get an idea of how the stats change as we improve the
> hashing, I inserted the attachment 1 patch between patches 5 and 6, and with
> attachment 2 attempting to fix the accounting for patches 9 and 10.

For qht, I dislike the approach of reporting "avg chain" per-element,
instead of per-bucket. Performance for a bucket whose entries are
all valid is virtually the same as that of a bucket that only
has one valid element; thus, with per-bucket reporting, we'd say that
the chain lenght is 1 in both cases, i.e. "perfect". With per-element
reporting, we'd report 4 (on a 64-bit host, since that's the value of
QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the
wrong idea (users would think they're in trouble, when they're not).

Using the avg-bucket-chain metric you can test how good the hashing is.
For instance, the metric is 1.01 for xxhash with phys_pc, pc and flags
(i.e. func5), and 1.21 if func5 takes only a valid phys_pc (the other two are 0).

I think reporting fully empty buckets as well as the longest chain
(of buckets for qht) in addition to this metric is a good idea, though.

> For booting an alpha kernel to login prompt:
> 
> Before hashing changes (@5/11)
> 
> TB count             175363/671088
> TB invalidate count  3996
> TB hash buckets      31731/32768
> TB hash avg chain    5.289 max=59
> 
> After xxhash patch (@7/11)
> 
> TB hash buckets      32582/32768
> TB hash avg chain    5.260 max=18
> 
> So far so good!
> 
> After qht patches (@11/11)
> 
> TB hash buckets      94360/131072
> TB hash avg chain    1.774 max=8
> 
> Do note that those last numbers are off: 1.774 avg * 94360 used buckets =
> 167394 total entries, which is far from 171367, the correct number of total
> entries.

If those numbers are off, then either this
    assert(hinfo.used_entries ==
           tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count);
should trigger, or the accounting isn't right.

Another option is that the "TB count - invalidate_count" is different
for each test you ran. I think this is what's going on, otherwise we couldn't
explain why the first report ("before 5/11") is also "wrong":

  5.289*31731=167825.259

Only the second report ("after 7/11") seems good (taking into account
lack of precision of just 3 decimals):
  5.26*32582=171381.32 ~= 171367
which leads me to believe that you've used the TB and invalidate
counts from that test.

I just tested your patches (on an ARM bootup) and the assert doesn't trigger,
and the stats are spot on for "after 11/11":

TB count            643610/2684354
TB hash buckets     369534/524288
TB hash avg chain   1.729 max=8
TB flush count      0
TB invalidate count 4718

1.729*369534=638924.286, which is ~= 643610-4718 = 638892.

> I'm tempted to pull over gcc's non-chaining hash table implementation
> (libiberty/hashtab.c, still gplv2+) and compare...

You can try, but I think performance wouldn't be great, because
the comparison function would be called way too often due to the
ht using open addressing. The problem there is not only the comparisons
themselves, but the all the cache lines needed to read the fields of
the comparison. I haven't tested libiberty's htable but I did test
the htable in concurrencykit[1], which also uses open addressing.

With ck's ht, performance was not good when booting ARM: IIRC ~30% of
runtime was spent on tb_cmp(); I also added the full hash to each TB so
that it would be compared first, but it didn't make a difference since
the delay was due to loading the cache line (I saw this with perf(1)'s
annotated code, which showed that ~80% of the time spent in tb_cmp()
was in performing the first load of the TB's fields).

This led me to a design that had buckets with a small set of
hash & pointer pairs, all in the same cache line as the head (then
I discovered somebody else had thought of this, and that's why there's
a link to the CLHT paper in qht.c).

BTW I tested ck's htable also because of a requirement we have for MTTCG,
which is to support lock-free concurrent lookups. AFAICT libiberty's ht
doesn't support this, so it might be a bit faster than ck's.

Thanks,

		Emilio

[1] http://concurrencykit.org/
    More info on their htable implementation here:
    http://backtrace.io/blog/blog/2015/03/13/workload-specialization/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-22 23:57       ` Emilio G. Cota
@ 2016-04-24 19:46         ` Richard Henderson
  2016-04-24 22:06           ` Emilio G. Cota
  0 siblings, 1 reply; 45+ messages in thread
From: Richard Henderson @ 2016-04-24 19:46 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On 04/22/2016 04:57 PM, Emilio G. Cota wrote:
> On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote:
>> FWIW, so that I could get an idea of how the stats change as we improve the
>> hashing, I inserted the attachment 1 patch between patches 5 and 6, and with
>> attachment 2 attempting to fix the accounting for patches 9 and 10.
>
> For qht, I dislike the approach of reporting "avg chain" per-element,
> instead of per-bucket. Performance for a bucket whose entries are
> all valid is virtually the same as that of a bucket that only
> has one valid element; thus, with per-bucket reporting, we'd say that
> the chain lenght is 1 in both cases, i.e. "perfect". With per-element
> reporting, we'd report 4 (on a 64-bit host, since that's the value of
> QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the
> wrong idea (users would think they're in trouble, when they're not).

But otherwise you have no way of knowing how full the buckets are.  The bucket 
size is just something that you have to keep in mind.

> If those numbers are off, then either this
>      assert(hinfo.used_entries ==
>             tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count);
> should trigger, or the accounting isn't right.

I think I used an NDEBUG build, so these weren't effective.

> Only the second report ("after 7/11") seems good (taking into account
> lack of precision of just 3 decimals):
>    5.26*32582=171381.32 ~= 171367
> which leads me to believe that you've used the TB and invalidate
> counts from that test.

The TB and invalidate numbers are repeatable; the same every time.

> You can try, but I think performance wouldn't be great, because
> the comparison function would be called way too often due to the
> ht using open addressing. The problem there is not only the comparisons
> themselves, but the all the cache lines needed to read the fields of
> the comparison. I haven't tested libiberty's htable but I did test
> the htable in concurrencykit[1], which also uses open addressing.

You are right that having the full hash for primary comparison is a big win, 
especially with how complex our comparison functions are.  And you're right 
that we have to have two of them.

> This led me to a design that had buckets with a small set of
> hash & pointer pairs, all in the same cache line as the head (then
> I discovered somebody else had thought of this, and that's why there's
> a link to the CLHT paper in qht.c).

Fair.  It's a good design.


r~

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 08/11] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 08/11] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
  2016-04-22 14:04   ` Alex Bennée
@ 2016-04-24 20:01   ` Richard Henderson
  2016-04-24 21:58     ` Emilio G. Cota
  1 sibling, 1 reply; 45+ messages in thread
From: Richard Henderson @ 2016-04-24 20:01 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Sergey Fedorov

On 04/19/2016 04:07 PM, Emilio G. Cota wrote:
> +static void qht_insert__locked(struct qht *ht, struct qht_map *map,
> +                               struct qht_bucket *head, void *p, uint32_t hash)
> +{
> +    struct qht_bucket *b = head;
> +    struct qht_bucket *prev = NULL;
> +    struct qht_bucket *new = NULL;
> +    int i;
> +
> +    for (;;) {
> +        if (b == NULL) {
> +            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
> +            memset(b, 0, sizeof(*b));
> +            new = b;
> +        }
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->hashes[i]) {
> +                continue;
> +            }

Surely that's b->pointers[i] != NULL.  We've made no provision that the hash 
function must return non-zero.

> +static inline bool qht_remove__locked(struct qht_map *map, struct qht_bucket *b,
> +                                      const void *p, uint32_t hash)
> +{
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->hashes[i] == hash && b->pointers[i] == p) {

Don't you only need to test p here?


r~

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 08/11] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-24 20:01   ` Richard Henderson
@ 2016-04-24 21:58     ` Emilio G. Cota
  0 siblings, 0 replies; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-24 21:58 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On Sun, Apr 24, 2016 at 13:01:31 -0700, Richard Henderson wrote:
> On 04/19/2016 04:07 PM, Emilio G. Cota wrote:
> >+static void qht_insert__locked(struct qht *ht, struct qht_map *map,
> >+                               struct qht_bucket *head, void *p, uint32_t hash)
> >+{
> >+    struct qht_bucket *b = head;
> >+    struct qht_bucket *prev = NULL;
> >+    struct qht_bucket *new = NULL;
> >+    int i;
> >+
> >+    for (;;) {
> >+        if (b == NULL) {
> >+            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
> >+            memset(b, 0, sizeof(*b));
> >+            new = b;
> >+        }
> >+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> >+            if (b->hashes[i]) {
> >+                continue;
> >+            }
> 
> Surely that's b->pointers[i] != NULL.  We've made no provision that the hash
> function must return non-zero.

Good catch! I initially banned 0 from being a valid hash value,
knowing that I'd have to eventually test how stupid this assumption
was, but I forgot to do so.

It turns out that banning any hash value is a stupid idea.
For example, looping over [0,0xffffffff] shows that
xxh32(0x7aac3812, seed=1) == 0.

Will fix this in v4 -- the only assumption will be that the passed
pointer should be !NULL.

> >+static inline bool qht_remove__locked(struct qht_map *map, struct qht_bucket *b,
> >+                                      const void *p, uint32_t hash)
> >+{
> >+    int i;
> >+
> >+    do {
> >+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> >+            if (b->hashes[i] == hash && b->pointers[i] == p) {
> 
> Don't you only need to test p here?

Fair point. Will fix in v4.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-24 19:46         ` Richard Henderson
@ 2016-04-24 22:06           ` Emilio G. Cota
  2016-04-27  2:43             ` Emilio G. Cota
  0 siblings, 1 reply; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-24 22:06 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On Sun, Apr 24, 2016 at 12:46:08 -0700, Richard Henderson wrote:
> On 04/22/2016 04:57 PM, Emilio G. Cota wrote:
> >On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote:
> >>FWIW, so that I could get an idea of how the stats change as we improve the
> >>hashing, I inserted the attachment 1 patch between patches 5 and 6, and with
> >>attachment 2 attempting to fix the accounting for patches 9 and 10.
> >
> >For qht, I dislike the approach of reporting "avg chain" per-element,
> >instead of per-bucket. Performance for a bucket whose entries are
> >all valid is virtually the same as that of a bucket that only
> >has one valid element; thus, with per-bucket reporting, we'd say that
> >the chain lenght is 1 in both cases, i.e. "perfect". With per-element
> >reporting, we'd report 4 (on a 64-bit host, since that's the value of
> >QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the
> >wrong idea (users would think they're in trouble, when they're not).
> 
> But otherwise you have no way of knowing how full the buckets are.  The
> bucket size is just something that you have to keep in mind.

I'll make some changes in v4 that I think will address both your and
my concerns:
- Report the number of empty buckets
- Do not count empty buckets when reporting avg bucket chain length
- Report average bucket occupancy (in %, so that QHT_BUCKET_ENTRIES
  does not have to be reported.)

> >If those numbers are off, then either this
> >     assert(hinfo.used_entries ==
> >            tcg_ctx.tb_ctx.nb_tbs - tcg_ctx.tb_ctx.tb_phys_invalidate_count);
> >should trigger, or the accounting isn't right.
> 
> I think I used an NDEBUG build, so these weren't effective.
> 
> >Only the second report ("after 7/11") seems good (taking into account
> >lack of precision of just 3 decimals):
> >   5.26*32582=171381.32 ~= 171367
> >which leads me to believe that you've used the TB and invalidate
> >counts from that test.
> 
> The TB and invalidate numbers are repeatable; the same every time.

Then something else is going on, because both the 1st and 3rd tests are
way off. I'd re-test with assertions enabled.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-24 22:06           ` Emilio G. Cota
@ 2016-04-27  2:43             ` Emilio G. Cota
  2016-04-28 16:37               ` Richard Henderson
  0 siblings, 1 reply; 45+ messages in thread
From: Emilio G. Cota @ 2016-04-27  2:43 UTC (permalink / raw)
  To: Richard Henderson
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On Sun, Apr 24, 2016 at 18:06:51 -0400, Emilio G. Cota wrote:
> On Sun, Apr 24, 2016 at 12:46:08 -0700, Richard Henderson wrote:
> > On 04/22/2016 04:57 PM, Emilio G. Cota wrote:
> > >On Fri, Apr 22, 2016 at 12:59:52 -0700, Richard Henderson wrote:
> > >>FWIW, so that I could get an idea of how the stats change as we improve the
> > >>hashing, I inserted the attachment 1 patch between patches 5 and 6, and with
> > >>attachment 2 attempting to fix the accounting for patches 9 and 10.
> > >
> > >For qht, I dislike the approach of reporting "avg chain" per-element,
> > >instead of per-bucket. Performance for a bucket whose entries are
> > >all valid is virtually the same as that of a bucket that only
> > >has one valid element; thus, with per-bucket reporting, we'd say that
> > >the chain lenght is 1 in both cases, i.e. "perfect". With per-element
> > >reporting, we'd report 4 (on a 64-bit host, since that's the value of
> > >QHT_BUCKET_ENTRIES) when the bucket is full, which IMO gives the
> > >wrong idea (users would think they're in trouble, when they're not).
> > 
> > But otherwise you have no way of knowing how full the buckets are.  The
> > bucket size is just something that you have to keep in mind.
> 
> I'll make some changes in v4 that I think will address both your and
> my concerns:
> - Report the number of empty buckets
> - Do not count empty buckets when reporting avg bucket chain length
> - Report average bucket occupancy (in %, so that QHT_BUCKET_ENTRIES
>   does not have to be reported.)

How does the following look?

Example with good hashing, i.e. func5(phys_pc, pc, flags):
TB count            704242/1342156
[...]
TB hash buckets     386484/524288 (73.72% used)
TB hash occupancy   32.57% avg chain occupancy. Histogram: 0-10%▆▁█▁▁▅▁▃▁▁90-100%
TB hash avg chain   1.02 buckets. Histogram: 1█▁▁3

Example with bad hashing, i.e. func5(phys_pc, 0, 0):
TB count            710748/1342156
[...]
TB hash buckets     113569/524288 (21.66% used)
TB hash occupancy   10.24% avg chain occupancy. Histogram: 0-10%█▁▁▁▁▁▁▁▁▁90-100%
TB hash avg chain   2.11 buckets. Histogram: 1▇▁▁▁▁▁▁▁▁▁93

Note that:

- "TB hash avg chain" does _not_ count empty buckets. This gives
  an idea of how many buckets a typical hit goes through.

- "TB hash occupancy" _does_ count empty buckets. It is called
  "avg chain occupancy" and not "avg occupancy" because the
  counts are only valid per-chain due to the seqlock protecting
  each chain.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 10/11] tb hash: track translated blocks with qht
  2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 10/11] tb hash: track translated blocks with qht Emilio G. Cota
@ 2016-04-28 13:27   ` Alex Bennée
  0 siblings, 0 replies; 45+ messages in thread
From: Alex Bennée @ 2016-04-28 13:27 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> Having a fixed-size hash table for keeping track of all translation blocks
> is suboptimal: some workloads are just too big or too small to get maximum
> performance from the hash table. The MRU promotion policy helps improve
> performance when the hash table is a little undersized, but it cannot
> make up for severely undersized hash tables.
>
> Furthermore, frequent MRU promotions result in writes that are a scalability
> bottleneck. For scalability, lookups should only perform reads, not writes.
> This is not a big deal for now, but it will become one once MTTCG matures.
>
> The appended fixes these issues by using qht as the implementation of
> the TB hash table. This solution is superior to other alternatives considered,
> namely:
>
> - master: implementation in QEMU before this patchset
> - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
> - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
>               MRU is implemented here by adding an intermediate struct
>               that contains the u32 hash and a pointer to the TB; this
>               allows us, on an MRU promotion, to copy said struct (that is not
>               at the head), and put this new copy at the head. After a grace
>               period, the original non-head struct can be eliminated, and
>               after another grace period, freed.
> - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
>                    no MRU for lookups; MRU for inserts.
> The appended solution is the following:
> - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
>                  no MRU for lookups; MRU for inserts.
>
> The plots below compare the considered solutions. The Y axis shows the
> boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
> sweeps the number of buckets (or initial number of buckets for qht-autoresize).
> The plots in PNG format (and with errorbars) can be seen here:
>   http://imgur.com/a/Awgnq
>
> Each test runs 5 times, and the entire QEMU process is pinned to a
> single core for repeatability of results.
>
>                             Host: Intel Xeon E5-2690
>
>   28 ++------------+-------------+-------------+-------------+------------++
>      A*****        +             +             +             master **A*** +
>   27 ++    *                                                 xxhash ##B###++
>      |      A******A******                               xxhash-rcu $$C$$$ |
>   26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
>      D%%$$                              A******A******A*qht-dyn-mru A*E****A
>   25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
>      B#####%                                                               |
>   24 ++    #C$$$$$                                                        ++
>      |      B###  $                                                        |
>      |          ## C$$$$$$                                                 |
>   23 ++           #       C$$$$$$                                         ++
>      |             B######       C$$$$$$                                %%%D
>   22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
>      |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
>   21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
>      +             E@@@   F&&&   +      E@     +      F&&&   +             +
>   20 ++------------+-------------+-------------+-------------+------------++
>      14            16            18            20            22            24
>                              log2 number of buckets
>
>                                  Host: Intel i7-4790K
>
>   14.5 ++------------+------------+-------------+------------+------------++
>        A**           +            +             +            master **A*** +
>     14 ++ **                                                 xxhash ##B###++
>   13.5 ++   **                                           xxhash-rcu $$C$$$++
>        |                                            qht-fixed-nomru %%D%%% |
>     13 ++     A******                                   qht-dyn-mru @@E@@@++
>        |             A*****A******A******             qht-dyn-nomru &&F&&& |
>   12.5 C$$                               A******A******A*****A******    ***A
>     12 ++ $$                                                        A***  ++
>        D%%% $$                                                             |
>   11.5 ++  %%                                                             ++
>        B###  %C$$$$$$                                                      |
>     11 ++  ## D%%%%% C$$$$$                                               ++
>        |     #      %      C$$$$$$                                         |
>   10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
>     10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
>        +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
>    9.5 ++------------+------------+-------------+------------+------------++
>        14            16           18            20           22            24
>                               log2 number of buckets
>
> Note that the original point before this patch series is X=15 for "master";
> the little sensitivity to the increased number of buckets is due to the
> poor hashing function in master.
>
> xxhash-rcu has significant overhead due to the constant churn of allocating
> and deallocating intermediate structs for implementing MRU. An alternative
> would be do consider failed lookups as "maybe not there", and then
> acquire the external lock (tb_lock in this case) to really confirm that
> there was indeed a failed lookup. This, however, would not be enough
> to implement dynamic resizing--this is more complex: see
> "Resizable, Scalable, Concurrent Hash Tables via Relativistic
> Programming" by Triplett, McKenney and Walpole. This solution was
> discarded due to the very coarse RCU read critical sections that we have
> in MTTCG; resizing requires waiting for readers after every pointer update,
> and resizes require many pointer updates, so this would quickly become
> prohibitive.
>
> qht-fixed-nomru shows that MRU promotion is advisable for undersized
> hash tables.
>
> However, qht-dyn-mru shows that MRU promotion is not important if the
> hash table is properly sized: there is virtually no difference in
> performance between qht-dyn-nomru and qht-dyn-mru.
>
> Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
> X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
> can achieve with optimum sizing of the hash table, while keeping the hash
> table scalable for readers.
>
> The improvement we get before and after this patch for booting debian jessie
> with arm-softmmu is:
>
> - Intel Xeon E5-2690: 10.5% less time
> - Intel i7-4790K: 5.2% less time
>
> We could get this same improvement _for this particular workload_ by
> statically increasing the size of the hash table. But this would hurt
> workloads that do not need a large hash table. The dynamic (upward)
> resizing allows us to start small and enlarge the hash table as needed.
>
> A quick note on downsizing: the table is resized back to 2**15 buckets
> on every tb_flush; this makes sense because it is not guaranteed that the
> table will reach the same number of TBs later on (e.g. most bootup code is
> thrown away after boot); it makes sense to grow the hash table as
> more code blocks are translated. This also avoids the complication of
> having to build downsizing hysteresis logic into qht.
>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c              | 82 ++++++++++++++++++++++++-----------------------
>  include/exec/exec-all.h |  9 +++---
>  include/exec/tb-hash.h  |  3 +-
>  translate-all.c         | 85 ++++++++++++++++++++++---------------------------
>  4 files changed, 86 insertions(+), 93 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index a889cf1..fd4b42f 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -217,55 +217,59 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
>      tb_free(tb);
>  }
>
> +struct tb_desc {
> +    target_ulong pc;
> +    target_ulong cs_base;
> +    uint64_t flags;
> +    tb_page_addr_t phys_page1;
> +    CPUArchState *env;
> +};
> +
> +static bool tb_cmp(const void *p, const void *d)
> +{
> +    const TranslationBlock *tb = p;
> +    const struct tb_desc *desc = d;
> +
> +    if (tb->pc == desc->pc &&
> +        tb->page_addr[0] == desc->phys_page1 &&
> +        tb->cs_base == desc->cs_base &&
> +        tb->flags == desc->flags) {
> +        /* check next page if needed */
> +        if (tb->page_addr[1] == -1) {
> +            return true;
> +        } else {
> +            tb_page_addr_t phys_page2;
> +            target_ulong virt_page2;
> +
> +            virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
> +            phys_page2 = get_page_addr_code(desc->env, virt_page2);
> +            if (tb->page_addr[1] == phys_page2) {
> +                return true;
> +            }
> +        }
> +    }
> +    return false;
> +}
> +
>  static TranslationBlock *tb_find_physical(CPUState *cpu,
>                                            target_ulong pc,
>                                            target_ulong cs_base,
>                                            uint32_t flags)
>  {
> -    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
> -    TranslationBlock *tb, **ptb1;
>      unsigned int h;

nit: this should be uint32_t given the hash and lookup functions are
defined as such.

> -    tb_page_addr_t phys_pc, phys_page1;
> -    target_ulong virt_page2;
> +    tb_page_addr_t phys_pc;
> +    struct tb_desc desc;
>
>      tcg_ctx.tb_ctx.tb_invalidated_flag = 0;
>
> -    /* find translated block using physical mappings */
> -    phys_pc = get_page_addr_code(env, pc);
> -    phys_page1 = phys_pc & TARGET_PAGE_MASK;
> +    desc.env = (CPUArchState *)cpu->env_ptr;
> +    desc.cs_base = cs_base;
> +    desc.flags = flags;
> +    desc.pc = pc;
> +    phys_pc = get_page_addr_code(desc.env, pc);
> +    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
>      h = tb_hash_func(phys_pc, pc, flags);
> -    ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    for(;;) {
> -        tb = *ptb1;
> -        if (!tb) {
> -            return NULL;
> -        }
> -        if (tb->pc == pc &&
> -            tb->page_addr[0] == phys_page1 &&
> -            tb->cs_base == cs_base &&
> -            tb->flags == flags) {
> -            /* check next page if needed */
> -            if (tb->page_addr[1] != -1) {
> -                tb_page_addr_t phys_page2;
> -
> -                virt_page2 = (pc & TARGET_PAGE_MASK) +
> -                    TARGET_PAGE_SIZE;
> -                phys_page2 = get_page_addr_code(env, virt_page2);
> -                if (tb->page_addr[1] == phys_page2) {
> -                    break;
> -                }
> -            } else {
> -                break;
> -            }
> -        }
> -        ptb1 = &tb->phys_hash_next;
> -    }
> -
> -    /* Move the TB to the head of the list */
> -    *ptb1 = tb->phys_hash_next;
> -    tb->phys_hash_next = tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    tcg_ctx.tb_ctx.tb_phys_hash[h] = tb;
> -    return tb;
> +    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
>  }
>
>  static TranslationBlock *tb_find_slow(CPUState *cpu,
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index c75fb3a..27c25da 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -21,6 +21,7 @@
>  #define _EXEC_ALL_H_
>
>  #include "qemu-common.h"
> +#include "qemu/qht.h"
>
>  /* allow to see translation results - the slowdown should be negligible, so we leave it */
>  #define DEBUG_DISAS
> @@ -212,8 +213,8 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
>
>  #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
>
> -#define CODE_GEN_PHYS_HASH_BITS     15
> -#define CODE_GEN_PHYS_HASH_SIZE     (1 << CODE_GEN_PHYS_HASH_BITS)
> +#define CODE_GEN_HTABLE_BITS     15
> +#define CODE_GEN_HTABLE_SIZE     (1 << CODE_GEN_HTABLE_BITS)
>
>  /* Estimated block size for TB allocation.  */
>  /* ??? The following is based on a 2015 survey of x86_64 host output.
> @@ -249,8 +250,6 @@ struct TranslationBlock {
>
>      void *tc_ptr;    /* pointer to the translated code */
>      uint8_t *tc_search;  /* pointer to search data */
> -    /* next matching tb for physical address. */
> -    struct TranslationBlock *phys_hash_next;
>      /* original tb when cflags has CF_NOCACHE */
>      struct TranslationBlock *orig_tb;
>      /* first and second physical page containing code. The lower bit
> @@ -281,7 +280,7 @@ typedef struct TBContext TBContext;
>  struct TBContext {
>
>      TranslationBlock *tbs;
> -    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
> +    struct qht htable;
>      int nb_tbs;
>      /* any access to the tbs or the page table must use this lock */
>      QemuMutex tb_lock;
> diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
> index 4b9635a..d274357 100644
> --- a/include/exec/tb-hash.h
> +++ b/include/exec/tb-hash.h
> @@ -20,7 +20,6 @@
>  #ifndef EXEC_TB_HASH
>  #define EXEC_TB_HASH
>
> -#include "exec/exec-all.h"
>  #include "exec/tb-hash-xx.h"
>
>  /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
> @@ -49,7 +48,7 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
>  static inline
>  uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
>  {
> -    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
> +    return tb_hash_func5(phys_pc, pc, flags);
>  }
>
>  #endif
> diff --git a/translate-all.c b/translate-all.c
> index eca2f16..617a572 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -735,6 +735,13 @@ static inline void code_gen_alloc(size_t tb_size)
>      qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
>  }
>
> +static void tb_htable_init(void)
> +{
> +    unsigned int mode = QHT_MODE_AUTO_RESIZE | QHT_MODE_MRU_INSERT;
> +
> +    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
> +}
> +
>  /* Must be called before using the QEMU cpus. 'tb_size' is the size
>     (in bytes) allocated to the translation buffer. Zero means default
>     size. */
> @@ -742,6 +749,7 @@ void tcg_exec_init(unsigned long tb_size)
>  {
>      cpu_gen_init();
>      page_init();
> +    tb_htable_init();
>      code_gen_alloc(tb_size);
>  #if defined(CONFIG_SOFTMMU)
>      /* There's no guest base to take into account, so go ahead and
> @@ -843,7 +851,7 @@ void tb_flush(CPUState *cpu)
>          memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
>      }
>
> -    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
> +    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
>      page_flush_tb();
>
>      tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
> @@ -854,60 +862,46 @@ void tb_flush(CPUState *cpu)
>
>  #ifdef DEBUG_TB_CHECK
>
> -static void tb_invalidate_check(target_ulong address)
> +static void
> +do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>  {
> -    TranslationBlock *tb;
> -    int i;
> +    TranslationBlock *tb = p;
> +    target_ulong addr = *(target_ulong *)userp;
>
> -    address &= TARGET_PAGE_MASK;
> -    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> -             tb = tb->phys_hash_next) {
> -            if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
> -                  address >= tb->pc + tb->size)) {
> -                printf("ERROR invalidate: address=" TARGET_FMT_lx
> -                       " PC=%08lx size=%04x\n",
> -                       address, (long)tb->pc, tb->size);
> -            }
> -        }
> +    if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
> +        printf("ERROR invalidate: address=" TARGET_FMT_lx
> +               " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
>      }
>  }
>
> -/* verify that all the pages have correct rights for code */
> -static void tb_page_check(void)
> +static void tb_invalidate_check(target_ulong address)
>  {
> -    TranslationBlock *tb;
> -    int i, flags1, flags2;
> -
> -    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> -                tb = tb->phys_hash_next) {
> -            flags1 = page_get_flags(tb->pc);
> -            flags2 = page_get_flags(tb->pc + tb->size - 1);
> -            if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> -                printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> -                       (long)tb->pc, tb->size, flags1, flags2);
> -            }
> -        }
> -    }
> +    address &= TARGET_PAGE_MASK;
> +    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
>  }
>
> -#endif
> -
> -static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock *tb)
> +static void
> +do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>  {
> -    TranslationBlock *tb1;
> +    TranslationBlock *tb = p;
> +    int flags1, flags2;
>
> -    for (;;) {
> -        tb1 = *ptb;
> -        if (tb1 == tb) {
> -            *ptb = tb1->phys_hash_next;
> -            break;
> -        }
> -        ptb = &tb1->phys_hash_next;
> +    flags1 = page_get_flags(tb->pc);
> +    flags2 = page_get_flags(tb->pc + tb->size - 1);
> +    if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> +        printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> +               (long)tb->pc, tb->size, flags1, flags2);
>      }
>  }
>
> +/* verify that all the pages have correct rights for code */
> +static void tb_page_check(void)
> +{
> +    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
> +}
> +
> +#endif
> +
>  static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
>  {
>      TranslationBlock *tb1;
> @@ -973,7 +967,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>      /* remove the TB from the hash list */
>      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
> +    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
>
>      /* remove the TB from the page list */
>      if (tb->page_addr[0] != page_addr) {
> @@ -1472,13 +1466,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>                           tb_page_addr_t phys_page2)
>  {
>      unsigned int h;
> -    TranslationBlock **ptb;
>
>      /* add in the hash table */
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    tb->phys_hash_next = *ptb;
> -    *ptb = tb;
> +    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
>
>      /* add in the page list */
>      tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-27  2:43             ` Emilio G. Cota
@ 2016-04-28 16:37               ` Richard Henderson
  0 siblings, 0 replies; 45+ messages in thread
From: Richard Henderson @ 2016-04-28 16:37 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Peter Crosthwaite, Peter Maydell, Sergey Fedorov

On 04/26/2016 07:43 PM, Emilio G. Cota wrote:
> How does the following look?
> 
> Example with good hashing, i.e. func5(phys_pc, pc, flags):
> TB count            704242/1342156
> [...]
> TB hash buckets     386484/524288 (73.72% used)
> TB hash occupancy   32.57% avg chain occupancy. Histogram: 0-10%▆▁█▁▁▅▁▃▁▁90-100%
> TB hash avg chain   1.02 buckets. Histogram: 1█▁▁3
> 
> Example with bad hashing, i.e. func5(phys_pc, 0, 0):
> TB count            710748/1342156
> [...]
> TB hash buckets     113569/524288 (21.66% used)
> TB hash occupancy   10.24% avg chain occupancy. Histogram: 0-10%█▁▁▁▁▁▁▁▁▁90-100%
> TB hash avg chain   2.11 buckets. Histogram: 1▇▁▁▁▁▁▁▁▁▁93
> 
> Note that:
> 
> - "TB hash avg chain" does _not_ count empty buckets. This gives
>   an idea of how many buckets a typical hit goes through.
> 
> - "TB hash occupancy" _does_ count empty buckets. It is called
>   "avg chain occupancy" and not "avg occupancy" because the
>   counts are only valid per-chain due to the seqlock protecting
>   each chain.

Looks really good.


r~

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2016-04-28 16:37 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-19 23:07 [Qemu-devel] [PATCH v3 00/11] tb hash improvements Emilio G. Cota
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 01/11] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
2016-04-22  9:32   ` Alex Bennée
2016-04-22  9:35   ` Peter Maydell
2016-04-22 15:50     ` Emilio G. Cota
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 02/11] seqlock: remove optional mutex Emilio G. Cota
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 03/11] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 04/11] include/processor.h: define cpu_relax() Emilio G. Cota
2016-04-20 12:15   ` KONRAD Frederic
2016-04-20 17:16     ` Emilio G. Cota
2016-04-20 17:18       ` Peter Maydell
2016-04-20 17:32   ` [Qemu-devel] [UPDATED " Emilio G. Cota
2016-04-22  9:35   ` [Qemu-devel] [PATCH " Alex Bennée
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 05/11] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
2016-04-20 15:18   ` Richard Henderson
2016-04-20 17:17     ` Emilio G. Cota
2016-04-20 17:55       ` Richard Henderson
2016-04-20 18:11         ` Emilio G. Cota
2016-04-20 19:39           ` Richard Henderson
2016-04-21 16:24             ` Emilio G. Cota
2016-04-20 17:35   ` [Qemu-devel] [UPDATED " Emilio G. Cota
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 06/11] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
2016-04-20 15:19   ` Richard Henderson
2016-04-22 12:58   ` Alex Bennée
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 07/11] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
2016-04-22 12:58   ` Alex Bennée
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 08/11] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
2016-04-22 14:04   ` Alex Bennée
2016-04-24 20:01   ` Richard Henderson
2016-04-24 21:58     ` Emilio G. Cota
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 09/11] qht: add test program Emilio G. Cota
2016-04-22 14:35   ` Alex Bennée
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 10/11] tb hash: track translated blocks with qht Emilio G. Cota
2016-04-28 13:27   ` Alex Bennée
2016-04-19 23:07 ` [Qemu-devel] [PATCH v3 11/11] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
2016-04-20 15:21   ` Richard Henderson
2016-04-22 14:38   ` Alex Bennée
2016-04-22 17:41   ` Richard Henderson
2016-04-22 19:23     ` Emilio G. Cota
2016-04-22 19:59     ` Richard Henderson
2016-04-22 23:57       ` Emilio G. Cota
2016-04-24 19:46         ` Richard Henderson
2016-04-24 22:06           ` Emilio G. Cota
2016-04-27  2:43             ` Emilio G. Cota
2016-04-28 16:37               ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.