All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/10] tb hash improvements
@ 2016-04-05  5:30 Emilio G. Cota
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 01/10] translate-all: add missing fold of tb_ctx into tcg_ctx Emilio G. Cota
                   ` (11 more replies)
  0 siblings, 12 replies; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

This patchset is derived from my ongoing work on MTTCG, but does
not depend on it and brings improvements that we can already
benefit from. It applies cleanly on the current master and
is checkpatch-clean.

The key goal is to make the TB hash table faster, and while at it,
scalable. Tested on two different host machines, the execution time
improvement before and after this series, when booting a debian
jessie arm image[*] with arm-softmmu, is:

- Intel Xeon E5-2690: 21.2% less time
- Intel i7-4790K: 23.5% less time

This workload is particularly sensitive to TB hash performance.
Other workloads not as sensitive might see a slight performance
degradation with this patchset, since the hashing + lookup
functions take now more instructions. In any case, no significant
slowdowns should occur.

The commit logs are sometimes long because I have lots of numbers
to share.

The only bits I'm not too comfortable with in this series are patches
2 and 5; I don't develop on Windows so I'm shooting in the dark there.

Please take a look and if possible, test on workloads you care about!

Thanks,

		Emilio

[*] taskset -c 0 arm-softmmu/qemu-system-arm -machine type=virt -nographic \
     -smp 1 -m 4096 -netdev user,id=unet,hostfwd=tcp::2222-:22 \
     -device virtio-net-device,netdev=unet \
     -drive file=jessie-arm32.qcow2,id=myblock,index=0,if=none \
     -device virtio-blk-device,drive=myblock \
     -kernel aarch32-current-linux-kernel-only.img \
     -append 'console=ttyAMA0 root=/dev/vda1' \
     -name arm,debug-threads=on -smp 1 -tb-size 1024
The image is taken from:
  http://people.linaro.org/~alex.bennee/images/jessie-arm32.qcow2
The image was modified to call `shutdown -h now` right after boot.
The kernel is taken from:
  http://people.linaro.org/~alex.bennee/images/aarch32-current-linux-kernel-only.img

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [Qemu-devel] [PATCH 01/10] translate-all: add missing fold of tb_ctx into tcg_ctx
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
@ 2016-04-05  5:30 ` Emilio G. Cota
  2016-04-05  8:49   ` Paolo Bonzini
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED Emilio G. Cota
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

Since 5e5f07e08 "TCG: Move translation block variables
to new context inside tcg_ctx: tb_ctx" on Feb 1 2013, compilation
of usermode + TB_DEBUG_CHECK has been broken. Fix it.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 translate-all.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/translate-all.c b/translate-all.c
index b4df1ec..8329ea6 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -861,7 +861,8 @@ static void tb_invalidate_check(target_ulong address)
 
     address &= TARGET_PAGE_MASK;
     for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tb_ctx.tb_phys_hash[i]; tb != NULL; tb = tb->phys_hash_next) {
+        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
+             tb = tb->phys_hash_next) {
             if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
                   address >= tb->pc + tb->size)) {
                 printf("ERROR invalidate: address=" TARGET_FMT_lx
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 01/10] translate-all: add missing fold of tb_ctx into tcg_ctx Emilio G. Cota
@ 2016-04-05  5:30 ` Emilio G. Cota
  2016-04-05  7:57   ` Peter Maydell
                     ` (2 more replies)
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 03/10] seqlock: remove optional mutex Emilio G. Cota
                   ` (9 subsequent siblings)
  11 siblings, 3 replies; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

I'm assuming windows compilers don't support this attribute.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/compiler.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
index 8f1cc7b..fb946f1 100644
--- a/include/qemu/compiler.h
+++ b/include/qemu/compiler.h
@@ -41,6 +41,16 @@
 # define QEMU_PACKED __attribute__((packed))
 #endif
 
+#define QEMU_CACHELINE (64)
+
+#if defined(_WIN32)
+#define QEMU_ALIGN(B)
+#else
+#define QEMU_ALIGN(B) __attribute__((aligned(B)))
+#endif
+
+#define QEMU_CACHELINE_ALIGNED QEMU_ALIGN(QEMU_CACHELINE)
+
 #ifndef glue
 #define xglue(x, y) x ## y
 #define glue(x, y) xglue(x, y)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [Qemu-devel] [PATCH 03/10] seqlock: remove optional mutex
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 01/10] translate-all: add missing fold of tb_ctx into tcg_ctx Emilio G. Cota
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED Emilio G. Cota
@ 2016-04-05  5:30 ` Emilio G. Cota
  2016-04-06  8:38   ` Alex Bennée
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 04/10] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

This option is unused; besides, it bloats the struct when not needed.
Let's just let writers define their own locks elsewhere.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c                 |  2 +-
 include/qemu/seqlock.h | 10 +---------
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/cpus.c b/cpus.c
index 8ae4777..7f550b9 100644
--- a/cpus.c
+++ b/cpus.c
@@ -611,7 +611,7 @@ int cpu_throttle_get_percentage(void)
 
 void cpu_ticks_init(void)
 {
-    seqlock_init(&timers_state.vm_clock_seqlock, NULL);
+    seqlock_init(&timers_state.vm_clock_seqlock);
     vmstate_register(NULL, 0, &vmstate_timers, &timers_state);
     throttle_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL_RT,
                                            cpu_throttle_timer_tick, NULL);
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index 70b01fd..e673482 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -19,22 +19,17 @@
 typedef struct QemuSeqLock QemuSeqLock;
 
 struct QemuSeqLock {
-    QemuMutex *mutex;
     unsigned sequence;
 };
 
-static inline void seqlock_init(QemuSeqLock *sl, QemuMutex *mutex)
+static inline void seqlock_init(QemuSeqLock *sl)
 {
-    sl->mutex = mutex;
     sl->sequence = 0;
 }
 
 /* Lock out other writers and update the count.  */
 static inline void seqlock_write_lock(QemuSeqLock *sl)
 {
-    if (sl->mutex) {
-        qemu_mutex_lock(sl->mutex);
-    }
     ++sl->sequence;
 
     /* Write sequence before updating other fields.  */
@@ -47,9 +42,6 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
     smp_wmb();
 
     ++sl->sequence;
-    if (sl->mutex) {
-        qemu_mutex_unlock(sl->mutex);
-    }
 }
 
 static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [Qemu-devel] [PATCH 04/10] seqlock: rename write_lock/unlock to write_begin/end
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
                   ` (2 preceding siblings ...)
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 03/10] seqlock: remove optional mutex Emilio G. Cota
@ 2016-04-05  5:30 ` Emilio G. Cota
  2016-04-06  8:42   ` Alex Bennée
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 05/10] include: add spinlock wrapper Emilio G. Cota
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

It is a more appropriate name, now that the mutex embedded
in the seqlock is gone.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c                 | 28 ++++++++++++++--------------
 include/qemu/seqlock.h |  4 ++--
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/cpus.c b/cpus.c
index 7f550b9..7dad2e6 100644
--- a/cpus.c
+++ b/cpus.c
@@ -247,13 +247,13 @@ int64_t cpu_get_clock(void)
 void cpu_enable_ticks(void)
 {
     /* Here, the really thing protected by seqlock is cpu_clock_offset. */
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (!timers_state.cpu_ticks_enabled) {
         timers_state.cpu_ticks_offset -= cpu_get_host_ticks();
         timers_state.cpu_clock_offset -= get_clock();
         timers_state.cpu_ticks_enabled = 1;
     }
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 /* disable cpu_get_ticks() : the clock is stopped. You must not call
@@ -263,13 +263,13 @@ void cpu_enable_ticks(void)
 void cpu_disable_ticks(void)
 {
     /* Here, the really thing protected by seqlock is cpu_clock_offset. */
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (timers_state.cpu_ticks_enabled) {
         timers_state.cpu_ticks_offset += cpu_get_host_ticks();
         timers_state.cpu_clock_offset = cpu_get_clock_locked();
         timers_state.cpu_ticks_enabled = 0;
     }
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 /* Correlation between real and virtual time is always going to be
@@ -292,7 +292,7 @@ static void icount_adjust(void)
         return;
     }
 
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     cur_time = cpu_get_clock_locked();
     cur_icount = cpu_get_icount_locked();
 
@@ -313,7 +313,7 @@ static void icount_adjust(void)
     last_delta = delta;
     timers_state.qemu_icount_bias = cur_icount
                               - (timers_state.qemu_icount << icount_time_shift);
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 static void icount_adjust_rt(void *opaque)
@@ -345,7 +345,7 @@ static void icount_warp_rt(void)
         return;
     }
 
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (runstate_is_running()) {
         int64_t clock = REPLAY_CLOCK(REPLAY_CLOCK_VIRTUAL_RT,
                                      cpu_get_clock_locked());
@@ -364,7 +364,7 @@ static void icount_warp_rt(void)
         timers_state.qemu_icount_bias += warp_delta;
     }
     vm_clock_warp_start = -1;
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 
     if (qemu_clock_expired(QEMU_CLOCK_VIRTUAL)) {
         qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
@@ -389,9 +389,9 @@ void qtest_clock_warp(int64_t dest)
         int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
         int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
 
-        seqlock_write_lock(&timers_state.vm_clock_seqlock);
+        seqlock_write_begin(&timers_state.vm_clock_seqlock);
         timers_state.qemu_icount_bias += warp;
-        seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+        seqlock_write_end(&timers_state.vm_clock_seqlock);
 
         qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
         timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
@@ -458,9 +458,9 @@ void qemu_start_warp_timer(void)
              * It is useful when we want a deterministic execution time,
              * isolated from host latencies.
              */
-            seqlock_write_lock(&timers_state.vm_clock_seqlock);
+            seqlock_write_begin(&timers_state.vm_clock_seqlock);
             timers_state.qemu_icount_bias += deadline;
-            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+            seqlock_write_end(&timers_state.vm_clock_seqlock);
             qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
         } else {
             /*
@@ -471,11 +471,11 @@ void qemu_start_warp_timer(void)
              * you will not be sending network packets continuously instead of
              * every 100ms.
              */
-            seqlock_write_lock(&timers_state.vm_clock_seqlock);
+            seqlock_write_begin(&timers_state.vm_clock_seqlock);
             if (vm_clock_warp_start == -1 || vm_clock_warp_start > clock) {
                 vm_clock_warp_start = clock;
             }
-            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+            seqlock_write_end(&timers_state.vm_clock_seqlock);
             timer_mod_anticipate(icount_warp_timer, clock + deadline);
         }
     } else if (deadline == 0) {
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index e673482..4dfc055 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -28,7 +28,7 @@ static inline void seqlock_init(QemuSeqLock *sl)
 }
 
 /* Lock out other writers and update the count.  */
-static inline void seqlock_write_lock(QemuSeqLock *sl)
+static inline void seqlock_write_begin(QemuSeqLock *sl)
 {
     ++sl->sequence;
 
@@ -36,7 +36,7 @@ static inline void seqlock_write_lock(QemuSeqLock *sl)
     smp_wmb();
 }
 
-static inline void seqlock_write_unlock(QemuSeqLock *sl)
+static inline void seqlock_write_end(QemuSeqLock *sl)
 {
     /* Write other fields before finalizing sequence.  */
     smp_wmb();
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [Qemu-devel] [PATCH 05/10] include: add spinlock wrapper
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
                   ` (3 preceding siblings ...)
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 04/10] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
@ 2016-04-05  5:30 ` Emilio G. Cota
  2016-04-05  8:51   ` Paolo Bonzini
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 06/10] include: add xxhash.h Emilio G. Cota
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

Wrap pthread_spin on POSIX, or QemuMutex on Windows.

AFAIK there are is no off-the-shelf spinlock implementation for
Windows, so we'll just use QemuMutex.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/spinlock-posix.h | 60 +++++++++++++++++++++++++++++++++++++++++++
 include/qemu/spinlock-win32.h | 33 ++++++++++++++++++++++++
 include/qemu/spinlock.h       | 10 ++++++++
 3 files changed, 103 insertions(+)
 create mode 100644 include/qemu/spinlock-posix.h
 create mode 100644 include/qemu/spinlock-win32.h
 create mode 100644 include/qemu/spinlock.h

diff --git a/include/qemu/spinlock-posix.h b/include/qemu/spinlock-posix.h
new file mode 100644
index 0000000..51c2c08
--- /dev/null
+++ b/include/qemu/spinlock-posix.h
@@ -0,0 +1,60 @@
+#ifndef QEMU_SPINLOCK_POSIX_H
+#define QEMU_SPINLOCK_POSIX_H
+
+#include <qemu/thread.h>
+#include <qemu/osdep.h>
+
+typedef pthread_spinlock_t QemuSpinLock;
+
+static inline void qemu_spinlock_error_exit(int err, const char *msg)
+{
+    fprintf(stderr, "qemu: %s: %s\n", msg, strerror(err));
+    abort();
+}
+
+static inline void qemu_spinlock_init(QemuSpinLock *lock)
+{
+    int rc;
+
+    rc = pthread_spin_init(lock, PTHREAD_PROCESS_SHARED);
+    if (unlikely(rc)) {
+        qemu_spinlock_error_exit(rc, __func__);
+    }
+}
+
+static inline void qemu_spinlock_destroy(QemuSpinLock *lock)
+{
+    int rc;
+
+    rc = pthread_spin_destroy(lock);
+    if (unlikely(rc)) {
+        qemu_spinlock_error_exit(rc, __func__);
+    }
+}
+
+static inline void qemu_spinlock_lock(QemuSpinLock *lock)
+{
+    int rc;
+
+    rc = pthread_spin_lock(lock);
+    if (unlikely(rc)) {
+        qemu_spinlock_error_exit(rc, __func__);
+    }
+}
+
+static inline int qemu_spinlock_trylock(QemuSpinLock *lock)
+{
+    return pthread_spin_trylock(lock);
+}
+
+static inline void qemu_spinlock_unlock(QemuSpinLock *lock)
+{
+    int rc;
+
+    rc = pthread_spin_unlock(lock);
+    if (unlikely(rc)) {
+        qemu_spinlock_error_exit(rc, __func__);
+    }
+}
+
+#endif /* QEMU_SPINLOCK_POSIX_H */
diff --git a/include/qemu/spinlock-win32.h b/include/qemu/spinlock-win32.h
new file mode 100644
index 0000000..5a105fb
--- /dev/null
+++ b/include/qemu/spinlock-win32.h
@@ -0,0 +1,33 @@
+#ifndef QEMU_SPINLOCK_WIN32_H
+#define QEMU_SPINLOCK_WIN32_H
+
+#include <qemu/thread.h>
+
+typedef QemuMutex QemuSpinLock;
+
+static inline void qemu_spinlock_init(QemuSpinLock *lock)
+{
+    qemu_mutex_init(lock);
+}
+
+static inline void qemu_spinlock_destroy(QemuSpinLock *lock)
+{
+    qemu_mutex_destroy(lock);
+}
+
+static inline void qemu_spinlock_lock(QemuSpinLock *lock)
+{
+    qemu_mutex_lock(lock);
+}
+
+static inline int qemu_spinlock_trylock(QemuSpinLock *lock)
+{
+    return qemu_mutex_trylock(lock);
+}
+
+static inline void qemu_spinlock_unlock(QemuSpinLock *lock)
+{
+    qemu_mutex_unlock(lock);
+}
+
+#endif /* QEMU_SPINLOCK_WIN32_H */
diff --git a/include/qemu/spinlock.h b/include/qemu/spinlock.h
new file mode 100644
index 0000000..001b55e
--- /dev/null
+++ b/include/qemu/spinlock.h
@@ -0,0 +1,10 @@
+#ifndef QEMU_SPINLOCK_H
+#define QEMU_SPINLOCK_H
+
+#ifdef _WIN32
+#include "spinlock-win32.h"
+#else
+#include "spinlock-posix.h"
+#endif
+
+#endif /* QEMU_SPINLOCK_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [Qemu-devel] [PATCH 06/10] include: add xxhash.h
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
                   ` (4 preceding siblings ...)
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 05/10] include: add spinlock wrapper Emilio G. Cota
@ 2016-04-05  5:30 ` Emilio G. Cota
  2016-04-06 11:39   ` Alex Bennée
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

xxhash is a fast, high-quality hashing function. The appended
brings in the 32-bit version of it, with the small modification that
it assumes the data to be hashed is made of 32-bit chunks; this increases
speed slightly for the use-case we care about, i.e. tb-hash.

The original algorithm, as well as a 64-bit implementation, can be found at:
  https://github.com/Cyan4973/xxHash

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/xxhash.h | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)
 create mode 100644 include/qemu/xxhash.h

diff --git a/include/qemu/xxhash.h b/include/qemu/xxhash.h
new file mode 100644
index 0000000..a13a665
--- /dev/null
+++ b/include/qemu/xxhash.h
@@ -0,0 +1,106 @@
+/*
+ * xxHash - Fast Hash algorithm
+ * Copyright (C) 2012-2016, Yann Collet
+ *
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * + Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * + Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * You can contact the author at :
+ * - xxHash source repository : https://github.com/Cyan4973/xxHash
+ */
+#ifndef QEMU_XXHASH_H
+#define QEMU_XXHASH_H
+
+#define PRIME32_1   2654435761U
+#define PRIME32_2   2246822519U
+#define PRIME32_3   3266489917U
+#define PRIME32_4    668265263U
+#define PRIME32_5    374761393U
+
+/*
+ * Note : although _rotl exists for minGW (GCC under windows), performance
+ * seems poor.
+ */
+#if defined(_MSC_VER)
+#  define XXH_rotl32(x, r) _rotl(x, r)
+#else
+#  define XXH_rotl32(x, r) ((x << r) | (x >> (32 - r)))
+#endif
+
+/* u32 hash of @n contiguous chunks of u32's */
+static inline uint32_t qemu_xxh32(const uint32_t *p, size_t n, uint32_t seed)
+{
+    const uint32_t *end = p + n;
+    uint32_t h32;
+
+    if (n >= 4) {
+        const uint32_t * const limit = end - 4;
+        uint32_t v1 = seed + PRIME32_1 + PRIME32_2;
+        uint32_t v2 = seed + PRIME32_2;
+        uint32_t v3 = seed + 0;
+        uint32_t v4 = seed - PRIME32_1;
+
+        do {
+            v1 += *p * PRIME32_2;
+            v1 = XXH_rotl32(v1, 13);
+            v1 *= PRIME32_1;
+            p++;
+            v2 += *p * PRIME32_2;
+            v2 = XXH_rotl32(v2, 13);
+            v2 *= PRIME32_1;
+            p++;
+            v3 += *p * PRIME32_2;
+            v3 = XXH_rotl32(v3, 13);
+            v3 *= PRIME32_1;
+            p++;
+            v4 += *p * PRIME32_2;
+            v4 = XXH_rotl32(v4, 13);
+            v4 *= PRIME32_1;
+            p++;
+        } while (p <= limit);
+        h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) +
+            XXH_rotl32(v4, 18);
+    } else {
+        h32  = seed + PRIME32_5;
+    }
+
+    h32 += n * sizeof(uint32_t);
+
+    while (p < end) {
+        h32 += *p * PRIME32_3;
+        h32  = XXH_rotl32(h32, 17) * PRIME32_4 ;
+        p++;
+    }
+
+    h32 ^= h32 >> 15;
+    h32 *= PRIME32_2;
+    h32 ^= h32 >> 13;
+    h32 *= PRIME32_3;
+    h32 ^= h32 >> 16;
+
+    return h32;
+}
+
+#endif /* QEMU_XXHASH_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
                   ` (5 preceding siblings ...)
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 06/10] include: add xxhash.h Emilio G. Cota
@ 2016-04-05  5:30 ` Emilio G. Cota
  2016-04-05 15:41   ` Richard Henderson
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
   high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c             |  2 +-
 include/exec/tb-hash.h | 19 +++++++++++++++++--
 translate-all.c        |  6 +++---
 3 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index bbfcbfb..4194a4a 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -233,7 +233,7 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
     /* find translated block using physical mappings */
     phys_pc = get_page_addr_code(env, pc);
     phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, pc, flags);
     ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     for(;;) {
         tb = *ptb1;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 0f4e8a0..c4942e1 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,6 +20,9 @@
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
+#include "exec/exec-all.h"
+#include "qemu/xxhash.h"
+
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
    addresses on the same page.  The top bits are the same.  This allows
    TLB invalidation to quickly clear a subset of the hash table.  */
@@ -43,9 +46,21 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
            | (tmp & TB_JMP_ADDR_MASK));
 }
 
-static inline unsigned int tb_phys_hash_func(tb_page_addr_t pc)
+static inline
+uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint64_t flags)
 {
-    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    struct key {
+        tb_page_addr_t phys_pc;
+        target_ulong pc;
+        uint64_t flags;
+    } QEMU_PACKED k;
+    unsigned int hash;
+
+    k.phys_pc = phys_pc;
+    k.pc = pc;
+    k.flags = flags;
+    hash = qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);
+    return hash & (CODE_GEN_PHYS_HASH_SIZE - 1);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index 8329ea6..8e7f9a7 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -972,7 +972,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
 
     /* remove the TB from the page list */
@@ -1474,8 +1474,8 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     unsigned int h;
     TranslationBlock **ptb;
 
-    /* add in the physical hash table */
-    h = tb_phys_hash_func(phys_pc);
+    /* add in the hash table */
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     tb->phys_hash_next = *ptb;
     *ptb = tb;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
                   ` (6 preceding siblings ...)
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
@ 2016-04-05  5:30 ` Emilio G. Cota
  2016-04-05  9:01   ` Paolo Bonzini
                     ` (2 more replies)
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 09/10] qht: add test program Emilio G. Cota
                   ` (3 subsequent siblings)
  11 siblings, 3 replies; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

This is a hash table with optional auto-resizing and MRU promotion for
reads and writes. Its implementation goal is to stay fast while
scaling for read-mostly workloads.

A hash table with these features will be necessary for the scalability
of the ongoing MTTCG work; before those changes arrive we can already
benefit from the single-threaded speedup that qht also provides.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/qht.h | 150 ++++++++++++++++++
 util/Makefile.objs |   2 +-
 util/qht.c         | 458 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 609 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/qht.h
 create mode 100644 util/qht.c

diff --git a/include/qemu/qht.h b/include/qemu/qht.h
new file mode 100644
index 0000000..63244e4
--- /dev/null
+++ b/include/qemu/qht.h
@@ -0,0 +1,150 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QHT_H
+#define QHT_H
+
+#include <stdbool.h>
+#include <assert.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/spinlock.h"
+#include "qemu/seqlock.h"
+#include "qemu/rcu.h"
+
+#if HOST_LONG_BITS == 32
+#define QHT_BUCKET_ENTRIES 6
+#else /* 64-bit */
+#define QHT_BUCKET_ENTRIES 4
+#endif
+
+struct qht_bucket {
+    QemuSpinLock lock;
+    QemuSeqLock sequence;
+    uint32_t hashes[QHT_BUCKET_ENTRIES];
+    void *pointers[QHT_BUCKET_ENTRIES];
+    struct qht_bucket *next;
+} QEMU_CACHELINE_ALIGNED;
+
+struct qht_map {
+    struct qht_bucket *buckets;
+    uint64_t n;
+    uint64_t n_items;
+    uint64_t n_items_threshold;
+    struct rcu_head rcu;
+};
+
+struct qht {
+    struct qht_map *map;
+    unsigned int mode;
+};
+
+typedef bool (*qht_lookup_func_t)(const void *obj, const void *userp);
+typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
+
+#define QHT_MODE_MRU_LOOKUP  0x1 /* move looked-up items to head */
+#define QHT_MODE_MRU_INSERT  0x2 /* insert new elements at the head */
+#define QHT_MODE_AUTO_RESIZE 0x4 /* auto-resize when heavily loaded */
+
+void qht_init(struct qht *ht, uint64_t n_elems, unsigned int mode);
+
+/* call only when there are no readers left */
+void qht_destroy(struct qht *ht);
+
+/* call with an external lock held */
+void qht_reset(struct qht *ht);
+
+/* call with an external lock held */
+void qht_reset_size(struct qht *ht, uint64_t n_elems);
+
+/* call with an external lock held */
+void qht_insert(struct qht *ht, void *p, uint32_t hash);
+
+/* call with an external lock held */
+bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
+
+/* call with an external lock held */
+void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp);
+
+/* call with an external lock held */
+void qht_grow(struct qht *ht);
+
+void qht_bucket_mru(struct qht_bucket *b, struct qht_bucket *orig,
+                    const void *p, int pos);
+
+static inline
+struct qht_bucket *qht_map_to_bucket(struct qht_map *map, uint32_t hash)
+{
+    return &map->buckets[hash & (map->n - 1)];
+}
+
+static inline
+void *__qht_lookup(struct qht_bucket *b, struct qht_bucket **far_bucket,
+                    int *pos, qht_lookup_func_t func, const void *userp,
+                    uint32_t hash)
+{
+    unsigned int count = 0;
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->hashes[i] == hash) {
+                void *p = atomic_read(&b->pointers[i]);
+
+                /*
+                 * p could be NULL if we're racing with a writer. We could use
+                 * barriers here but for performance we only issue the ones
+                 * in the seqlock.
+                 */
+                if (likely(p && func(p, userp))) {
+                    if (unlikely(count)) {
+                        *far_bucket = b;
+                        *pos = i;
+                    }
+                    return p;
+                }
+            }
+        }
+        count++;
+        b = atomic_read(&b->next);
+        /*
+         * This barrier guarantees that we will read a properly initialized
+         * b->next; it is paired with an smp_wmb() before setting b->next.
+         */
+        smp_rmb();
+    } while (b);
+    return NULL;
+}
+
+static inline void *qht_lookup(struct qht *ht, qht_lookup_func_t func,
+                                const void *userp, uint32_t hash)
+{
+    struct qht_bucket *far_bucket = NULL;
+    struct qht_bucket *b;
+    struct qht_map *map;
+    uint32_t version;
+    int pos = 0;
+    void *ret;
+
+    map = atomic_read(&ht->map);
+    /* paired with smp_wmb() before setting ht->map */
+    smp_rmb();
+    b = qht_map_to_bucket(map, hash);
+
+    do {
+        version = seqlock_read_begin(&b->sequence);
+        ret = __qht_lookup(b, &far_bucket, &pos, func, userp, hash);
+    } while (seqlock_read_retry(&b->sequence, version));
+    if ((ht->mode & QHT_MODE_MRU_LOOKUP) && unlikely(far_bucket)) {
+        qht_bucket_mru(b, far_bucket, ret, pos);
+    }
+    return ret;
+}
+
+#endif /* QHT_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index a8a777e..ae893dc 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -1,4 +1,4 @@
-util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
+util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o qht.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
diff --git a/util/qht.c b/util/qht.c
new file mode 100644
index 0000000..590cd8a
--- /dev/null
+++ b/util/qht.c
@@ -0,0 +1,458 @@
+/*
+ * qht.c - QEMU Hash Table, designed to scale for read-mostly workloads.
+ *
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ *
+ * Assumptions:
+ * - Writers and iterators must take an external lock.
+ * - A hash value of 0 is invalid.
+ * - NULL cannot be inserted as a pointer value.
+ *
+ * Features:
+ * - Optional auto-resizing: the hash table resizes up if the load surpasses
+ *   a certain threshold. Resizing is done concurrently with readers.
+ * - Optional bucket MRU promotion policy.
+ *
+ * The key structure is the bucket, which is cacheline-sized. Buckets
+ * contain a few hash values and pointers; the u32 hash values are stored in
+ * full so that resizing is fast. Having this structure instead of directly
+ * chaining items has three advantages:
+ * - Failed lookups fail fast, and touch a minimum number of cache lines.
+ * - Resizing the hash table with concurrent lookups is easy.
+ * - We can have a few Most-Recently-Used (MRU) hash-pointer pairs in the same
+ *   head bucket. This helps scalability, since MRU promotions (i.e. writes to
+ *   the bucket) become less common.
+ *
+ * For concurrent lookups we use a per-bucket seqlock; per-bucket spinlocks
+ * allow readers (lookups) to upgrade to writers and thus implement an MRU
+ * promotion policy; these MRU-induced writes do not touch the cache lines of
+ * other head buckets.
+ *
+ * Note that there are two types of buckets:
+ * 1. "head" buckets are the ones allocated in the array of buckets in qht_map.
+ * 2. all "non-head" buckets (i.e. all others) are members of a chain that
+ *    starts from a head bucket.
+ * Note that the seqlock and spinlock of a head bucket applies to all buckets
+ * chained to it; these two fields are unused in non-head buckets.
+ *
+ * Resizing is done by taking all spinlocks (so that no readers-turned-writers
+ * can race with us) and then placing all elements into a new hash table. Last,
+ * the ht->map pointer is set, and the old map is freed once no RCU readers can
+ * see it anymore.
+ *
+ * Related Work:
+ * - Idea of cacheline-sized buckets with full hashes taken from:
+ *   David, Guerraoui & Trigonakis, "Asynchronized Concurrency:
+ *   The Secret to Scaling Concurrent Search Data Structures", ASPLOS'15.
+ * - Why not RCU-based hash tables? They would allow us to get rid of the
+ *   seqlock, but resizing would take forever since RCU read critical
+ *   sections in QEMU take quite a long time.
+ *   More info on relativistic hash tables:
+ *   + Triplett, McKenney & Walpole, "Resizable, Scalable, Concurrent Hash
+ *     Tables via Relativistic Programming", USENIX ATC'11.
+ *   + Corbet, "Relativistic hash tables, part 1: Algorithms", @ lwn.net, 2014.
+ *     https://lwn.net/Articles/612021/
+ */
+#include "qemu/qht.h"
+#include "qemu/atomic.h"
+
+static inline uint64_t qht_elems_to_buckets(uint64_t n_elems)
+{
+    return pow2ceil(n_elems / QHT_BUCKET_ENTRIES);
+}
+
+static inline void qht_head_init(struct qht_bucket *b)
+{
+    memset(b, 0, sizeof(*b));
+    qemu_spinlock_init(&b->lock);
+    seqlock_init(&b->sequence);
+}
+
+static inline void qht_chain_destroy(struct qht_bucket *head)
+{
+    struct qht_bucket *curr = head->next;
+    struct qht_bucket *prev;
+
+    while (curr) {
+        prev = curr;
+        curr = curr->next;
+        qemu_vfree(prev);
+    }
+}
+
+/* pass only an orphan map */
+static void qht_map_destroy(struct qht_map *map)
+{
+    uint64_t i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_chain_destroy(&map->buckets[i]);
+    }
+    qemu_vfree(map->buckets);
+    g_free(map);
+}
+
+static void qht_map_reclaim(struct rcu_head *rcu)
+{
+    struct qht_map *map = container_of(rcu, struct qht_map, rcu);
+
+    qht_map_destroy(map);
+}
+
+static struct qht_map *qht_map_create(uint64_t n)
+{
+    struct qht_map *map;
+    uint64_t i;
+
+    assert(n <= (1ULL << 32)); /* we're using a 32-bit hash func */
+    map = g_malloc(sizeof(*map));
+    map->n = n;
+    map->n_items = 0;
+    map->n_items_threshold = n * QHT_BUCKET_ENTRIES / 2;
+    map->buckets = qemu_memalign(QEMU_CACHELINE, sizeof(*map->buckets) * n);
+    for (i = 0; i < n; i++) {
+        qht_head_init(&map->buckets[i]);
+    }
+    return map;
+}
+
+static inline void qht_publish(struct qht *ht, struct qht_map *new)
+{
+    /* Readers should see a properly initialized map; pair with smp_rmb() */
+    smp_wmb();
+    atomic_set(&ht->map, new);
+}
+
+void qht_init(struct qht *ht, uint64_t n_elems, unsigned int mode)
+{
+    struct qht_map *map;
+    uint64_t n = qht_elems_to_buckets(n_elems);
+
+    map = qht_map_create(n);
+    ht->mode = mode;
+    qht_publish(ht, map);
+}
+
+/* call only when there are no readers left */
+void qht_destroy(struct qht *ht)
+{
+    qht_map_destroy(ht->map);
+    memset(ht, 0, sizeof(*ht));
+}
+
+static void __qht_bucket_reset(struct qht_bucket *b)
+{
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            atomic_set(&b->hashes[i], 0);
+            atomic_set(&b->pointers[i], NULL);
+        }
+        b = b->next;
+    } while (b);
+}
+
+static void qht_bucket_reset(struct qht_bucket *b)
+{
+    qemu_spinlock_lock(&b->lock);
+    seqlock_write_begin(&b->sequence);
+    __qht_bucket_reset(b);
+    seqlock_write_end(&b->sequence);
+    qemu_spinlock_unlock(&b->lock);
+}
+
+/* call with an external lock held */
+void qht_reset(struct qht *ht)
+{
+    struct qht_map *map = ht->map;
+    uint64_t i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_bucket_reset(&map->buckets[i]);
+    }
+}
+
+/* call with an external lock held */
+void qht_reset_size(struct qht *ht, uint64_t n_elems)
+{
+    struct qht_map *old = ht->map;
+
+    qht_reset(ht);
+    if (old->n == qht_elems_to_buckets(n_elems)) {
+        return;
+    }
+    qht_init(ht, n_elems, ht->mode);
+    call_rcu1(&old->rcu, qht_map_reclaim);
+}
+
+/* Can only be called when at least orig is the 3rd link i.e. head->2nd->orig */
+static void
+__qht_move_bucket_to_front(struct qht_bucket *head, struct qht_bucket *orig)
+{
+    struct qht_bucket *b = head->next;
+
+    for (;;) {
+        if (b->next == orig) {
+            atomic_set(&b->next, orig->next);
+            atomic_set(&orig->next, head->next);
+            atomic_set(&head->next, orig);
+            return;
+        }
+        b = b->next;
+    }
+}
+
+static inline void __qht_bucket_mru_head(struct qht_bucket *b, int pos)
+{
+    uint32_t orig_hash = b->hashes[pos];
+    void *orig_p = b->pointers[pos];
+    int i;
+
+    for (i = 0; i < pos; i++) {
+        atomic_set(&b->hashes[i + 1], b->hashes[i]);
+        atomic_set(&b->pointers[i + 1], b->pointers[i]);
+    }
+    atomic_set(&b->hashes[0], orig_hash);
+    atomic_set(&b->pointers[0], orig_p);
+}
+
+
+/* call with head->lock held */
+static inline void
+__qht_bucket_mru(struct qht_bucket *head, struct qht_bucket *orig, int pos)
+{
+    uint32_t *dest_hash;
+    void **dest_p;
+    void *p;
+    uint32_t hash;
+    int i;
+
+    if (head == orig) {
+        return __qht_bucket_mru_head(head, pos);
+    }
+    for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+        if (i == QHT_BUCKET_ENTRIES - 1) {
+            dest_hash = &orig->hashes[pos];
+            dest_p = &orig->pointers[pos];
+        } else {
+            dest_hash = &head->hashes[i + 1];
+            dest_p = &head->pointers[i + 1];
+        }
+        hash = *dest_hash;
+        p = *dest_p;
+
+        atomic_set(dest_hash, head->hashes[i]);
+        atomic_set(dest_p, head->pointers[i]);
+
+        atomic_set(&head->hashes[i], hash);
+        atomic_set(&head->pointers[i], p);
+    }
+    if (head->next != orig) {
+        __qht_move_bucket_to_front(head, orig);
+    }
+}
+
+void qht_bucket_mru(struct qht_bucket *head, struct qht_bucket *orig,
+                    const void *p, int pos)
+{
+    qemu_spinlock_lock(&head->lock);
+    if (unlikely(orig->pointers[pos] != p)) {
+        /* while we acquired the lock, the bucket was updated, so bail out */
+        goto out;
+    }
+    seqlock_write_begin(&head->sequence);
+    __qht_bucket_mru(head, orig, pos);
+    seqlock_write_end(&head->sequence);
+ out:
+    qemu_spinlock_unlock(&head->lock);
+}
+
+/* call with b->lock held */
+static void __qht_insert(struct qht *ht, struct qht_map *map,
+                         struct qht_bucket *b, void *p, uint32_t hash)
+{
+    struct qht_bucket *head = b;
+    struct qht_bucket *prev = NULL;
+    struct qht_bucket *new = NULL;
+    unsigned int count = 0;
+    int i;
+
+    for (;;) {
+        if (b == NULL) {
+            b = qemu_memalign(QEMU_CACHELINE, sizeof(*b));
+            memset(b, 0, sizeof(*b));
+            new = b;
+        }
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->hashes[i]) {
+                continue;
+            }
+            /* found an empty key: acquire the seqlock and write */
+            seqlock_write_begin(&head->sequence);
+            if (new) {
+                /*
+                 * This barrier is paired with smp_rmb() after reading
+                 * b->next when not holding b->lock.
+                 */
+                smp_wmb();
+                atomic_set(&prev->next, b);
+            }
+            atomic_set(&b->hashes[i], hash);
+            atomic_set(&b->pointers[i], p);
+            if ((ht->mode & QHT_MODE_MRU_INSERT) && count) {
+                __qht_bucket_mru(head, b, i);
+            }
+            seqlock_write_end(&head->sequence);
+            map->n_items++;
+            return;
+        }
+        prev = b;
+        b = b->next;
+        count++;
+    }
+}
+
+/* call with an external lock held */
+void qht_insert(struct qht *ht, void *p, uint32_t hash)
+{
+    struct qht_map *map = ht->map;
+    struct qht_bucket *b = qht_map_to_bucket(map, hash);
+
+    qemu_spinlock_lock(&b->lock);
+    __qht_insert(ht, map, b, p, hash);
+    qemu_spinlock_unlock(&b->lock);
+
+    if ((ht->mode & QHT_MODE_AUTO_RESIZE) &&
+        unlikely(map->n_items > map->n_items_threshold)) {
+        qht_grow(ht);
+    }
+}
+
+/* call with b->lock held */
+static inline bool __qht_remove(struct qht_map *map, struct qht_bucket *b,
+                                const void *p, uint32_t hash)
+{
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->hashes[i] == hash && b->pointers[i] == p) {
+                atomic_set(&b->hashes[i], 0);
+                atomic_set(&b->pointers[i], NULL);
+                map->n_items--;
+                return true;
+            }
+        }
+        b = b->next;
+    } while (b);
+    return false;
+}
+
+/* call with an external lock held */
+bool qht_remove(struct qht *ht, const void *p, uint32_t hash)
+{
+    struct qht_map *map = ht->map;
+    struct qht_bucket *b = qht_map_to_bucket(map, hash);
+    bool ret;
+
+    qemu_spinlock_lock(&b->lock);
+    seqlock_write_begin(&b->sequence);
+    ret = __qht_remove(map, b, p, hash);
+    seqlock_write_end(&b->sequence);
+    qemu_spinlock_unlock(&b->lock);
+    return ret;
+}
+
+/*
+ * acquire all spinlocks from a map, so that writers cannot race with
+ * readers-turned-writers.
+ */
+static void qht_lock(struct qht *ht)
+{
+    struct qht_map *map = ht->map;
+    uint64_t i;
+
+    /* if readers cannot upgrade, do nothing */
+    if (!(ht->mode & QHT_MODE_MRU_LOOKUP)) {
+        return;
+    }
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+
+        qemu_spinlock_lock(&b->lock);
+    }
+}
+
+static void qht_unlock(struct qht *ht)
+{
+    struct qht_map *map = ht->map;
+    uint64_t i;
+
+    if (!(ht->mode & QHT_MODE_MRU_LOOKUP)) {
+        return;
+    }
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+
+        qemu_spinlock_unlock(&b->lock);
+    }
+}
+
+/* external lock + all of the map's locks held (if !MRU_LOOKUP) */
+static inline void __qht_map_iter(struct qht *ht, struct qht_map *map,
+                                  qht_iter_func_t func, void *userp)
+{
+    uint64_t i;
+
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+        int j;
+
+        do {
+            for (j = 0; j < QHT_BUCKET_ENTRIES; j++) {
+                if (b->hashes[j]) {
+                    func(ht, b->pointers[j], b->hashes[j], userp);
+                }
+            }
+            b = b->next;
+        } while (b);
+    }
+}
+
+/* call with an external lock held */
+void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp)
+{
+    qht_lock(ht);
+    __qht_map_iter(ht, ht->map, func, userp);
+    qht_unlock(ht);
+}
+
+static void qht_map_copy(struct qht *ht, void *p, uint32_t hash, void *userp)
+{
+    struct qht_map *new = userp;
+    struct qht_bucket *b = qht_map_to_bucket(new, hash);
+
+    /* no need to acquire b->lock because no thread has seen this map yet */
+    __qht_insert(ht, new, b, p, hash);
+}
+
+/* call with an external lock held */
+void qht_grow(struct qht *ht)
+{
+    struct qht_map *old = ht->map;
+    struct qht_map *new;
+    uint64_t n = old->n * 2;
+
+    if (unlikely(n > (1ULL << 32))) {
+        return;
+    }
+    new = qht_map_create(n);
+    qht_iter(ht, qht_map_copy, new);
+
+    qht_publish(ht, new);
+    call_rcu1(&old->rcu, qht_map_reclaim);
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [Qemu-devel] [PATCH 09/10] qht: add test program
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
                   ` (7 preceding siblings ...)
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
@ 2016-04-05  5:30 ` Emilio G. Cota
  2016-04-08 10:45   ` Alex Bennée
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 10/10] tb hash: track translated blocks with qht Emilio G. Cota
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore |   1 +
 tests/Makefile   |   6 +++-
 tests/test-qht.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-qht.c

diff --git a/tests/.gitignore b/tests/.gitignore
index b7bf13e..d6d0700 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -47,6 +47,7 @@ test-qapi-visit.[ch]
 test-qdev-global-props
 test-qemu-opts
 test-qga
+test-qht
 test-qmp-commands
 test-qmp-commands.h
 test-qmp-event
diff --git a/tests/Makefile b/tests/Makefile
index 45b9048..b5a99d6 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -68,6 +68,8 @@ check-unit-y += tests/rcutorture$(EXESUF)
 gcov-files-rcutorture-y = util/rcu.c
 check-unit-y += tests/test-rcu-list$(EXESUF)
 gcov-files-test-rcu-list-y = util/rcu.c
+check-unit-y += tests/test-qht$(EXESUF)
+gcov-files-test-qht-y = util/qht.c
 check-unit-y += tests/test-bitops$(EXESUF)
 check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
 check-unit-y += tests/check-qom-interface$(EXESUF)
@@ -387,7 +389,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-qmp-commands.o tests/test-visitor-serialization.o \
 	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
-	tests/rcutorture.o tests/test-rcu-list.o
+	tests/rcutorture.o tests/test-rcu-list.o \
+	tests/test-qht.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -425,6 +428,7 @@ tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
 tests/test-int128$(EXESUF): tests/test-int128.o
 tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
+tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/test-qht.c b/tests/test-qht.c
new file mode 100644
index 0000000..f553fbc
--- /dev/null
+++ b/tests/test-qht.c
@@ -0,0 +1,100 @@
+#include "qemu/osdep.h"
+#include "qemu/xxhash.h"
+#include "qemu/qht.h"
+
+#define N 5000
+#define SEED 1
+
+static struct qht ht;
+static int32_t arr[N];
+
+static bool is_equal(const void *obj, const void *userp)
+{
+    const int32_t *a = obj;
+    const int32_t *b = userp;
+
+    return *a == *b;
+}
+
+static void insert(int a, int b)
+{
+    int i;
+
+    for (i = a; i < b; i++) {
+        uint32_t hash;
+
+        arr[i] = i;
+        hash = qemu_xxh32((uint32_t *)&arr[i], 1, SEED);
+
+        qht_insert(&ht, &arr[i], hash);
+    }
+}
+
+static void rm(int init, int end)
+{
+    int i;
+
+    for (i = init; i < end; i++) {
+        uint32_t hash;
+
+        hash = qemu_xxh32((uint32_t *)&arr[i], 1, SEED);
+        assert(qht_remove(&ht, &arr[i], hash));
+    }
+}
+
+static void check(int a, int b, bool expected)
+{
+    int i;
+
+    for (i = a; i < b; i++) {
+        void *p;
+        uint32_t hash;
+        int32_t val;
+
+        val = i;
+        hash = qemu_xxh32((uint32_t *)&val, 1, SEED);
+        p = qht_lookup(&ht, is_equal, &val, hash);
+        assert(!!p == expected);
+    }
+}
+
+static void count_func(struct qht *ht, void *p, uint32_t hash, void *userp)
+{
+    unsigned int *curr = userp;
+
+    (*curr)++;
+}
+
+static void iter_check(unsigned int count)
+{
+    unsigned int curr = 0;
+
+    qht_iter(&ht, count_func, &curr);
+    assert(curr == count);
+}
+
+static void qht_test(unsigned int mode)
+{
+    qht_init(&ht, 0, mode);
+
+    insert(0, N);
+    check(0, N, true);
+    check(-N, -1, false);
+    iter_check(N);
+    rm(1, 2);
+    qht_reset_size(&ht, 0);
+    check(0, N, false);
+
+    qht_destroy(&ht);
+}
+
+int main(int argc, char *argv[])
+{
+    qht_test(0);
+    qht_test(QHT_MODE_MRU_LOOKUP);
+    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT);
+    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT | QHT_MODE_AUTO_RESIZE);
+    qht_test(QHT_MODE_AUTO_RESIZE | QHT_MODE_MRU_INSERT);
+    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT);
+    return 0;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [Qemu-devel] [PATCH 10/10] tb hash: track translated blocks with qht
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
                   ` (8 preceding siblings ...)
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 09/10] qht: add test program Emilio G. Cota
@ 2016-04-05  5:30 ` Emilio G. Cota
  2016-04-08 12:39   ` Alex Bennée
  2016-04-05  8:47 ` [Qemu-devel] [PATCH 00/10] tb hash improvements Alex Bennée
  2016-04-05  9:01 ` Paolo Bonzini
  11 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05  5:30 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
              MRU is implemented here by adding an intermediate struct
              that contains the u32 hash and a pointer to the TB; this
              allows us, on an MRU promotion, to copy said struct (that is not
              at the head), and put this new copy at the head. After a grace
              period, the original non-head struct can be eliminated, and
              after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                   no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                 no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
  http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.

                            Host: Intel Xeon E5-2690

  28 ++------------+-------------+-------------+-------------+------------++
     A*****        +             +             +             master **A*** +
  27 ++    *                                                 xxhash ##B###++
     |      A******A******                               xxhash-rcu $$C$$$ |
  26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
     D%%$$                              A******A******A*qht-dyn-mru A*E****A
  25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
     B#####%                                                               |
  24 ++    #C$$$$$                                                        ++
     |      B###  $                                                        |
     |          ## C$$$$$$                                                 |
  23 ++           #       C$$$$$$                                         ++
     |             B######       C$$$$$$                                %%%D
  22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
     |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
  21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
     +             E@@@   F&&&   +      E@     +      F&&&   +             +
  20 ++------------+-------------+-------------+-------------+------------++
     14            16            18            20            22            24
                             log2 number of buckets

                                 Host: Intel i7-4790K

  14.5 ++------------+------------+-------------+------------+------------++
       A**           +            +             +            master **A*** +
    14 ++ **                                                 xxhash ##B###++
  13.5 ++   **                                           xxhash-rcu $$C$$$++
       |                                            qht-fixed-nomru %%D%%% |
    13 ++     A******                                   qht-dyn-mru @@E@@@++
       |             A*****A******A******             qht-dyn-nomru &&F&&& |
  12.5 C$$                               A******A******A*****A******    ***A
    12 ++ $$                                                        A***  ++
       D%%% $$                                                             |
  11.5 ++  %%                                                             ++
       B###  %C$$$$$$                                                      |
    11 ++  ## D%%%%% C$$$$$                                               ++
       |     #      %      C$$$$$$                                         |
  10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
    10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
       +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
   9.5 ++------------+------------+-------------+------------+------------++
       14            16           18            20           22            24
                              log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 82 ++++++++++++++++++++++++-----------------------
 include/exec/exec-all.h |  9 +++---
 include/exec/tb-hash.h  |  5 +--
 translate-all.c         | 84 ++++++++++++++++++++++---------------------------
 4 files changed, 85 insertions(+), 95 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 4194a4a..e948ffd 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -217,55 +217,59 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
     tb_free(tb);
 }
 
+struct tb_desc {
+    target_ulong pc;
+    target_ulong cs_base;
+    uint64_t flags;
+    tb_page_addr_t phys_page1;
+    CPUArchState *env;
+};
+
+static bool tb_cmp(const void *p, const void *d)
+{
+    const TranslationBlock *tb = p;
+    const struct tb_desc *desc = d;
+
+    if (tb->pc == desc->pc &&
+        tb->page_addr[0] == desc->phys_page1 &&
+        tb->cs_base == desc->cs_base &&
+        tb->flags == desc->flags) {
+        /* check next page if needed */
+        if (tb->page_addr[1] == -1) {
+            return true;
+        } else {
+            tb_page_addr_t phys_page2;
+            target_ulong virt_page2;
+
+            virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
+            phys_page2 = get_page_addr_code(desc->env, virt_page2);
+            if (tb->page_addr[1] == phys_page2) {
+                return true;
+            }
+        }
+    }
+    return false;
+}
+
 static TranslationBlock *tb_find_physical(CPUState *cpu,
                                           target_ulong pc,
                                           target_ulong cs_base,
                                           uint64_t flags)
 {
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
-    TranslationBlock *tb, **ptb1;
     unsigned int h;
-    tb_page_addr_t phys_pc, phys_page1;
-    target_ulong virt_page2;
+    tb_page_addr_t phys_pc;
+    struct tb_desc desc;
 
     tcg_ctx.tb_ctx.tb_invalidated_flag = 0;
 
-    /* find translated block using physical mappings */
-    phys_pc = get_page_addr_code(env, pc);
-    phys_page1 = phys_pc & TARGET_PAGE_MASK;
+    desc.env = (CPUArchState *)cpu->env_ptr;
+    desc.cs_base = cs_base;
+    desc.flags = flags;
+    desc.pc = pc;
+    phys_pc = get_page_addr_code(desc.env, pc);
+    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_hash_func(phys_pc, pc, flags);
-    ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    for(;;) {
-        tb = *ptb1;
-        if (!tb) {
-            return NULL;
-        }
-        if (tb->pc == pc &&
-            tb->page_addr[0] == phys_page1 &&
-            tb->cs_base == cs_base &&
-            tb->flags == flags) {
-            /* check next page if needed */
-            if (tb->page_addr[1] != -1) {
-                tb_page_addr_t phys_page2;
-
-                virt_page2 = (pc & TARGET_PAGE_MASK) +
-                    TARGET_PAGE_SIZE;
-                phys_page2 = get_page_addr_code(env, virt_page2);
-                if (tb->page_addr[1] == phys_page2) {
-                    break;
-                }
-            } else {
-                break;
-            }
-        }
-        ptb1 = &tb->phys_hash_next;
-    }
-
-    /* Move the TB to the head of the list */
-    *ptb1 = tb->phys_hash_next;
-    tb->phys_hash_next = tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tcg_ctx.tb_ctx.tb_phys_hash[h] = tb;
-    return tb;
+    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
 }
 
 static TranslationBlock *tb_find_slow(CPUState *cpu,
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 7362095..a894817 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -21,6 +21,7 @@
 #define _EXEC_ALL_H_
 
 #include "qemu-common.h"
+#include "qemu/qht.h"
 
 /* allow to see translation results - the slowdown should be negligible, so we leave it */
 #define DEBUG_DISAS
@@ -211,8 +212,8 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
 
 #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
 
-#define CODE_GEN_PHYS_HASH_BITS     15
-#define CODE_GEN_PHYS_HASH_SIZE     (1 << CODE_GEN_PHYS_HASH_BITS)
+#define CODE_GEN_HTABLE_BITS     15
+#define CODE_GEN_HTABLE_SIZE     (1 << CODE_GEN_HTABLE_BITS)
 
 /* Estimated block size for TB allocation.  */
 /* ??? The following is based on a 2015 survey of x86_64 host output.
@@ -248,8 +249,6 @@ struct TranslationBlock {
 
     void *tc_ptr;    /* pointer to the translated code */
     uint8_t *tc_search;  /* pointer to search data */
-    /* next matching tb for physical address. */
-    struct TranslationBlock *phys_hash_next;
     /* original tb when cflags has CF_NOCACHE */
     struct TranslationBlock *orig_tb;
     /* first and second physical page containing code. The lower bit
@@ -280,7 +279,7 @@ typedef struct TBContext TBContext;
 struct TBContext {
 
     TranslationBlock *tbs;
-    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
+    struct qht htable;
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index c4942e1..fabc936 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,7 +20,6 @@
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
-#include "exec/exec-all.h"
 #include "qemu/xxhash.h"
 
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
@@ -54,13 +53,11 @@ uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint64_t flags)
         target_ulong pc;
         uint64_t flags;
     } QEMU_PACKED k;
-    unsigned int hash;
 
     k.phys_pc = phys_pc;
     k.pc = pc;
     k.flags = flags;
-    hash = qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);
-    return hash & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    return qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index 8e7f9a7..f0d7f1a 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -735,6 +735,13 @@ static inline void code_gen_alloc(size_t tb_size)
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
+static void tb_htable_init(void)
+{
+    unsigned int mode = QHT_MODE_AUTO_RESIZE | QHT_MODE_MRU_INSERT;
+
+    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
+}
+
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
    (in bytes) allocated to the translation buffer. Zero means default
    size. */
@@ -742,6 +749,7 @@ void tcg_exec_init(unsigned long tb_size)
 {
     cpu_gen_init();
     page_init();
+    tb_htable_init();
     code_gen_alloc(tb_size);
 #if defined(CONFIG_SOFTMMU)
     /* There's no guest base to take into account, so go ahead and
@@ -843,7 +851,7 @@ void tb_flush(CPUState *cpu)
         memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
     }
 
-    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
+    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
@@ -854,60 +862,45 @@ void tb_flush(CPUState *cpu)
 
 #ifdef DEBUG_TB_CHECK
 
-static void tb_invalidate_check(target_ulong address)
+static void
+__tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 {
-    TranslationBlock *tb;
-    int i;
+    TranslationBlock *tb = p;
+    target_ulong addr = *(target_ulong *)userp;
 
-    address &= TARGET_PAGE_MASK;
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-             tb = tb->phys_hash_next) {
-            if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
-                  address >= tb->pc + tb->size)) {
-                printf("ERROR invalidate: address=" TARGET_FMT_lx
-                       " PC=%08lx size=%04x\n",
-                       address, (long)tb->pc, tb->size);
-            }
-        }
+    if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
+        printf("ERROR invalidate: address=" TARGET_FMT_lx
+               " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
     }
 }
 
-/* verify that all the pages have correct rights for code */
-static void tb_page_check(void)
+static void tb_invalidate_check(target_ulong address)
 {
-    TranslationBlock *tb;
-    int i, flags1, flags2;
-
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-                tb = tb->phys_hash_next) {
-            flags1 = page_get_flags(tb->pc);
-            flags2 = page_get_flags(tb->pc + tb->size - 1);
-            if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
-                printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
-                       (long)tb->pc, tb->size, flags1, flags2);
-            }
-        }
-    }
+    address &= TARGET_PAGE_MASK;
+    qht_iter(&tcg_ctx.tb_ctx.htable, __tb_invalidate_check, &address);
 }
 
-#endif
-
-static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock *tb)
+static void __tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 {
-    TranslationBlock *tb1;
+    TranslationBlock *tb = p;
+    int flags1, flags2;
 
-    for (;;) {
-        tb1 = *ptb;
-        if (tb1 == tb) {
-            *ptb = tb1->phys_hash_next;
-            break;
-        }
-        ptb = &tb1->phys_hash_next;
+    flags1 = page_get_flags(tb->pc);
+    flags2 = page_get_flags(tb->pc + tb->size - 1);
+    if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
+        printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
+               (long)tb->pc, tb->size, flags1, flags2);
     }
 }
 
+/* verify that all the pages have correct rights for code */
+static void tb_page_check(void)
+{
+    qht_iter(&tcg_ctx.tb_ctx.htable, __tb_page_check, NULL);
+}
+
+#endif
+
 static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
 {
     TranslationBlock *tb1;
@@ -973,7 +966,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
+    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
@@ -1472,13 +1465,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
     unsigned int h;
-    TranslationBlock **ptb;
 
     /* add in the hash table */
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tb->phys_hash_next = *ptb;
-    *ptb = tb;
+    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* add in the page list */
     tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED Emilio G. Cota
@ 2016-04-05  7:57   ` Peter Maydell
  2016-04-05 17:24     ` Emilio G. Cota
  2016-04-05  8:49   ` Paolo Bonzini
  2016-04-05 12:57   ` Lluís Vilanova
  2 siblings, 1 reply; 62+ messages in thread
From: Peter Maydell @ 2016-04-05  7:57 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, Peter Crosthwaite, QEMU Developers, Sergey Fedorov,
	Paolo Bonzini, Alex Bennée, Richard Henderson

On 5 April 2016 at 06:30, Emilio G. Cota <cota@braap.org> wrote:
> I'm assuming windows compilers don't support this attribute.

Our windows compiler is gcc, so I would expect it to support
attribute aligned, and a quick grep through the sources shows
we already use it in several places.

> +#define QEMU_CACHELINE (64)

Why 64? Does anything bad happen if the host's cache line
size turns out to be greater than the value here ?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] tb hash improvements
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
                   ` (9 preceding siblings ...)
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 10/10] tb hash: track translated blocks with qht Emilio G. Cota
@ 2016-04-05  8:47 ` Alex Bennée
  2016-04-05  9:01 ` Paolo Bonzini
  11 siblings, 0 replies; 62+ messages in thread
From: Alex Bennée @ 2016-04-05  8:47 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> This patchset is derived from my ongoing work on MTTCG, but does
> not depend on it and brings improvements that we can already
> benefit from. It applies cleanly on the current master and
> is checkpatch-clean.
>
> The key goal is to make the TB hash table faster, and while at it,
> scalable. Tested on two different host machines, the execution time
> improvement before and after this series, when booting a debian
> jessie arm image[*] with arm-softmmu, is:
>
> - Intel Xeon E5-2690: 21.2% less time
> - Intel i7-4790K: 23.5% less time
>
> This workload is particularly sensitive to TB hash performance.
> Other workloads not as sensitive might see a slight performance
> degradation with this patchset, since the hashing + lookup
> functions take now more instructions. In any case, no significant
> slowdowns should occur.
>
> The commit logs are sometimes long because I have lots of numbers
> to share.
>
> The only bits I'm not too comfortable with in this series are patches
> 2 and 5; I don't develop on Windows so I'm shooting in the dark there.
>
> Please take a look and if possible, test on workloads you care about!

Excellent stuff, will have a look this week.

>
> Thanks,
>
> 		Emilio
>
> [*] taskset -c 0 arm-softmmu/qemu-system-arm -machine type=virt -nographic \
>      -smp 1 -m 4096 -netdev user,id=unet,hostfwd=tcp::2222-:22 \
>      -device virtio-net-device,netdev=unet \
>      -drive file=jessie-arm32.qcow2,id=myblock,index=0,if=none \
>      -device virtio-blk-device,drive=myblock \
>      -kernel aarch32-current-linux-kernel-only.img \
>      -append 'console=ttyAMA0 root=/dev/vda1' \
>      -name arm,debug-threads=on -smp 1 -tb-size 1024
> The image is taken from:
>   http://people.linaro.org/~alex.bennee/images/jessie-arm32.qcow2
> The image was modified to call `shutdown -h now` right after boot.
> The kernel is taken from:
>   http://people.linaro.org/~alex.bennee/images/aarch32-current-linux-kernel-only.img


--
Alex Bennée

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 01/10] translate-all: add missing fold of tb_ctx into tcg_ctx
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 01/10] translate-all: add missing fold of tb_ctx into tcg_ctx Emilio G. Cota
@ 2016-04-05  8:49   ` Paolo Bonzini
  0 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-05  8:49 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Sergey Fedorov, Richard Henderson,
	Alex Bennée, Peter Crosthwaite



On 05/04/2016 07:30, Emilio G. Cota wrote:
> Since 5e5f07e08 "TCG: Move translation block variables
> to new context inside tcg_ctx: tb_ctx" on Feb 1 2013, compilation
> of usermode + TB_DEBUG_CHECK has been broken. Fix it.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  translate-all.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/translate-all.c b/translate-all.c
> index b4df1ec..8329ea6 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -861,7 +861,8 @@ static void tb_invalidate_check(target_ulong address)
>  
>      address &= TARGET_PAGE_MASK;
>      for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tb_ctx.tb_phys_hash[i]; tb != NULL; tb = tb->phys_hash_next) {
> +        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> +             tb = tb->phys_hash_next) {
>              if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
>                    address >= tb->pc + tb->size)) {
>                  printf("ERROR invalidate: address=" TARGET_FMT_lx
> 

Thanks, queued for 2.6.

Paolo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED Emilio G. Cota
  2016-04-05  7:57   ` Peter Maydell
@ 2016-04-05  8:49   ` Paolo Bonzini
  2016-04-05 12:57   ` Lluís Vilanova
  2 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-05  8:49 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Sergey Fedorov, Richard Henderson,
	Alex Bennée, Peter Crosthwaite



On 05/04/2016 07:30, Emilio G. Cota wrote:
> I'm assuming windows compilers don't support this attribute.

They actually do. :)

Paolo

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/compiler.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
> index 8f1cc7b..fb946f1 100644
> --- a/include/qemu/compiler.h
> +++ b/include/qemu/compiler.h
> @@ -41,6 +41,16 @@
>  # define QEMU_PACKED __attribute__((packed))
>  #endif
>  
> +#define QEMU_CACHELINE (64)
> +
> +#if defined(_WIN32)
> +#define QEMU_ALIGN(B)
> +#else
> +#define QEMU_ALIGN(B) __attribute__((aligned(B)))
> +#endif
> +
> +#define QEMU_CACHELINE_ALIGNED QEMU_ALIGN(QEMU_CACHELINE)
> +
>  #ifndef glue
>  #define xglue(x, y) x ## y
>  #define glue(x, y) xglue(x, y)
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 05/10] include: add spinlock wrapper
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 05/10] include: add spinlock wrapper Emilio G. Cota
@ 2016-04-05  8:51   ` Paolo Bonzini
  2016-04-06 15:51     ` Alex Bennée
  0 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-05  8:51 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Sergey Fedorov, Richard Henderson,
	Alex Bennée, Peter Crosthwaite



On 05/04/2016 07:30, Emilio G. Cota wrote:
> Wrap pthread_spin on POSIX, or QemuMutex on Windows.
> 
> AFAIK there are is no off-the-shelf spinlock implementation for
> Windows, so we'll just use QemuMutex.

It's much simpler to use a simple test-and-set spinlock.

GitHub is down, but this should be the link
http://github.com/bonzini/qemu/commit/e1361634 (it's in my mttcg branch).

Paolo

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/spinlock-posix.h | 60 +++++++++++++++++++++++++++++++++++++++++++
>  include/qemu/spinlock-win32.h | 33 ++++++++++++++++++++++++
>  include/qemu/spinlock.h       | 10 ++++++++
>  3 files changed, 103 insertions(+)
>  create mode 100644 include/qemu/spinlock-posix.h
>  create mode 100644 include/qemu/spinlock-win32.h
>  create mode 100644 include/qemu/spinlock.h
> 
> diff --git a/include/qemu/spinlock-posix.h b/include/qemu/spinlock-posix.h
> new file mode 100644
> index 0000000..51c2c08
> --- /dev/null
> +++ b/include/qemu/spinlock-posix.h
> @@ -0,0 +1,60 @@
> +#ifndef QEMU_SPINLOCK_POSIX_H
> +#define QEMU_SPINLOCK_POSIX_H
> +
> +#include <qemu/thread.h>
> +#include <qemu/osdep.h>
> +
> +typedef pthread_spinlock_t QemuSpinLock;
> +
> +static inline void qemu_spinlock_error_exit(int err, const char *msg)
> +{
> +    fprintf(stderr, "qemu: %s: %s\n", msg, strerror(err));
> +    abort();
> +}
> +
> +static inline void qemu_spinlock_init(QemuSpinLock *lock)
> +{
> +    int rc;
> +
> +    rc = pthread_spin_init(lock, PTHREAD_PROCESS_SHARED);
> +    if (unlikely(rc)) {
> +        qemu_spinlock_error_exit(rc, __func__);
> +    }
> +}
> +
> +static inline void qemu_spinlock_destroy(QemuSpinLock *lock)
> +{
> +    int rc;
> +
> +    rc = pthread_spin_destroy(lock);
> +    if (unlikely(rc)) {
> +        qemu_spinlock_error_exit(rc, __func__);
> +    }
> +}
> +
> +static inline void qemu_spinlock_lock(QemuSpinLock *lock)
> +{
> +    int rc;
> +
> +    rc = pthread_spin_lock(lock);
> +    if (unlikely(rc)) {
> +        qemu_spinlock_error_exit(rc, __func__);
> +    }
> +}
> +
> +static inline int qemu_spinlock_trylock(QemuSpinLock *lock)
> +{
> +    return pthread_spin_trylock(lock);
> +}
> +
> +static inline void qemu_spinlock_unlock(QemuSpinLock *lock)
> +{
> +    int rc;
> +
> +    rc = pthread_spin_unlock(lock);
> +    if (unlikely(rc)) {
> +        qemu_spinlock_error_exit(rc, __func__);
> +    }
> +}
> +
> +#endif /* QEMU_SPINLOCK_POSIX_H */
> diff --git a/include/qemu/spinlock-win32.h b/include/qemu/spinlock-win32.h
> new file mode 100644
> index 0000000..5a105fb
> --- /dev/null
> +++ b/include/qemu/spinlock-win32.h
> @@ -0,0 +1,33 @@
> +#ifndef QEMU_SPINLOCK_WIN32_H
> +#define QEMU_SPINLOCK_WIN32_H
> +
> +#include <qemu/thread.h>
> +
> +typedef QemuMutex QemuSpinLock;
> +
> +static inline void qemu_spinlock_init(QemuSpinLock *lock)
> +{
> +    qemu_mutex_init(lock);
> +}
> +
> +static inline void qemu_spinlock_destroy(QemuSpinLock *lock)
> +{
> +    qemu_mutex_destroy(lock);
> +}
> +
> +static inline void qemu_spinlock_lock(QemuSpinLock *lock)
> +{
> +    qemu_mutex_lock(lock);
> +}
> +
> +static inline int qemu_spinlock_trylock(QemuSpinLock *lock)
> +{
> +    return qemu_mutex_trylock(lock);
> +}
> +
> +static inline void qemu_spinlock_unlock(QemuSpinLock *lock)
> +{
> +    qemu_mutex_unlock(lock);
> +}
> +
> +#endif /* QEMU_SPINLOCK_WIN32_H */
> diff --git a/include/qemu/spinlock.h b/include/qemu/spinlock.h
> new file mode 100644
> index 0000000..001b55e
> --- /dev/null
> +++ b/include/qemu/spinlock.h
> @@ -0,0 +1,10 @@
> +#ifndef QEMU_SPINLOCK_H
> +#define QEMU_SPINLOCK_H
> +
> +#ifdef _WIN32
> +#include "spinlock-win32.h"
> +#else
> +#include "spinlock-posix.h"
> +#endif
> +
> +#endif /* QEMU_SPINLOCK_H */
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
@ 2016-04-05  9:01   ` Paolo Bonzini
  2016-04-05 15:50   ` Richard Henderson
  2016-04-08 10:27   ` Alex Bennée
  2 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-05  9:01 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Sergey Fedorov, Richard Henderson,
	Alex Bennée, Peter Crosthwaite



On 05/04/2016 07:30, Emilio G. Cota wrote:
> +static void qht_bucket_reset(struct qht_bucket *b)
> +{
> +    qemu_spinlock_lock(&b->lock);
> +    seqlock_write_begin(&b->sequence);
> +    __qht_bucket_reset(b);

No __ names, please use names like qht_bucket_reset_locked.
"_locked" doesn't make much sense for seqlock read-side critical
sections, I think qht_do_lookup is good enough for that case.

Otherwise, this is great stuff. :)

Paolo

> +    seqlock_write_end(&b->sequence);
> +    qemu_spinlock_unlock(&b->lock);
> +}

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] tb hash improvements
  2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
                   ` (10 preceding siblings ...)
  2016-04-05  8:47 ` [Qemu-devel] [PATCH 00/10] tb hash improvements Alex Bennée
@ 2016-04-05  9:01 ` Paolo Bonzini
  11 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-05  9:01 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Sergey Fedorov, Richard Henderson,
	Alex Bennée, Peter Crosthwaite



On 05/04/2016 07:30, Emilio G. Cota wrote:
> This patchset is derived from my ongoing work on MTTCG, but does
> not depend on it and brings improvements that we can already
> benefit from. It applies cleanly on the current master and
> is checkpatch-clean.
> 
> The key goal is to make the TB hash table faster, and while at it,
> scalable. Tested on two different host machines, the execution time
> improvement before and after this series, when booting a debian
> jessie arm image[*] with arm-softmmu, is:
> 
> - Intel Xeon E5-2690: 21.2% less time
> - Intel i7-4790K: 23.5% less time
> 
> This workload is particularly sensitive to TB hash performance.
> Other workloads not as sensitive might see a slight performance
> degradation with this patchset, since the hashing + lookup
> functions take now more instructions. In any case, no significant
> slowdowns should occur.
> 
> The commit logs are sometimes long because I have lots of numbers
> to share.
> 
> The only bits I'm not too comfortable with in this series are patches
> 2 and 5; I don't develop on Windows so I'm shooting in the dark there.
> 
> Please take a look and if possible, test on workloads you care about!

That's great stuff.  It will have to wait for 2.7, but it's really good.

Paolo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED Emilio G. Cota
  2016-04-05  7:57   ` Peter Maydell
  2016-04-05  8:49   ` Paolo Bonzini
@ 2016-04-05 12:57   ` Lluís Vilanova
  2016-04-05 12:58     ` Peter Maydell
  2 siblings, 1 reply; 62+ messages in thread
From: Lluís Vilanova @ 2016-04-05 12:57 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Paolo Bonzini, Sergey Fedorov, Alex Bennée,
	Richard Henderson

Emilio G Cota writes:

> I'm assuming windows compilers don't support this attribute.
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/compiler.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)

> diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
> index 8f1cc7b..fb946f1 100644
> --- a/include/qemu/compiler.h
> +++ b/include/qemu/compiler.h
> @@ -41,6 +41,16 @@
>  # define QEMU_PACKED __attribute__((packed))
>  #endif
 
> +#define QEMU_CACHELINE (64)
[...]

You should make this a value taken from configure. In linux you could use
something like:

  getconf LEVEL1_DCACHE_LINESIZE

Or an equivalent program using sysconf(3) if getconf is not present.


Cheers,
  Lluis

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 12:57   ` Lluís Vilanova
@ 2016-04-05 12:58     ` Peter Maydell
  2016-04-05 15:29       ` Paolo Bonzini
  2016-04-05 16:23       ` Lluís Vilanova
  0 siblings, 2 replies; 62+ messages in thread
From: Peter Maydell @ 2016-04-05 12:58 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel, Peter Maydell,
	Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée, Richard Henderson

On 5 April 2016 at 13:57, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Emilio G Cota writes:
>
>> I'm assuming windows compilers don't support this attribute.
>> Signed-off-by: Emilio G. Cota <cota@braap.org>
>> ---
>>  include/qemu/compiler.h | 10 ++++++++++
>>  1 file changed, 10 insertions(+)
>
>> diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
>> index 8f1cc7b..fb946f1 100644
>> --- a/include/qemu/compiler.h
>> +++ b/include/qemu/compiler.h
>> @@ -41,6 +41,16 @@
>>  # define QEMU_PACKED __attribute__((packed))
>>  #endif
>
>> +#define QEMU_CACHELINE (64)
> [...]
>
> You should make this a value taken from configure. In linux you could use
> something like:
>
>   getconf LEVEL1_DCACHE_LINESIZE
>
> Or an equivalent program using sysconf(3) if getconf is not present.

The cache line size at compile time is not necessarily the cache
line size at runtime...

thanks
-- PMM

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 12:58     ` Peter Maydell
@ 2016-04-05 15:29       ` Paolo Bonzini
  2016-04-05 16:23       ` Lluís Vilanova
  1 sibling, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-05 15:29 UTC (permalink / raw)
  To: Peter Maydell, Emilio G. Cota, QEMU Developers, MTTCG Devel,
	Peter Crosthwaite, Sergey Fedorov, Alex Bennée,
	Richard Henderson



On 05/04/2016 14:58, Peter Maydell wrote:
>> >
>> > You should make this a value taken from configure. In linux you could use
>> > something like:
>> >
>> >   getconf LEVEL1_DCACHE_LINESIZE
>> >
>> > Or an equivalent program using sysconf(3) if getconf is not present.
> The cache line size at compile time is not necessarily the cache
> line size at runtime...

But the size of data structures cannot change at run-time.  A bigger
worry is cross-compilation.

Linux has these defaults:

	Alpha	32 or 64
	ARM	configurable, 64 for ARMv7
	AArch64	128
	PPC     128
	s390	256
	x86	configurable, default 64

which should be easy to copy in QEMU too.

Paolo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
@ 2016-04-05 15:41   ` Richard Henderson
  2016-04-05 15:48     ` Paolo Bonzini
  2016-04-05 16:33     ` Laurent Desnogues
  0 siblings, 2 replies; 62+ messages in thread
From: Richard Henderson @ 2016-04-05 15:41 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Paolo Bonzini, Sergey Fedorov, Alex Bennée,
	Peter Crosthwaite

On 04/04/2016 10:30 PM, Emilio G. Cota wrote:
> Tests show that the other element checked for in tb_find_physical,
> cs_base, is always a match when tb_phys+pc+flags are a match,
> so hashing cs_base is wasteful. It could be that this is an ARM-only
> thing, though.

The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB 
consisting of only a delay slot).

It may well still turn out to be reasonable to ignore cs_base for hashing.

> +static inline
> +uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint64_t flags)
>   {
> -    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
> +    struct key {
> +        tb_page_addr_t phys_pc;
> +        target_ulong pc;
> +        uint64_t flags;

The flags field should be uint32_t, not uint64_t.
See its definition in TranslationBlock.

> +    } QEMU_PACKED k;
> +    unsigned int hash;
> +
> +    k.phys_pc = phys_pc;
> +    k.pc = pc;
> +    k.flags = flags;
> +    hash = qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);

I'm less than happy about putting data into a struct and hashing that memory. 
Why don't you just take the ideas from xxh32/xxh64 and perform the hash 
directly right here?

In particular, you've 3 data elements here, a maximum of 5 32-bit words, thus 
the loop within xxh32 will never roll, so it's mostly dead code.


r~

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-05 15:41   ` Richard Henderson
@ 2016-04-05 15:48     ` Paolo Bonzini
  2016-04-05 16:07       ` Richard Henderson
  2016-04-05 16:33     ` Laurent Desnogues
  1 sibling, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-05 15:48 UTC (permalink / raw)
  To: Richard Henderson, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Sergey Fedorov, Alex Bennée, Peter Crosthwaite



On 05/04/2016 17:41, Richard Henderson wrote:
> 
>> +static inline
>> +uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc,
>> uint64_t flags)
>>   {
>> -    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
>> +    struct key {
>> +        tb_page_addr_t phys_pc;
>> +        target_ulong pc;
>> +        uint64_t flags;
> 
> The flags field should be uint32_t, not uint64_t.
> See its definition in TranslationBlock.
> 
>> +    } QEMU_PACKED k;
>> +    unsigned int hash;
>> +
>> +    k.phys_pc = phys_pc;
>> +    k.pc = pc;
>> +    k.flags = flags;
>> +    hash = qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);
> 
> I'm less than happy about putting data into a struct and hashing that
> memory. Why don't you just take the ideas from xxh32/xxh64 and perform
> the hash directly right here?
> 
> In particular, you've 3 data elements here, a maximum of 5 32-bit words,
> thus the loop within xxh32 will never roll, so it's mostly dead code.

I think it's fine to use the struct.  The exact size of the struct
varies from 3 to 5 32-bit words, so it's hard to write nice
size-dependent code for the hash.

Paolo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
  2016-04-05  9:01   ` Paolo Bonzini
@ 2016-04-05 15:50   ` Richard Henderson
  2016-04-08 10:27   ` Alex Bennée
  2 siblings, 0 replies; 62+ messages in thread
From: Richard Henderson @ 2016-04-05 15:50 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Paolo Bonzini, Sergey Fedorov, Alex Bennée,
	Peter Crosthwaite

On 04/04/2016 10:30 PM, Emilio G. Cota wrote:
> +struct qht_map {
> +    struct qht_bucket *buckets;
> +    uint64_t n;
> +    uint64_t n_items;
> +    uint64_t n_items_threshold;
> +    struct rcu_head rcu;
> +};

There's no point in using 64-bit data for a 32-bit host.
You should be using e.g. size_t for the counts.


r~

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-05 15:48     ` Paolo Bonzini
@ 2016-04-05 16:07       ` Richard Henderson
  2016-04-05 19:40         ` Emilio G. Cota
  0 siblings, 1 reply; 62+ messages in thread
From: Richard Henderson @ 2016-04-05 16:07 UTC (permalink / raw)
  To: Paolo Bonzini, Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Sergey Fedorov, Alex Bennée, Peter Crosthwaite

On 04/05/2016 08:48 AM, Paolo Bonzini wrote:
> I think it's fine to use the struct.  The exact size of the struct
> varies from 3 to 5 32-bit words, so it's hard to write nice
> size-dependent code for the hash.

I don't think it is.  We have 3 integers.  It is trivial to create a simple 
function of 2 multiplies, two adds, and a remainder.

Take the primes from the xxhash.h, for example:

   (phys_pc * PRIME32_2 + pc * PRIME32_3 + flags)
   % PRIME32_1
   & (CODE_GEN_PHYS_HASH_SIZE - 1)

Obviously, some bucket measurements should be taken, but I can well imagine 
that this might perform just as well as the fully generic hasher.


r~

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 12:58     ` Peter Maydell
  2016-04-05 15:29       ` Paolo Bonzini
@ 2016-04-05 16:23       ` Lluís Vilanova
  2016-04-05 16:31         ` Richard Henderson
  1 sibling, 1 reply; 62+ messages in thread
From: Lluís Vilanova @ 2016-04-05 16:23 UTC (permalink / raw)
  To: Peter Maydell
  Cc: MTTCG Devel, Peter Crosthwaite, QEMU Developers, Emilio G. Cota,
	Paolo Bonzini, Sergey Fedorov, Alex Bennée,
	Richard Henderson

Peter Maydell writes:

> On 5 April 2016 at 13:57, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>> Emilio G Cota writes:
>> 
>>> I'm assuming windows compilers don't support this attribute.
>>> Signed-off-by: Emilio G. Cota <cota@braap.org>
>>> ---
>>> include/qemu/compiler.h | 10 ++++++++++
>>> 1 file changed, 10 insertions(+)
>> 
>>> diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
>>> index 8f1cc7b..fb946f1 100644
>>> --- a/include/qemu/compiler.h
>>> +++ b/include/qemu/compiler.h
>>> @@ -41,6 +41,16 @@
>>> # define QEMU_PACKED __attribute__((packed))
>>> #endif
>> 
>>> +#define QEMU_CACHELINE (64)
>> [...]
>> 
>> You should make this a value taken from configure. In linux you could use
>> something like:
>> 
>> getconf LEVEL1_DCACHE_LINESIZE
>> 
>> Or an equivalent program using sysconf(3) if getconf is not present.

> The cache line size at compile time is not necessarily the cache
> line size at runtime...

True, but exposing it as a configure flag allows tunning for the target arch
during cross-compilation. It would be bettwe to use the compiler to tell you
about the target cache line size, but I'm not aware of a way to do it.

Got it!

  gcc -march=native --help=params -v 2>&1 | grep "param l1-cache-line-size" | sed -e 's/.* --param l1-cache-line-size=\([0-9]\+\) .*/\1/'


Cheers,
  Lluis

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 16:23       ` Lluís Vilanova
@ 2016-04-05 16:31         ` Richard Henderson
  2016-04-05 16:56           ` Peter Maydell
  0 siblings, 1 reply; 62+ messages in thread
From: Richard Henderson @ 2016-04-05 16:31 UTC (permalink / raw)
  To: Peter Maydell, Emilio G. Cota, QEMU Developers, MTTCG Devel,
	Peter Crosthwaite, Sergey Fedorov, Paolo Bonzini,
	Alex Bennée

On 04/05/2016 09:23 AM, Lluís Vilanova wrote:
> Got it!
>
>    gcc -march=native --help=params -v 2>&1 | grep "param l1-cache-line-size" | sed -e 's/.* --param l1-cache-line-size=\([0-9]\+\) .*/\1/'

That will only work on x86, where we can be pretty sure it's 64 anyway.


r~

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-05 15:41   ` Richard Henderson
  2016-04-05 15:48     ` Paolo Bonzini
@ 2016-04-05 16:33     ` Laurent Desnogues
  2016-04-05 17:19       ` Richard Henderson
  1 sibling, 1 reply; 62+ messages in thread
From: Laurent Desnogues @ 2016-04-05 16:33 UTC (permalink / raw)
  To: Richard Henderson
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Emilio G. Cota, Paolo Bonzini, Sergey Fedorov, Alex Bennée

On Tue, Apr 5, 2016 at 5:41 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 04/04/2016 10:30 PM, Emilio G. Cota wrote:
>>
>> Tests show that the other element checked for in tb_find_physical,
>> cs_base, is always a match when tb_phys+pc+flags are a match,
>> so hashing cs_base is wasteful. It could be that this is an ARM-only
>> thing, though.
>
>
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a
> TB consisting of only a delay slot).
>
> It may well still turn out to be reasonable to ignore cs_base for hashing.
>
>> +static inline
>> +uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint64_t
>> flags)
>>   {
>> -    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
>> +    struct key {
>> +        tb_page_addr_t phys_pc;
>> +        target_ulong pc;
>> +        uint64_t flags;
>
>
> The flags field should be uint32_t, not uint64_t.
> See its definition in TranslationBlock.

The 'flags' field is 64-bit.  You're thinking of cflags, I guess.


Laurent

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 16:31         ` Richard Henderson
@ 2016-04-05 16:56           ` Peter Maydell
  2016-04-05 19:02             ` Lluís Vilanova
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Maydell @ 2016-04-05 16:56 UTC (permalink / raw)
  To: Richard Henderson
  Cc: MTTCG Devel, Peter Crosthwaite, QEMU Developers, Emilio G. Cota,
	Paolo Bonzini, Sergey Fedorov, Alex Bennée

On 5 April 2016 at 17:31, Richard Henderson <rth@twiddle.net> wrote:
> On 04/05/2016 09:23 AM, Lluís Vilanova wrote:
>>
>> Got it!
>>
>>    gcc -march=native --help=params -v 2>&1 | grep "param
>> l1-cache-line-size" | sed -e 's/.* --param l1-cache-line-size=\([0-9]\+\)
>> .*/\1/'
>
>
> That will only work on x86, where we can be pretty sure it's 64 anyway.

Doesn't work on OSX x86 either, since the gcc-to-clang wrapper
doesn't support --help=params :-)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-05 16:33     ` Laurent Desnogues
@ 2016-04-05 17:19       ` Richard Henderson
  2016-04-06  6:06         ` Laurent Desnogues
  0 siblings, 1 reply; 62+ messages in thread
From: Richard Henderson @ 2016-04-05 17:19 UTC (permalink / raw)
  To: Laurent Desnogues
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Emilio G. Cota, Paolo Bonzini, Sergey Fedorov, Alex Bennée

On 04/05/2016 09:33 AM, Laurent Desnogues wrote:
> The 'flags' field is 64-bit.  You're thinking of cflags, I guess.

Well that's silly.  Since it's filled in via

static inline void cpu_get_tb_cpu_state(CPUMIPSState *env, target_ulong *pc,
                                        target_ulong *cs_base, int *flags)

and passed back in to generate code with

TranslationBlock *tb_gen_code(CPUState *cpu,
                              target_ulong pc, target_ulong cs_base, int flags,
                              int cflags);

So while TranslationBlock stores "uint64_t", the producer and consumer see "int".


r~

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05  7:57   ` Peter Maydell
@ 2016-04-05 17:24     ` Emilio G. Cota
  2016-04-05 18:01       ` Peter Maydell
  0 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05 17:24 UTC (permalink / raw)
  To: Peter Maydell, Paolo Bonzini
  Cc: MTTCG Devel, MTTCG Devel, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Alex Bennée, Richard Henderson

On Tue, Apr 05, 2016 at 08:57:45 +0100, Peter Maydell wrote:
> On 5 April 2016 at 06:30, Emilio G. Cota <cota@braap.org> wrote:
> > +#define QEMU_CACHELINE (64)
> 
> Why 64? Does anything bad happen if the host's cache line
> size turns out to be greater than the value here ?

Defining a number lower than the host's cache line could result
in some false sharing. No big deal for most workloads.

Defining a number greater than the host's cache might result in
wasted memory, which we really should avoid.

I used 64 because it is common on x86, which I assume is what most
developers develop on.

On Tue, Apr 05, 2016 at 17:29:05 +0200, Paolo Bonzini wrote:
> But the size of data structures cannot change at run-time.  A bigger
> worry is cross-compilation.
> 
> Linux has these defaults:
> 
> 	Alpha	32 or 64
> 	ARM	configurable, 64 for ARMv7
> 	AArch64	128
> 	PPC     128
> 	s390	256
> 	x86	configurable, default 64
> 
> which should be easy to copy in QEMU too.

So how about this:
we add these defaults, and also add an optional --configure
parameter to override said defaults.

Otherwise I'd just stick to 64.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 17:24     ` Emilio G. Cota
@ 2016-04-05 18:01       ` Peter Maydell
  2016-04-05 19:13         ` Emilio G. Cota
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Maydell @ 2016-04-05 18:01 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, MTTCG Devel, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée,
	Richard Henderson

On 5 April 2016 at 18:24, Emilio G. Cota <cota@braap.org> wrote:
> On Tue, Apr 05, 2016 at 08:57:45 +0100, Peter Maydell wrote:
>> On 5 April 2016 at 06:30, Emilio G. Cota <cota@braap.org> wrote:
>> > +#define QEMU_CACHELINE (64)
>>
>> Why 64? Does anything bad happen if the host's cache line
>> size turns out to be greater than the value here ?
>
> Defining a number lower than the host's cache line could result
> in some false sharing. No big deal for most workloads.
>
> Defining a number greater than the host's cache might result in
> wasted memory, which we really should avoid.
>
> I used 64 because it is common on x86, which I assume is what most
> developers develop on.
>
> On Tue, Apr 05, 2016 at 17:29:05 +0200, Paolo Bonzini wrote:
>> But the size of data structures cannot change at run-time.  A bigger
>> worry is cross-compilation.
>>
>> Linux has these defaults:
>>
>>       Alpha   32 or 64
>>       ARM     configurable, 64 for ARMv7
>>       AArch64 128
>>       PPC     128
>>       s390    256
>>       x86     configurable, default 64
>>
>> which should be easy to copy in QEMU too.
>
> So how about this:
> we add these defaults, and also add an optional --configure
> parameter to override said defaults.

I think this definitely doesn't merit a configure parameter.

> Otherwise I'd just stick to 64.

If this is basically just a hashtable semi-tunable parameter knob,
then I don't mind if we use 64 everywhere (or some slightly more
architecture-aware default). But I would prefer that we not make
it a global define QEMU_CACHELINE if we can't actually guarantee
that it's the size of the cacheline (which we can't), because I
think that will be confusing and invite future misuse.

Unless we have another use case in the tree at the moment for
a number which is "probably the cacheline size, but might be
smaller or larger if you're unlucky", in which case we just want
a better name :-)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 16:56           ` Peter Maydell
@ 2016-04-05 19:02             ` Lluís Vilanova
  2016-04-05 19:15               ` Richard Henderson
  0 siblings, 1 reply; 62+ messages in thread
From: Lluís Vilanova @ 2016-04-05 19:02 UTC (permalink / raw)
  To: Peter Maydell
  Cc: MTTCG Devel, Peter Crosthwaite, QEMU Developers, Emilio G. Cota,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée,
	Richard Henderson

Peter Maydell writes:

> On 5 April 2016 at 17:31, Richard Henderson <rth@twiddle.net> wrote:
>> On 04/05/2016 09:23 AM, Lluís Vilanova wrote:
>>> 
>>> Got it!
>>> 
>>> gcc -march=native --help=params -v 2>&1 | grep "param
>>> l1-cache-line-size" | sed -e 's/.* --param l1-cache-line-size=\([0-9]\+\)
>>> .*/\1/'
>> 
>> 
>> That will only work on x86, where we can be pretty sure it's 64 anyway.

Oh, you mean l1-cache-line-size is an x86-only parameter in GCC?


> Doesn't work on OSX x86 either, since the gcc-to-clang wrapper
> doesn't support --help=params :-)

Well, my suggestion of adding a configure flag still applies :)


Cheers,
  Lluis

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 18:01       ` Peter Maydell
@ 2016-04-05 19:13         ` Emilio G. Cota
  0 siblings, 0 replies; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05 19:13 UTC (permalink / raw)
  To: Peter Maydell
  Cc: MTTCG Devel, MTTCG Devel, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée,
	Richard Henderson

On Tue, Apr 05, 2016 at 19:01:07 +0100, Peter Maydell wrote:
> On 5 April 2016 at 18:24, Emilio G. Cota <cota@braap.org> wrote:
> > So how about this:
> > we add these defaults, and also add an optional --configure
> > parameter to override said defaults.
> 
> I think this definitely doesn't merit a configure parameter.

Agreed :)

> > Otherwise I'd just stick to 64.
> 
> If this is basically just a hashtable semi-tunable parameter knob,
> then I don't mind if we use 64 everywhere (or some slightly more
> architecture-aware default). But I would prefer that we not make
> it a global define QEMU_CACHELINE if we can't actually guarantee
> that it's the size of the cacheline (which we can't), because I
> think that will be confusing and invite future misuse.
> 
> Unless we have another use case in the tree at the moment for
> a number which is "probably the cacheline size, but might be
> smaller or larger if you're unlucky", in which case we just want
> a better name :-)

Ok, so for now I'll only leave the ALIGN() macro, and then use
ALIGN(64) in the hash table. We can always reconsider adding a
more proper definition if this gets widespread use.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 19:02             ` Lluís Vilanova
@ 2016-04-05 19:15               ` Richard Henderson
  2016-04-05 20:09                 ` Lluís Vilanova
  0 siblings, 1 reply; 62+ messages in thread
From: Richard Henderson @ 2016-04-05 19:15 UTC (permalink / raw)
  To: Peter Maydell, MTTCG Devel, Peter Crosthwaite, QEMU Developers,
	Emilio G. Cota, Paolo Bonzini, Sergey Fedorov, Alex Bennée

On 04/05/2016 12:02 PM, Lluís Vilanova wrote:
> Peter Maydell writes:
> 
>> On 5 April 2016 at 17:31, Richard Henderson <rth@twiddle.net> wrote:
>>> On 04/05/2016 09:23 AM, Lluís Vilanova wrote:
>>>>
>>>> Got it!
>>>>
>>>> gcc -march=native --help=params -v 2>&1 | grep "param
>>>> l1-cache-line-size" | sed -e 's/.* --param l1-cache-line-size=\([0-9]\+\)
>>>> .*/\1/'
>>>
>>>
>>> That will only work on x86, where we can be pretty sure it's 64 anyway.
> 
> Oh, you mean l1-cache-line-size is an x86-only parameter in GCC?

No, using -march=native to detect the host line size.


r~

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-05 16:07       ` Richard Henderson
@ 2016-04-05 19:40         ` Emilio G. Cota
  2016-04-05 21:08           ` Richard Henderson
  0 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-05 19:40 UTC (permalink / raw)
  To: Richard Henderson
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée

On Tue, Apr 05, 2016 at 09:07:57 -0700, Richard Henderson wrote:
> On 04/05/2016 08:48 AM, Paolo Bonzini wrote:
> >I think it's fine to use the struct.  The exact size of the struct
> >varies from 3 to 5 32-bit words, so it's hard to write nice
> >size-dependent code for the hash.
> 
> I don't think it is.  We have 3 integers.  It is trivial to create a simple
> function of 2 multiplies, two adds, and a remainder.
> 
> Take the primes from the xxhash.h, for example:
> 
>   (phys_pc * PRIME32_2 + pc * PRIME32_3 + flags)
>   % PRIME32_1
>   & (CODE_GEN_PHYS_HASH_SIZE - 1)
> 
> Obviously, some bucket measurements should be taken, but I can well imagine
> that this might perform just as well as the fully generic hasher.

That function doesn't perform well: 25.06s vs. 21.18s with xxh32.

Having the packed struct and passing it to an *inlined* xxhash is
virtually unbeatable; gcc (>=v4.6, dunno about older ones) optimizes the
inline function since it knows the size of the struct.

To show this I'm appending the generated code for tb_hash_func when xxh32
is inlined vs. when it is not, for x86_64-softmmu. Results are similar
for arm-softmmu.

Anyway (for the arm bootup test) we're talking about ~0.50% of runtime spent
in tb_hash_func (with xxh32 inlined), so whatever we did here could not
improve overall performance much.

Thanks,

		Emilio

* no inline:

00000000001a4e60 <qemu_xxh32>:
  1a4e60:	48 83 ec 18          	sub    $0x18,%rsp
  1a4e64:	4c 8d 0c b7          	lea    (%rdi,%rsi,4),%r9
  1a4e68:	64 48 8b 04 25 28 00 	mov    %fs:0x28,%rax
  1a4e6f:	00 00 
  1a4e71:	48 89 44 24 08       	mov    %rax,0x8(%rsp)
  1a4e76:	31 c0                	xor    %eax,%eax
  1a4e78:	48 83 fe 03          	cmp    $0x3,%rsi
  1a4e7c:	8d 82 b1 67 56 16    	lea    0x165667b1(%rdx),%eax
  1a4e82:	0f 86 92 00 00 00    	jbe    1a4f1a <qemu_xxh32+0xba>
  1a4e88:	4d 8d 59 f0          	lea    -0x10(%r9),%r11
  1a4e8c:	44 8d 82 28 44 23 24 	lea    0x24234428(%rdx),%r8d
  1a4e93:	8d 8a 77 ca eb 85    	lea    -0x7a143589(%rdx),%ecx
  1a4e99:	8d 82 4f 86 c8 61    	lea    0x61c8864f(%rdx),%eax
  1a4e9f:	90                   	nop
  1a4ea0:	44 8b 17             	mov    (%rdi),%r10d
  1a4ea3:	45 69 d2 77 ca eb 85 	imul   $0x85ebca77,%r10d,%r10d
  1a4eaa:	45 01 d0             	add    %r10d,%r8d
  1a4ead:	44 8b 57 04          	mov    0x4(%rdi),%r10d
  1a4eb1:	41 c1 c0 0d          	rol    $0xd,%r8d
  1a4eb5:	45 69 c0 b1 79 37 9e 	imul   $0x9e3779b1,%r8d,%r8d
  1a4ebc:	45 69 d2 77 ca eb 85 	imul   $0x85ebca77,%r10d,%r10d
  1a4ec3:	44 01 d1             	add    %r10d,%ecx
  1a4ec6:	44 8b 57 08          	mov    0x8(%rdi),%r10d
  1a4eca:	c1 c1 0d             	rol    $0xd,%ecx
  1a4ecd:	69 c9 b1 79 37 9e    	imul   $0x9e3779b1,%ecx,%ecx
  1a4ed3:	45 69 d2 77 ca eb 85 	imul   $0x85ebca77,%r10d,%r10d
  1a4eda:	44 01 d2             	add    %r10d,%edx
  1a4edd:	44 8b 57 0c          	mov    0xc(%rdi),%r10d
  1a4ee1:	48 83 c7 10          	add    $0x10,%rdi
  1a4ee5:	c1 c2 0d             	rol    $0xd,%edx
  1a4ee8:	69 d2 b1 79 37 9e    	imul   $0x9e3779b1,%edx,%edx
  1a4eee:	45 69 d2 77 ca eb 85 	imul   $0x85ebca77,%r10d,%r10d
  1a4ef5:	44 01 d0             	add    %r10d,%eax
  1a4ef8:	c1 c0 0d             	rol    $0xd,%eax
  1a4efb:	69 c0 b1 79 37 9e    	imul   $0x9e3779b1,%eax,%eax
  1a4f01:	49 39 fb             	cmp    %rdi,%r11
  1a4f04:	73 9a                	jae    1a4ea0 <qemu_xxh32+0x40>
  1a4f06:	c1 c9 19             	ror    $0x19,%ecx
  1a4f09:	41 c1 c8 1f          	ror    $0x1f,%r8d
  1a4f0d:	c1 ca 14             	ror    $0x14,%edx
  1a4f10:	44 01 c1             	add    %r8d,%ecx
  1a4f13:	c1 c8 0e             	ror    $0xe,%eax
  1a4f16:	01 ca                	add    %ecx,%edx
  1a4f18:	01 d0                	add    %edx,%eax
  1a4f1a:	4c 39 cf             	cmp    %r9,%rdi
  1a4f1d:	8d 34 b0             	lea    (%rax,%rsi,4),%esi
  1a4f20:	73 22                	jae    1a4f44 <qemu_xxh32+0xe4>
  1a4f22:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  1a4f28:	8b 17                	mov    (%rdi),%edx
  1a4f2a:	48 83 c7 04          	add    $0x4,%rdi
  1a4f2e:	69 c2 3d ae b2 c2    	imul   $0xc2b2ae3d,%edx,%eax
  1a4f34:	01 c6                	add    %eax,%esi
  1a4f36:	c1 c6 11             	rol    $0x11,%esi
  1a4f39:	69 f6 2f eb d4 27    	imul   $0x27d4eb2f,%esi,%esi
  1a4f3f:	49 39 f9             	cmp    %rdi,%r9
  1a4f42:	77 e4                	ja     1a4f28 <qemu_xxh32+0xc8>
  1a4f44:	89 f0                	mov    %esi,%eax
  1a4f46:	c1 e8 0f             	shr    $0xf,%eax
  1a4f49:	31 f0                	xor    %esi,%eax
  1a4f4b:	69 d0 77 ca eb 85    	imul   $0x85ebca77,%eax,%edx
  1a4f51:	89 d0                	mov    %edx,%eax
  1a4f53:	c1 e8 0d             	shr    $0xd,%eax
  1a4f56:	31 d0                	xor    %edx,%eax
  1a4f58:	69 d0 3d ae b2 c2    	imul   $0xc2b2ae3d,%eax,%edx
  1a4f5e:	89 d0                	mov    %edx,%eax
  1a4f60:	c1 e8 10             	shr    $0x10,%eax
  1a4f63:	31 d0                	xor    %edx,%eax
  1a4f65:	48 8b 54 24 08       	mov    0x8(%rsp),%rdx
  1a4f6a:	64 48 33 14 25 28 00 	xor    %fs:0x28,%rdx
  1a4f71:	00 00 
  1a4f73:	75 05                	jne    1a4f7a <qemu_xxh32+0x11a>
  1a4f75:	48 83 c4 18          	add    $0x18,%rsp
  1a4f79:	c3                   	retq   
  1a4f7a:	e8 f1 7a fe ff       	callq  18ca70 <__stack_chk_fail@plt>
  1a4f7f:	90                   	nop

00000000001a4f80 <tb_hash_func>:
  1a4f80:	48 83 ec 28          	sub    $0x28,%rsp
  1a4f84:	48 89 3c 24          	mov    %rdi,(%rsp)
  1a4f88:	48 89 74 24 08       	mov    %rsi,0x8(%rsp)
  1a4f8d:	48 89 e7             	mov    %rsp,%rdi
  1a4f90:	89 54 24 10          	mov    %edx,0x10(%rsp)
  1a4f94:	be 05 00 00 00       	mov    $0x5,%esi
  1a4f99:	ba 01 00 00 00       	mov    $0x1,%edx
  1a4f9e:	64 48 8b 04 25 28 00 	mov    %fs:0x28,%rax
  1a4fa5:	00 00 
  1a4fa7:	48 89 44 24 18       	mov    %rax,0x18(%rsp)
  1a4fac:	31 c0                	xor    %eax,%eax
  1a4fae:	e8 ad fe ff ff       	callq  1a4e60 <qemu_xxh32>
  1a4fb3:	48 8b 54 24 18       	mov    0x18(%rsp),%rdx
  1a4fb8:	64 48 33 14 25 28 00 	xor    %fs:0x28,%rdx
  1a4fbf:	00 00 
  1a4fc1:	75 05                	jne    1a4fc8 <tb_hash_func+0x48>
  1a4fc3:	48 83 c4 28          	add    $0x28,%rsp
  1a4fc7:	c3                   	retq   
  1a4fc8:	e8 a3 7a fe ff       	callq  18ca70 <__stack_chk_fail@plt>
  1a4fcd:	0f 1f 00             	nopl   (%rax)

* inline:

00000000001a6800 <tb_hash_func>:
  1a6800:	48 83 ec 28          	sub    $0x28,%rsp
  1a6804:	69 cf 77 ca eb 85    	imul   $0x85ebca77,%edi,%ecx
  1a680a:	48 89 3c 24          	mov    %rdi,(%rsp)
  1a680e:	48 c1 ef 20          	shr    $0x20,%rdi
  1a6812:	69 ff 77 ca eb 85    	imul   $0x85ebca77,%edi,%edi
  1a6818:	48 89 74 24 08       	mov    %rsi,0x8(%rsp)
  1a681d:	64 48 8b 04 25 28 00 	mov    %fs:0x28,%rax
  1a6824:	00 00 
  1a6826:	48 89 44 24 18       	mov    %rax,0x18(%rsp)
  1a682b:	31 c0                	xor    %eax,%eax
  1a682d:	81 c1 29 44 23 24    	add    $0x24234429,%ecx
  1a6833:	69 c6 77 ca eb 85    	imul   $0x85ebca77,%esi,%eax
  1a6839:	48 c1 ee 20          	shr    $0x20,%rsi
  1a683d:	81 ef 88 35 14 7a    	sub    $0x7a143588,%edi
  1a6843:	69 f6 77 ca eb 85    	imul   $0x85ebca77,%esi,%esi
  1a6849:	c1 c9 13             	ror    $0x13,%ecx
  1a684c:	c1 cf 13             	ror    $0x13,%edi
  1a684f:	83 c0 01             	add    $0x1,%eax
  1a6852:	69 c9 b1 79 37 9e    	imul   $0x9e3779b1,%ecx,%ecx
  1a6858:	c1 c8 13             	ror    $0x13,%eax
  1a685b:	81 c6 50 86 c8 61    	add    $0x61c88650,%esi
  1a6861:	69 ff b1 79 37 9e    	imul   $0x9e3779b1,%edi,%edi
  1a6867:	c1 ce 13             	ror    $0x13,%esi
  1a686a:	c1 c9 1f             	ror    $0x1f,%ecx
  1a686d:	69 c0 b1 79 37 9e    	imul   $0x9e3779b1,%eax,%eax
  1a6873:	c1 cf 19             	ror    $0x19,%edi
  1a6876:	69 f6 b1 79 37 9e    	imul   $0x9e3779b1,%esi,%esi
  1a687c:	8d 7c 39 14          	lea    0x14(%rcx,%rdi,1),%edi
  1a6880:	c1 c8 14             	ror    $0x14,%eax
  1a6883:	69 d2 3d ae b2 c2    	imul   $0xc2b2ae3d,%edx,%edx
  1a6889:	01 f8                	add    %edi,%eax
  1a688b:	c1 ce 0e             	ror    $0xe,%esi
  1a688e:	01 c6                	add    %eax,%esi
  1a6890:	01 f2                	add    %esi,%edx
  1a6892:	c1 ca 0f             	ror    $0xf,%edx
  1a6895:	69 d2 2f eb d4 27    	imul   $0x27d4eb2f,%edx,%edx
  1a689b:	89 d0                	mov    %edx,%eax
  1a689d:	c1 e8 0f             	shr    $0xf,%eax
  1a68a0:	31 d0                	xor    %edx,%eax
  1a68a2:	69 d0 77 ca eb 85    	imul   $0x85ebca77,%eax,%edx
  1a68a8:	89 d0                	mov    %edx,%eax
  1a68aa:	c1 e8 0d             	shr    $0xd,%eax
  1a68ad:	31 d0                	xor    %edx,%eax
  1a68af:	69 d0 3d ae b2 c2    	imul   $0xc2b2ae3d,%eax,%edx
  1a68b5:	89 d0                	mov    %edx,%eax
  1a68b7:	c1 e8 10             	shr    $0x10,%eax
  1a68ba:	31 d0                	xor    %edx,%eax
  1a68bc:	48 8b 54 24 18       	mov    0x18(%rsp),%rdx
  1a68c1:	64 48 33 14 25 28 00 	xor    %fs:0x28,%rdx
  1a68c8:	00 00 
  1a68ca:	75 05                	jne    1a68d1 <tb_hash_func+0xd1>
  1a68cc:	48 83 c4 28          	add    $0x28,%rsp
  1a68d0:	c3                   	retq   
  1a68d1:	e8 9a 61 fe ff       	callq  18ca70 <__stack_chk_fail@plt>
  1a68d6:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  1a68dd:	00 00 00 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 19:15               ` Richard Henderson
@ 2016-04-05 20:09                 ` Lluís Vilanova
  2016-04-06 11:44                   ` Paolo Bonzini
  0 siblings, 1 reply; 62+ messages in thread
From: Lluís Vilanova @ 2016-04-05 20:09 UTC (permalink / raw)
  To: Richard Henderson
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Emilio G. Cota, Sergey Fedorov, Paolo Bonzini, Alex Bennée

Richard Henderson writes:

> On 04/05/2016 12:02 PM, Lluís Vilanova wrote:
>> Peter Maydell writes:
>> 
>>> On 5 April 2016 at 17:31, Richard Henderson <rth@twiddle.net> wrote:
>>>> On 04/05/2016 09:23 AM, Lluís Vilanova wrote:
>>>>> 
>>>>> Got it!
>>>>> 
>>>>> gcc -march=native --help=params -v 2>&1 | grep "param
>>>>> l1-cache-line-size" | sed -e 's/.* --param l1-cache-line-size=\([0-9]\+\)
>>>>> .*/\1/'
>>>> 
>>>> 
>>>> That will only work on x86, where we can be pretty sure it's 64 anyway.
>> 
>> Oh, you mean l1-cache-line-size is an x86-only parameter in GCC?

> No, using -march=native to detect the host line size.

Ah. That's just an example. For cross-compilation you would use a different
march argument (or none to use the default target sub-arch) and get the
parameter for the target processor. This should already be known by configure as
part of the arguments to select the cross-compiler and target architecture
(e.g., CC).


Lluis

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-05 19:40         ` Emilio G. Cota
@ 2016-04-05 21:08           ` Richard Henderson
  2016-04-06  0:52             ` Emilio G. Cota
  0 siblings, 1 reply; 62+ messages in thread
From: Richard Henderson @ 2016-04-05 21:08 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée

On 04/05/2016 12:40 PM, Emilio G. Cota wrote:
> On Tue, Apr 05, 2016 at 09:07:57 -0700, Richard Henderson wrote:
>> On 04/05/2016 08:48 AM, Paolo Bonzini wrote:
>>> I think it's fine to use the struct.  The exact size of the struct
>>> varies from 3 to 5 32-bit words, so it's hard to write nice
>>> size-dependent code for the hash.
>>
>> I don't think it is.  We have 3 integers.  It is trivial to create a simple
>> function of 2 multiplies, two adds, and a remainder.
>>
>> Take the primes from the xxhash.h, for example:
>>
>>   (phys_pc * PRIME32_2 + pc * PRIME32_3 + flags)
>>   % PRIME32_1
>>   & (CODE_GEN_PHYS_HASH_SIZE - 1)
>>
>> Obviously, some bucket measurements should be taken, but I can well imagine
>> that this might perform just as well as the fully generic hasher.
>
> That function doesn't perform well: 25.06s vs. 21.18s with xxh32.

Sure, that's just what came off the top of my head.

But the point is that we can do better than dropping data into memory.
Particularly for those hosts that do not support unaligned data, such as you
created with the packed structure.


> * inline:
>
> 00000000001a6800 <tb_hash_func>:
>   1a6800:	48 83 ec 28          	sub    $0x28,%rsp
>   1a6804:	69 cf 77 ca eb 85    	imul   $0x85ebca77,%edi,%ecx
>   1a680a:	48 89 3c 24          	mov    %rdi,(%rsp)
>   1a680e:	48 c1 ef 20          	shr    $0x20,%rdi
>   1a6812:	69 ff 77 ca eb 85    	imul   $0x85ebca77,%edi,%edi
>   1a6818:	48 89 74 24 08       	mov    %rsi,0x8(%rsp)
>   1a681d:	64 48 8b 04 25 28 00 	mov    %fs:0x28,%rax
>   1a6824:	00 00
>   1a6826:	48 89 44 24 18       	mov    %rax,0x18(%rsp)
>   1a682b:	31 c0                	xor    %eax,%eax
>   1a682d:	81 c1 29 44 23 24    	add    $0x24234429,%ecx
>   1a6833:	69 c6 77 ca eb 85    	imul   $0x85ebca77,%esi,%eax
>   1a6839:	48 c1 ee 20          	shr    $0x20,%rsi
>   1a683d:	81 ef 88 35 14 7a    	sub    $0x7a143588,%edi
>   1a6843:	69 f6 77 ca eb 85    	imul   $0x85ebca77,%esi,%esi
>   1a6849:	c1 c9 13             	ror    $0x13,%ecx
>   1a684c:	c1 cf 13             	ror    $0x13,%edi
>   1a684f:	83 c0 01             	add    $0x1,%eax
>   1a6852:	69 c9 b1 79 37 9e    	imul   $0x9e3779b1,%ecx,%ecx
>   1a6858:	c1 c8 13             	ror    $0x13,%eax
>   1a685b:	81 c6 50 86 c8 61    	add    $0x61c88650,%esi
>   1a6861:	69 ff b1 79 37 9e    	imul   $0x9e3779b1,%edi,%edi
>   1a6867:	c1 ce 13             	ror    $0x13,%esi
>   1a686a:	c1 c9 1f             	ror    $0x1f,%ecx
>   1a686d:	69 c0 b1 79 37 9e    	imul   $0x9e3779b1,%eax,%eax
>   1a6873:	c1 cf 19             	ror    $0x19,%edi
>   1a6876:	69 f6 b1 79 37 9e    	imul   $0x9e3779b1,%esi,%esi
>   1a687c:	8d 7c 39 14          	lea    0x14(%rcx,%rdi,1),%edi
>   1a6880:	c1 c8 14             	ror    $0x14,%eax
>   1a6883:	69 d2 3d ae b2 c2    	imul   $0xc2b2ae3d,%edx,%edx
>   1a6889:	01 f8                	add    %edi,%eax
>   1a688b:	c1 ce 0e             	ror    $0xe,%esi
>   1a688e:	01 c6                	add    %eax,%esi
>   1a6890:	01 f2                	add    %esi,%edx
>   1a6892:	c1 ca 0f             	ror    $0xf,%edx
>   1a6895:	69 d2 2f eb d4 27    	imul   $0x27d4eb2f,%edx,%edx
>   1a689b:	89 d0                	mov    %edx,%eax
>   1a689d:	c1 e8 0f             	shr    $0xf,%eax
>   1a68a0:	31 d0                	xor    %edx,%eax
>   1a68a2:	69 d0 77 ca eb 85    	imul   $0x85ebca77,%eax,%edx
>   1a68a8:	89 d0                	mov    %edx,%eax
>   1a68aa:	c1 e8 0d             	shr    $0xd,%eax
>   1a68ad:	31 d0                	xor    %edx,%eax
>   1a68af:	69 d0 3d ae b2 c2    	imul   $0xc2b2ae3d,%eax,%edx
>   1a68b5:	89 d0                	mov    %edx,%eax
>   1a68b7:	c1 e8 10             	shr    $0x10,%eax
>   1a68ba:	31 d0                	xor    %edx,%eax
>   1a68bc:	48 8b 54 24 18       	mov    0x18(%rsp),%rdx
>   1a68c1:	64 48 33 14 25 28 00 	xor    %fs:0x28,%rdx
>   1a68c8:	00 00
>   1a68ca:	75 05                	jne    1a68d1 <tb_hash_func+0xd1>
>   1a68cc:	48 83 c4 28          	add    $0x28,%rsp
>   1a68d0:	c3                   	retq

If I inline everything explicitly, we do even better.


r~


0000000000000000 <tb_hash_func>:
    0:	44 69 c6 77 ca eb 85 	imul   $0x85ebca77,%esi,%r8d
    7:	48 c1 ee 20          	shr    $0x20,%rsi
    b:	69 cf 77 ca eb 85    	imul   $0x85ebca77,%edi,%ecx
   11:	48 c1 ef 20          	shr    $0x20,%rdi
   15:	41 83 c0 01          	add    $0x1,%r8d
   19:	41 c1 c0 0d          	rol    $0xd,%r8d
   1d:	81 c1 29 44 23 24    	add    $0x24234429,%ecx
   23:	69 c6 77 ca eb 85    	imul   $0x85ebca77,%esi,%eax
   29:	c1 c1 0d             	rol    $0xd,%ecx
   2c:	41 69 f0 b1 79 37 9e 	imul   $0x9e3779b1,%r8d,%esi
   33:	05 50 86 c8 61       	add    $0x61c88650,%eax
   38:	69 ff 77 ca eb 85    	imul   $0x85ebca77,%edi,%edi
   3e:	c1 c6 0c             	rol    $0xc,%esi
   41:	c1 c0 0d             	rol    $0xd,%eax
   44:	69 d2 3d ae b2 c2    	imul   $0xc2b2ae3d,%edx,%edx
   4a:	81 ef 88 35 14 7a    	sub    $0x7a143588,%edi
   50:	69 c9 b1 79 37 9e    	imul   $0x9e3779b1,%ecx,%ecx
   56:	8d 54 16 14          	lea    0x14(%rsi,%rdx,1),%edx
   5a:	c1 c7 0d             	rol    $0xd,%edi
   5d:	69 c0 b1 79 37 9e    	imul   $0x9e3779b1,%eax,%eax
   63:	d1 c1                	rol    %ecx
   65:	01 d1                	add    %edx,%ecx
   67:	c1 c8 0e             	ror    $0xe,%eax
   6a:	69 d7 b1 79 37 9e    	imul   $0x9e3779b1,%edi,%edx
   70:	01 c8                	add    %ecx,%eax
   72:	c1 c2 07             	rol    $0x7,%edx
   75:	01 d0                	add    %edx,%eax
   77:	c1 c8 0f             	ror    $0xf,%eax
   7a:	69 c0 2f eb d4 27    	imul   $0x27d4eb2f,%eax,%eax
   80:	89 c2                	mov    %eax,%edx
   82:	c1 ea 0f             	shr    $0xf,%edx
   85:	31 d0                	xor    %edx,%eax
   87:	69 c0 77 ca eb 85    	imul   $0x85ebca77,%eax,%eax
   8d:	89 c2                	mov    %eax,%edx
   8f:	c1 ea 0d             	shr    $0xd,%edx
   92:	31 d0                	xor    %edx,%eax
   94:	69 c0 3d ae b2 c2    	imul   $0xc2b2ae3d,%eax,%eax
   9a:	89 c2                	mov    %eax,%edx
   9c:	c1 ea 10             	shr    $0x10,%edx
   9f:	31 d0                	xor    %edx,%eax
   a1:	c3                   	retq


#define PRIME32_1   2654435761U
#define PRIME32_2   2246822519U
#define PRIME32_3   3266489917U
#define PRIME32_4    668265263U
#define PRIME32_5    374761393U
#define XXH_rotl32(x, r) ((x << r) | (x >> (32 - r)))

unsigned tb_hash_func(unsigned long phys_pc, unsigned long pc, unsigned flags)
{
   unsigned h32, seed = 1;
   unsigned w0, w1, w2, w3, w4;
   unsigned v1, v2, v3, v4;

   w0 = phys_pc;
   w1 = phys_pc >> 31 >> 1;
   w2 = pc;
   w3 = pc >> 31 >> 1;
   w4 = flags;

   v1 = seed + PRIME32_1 + PRIME32_2;
   v2 = seed + PRIME32_2;
   v3 = seed + 0;
   v4 = seed - PRIME32_1;

   v1 += w0 * PRIME32_2;
   v1 = XXH_rotl32(v1, 13);
   v1 *= PRIME32_1;
   v2 += w1 * PRIME32_2;
   v2 = XXH_rotl32(v2, 13);
   v2 *= PRIME32_1;
   v3 += w2 * PRIME32_2;
   v3 = XXH_rotl32(v3, 13);
   v3 *= PRIME32_1;
   v4 += w3 * PRIME32_2;
   v4 = XXH_rotl32(v4, 13);
   v4 *= PRIME32_1;

   h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7)
	+ XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18);

   h32 += 20;

   h32 += w4 * PRIME32_3;
   h32  = XXH_rotl32(h32, 17) * PRIME32_4;

   h32 ^= h32 >> 15;
   h32 *= PRIME32_2;
   h32 ^= h32 >> 13;
   h32 *= PRIME32_3;
   h32 ^= h32 >> 16;

   return h32;
}

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-05 21:08           ` Richard Henderson
@ 2016-04-06  0:52             ` Emilio G. Cota
  2016-04-06 11:52               ` Paolo Bonzini
  0 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-06  0:52 UTC (permalink / raw)
  To: Richard Henderson
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Paolo Bonzini, Alex Bennée

On Tue, Apr 05, 2016 at 14:08:13 -0700, Richard Henderson wrote:
> But the point is that we can do better than dropping data into memory.
> Particularly for those hosts that do not support unaligned data, such as you
> created with the packed structure.

If we made sure the fields in the struct were in the right order
(larger fields first), this shouldn't be an issue.

Anyway I took your proposal and implemented the patch below.
FWIW I cannot measure a perf. difference between this and the packed
struct for arm-softmmu (i.e. 16 bytes) on an x86_64 host.

How does the appended look?

Thanks,

		E.


commit af92a0690f49172621cd8b80759e3ca567d43567
Author: Emilio G. Cota <cota@braap.org>
Date:   Tue Apr 5 18:06:21 2016 -0400

    rth
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>

diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 6b97a7c..349a856 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -45,19 +45,124 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
            | (tmp & TB_JMP_ADDR_MASK));
 }
 
-static inline
-uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
+static inline uint32_t h32_finish(uint32_t h32)
 {
-    struct {
-        tb_page_addr_t phys_pc;
-        target_ulong pc;
-        int flags;
-    } QEMU_PACKED k;
-
-    k.phys_pc = phys_pc;
-    k.pc = pc;
-    k.flags = flags;
-    return qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);
+    h32 ^= h32 >> 15;
+    h32 *= PRIME32_2;
+    h32 ^= h32 >> 13;
+    h32 *= PRIME32_3;
+    h32 ^= h32 >> 16;
+
+    return h32;
+}
+
+static inline uint32_t tb_hash_func3(uint32_t a, uint32_t b, uint32_t c, int seed)
+{
+    uint32_t h32 = seed + PRIME32_5;
+
+    h32 += 12;
+
+    h32 += a * PRIME32_3;
+    h32  = XXH_rotl32(h32, 17) * PRIME32_4;
+
+    h32 += b * PRIME32_3;
+    h32  = XXH_rotl32(h32, 17) * PRIME32_4;
+
+    h32 += c * PRIME32_3;
+    h32  = XXH_rotl32(h32, 17) * PRIME32_4;
+
+    return h32_finish(h32);
+}
+
+static inline uint32_t tb_hash_func4(uint64_t a0, uint32_t c, uint32_t d, int seed)
+{
+    uint32_t v1 = seed + PRIME32_1 + PRIME32_2;
+    uint32_t v2 = seed + PRIME32_2;
+    uint32_t v3 = seed + 0;
+    uint32_t v4 = seed - PRIME32_1;
+    uint32_t a = a0 >> 31 >> 1;
+    uint32_t b = a0;
+    uint32_t h32;
+
+    v1 += a * PRIME32_2;
+    v1 = XXH_rotl32(v1, 13);
+    v1 *= PRIME32_1;
+
+    v2 += b * PRIME32_2;
+    v2 = XXH_rotl32(v2, 13);
+    v2 *= PRIME32_1;
+
+    v3 += c * PRIME32_2;
+    v3 = XXH_rotl32(v3, 13);
+    v3 *= PRIME32_1;
+
+    v4 += d * PRIME32_2;
+    v4 = XXH_rotl32(v4, 13);
+    v4 *= PRIME32_1;
+
+    h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) +
+          XXH_rotl32(v4, 18);
+    h32 += 16;
+
+    return h32_finish(h32);
+}
+
+static inline uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e, int seed)
+{
+    uint32_t v1 = seed + PRIME32_1 + PRIME32_2;
+    uint32_t v2 = seed + PRIME32_2;
+    uint32_t v3 = seed + 0;
+    uint32_t v4 = seed - PRIME32_1;
+    uint32_t a = a0 >> 31 >> 1;
+    uint32_t b = a0;
+    uint32_t c = b0 >> 31 >> 1;
+    uint32_t d = b0;
+    uint32_t h32;
+
+    v1 += a * PRIME32_2;
+    v1 = XXH_rotl32(v1, 13);
+    v1 *= PRIME32_1;
+
+    v2 += b * PRIME32_2;
+    v2 = XXH_rotl32(v2, 13);
+    v2 *= PRIME32_1;
+
+    v3 += c * PRIME32_2;
+    v3 = XXH_rotl32(v3, 13);
+    v3 *= PRIME32_1;
+
+    v4 += d * PRIME32_2;
+    v4 = XXH_rotl32(v4, 13);
+    v4 *= PRIME32_1;
+
+    h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) +
+          XXH_rotl32(v4, 18);
+    h32 += 20;
+
+    h32 += e * PRIME32_3;
+    h32  = XXH_rotl32(h32, 17) * PRIME32_4;
+
+    return h32_finish(h32);
+}
+
+static __attribute__((noinline))
+unsigned tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
+{
+#if TARGET_LONG_BITS == 64
+
+    if (sizeof(phys_pc) == sizeof(pc)) {
+        return tb_hash_func5(phys_pc, pc, flags, 1);
+    }
+    return tb_hash_func4(pc, phys_pc, flags, 1);
+
+#else /* 32-bit target */
+
+    if (sizeof(phys_pc) > sizeof(pc)) {
+        return tb_hash_func4(phys_pc, pc, flags, 1);
+    }
+    return tb_hash_func3(pc, phys_pc, flags, 1);
+
+#endif /* TARGET_LONG_BITS */
 }
 
 #endif

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-05 17:19       ` Richard Henderson
@ 2016-04-06  6:06         ` Laurent Desnogues
  2016-04-06 17:32           ` Emilio G. Cota
  0 siblings, 1 reply; 62+ messages in thread
From: Laurent Desnogues @ 2016-04-06  6:06 UTC (permalink / raw)
  To: Richard Henderson
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Emilio G. Cota, Paolo Bonzini, Sergey Fedorov, Alex Bennée

On Tue, Apr 5, 2016 at 7:19 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 04/05/2016 09:33 AM, Laurent Desnogues wrote:
>> The 'flags' field is 64-bit.  You're thinking of cflags, I guess.
>
> Well that's silly.  Since it's filled in via
>
> static inline void cpu_get_tb_cpu_state(CPUMIPSState *env, target_ulong *pc,
>                                         target_ulong *cs_base, int *flags)
>
> and passed back in to generate code with
>
> TranslationBlock *tb_gen_code(CPUState *cpu,
>                               target_ulong pc, target_ulong cs_base, int flags,
>                               int cflags);
>
> So while TranslationBlock stores "uint64_t", the producer and consumer see "int".

I agree.  I guess TranslationBlock should be fixed to use uint32_t
(note several functions have to be changed from using int to uint32_t
or aarch64-softmmu will fail).


Laurent

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] seqlock: remove optional mutex
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 03/10] seqlock: remove optional mutex Emilio G. Cota
@ 2016-04-06  8:38   ` Alex Bennée
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Bennée @ 2016-04-06  8:38 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> This option is unused; besides, it bloats the struct when not needed.
> Let's just let writers define their own locks elsewhere.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  cpus.c                 |  2 +-
>  include/qemu/seqlock.h | 10 +---------
>  2 files changed, 2 insertions(+), 10 deletions(-)
>
> diff --git a/cpus.c b/cpus.c
> index 8ae4777..7f550b9 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -611,7 +611,7 @@ int cpu_throttle_get_percentage(void)
>
>  void cpu_ticks_init(void)
>  {
> -    seqlock_init(&timers_state.vm_clock_seqlock, NULL);
> +    seqlock_init(&timers_state.vm_clock_seqlock);
>      vmstate_register(NULL, 0, &vmstate_timers, &timers_state);
>      throttle_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL_RT,
>                                             cpu_throttle_timer_tick, NULL);
> diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
> index 70b01fd..e673482 100644
> --- a/include/qemu/seqlock.h
> +++ b/include/qemu/seqlock.h
> @@ -19,22 +19,17 @@
>  typedef struct QemuSeqLock QemuSeqLock;
>
>  struct QemuSeqLock {
> -    QemuMutex *mutex;
>      unsigned sequence;
>  };
>
> -static inline void seqlock_init(QemuSeqLock *sl, QemuMutex *mutex)
> +static inline void seqlock_init(QemuSeqLock *sl)
>  {
> -    sl->mutex = mutex;
>      sl->sequence = 0;
>  }
>
>  /* Lock out other writers and update the count.  */
>  static inline void seqlock_write_lock(QemuSeqLock *sl)
>  {
> -    if (sl->mutex) {
> -        qemu_mutex_lock(sl->mutex);
> -    }
>      ++sl->sequence;
>
>      /* Write sequence before updating other fields.  */
> @@ -47,9 +42,6 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
>      smp_wmb();
>
>      ++sl->sequence;
> -    if (sl->mutex) {
> -        qemu_mutex_unlock(sl->mutex);
> -    }
>  }
>
>  static inline unsigned seqlock_read_begin(QemuSeqLock *sl)


--
Alex Bennée

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 04/10] seqlock: rename write_lock/unlock to write_begin/end
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 04/10] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
@ 2016-04-06  8:42   ` Alex Bennée
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Bennée @ 2016-04-06  8:42 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> It is a more appropriate name, now that the mutex embedded
> in the seqlock is gone.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  cpus.c                 | 28 ++++++++++++++--------------
>  include/qemu/seqlock.h |  4 ++--
>  2 files changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/cpus.c b/cpus.c
> index 7f550b9..7dad2e6 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -247,13 +247,13 @@ int64_t cpu_get_clock(void)
>  void cpu_enable_ticks(void)
>  {
>      /* Here, the really thing protected by seqlock is cpu_clock_offset. */
> -    seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_begin(&timers_state.vm_clock_seqlock);
>      if (!timers_state.cpu_ticks_enabled) {
>          timers_state.cpu_ticks_offset -= cpu_get_host_ticks();
>          timers_state.cpu_clock_offset -= get_clock();
>          timers_state.cpu_ticks_enabled = 1;
>      }
> -    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_end(&timers_state.vm_clock_seqlock);
>  }
>
>  /* disable cpu_get_ticks() : the clock is stopped. You must not call
> @@ -263,13 +263,13 @@ void cpu_enable_ticks(void)
>  void cpu_disable_ticks(void)
>  {
>      /* Here, the really thing protected by seqlock is cpu_clock_offset. */
> -    seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_begin(&timers_state.vm_clock_seqlock);
>      if (timers_state.cpu_ticks_enabled) {
>          timers_state.cpu_ticks_offset += cpu_get_host_ticks();
>          timers_state.cpu_clock_offset = cpu_get_clock_locked();
>          timers_state.cpu_ticks_enabled = 0;
>      }
> -    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_end(&timers_state.vm_clock_seqlock);
>  }
>
>  /* Correlation between real and virtual time is always going to be
> @@ -292,7 +292,7 @@ static void icount_adjust(void)
>          return;
>      }
>
> -    seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_begin(&timers_state.vm_clock_seqlock);
>      cur_time = cpu_get_clock_locked();
>      cur_icount = cpu_get_icount_locked();
>
> @@ -313,7 +313,7 @@ static void icount_adjust(void)
>      last_delta = delta;
>      timers_state.qemu_icount_bias = cur_icount
>                                - (timers_state.qemu_icount << icount_time_shift);
> -    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_end(&timers_state.vm_clock_seqlock);
>  }
>
>  static void icount_adjust_rt(void *opaque)
> @@ -345,7 +345,7 @@ static void icount_warp_rt(void)
>          return;
>      }
>
> -    seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_begin(&timers_state.vm_clock_seqlock);
>      if (runstate_is_running()) {
>          int64_t clock = REPLAY_CLOCK(REPLAY_CLOCK_VIRTUAL_RT,
>                                       cpu_get_clock_locked());
> @@ -364,7 +364,7 @@ static void icount_warp_rt(void)
>          timers_state.qemu_icount_bias += warp_delta;
>      }
>      vm_clock_warp_start = -1;
> -    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_end(&timers_state.vm_clock_seqlock);
>
>      if (qemu_clock_expired(QEMU_CLOCK_VIRTUAL)) {
>          qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
> @@ -389,9 +389,9 @@ void qtest_clock_warp(int64_t dest)
>          int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
>          int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
>
> -        seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +        seqlock_write_begin(&timers_state.vm_clock_seqlock);
>          timers_state.qemu_icount_bias += warp;
> -        seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +        seqlock_write_end(&timers_state.vm_clock_seqlock);
>
>          qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
>          timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
> @@ -458,9 +458,9 @@ void qemu_start_warp_timer(void)
>               * It is useful when we want a deterministic execution time,
>               * isolated from host latencies.
>               */
> -            seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +            seqlock_write_begin(&timers_state.vm_clock_seqlock);
>              timers_state.qemu_icount_bias += deadline;
> -            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +            seqlock_write_end(&timers_state.vm_clock_seqlock);
>              qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
>          } else {
>              /*
> @@ -471,11 +471,11 @@ void qemu_start_warp_timer(void)
>               * you will not be sending network packets continuously instead of
>               * every 100ms.
>               */
> -            seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +            seqlock_write_begin(&timers_state.vm_clock_seqlock);
>              if (vm_clock_warp_start == -1 || vm_clock_warp_start > clock) {
>                  vm_clock_warp_start = clock;
>              }
> -            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +            seqlock_write_end(&timers_state.vm_clock_seqlock);
>              timer_mod_anticipate(icount_warp_timer, clock + deadline);
>          }
>      } else if (deadline == 0) {
> diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
> index e673482..4dfc055 100644
> --- a/include/qemu/seqlock.h
> +++ b/include/qemu/seqlock.h
> @@ -28,7 +28,7 @@ static inline void seqlock_init(QemuSeqLock *sl)
>  }
>
>  /* Lock out other writers and update the count.  */
> -static inline void seqlock_write_lock(QemuSeqLock *sl)
> +static inline void seqlock_write_begin(QemuSeqLock *sl)
>  {
>      ++sl->sequence;
>
> @@ -36,7 +36,7 @@ static inline void seqlock_write_lock(QemuSeqLock *sl)
>      smp_wmb();
>  }
>
> -static inline void seqlock_write_unlock(QemuSeqLock *sl)
> +static inline void seqlock_write_end(QemuSeqLock *sl)
>  {
>      /* Write other fields before finalizing sequence.  */
>      smp_wmb();


--
Alex Bennée

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] include: add xxhash.h
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 06/10] include: add xxhash.h Emilio G. Cota
@ 2016-04-06 11:39   ` Alex Bennée
  2016-04-06 22:59     ` Emilio G. Cota
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Bennée @ 2016-04-06 11:39 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> xxhash is a fast, high-quality hashing function. The appended
> brings in the 32-bit version of it, with the small modification that
> it assumes the data to be hashed is made of 32-bit chunks; this increases
> speed slightly for the use-case we care about, i.e. tb-hash.
>
> The original algorithm, as well as a 64-bit implementation, can be found at:
>   https://github.com/Cyan4973/xxHash
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/xxhash.h | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 106 insertions(+)
>  create mode 100644 include/qemu/xxhash.h
>
> diff --git a/include/qemu/xxhash.h b/include/qemu/xxhash.h
> new file mode 100644
> index 0000000..a13a665
> --- /dev/null
> +++ b/include/qemu/xxhash.h
> @@ -0,0 +1,106 @@
<snip>
> +
> +/* u32 hash of @n contiguous chunks of u32's */
> +static inline uint32_t qemu_xxh32(const uint32_t *p, size_t n, uint32_t seed)
> +{

What is the point of seed here? I looked on the original site to see if
there was any guidance on tuning seed but couldn't find anything. I
appreciate the compiler can inline the constant away but perhaps we
should #define it and drop the parameter if we are not intending to
modify it?

Also it might be helpful to wrap the call to avoid getting the
boilerplate sizing wrong:

  #define qemu_xxh32(s) qemu_xxh32_impl((const uint32_t *)s, sizeof(*s)/sizeof(uint32_t), 1)

Then calls become a little simpler for the user:

  return qemu_xxh32(&k);

Do we need to include a compile time check for structures that don't
neatly divide into uint32_t chunks?

> +    const uint32_t *end = p + n;
> +    uint32_t h32;
> +
> +    if (n >= 4) {
> +        const uint32_t * const limit = end - 4;
> +        uint32_t v1 = seed + PRIME32_1 + PRIME32_2;
> +        uint32_t v2 = seed + PRIME32_2;
> +        uint32_t v3 = seed + 0;
> +        uint32_t v4 = seed - PRIME32_1;
> +
> +        do {
> +            v1 += *p * PRIME32_2;
> +            v1 = XXH_rotl32(v1, 13);
> +            v1 *= PRIME32_1;
> +            p++;
> +            v2 += *p * PRIME32_2;
> +            v2 = XXH_rotl32(v2, 13);
> +            v2 *= PRIME32_1;
> +            p++;
> +            v3 += *p * PRIME32_2;
> +            v3 = XXH_rotl32(v3, 13);
> +            v3 *= PRIME32_1;
> +            p++;
> +            v4 += *p * PRIME32_2;
> +            v4 = XXH_rotl32(v4, 13);
> +            v4 *= PRIME32_1;
> +            p++;
> +        } while (p <= limit);
> +        h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) +
> +            XXH_rotl32(v4, 18);
> +    } else {
> +        h32  = seed + PRIME32_5;
> +    }

I don't plead any particular knowledge of hashing codes but I note the
test cases we add only exercise the n == 1 path but in actual usage we
exercise n = 5 (at least for arm32, I guess aarch64 would be more).

> +
> +    h32 += n * sizeof(uint32_t);
> +
> +    while (p < end) {
> +        h32 += *p * PRIME32_3;
> +        h32  = XXH_rotl32(h32, 17) * PRIME32_4 ;
> +        p++;
> +    }
> +
> +    h32 ^= h32 >> 15;
> +    h32 *= PRIME32_2;
> +    h32 ^= h32 >> 13;
> +    h32 *= PRIME32_3;
> +    h32 ^= h32 >> 16;
> +
> +    return h32;
> +}
> +
> +#endif /* QEMU_XXHASH_H */


--
Alex Bennée

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-05 20:09                 ` Lluís Vilanova
@ 2016-04-06 11:44                   ` Paolo Bonzini
  2016-04-06 12:02                     ` Laurent Desnogues
  0 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-06 11:44 UTC (permalink / raw)
  To: Richard Henderson, Peter Maydell, MTTCG Devel, Peter Crosthwaite,
	QEMU Developers, Emilio G. Cota, Sergey Fedorov,
	Alex Bennée



On 05/04/2016 22:09, Lluís Vilanova wrote:
> Ah. That's just an example. For cross-compilation you would use a different
> march argument (or none to use the default target sub-arch) and get the
> parameter for the target processor. This should already be known by configure as
> part of the arguments to select the cross-compiler and target architecture
> (e.g., CC).

I would just use 64, with special cases for PPC and s390... even for
AArch64 128 seems a little wasteful.

Paolo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-06  0:52             ` Emilio G. Cota
@ 2016-04-06 11:52               ` Paolo Bonzini
  2016-04-06 17:44                 ` Emilio G. Cota
  0 siblings, 1 reply; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-06 11:52 UTC (permalink / raw)
  To: Emilio G. Cota, Richard Henderson
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Alex Bennée



On 06/04/2016 02:52, Emilio G. Cota wrote:
> +static inline uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e, int seed)

I would keep just this version and unconditionally zero-extend to
64-bits.  The compiler is able to detect the high 32 bits are zero, drop
the more expensive multiplications and constant fold everything.

For example if you write

unsigned tb_hash_func(uint32_t phys_pc, uint32_t pc, int flags)
{
    return tb_hash_func5(phys_pc, pc, flags, 1);
}

and check the optimized code with -fdump-tree-optimized you'll see that
the rotated v1, the rotated v3 and the 20 merge into a single constant
1733907856.

Thanks,

Paolo

> +{
> +    uint32_t v1 = seed + PRIME32_1 + PRIME32_2;
> +    uint32_t v2 = seed + PRIME32_2;
> +    uint32_t v3 = seed + 0;
> +    uint32_t v4 = seed - PRIME32_1;
> +    uint32_t a = a0 >> 31 >> 1;
> +    uint32_t b = a0;
> +    uint32_t c = b0 >> 31 >> 1;
> +    uint32_t d = b0;
> +    uint32_t h32;
> +
> +    v1 += a * PRIME32_2;
> +    v1 = XXH_rotl32(v1, 13);
> +    v1 *= PRIME32_1;
> +
> +    v2 += b * PRIME32_2;
> +    v2 = XXH_rotl32(v2, 13);
> +    v2 *= PRIME32_1;
> +
> +    v3 += c * PRIME32_2;
> +    v3 = XXH_rotl32(v3, 13);
> +    v3 *= PRIME32_1;
> +
> +    v4 += d * PRIME32_2;
> +    v4 = XXH_rotl32(v4, 13);
> +    v4 *= PRIME32_1;
> +
> +    h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) +
> +          XXH_rotl32(v4, 18);
> +    h32 += 20;
> +
> +    h32 += e * PRIME32_3;
> +    h32  = XXH_rotl32(h32, 17) * PRIME32_4;
> +
> +    return h32_finish(h32);
> +}
> +
> +static __attribute__((noinline))
> +unsigned tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
> +{
> +#if TARGET_LONG_BITS == 64
> +
> +    if (sizeof(phys_pc) == sizeof(pc)) {
> +        return tb_hash_func5(phys_pc, pc, flags, 1);
> +    }

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED
  2016-04-06 11:44                   ` Paolo Bonzini
@ 2016-04-06 12:02                     ` Laurent Desnogues
  0 siblings, 0 replies; 62+ messages in thread
From: Laurent Desnogues @ 2016-04-06 12:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Emilio G. Cota, Sergey Fedorov, Alex Bennée,
	Richard Henderson

On Wed, Apr 6, 2016 at 1:44 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 05/04/2016 22:09, Lluís Vilanova wrote:
>> Ah. That's just an example. For cross-compilation you would use a different
>> march argument (or none to use the default target sub-arch) and get the
>> parameter for the target processor. This should already be known by configure as
>> part of the arguments to select the cross-compiler and target architecture
>> (e.g., CC).
>
> I would just use 64, with special cases for PPC and s390... even for
> AArch64 128 seems a little wasteful.

As far as I know, the only AArch64 CPU to have 128-byte cache line is
Cavium ThunderX (it seems to also be the case for Cavium MIPS-based
CPU).


Laurent

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 05/10] include: add spinlock wrapper
  2016-04-05  8:51   ` Paolo Bonzini
@ 2016-04-06 15:51     ` Alex Bennée
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Bennée @ 2016-04-06 15:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Emilio G. Cota, Sergey Fedorov, Richard Henderson


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 05/04/2016 07:30, Emilio G. Cota wrote:
>> Wrap pthread_spin on POSIX, or QemuMutex on Windows.
>>
>> AFAIK there are is no off-the-shelf spinlock implementation for
>> Windows, so we'll just use QemuMutex.
>
> It's much simpler to use a simple test-and-set spinlock.
>
> GitHub is down, but this should be the link
> http://github.com/bonzini/qemu/commit/e1361634 (it's in my mttcg
> branch).

In addition this version breaks on the OSX/clang build:

https://travis-ci.org/stsquad/qemu/jobs/121002991

>
> Paolo
>
>> Signed-off-by: Emilio G. Cota <cota@braap.org>
>> ---
>>  include/qemu/spinlock-posix.h | 60 +++++++++++++++++++++++++++++++++++++++++++
>>  include/qemu/spinlock-win32.h | 33 ++++++++++++++++++++++++
>>  include/qemu/spinlock.h       | 10 ++++++++
>>  3 files changed, 103 insertions(+)
>>  create mode 100644 include/qemu/spinlock-posix.h
>>  create mode 100644 include/qemu/spinlock-win32.h
>>  create mode 100644 include/qemu/spinlock.h
>>
>> diff --git a/include/qemu/spinlock-posix.h b/include/qemu/spinlock-posix.h
>> new file mode 100644
>> index 0000000..51c2c08
>> --- /dev/null
>> +++ b/include/qemu/spinlock-posix.h
>> @@ -0,0 +1,60 @@
>> +#ifndef QEMU_SPINLOCK_POSIX_H
>> +#define QEMU_SPINLOCK_POSIX_H
>> +
>> +#include <qemu/thread.h>
>> +#include <qemu/osdep.h>
>> +
>> +typedef pthread_spinlock_t QemuSpinLock;
>> +
>> +static inline void qemu_spinlock_error_exit(int err, const char *msg)
>> +{
>> +    fprintf(stderr, "qemu: %s: %s\n", msg, strerror(err));
>> +    abort();
>> +}
>> +
>> +static inline void qemu_spinlock_init(QemuSpinLock *lock)
>> +{
>> +    int rc;
>> +
>> +    rc = pthread_spin_init(lock, PTHREAD_PROCESS_SHARED);
>> +    if (unlikely(rc)) {
>> +        qemu_spinlock_error_exit(rc, __func__);
>> +    }
>> +}
>> +
>> +static inline void qemu_spinlock_destroy(QemuSpinLock *lock)
>> +{
>> +    int rc;
>> +
>> +    rc = pthread_spin_destroy(lock);
>> +    if (unlikely(rc)) {
>> +        qemu_spinlock_error_exit(rc, __func__);
>> +    }
>> +}
>> +
>> +static inline void qemu_spinlock_lock(QemuSpinLock *lock)
>> +{
>> +    int rc;
>> +
>> +    rc = pthread_spin_lock(lock);
>> +    if (unlikely(rc)) {
>> +        qemu_spinlock_error_exit(rc, __func__);
>> +    }
>> +}
>> +
>> +static inline int qemu_spinlock_trylock(QemuSpinLock *lock)
>> +{
>> +    return pthread_spin_trylock(lock);
>> +}
>> +
>> +static inline void qemu_spinlock_unlock(QemuSpinLock *lock)
>> +{
>> +    int rc;
>> +
>> +    rc = pthread_spin_unlock(lock);
>> +    if (unlikely(rc)) {
>> +        qemu_spinlock_error_exit(rc, __func__);
>> +    }
>> +}
>> +
>> +#endif /* QEMU_SPINLOCK_POSIX_H */
>> diff --git a/include/qemu/spinlock-win32.h b/include/qemu/spinlock-win32.h
>> new file mode 100644
>> index 0000000..5a105fb
>> --- /dev/null
>> +++ b/include/qemu/spinlock-win32.h
>> @@ -0,0 +1,33 @@
>> +#ifndef QEMU_SPINLOCK_WIN32_H
>> +#define QEMU_SPINLOCK_WIN32_H
>> +
>> +#include <qemu/thread.h>
>> +
>> +typedef QemuMutex QemuSpinLock;
>> +
>> +static inline void qemu_spinlock_init(QemuSpinLock *lock)
>> +{
>> +    qemu_mutex_init(lock);
>> +}
>> +
>> +static inline void qemu_spinlock_destroy(QemuSpinLock *lock)
>> +{
>> +    qemu_mutex_destroy(lock);
>> +}
>> +
>> +static inline void qemu_spinlock_lock(QemuSpinLock *lock)
>> +{
>> +    qemu_mutex_lock(lock);
>> +}
>> +
>> +static inline int qemu_spinlock_trylock(QemuSpinLock *lock)
>> +{
>> +    return qemu_mutex_trylock(lock);
>> +}
>> +
>> +static inline void qemu_spinlock_unlock(QemuSpinLock *lock)
>> +{
>> +    qemu_mutex_unlock(lock);
>> +}
>> +
>> +#endif /* QEMU_SPINLOCK_WIN32_H */
>> diff --git a/include/qemu/spinlock.h b/include/qemu/spinlock.h
>> new file mode 100644
>> index 0000000..001b55e
>> --- /dev/null
>> +++ b/include/qemu/spinlock.h
>> @@ -0,0 +1,10 @@
>> +#ifndef QEMU_SPINLOCK_H
>> +#define QEMU_SPINLOCK_H
>> +
>> +#ifdef _WIN32
>> +#include "spinlock-win32.h"
>> +#else
>> +#include "spinlock-posix.h"
>> +#endif
>> +
>> +#endif /* QEMU_SPINLOCK_H */
>>


--
Alex Bennée

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-06  6:06         ` Laurent Desnogues
@ 2016-04-06 17:32           ` Emilio G. Cota
  2016-04-06 17:42             ` Richard Henderson
  0 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-06 17:32 UTC (permalink / raw)
  To: Laurent Desnogues
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Paolo Bonzini, Sergey Fedorov, Alex Bennée,
	Richard Henderson

On Wed, Apr 06, 2016 at 08:06:57 +0200, Laurent Desnogues wrote:
> On Tue, Apr 5, 2016 at 7:19 PM, Richard Henderson <rth@twiddle.net> wrote:
> > On 04/05/2016 09:33 AM, Laurent Desnogues wrote:
> >> The 'flags' field is 64-bit.  You're thinking of cflags, I guess.
> >
> > Well that's silly.  Since it's filled in via
> >
> > static inline void cpu_get_tb_cpu_state(CPUMIPSState *env, target_ulong *pc,
> >                                         target_ulong *cs_base, int *flags)
> >
> > and passed back in to generate code with
> >
> > TranslationBlock *tb_gen_code(CPUState *cpu,
> >                               target_ulong pc, target_ulong cs_base, int flags,
> >                               int cflags);
> >
> > So while TranslationBlock stores "uint64_t", the producer and consumer see "int".
> 
> I agree.  I guess TranslationBlock should be fixed to use uint32_t
> (note several functions have to be changed from using int to uint32_t
> or aarch64-softmmu will fail).

Can you please elaborate on this?

FWIW aarch64-softmmu boots OK for me with the patch below. I'm booting
it as per the instructions in
  http://www.bennee.com/~alex/blog/2014/05/09/running-linux-in-qemus-aarch64-system-emulation-mode/

Thanks,

		Emilio

commit e70474788fa37a85df21e1c63101a879103758f5
Author: Emilio G. Cota <cota@braap.org>
Date:   Tue Apr 5 13:55:16 2016 -0400

    tb: consistently use 'int' type for tb->flags
    
    Reported-by: Richard Henderson <rth@twiddle.net>
    Signed-off-by: Emilio G. Cota <cota@braap.org>

diff --git a/cpu-exec.c b/cpu-exec.c
index bbfcbfb..5abbf57 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -220,7 +220,7 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
 static TranslationBlock *tb_find_physical(CPUState *cpu,
                                           target_ulong pc,
                                           target_ulong cs_base,
-                                          uint64_t flags)
+                                          int flags)
 {
     CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb, **ptb1;
@@ -271,7 +271,7 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
 static TranslationBlock *tb_find_slow(CPUState *cpu,
                                       target_ulong pc,
                                       target_ulong cs_base,
-                                      uint64_t flags)
+                                      int flags)
 {
     TranslationBlock *tb;
 
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 7362095..277e6f1 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -235,7 +235,7 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
 struct TranslationBlock {
     target_ulong pc;   /* simulated PC corresponding to this block (EIP + CS base) */
     target_ulong cs_base; /* CS base for this block */
-    uint64_t flags; /* flags defining in which context the code was generated */
+    int flags; /* flags defining in which context the code was generated */
     uint16_t size;      /* size of target code for this block (1 <=
                            size <= TARGET_PAGE_SIZE) */
     uint16_t icount;
diff --git a/target-i386/translate.c b/target-i386/translate.c
index 1a1214d..4024ad4 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -8178,7 +8178,7 @@ void gen_intermediate_code(CPUX86State *env, TranslationBlock *tb)
     CPUState *cs = CPU(cpu);
     DisasContext dc1, *dc = &dc1;
     target_ulong pc_ptr;
-    uint64_t flags;
+    int flags;
     target_ulong pc_start;
     target_ulong cs_base;
     int num_insns;
diff --git a/translate-all.c b/translate-all.c
index 8329ea6..27b4d57 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1593,7 +1593,7 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
     TranslationBlock *tb;
     uint32_t n, cflags;
     target_ulong pc, cs_base;
-    uint64_t flags;
+    int flags;
 
     tb = tb_find_pc(retaddr);
     if (!tb) {

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-06 17:32           ` Emilio G. Cota
@ 2016-04-06 17:42             ` Richard Henderson
  2016-04-07  8:12               ` Laurent Desnogues
  0 siblings, 1 reply; 62+ messages in thread
From: Richard Henderson @ 2016-04-06 17:42 UTC (permalink / raw)
  To: Emilio G. Cota, Laurent Desnogues
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Paolo Bonzini, Sergey Fedorov, Alex Bennée

On 04/06/2016 10:32 AM, Emilio G. Cota wrote:
> On Wed, Apr 06, 2016 at 08:06:57 +0200, Laurent Desnogues wrote:
>> On Tue, Apr 5, 2016 at 7:19 PM, Richard Henderson <rth@twiddle.net> wrote:
>>> On 04/05/2016 09:33 AM, Laurent Desnogues wrote:
>>>> The 'flags' field is 64-bit.  You're thinking of cflags, I guess.
>>>
>>> Well that's silly.  Since it's filled in via
>>>
>>> static inline void cpu_get_tb_cpu_state(CPUMIPSState *env, target_ulong *pc,
>>>                                         target_ulong *cs_base, int *flags)
>>>
>>> and passed back in to generate code with
>>>
>>> TranslationBlock *tb_gen_code(CPUState *cpu,
>>>                               target_ulong pc, target_ulong cs_base, int flags,
>>>                               int cflags);
>>>
>>> So while TranslationBlock stores "uint64_t", the producer and consumer see "int".
>>
>> I agree.  I guess TranslationBlock should be fixed to use uint32_t
>> (note several functions have to be changed from using int to uint32_t
>> or aarch64-softmmu will fail).
> 
> Can you please elaborate on this?

The arm port is using some high bits, including

#define ARM_TBFLAG_AARCH64_STATE_SHIFT 31
#define ARM_TBFLAG_AARCH64_STATE_MASK  (1U << ARM_TBFLAG_AARCH64_STATE_SHIFT)

So, I would certainly be ok switching everything to use uint32_t over int.


r~

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-06 11:52               ` Paolo Bonzini
@ 2016-04-06 17:44                 ` Emilio G. Cota
  2016-04-06 18:23                   ` Paolo Bonzini
  0 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-06 17:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Alex Bennée, Richard Henderson

On Wed, Apr 06, 2016 at 13:52:21 +0200, Paolo Bonzini wrote:
> 
> 
> On 06/04/2016 02:52, Emilio G. Cota wrote:
> > +static inline uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e, int seed)
> 
> I would keep just this version and unconditionally zero-extend to
> 64-bits.  The compiler is able to detect the high 32 bits are zero, drop
> the more expensive multiplications and constant fold everything.
> 
> For example if you write
> 
> unsigned tb_hash_func(uint32_t phys_pc, uint32_t pc, int flags)
> {
>     return tb_hash_func5(phys_pc, pc, flags, 1);
> }
> 
> and check the optimized code with -fdump-tree-optimized you'll see that
> the rotated v1, the rotated v3 and the 20 merge into a single constant
> 1733907856.

I like this idea, because the ugliness of the sizeof checks is significant.
However, the quality of the resulting hash is not as good when always using func5.
For instance, when we'd otherwise use func3, two fifths of every input contain
exactly the same bits: all 0's. This inevitably leads to more collisions.

Performance (for the debian arm bootup test) gets up to 15% worse.

		Emilio

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-06 17:44                 ` Emilio G. Cota
@ 2016-04-06 18:23                   ` Paolo Bonzini
  2016-04-06 18:27                     ` Richard Henderson
  2016-04-07  0:37                     ` Emilio G. Cota
  0 siblings, 2 replies; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-06 18:23 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Alex Bennée, Richard Henderson



On 06/04/2016 19:44, Emilio G. Cota wrote:
> I like this idea, because the ugliness of the sizeof checks is significant.
> However, the quality of the resulting hash is not as good when always using func5.
> For instance, when we'd otherwise use func3, two fifths of every input contain
> exactly the same bits: all 0's. This inevitably leads to more collisions.

It doesn't necessarily lead to more collisions.  For a completely stupid
hash function "a+b+c+d+e", for example, adding zeros doesn't add more
collisions.

What if you rearrange the five words so that the "all 0" parts of the
input are all at the beginning, or all at the end?  Perhaps the problem
is that the odd words at the end are hashed less effectively.

Perhaps better is to always use a three-word xxhash, but pick the 64-bit
version if any of phys_pc and pc are 64-bits.  The unrolling would be
very effective, and the performance penalty not too important (64-bit on
32-bit is very slow anyway).

If you can fix the problems with the collisions, it also means that you
get good performance running 32-bit guests on qemu-system-x86_64 or
qemu-system-aarch64.  So, if you can do it, it's a very desirable property.

Paolo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-06 18:23                   ` Paolo Bonzini
@ 2016-04-06 18:27                     ` Richard Henderson
  2016-04-07  0:37                     ` Emilio G. Cota
  1 sibling, 0 replies; 62+ messages in thread
From: Richard Henderson @ 2016-04-06 18:27 UTC (permalink / raw)
  To: Paolo Bonzini, Emilio G. Cota
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Alex Bennée

On 04/06/2016 11:23 AM, Paolo Bonzini wrote:
> Perhaps better is to always use a three-word xxhash, but pick the 64-bit
> version if any of phys_pc and pc are 64-bits.  The unrolling would be
> very effective, and the performance penalty not too important (64-bit on
> 32-bit is very slow anyway).

True.

It would also be interesting to put an average bucket chain length into the
"info jit" dump, so that it's easy to see how the hashing is performing for a
given guest.


r~

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] include: add xxhash.h
  2016-04-06 11:39   ` Alex Bennée
@ 2016-04-06 22:59     ` Emilio G. Cota
  0 siblings, 0 replies; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-06 22:59 UTC (permalink / raw)
  To: Alex Bennée
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Paolo Bonzini, Richard Henderson

On Wed, Apr 06, 2016 at 12:39:42 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
> 
> > xxhash is a fast, high-quality hashing function. The appended
> > brings in the 32-bit version of it, with the small modification that
> > it assumes the data to be hashed is made of 32-bit chunks; this increases
> > speed slightly for the use-case we care about, i.e. tb-hash.
> >
> > The original algorithm, as well as a 64-bit implementation, can be found at:
> >   https://github.com/Cyan4973/xxHash
> >
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  include/qemu/xxhash.h | 106 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 106 insertions(+)
> >  create mode 100644 include/qemu/xxhash.h
> >
> > diff --git a/include/qemu/xxhash.h b/include/qemu/xxhash.h
> > new file mode 100644
> > index 0000000..a13a665
> > --- /dev/null
> > +++ b/include/qemu/xxhash.h
> > @@ -0,0 +1,106 @@
> <snip>
> > +
> > +/* u32 hash of @n contiguous chunks of u32's */
> > +static inline uint32_t qemu_xxh32(const uint32_t *p, size_t n, uint32_t seed)
> > +{
> 
> What is the point of seed here? I looked on the original site to see if
> there was any guidance on tuning seed but couldn't find anything. I
> appreciate the compiler can inline the constant away but perhaps we
> should #define it and drop the parameter if we are not intending to
> modify it?

The seed value would only matter if we needed consistent hashes (we don't).

Any seed value should give similar performance, given xxhash's quality.

I'll set a define for this seed if it is used in more than one place.

> Also it might be helpful to wrap the call to avoid getting the
> boilerplate sizing wrong:
> 
>   #define qemu_xxh32(s) qemu_xxh32_impl((const uint32_t *)s, sizeof(*s)/sizeof(uint32_t), 1)
> 
> Then calls become a little simpler for the user:
> 
>   return qemu_xxh32(&k);
> 
> Do we need to include a compile time check for structures that don't
> neatly divide into uint32_t chunks?

I'll give the original byte-sized function a shot and come back to this
if necessary.

Thanks,

		E.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-06 18:23                   ` Paolo Bonzini
  2016-04-06 18:27                     ` Richard Henderson
@ 2016-04-07  0:37                     ` Emilio G. Cota
  2016-04-07  8:46                       ` Paolo Bonzini
  1 sibling, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-07  0:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Alex Bennée, Richard Henderson

On Wed, Apr 06, 2016 at 20:23:42 +0200, Paolo Bonzini wrote:
> On 06/04/2016 19:44, Emilio G. Cota wrote:
> > I like this idea, because the ugliness of the sizeof checks is significant.
> > However, the quality of the resulting hash is not as good when always using func5.
> > For instance, when we'd otherwise use func3, two fifths of every input contain
> > exactly the same bits: all 0's. This inevitably leads to more collisions.

I take this back. I don't know anymore what I measured earlier today--it's
been a long day and was juggling quite a few things.

I essentially see the same chain lengths (within 0.2%) for either function, i.e.
func3 or func5 with the padded 0's when running arm-softmmu. So this
is good news :>

> Perhaps better is to always use a three-word xxhash, but pick the 64-bit
> version if any of phys_pc and pc are 64-bits.  The unrolling would be
> very effective, and the performance penalty not too important (64-bit on
> 32-bit is very slow anyway).

By "the 64-bit version" you mean what I called func5? That is:

if (sizeof(phys_pc) == sizeof(uint64_t) || sizeof(pc) == sizeof(uint64_t))
	return tb_hash_func5();
return tb_hash_func3();

or do you mean xxhash64 (which I did not include in my patchset)?
My tests with xxhash64 suggest that the quality of the results do not
improve over xxhash32, and the computation takes longer (it's more
instructions); not much, but measurable.

So we should probably just go with func5 always, as you suggested
initially. If so, I'm ready to send a v2.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-06 17:42             ` Richard Henderson
@ 2016-04-07  8:12               ` Laurent Desnogues
  0 siblings, 0 replies; 62+ messages in thread
From: Laurent Desnogues @ 2016-04-07  8:12 UTC (permalink / raw)
  To: Richard Henderson
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Emilio G. Cota, Paolo Bonzini, Sergey Fedorov, Alex Bennée

On Wed, Apr 6, 2016 at 7:42 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 04/06/2016 10:32 AM, Emilio G. Cota wrote:
>> On Wed, Apr 06, 2016 at 08:06:57 +0200, Laurent Desnogues wrote:
>>> On Tue, Apr 5, 2016 at 7:19 PM, Richard Henderson <rth@twiddle.net> wrote:
>>>> On 04/05/2016 09:33 AM, Laurent Desnogues wrote:
>>>>> The 'flags' field is 64-bit.  You're thinking of cflags, I guess.
>>>>
>>>> Well that's silly.  Since it's filled in via
>>>>
>>>> static inline void cpu_get_tb_cpu_state(CPUMIPSState *env, target_ulong *pc,
>>>>                                         target_ulong *cs_base, int *flags)
>>>>
>>>> and passed back in to generate code with
>>>>
>>>> TranslationBlock *tb_gen_code(CPUState *cpu,
>>>>                               target_ulong pc, target_ulong cs_base, int flags,
>>>>                               int cflags);
>>>>
>>>> So while TranslationBlock stores "uint64_t", the producer and consumer see "int".
>>>
>>> I agree.  I guess TranslationBlock should be fixed to use uint32_t
>>> (note several functions have to be changed from using int to uint32_t
>>> or aarch64-softmmu will fail).
>>
>> Can you please elaborate on this?
>
> The arm port is using some high bits, including
>
> #define ARM_TBFLAG_AARCH64_STATE_SHIFT 31
> #define ARM_TBFLAG_AARCH64_STATE_MASK  (1U << ARM_TBFLAG_AARCH64_STATE_SHIFT)

Yes, that's why I advocate the use of uint32_t over int.

Emilio, I can't reproduce the failure I had, but fixing the warnings
gcc issues when using uint32_t instead of int for cpu_get_tb_cpu_state
should be enough to make the change safe.  Note this doesn't really
belong to this patch set :-)


Laurent

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-07  0:37                     ` Emilio G. Cota
@ 2016-04-07  8:46                       ` Paolo Bonzini
  0 siblings, 0 replies; 62+ messages in thread
From: Paolo Bonzini @ 2016-04-07  8:46 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, Peter Maydell, Peter Crosthwaite, QEMU Developers,
	Sergey Fedorov, Alex Bennée, Richard Henderson



On 07/04/2016 02:37, Emilio G. Cota wrote:
> I take this back. I don't know anymore what I measured earlier today--it's
> been a long day and was juggling quite a few things.
> 
> I essentially see the same chain lengths (within 0.2%) for either function, i.e.
> func3 or func5 with the padded 0's when running arm-softmmu. So this
> is good news :>

It's also much more reasonable for a good hash function .:)

>> Perhaps better is to always use a three-word xxhash, but pick the 64-bit
>> version if any of phys_pc and pc are 64-bits.  The unrolling would be
>> very effective, and the performance penalty not too important (64-bit on
>> 32-bit is very slow anyway).
> 
> By "the 64-bit version" you mean what I called func5? That is:
> 
> if (sizeof(phys_pc) == sizeof(uint64_t) || sizeof(pc) == sizeof(uint64_t))
> 	return tb_hash_func5();
> return tb_hash_func3();
> 
> or do you mean xxhash64 (which I did not include in my patchset)?

I meant xxhash64, but using func5 unconditionally is by far my preferred
choice.

Paolo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
  2016-04-05  9:01   ` Paolo Bonzini
  2016-04-05 15:50   ` Richard Henderson
@ 2016-04-08 10:27   ` Alex Bennée
  2016-04-19 23:03     ` Emilio G. Cota
  2 siblings, 1 reply; 62+ messages in thread
From: Alex Bennée @ 2016-04-08 10:27 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> This is a hash table with optional auto-resizing and MRU promotion for
> reads and writes. Its implementation goal is to stay fast while
> scaling for read-mostly workloads.
>
> A hash table with these features will be necessary for the scalability
> of the ongoing MTTCG work; before those changes arrive we can already
> benefit from the single-threaded speedup that qht also provides.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/qht.h | 150 ++++++++++++++++++
>  util/Makefile.objs |   2 +-
>  util/qht.c         | 458 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 609 insertions(+), 1 deletion(-)
>  create mode 100644 include/qemu/qht.h
>  create mode 100644 util/qht.c
>
> diff --git a/include/qemu/qht.h b/include/qemu/qht.h
> new file mode 100644
> index 0000000..63244e4
> --- /dev/null
> +++ b/include/qemu/qht.h
> @@ -0,0 +1,150 @@
> +/*
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + *   See the COPYING file in the top-level directory.
> + */
> +#ifndef QHT_H
> +#define QHT_H
> +
> +#include <stdbool.h>
> +#include <assert.h>
> +#include <stdint.h>
> +#include <stdio.h>
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qemu/spinlock.h"
> +#include "qemu/seqlock.h"
> +#include "qemu/rcu.h"
> +
> +#if HOST_LONG_BITS == 32
> +#define QHT_BUCKET_ENTRIES 6
> +#else /* 64-bit */
> +#define QHT_BUCKET_ENTRIES 4
> +#endif

Are these purely a function of the total size of qht_bucket bellow and keeping
it nicely cacheline aligned?

> +
> +struct qht_bucket {
> +    QemuSpinLock lock;
> +    QemuSeqLock sequence;
> +    uint32_t hashes[QHT_BUCKET_ENTRIES];
> +    void *pointers[QHT_BUCKET_ENTRIES];
> +    struct qht_bucket *next;
> +} QEMU_CACHELINE_ALIGNED;

For the TLB code we have compile time checks QEMU_BUILD_BUG_ON() to warn
the developers if we go over a boundary. Might be worth adding a similar
here.

> +struct qht_map {
> +    struct qht_bucket *buckets;
> +    uint64_t n;
> +    uint64_t n_items;
> +    uint64_t n_items_threshold;
> +    struct rcu_head rcu;
> +};
> +
> +struct qht {
> +    struct qht_map *map;
> +    unsigned int mode;
> +};
> +
> +typedef bool (*qht_lookup_func_t)(const void *obj, const void *userp);
> +typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
> +
> +#define QHT_MODE_MRU_LOOKUP  0x1 /* move looked-up items to head */
> +#define QHT_MODE_MRU_INSERT  0x2 /* insert new elements at the head */
> +#define QHT_MODE_AUTO_RESIZE 0x4 /* auto-resize when heavily loaded */
> +
> +void qht_init(struct qht *ht, uint64_t n_elems, unsigned int mode);
> +
> +/* call only when there are no readers left */
> +void qht_destroy(struct qht *ht);
> +

What's the rationale for making the caller responsible for locking here?
Are we going to be protected by tb_lock in the MTTCG case rather than a
finer grained locking inside the qht?

> +/* call with an external lock held */
> +void qht_reset(struct qht *ht);
> +
> +/* call with an external lock held */
> +void qht_reset_size(struct qht *ht, uint64_t n_elems);
> +
> +/* call with an external lock held */
> +void qht_insert(struct qht *ht, void *p, uint32_t hash);
> +
> +/* call with an external lock held */
> +bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
> +
> +/* call with an external lock held */
> +void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp);
> +
> +/* call with an external lock held */
> +void qht_grow(struct qht *ht);
> +
> +void qht_bucket_mru(struct qht_bucket *b, struct qht_bucket *orig,
> +                    const void *p, int pos);
> +
> +static inline
> +struct qht_bucket *qht_map_to_bucket(struct qht_map *map, uint32_t hash)
> +{
> +    return &map->buckets[hash & (map->n - 1)];
> +}
> +
> +static inline
> +void *__qht_lookup(struct qht_bucket *b, struct qht_bucket **far_bucket,
> +                    int *pos, qht_lookup_func_t func, const void *userp,
> +                    uint32_t hash)

Aside the inline is there any reason to keep such an implementation
detail in the header. I certainly couldn't measure a difference after I
moved them into qht.c.

> +{
> +    unsigned int count = 0;
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->hashes[i] == hash) {
> +                void *p = atomic_read(&b->pointers[i]);
> +
> +                /*
> +                 * p could be NULL if we're racing with a writer. We could use
> +                 * barriers here but for performance we only issue the ones
> +                 * in the seqlock.
> +                 */
> +                if (likely(p && func(p, userp))) {

A comment about saving the position for MRU might be worthwhile here.

> +                    if (unlikely(count)) {
> +                        *far_bucket = b;
> +                        *pos = i;
> +                    }
> +                    return p;
> +                }
> +            }
> +        }
> +        count++;
> +        b = atomic_read(&b->next);
> +        /*
> +         * This barrier guarantees that we will read a properly initialized
> +         * b->next; it is paired with an smp_wmb() before setting b->next.
> +         */
> +        smp_rmb();
> +    } while (b);
> +    return NULL;
> +}
> +
> +static inline void *qht_lookup(struct qht *ht, qht_lookup_func_t func,
> +                                const void *userp, uint32_t hash)
> +{
> +    struct qht_bucket *far_bucket = NULL;
> +    struct qht_bucket *b;
> +    struct qht_map *map;
> +    uint32_t version;
> +    int pos = 0;
> +    void *ret;
> +
> +    map = atomic_read(&ht->map);
> +    /* paired with smp_wmb() before setting ht->map */
> +    smp_rmb();
> +    b = qht_map_to_bucket(map, hash);
> +
> +    do {
> +        version = seqlock_read_begin(&b->sequence);
> +        ret = __qht_lookup(b, &far_bucket, &pos, func, userp, hash);
> +    } while (seqlock_read_retry(&b->sequence, version));

OK I was slightly confused by this until I got further along. Maybe some
comments in the qht_bucket structure to point out the seqlock is
important when a bucket is at the head of the list.

> +    if ((ht->mode & QHT_MODE_MRU_LOOKUP) && unlikely(far_bucket)) {
> +        qht_bucket_mru(b, far_bucket, ret, pos);
> +    }
> +    return ret;
> +}
> +
> +#endif /* QHT_H */
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index a8a777e..ae893dc 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -1,4 +1,4 @@
> -util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
> +util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o qht.o
>  util-obj-$(CONFIG_POSIX) += compatfd.o
>  util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
>  util-obj-$(CONFIG_POSIX) += mmap-alloc.o
> diff --git a/util/qht.c b/util/qht.c
> new file mode 100644
> index 0000000..590cd8a
> --- /dev/null
> +++ b/util/qht.c
> @@ -0,0 +1,458 @@
> +/*
> + * qht.c - QEMU Hash Table, designed to scale for read-mostly workloads.
> + *
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + *   See the COPYING file in the top-level directory.
> + *
> + * Assumptions:
> + * - Writers and iterators must take an external lock.
> + * - A hash value of 0 is invalid.
> + * - NULL cannot be inserted as a pointer value.
> + *
> + * Features:
> + * - Optional auto-resizing: the hash table resizes up if the load surpasses
> + *   a certain threshold. Resizing is done concurrently with readers.
> + * - Optional bucket MRU promotion policy.
> + *
> + * The key structure is the bucket, which is cacheline-sized. Buckets
> + * contain a few hash values and pointers; the u32 hash values are stored in
> + * full so that resizing is fast. Having this structure instead of directly
> + * chaining items has three advantages:
> + * - Failed lookups fail fast, and touch a minimum number of cache lines.
> + * - Resizing the hash table with concurrent lookups is easy.
> + * - We can have a few Most-Recently-Used (MRU) hash-pointer pairs in the same
> + *   head bucket. This helps scalability, since MRU promotions (i.e. writes to
> + *   the bucket) become less common.
> + *
> + * For concurrent lookups we use a per-bucket seqlock; per-bucket spinlocks
> + * allow readers (lookups) to upgrade to writers and thus implement an MRU
> + * promotion policy; these MRU-induced writes do not touch the cache lines of
> + * other head buckets.
> + *
> + * Note that there are two types of buckets:
> + * 1. "head" buckets are the ones allocated in the array of buckets in qht_map.
> + * 2. all "non-head" buckets (i.e. all others) are members of a chain that
> + *    starts from a head bucket.
> + * Note that the seqlock and spinlock of a head bucket applies to all buckets
> + * chained to it; these two fields are unused in non-head buckets.
> + *
> + * Resizing is done by taking all spinlocks (so that no readers-turned-writers
> + * can race with us) and then placing all elements into a new hash table. Last,
> + * the ht->map pointer is set, and the old map is freed once no RCU readers can
> + * see it anymore.
> + *
> + * Related Work:
> + * - Idea of cacheline-sized buckets with full hashes taken from:
> + *   David, Guerraoui & Trigonakis, "Asynchronized Concurrency:
> + *   The Secret to Scaling Concurrent Search Data Structures", ASPLOS'15.
> + * - Why not RCU-based hash tables? They would allow us to get rid of the
> + *   seqlock, but resizing would take forever since RCU read critical
> + *   sections in QEMU take quite a long time.
> + *   More info on relativistic hash tables:
> + *   + Triplett, McKenney & Walpole, "Resizable, Scalable, Concurrent Hash
> + *     Tables via Relativistic Programming", USENIX ATC'11.
> + *   + Corbet, "Relativistic hash tables, part 1: Algorithms", @ lwn.net, 2014.
> + *     https://lwn.net/Articles/612021/

Very helpful links, thanks ;-)

> + */
> +#include "qemu/qht.h"
> +#include "qemu/atomic.h"
> +
> +static inline uint64_t qht_elems_to_buckets(uint64_t n_elems)
> +{
> +    return pow2ceil(n_elems / QHT_BUCKET_ENTRIES);
> +}
> +
> +static inline void qht_head_init(struct qht_bucket *b)
> +{
> +    memset(b, 0, sizeof(*b));
> +    qemu_spinlock_init(&b->lock);
> +    seqlock_init(&b->sequence);
> +}
> +
> +static inline void qht_chain_destroy(struct qht_bucket *head)
> +{
> +    struct qht_bucket *curr = head->next;
> +    struct qht_bucket *prev;
> +
> +    while (curr) {
> +        prev = curr;
> +        curr = curr->next;
> +        qemu_vfree(prev);
> +    }
> +}
> +
> +/* pass only an orphan map */
> +static void qht_map_destroy(struct qht_map *map)
> +{
> +    uint64_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        qht_chain_destroy(&map->buckets[i]);
> +    }
> +    qemu_vfree(map->buckets);
> +    g_free(map);
> +}
> +
> +static void qht_map_reclaim(struct rcu_head *rcu)
> +{
> +    struct qht_map *map = container_of(rcu, struct qht_map, rcu);
> +
> +    qht_map_destroy(map);
> +}
> +
> +static struct qht_map *qht_map_create(uint64_t n)
> +{
> +    struct qht_map *map;
> +    uint64_t i;
> +
> +    assert(n <= (1ULL << 32)); /* we're using a 32-bit hash func */
> +    map = g_malloc(sizeof(*map));
> +    map->n = n;
> +    map->n_items = 0;
> +    map->n_items_threshold = n * QHT_BUCKET_ENTRIES / 2;
> +    map->buckets = qemu_memalign(QEMU_CACHELINE, sizeof(*map->buckets) * n);
> +    for (i = 0; i < n; i++) {
> +        qht_head_init(&map->buckets[i]);
> +    }
> +    return map;
> +}
> +
> +static inline void qht_publish(struct qht *ht, struct qht_map *new)
> +{
> +    /* Readers should see a properly initialized map; pair with smp_rmb() */
> +    smp_wmb();
> +    atomic_set(&ht->map, new);
> +}
> +
> +void qht_init(struct qht *ht, uint64_t n_elems, unsigned int mode)
> +{
> +    struct qht_map *map;
> +    uint64_t n = qht_elems_to_buckets(n_elems);
> +
> +    map = qht_map_create(n);
> +    ht->mode = mode;
> +    qht_publish(ht, map);
> +}
> +
> +/* call only when there are no readers left */
> +void qht_destroy(struct qht *ht)
> +{
> +    qht_map_destroy(ht->map);
> +    memset(ht, 0, sizeof(*ht));
> +}
> +
> +static void __qht_bucket_reset(struct qht_bucket *b)

I think Paolo has already mentioned __ names.

> +{
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            atomic_set(&b->hashes[i], 0);
> +            atomic_set(&b->pointers[i], NULL);
> +        }
> +        b = b->next;
> +    } while (b);
> +}
> +
> +static void qht_bucket_reset(struct qht_bucket *b)
> +{
> +    qemu_spinlock_lock(&b->lock);
> +    seqlock_write_begin(&b->sequence);
> +    __qht_bucket_reset(b);

Personally I'd just import the __qht_bucket_reset into here to make it
clear the seqlock protects us from using values halfway from being
reset. Otherwise a quick comment to repeat the head text that effect on
the above function.

> +    seqlock_write_end(&b->sequence);
> +    qemu_spinlock_unlock(&b->lock);
> +}
> +
> +/* call with an external lock held */
> +void qht_reset(struct qht *ht)
> +{
> +    struct qht_map *map = ht->map;
> +    uint64_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        qht_bucket_reset(&map->buckets[i]);
> +    }
> +}
> +
> +/* call with an external lock held */
> +void qht_reset_size(struct qht *ht, uint64_t n_elems)
> +{
> +    struct qht_map *old = ht->map;
> +
> +    qht_reset(ht);
> +    if (old->n == qht_elems_to_buckets(n_elems)) {
> +        return;
> +    }
> +    qht_init(ht, n_elems, ht->mode);
> +    call_rcu1(&old->rcu, qht_map_reclaim);
> +}
> +
> +/* Can only be called when at least orig is the 3rd link i.e. head->2nd->orig */
> +static void
> +__qht_move_bucket_to_front(struct qht_bucket *head, struct qht_bucket *orig)
> +{
> +    struct qht_bucket *b = head->next;
> +
> +    for (;;) {
> +        if (b->next == orig) {
> +            atomic_set(&b->next, orig->next);
> +            atomic_set(&orig->next, head->next);
> +            atomic_set(&head->next, orig);
> +            return;
> +        }
> +        b = b->next;
> +    }
> +}
> +
> +static inline void __qht_bucket_mru_head(struct qht_bucket *b, int pos)
> +{
> +    uint32_t orig_hash = b->hashes[pos];
> +    void *orig_p = b->pointers[pos];
> +    int i;
> +
> +    for (i = 0; i < pos; i++) {
> +        atomic_set(&b->hashes[i + 1], b->hashes[i]);
> +        atomic_set(&b->pointers[i + 1], b->pointers[i]);
> +    }
> +    atomic_set(&b->hashes[0], orig_hash);
> +    atomic_set(&b->pointers[0], orig_p);
> +}
> +
> +
> +/* call with head->lock held */
> +static inline void
> +__qht_bucket_mru(struct qht_bucket *head, struct qht_bucket *orig, int pos)
> +{
> +    uint32_t *dest_hash;
> +    void **dest_p;
> +    void *p;
> +    uint32_t hash;
> +    int i;
> +
> +    if (head == orig) {
> +        return __qht_bucket_mru_head(head, pos);
> +    }

OK this is calling out for a comment because I'm having trouble
following the logic. I think we are pushing the last entry in the head
bucket out the bucket the MRU entry is coming from while shuffling the
rest up for the new entry.

> +    for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +        if (i == QHT_BUCKET_ENTRIES - 1) {
> +            dest_hash = &orig->hashes[pos];
> +            dest_p = &o rig->pointers[pos];
> +        } else {
> +            dest_hash = &head->hashes[i + 1];
> +            dest_p = &head->pointers[i + 1];
> +        }
> +        hash = *dest_hash;
> +        p = *dest_p;
> +
> +        atomic_set(dest_hash, head->hashes[i]);
> +        atomic_set(dest_p, head->pointers[i]);
> +
> +        atomic_set(&head->hashes[i], hash);
> +        atomic_set(&head->pointers[i], p);
> +    }

So the bucket that has just swapped the MRU entry for an evicted entry
gets placed after head. Is there a chance we end up just bouncing the
head->next bucket each hit?

> +    if (head->next != orig) {
> +        __qht_move_bucket_to_front(head, orig);
> +    }
> +}
> +
> +void qht_bucket_mru(struct qht_bucket *head, struct qht_bucket *orig,
> +                    const void *p, int pos)
> +{
> +    qemu_spinlock_lock(&head->lock);
> +    if (unlikely(orig->pointers[pos] != p)) {
> +        /* while we acquired the lock, the bucket was updated, so bail out */
> +        goto out;
> +    }
> +    seqlock_write_begin(&head->sequence);
> +    __qht_bucket_mru(head, orig, pos);
> +    seqlock_write_end(&head->sequence);
> + out:
> +    qemu_spinlock_unlock(&head->lock);
> +}
> +
> +/* call with b->lock held */
> +static void __qht_insert(struct qht *ht, struct qht_map *map,
> +                         struct qht_bucket *b, void *p, uint32_t hash)
> +{
> +    struct qht_bucket *head = b;
> +    struct qht_bucket *prev = NULL;
> +    struct qht_bucket *new = NULL;
> +    unsigned int count = 0;
> +    int i;
> +
> +    for (;;) {
> +        if (b == NULL) {
> +            b = qemu_memalign(QEMU_CACHELINE, sizeof(*b));
> +            memset(b, 0, sizeof(*b));
> +            new = b;
> +        }
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->hashes[i]) {
> +                continue;
> +            }
> +            /* found an empty key: acquire the seqlock and write */
> +            seqlock_write_begin(&head->sequence);
> +            if (new) {
> +                /*
> +                 * This barrier is paired with smp_rmb() after reading
> +                 * b->next when not holding b->lock.
> +                 */
> +                smp_wmb();
> +                atomic_set(&prev->next, b);
> +            }
> +            atomic_set(&b->hashes[i], hash);
> +            atomic_set(&b->pointers[i], p);
> +            if ((ht->mode & QHT_MODE_MRU_INSERT) && count) {
> +                __qht_bucket_mru(head, b, i);
> +            }
> +            seqlock_write_end(&head->sequence);
> +            map->n_items++;
> +            return;
> +        }
> +        prev = b;
> +        b = b->next;
> +        count++;
> +    }
> +}
> +
> +/* call with an external lock held */
> +void qht_insert(struct qht *ht, void *p, uint32_t hash)
> +{
> +    struct qht_map *map = ht->map;
> +    struct qht_bucket *b = qht_map_to_bucket(map, hash);
> +
> +    qemu_spinlock_lock(&b->lock);
> +    __qht_insert(ht, map, b, p, hash);
> +    qemu_spinlock_unlock(&b->lock);
> +
> +    if ((ht->mode & QHT_MODE_AUTO_RESIZE) &&
> +        unlikely(map->n_items > map->n_items_threshold)) {
> +        qht_grow(ht);
> +    }
> +}
> +
> +/* call with b->lock held */
> +static inline bool __qht_remove(struct qht_map *map, struct qht_bucket *b,
> +                                const void *p, uint32_t hash)
> +{
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->hashes[i] == hash && b->pointers[i] == p) {
> +                atomic_set(&b->hashes[i], 0);
> +                atomic_set(&b->pointers[i], NULL);
> +                map->n_items--;
> +                return true;
> +            }
> +        }
> +        b = b->next;
> +    } while (b);
> +    return false;
> +}
> +
> +/* call with an external lock held */
> +bool qht_remove(struct qht *ht, const void *p, uint32_t hash)
> +{
> +    struct qht_map *map = ht->map;
> +    struct qht_bucket *b = qht_map_to_bucket(map, hash);
> +    bool ret;
> +
> +    qemu_spinlock_lock(&b->lock);
> +    seqlock_write_begin(&b->sequence);
> +    ret = __qht_remove(map, b, p, hash);
> +    seqlock_write_end(&b->sequence);
> +    qemu_spinlock_unlock(&b->lock);
> +    return ret;
> +}
> +
> +/*
> + * acquire all spinlocks from a map, so that writers cannot race with
> + * readers-turned-writers.
> + */
> +static void qht_lock(struct qht *ht)
> +{
> +    struct qht_map *map = ht->map;
> +    uint64_t i;
> +
> +    /* if readers cannot upgrade, do nothing */
> +    if (!(ht->mode & QHT_MODE_MRU_LOOKUP)) {
> +        return;
> +    }
> +    for (i = 0; i < map->n; i++) {
> +        struct qht_bucket *b = &map->buckets[i];
> +
> +        qemu_spinlock_lock(&b->lock);
> +    }
> +}
> +
> +static void qht_unlock(struct qht *ht)
> +{
> +    struct qht_map *map = ht->map;
> +    uint64_t i;
> +
> +    if (!(ht->mode & QHT_MODE_MRU_LOOKUP)) {
> +        return;
> +    }
> +    for (i = 0; i < map->n; i++) {
> +        struct qht_bucket *b = &map->buckets[i];
> +
> +        qemu_spinlock_unlock(&b->lock);
> +    }
> +}
> +
> +/* external lock + all of the map's locks held (if !MRU_LOOKUP) */
> +static inline void __qht_map_iter(struct qht *ht, struct qht_map *map,
> +                                  qht_iter_func_t func, void *userp)
> +{
> +    uint64_t i;
> +
> +    for (i = 0; i < map->n; i++) {
> +        struct qht_bucket *b = &map->buckets[i];
> +        int j;
> +
> +        do {
> +            for (j = 0; j < QHT_BUCKET_ENTRIES; j++) {
> +                if (b->hashes[j]) {
> +                    func(ht, b->pointers[j], b->hashes[j], userp);
> +                }
> +            }
> +            b = b->next;
> +        } while (b);
> +    }
> +}
> +
> +/* call with an external lock held */
> +void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp)
> +{
> +    qht_lock(ht);
> +    __qht_map_iter(ht, ht->map, func, userp);
> +    qht_unlock(ht);
> +}
> +
> +static void qht_map_copy(struct qht *ht, void *p, uint32_t hash, void *userp)
> +{
> +    struct qht_map *new = userp;
> +    struct qht_bucket *b = qht_map_to_bucket(new, hash);
> +
> +    /* no need to acquire b->lock because no thread has seen this map yet */
> +    __qht_insert(ht, new, b, p, hash);
> +}
> +
> +/* call with an external lock held */
> +void qht_grow(struct qht *ht)
> +{
> +    struct qht_map *old = ht->map;
> +    struct qht_map *new;
> +    uint64_t n = old->n * 2;
> +
> +    if (unlikely(n > (1ULL << 32))) {
> +        return;
> +    }
> +    new = qht_map_create(n);
> +    qht_iter(ht, qht_map_copy, new);
> +
> +    qht_publish(ht, new);
> +    call_rcu1(&old->rcu, qht_map_reclaim);
> +}

This looks good and I can see this being a useful utility function
across QEMU. My main problem was following exactly what happens when
entries are moved around for the MRU case. As users don't have to have a
deep knowledge of the implementation details this isn't a major issue
although a few more expansive comments or ASCII diagrams may help if
you are feeling up to it ;-)

--
Alex Bennée

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 09/10] qht: add test program
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 09/10] qht: add test program Emilio G. Cota
@ 2016-04-08 10:45   ` Alex Bennée
  2016-04-19 23:06     ` Emilio G. Cota
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Bennée @ 2016-04-08 10:45 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tests/.gitignore |   1 +
>  tests/Makefile   |   6 +++-
>  tests/test-qht.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 106 insertions(+), 1 deletion(-)
>  create mode 100644 tests/test-qht.c
>
> diff --git a/tests/.gitignore b/tests/.gitignore
> index b7bf13e..d6d0700 100644
> --- a/tests/.gitignore
> +++ b/tests/.gitignore
> @@ -47,6 +47,7 @@ test-qapi-visit.[ch]
>  test-qdev-global-props
>  test-qemu-opts
>  test-qga
> +test-qht
>  test-qmp-commands
>  test-qmp-commands.h
>  test-qmp-event
> diff --git a/tests/Makefile b/tests/Makefile
> index 45b9048..b5a99d6 100644
> --- a/tests/Makefile
> +++ b/tests/Makefile
> @@ -68,6 +68,8 @@ check-unit-y += tests/rcutorture$(EXESUF)
>  gcov-files-rcutorture-y = util/rcu.c
>  check-unit-y += tests/test-rcu-list$(EXESUF)
>  gcov-files-test-rcu-list-y = util/rcu.c
> +check-unit-y += tests/test-qht$(EXESUF)
> +gcov-files-test-qht-y = util/qht.c
>  check-unit-y += tests/test-bitops$(EXESUF)
>  check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
>  check-unit-y += tests/check-qom-interface$(EXESUF)
> @@ -387,7 +389,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
>  	tests/test-qmp-commands.o tests/test-visitor-serialization.o \
>  	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
>  	tests/test-opts-visitor.o tests/test-qmp-event.o \
> -	tests/rcutorture.o tests/test-rcu-list.o
> +	tests/rcutorture.o tests/test-rcu-list.o \
> +	tests/test-qht.o
>
>  $(test-obj-y): QEMU_INCLUDES += -Itests
>  QEMU_CFLAGS += -I$(SRC_PATH)/tests
> @@ -425,6 +428,7 @@ tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
>  tests/test-int128$(EXESUF): tests/test-int128.o
>  tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
>  tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
> +tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
>
>  tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
>  	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
> diff --git a/tests/test-qht.c b/tests/test-qht.c
> new file mode 100644
> index 0000000..f553fbc
> --- /dev/null
> +++ b/tests/test-qht.c
> @@ -0,0 +1,100 @@
> +#include "qemu/osdep.h"
> +#include "qemu/xxhash.h"
> +#include "qemu/qht.h"
> +
> +#define N 5000
> +#define SEED 1
> +
> +static struct qht ht;
> +static int32_t arr[N];
> +
> +static bool is_equal(const void *obj, const void *userp)
> +{
> +    const int32_t *a = obj;
> +    const int32_t *b = userp;
> +
> +    return *a == *b;
> +}
> +
> +static void insert(int a, int b)
> +{
> +    int i;
> +
> +    for (i = a; i < b; i++) {
> +        uint32_t hash;
> +
> +        arr[i] = i;
> +        hash = qemu_xxh32((uint32_t *)&arr[i], 1, SEED);
> +
> +        qht_insert(&ht, &arr[i], hash);
> +    }
> +}
> +
> +static void rm(int init, int end)
> +{
> +    int i;
> +
> +    for (i = init; i < end; i++) {
> +        uint32_t hash;
> +
> +        hash = qemu_xxh32((uint32_t *)&arr[i], 1, SEED);
> +        assert(qht_remove(&ht, &arr[i], hash));
> +    }
> +}
> +
> +static void check(int a, int b, bool expected)
> +{
> +    int i;
> +
> +    for (i = a; i < b; i++) {
> +        void *p;
> +        uint32_t hash;
> +        int32_t val;
> +
> +        val = i;
> +        hash = qemu_xxh32((uint32_t *)&val, 1, SEED);
> +        p = qht_lookup(&ht, is_equal, &val, hash);
> +        assert(!!p == expected);
> +    }
> +}
> +
> +static void count_func(struct qht *ht, void *p, uint32_t hash, void *userp)
> +{
> +    unsigned int *curr = userp;
> +
> +    (*curr)++;
> +}
> +
> +static void iter_check(unsigned int count)
> +{
> +    unsigned int curr = 0;
> +
> +    qht_iter(&ht, count_func, &curr);
> +    assert(curr == count);
> +}
> +
> +static void qht_test(unsigned int mode)
> +{
> +    qht_init(&ht, 0, mode);
> +
> +    insert(0, N);
> +    check(0, N, true);
> +    check(-N, -1, false);
> +    iter_check(N);
> +    rm(1, 2);
> +    qht_reset_size(&ht, 0);
> +    check(0, N, false);
> +
> +    qht_destroy(&ht);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +    qht_test(0);
> +    qht_test(QHT_MODE_MRU_LOOKUP);
> +    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT);
> +    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT | QHT_MODE_AUTO_RESIZE);
> +    qht_test(QHT_MODE_AUTO_RESIZE | QHT_MODE_MRU_INSERT);
> +    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT);
> +    return 0;
> +}

A couple of notes:

  - these should use the gtester boiler plate for reporting results
  - AFAICT they are not exercising the multi-element hashing we actually
    use in the main code
  - it would be nice to add a check on the bucket/map distribution to
    defend against the algorithm being accidentally weakened with follow up patches

However tests are good :-)

--
Alex Bennée

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 10/10] tb hash: track translated blocks with qht
  2016-04-05  5:30 ` [Qemu-devel] [PATCH 10/10] tb hash: track translated blocks with qht Emilio G. Cota
@ 2016-04-08 12:39   ` Alex Bennée
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Bennée @ 2016-04-08 12:39 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> Having a fixed-size hash table for keeping track of all translation blocks
> is suboptimal: some workloads are just too big or too small to get maximum
> performance from the hash table. The MRU promotion policy helps improve
> performance when the hash table is a little undersized, but it cannot
> make up for severely undersized hash tables.
>
> Furthermore, frequent MRU promotions result in writes that are a scalability
> bottleneck. For scalability, lookups should only perform reads, not writes.
> This is not a big deal for now, but it will become one once MTTCG matures.
>
> The appended fixes these issues by using qht as the implementation of
> the TB hash table. This solution is superior to other alternatives considered,
> namely:
>
> - master: implementation in QEMU before this patchset
> - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
> - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
>               MRU is implemented here by adding an intermediate struct
>               that contains the u32 hash and a pointer to the TB; this
>               allows us, on an MRU promotion, to copy said struct (that is not
>               at the head), and put this new copy at the head. After a grace
>               period, the original non-head struct can be eliminated, and
>               after another grace period, freed.
> - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
>                    no MRU for lookups; MRU for inserts.
> The appended solution is the following:
> - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
>                  no MRU for lookups; MRU for inserts.
>
> The plots below compare the considered solutions. The Y axis shows the
> boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
> sweeps the number of buckets (or initial number of buckets for qht-autoresize).
> The plots in PNG format (and with errorbars) can be seen here:
>   http://imgur.com/a/Awgnq
>
> Each test runs 5 times, and the entire QEMU process is pinned to a
> single core for repeatability of results.
>
>                             Host: Intel Xeon E5-2690
>
>   28 ++------------+-------------+-------------+-------------+------------++
>      A*****        +             +             +             master **A*** +
>   27 ++    *                                                 xxhash ##B###++
>      |      A******A******                               xxhash-rcu $$C$$$ |
>   26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
>      D%%$$                              A******A******A*qht-dyn-mru A*E****A
>   25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
>      B#####%                                                               |
>   24 ++    #C$$$$$                                                        ++
>      |      B###  $                                                        |
>      |          ## C$$$$$$                                                 |
>   23 ++           #       C$$$$$$                                         ++
>      |             B######       C$$$$$$                                %%%D
>   22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
>      |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
>   21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
>      +             E@@@   F&&&   +      E@     +      F&&&   +             +
>   20 ++------------+-------------+-------------+-------------+------------++
>      14            16            18            20            22            24
>                              log2 number of buckets
>
>                                  Host: Intel i7-4790K
>
>   14.5 ++------------+------------+-------------+------------+------------++
>        A**           +            +             +            master **A*** +
>     14 ++ **                                                 xxhash ##B###++
>   13.5 ++   **                                           xxhash-rcu $$C$$$++
>        |                                            qht-fixed-nomru %%D%%% |
>     13 ++     A******                                   qht-dyn-mru @@E@@@++
>        |             A*****A******A******             qht-dyn-nomru &&F&&& |
>   12.5 C$$                               A******A******A*****A******    ***A
>     12 ++ $$                                                        A***  ++
>        D%%% $$                                                             |
>   11.5 ++  %%                                                             ++
>        B###  %C$$$$$$                                                      |
>     11 ++  ## D%%%%% C$$$$$                                               ++
>        |     #      %      C$$$$$$                                         |
>   10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
>     10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
>        +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
>    9.5 ++------------+------------+-------------+------------+------------++
>        14            16           18            20           22            24
>                               log2 number of buckets
>
> Note that the original point before this patch series is X=15 for "master";
> the little sensitivity to the increased number of buckets is due to the
> poor hashing function in master.
>
> xxhash-rcu has significant overhead due to the constant churn of allocating
> and deallocating intermediate structs for implementing MRU. An alternative
> would be do consider failed lookups as "maybe not there", and then
> acquire the external lock (tb_lock in this case) to really confirm that
> there was indeed a failed lookup. This, however, would not be enough
> to implement dynamic resizing--this is more complex: see
> "Resizable, Scalable, Concurrent Hash Tables via Relativistic
> Programming" by Triplett, McKenney and Walpole. This solution was
> discarded due to the very coarse RCU read critical sections that we have
> in MTTCG; resizing requires waiting for readers after every pointer update,
> and resizes require many pointer updates, so this would quickly become
> prohibitive.
>
> qht-fixed-nomru shows that MRU promotion is advisable for undersized
> hash tables.
>
> However, qht-dyn-mru shows that MRU promotion is not important if the
> hash table is properly sized: there is virtually no difference in
> performance between qht-dyn-nomru and qht-dyn-mru.
>
> Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
> X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
> can achieve with optimum sizing of the hash table, while keeping the hash
> table scalable for readers.
>
> The improvement we get before and after this patch for booting debian jessie
> with arm-softmmu is:
>
> - Intel Xeon E5-2690: 10.5% less time
> - Intel i7-4790K: 5.2% less time
>
> We could get this same improvement _for this particular workload_ by
> statically increasing the size of the hash table. But this would hurt
> workloads that do not need a large hash table. The dynamic (upward)
> resizing allows us to start small and enlarge the hash table as needed.
>
> A quick note on downsizing: the table is resized back to 2**15 buckets
> on every tb_flush; this makes sense because it is not guaranteed that the
> table will reach the same number of TBs later on (e.g. most bootup code is
> thrown away after boot); it makes sense to grow the hash table as
> more code blocks are translated. This also avoids the complication of
> having to build downsizing hysteresis logic into qht.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c              | 82 ++++++++++++++++++++++++-----------------------
>  include/exec/exec-all.h |  9 +++---
>  include/exec/tb-hash.h  |  5 +--
>  translate-all.c         | 84 ++++++++++++++++++++++---------------------------
>  4 files changed, 85 insertions(+), 95 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 4194a4a..e948ffd 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -217,55 +217,59 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
>      tb_free(tb);
>  }
>
> +struct tb_desc {
> +    target_ulong pc;
> +    target_ulong cs_base;
> +    uint64_t flags;
> +    tb_page_addr_t phys_page1;
> +    CPUArchState *env;
> +};
> +
> +static bool tb_cmp(const void *p, const void *d)
> +{
> +    const TranslationBlock *tb = p;
> +    const struct tb_desc *desc = d;
> +
> +    if (tb->pc == desc->pc &&
> +        tb->page_addr[0] == desc->phys_page1 &&
> +        tb->cs_base == desc->cs_base &&
> +        tb->flags == desc->flags) {
> +        /* check next page if needed */
> +        if (tb->page_addr[1] == -1) {
> +            return true;
> +        } else {
> +            tb_page_addr_t phys_page2;
> +            target_ulong virt_page2;
> +
> +            virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
> +            phys_page2 = get_page_addr_code(desc->env, virt_page2);
> +            if (tb->page_addr[1] == phys_page2) {
> +                return true;
> +            }
> +        }
> +    }
> +    return false;
> +}
> +
>  static TranslationBlock *tb_find_physical(CPUState *cpu,
>                                            target_ulong pc,
>                                            target_ulong cs_base,
>                                            uint64_t flags)
>  {
> -    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
> -    TranslationBlock *tb, **ptb1;
>      unsigned int h;
> -    tb_page_addr_t phys_pc, phys_page1;
> -    target_ulong virt_page2;
> +    tb_page_addr_t phys_pc;
> +    struct tb_desc desc;
>
>      tcg_ctx.tb_ctx.tb_invalidated_flag = 0;
>
> -    /* find translated block using physical mappings */
> -    phys_pc = get_page_addr_code(env, pc);
> -    phys_page1 = phys_pc & TARGET_PAGE_MASK;
> +    desc.env = (CPUArchState *)cpu->env_ptr;
> +    desc.cs_base = cs_base;
> +    desc.flags = flags;
> +    desc.pc = pc;
> +    phys_pc = get_page_addr_code(desc.env, pc);
> +    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
>      h = tb_hash_func(phys_pc, pc, flags);
> -    ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    for(;;) {
> -        tb = *ptb1;
> -        if (!tb) {
> -            return NULL;
> -        }
> -        if (tb->pc == pc &&
> -            tb->page_addr[0] == phys_page1 &&
> -            tb->cs_base == cs_base &&
> -            tb->flags == flags) {
> -            /* check next page if needed */
> -            if (tb->page_addr[1] != -1) {
> -                tb_page_addr_t phys_page2;
> -
> -                virt_page2 = (pc & TARGET_PAGE_MASK) +
> -                    TARGET_PAGE_SIZE;
> -                phys_page2 = get_page_addr_code(env, virt_page2);
> -                if (tb->page_addr[1] == phys_page2) {
> -                    break;
> -                }
> -            } else {
> -                break;
> -            }
> -        }
> -        ptb1 = &tb->phys_hash_next;
> -    }
> -
> -    /* Move the TB to the head of the list */
> -    *ptb1 = tb->phys_hash_next;
> -    tb->phys_hash_next = tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    tcg_ctx.tb_ctx.tb_phys_hash[h] = tb;
> -    return tb;
> +    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
>  }
>
>  static TranslationBlock *tb_find_slow(CPUState *cpu,
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 7362095..a894817 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -21,6 +21,7 @@
>  #define _EXEC_ALL_H_
>
>  #include "qemu-common.h"
> +#include "qemu/qht.h"
>
>  /* allow to see translation results - the slowdown should be negligible, so we leave it */
>  #define DEBUG_DISAS
> @@ -211,8 +212,8 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
>
>  #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
>
> -#define CODE_GEN_PHYS_HASH_BITS     15
> -#define CODE_GEN_PHYS_HASH_SIZE     (1 << CODE_GEN_PHYS_HASH_BITS)
> +#define CODE_GEN_HTABLE_BITS     15
> +#define CODE_GEN_HTABLE_SIZE     (1 << CODE_GEN_HTABLE_BITS)
>
>  /* Estimated block size for TB allocation.  */
>  /* ??? The following is based on a 2015 survey of x86_64 host output.
> @@ -248,8 +249,6 @@ struct TranslationBlock {
>
>      void *tc_ptr;    /* pointer to the translated code */
>      uint8_t *tc_search;  /* pointer to search data */
> -    /* next matching tb for physical address. */
> -    struct TranslationBlock *phys_hash_next;
>      /* original tb when cflags has CF_NOCACHE */
>      struct TranslationBlock *orig_tb;
>      /* first and second physical page containing code. The lower bit
> @@ -280,7 +279,7 @@ typedef struct TBContext TBContext;
>  struct TBContext {
>
>      TranslationBlock *tbs;
> -    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
> +    struct qht htable;
>      int nb_tbs;
>      /* any access to the tbs or the page table must use this lock */
>      QemuMutex tb_lock;
> diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
> index c4942e1..fabc936 100644
> --- a/include/exec/tb-hash.h
> +++ b/include/exec/tb-hash.h
> @@ -20,7 +20,6 @@
>  #ifndef EXEC_TB_HASH
>  #define EXEC_TB_HASH
>
> -#include "exec/exec-all.h"
>  #include "qemu/xxhash.h"
>
>  /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
> @@ -54,13 +53,11 @@ uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint64_t flags)
>          target_ulong pc;
>          uint64_t flags;
>      } QEMU_PACKED k;
> -    unsigned int hash;
>
>      k.phys_pc = phys_pc;
>      k.pc = pc;
>      k.flags = flags;
> -    hash = qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);
> -    return hash & (CODE_GEN_PHYS_HASH_SIZE - 1);
> +    return qemu_xxh32((uint32_t *)&k, sizeof(k) / sizeof(uint32_t), 1);
>  }
>
>  #endif
> diff --git a/translate-all.c b/translate-all.c
> index 8e7f9a7..f0d7f1a 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -735,6 +735,13 @@ static inline void code_gen_alloc(size_t tb_size)
>      qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
>  }
>
> +static void tb_htable_init(void)
> +{
> +    unsigned int mode = QHT_MODE_AUTO_RESIZE | QHT_MODE_MRU_INSERT;
> +
> +    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
> +}
> +
>  /* Must be called before using the QEMU cpus. 'tb_size' is the size
>     (in bytes) allocated to the translation buffer. Zero means default
>     size. */
> @@ -742,6 +749,7 @@ void tcg_exec_init(unsigned long tb_size)
>  {
>      cpu_gen_init();
>      page_init();
> +    tb_htable_init();
>      code_gen_alloc(tb_size);
>  #if defined(CONFIG_SOFTMMU)
>      /* There's no guest base to take into account, so go ahead and
> @@ -843,7 +851,7 @@ void tb_flush(CPUState *cpu)
>          memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
>      }
>
> -    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
> +    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
>      page_flush_tb();
>
>      tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
> @@ -854,60 +862,45 @@ void tb_flush(CPUState *cpu)
>
>  #ifdef DEBUG_TB_CHECK
>
> -static void tb_invalidate_check(target_ulong address)
> +static void
> +__tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void
> *userp)

No __ functions please.

>  {
> -    TranslationBlock *tb;
> -    int i;
> +    TranslationBlock *tb = p;
> +    target_ulong addr = *(target_ulong *)userp;
>
> -    address &= TARGET_PAGE_MASK;
> -    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> -             tb = tb->phys_hash_next) {
> -            if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
> -                  address >= tb->pc + tb->size)) {
> -                printf("ERROR invalidate: address=" TARGET_FMT_lx
> -                       " PC=%08lx size=%04x\n",
> -                       address, (long)tb->pc, tb->size);
> -            }
> -        }
> +    if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
> +        printf("ERROR invalidate: address=" TARGET_FMT_lx
> +               " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
>      }
>  }
>
> -/* verify that all the pages have correct rights for code */
> -static void tb_page_check(void)
> +static void tb_invalidate_check(target_ulong address)
>  {
> -    TranslationBlock *tb;
> -    int i, flags1, flags2;
> -
> -    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> -                tb = tb->phys_hash_next) {
> -            flags1 = page_get_flags(tb->pc);
> -            flags2 = page_get_flags(tb->pc + tb->size - 1);
> -            if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> -                printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> -                       (long)tb->pc, tb->size, flags1, flags2);
> -            }
> -        }
> -    }
> +    address &= TARGET_PAGE_MASK;
> +    qht_iter(&tcg_ctx.tb_ctx.htable, __tb_invalidate_check, &address);
>  }
>
> -#endif
> -
> -static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock *tb)
> +static void __tb_page_check(struct qht *ht, void *p, uint32_t hash,
> void *userp)

And here.

>  {
> -    TranslationBlock *tb1;
> +    TranslationBlock *tb = p;
> +    int flags1, flags2;
>
> -    for (;;) {
> -        tb1 = *ptb;
> -        if (tb1 == tb) {
> -            *ptb = tb1->phys_hash_next;
> -            break;
> -        }
> -        ptb = &tb1->phys_hash_next;
> +    flags1 = page_get_flags(tb->pc);
> +    flags2 = page_get_flags(tb->pc + tb->size - 1);
> +    if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> +        printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> +               (long)tb->pc, tb->size, flags1, flags2);
>      }
>  }
>
> +/* verify that all the pages have correct rights for code */
> +static void tb_page_check(void)
> +{
> +    qht_iter(&tcg_ctx.tb_ctx.htable, __tb_page_check, NULL);
> +}
> +
> +#endif
> +
>  static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
>  {
>      TranslationBlock *tb1;
> @@ -973,7 +966,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>      /* remove the TB from the hash list */
>      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
> +    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
>
>      /* remove the TB from the page list */
>      if (tb->page_addr[0] != page_addr) {
> @@ -1472,13 +1465,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>                           tb_page_addr_t phys_page2)
>  {
>      unsigned int h;
> -    TranslationBlock **ptb;
>
>      /* add in the hash table */
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    tb->phys_hash_next = *ptb;
> -    *ptb = tb;
> +    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
>
>      /* add in the page list */
>      tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);

Apart from the static function names it looks good. I particularly like
the net loss of code thanks to using utility code ;-)

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

--
Alex Bennée

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-08 10:27   ` Alex Bennée
@ 2016-04-19 23:03     ` Emilio G. Cota
  0 siblings, 0 replies; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:03 UTC (permalink / raw)
  To: Alex Bennée
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

Hi Alex,

I'm sending a v3 in a few minutes. I've addressed all your comments there,
so I won't duplicate them here; please find inline my replies to some
questions you raised.

On Fri, Apr 08, 2016 at 11:27:19 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
(snip)
> > +/* call only when there are no readers left */
> > +void qht_destroy(struct qht *ht);
> > +
> 
> What's the rationale for making the caller responsible for locking here?

Instead of embedding a mutex in struct qht, I decided to let callers
handle that, since most likely write accesses will be part of
other updates that will require their own lock.

> Are we going to be protected by tb_lock in the MTTCG case rather than a
> finer grained locking inside the qht?

Yes, this is an example of the external-lock-that-other-updates-require
that I refer to above.

(snip)
> > +static inline
> > +void *__qht_lookup(struct qht_bucket *b, struct qht_bucket **far_bucket,
> > +                    int *pos, qht_lookup_func_t func, const void *userp,
> > +                    uint32_t hash)
> 
> Aside the inline is there any reason to keep such an implementation
> detail in the header. I certainly couldn't measure a difference after I
> moved them into qht.c.

I thought I measured a small perf. increase in the past with the inline.
But I couldn't replicate it so I've moved this to qht.c in v3.

(snip)
> > +static inline void *qht_lookup(struct qht *ht, qht_lookup_func_t func,
> > +                                const void *userp, uint32_t hash)
> > +{
> > +    struct qht_bucket *far_bucket = NULL;
> > +    struct qht_bucket *b;
> > +    struct qht_map *map;
> > +    uint32_t version;
> > +    int pos = 0;
> > +    void *ret;
> > +
> > +    map = atomic_read(&ht->map);
> > +    /* paired with smp_wmb() before setting ht->map */
> > +    smp_rmb();
> > +    b = qht_map_to_bucket(map, hash);
> > +
> > +    do {
> > +        version = seqlock_read_begin(&b->sequence);
> > +        ret = __qht_lookup(b, &far_bucket, &pos, func, userp, hash);
> > +    } while (seqlock_read_retry(&b->sequence, version));
> 
> OK I was slightly confused by this until I got further along. Maybe some
> comments in the qht_bucket structure to point out the seqlock is
> important when a bucket is at the head of the list.

There is a comment at the top of qht.c that says exactly that. I'd rather
not duplicate it elsewhere.

(snip)
> > +    for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> > +        if (i == QHT_BUCKET_ENTRIES - 1) {
> > +            dest_hash = &orig->hashes[pos];
> > +            dest_p = &o rig->pointers[pos];
> > +        } else {
> > +            dest_hash = &head->hashes[i + 1];
> > +            dest_p = &head->pointers[i + 1];
> > +        }
> > +        hash = *dest_hash;
> > +        p = *dest_p;
> > +
> > +        atomic_set(dest_hash, head->hashes[i]);
> > +        atomic_set(dest_p, head->pointers[i]);
> > +
> > +        atomic_set(&head->hashes[i], hash);
> > +        atomic_set(&head->pointers[i], p);
> > +    }
> 
> So the bucket that has just swapped the MRU entry for an evicted entry
> gets placed after head. Is there a chance we end up just bouncing the
> head->next bucket each hit?

Given a certain sequence of hits, it could happen that head->next
gets updated every time, yes. This is of course unlikely, which is
why I think it's fine. An alternative would be to simply swap
head[last] with orig[pos], and leave it there. I can't measure
much of a difference between the two options.

(snip until end of patch)
> This looks good and I can see this being a useful utility function
> across QEMU. My main problem was following exactly what happens when
> entries are moved around for the MRU case. As users don't have to have a
> deep knowledge of the implementation details this isn't a major issue
> although a few more expansive comments or ASCII diagrams may help if
> you are feeling up to it ;-)

I added comments in v3 to address this.

Thanks a lot for your review!

		Emilio

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 09/10] qht: add test program
  2016-04-08 10:45   ` Alex Bennée
@ 2016-04-19 23:06     ` Emilio G. Cota
  2016-04-20  7:50       ` Alex Bennée
  0 siblings, 1 reply; 62+ messages in thread
From: Emilio G. Cota @ 2016-04-19 23:06 UTC (permalink / raw)
  To: Alex Bennée
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov

On Fri, Apr 08, 2016 at 11:45:41 +0100, Alex Bennée wrote:
(snip the entire patch)
> A couple of notes:
> 
>   - these should use the gtester boiler plate for reporting results

Done in v3.

>   - AFAICT they are not exercising the multi-element hashing we actually
>     use in the main code
>   - it would be nice to add a check on the bucket/map distribution to
>     defend against the algorithm being accidentally weakened with follow up patches

I added tb hash chain info to 'info jit' to keep track of this. My goal
with the test program is to check that the hash table is correct; I'd
rather check performance with QEMU than with a made-up test, since
it is QEMU's performance what we care about.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [Qemu-devel] [PATCH 09/10] qht: add test program
  2016-04-19 23:06     ` Emilio G. Cota
@ 2016-04-20  7:50       ` Alex Bennée
  0 siblings, 0 replies; 62+ messages in thread
From: Alex Bennée @ 2016-04-20  7:50 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Peter Maydell, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> On Fri, Apr 08, 2016 at 11:45:41 +0100, Alex Bennée wrote:
> (snip the entire patch)
>> A couple of notes:
>>
>>   - these should use the gtester boiler plate for reporting results
>
> Done in v3.
>
>>   - AFAICT they are not exercising the multi-element hashing we actually
>>     use in the main code
>>   - it would be nice to add a check on the bucket/map distribution to
>>     defend against the algorithm being accidentally weakened with follow up patches
>
> I added tb hash chain info to 'info jit' to keep track of this. My goal
> with the test program is to check that the hash table is correct; I'd
> rather check performance with QEMU than with a made-up test, since
> it is QEMU's performance what we care about.

Fair enough. We could do with more TCG exercising test cases in the make
check but that's a different (and more difficult) problem.
>
> Thanks,
>
> 		Emilio


--
Alex Bennée

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2016-04-20  7:50 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-05  5:30 [Qemu-devel] [PATCH 00/10] tb hash improvements Emilio G. Cota
2016-04-05  5:30 ` [Qemu-devel] [PATCH 01/10] translate-all: add missing fold of tb_ctx into tcg_ctx Emilio G. Cota
2016-04-05  8:49   ` Paolo Bonzini
2016-04-05  5:30 ` [Qemu-devel] [PATCH 02/10] compiler.h: add QEMU_CACHELINE + QEMU_ALIGN() + QEMU_CACHELINE_ALIGNED Emilio G. Cota
2016-04-05  7:57   ` Peter Maydell
2016-04-05 17:24     ` Emilio G. Cota
2016-04-05 18:01       ` Peter Maydell
2016-04-05 19:13         ` Emilio G. Cota
2016-04-05  8:49   ` Paolo Bonzini
2016-04-05 12:57   ` Lluís Vilanova
2016-04-05 12:58     ` Peter Maydell
2016-04-05 15:29       ` Paolo Bonzini
2016-04-05 16:23       ` Lluís Vilanova
2016-04-05 16:31         ` Richard Henderson
2016-04-05 16:56           ` Peter Maydell
2016-04-05 19:02             ` Lluís Vilanova
2016-04-05 19:15               ` Richard Henderson
2016-04-05 20:09                 ` Lluís Vilanova
2016-04-06 11:44                   ` Paolo Bonzini
2016-04-06 12:02                     ` Laurent Desnogues
2016-04-05  5:30 ` [Qemu-devel] [PATCH 03/10] seqlock: remove optional mutex Emilio G. Cota
2016-04-06  8:38   ` Alex Bennée
2016-04-05  5:30 ` [Qemu-devel] [PATCH 04/10] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
2016-04-06  8:42   ` Alex Bennée
2016-04-05  5:30 ` [Qemu-devel] [PATCH 05/10] include: add spinlock wrapper Emilio G. Cota
2016-04-05  8:51   ` Paolo Bonzini
2016-04-06 15:51     ` Alex Bennée
2016-04-05  5:30 ` [Qemu-devel] [PATCH 06/10] include: add xxhash.h Emilio G. Cota
2016-04-06 11:39   ` Alex Bennée
2016-04-06 22:59     ` Emilio G. Cota
2016-04-05  5:30 ` [Qemu-devel] [PATCH 07/10] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
2016-04-05 15:41   ` Richard Henderson
2016-04-05 15:48     ` Paolo Bonzini
2016-04-05 16:07       ` Richard Henderson
2016-04-05 19:40         ` Emilio G. Cota
2016-04-05 21:08           ` Richard Henderson
2016-04-06  0:52             ` Emilio G. Cota
2016-04-06 11:52               ` Paolo Bonzini
2016-04-06 17:44                 ` Emilio G. Cota
2016-04-06 18:23                   ` Paolo Bonzini
2016-04-06 18:27                     ` Richard Henderson
2016-04-07  0:37                     ` Emilio G. Cota
2016-04-07  8:46                       ` Paolo Bonzini
2016-04-05 16:33     ` Laurent Desnogues
2016-04-05 17:19       ` Richard Henderson
2016-04-06  6:06         ` Laurent Desnogues
2016-04-06 17:32           ` Emilio G. Cota
2016-04-06 17:42             ` Richard Henderson
2016-04-07  8:12               ` Laurent Desnogues
2016-04-05  5:30 ` [Qemu-devel] [PATCH 08/10] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
2016-04-05  9:01   ` Paolo Bonzini
2016-04-05 15:50   ` Richard Henderson
2016-04-08 10:27   ` Alex Bennée
2016-04-19 23:03     ` Emilio G. Cota
2016-04-05  5:30 ` [Qemu-devel] [PATCH 09/10] qht: add test program Emilio G. Cota
2016-04-08 10:45   ` Alex Bennée
2016-04-19 23:06     ` Emilio G. Cota
2016-04-20  7:50       ` Alex Bennée
2016-04-05  5:30 ` [Qemu-devel] [PATCH 10/10] tb hash: track translated blocks with qht Emilio G. Cota
2016-04-08 12:39   ` Alex Bennée
2016-04-05  8:47 ` [Qemu-devel] [PATCH 00/10] tb hash improvements Alex Bennée
2016-04-05  9:01 ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.