All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v6 00/15] tb hash improvements
@ 2016-05-25  1:13 Emilio G. Cota
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 01/15] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
                   ` (15 more replies)
  0 siblings, 16 replies; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

v5: https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg02366.html

v6 applies cleanly on top of tcg-next (8b1fe3f4 "cpu-exec:
Clean up 'interrupt_request' reloading", tagged "pull-tcg-20160512").

Changes from v5, mostly from Sergey's review:

- processor.h: use #ifdef #elif throughout the file

- tb_hash_func: use uint32 for 'flags' param

- tb_hash_func5: do 'foo >> 32' instead of 'foo >> 31 >> 1', since foo
  is a u64.

- thread.h:
  * qemu_spin_locked: remove acquire semantics; simply use atomic_read().
  * qemu_spin_trylock: return bool instead of 0 or -EBUSY; this saves
    a branch.
  * qemu_spin:
    + use __sync_lock_test_and_set and __sync_lock_release; drop
      the patches touching atomic.h.
    + add unlikely() hint to "while (test_and_set)"; this gives a small
      speedup under no contention.

- qht:
  * merge the parallel-writes patch into the QHT patch.
    [Richard: I dropped your reviewed-by since the patch changed
     quite a bit.]
  * drop unneeded #includes from qht.h
  * document qht.h using kerneldoc.
  * use unsigned int for storing the seqlock version.
  * fix a couple of typos in the comments at the top of qht.c.
  * explain better the "no duplicated pointer" policy: while trying to
    insert an already-existing hash-pointer pair is OK (insert will
    just return false), it's not OK to insert different hash-pointer
    pairs that share the same pointer value, but not the hashes.
  * Add comment about lookups having to be done in an RCU read-critical
    section.
  * remove map->stale; simply check ht->map before and after acquiring
    a bucket lock.
  * only use atomic_read/set on bucket pointers, not hashes. Reading
    partially-updated hashes is OK, since we'll retry anyway thanks
    to the seqlock. Add a comment regarding this at the top of struct
    qht_bucket.
  * s/b->n/b->n_buckets/
  * define qht_debug_assert, enabled #ifdef QHT_DEBUG. Use it instead of
    assert(), except in one case (slow path) where g_assert_cmpuint is
    convenient.
  * use a mutex for ht->lock instead of a spinlock. This makes the resize
    code simpler, since holding ht->lock for a bit of time is OK now;
    other threads won't be busy-waiting. Document that ht->lock needs
    to be grabbed before b->lock.
  * use atomic_rcu_read/set instead of open-coding them.
  * qht_remove: only clear out b->hashes[] and b->pointers[] if they belong
                to what was the last entry in the chain.
  * qht_remove: add debug assert against inserting a NULL pointer.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 01/15] compiler.h: add QEMU_ALIGNED() to enforce struct alignment
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-27 19:54   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 02/15] seqlock: remove optional mutex Emilio G. Cota
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/compiler.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
index 8f1cc7b..b64f899 100644
--- a/include/qemu/compiler.h
+++ b/include/qemu/compiler.h
@@ -41,6 +41,8 @@
 # define QEMU_PACKED __attribute__((packed))
 #endif
 
+#define QEMU_ALIGNED(X) __attribute__((aligned(X)))
+
 #ifndef glue
 #define xglue(x, y) x ## y
 #define glue(x, y) xglue(x, y)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 02/15] seqlock: remove optional mutex
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 01/15] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-27 19:55   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 03/15] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

This option is unused; besides, it bloats the struct when not needed.
Let's just let writers define their own locks elsewhere.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c                 |  2 +-
 include/qemu/seqlock.h | 10 +---------
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/cpus.c b/cpus.c
index cbeb1f6..dd86da5 100644
--- a/cpus.c
+++ b/cpus.c
@@ -619,7 +619,7 @@ int cpu_throttle_get_percentage(void)
 
 void cpu_ticks_init(void)
 {
-    seqlock_init(&timers_state.vm_clock_seqlock, NULL);
+    seqlock_init(&timers_state.vm_clock_seqlock);
     vmstate_register(NULL, 0, &vmstate_timers, &timers_state);
     throttle_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL_RT,
                                            cpu_throttle_timer_tick, NULL);
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index 70b01fd..e673482 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -19,22 +19,17 @@
 typedef struct QemuSeqLock QemuSeqLock;
 
 struct QemuSeqLock {
-    QemuMutex *mutex;
     unsigned sequence;
 };
 
-static inline void seqlock_init(QemuSeqLock *sl, QemuMutex *mutex)
+static inline void seqlock_init(QemuSeqLock *sl)
 {
-    sl->mutex = mutex;
     sl->sequence = 0;
 }
 
 /* Lock out other writers and update the count.  */
 static inline void seqlock_write_lock(QemuSeqLock *sl)
 {
-    if (sl->mutex) {
-        qemu_mutex_lock(sl->mutex);
-    }
     ++sl->sequence;
 
     /* Write sequence before updating other fields.  */
@@ -47,9 +42,6 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
     smp_wmb();
 
     ++sl->sequence;
-    if (sl->mutex) {
-        qemu_mutex_unlock(sl->mutex);
-    }
 }
 
 static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 03/15] seqlock: rename write_lock/unlock to write_begin/end
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 01/15] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 02/15] seqlock: remove optional mutex Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-27 19:59   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 04/15] include/processor.h: define cpu_relax() Emilio G. Cota
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

It is a more appropriate name, now that the mutex embedded
in the seqlock is gone.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c                 | 28 ++++++++++++++--------------
 include/qemu/seqlock.h |  4 ++--
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/cpus.c b/cpus.c
index dd86da5..735c9b2 100644
--- a/cpus.c
+++ b/cpus.c
@@ -247,13 +247,13 @@ int64_t cpu_get_clock(void)
 void cpu_enable_ticks(void)
 {
     /* Here, the really thing protected by seqlock is cpu_clock_offset. */
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (!timers_state.cpu_ticks_enabled) {
         timers_state.cpu_ticks_offset -= cpu_get_host_ticks();
         timers_state.cpu_clock_offset -= get_clock();
         timers_state.cpu_ticks_enabled = 1;
     }
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 /* disable cpu_get_ticks() : the clock is stopped. You must not call
@@ -263,13 +263,13 @@ void cpu_enable_ticks(void)
 void cpu_disable_ticks(void)
 {
     /* Here, the really thing protected by seqlock is cpu_clock_offset. */
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (timers_state.cpu_ticks_enabled) {
         timers_state.cpu_ticks_offset += cpu_get_host_ticks();
         timers_state.cpu_clock_offset = cpu_get_clock_locked();
         timers_state.cpu_ticks_enabled = 0;
     }
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 /* Correlation between real and virtual time is always going to be
@@ -292,7 +292,7 @@ static void icount_adjust(void)
         return;
     }
 
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     cur_time = cpu_get_clock_locked();
     cur_icount = cpu_get_icount_locked();
 
@@ -313,7 +313,7 @@ static void icount_adjust(void)
     last_delta = delta;
     timers_state.qemu_icount_bias = cur_icount
                               - (timers_state.qemu_icount << icount_time_shift);
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 static void icount_adjust_rt(void *opaque)
@@ -353,7 +353,7 @@ static void icount_warp_rt(void)
         return;
     }
 
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (runstate_is_running()) {
         int64_t clock = REPLAY_CLOCK(REPLAY_CLOCK_VIRTUAL_RT,
                                      cpu_get_clock_locked());
@@ -372,7 +372,7 @@ static void icount_warp_rt(void)
         timers_state.qemu_icount_bias += warp_delta;
     }
     vm_clock_warp_start = -1;
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 
     if (qemu_clock_expired(QEMU_CLOCK_VIRTUAL)) {
         qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
@@ -397,9 +397,9 @@ void qtest_clock_warp(int64_t dest)
         int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
         int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
 
-        seqlock_write_lock(&timers_state.vm_clock_seqlock);
+        seqlock_write_begin(&timers_state.vm_clock_seqlock);
         timers_state.qemu_icount_bias += warp;
-        seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+        seqlock_write_end(&timers_state.vm_clock_seqlock);
 
         qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
         timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
@@ -466,9 +466,9 @@ void qemu_start_warp_timer(void)
              * It is useful when we want a deterministic execution time,
              * isolated from host latencies.
              */
-            seqlock_write_lock(&timers_state.vm_clock_seqlock);
+            seqlock_write_begin(&timers_state.vm_clock_seqlock);
             timers_state.qemu_icount_bias += deadline;
-            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+            seqlock_write_end(&timers_state.vm_clock_seqlock);
             qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
         } else {
             /*
@@ -479,11 +479,11 @@ void qemu_start_warp_timer(void)
              * you will not be sending network packets continuously instead of
              * every 100ms.
              */
-            seqlock_write_lock(&timers_state.vm_clock_seqlock);
+            seqlock_write_begin(&timers_state.vm_clock_seqlock);
             if (vm_clock_warp_start == -1 || vm_clock_warp_start > clock) {
                 vm_clock_warp_start = clock;
             }
-            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+            seqlock_write_end(&timers_state.vm_clock_seqlock);
             timer_mod_anticipate(icount_warp_timer, clock + deadline);
         }
     } else if (deadline == 0) {
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index e673482..4dfc055 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -28,7 +28,7 @@ static inline void seqlock_init(QemuSeqLock *sl)
 }
 
 /* Lock out other writers and update the count.  */
-static inline void seqlock_write_lock(QemuSeqLock *sl)
+static inline void seqlock_write_begin(QemuSeqLock *sl)
 {
     ++sl->sequence;
 
@@ -36,7 +36,7 @@ static inline void seqlock_write_lock(QemuSeqLock *sl)
     smp_wmb();
 }
 
-static inline void seqlock_write_unlock(QemuSeqLock *sl)
+static inline void seqlock_write_end(QemuSeqLock *sl)
 {
     /* Write other fields before finalizing sequence.  */
     smp_wmb();
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 04/15] include/processor.h: define cpu_relax()
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (2 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 03/15] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-27 20:53   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 05/15] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Taken from the linux kernel.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/processor.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)
 create mode 100644 include/qemu/processor.h

diff --git a/include/qemu/processor.h b/include/qemu/processor.h
new file mode 100644
index 0000000..42bcc99
--- /dev/null
+++ b/include/qemu/processor.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_PROCESSOR_H
+#define QEMU_PROCESSOR_H
+
+#include "qemu/atomic.h"
+
+#if defined(__i386__) || defined(__x86_64__)
+# define cpu_relax() asm volatile("rep; nop" ::: "memory")
+
+#elif defined(__ia64__)
+# define cpu_relax() asm volatile("hint @pause" ::: "memory")
+
+#elif defined(__aarch64__)
+# define cpu_relax() asm volatile("yield" ::: "memory")
+
+#elif defined(__powerpc64__)
+/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
+# define cpu_relax() asm volatile("or 1, 1, 1;"
+                                  "or 2, 2, 2;" ::: "memory")
+
+#else
+# define cpu_relax() barrier()
+#endif
+
+#endif /* QEMU_PROCESSOR_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 05/15] qemu-thread: add simple test-and-set spinlock
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (3 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 04/15] include/processor.h: define cpu_relax() Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 06/15] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

From: Guillaume Delbergue <guillaume.delbergue@greensocs.com>

Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
[Rewritten. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Emilio's additions: use TAS instead of atomic_xchg; emit acquire/release
 barriers; return bool from trylock; call cpu_relax() while spinning;
 optimize for uncontended locks by acquiring the lock with TAS instead
 of TATAS; add qemu_spin_locked().]
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/thread.h | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index bdae6df..c5d71cf 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -1,6 +1,8 @@
 #ifndef __QEMU_THREAD_H
 #define __QEMU_THREAD_H 1
 
+#include "qemu/processor.h"
+#include "qemu/atomic.h"
 
 typedef struct QemuMutex QemuMutex;
 typedef struct QemuCond QemuCond;
@@ -60,4 +62,37 @@ struct Notifier;
 void qemu_thread_atexit_add(struct Notifier *notifier);
 void qemu_thread_atexit_remove(struct Notifier *notifier);
 
+typedef struct QemuSpin {
+    int value;
+} QemuSpin;
+
+static inline void qemu_spin_init(QemuSpin *spin)
+{
+    __sync_lock_release(&spin->value);
+}
+
+static inline void qemu_spin_lock(QemuSpin *spin)
+{
+    while (unlikely(__sync_lock_test_and_set(&spin->value, true))) {
+        while (atomic_read(&spin->value)) {
+            cpu_relax();
+        }
+    }
+}
+
+static inline bool qemu_spin_trylock(QemuSpin *spin)
+{
+    return __sync_lock_test_and_set(&spin->value, true);
+}
+
+static inline bool qemu_spin_locked(QemuSpin *spin)
+{
+    return atomic_read(&spin->value);
+}
+
+static inline void qemu_spin_unlock(QemuSpin *spin)
+{
+    __sync_lock_release(&spin->value);
+}
+
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 06/15] exec: add tb_hash_func5, derived from xxhash
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (4 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 05/15] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-28 12:36   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 07/15] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

This will be used by upcoming changes for hashing the tb hash.

Add this into a separate file to include the copyright notice from
xxhash.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-hash-xx.h | 94 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)
 create mode 100644 include/exec/tb-hash-xx.h

diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
new file mode 100644
index 0000000..9f3fc05
--- /dev/null
+++ b/include/exec/tb-hash-xx.h
@@ -0,0 +1,94 @@
+/*
+ * xxHash - Fast Hash algorithm
+ * Copyright (C) 2012-2016, Yann Collet
+ *
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * + Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * + Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * You can contact the author at :
+ * - xxHash source repository : https://github.com/Cyan4973/xxHash
+ */
+#ifndef EXEC_TB_HASH_XX
+#define EXEC_TB_HASH_XX
+
+#include <qemu/bitops.h>
+
+#define PRIME32_1   2654435761U
+#define PRIME32_2   2246822519U
+#define PRIME32_3   3266489917U
+#define PRIME32_4    668265263U
+#define PRIME32_5    374761393U
+
+#define TB_HASH_XX_SEED 1
+
+/*
+ * xxhash32, customized for input variables that are not guaranteed to be
+ * contiguous in memory.
+ */
+static inline
+uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e)
+{
+    uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
+    uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
+    uint32_t v3 = TB_HASH_XX_SEED + 0;
+    uint32_t v4 = TB_HASH_XX_SEED - PRIME32_1;
+    uint32_t a = a0 >> 32;
+    uint32_t b = a0;
+    uint32_t c = b0 >> 32;
+    uint32_t d = b0;
+    uint32_t h32;
+
+    v1 += a * PRIME32_2;
+    v1 = rol32(v1, 13);
+    v1 *= PRIME32_1;
+
+    v2 += b * PRIME32_2;
+    v2 = rol32(v2, 13);
+    v2 *= PRIME32_1;
+
+    v3 += c * PRIME32_2;
+    v3 = rol32(v3, 13);
+    v3 *= PRIME32_1;
+
+    v4 += d * PRIME32_2;
+    v4 = rol32(v4, 13);
+    v4 *= PRIME32_1;
+
+    h32 = rol32(v1, 1) + rol32(v2, 7) + rol32(v3, 12) + rol32(v4, 18);
+    h32 += 20;
+
+    h32 += e * PRIME32_3;
+    h32  = rol32(h32, 17) * PRIME32_4;
+
+    h32 ^= h32 >> 15;
+    h32 *= PRIME32_2;
+    h32 ^= h32 >> 13;
+    h32 *= PRIME32_3;
+    h32 ^= h32 >> 16;
+
+    return h32;
+}
+
+#endif /* EXEC_TB_HASH_XX */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 07/15] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (5 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 06/15] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-28 12:39   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data Emilio G. Cota
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
   high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though. UPDATE:
On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
> consisting of only a delay slot).
> It may well still turn out to be reasonable to ignore cs_base for hashing.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c             |  4 ++--
 include/exec/tb-hash.h |  8 ++++++--
 translate-all.c        | 10 +++++-----
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 14df1aa..1735032 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -231,13 +231,13 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
 {
     CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb, **tb_hash_head, **ptb1;
-    unsigned int h;
+    uint32_t h;
     tb_page_addr_t phys_pc, phys_page1;
 
     /* find translated block using physical mappings */
     phys_pc = get_page_addr_code(env, pc);
     phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, pc, flags);
 
     /* Start at head of the hash entry */
     ptb1 = tb_hash_head = &tcg_ctx.tb_ctx.tb_phys_hash[h];
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 0f4e8a0..88ccfd1 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,6 +20,9 @@
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
+#include "exec/exec-all.h"
+#include "exec/tb-hash-xx.h"
+
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
    addresses on the same page.  The top bits are the same.  This allows
    TLB invalidation to quickly clear a subset of the hash table.  */
@@ -43,9 +46,10 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
            | (tmp & TB_JMP_ADDR_MASK));
 }
 
-static inline unsigned int tb_phys_hash_func(tb_page_addr_t pc)
+static inline
+uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t flags)
 {
-    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index b54f472..c48fccb 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -991,12 +991,12 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 {
     CPUState *cpu;
     PageDesc *p;
-    unsigned int h;
+    uint32_t h;
     tb_page_addr_t phys_pc;
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
 
     /* remove the TB from the page list */
@@ -1126,11 +1126,11 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
-    unsigned int h;
+    uint32_t h;
     TranslationBlock **ptb;
 
-    /* add in the physical hash table */
-    h = tb_phys_hash_func(phys_pc);
+    /* add in the hash table */
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     tb->phys_hash_next = *ptb;
     *ptb = tb;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (6 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 07/15] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-28 18:15   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 09/15] qdist: add test program Emilio G. Cota
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Sometimes it is useful to have a quick histogram to represent a certain
distribution -- for example, when investigating a performance regression
in a hash table due to inadequate hashing.

The appended allows us to easily represent a distribution using Unicode
characters. Further, the data structure keeping track of the distribution
is so simple that obtaining its values for off-line processing is trivial.

Example, taking the last 10 commits to QEMU:

 Characters in commit title  Count
-----------------------------------
                         39      1
                         48      1
                         53      1
                         54      2
                         57      1
                         61      1
                         67      1
                         78      1
                         80      1
qdist_init(&dist);
qdist_inc(&dist, 39);
[...]
qdist_inc(&dist, 80);

char *str = qdist_pr(&dist, 9, QDIST_PR_LABELS);
// -> [39.0,43.6)▂▂ █▂ ▂ ▄[75.4,80.0]
g_free(str);

char *str = qdist_pr(&dist, 4, QDIST_PR_LABELS);
// -> [39.0,49.2)▁█▁▁[69.8,80.0]
g_free(str);

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/qdist.h |  62 +++++++++
 util/Makefile.objs   |   1 +
 util/qdist.c         | 386 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 449 insertions(+)
 create mode 100644 include/qemu/qdist.h
 create mode 100644 util/qdist.c

diff --git a/include/qemu/qdist.h b/include/qemu/qdist.h
new file mode 100644
index 0000000..6d8b701
--- /dev/null
+++ b/include/qemu/qdist.h
@@ -0,0 +1,62 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_QDIST_H
+#define QEMU_QDIST_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/bitops.h"
+
+/*
+ * Samples with the same 'x value' end up in the same qdist_entry,
+ * e.g. inc(0.1) and inc(0.1) end up as {x=0.1, count=2}.
+ *
+ * Binning happens only at print time, so that we retain the flexibility to
+ * choose the binning. This might not be ideal for workloads that do not care
+ * much about precision and insert many samples all with different x values;
+ * in that case, pre-binning (e.g. entering both 0.115 and 0.097 as 0.1)
+ * should be considered.
+ */
+struct qdist_entry {
+    double x;
+    unsigned long count;
+};
+
+struct qdist {
+    struct qdist_entry *entries;
+    size_t n;
+};
+
+#define QDIST_PR_BORDER     BIT(0)
+#define QDIST_PR_LABELS     BIT(1)
+/* the remaining options only work if PR_LABELS is set */
+#define QDIST_PR_NODECIMAL  BIT(2)
+#define QDIST_PR_PERCENT    BIT(3)
+#define QDIST_PR_100X       BIT(4)
+#define QDIST_PR_NOBINRANGE BIT(5)
+
+void qdist_init(struct qdist *dist);
+void qdist_destroy(struct qdist *dist);
+
+void qdist_add(struct qdist *dist, double x, long count);
+void qdist_inc(struct qdist *dist, double x);
+double qdist_xmin(const struct qdist *dist);
+double qdist_xmax(const struct qdist *dist);
+double qdist_avg(const struct qdist *dist);
+unsigned long qdist_sample_count(const struct qdist *dist);
+size_t qdist_unique_entries(const struct qdist *dist);
+
+/* callers must free the returned string with g_free() */
+char *qdist_pr_plain(const struct qdist *dist, size_t n_groups);
+
+/* callers must free the returned string with g_free() */
+char *qdist_pr(const struct qdist *dist, size_t n_groups, uint32_t opt);
+
+/* Only qdist code and test code should ever call this function */
+void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n);
+
+#endif /* QEMU_QDIST_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index a8a777e..702435e 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -32,3 +32,4 @@ util-obj-y += buffer.o
 util-obj-y += timed-average.o
 util-obj-y += base64.o
 util-obj-y += log.o
+util-obj-y += qdist.o
diff --git a/util/qdist.c b/util/qdist.c
new file mode 100644
index 0000000..3343640
--- /dev/null
+++ b/util/qdist.c
@@ -0,0 +1,386 @@
+/*
+ * qdist.c - QEMU helpers for handling frequency distributions of data.
+ *
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/qdist.h"
+
+#include <math.h>
+#ifndef NAN
+#define NAN (0.0 / 0.0)
+#endif
+
+void qdist_init(struct qdist *dist)
+{
+    dist->entries = NULL;
+    dist->n = 0;
+}
+
+void qdist_destroy(struct qdist *dist)
+{
+    g_free(dist->entries);
+}
+
+static inline int qdist_cmp_double(double a, double b)
+{
+    if (a > b) {
+        return 1;
+    } else if (a < b) {
+        return -1;
+    }
+    return 0;
+}
+
+static int qdist_cmp(const void *ap, const void *bp)
+{
+    const struct qdist_entry *a = ap;
+    const struct qdist_entry *b = bp;
+
+    return qdist_cmp_double(a->x, b->x);
+}
+
+void qdist_add(struct qdist *dist, double x, long count)
+{
+    struct qdist_entry *entry = NULL;
+
+    if (dist->entries) {
+        struct qdist_entry e;
+
+        e.x = x;
+        entry = bsearch(&e, dist->entries, dist->n, sizeof(e), qdist_cmp);
+    }
+
+    if (entry) {
+        entry->count += count;
+        return;
+    }
+
+    dist->entries = g_realloc(dist->entries,
+                              sizeof(*dist->entries) * (dist->n + 1));
+    dist->n++;
+    entry = &dist->entries[dist->n - 1];
+    entry->x = x;
+    entry->count = count;
+    qsort(dist->entries, dist->n, sizeof(*entry), qdist_cmp);
+}
+
+void qdist_inc(struct qdist *dist, double x)
+{
+    qdist_add(dist, x, 1);
+}
+
+/*
+ * Unicode for block elements. See:
+ *   https://en.wikipedia.org/wiki/Block_Elements
+ */
+static const gunichar qdist_blocks[] = {
+    0x2581,
+    0x2582,
+    0x2583,
+    0x2584,
+    0x2585,
+    0x2586,
+    0x2587,
+    0x2588
+};
+
+#define QDIST_NR_BLOCK_CODES ARRAY_SIZE(qdist_blocks)
+
+/*
+ * Print a distribution into a string.
+ *
+ * This function assumes that appropriate binning has been done on the input;
+ * see qdist_bin__internal() and qdist_pr_plain().
+ *
+ * Callers must free the returned string with g_free().
+ */
+static char *qdist_pr_internal(const struct qdist *dist)
+{
+    double min, max, step;
+    GString *s = g_string_new("");
+    size_t i;
+
+    /* if only one entry, its printout will be either full or empty */
+    if (dist->n == 1) {
+        if (dist->entries[0].count) {
+            g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES - 1]);
+        } else {
+            g_string_append_c(s, ' ');
+        }
+        goto out;
+    }
+
+    /* get min and max counts */
+    min = dist->entries[0].count;
+    max = min;
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        if (e->count < min) {
+            min = e->count;
+        }
+        if (e->count > max) {
+            max = e->count;
+        }
+    }
+
+    /* floor((count - min) * step) will give us the block index */
+    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
+
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+        int index;
+
+        /* make an exception with 0; instead of using block[0], print a space */
+        if (e->count) {
+            index = (int)((e->count - min) * step);
+            g_string_append_unichar(s, qdist_blocks[index]);
+        } else {
+            g_string_append_c(s, ' ');
+        }
+    }
+ out:
+    return g_string_free(s, FALSE);
+}
+
+/*
+ * Bin the distribution in @from into @n bins of consecutive, non-overlapping
+ * intervals, copying the result to @to.
+ *
+ * This function is internal to qdist: only this file and test code should
+ * ever call it.
+ *
+ * Note: calling this function on an already-binned qdist is a bug.
+ *
+ * If @n == 0 or @from->n == 1, use @from->n.
+ */
+void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
+{
+    double xmin, xmax;
+    double step;
+    size_t i, j, j_min;
+
+    qdist_init(to);
+
+    if (!from->entries) {
+        return;
+    }
+    if (!n || from->n == 1) {
+        n = from->n;
+    }
+
+    /* set equally-sized bins between @from's left and right */
+    xmin = qdist_xmin(from);
+    xmax = qdist_xmax(from);
+    step = (xmax - xmin) / n;
+
+    if (n == from->n) {
+        /* if @from's entries are equally spaced, no need to re-bin */
+        for (i = 0; i < from->n; i++) {
+            if (from->entries[i].x != xmin + i * step) {
+                goto rebin;
+            }
+        }
+        /* they're equally spaced, so copy the dist and bail out */
+        to->entries = g_malloc(sizeof(*to->entries) * from->n);
+        to->n = from->n;
+        memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
+        return;
+    }
+
+ rebin:
+    j_min = 0;
+    for (i = 0; i < n; i++) {
+        double x;
+        double left, right;
+
+        left = xmin + i * step;
+        right = xmin + (i + 1) * step;
+
+        /* Add x, even if it might not get any counts later */
+        x = left;
+        qdist_add(to, x, 0);
+
+        /*
+         * To avoid double-counting we capture [left, right) ranges, except for
+         * the righmost bin, which captures a [left, right] range.
+         */
+        for (j = j_min; j < from->n; j++) {
+            struct qdist_entry *o = &from->entries[j];
+
+            /* entries are ordered so do not check beyond right */
+            if (o->x > right) {
+                break;
+            }
+            if (o->x >= left && (o->x < right ||
+                                   (i == n - 1 && o->x == right))) {
+                qdist_add(to, x, o->count);
+                /* don't check this entry again */
+                j_min = j + 1;
+            }
+        }
+    }
+}
+
+/*
+ * Print @dist into a string, after re-binning it into @n bins of consecutive,
+ * non-overlapping intervals.
+ *
+ * If @n == 0, use @orig->n.
+ *
+ * Callers must free the returned string with g_free().
+ */
+char *qdist_pr_plain(const struct qdist *dist, size_t n)
+{
+    struct qdist binned;
+    char *ret;
+
+    if (!dist->entries) {
+        return NULL;
+    }
+    qdist_bin__internal(&binned, dist, n);
+    ret = qdist_pr_internal(&binned);
+    qdist_destroy(&binned);
+    return ret;
+}
+
+static char *qdist_pr_label(const struct qdist *dist, size_t n_bins,
+                            uint32_t opt, bool is_left)
+{
+    const char *percent;
+    const char *lparen;
+    const char *rparen;
+    GString *s;
+    double x1, x2, step;
+    double x;
+    double n;
+    int dec;
+
+    s = g_string_new("");
+    if (!(opt & QDIST_PR_LABELS)) {
+        goto out;
+    }
+
+    dec = opt & QDIST_PR_NODECIMAL ? 0 : 1;
+    percent = opt & QDIST_PR_PERCENT ? "%" : "";
+
+    n = n_bins ? n_bins : dist->n;
+    x = is_left ? qdist_xmin(dist) : qdist_xmax(dist);
+    step = (qdist_xmax(dist) - qdist_xmin(dist)) / n;
+
+    if (opt & QDIST_PR_100X) {
+        x *= 100.0;
+        step *= 100.0;
+    }
+    if (opt & QDIST_PR_NOBINRANGE) {
+        lparen = rparen = "";
+        x1 = x;
+        x2 = x; /* unnecessary, but a dumb compiler might not figure it out */
+    } else {
+        lparen = "[";
+        rparen = is_left ? ")" : "]";
+        if (is_left) {
+            x1 = x;
+            x2 = x + step;
+        } else {
+            x1 = x - step;
+            x2 = x;
+        }
+    }
+    g_string_append_printf(s, "%s%.*f", lparen, dec, x1);
+    if (!(opt & QDIST_PR_NOBINRANGE)) {
+        g_string_append_printf(s, ",%.*f%s", dec, x2, rparen);
+    }
+    g_string_append(s, percent);
+ out:
+    return g_string_free(s, FALSE);
+}
+
+/*
+ * Print the distribution's histogram into a string.
+ *
+ * See also: qdist_pr_plain().
+ *
+ * Callers must free the returned string with g_free().
+ */
+char *qdist_pr(const struct qdist *dist, size_t n_bins, uint32_t opt)
+{
+    const char *border = opt & QDIST_PR_BORDER ? "|" : "";
+    char *llabel, *rlabel;
+    char *hgram;
+    GString *s;
+
+    if (dist->entries == NULL) {
+        return NULL;
+    }
+
+    s = g_string_new("");
+
+    llabel = qdist_pr_label(dist, n_bins, opt, true);
+    rlabel = qdist_pr_label(dist, n_bins, opt, false);
+    hgram = qdist_pr_plain(dist, n_bins);
+    g_string_append_printf(s, "%s%s%s%s%s",
+                           llabel, border, hgram, border, rlabel);
+    g_free(llabel);
+    g_free(rlabel);
+    g_free(hgram);
+
+    return g_string_free(s, FALSE);
+}
+
+static inline double qdist_x(const struct qdist *dist, int index)
+{
+    if (dist->entries == NULL) {
+        return NAN;
+    }
+    return dist->entries[index].x;
+}
+
+double qdist_xmin(const struct qdist *dist)
+{
+    return qdist_x(dist, 0);
+}
+
+double qdist_xmax(const struct qdist *dist)
+{
+    return qdist_x(dist, dist->n - 1);
+}
+
+size_t qdist_unique_entries(const struct qdist *dist)
+{
+    return dist->n;
+}
+
+unsigned long qdist_sample_count(const struct qdist *dist)
+{
+    unsigned long count = 0;
+    size_t i;
+
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        count += e->count;
+    }
+    return count;
+}
+
+double qdist_avg(const struct qdist *dist)
+{
+    unsigned long count;
+    size_t i;
+    double ret = 0;
+
+    count = qdist_sample_count(dist);
+    if (!count) {
+        return NAN;
+    }
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        ret += e->x * e->count / count;
+    }
+    return ret;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 09/15] qdist: add test program
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (7 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-28 18:56   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore   |   1 +
 tests/Makefile     |   6 +-
 tests/test-qdist.c | 369 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 375 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-qdist.c

diff --git a/tests/.gitignore b/tests/.gitignore
index a06a8ba..7c0d156 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -48,6 +48,7 @@ test-qapi-types.[ch]
 test-qapi-visit.[ch]
 test-qdev-global-props
 test-qemu-opts
+test-qdist
 test-qga
 test-qmp-commands
 test-qmp-commands.h
diff --git a/tests/Makefile b/tests/Makefile
index 9dddde6..a5af20b 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -70,6 +70,8 @@ check-unit-y += tests/rcutorture$(EXESUF)
 gcov-files-rcutorture-y = util/rcu.c
 check-unit-y += tests/test-rcu-list$(EXESUF)
 gcov-files-test-rcu-list-y = util/rcu.c
+check-unit-y += tests/test-qdist$(EXESUF)
+gcov-files-test-qdist-y = util/qdist.c
 check-unit-y += tests/test-bitops$(EXESUF)
 check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
 check-unit-y += tests/check-qom-interface$(EXESUF)
@@ -392,7 +394,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-qmp-commands.o tests/test-visitor-serialization.o \
 	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
-	tests/rcutorture.o tests/test-rcu-list.o
+	tests/rcutorture.o tests/test-rcu-list.o \
+	tests/test-qdist.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -431,6 +434,7 @@ tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
 tests/test-int128$(EXESUF): tests/test-int128.o
 tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
+tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/test-qdist.c b/tests/test-qdist.c
new file mode 100644
index 0000000..7625a57
--- /dev/null
+++ b/tests/test-qdist.c
@@ -0,0 +1,369 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "qemu/qdist.h"
+
+#include <math.h>
+
+struct entry_desc {
+    double x;
+    unsigned long count;
+
+    /* 0 prints a space, 1-8 prints from qdist_blocks[] */
+    int fill_code;
+};
+
+/* See: https://en.wikipedia.org/wiki/Block_Elements */
+static const gunichar qdist_blocks[] = {
+    0x2581,
+    0x2582,
+    0x2583,
+    0x2584,
+    0x2585,
+    0x2586,
+    0x2587,
+    0x2588
+};
+
+#define QDIST_NR_BLOCK_CODES ARRAY_SIZE(qdist_blocks)
+
+static char *pr_hist(const struct entry_desc *darr, size_t n)
+{
+    GString *s = g_string_new("");
+    size_t i;
+
+    for (i = 0; i < n; i++) {
+        int fill = darr[i].fill_code;
+
+        if (fill) {
+            assert(fill <= QDIST_NR_BLOCK_CODES);
+            g_string_append_unichar(s, qdist_blocks[fill - 1]);
+        } else {
+            g_string_append_c(s, ' ');
+        }
+    }
+    return g_string_free(s, FALSE);
+}
+
+static void
+histogram_check(const struct qdist *dist, const struct entry_desc *darr,
+                size_t n, size_t n_bins)
+{
+    char *pr = qdist_pr_plain(dist, n_bins);
+    char *str = pr_hist(darr, n);
+
+    g_assert_cmpstr(pr, ==, str);
+    g_free(pr);
+    g_free(str);
+}
+
+static void histogram_check_single_full(const struct qdist *dist, size_t n_bins)
+{
+    struct entry_desc desc = { .fill_code = 8 };
+
+    histogram_check(dist, &desc, 1, n_bins);
+}
+
+static void
+entries_check(const struct qdist *dist, const struct entry_desc *darr, size_t n)
+{
+    size_t i;
+
+    for (i = 0; i < n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        g_assert_cmpuint(e->count, ==, darr[i].count);
+    }
+}
+
+static void
+entries_insert(struct qdist *dist, const struct entry_desc *darr, size_t n)
+{
+    size_t i;
+
+    for (i = 0; i < n; i++) {
+        qdist_add(dist, darr[i].x, darr[i].count);
+    }
+}
+
+static void do_test_bin(const struct entry_desc *a, size_t n_a,
+                        const struct entry_desc *b, size_t n_b)
+{
+    struct qdist qda;
+    struct qdist qdb;
+
+    qdist_init(&qda);
+
+    entries_insert(&qda, a, n_a);
+    qdist_inc(&qda, a[0].x);
+    qdist_add(&qda, a[0].x, -1);
+
+    g_assert_cmpuint(qdist_unique_entries(&qda), ==, n_a);
+    g_assert_cmpfloat(qdist_xmin(&qda), ==, a[0].x);
+    g_assert_cmpfloat(qdist_xmax(&qda), ==, a[n_a - 1].x);
+    histogram_check(&qda, a, n_a, 0);
+    histogram_check(&qda, a, n_a, n_a);
+
+    qdist_bin__internal(&qdb, &qda, n_b);
+    g_assert_cmpuint(qdb.n, ==, n_b);
+    entries_check(&qdb, b, n_b);
+    g_assert_cmpuint(qdist_sample_count(&qda), ==, qdist_sample_count(&qdb));
+    /*
+     * No histogram_check() for $qdb, since we'd rebin it and that is a bug.
+     * Instead, regenerate it from $qda.
+     */
+    histogram_check(&qda, b, n_b, n_b);
+
+    qdist_destroy(&qdb);
+    qdist_destroy(&qda);
+}
+
+static void do_test_pr(uint32_t opt)
+{
+    static const struct entry_desc desc[] = {
+        [0] = { 1, 900, 8 },
+        [1] = { 2, 1, 1 },
+        [2] = { 3, 2, 1 }
+    };
+    static const char border[] = "|";
+    const char *llabel = NULL;
+    const char *rlabel = NULL;
+    struct qdist dist;
+    GString *s;
+    char *str;
+    char *pr;
+    size_t n;
+
+    n = ARRAY_SIZE(desc);
+    qdist_init(&dist);
+
+    entries_insert(&dist, desc, n);
+    histogram_check(&dist, desc, n, 0);
+
+    s = g_string_new("");
+
+    if (opt & QDIST_PR_LABELS) {
+        unsigned int lopts = opt & (QDIST_PR_NODECIMAL |
+                                    QDIST_PR_PERCENT |
+                                    QDIST_PR_100X |
+                                    QDIST_PR_NOBINRANGE);
+
+        if (lopts == 0) {
+            llabel = "[1.0,1.7)";
+            rlabel = "[2.3,3.0]";
+        } else if (lopts == QDIST_PR_NODECIMAL) {
+            llabel = "[1,2)";
+            rlabel = "[2,3]";
+        } else if (lopts == (QDIST_PR_PERCENT | QDIST_PR_NODECIMAL)) {
+            llabel = "[1,2)%";
+            rlabel = "[2,3]%";
+        } else if (lopts == QDIST_PR_100X) {
+            llabel = "[100.0,166.7)";
+            rlabel = "[233.3,300.0]";
+        } else if (lopts == (QDIST_PR_NOBINRANGE | QDIST_PR_NODECIMAL)) {
+            llabel = "1";
+            rlabel = "3";
+        } else {
+            g_assert_cmpstr("BUG", ==, "This is not meant to be exhaustive");
+        }
+    }
+
+    if (llabel) {
+        g_string_append(s, llabel);
+    }
+    if (opt & QDIST_PR_BORDER) {
+        g_string_append(s, border);
+    }
+
+    str = pr_hist(desc, n);
+    g_string_append(s, str);
+    g_free(str);
+
+    if (opt & QDIST_PR_BORDER) {
+        g_string_append(s, border);
+    }
+    if (rlabel) {
+        g_string_append(s, rlabel);
+    }
+
+    str = g_string_free(s, FALSE);
+    pr = qdist_pr(&dist, n, opt);
+    g_assert_cmpstr(pr, ==, str);
+    g_free(pr);
+    g_free(str);
+
+    qdist_destroy(&dist);
+}
+
+static inline void do_test_pr_label(uint32_t opt)
+{
+    opt |= QDIST_PR_LABELS;
+    do_test_pr(opt);
+}
+
+static void test_pr(void)
+{
+    do_test_pr(0);
+
+    do_test_pr(QDIST_PR_BORDER);
+
+    /* 100X should be ignored because we're not setting LABELS */
+    do_test_pr(QDIST_PR_100X);
+
+    do_test_pr_label(0);
+    do_test_pr_label(QDIST_PR_NODECIMAL);
+    do_test_pr_label(QDIST_PR_PERCENT | QDIST_PR_NODECIMAL);
+    do_test_pr_label(QDIST_PR_100X);
+    do_test_pr_label(QDIST_PR_NOBINRANGE | QDIST_PR_NODECIMAL);
+}
+
+static void test_bin_shrink(void)
+{
+    static const struct entry_desc a[] = {
+        [0] = { 0.0,   42922, 7 },
+        [1] = { 0.25,  47834, 8 },
+        [2] = { 0.50,  26628, 0 },
+        [3] = { 0.625, 597,   4 },
+        [4] = { 0.75,  10298, 1 },
+        [5] = { 0.875, 22,    2 },
+        [6] = { 1.0,   2771,  1 }
+    };
+    static const struct entry_desc b[] = {
+        [0] = { 0.0, 42922, 7 },
+        [1] = { 0.25, 47834, 8 },
+        [2] = { 0.50, 27225, 3 },
+        [3] = { 0.75, 13091, 1 }
+    };
+
+    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
+}
+
+static void test_bin_expand(void)
+{
+    static const struct entry_desc a[] = {
+        [0] = { 0.0,   11713, 5 },
+        [1] = { 0.25,  20294, 0 },
+        [2] = { 0.50,  17266, 8 },
+        [3] = { 0.625, 1506,  0 },
+        [4] = { 0.75,  10355, 6 },
+        [5] = { 0.833, 2,     1 },
+        [6] = { 0.875, 99,    4 },
+        [7] = { 1.0,   4301,  2 }
+    };
+    static const struct entry_desc b[] = {
+        [0] = { 0.0, 11713, 5 },
+        [1] = { 0.0, 0,     0 },
+        [2] = { 0.0, 20294, 8 },
+        [3] = { 0.0, 0,     0 },
+        [4] = { 0.0, 0,     0 },
+        [5] = { 0.0, 17266, 6 },
+        [6] = { 0.0, 1506,  1 },
+        [7] = { 0.0, 10355, 4 },
+        [8] = { 0.0, 101,   1 },
+        [9] = { 0.0, 4301,  2 }
+    };
+
+    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
+}
+
+static void test_bin_simple(void)
+{
+    static const struct entry_desc a[] = {
+        [0] = { 10, 101, 8 },
+        [1] = { 11, 0, 0 },
+        [2] = { 12, 2, 1 }
+    };
+    static const struct entry_desc b[] = {
+        [0] = { 0, 101, 8 },
+        [1] = { 0, 0, 0 },
+        [2] = { 0, 0, 0 },
+        [3] = { 0, 0, 0 },
+        [4] = { 0, 2, 1 }
+    };
+
+    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
+}
+
+static void test_single_full(void)
+{
+    struct qdist dist;
+
+    qdist_init(&dist);
+
+    qdist_add(&dist, 3, 102);
+    g_assert_cmpfloat(qdist_avg(&dist), ==, 3);
+    g_assert_cmpfloat(qdist_xmin(&dist), ==, 3);
+    g_assert_cmpfloat(qdist_xmax(&dist), ==, 3);
+
+    histogram_check_single_full(&dist, 0);
+    histogram_check_single_full(&dist, 1);
+    histogram_check_single_full(&dist, 10);
+
+    qdist_destroy(&dist);
+}
+
+static void test_single_empty(void)
+{
+    struct qdist dist;
+    char *pr;
+
+    qdist_init(&dist);
+
+    qdist_add(&dist, 3, 0);
+    g_assert_cmpuint(qdist_sample_count(&dist), ==, 0);
+    g_assert(isnan(qdist_avg(&dist)));
+    g_assert_cmpfloat(qdist_xmin(&dist), ==, 3);
+    g_assert_cmpfloat(qdist_xmax(&dist), ==, 3);
+
+    pr = qdist_pr_plain(&dist, 0);
+    g_assert_cmpstr(pr, ==, " ");
+    g_free(pr);
+
+    pr = qdist_pr_plain(&dist, 1);
+    g_assert_cmpstr(pr, ==, " ");
+    g_free(pr);
+
+    pr = qdist_pr_plain(&dist, 2);
+    g_assert_cmpstr(pr, ==, " ");
+    g_free(pr);
+
+    qdist_destroy(&dist);
+}
+
+static void test_none(void)
+{
+    struct qdist dist;
+    char *pr;
+
+    qdist_init(&dist);
+
+    g_assert(isnan(qdist_avg(&dist)));
+    g_assert(isnan(qdist_xmin(&dist)));
+    g_assert(isnan(qdist_xmax(&dist)));
+
+    pr = qdist_pr_plain(&dist, 0);
+    g_assert(pr == NULL);
+
+    pr = qdist_pr_plain(&dist, 2);
+    g_assert(pr == NULL);
+
+    qdist_destroy(&dist);
+}
+
+int main(int argc, char *argv[])
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/qdist/none", test_none);
+    g_test_add_func("/qdist/single/empty", test_single_empty);
+    g_test_add_func("/qdist/single/full", test_single_full);
+    g_test_add_func("/qdist/binning/simple", test_bin_simple);
+    g_test_add_func("/qdist/binning/expand", test_bin_expand);
+    g_test_add_func("/qdist/binning/shrink", test_bin_shrink);
+    g_test_add_func("/qdist/pr", test_pr);
+    return g_test_run();
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (8 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 09/15] qdist: add test program Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-29 19:52   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 11/15] qht: add test program Emilio G. Cota
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

This is a fast, scalable chained hash table with optional auto-resizing, allowing
reads that are concurrent with reads, and reads/writes that are concurrent
with writes to separate buckets.

A hash table with these features will be necessary for the scalability
of the ongoing MTTCG work; before those changes arrive we can already
benefit from the single-threaded speedup that qht also provides.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/qht.h | 183 ++++++++++++
 util/Makefile.objs |   1 +
 util/qht.c         | 837 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1021 insertions(+)
 create mode 100644 include/qemu/qht.h
 create mode 100644 util/qht.c

diff --git a/include/qemu/qht.h b/include/qemu/qht.h
new file mode 100644
index 0000000..aec60aa
--- /dev/null
+++ b/include/qemu/qht.h
@@ -0,0 +1,183 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_QHT_H
+#define QEMU_QHT_H
+
+#include "qemu/osdep.h"
+#include "qemu/seqlock.h"
+#include "qemu/thread.h"
+#include "qemu/qdist.h"
+
+struct qht {
+    struct qht_map *map;
+    QemuMutex lock; /* serializes setters of ht->map */
+    unsigned int mode;
+};
+
+/**
+ * struct qht_stats - Statistics of a QHT
+ * @head_buckets: number of head buckets
+ * @used_head_buckets: number of non-empty head buckets
+ * @entries: total number of entries
+ * @chain: frequency distribution representing the number of buckets in each
+ *         chain, excluding empty chains.
+ * @occupancy: frequency distribution representing chain occupancy rate.
+ *             Valid range: from 0.0 (empty) to 1.0 (full occupancy).
+ *
+ * An entry is a pointer-hash pair.
+ * Each bucket can host several entries.
+ * Chains are chains of buckets, whose first link is always a head bucket.
+ */
+struct qht_stats {
+    size_t head_buckets;
+    size_t used_head_buckets;
+    size_t entries;
+    struct qdist chain;
+    struct qdist occupancy;
+};
+
+typedef bool (*qht_lookup_func_t)(const void *obj, const void *userp);
+typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
+
+#define QHT_MODE_AUTO_RESIZE 0x1 /* auto-resize when heavily loaded */
+
+/**
+ * qht_init - Initialize a QHT
+ * @ht: QHT to be initialized
+ * @n_elems: number of entries the hash table should be optimized for.
+ * @mode: bitmask with OR'ed QHT_MODE_*
+ */
+void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
+
+/**
+ * qht_destroy - destroy a previously initialized QHT
+ * @ht: QHT to be destroyed
+ *
+ * Call only when there are no readers/writers left.
+ */
+void qht_destroy(struct qht *ht);
+
+/**
+ * qht_insert - Insert a pointer into the hash table
+ * @ht: QHT to insert to
+ * @p: pointer to be inserted
+ * @hash: hash corresponding to @p
+ *
+ * Attempting to insert a NULL @p is a bug.
+ * Inserting the same pointer @p with different @hash values is a bug.
+ *
+ * Returns true on sucess.
+ * Returns false if the @p-@hash pair already exists in the hash table.
+ */
+bool qht_insert(struct qht *ht, void *p, uint32_t hash);
+
+/**
+ * qht_lookup - Look up a pointer in a QHT
+ * @ht: QHT to be looked up
+ * @func: function to compare existing pointers against @userp
+ * @userp: pointer to pass to @func
+ * @hash: hash of the pointer to be looked up
+ *
+ * Needs to be called under an RCU read-critical section.
+ *
+ * The user-provided @func compares pointers in QHT against @userp.
+ * If the function returns true, a match has been found.
+ *
+ * Returns the corresponding pointer when a match is found.
+ * Returns NULL otherwise.
+ */
+void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
+                 uint32_t hash);
+
+/**
+ * qht_remove - remove a pointer from the hash table
+ * @ht: QHT to remove from
+ * @p: pointer to be removed
+ * @hash: hash corresponding to @p
+ *
+ * Attempting to remove a NULL @p is a bug.
+ *
+ * Just-removed @p pointers cannot be immediately freed; they need to remain
+ * valid until the end of the RCU grace period in which qht_remove() is called.
+ * This guarantees that concurrent lookups will always compare against valid
+ * data.
+ *
+ * Returns true on success.
+ * Returns false if the @p-@hash pair was not found.
+ */
+bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
+
+/**
+ * qht_reset - reset a QHT
+ * @ht: QHT to be reset
+ *
+ * All entries in the hash table are reset. No resizing is performed.
+ *
+ * If concurrent readers may exist, the objects pointed to by the hash table
+ * must remain valid for the existing RCU grace period -- see qht_remove().
+ * See also: qht_reset_size()
+ */
+void qht_reset(struct qht *ht);
+
+/**
+ * qht_reset_size - reset and resize a QHT
+ * @ht: QHT to be reset and resized
+ * @n_elems: number of entries the resized hash table should be optimized for.
+ *
+ * Returns true if the resize was necessary and therefore performed.
+ * Returns false otherwise.
+ *
+ * If concurrent readers may exist, the objects pointed to by the hash table
+ * must remain valid for the existing RCU grace period -- see qht_remove().
+ * See also: qht_reset(), qht_resize().
+ */
+bool qht_reset_size(struct qht *ht, size_t n_elems);
+
+/**
+ * qht_resize - resize a QHT
+ * @ht: QHT to be resized
+ * @n_elems: number of entries the resized hash table should be optimized for
+ *
+ * Returns true on success.
+ * Returns false if the resize was not necessary and therefore not performed.
+ * See also: qht_reset_size().
+ */
+bool qht_resize(struct qht *ht, size_t n_elems);
+
+/**
+ * qht_iter - Iterate over a QHT
+ * @ht: QHT to be iterated over
+ * @func: function to be called for each entry in QHT
+ * @userp: additional pointer to be passed to @func
+ *
+ * Each time it is called, user-provided @func is passed a pointer-hash pair,
+ * plus @userp.
+ */
+void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp);
+
+/**
+ * qht_statistics_init - Gather statistics from a QHT
+ * @ht: QHT to gather statistics from
+ * @stats: pointer to a struct qht_stats to be filled in
+ *
+ * Does NOT need to be called under an RCU read-critical section,
+ * since it does not dereference any pointers stored in the hash table.
+ *
+ * When done with @stats, pass the struct to qht_statistics_destroy().
+ * Failing to do this will leak memory.
+ */
+void qht_statistics_init(struct qht *ht, struct qht_stats *stats);
+
+/**
+ * qht_statistics_destroy - Destroy a struct qht_stats
+ * @stats: stuct qht_stats to be destroyed
+ *
+ * See also: qht_statistics_init().
+ */
+void qht_statistics_destroy(struct qht_stats *stats);
+
+#endif /* QEMU_QHT_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 702435e..45f8794 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -33,3 +33,4 @@ util-obj-y += timed-average.o
 util-obj-y += base64.o
 util-obj-y += log.o
 util-obj-y += qdist.o
+util-obj-y += qht.o
diff --git a/util/qht.c b/util/qht.c
new file mode 100644
index 0000000..ca5a620
--- /dev/null
+++ b/util/qht.c
@@ -0,0 +1,837 @@
+/*
+ * qht.c - QEMU Hash Table, designed to scale for read-mostly workloads.
+ *
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ *
+ * Assumptions:
+ * - NULL cannot be inserted/removed as a pointer value.
+ * - Trying to insert an already-existing hash-pointer pair is OK. However,
+ *   it is not OK to insert into the same hash table different hash-pointer
+ *   pairs that have the same pointer value, but not the hashes.
+ * - Lookups are performed under an RCU read-critical section; removals
+ *   must wait for a grace period to elapse before freeing removed objects.
+ *
+ * Features:
+ * - Reads (i.e. lookups and iterators) can be concurrent with other reads.
+ *   Lookups that are concurrent with writes to the same bucket will retry
+ *   via a seqlock; iterators acquire all bucket locks and therefore can be
+ *   concurrent with lookups and are serialized wrt writers.
+ * - Writes (i.e. insertions/removals) can be concurrent with writes to
+ *   different buckets; writes to the same bucket are serialized through a lock.
+ * - Optional auto-resizing: the hash table resizes up if the load surpasses
+ *   a certain threshold. Resizing is done concurrently with readers; writes
+ *   are serialized with the resize operation.
+ *
+ * The key structure is the bucket, which is cacheline-sized. Buckets
+ * contain a few hash values and pointers; the u32 hash values are stored in
+ * full so that resizing is fast. Having this structure instead of directly
+ * chaining items has two advantages:
+ * - Failed lookups fail fast, and touch a minimum number of cache lines.
+ * - Resizing the hash table with concurrent lookups is easy.
+ *
+ * There are two types of buckets:
+ * 1. "head" buckets are the ones allocated in the array of buckets in qht_map.
+ * 2. all "non-head" buckets (i.e. all others) are members of a chain that
+ *    starts from a head bucket.
+ * Note that the seqlock and spinlock of a head bucket applies to all buckets
+ * chained to it; these two fields are unused in non-head buckets.
+ *
+ * On removals, we move the last valid item in the chain to the position of the
+ * just-removed entry. This makes lookups slightly faster, since the moment an
+ * invalid entry is found, the (failed) lookup is over.
+ *
+ * Resizing is done by taking all bucket spinlocks (so that no other writers can
+ * race with us) and then copying all entries into a new hash map. Then, the
+ * ht->map pointer is set, and the old map is freed once no RCU readers can see
+ * it anymore.
+ *
+ * Writers check for concurrent resizes by comparing ht->map before and after
+ * acquiring their bucket lock. If they don't match, a resize has occured
+ * while the bucket spinlock was being acquired.
+ *
+ * Related Work:
+ * - Idea of cacheline-sized buckets with full hashes taken from:
+ *   David, Guerraoui & Trigonakis, "Asynchronized Concurrency:
+ *   The Secret to Scaling Concurrent Search Data Structures", ASPLOS'15.
+ * - Why not RCU-based hash tables? They would allow us to get rid of the
+ *   seqlock, but resizing would take forever since RCU read critical
+ *   sections in QEMU take quite a long time.
+ *   More info on relativistic hash tables:
+ *   + Triplett, McKenney & Walpole, "Resizable, Scalable, Concurrent Hash
+ *     Tables via Relativistic Programming", USENIX ATC'11.
+ *   + Corbet, "Relativistic hash tables, part 1: Algorithms", @ lwn.net, 2014.
+ *     https://lwn.net/Articles/612021/
+ */
+#include "qemu/qht.h"
+#include "qemu/atomic.h"
+#include "qemu/rcu.h"
+
+//#define QHT_DEBUG
+
+/*
+ * We want to avoid false sharing of cache lines. Most systems have 64-byte
+ * cache lines so we go with it for simplicity.
+ *
+ * Note that systems with smaller cache lines will be fine (the struct is
+ * almost 64-bytes); systems with larger cache lines might suffer from
+ * some false sharing.
+ */
+#define QHT_BUCKET_ALIGN 64
+
+/* define these to keep sizeof(qht_bucket) within QHT_BUCKET_ALIGN */
+#if HOST_LONG_BITS == 32
+#define QHT_BUCKET_ENTRIES 6
+#else /* 64-bit */
+#define QHT_BUCKET_ENTRIES 4
+#endif
+
+/*
+ * Note: reading partially-updated pointers in @pointers could lead to
+ * segfaults. We thus access them with atomic_read/set; this guarantees
+ * that the compiler makes all those accesses atomic. We also need the
+ * volatile-like behavior in atomic_read, since otherwise the compiler
+ * might refetch the pointer.
+ * atomic_read's are of course not necessary when the bucket lock is held.
+ *
+ * If both ht->lock and b->lock are grabbed, ht->lock should always
+ * be grabbed first.
+ */
+struct qht_bucket {
+    QemuSpin lock;
+    QemuSeqLock sequence;
+    uint32_t hashes[QHT_BUCKET_ENTRIES];
+    void *pointers[QHT_BUCKET_ENTRIES];
+    struct qht_bucket *next;
+} QEMU_ALIGNED(QHT_BUCKET_ALIGN);
+
+QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
+
+/**
+ * struct qht_map - structure to track an array of buckets
+ * @rcu: used by RCU. Keep it as the top field in the struct to help valgrind
+ *       find the whole struct.
+ * @buckets: array of head buckets. It is constant once the map is created.
+ * @n_buckets: number of head buckets. It is constant once the map is created.
+ * @n_added_buckets: number of added (i.e. "non-head") buckets
+ * @n_added_buckets_threshold: threshold to trigger an upward resize once the
+ *                             number of added buckets surpasses it.
+ *
+ * Buckets are tracked in what we call a "map", i.e. this structure.
+ */
+struct qht_map {
+    struct rcu_head rcu;
+    struct qht_bucket *buckets;
+    size_t n_buckets;
+    size_t n_added_buckets;
+    size_t n_added_buckets_threshold;
+};
+
+/* trigger a resize when n_added_buckets > n_buckets / div */
+#define QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV 8
+
+static void qht_do_resize(struct qht *ht, struct qht_map *new);
+static void qht_grow_maybe(struct qht *ht);
+
+#ifdef QHT_DEBUG
+
+#define qht_debug_assert(X) do { assert(X); } while (0)
+
+static void qht_bucket_debug__locked(struct qht_bucket *b)
+{
+    bool seen_empty = false;
+    bool corrupt = false;
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] == NULL) {
+                seen_empty = true;
+                continue;
+            }
+            if (seen_empty) {
+                fprintf(stderr, "%s: b: %p, pos: %i, hash: 0x%x, p: %p\n",
+                       __func__, b, i, b->hashes[i], b->pointers[i]);
+                corrupt = true;
+            }
+        }
+        b = b->next;
+    } while (b);
+    qht_debug_assert(!corrupt);
+}
+
+static void qht_map_debug__all_locked(struct qht_map *map)
+{
+    int i;
+
+    for (i = 0; i < map->n_buckets; i++) {
+        qht_bucket_debug__locked(&map->buckets[i]);
+    }
+}
+#else
+
+#define qht_debug_assert(X) do { (void)(X); } while (0)
+
+static inline void qht_bucket_debug__locked(struct qht_bucket *b)
+{ }
+
+static inline void qht_map_debug__all_locked(struct qht_map *map)
+{ }
+#endif /* QHT_DEBUG */
+
+static inline size_t qht_elems_to_buckets(size_t n_elems)
+{
+    return pow2ceil(n_elems / QHT_BUCKET_ENTRIES);
+}
+
+static inline void qht_head_init(struct qht_bucket *b)
+{
+    memset(b, 0, sizeof(*b));
+    qemu_spin_init(&b->lock);
+    seqlock_init(&b->sequence);
+}
+
+static inline
+struct qht_bucket *qht_map_to_bucket(struct qht_map *map, uint32_t hash)
+{
+    return &map->buckets[hash & (map->n_buckets - 1)];
+}
+
+/* acquire all bucket locks from a map */
+static void qht_map_lock_buckets(struct qht_map *map)
+{
+    size_t i;
+
+    for (i = 0; i < map->n_buckets; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+
+        qemu_spin_lock(&b->lock);
+    }
+}
+
+static void qht_map_unlock_buckets(struct qht_map *map)
+{
+    size_t i;
+
+    for (i = 0; i < map->n_buckets; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+
+        qemu_spin_unlock(&b->lock);
+    }
+}
+
+/*
+ * Call with at least a bucket lock held.
+ * @map should be the value read before acquiring the lock (or locks).
+ */
+static inline bool qht_map_is_stale__locked(struct qht *ht, struct qht_map *map)
+{
+    return map != ht->map;
+}
+
+/*
+ * Grab all bucket locks, and set @pmap after making sure the map isn't stale.
+ *
+ * Pairs with qht_map_unlock_buckets(), hence the pass-by-reference.
+ *
+ * Note: callers cannot have ht->lock held.
+ */
+static inline
+void qht_map_lock_buckets__no_stale(struct qht *ht, struct qht_map **pmap)
+{
+    struct qht_map *map;
+
+    map = atomic_rcu_read(&ht->map);
+    qht_map_lock_buckets(map);
+    if (likely(!qht_map_is_stale__locked(ht, map))) {
+        *pmap = map;
+        return;
+    }
+    qht_map_unlock_buckets(map);
+
+    /* we raced with a resize; acquire ht->lock to see the updated ht->map */
+    qemu_mutex_lock(&ht->lock);
+    map = ht->map;
+    qht_map_lock_buckets(map);
+    qemu_mutex_unlock(&ht->lock);
+    *pmap = map;
+    return;
+}
+
+/*
+ * Get a head bucket and lock it, making sure its parent map is not stale.
+ * @pmap is filled with a pointer to the bucket's parent map.
+ *
+ * Unlock with qemu_spin_unlock(&b->lock).
+ *
+ * Note: callers cannot have ht->lock held.
+ */
+static inline
+struct qht_bucket *qht_bucket_lock__no_stale(struct qht *ht, uint32_t hash,
+                                             struct qht_map **pmap)
+{
+    struct qht_bucket *b;
+    struct qht_map *map;
+
+    map = atomic_rcu_read(&ht->map);
+    b = qht_map_to_bucket(map, hash);
+
+    qemu_spin_lock(&b->lock);
+    if (likely(!qht_map_is_stale__locked(ht, map))) {
+        *pmap = map;
+        return b;
+    }
+    qemu_spin_unlock(&b->lock);
+
+    /* we raced with a resize; acquire ht->lock to see the updated ht->map */
+    qemu_mutex_lock(&ht->lock);
+    map = ht->map;
+    b = qht_map_to_bucket(map, hash);
+    qemu_spin_lock(&b->lock);
+    qemu_mutex_unlock(&ht->lock);
+    *pmap = map;
+    return b;
+}
+
+static inline bool qht_map_needs_resize(struct qht_map *map)
+{
+    return atomic_read(&map->n_added_buckets) > map->n_added_buckets_threshold;
+}
+
+static inline void qht_chain_destroy(struct qht_bucket *head)
+{
+    struct qht_bucket *curr = head->next;
+    struct qht_bucket *prev;
+
+    while (curr) {
+        prev = curr;
+        curr = curr->next;
+        qemu_vfree(prev);
+    }
+}
+
+/* pass only an orphan map */
+static void qht_map_destroy(struct qht_map *map)
+{
+    size_t i;
+
+    for (i = 0; i < map->n_buckets; i++) {
+        qht_chain_destroy(&map->buckets[i]);
+    }
+    qemu_vfree(map->buckets);
+    g_free(map);
+}
+
+static void qht_map_reclaim(struct rcu_head *rcu)
+{
+    struct qht_map *map = container_of(rcu, struct qht_map, rcu);
+
+    qht_map_destroy(map);
+}
+
+static struct qht_map *qht_map_create(size_t n_buckets)
+{
+    struct qht_map *map;
+    size_t i;
+
+    map = g_malloc(sizeof(*map));
+    map->n_buckets = n_buckets;
+
+    map->n_added_buckets = 0;
+    map->n_added_buckets_threshold = n_buckets /
+        QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV;
+
+    /* let tiny hash tables to at least add one non-head bucket */
+    if (unlikely(map->n_added_buckets_threshold == 0)) {
+        map->n_added_buckets_threshold = 1;
+    }
+
+    map->buckets = qemu_memalign(QHT_BUCKET_ALIGN,
+                                 sizeof(*map->buckets) * n_buckets);
+    for (i = 0; i < n_buckets; i++) {
+        qht_head_init(&map->buckets[i]);
+    }
+    return map;
+}
+
+void qht_init(struct qht *ht, size_t n_elems, unsigned int mode)
+{
+    struct qht_map *map;
+    size_t n_buckets = qht_elems_to_buckets(n_elems);
+
+    ht->mode = mode;
+    qemu_mutex_init(&ht->lock);
+    map = qht_map_create(n_buckets);
+    atomic_rcu_set(&ht->map, map);
+}
+
+/* call only when there are no readers/writers left */
+void qht_destroy(struct qht *ht)
+{
+    qht_map_destroy(ht->map);
+    memset(ht, 0, sizeof(*ht));
+}
+
+static void qht_bucket_reset__locked(struct qht_bucket *head)
+{
+    struct qht_bucket *b = head;
+    int i;
+
+    seqlock_write_begin(&head->sequence);
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] == NULL) {
+                goto done;
+            }
+            b->hashes[i] = 0;
+            atomic_set(&b->pointers[i], NULL);
+        }
+        b = b->next;
+    } while (b);
+ done:
+    seqlock_write_end(&head->sequence);
+}
+
+/* call with all bucket locks held */
+static void qht_map_reset__all_locked(struct qht_map *map)
+{
+    size_t i;
+
+    for (i = 0; i < map->n_buckets; i++) {
+        qht_bucket_reset__locked(&map->buckets[i]);
+    }
+    qht_map_debug__all_locked(map);
+}
+
+void qht_reset(struct qht *ht)
+{
+    struct qht_map *map;
+
+    qht_map_lock_buckets__no_stale(ht, &map);
+    qht_map_reset__all_locked(map);
+    qht_map_unlock_buckets(map);
+}
+
+bool qht_reset_size(struct qht *ht, size_t n_elems)
+{
+    struct qht_map *new;
+    struct qht_map *map;
+    size_t n_buckets;
+    bool resize = false;
+
+    n_buckets = qht_elems_to_buckets(n_elems);
+
+    qemu_mutex_lock(&ht->lock);
+    map = ht->map;
+    if (n_buckets != map->n_buckets) {
+        new = qht_map_create(n_buckets);
+        resize = true;
+    }
+
+    qht_map_lock_buckets(map);
+    qht_map_reset__all_locked(map);
+    if (resize) {
+        qht_do_resize(ht, new);
+    }
+    qht_map_unlock_buckets(map);
+    qemu_mutex_unlock(&ht->lock);
+
+    return resize;
+}
+
+static inline
+void *qht_do_lookup(struct qht_bucket *head, qht_lookup_func_t func,
+                    const void *userp, uint32_t hash)
+{
+    struct qht_bucket *b = head;
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->hashes[i] == hash) {
+                void *p = atomic_read(&b->pointers[i]);
+
+                if (likely(p) && likely(func(p, userp))) {
+                    return p;
+                }
+            }
+        }
+        b = atomic_rcu_read(&b->next);
+    } while (b);
+
+    return NULL;
+}
+
+static __attribute__((noinline))
+void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
+                           const void *userp, uint32_t hash)
+{
+    unsigned int version;
+    void *ret;
+
+    do {
+        version = seqlock_read_begin(&b->sequence);
+        ret = qht_do_lookup(b, func, userp, hash);
+    } while (seqlock_read_retry(&b->sequence, version));
+    return ret;
+}
+
+void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
+                 uint32_t hash)
+{
+    struct qht_bucket *b;
+    struct qht_map *map;
+    unsigned int version;
+    void *ret;
+
+    map = atomic_rcu_read(&ht->map);
+    b = qht_map_to_bucket(map, hash);
+
+    version = seqlock_read_begin(&b->sequence);
+    ret = qht_do_lookup(b, func, userp, hash);
+    if (likely(!seqlock_read_retry(&b->sequence, version))) {
+        return ret;
+    }
+    /*
+     * Removing the do/while from the fastpath gives a 4% perf. increase when
+     * running a 100%-lookup microbenchmark.
+     */
+    return qht_lookup__slowpath(b, func, userp, hash);
+}
+
+/* call with head->lock held */
+static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
+                               struct qht_bucket *head, void *p, uint32_t hash,
+                               bool *needs_resize)
+{
+    struct qht_bucket *b = head;
+    struct qht_bucket *prev = NULL;
+    struct qht_bucket *new = NULL;
+    int i;
+
+    for (;;) {
+        if (b == NULL) {
+            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
+            memset(b, 0, sizeof(*b));
+            new = b;
+            atomic_inc(&map->n_added_buckets);
+            if (unlikely(qht_map_needs_resize(map)) && needs_resize) {
+                *needs_resize = true;
+            }
+        }
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i]) {
+                if (unlikely(b->pointers[i] == p)) {
+                    return false;
+                }
+                continue;
+            }
+            /* found an empty key: acquire the seqlock and write */
+            seqlock_write_begin(&head->sequence);
+            if (new) {
+                atomic_rcu_set(&prev->next, b);
+            }
+            b->hashes[i] = hash;
+            atomic_set(&b->pointers[i], p);
+            seqlock_write_end(&head->sequence);
+            return true;
+        }
+        prev = b;
+        b = b->next;
+    }
+}
+
+bool qht_insert(struct qht *ht, void *p, uint32_t hash)
+{
+    struct qht_bucket *b;
+    struct qht_map *map;
+    bool needs_resize = false;
+    bool ret;
+
+    /* NULL pointers are not supported */
+    qht_debug_assert(p);
+
+    b = qht_bucket_lock__no_stale(ht, hash, &map);
+    ret = qht_insert__locked(ht, map, b, p, hash, &needs_resize);
+    qht_bucket_debug__locked(b);
+    qemu_spin_unlock(&b->lock);
+
+    if (unlikely(needs_resize) && ht->mode & QHT_MODE_AUTO_RESIZE) {
+        qht_grow_maybe(ht);
+    }
+    return ret;
+}
+
+static inline bool qht_entry_is_last(struct qht_bucket *b, int pos)
+{
+    if (pos == QHT_BUCKET_ENTRIES - 1) {
+        if (b->next == NULL) {
+            return true;
+        }
+        return b->next->pointers[0] == NULL;
+    }
+    return b->pointers[pos + 1] == NULL;
+}
+
+static void
+qht_entry_move(struct qht_bucket *to, int i, struct qht_bucket *from, int j)
+{
+    qht_debug_assert(!(to == from && i == j));
+    qht_debug_assert(to->pointers[i]);
+    qht_debug_assert(from->pointers[j]);
+
+    to->hashes[i] = from->hashes[j];
+    atomic_set(&to->pointers[i], from->pointers[j]);
+
+    from->hashes[j] = 0;
+    atomic_set(&from->pointers[j], NULL);
+}
+
+/*
+ * Find the last valid entry in @head, and swap it with @orig[pos], which has
+ * just been invalidated.
+ */
+static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
+{
+    struct qht_bucket *b = orig;
+    struct qht_bucket *prev = NULL;
+    int i;
+
+    if (qht_entry_is_last(orig, pos)) {
+        orig->hashes[pos] = 0;
+        atomic_set(&orig->pointers[pos], NULL);
+        return;
+    }
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i]) {
+                continue;
+            }
+            if (i > 0) {
+                return qht_entry_move(orig, pos, b, i - 1);
+            }
+            qht_debug_assert(prev);
+            return qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
+        }
+        prev = b;
+        b = b->next;
+    } while (b);
+    /* no free entries other than orig[pos], so swap it with the last one */
+    qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
+}
+
+/* call with b->lock held */
+static inline
+bool qht_remove__locked(struct qht_map *map, struct qht_bucket *head,
+                        const void *p, uint32_t hash)
+{
+    struct qht_bucket *b = head;
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            void *q = b->pointers[i];
+
+            if (unlikely(q == NULL)) {
+                return false;
+            }
+            if (q == p) {
+                qht_debug_assert(b->hashes[i] == hash);
+                seqlock_write_begin(&head->sequence);
+                qht_bucket_fill_hole(b, i);
+                seqlock_write_end(&head->sequence);
+                return true;
+            }
+        }
+        b = b->next;
+    } while (b);
+    return false;
+}
+
+bool qht_remove(struct qht *ht, const void *p, uint32_t hash)
+{
+    struct qht_bucket *b;
+    struct qht_map *map;
+    bool ret;
+
+    /* NULL pointers are not supported */
+    qht_debug_assert(p);
+
+    b = qht_bucket_lock__no_stale(ht, hash, &map);
+    ret = qht_remove__locked(map, b, p, hash);
+    qht_bucket_debug__locked(b);
+    qemu_spin_unlock(&b->lock);
+    return ret;
+}
+
+static inline void qht_bucket_iter(struct qht *ht, struct qht_bucket *b,
+                                   qht_iter_func_t func, void *userp)
+{
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] == NULL) {
+                return;
+            }
+            func(ht, b->pointers[i], b->hashes[i], userp);
+        }
+        b = b->next;
+    } while (b);
+}
+
+/* call with all of the map's locks held */
+static inline void qht_map_iter__all_locked(struct qht *ht, struct qht_map *map,
+                                            qht_iter_func_t func, void *userp)
+{
+    size_t i;
+
+    for (i = 0; i < map->n_buckets; i++) {
+        qht_bucket_iter(ht, &map->buckets[i], func, userp);
+    }
+}
+
+void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp)
+{
+    struct qht_map *map;
+
+    map = atomic_rcu_read(&ht->map);
+    qht_map_lock_buckets(map);
+    /* Note: ht here is merely for carrying ht->mode; ht->map won't be read */
+    qht_map_iter__all_locked(ht, map, func, userp);
+    qht_map_unlock_buckets(map);
+}
+
+static void qht_map_copy(struct qht *ht, void *p, uint32_t hash, void *userp)
+{
+    struct qht_map *new = userp;
+    struct qht_bucket *b = qht_map_to_bucket(new, hash);
+
+    /* no need to acquire b->lock because no thread has seen this map yet */
+    qht_insert__locked(ht, new, b, p, hash, NULL);
+}
+
+/*
+ * Call with ht->lock and all bucket locks held.
+ *
+ * Creating the @new map here would add unnecessary delay while all the locks
+ * are held--holding up the bucket locks is particularly bad, since no writes
+ * can occur while these are held. Thus, we let callers create the new map,
+ * hopefully without the bucket locks held.
+ */
+static void qht_do_resize(struct qht *ht, struct qht_map *new)
+{
+    struct qht_map *old;
+
+    old = ht->map;
+    g_assert_cmpuint(new->n_buckets, !=, old->n_buckets);
+
+    qht_map_iter__all_locked(ht, old, qht_map_copy, new);
+    qht_map_debug__all_locked(new);
+
+    atomic_rcu_set(&ht->map, new);
+    call_rcu1(&old->rcu, qht_map_reclaim);
+}
+
+bool qht_resize(struct qht *ht, size_t n_elems)
+{
+    size_t n_buckets = qht_elems_to_buckets(n_elems);
+    size_t ret = false;
+
+    qemu_mutex_lock(&ht->lock);
+    if (n_buckets != ht->map->n_buckets) {
+        struct qht_map *new;
+        struct qht_map *old = ht->map;
+
+        new = qht_map_create(n_buckets);
+        qht_map_lock_buckets(old);
+        qht_do_resize(ht, new);
+        qht_map_unlock_buckets(old);
+        ret = true;
+    }
+    qemu_mutex_unlock(&ht->lock);
+
+    return ret;
+}
+
+static __attribute__((noinline)) void qht_grow_maybe(struct qht *ht)
+{
+    struct qht_map *map;
+
+    /*
+     * If the lock is taken it probably means there's an ongoing resize,
+     * so bail out.
+     */
+    if (qemu_mutex_trylock(&ht->lock)) {
+        return;
+    }
+    map = ht->map;
+    /* another thread might have just performed the resize we were after */
+    if (qht_map_needs_resize(map)) {
+        struct qht_map *new = qht_map_create(map->n_buckets * 2);
+
+        qht_map_lock_buckets(map);
+        qht_do_resize(ht, new);
+        qht_map_unlock_buckets(map);
+    }
+    qemu_mutex_unlock(&ht->lock);
+}
+
+/* pass @stats to qht_statistics_destroy() when done */
+void qht_statistics_init(struct qht *ht, struct qht_stats *stats)
+{
+    struct qht_map *map;
+    int i;
+
+    map = atomic_rcu_read(&ht->map);
+
+    stats->head_buckets = map->n_buckets;
+    stats->used_head_buckets = 0;
+    stats->entries = 0;
+    qdist_init(&stats->chain);
+    qdist_init(&stats->occupancy);
+
+    for (i = 0; i < map->n_buckets; i++) {
+        struct qht_bucket *head = &map->buckets[i];
+        struct qht_bucket *b;
+        unsigned int version;
+        size_t buckets;
+        size_t entries;
+        int j;
+
+        do {
+            version = seqlock_read_begin(&head->sequence);
+            buckets = 0;
+            entries = 0;
+            b = head;
+            do {
+                for (j = 0; j < QHT_BUCKET_ENTRIES; j++) {
+                    if (atomic_read(&b->pointers[j]) == NULL) {
+                        break;
+                    }
+                    entries++;
+                }
+                buckets++;
+                b = atomic_rcu_read(&b->next);
+            } while (b);
+        } while (seqlock_read_retry(&head->sequence, version));
+
+        if (entries) {
+            qdist_inc(&stats->chain, buckets);
+            qdist_inc(&stats->occupancy,
+                      (double)entries / QHT_BUCKET_ENTRIES / buckets);
+            stats->used_head_buckets++;
+            stats->entries += entries;
+        } else {
+            qdist_inc(&stats->occupancy, 0);
+        }
+    }
+}
+
+void qht_statistics_destroy(struct qht_stats *stats)
+{
+    qdist_destroy(&stats->occupancy);
+    qdist_destroy(&stats->chain);
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 11/15] qht: add test program
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (9 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-29 20:15   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark Emilio G. Cota
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore |   1 +
 tests/Makefile   |   6 ++-
 tests/test-qht.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 165 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-qht.c

diff --git a/tests/.gitignore b/tests/.gitignore
index 7c0d156..ffde5d2 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -50,6 +50,7 @@ test-qdev-global-props
 test-qemu-opts
 test-qdist
 test-qga
+test-qht
 test-qmp-commands
 test-qmp-commands.h
 test-qmp-event
diff --git a/tests/Makefile b/tests/Makefile
index a5af20b..8589b11 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -72,6 +72,8 @@ check-unit-y += tests/test-rcu-list$(EXESUF)
 gcov-files-test-rcu-list-y = util/rcu.c
 check-unit-y += tests/test-qdist$(EXESUF)
 gcov-files-test-qdist-y = util/qdist.c
+check-unit-y += tests/test-qht$(EXESUF)
+gcov-files-test-qht-y = util/qht.c
 check-unit-y += tests/test-bitops$(EXESUF)
 check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
 check-unit-y += tests/check-qom-interface$(EXESUF)
@@ -395,7 +397,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
 	tests/rcutorture.o tests/test-rcu-list.o \
-	tests/test-qdist.o
+	tests/test-qdist.o \
+	tests/test-qht.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -435,6 +438,7 @@ tests/test-int128$(EXESUF): tests/test-int128.o
 tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
 tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
+tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/test-qht.c b/tests/test-qht.c
new file mode 100644
index 0000000..c8eb930
--- /dev/null
+++ b/tests/test-qht.c
@@ -0,0 +1,159 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "qemu/qht.h"
+
+#define N 5000
+
+static struct qht ht;
+static int32_t arr[N * 2];
+
+static bool is_equal(const void *obj, const void *userp)
+{
+    const int32_t *a = obj;
+    const int32_t *b = userp;
+
+    return *a == *b;
+}
+
+static void insert(int a, int b)
+{
+    int i;
+
+    for (i = a; i < b; i++) {
+        uint32_t hash;
+
+        arr[i] = i;
+        hash = i;
+
+        qht_insert(&ht, &arr[i], hash);
+    }
+}
+
+static void rm(int init, int end)
+{
+    int i;
+
+    for (i = init; i < end; i++) {
+        uint32_t hash;
+
+        hash = arr[i];
+        g_assert_true(qht_remove(&ht, &arr[i], hash));
+    }
+}
+
+static void check(int a, int b, bool expected)
+{
+    struct qht_stats stats;
+    int i;
+
+    for (i = a; i < b; i++) {
+        void *p;
+        uint32_t hash;
+        int32_t val;
+
+        val = i;
+        hash = i;
+        p = qht_lookup(&ht, is_equal, &val, hash);
+        g_assert_true(!!p == expected);
+    }
+    qht_statistics_init(&ht, &stats);
+    if (stats.used_head_buckets) {
+        g_assert_cmpfloat(qdist_avg(&stats.chain), >=, 1.0);
+    }
+    g_assert_cmpuint(stats.head_buckets, >, 0);
+    qht_statistics_destroy(&stats);
+}
+
+static void count_func(struct qht *ht, void *p, uint32_t hash, void *userp)
+{
+    unsigned int *curr = userp;
+
+    (*curr)++;
+}
+
+static void check_n(size_t expected)
+{
+    struct qht_stats stats;
+
+    qht_statistics_init(&ht, &stats);
+    g_assert_cmpuint(stats.entries, ==, expected);
+    qht_statistics_destroy(&stats);
+}
+
+static void iter_check(unsigned int count)
+{
+    unsigned int curr = 0;
+
+    qht_iter(&ht, count_func, &curr);
+    g_assert_cmpuint(curr, ==, count);
+}
+
+static void qht_do_test(unsigned int mode, size_t init_entries)
+{
+    qht_init(&ht, 0, mode);
+
+    insert(0, N);
+    check(0, N, true);
+    check_n(N);
+    check(-N, -1, false);
+    iter_check(N);
+
+    rm(101, 102);
+    check_n(N - 1);
+    insert(N, N * 2);
+    check_n(N + N - 1);
+    rm(N, N * 2);
+    check_n(N - 1);
+    insert(101, 102);
+    check_n(N);
+
+    rm(10, 200);
+    check_n(N - 190);
+    insert(150, 200);
+    check_n(N - 190 + 50);
+    insert(10, 150);
+    check_n(N);
+
+    rm(1, 2);
+    check_n(N - 1);
+    qht_reset_size(&ht, 0);
+    check_n(0);
+    check(0, N, false);
+
+    qht_destroy(&ht);
+}
+
+static void qht_test(unsigned int mode)
+{
+    qht_do_test(mode, 0);
+    qht_do_test(mode, 1);
+    qht_do_test(mode, 2);
+    qht_do_test(mode, 8);
+    qht_do_test(mode, 16);
+    qht_do_test(mode, 8192);
+    qht_do_test(mode, 16384);
+}
+
+static void test_default(void)
+{
+    qht_test(0);
+}
+
+static void test_resize(void)
+{
+    qht_test(QHT_MODE_AUTO_RESIZE);
+}
+
+int main(int argc, char *argv[])
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/qht/mode/default", test_default);
+    g_test_add_func("/qht/mode/resize", test_resize);
+    return g_test_run();
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (10 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 11/15] qht: add test program Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-29 20:45   ` Sergey Fedorov
  2016-05-31 15:12   ` Alex Bennée
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 13/15] qht: add test-qht-par to invoke qht-bench from 'check' target Emilio G. Cota
                   ` (3 subsequent siblings)
  15 siblings, 2 replies; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

This serves as a performance benchmark as well as a stress test
for QHT. We can tweak quite a number of things, including the
number of resize threads and how frequently resizes are triggered.

A performance comparison of QHT vs CLHT[1] and ck_hs[2] using
this same benchmark program can be found here:
  http://imgur.com/a/0Bms4

The tests are run on a 64-core AMD Opteron 6376.

Note that ck_hs's performance drops significantly as writes go
up, since it requires an external lock (I used a ck_spinlock)
around every write.

Also, note that CLHT instead of using a seqlock, relies on an
allocator that does not ever return the same address during the
same read-critical section. This gives it a slight performance
advantage over QHT on read-heavy workloads, since the seqlock
writes aren't there.

[1] CLHT: https://github.com/LPD-EPFL/CLHT
          https://infoscience.epfl.ch/record/207109/files/ascy_asplos15.pdf

[2] ck_hs: http://concurrencykit.org/
           http://backtrace.io/blog/blog/2015/03/13/workload-specialization/

A few of those plots are shown in text here, since that site
might not be online forever. Throughput is on Mops/s on the Y axis.

                             200K keys, 0 % updates

  450 ++--+------+------+-------+-------+-------+-------+------+-------+--++
      |   +      +      +       +       +       +       +      +      +N+  |
  400 ++                                                           ---+E+ ++
      |                                                       +++----      |
  350 ++          9 ++------+------++                       --+E+    -+H+ ++
      |             |      +H+-     |                 -+N+----   ---- +++  |
  300 ++          8 ++     +E+     ++             -----+E+  --+H+         ++
      |             |      +++      |         -+N+-----+H+--               |
  250 ++          7 ++------+------++  +++-----+E+----                    ++
  200 ++                    1         -+E+-----+H+                        ++
      |                           ----                     qht +-E--+      |
  150 ++                      -+E+                        clht +-H--+     ++
      |                   ----                              ck +-N--+      |
  100 ++               +E+                                                ++
      |            ----                                                    |
   50 ++       -+E+                                                       ++
      |   +E+E+  +      +       +       +       +       +      +       +   |
    0 ++--E------+------+-------+-------+-------+-------+------+-------+--++
          1      8      16      24      32      40      48     56      64
                                Number of threads

                             200K keys, 1 % updates

  350 ++--+------+------+-------+-------+-------+-------+------+-------+--++
      |   +      +      +       +       +       +       +      +     -+E+  |
  300 ++                                                         -----+H+ ++
      |                                                       +E+--        |
      |           9 ++------+------++                  +++----             |
  250 ++            |      +E+   -- |                 -+E+                ++
      |           8 ++         --  ++             ----                     |
  200 ++            |      +++-     |  +++  ---+E+                        ++
      |           7 ++------N------++ -+E+--               qht +-E--+      |
      |                     1  +++----                    clht +-H--+      |
  150 ++                      -+E+                          ck +-N--+     ++
      |                   ----                                             |
  100 ++               +E+                                                ++
      |            ----                                                    |
      |        -+E+                                                        |
   50 ++    +H+-+N+----+N+-----+N+------                                  ++
      |   +E+E+  +      +       +      +N+-----+N+-----+N+----+N+-----+N+  |
    0 ++--E------+------+-------+-------+-------+-------+------+-------+--++
          1      8      16      24      32      40      48     56      64
                                Number of threads

                             200K keys, 20 % updates

  300 ++--+------+------+-------+-------+-------+-------+------+-------+--++
      |   +      +      +       +       +       +       +      +       +   |
      |                                                              -+H+  |
  250 ++                                                         ----     ++
      |           9 ++------+------++                       --+H+  ---+E+  |
      |           8 ++     +H+--   ++                 -+H+----+E+--        |
  200 ++            |      +E+    --|             -----+E+--  +++         ++
      |           7 ++      + ---- ++       ---+H+---- +++ qht +-E--+      |
  150 ++          6 ++------N------++ -+H+-----+E+        clht +-H--+     ++
      |                     1     -----+E+--                ck +-N--+      |
      |                       -+H+----                                     |
  100 ++                  -----+E+                                        ++
      |                +E+--                                               |
      |            ----+++                                                 |
   50 ++       -+E+                                                       ++
      |     +E+ +++                                                        |
      |   +E+N+-+N+-----+       +       +       +       +      +       +   |
    0 ++--E------+------N-------N-------N-------N-------N------N-------N--++
          1      8      16      24      32      40      48     56      64
                                Number of threads

                            200K keys, 100 % updates       qht +-E--+
                                                          clht +-H--+
  160 ++--+------+------+-------+-------+-------+-------+---ck-+-N-----+--++
      |   +      +      +       +       +       +       +      +   ----H   |
  140 ++                                                      +H+--  -+E+ ++
      |                                                +++----   ----      |
  120 ++          8 ++------+------++                 -+H+    +E+         ++
      |           7 ++     +H+---- ++             ---- +++----             |
  100 ++            |      +E+      |  +++  ---+H+    -+E+                ++
      |           6 ++     +++     ++ -+H+--   +++----                     |
   80 ++          5 ++------N----------+E+-----+E+                        ++
      |                     1 -+H+---- +++                                 |
      |                   -----+E+                                         |
   60 ++               +H+---- +++                                        ++
      |            ----+E+                                                 |
   40 ++        +H+----                                                   ++
      |       --+E+                                                        |
   20 ++    +E+                                                           ++
      |  +EE+    +      +       +       +       +       +      +       +   |
    0 ++--+N-N---N------N-------N-------N-------N-------N------N-------N--++
          1      8      16      24      32      40      48     56      64
                                Number of threads

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore  |   1 +
 tests/Makefile    |   3 +-
 tests/qht-bench.c | 474 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 477 insertions(+), 1 deletion(-)
 create mode 100644 tests/qht-bench.c

diff --git a/tests/.gitignore b/tests/.gitignore
index ffde5d2..d19023e 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -7,6 +7,7 @@ check-qnull
 check-qstring
 check-qom-interface
 check-qom-proplist
+qht-bench
 rcutorture
 test-aio
 test-base64
diff --git a/tests/Makefile b/tests/Makefile
index 8589b11..176bbd8 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -398,7 +398,7 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
 	tests/rcutorture.o tests/test-rcu-list.o \
 	tests/test-qdist.o \
-	tests/test-qht.o
+	tests/test-qht.o tests/qht-bench.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -439,6 +439,7 @@ tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
 tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
 tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
+tests/qht-bench$(EXESUF): tests/qht-bench.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/qht-bench.c b/tests/qht-bench.c
new file mode 100644
index 0000000..30d27c8
--- /dev/null
+++ b/tests/qht-bench.c
@@ -0,0 +1,474 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "qemu/processor.h"
+#include "qemu/atomic.h"
+#include "qemu/qht.h"
+#include "qemu/rcu.h"
+#include "exec/tb-hash-xx.h"
+
+struct thread_stats {
+    size_t rd;
+    size_t not_rd;
+    size_t in;
+    size_t not_in;
+    size_t rm;
+    size_t not_rm;
+    size_t rz;
+    size_t not_rz;
+};
+
+struct thread_info {
+    void (*func)(struct thread_info *);
+    struct thread_stats stats;
+    uint64_t r;
+    bool write_op; /* writes alternate between insertions and removals */
+    bool resize_down;
+} QEMU_ALIGNED(64); /* avoid false sharing among threads */
+
+static struct qht ht;
+static QemuThread *rw_threads;
+
+#define DEFAULT_RANGE (4096)
+#define DEFAULT_QHT_N_ELEMS DEFAULT_RANGE
+
+static unsigned int duration = 1;
+static unsigned int n_rw_threads = 1;
+static unsigned long lookup_range = DEFAULT_RANGE;
+static unsigned long update_range = DEFAULT_RANGE;
+static size_t init_range = DEFAULT_RANGE;
+static size_t init_size = DEFAULT_RANGE;
+static long populate_offset;
+static long *keys;
+
+static size_t resize_min;
+static size_t resize_max;
+static struct thread_info *rz_info;
+static unsigned long resize_delay = 1000;
+static double resize_rate; /* 0.0 to 1.0 */
+static unsigned int n_rz_threads = 1;
+static QemuThread *rz_threads;
+
+static double update_rate; /* 0.0 to 1.0 */
+static uint64_t update_threshold;
+static uint64_t resize_threshold;
+
+static size_t qht_n_elems = DEFAULT_QHT_N_ELEMS;
+static int qht_mode;
+
+static bool test_start;
+static bool test_stop;
+
+static struct thread_info *rw_info;
+
+static const char commands_string[] =
+    " -d = duration, in seconds\n"
+    " -n = number of threads\n"
+    "\n"
+    " -k = initial number of keys\n"
+    " -o = offset at which keys start\n"
+    " -K = initial range of keys (will be rounded up to pow2)\n"
+    " -l = lookup range of keys (will be rounded up to pow2)\n"
+    " -r = update range of keys (will be rounded up to pow2)\n"
+    "\n"
+    " -u = update rate (0.0 to 100.0), 50/50 split of insertions/removals\n"
+    "\n"
+    " -s = initial size hint\n"
+    " -R = enable auto-resize\n"
+    " -S = resize rate (0.0 to 100.0)\n"
+    " -D = delay (in us) between potential resizes\n"
+    " -N = number of resize threads";
+
+static void usage_complete(int argc, char *argv[])
+{
+    fprintf(stderr, "Usage: %s [options]\n", argv[0]);
+    fprintf(stderr, "options:\n%s\n", commands_string);
+    exit(-1);
+}
+
+static bool is_equal(const void *obj, const void *userp)
+{
+    const long *a = obj;
+    const long *b = userp;
+
+    return *a == *b;
+}
+
+static inline uint32_t h(unsigned long v)
+{
+    return tb_hash_func5(v, 0, 0);
+}
+
+/*
+ * From: https://en.wikipedia.org/wiki/Xorshift
+ * This is faster than rand_r(), and gives us a wider range (RAND_MAX is only
+ * guaranteed to be >= INT_MAX).
+ */
+static uint64_t xorshift64star(uint64_t x)
+{
+    x ^= x >> 12; /* a */
+    x ^= x << 25; /* b */
+    x ^= x >> 27; /* c */
+    return x * UINT64_C(2685821657736338717);
+}
+
+static void do_rz(struct thread_info *info)
+{
+    struct thread_stats *stats = &info->stats;
+
+    if (info->r < resize_threshold) {
+        size_t size = info->resize_down ? resize_min : resize_max;
+        bool resized;
+
+        resized = qht_resize(&ht, size);
+        info->resize_down = !info->resize_down;
+
+        if (resized) {
+            stats->rz++;
+        } else {
+            stats->not_rz++;
+        }
+    }
+    g_usleep(resize_delay);
+}
+
+static void do_rw(struct thread_info *info)
+{
+    struct thread_stats *stats = &info->stats;
+    uint32_t hash;
+    long *p;
+
+    if (info->r >= update_threshold) {
+        bool read;
+
+        p = &keys[info->r & (lookup_range - 1)];
+        hash = h(*p);
+        read = qht_lookup(&ht, is_equal, p, hash);
+        if (read) {
+            stats->rd++;
+        } else {
+            stats->not_rd++;
+        }
+    } else {
+        p = &keys[info->r & (update_range - 1)];
+        hash = h(*p);
+        if (info->write_op) {
+            bool written = false;
+
+            if (qht_lookup(&ht, is_equal, p, hash) == NULL) {
+                written = qht_insert(&ht, p, hash);
+            }
+            if (written) {
+                stats->in++;
+            } else {
+                stats->not_in++;
+            }
+        } else {
+            bool removed = false;
+
+            if (qht_lookup(&ht, is_equal, p, hash)) {
+                removed = qht_remove(&ht, p, hash);
+            }
+            if (removed) {
+                stats->rm++;
+            } else {
+                stats->not_rm++;
+            }
+        }
+        info->write_op = !info->write_op;
+    }
+}
+
+static void *thread_func(void *p)
+{
+    struct thread_info *info = p;
+
+    while (!atomic_mb_read(&test_start)) {
+        cpu_relax();
+    }
+
+    rcu_register_thread();
+
+    rcu_read_lock();
+    while (!atomic_read(&test_stop)) {
+        info->r = xorshift64star(info->r);
+        info->func(info);
+    }
+    rcu_read_unlock();
+
+    rcu_unregister_thread();
+    return NULL;
+}
+
+/* sets everything except info->func */
+static void prepare_thread_info(struct thread_info *info, int i)
+{
+    /* seed for the RNG; each thread should have a different one */
+    info->r = (i + 1) ^ time(NULL);
+    /* the first update will be a write */
+    info->write_op = true;
+    /* the first resize will be down */
+    info->resize_down = true;
+
+    memset(&info->stats, 0, sizeof(info->stats));
+}
+
+static void
+th_create_n(QemuThread **threads, struct thread_info **infos, const char *name,
+            void (*func)(struct thread_info *), int offset, int n)
+{
+    struct thread_info *info;
+    QemuThread *th;
+    int i;
+
+    th = g_malloc(sizeof(*th) * n);
+    *threads = th;
+
+    info = qemu_memalign(64, sizeof(*info) * n);
+    *infos = info;
+
+    for (i = 0; i < n; i++) {
+        prepare_thread_info(&info[i], i);
+        info[i].func = func;
+        qemu_thread_create(&th[i], name, thread_func, &info[i],
+                           QEMU_THREAD_JOINABLE);
+    }
+}
+
+static void create_threads(void)
+{
+    th_create_n(&rw_threads, &rw_info, "rw", do_rw, 0, n_rw_threads);
+    th_create_n(&rz_threads, &rz_info, "rz", do_rz, n_rw_threads, n_rz_threads);
+}
+
+static void pr_params(void)
+{
+    printf("Parameters:\n");
+    printf(" duration:          %d s\n", duration);
+    printf(" # of threads:      %u\n", n_rw_threads);
+    printf(" initial # of keys: %zu\n", init_size);
+    printf(" initial size hint: %zu\n", qht_n_elems);
+    printf(" auto-resize:       %s\n",
+           qht_mode & QHT_MODE_AUTO_RESIZE ? "on" : "off");
+    if (resize_rate) {
+        printf(" resize_rate:       %f%%\n", resize_rate * 100.0);
+        printf(" resize range:      %zu-%zu\n", resize_min, resize_max);
+        printf(" # resize threads   %u\n", n_rz_threads);
+    }
+    printf(" update rate:       %f%%\n", update_rate * 100.0);
+    printf(" offset:            %ld\n", populate_offset);
+    printf(" initial key range: %zu\n", init_range);
+    printf(" lookup range:      %zu\n", lookup_range);
+    printf(" update range:      %zu\n", update_range);
+}
+
+static void do_threshold(double rate, uint64_t *threshold)
+{
+    if (rate == 1.0) {
+        *threshold = UINT64_MAX;
+    } else {
+        *threshold = rate * UINT64_MAX;
+    }
+}
+
+static void htable_init(void)
+{
+    unsigned long n = MAX(init_range, update_range);
+    uint64_t r = time(NULL);
+    size_t retries = 0;
+    size_t i;
+
+    /* avoid allocating memory later by allocating all the keys now */
+    keys = g_malloc(sizeof(*keys) * n);
+    for (i = 0; i < n; i++) {
+        keys[i] = populate_offset + i;
+    }
+
+    /* some sanity checks */
+    g_assert_cmpuint(lookup_range, <=, n);
+
+    /* compute thresholds */
+    do_threshold(update_rate, &update_threshold);
+    do_threshold(resize_rate, &resize_threshold);
+
+    if (resize_rate) {
+        resize_min = n / 2;
+        resize_max = n;
+        assert(resize_min < resize_max);
+    } else {
+        n_rz_threads = 0;
+    }
+
+    /* initialize the hash table */
+    qht_init(&ht, qht_n_elems, qht_mode);
+    assert(init_size <= init_range);
+
+    pr_params();
+
+    fprintf(stderr, "Initialization: populating %zu items...", init_size);
+    for (i = 0; i < init_size; i++) {
+        for (;;) {
+            uint32_t hash;
+            long *p;
+
+            r = xorshift64star(r);
+            p = &keys[r & (init_range - 1)];
+            hash = h(*p);
+            if (qht_insert(&ht, p, hash)) {
+                break;
+            }
+            retries++;
+        }
+    }
+    fprintf(stderr, " populated after %zu retries\n", retries);
+}
+
+static void add_stats(struct thread_stats *s, struct thread_info *info, int n)
+{
+    int i;
+
+    for (i = 0; i < n; i++) {
+        struct thread_stats *stats = &info[i].stats;
+
+        s->rd += stats->rd;
+        s->not_rd += stats->not_rd;
+
+        s->in += stats->in;
+        s->not_in += stats->not_in;
+
+        s->rm += stats->rm;
+        s->not_rm += stats->not_rm;
+
+        s->rz += stats->rz;
+        s->not_rz += stats->not_rz;
+    }
+}
+
+static void pr_stats(void)
+{
+    struct thread_stats s = {};
+    double tx;
+
+    add_stats(&s, rw_info, n_rw_threads);
+    add_stats(&s, rz_info, n_rz_threads);
+
+    printf("Results:\n");
+
+    if (resize_rate) {
+        printf(" Resizes:           %zu (%.2f%% of %zu)\n",
+               s.rz, (double)s.rz / (s.rz + s.not_rz) * 100, s.rz + s.not_rz);
+    }
+
+    printf(" Read:              %.2f M (%.2f%% of %.2fM)\n",
+           (double)s.rd / 1e6,
+           (double)s.rd / (s.rd + s.not_rd) * 100,
+           (double)(s.rd + s.not_rd) / 1e6);
+    printf(" Inserted:          %.2f M (%.2f%% of %.2fM)\n",
+           (double)s.in / 1e6,
+           (double)s.in / (s.in + s.not_in) * 100,
+           (double)(s.in + s.not_in) / 1e6);
+    printf(" Removed:           %.2f M (%.2f%% of %.2fM)\n",
+           (double)s.rm / 1e6,
+           (double)s.rm / (s.rm + s.not_rm) * 100,
+           (double)(s.rm + s.not_rm) / 1e6);
+
+    tx = (s.rd + s.not_rd + s.in + s.not_in + s.rm + s.not_rm) / 1e6 / duration;
+    printf(" Throughput:        %.2f MT/s\n", tx);
+    printf(" Throughput/thread: %.2f MT/s/thread\n", tx / n_rw_threads);
+}
+
+static void run_test(void)
+{
+    unsigned int remaining;
+    int i;
+
+    atomic_mb_set(&test_start, true);
+    do {
+        remaining = sleep(duration);
+    } while (remaining);
+    atomic_mb_set(&test_stop, true);
+
+    for (i = 0; i < n_rw_threads; i++) {
+        qemu_thread_join(&rw_threads[i]);
+    }
+    for (i = 0; i < n_rz_threads; i++) {
+        qemu_thread_join(&rz_threads[i]);
+    }
+}
+
+static void parse_args(int argc, char *argv[])
+{
+    int c;
+
+    for (;;) {
+        c = getopt(argc, argv, "d:D:k:K:l:hn:N:o:r:Rs:S:u:");
+        if (c < 0) {
+            break;
+        }
+        switch (c) {
+        case 'd':
+            duration = atoi(optarg);
+            break;
+        case 'D':
+            resize_delay = atol(optarg);
+            break;
+        case 'h':
+            usage_complete(argc, argv);
+            exit(0);
+        case 'k':
+            init_size = atol(optarg);
+            break;
+        case 'K':
+            init_range = pow2ceil(atol(optarg));
+            break;
+        case 'l':
+            lookup_range = pow2ceil(atol(optarg));
+            break;
+        case 'n':
+            n_rw_threads = atoi(optarg);
+            break;
+        case 'N':
+            n_rz_threads = atoi(optarg);
+            break;
+        case 'o':
+            populate_offset = atol(optarg);
+            break;
+        case 'r':
+            update_range = pow2ceil(atol(optarg));
+            break;
+        case 'R':
+            qht_mode |= QHT_MODE_AUTO_RESIZE;
+            break;
+        case 's':
+            qht_n_elems = atol(optarg);
+            break;
+        case 'S':
+            resize_rate = atof(optarg) / 100.0;
+            if (resize_rate > 1.0) {
+                resize_rate = 1.0;
+            }
+            break;
+        case 'u':
+            update_rate = atof(optarg) / 100.0;
+            if (update_rate > 1.0) {
+                update_rate = 1.0;
+            }
+            break;
+        }
+    }
+}
+
+int main(int argc, char *argv[])
+{
+    parse_args(argc, argv);
+    htable_init();
+    create_threads();
+    run_test();
+    pr_stats();
+    return 0;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 13/15] qht: add test-qht-par to invoke qht-bench from 'check' target
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (11 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-29 20:53   ` Sergey Fedorov
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 14/15] tb hash: track translated blocks with qht Emilio G. Cota
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore     |  1 +
 tests/Makefile       |  5 ++++-
 tests/test-qht-par.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-qht-par.c

diff --git a/tests/.gitignore b/tests/.gitignore
index d19023e..840ea39 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -52,6 +52,7 @@ test-qemu-opts
 test-qdist
 test-qga
 test-qht
+test-qht-par
 test-qmp-commands
 test-qmp-commands.h
 test-qmp-event
diff --git a/tests/Makefile b/tests/Makefile
index 176bbd8..b4e4e21 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -74,6 +74,8 @@ check-unit-y += tests/test-qdist$(EXESUF)
 gcov-files-test-qdist-y = util/qdist.c
 check-unit-y += tests/test-qht$(EXESUF)
 gcov-files-test-qht-y = util/qht.c
+check-unit-y += tests/test-qht-par$(EXESUF)
+gcov-files-test-qht-par-y = util/qht.c
 check-unit-y += tests/test-bitops$(EXESUF)
 check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
 check-unit-y += tests/check-qom-interface$(EXESUF)
@@ -398,7 +400,7 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
 	tests/rcutorture.o tests/test-rcu-list.o \
 	tests/test-qdist.o \
-	tests/test-qht.o tests/qht-bench.o
+	tests/test-qht.o tests/qht-bench.o tests/test-qht-par.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -439,6 +441,7 @@ tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
 tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
 tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
+tests/test-qht-par$(EXESUF): tests/test-qht-par.o tests/qht-bench$(EXESUF) $(test-util-obj-y)
 tests/qht-bench$(EXESUF): tests/qht-bench.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
diff --git a/tests/test-qht-par.c b/tests/test-qht-par.c
new file mode 100644
index 0000000..fc0cb23
--- /dev/null
+++ b/tests/test-qht-par.c
@@ -0,0 +1,56 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/osdep.h"
+#include <glib.h>
+
+#define TEST_QHT_STRING "tests/qht-bench 1>/dev/null 2>&1 -R -S0.1 -D10000 -N1"
+
+static void test_qht(int n_threads, int update_rate, int duration)
+{
+    char *str;
+    int rc;
+
+    str = g_strdup_printf(TEST_QHT_STRING "-n %d -u %d -d %d",
+                          n_threads, update_rate, duration);
+    rc = system(str);
+    g_free(str);
+    g_assert_cmpint(rc, ==, 0);
+}
+
+static void test_2th0u1s(void)
+{
+    test_qht(2, 0, 1);
+}
+
+static void test_2th20u1s(void)
+{
+    test_qht(2, 20, 1);
+}
+
+static void test_2th0u5s(void)
+{
+    test_qht(2, 0, 5);
+}
+
+static void test_2th20u5s(void)
+{
+    test_qht(2, 20, 5);
+}
+
+int main(int argc, char *argv[])
+{
+    g_test_init(&argc, &argv, NULL);
+
+    if (g_test_quick()) {
+        g_test_add_func("/qht/parallel/2threads-0%updates-1s", test_2th0u1s);
+        g_test_add_func("/qht/parallel/2threads-20%updates-1s", test_2th20u1s);
+    } else {
+        g_test_add_func("/qht/parallel/2threads-0%updates-5s", test_2th0u5s);
+        g_test_add_func("/qht/parallel/2threads-20%updates-5s", test_2th20u5s);
+    }
+    return g_test_run();
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 14/15] tb hash: track translated blocks with qht
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (12 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 13/15] qht: add test-qht-par to invoke qht-bench from 'check' target Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-29 21:09   ` Sergey Fedorov
  2016-05-31  8:39   ` Alex Bennée
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 15/15] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
  2016-06-08  6:25 ` [Qemu-devel] [PATCH v6 00/15] tb hash improvements Alex Bennée
  15 siblings, 2 replies; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
              MRU is implemented here by adding an intermediate struct
              that contains the u32 hash and a pointer to the TB; this
              allows us, on an MRU promotion, to copy said struct (that is not
              at the head), and put this new copy at the head. After a grace
              period, the original non-head struct can be eliminated, and
              after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                   no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                 no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
  http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.

                            Host: Intel Xeon E5-2690

  28 ++------------+-------------+-------------+-------------+------------++
     A*****        +             +             +             master **A*** +
  27 ++    *                                                 xxhash ##B###++
     |      A******A******                               xxhash-rcu $$C$$$ |
  26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
     D%%$$                              A******A******A*qht-dyn-mru A*E****A
  25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
     B#####%                                                               |
  24 ++    #C$$$$$                                                        ++
     |      B###  $                                                        |
     |          ## C$$$$$$                                                 |
  23 ++           #       C$$$$$$                                         ++
     |             B######       C$$$$$$                                %%%D
  22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
     |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
  21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
     +             E@@@   F&&&   +      E@     +      F&&&   +             +
  20 ++------------+-------------+-------------+-------------+------------++
     14            16            18            20            22            24
                             log2 number of buckets

                                 Host: Intel i7-4790K

  14.5 ++------------+------------+-------------+------------+------------++
       A**           +            +             +            master **A*** +
    14 ++ **                                                 xxhash ##B###++
  13.5 ++   **                                           xxhash-rcu $$C$$$++
       |                                            qht-fixed-nomru %%D%%% |
    13 ++     A******                                   qht-dyn-mru @@E@@@++
       |             A*****A******A******             qht-dyn-nomru &&F&&& |
  12.5 C$$                               A******A******A*****A******    ***A
    12 ++ $$                                                        A***  ++
       D%%% $$                                                             |
  11.5 ++  %%                                                             ++
       B###  %C$$$$$$                                                      |
    11 ++  ## D%%%%% C$$$$$                                               ++
       |     #      %      C$$$$$$                                         |
  10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
    10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
       +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
   9.5 ++------------+------------+-------------+------------+------------++
       14            16           18            20           22            24
                              log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 86 ++++++++++++++++++++++++-------------------------
 include/exec/exec-all.h |  9 +++---
 include/exec/tb-hash.h  |  3 +-
 translate-all.c         | 85 ++++++++++++++++++++++--------------------------
 4 files changed, 86 insertions(+), 97 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 1735032..6a2350d 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -224,57 +224,57 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
 }
 #endif
 
+struct tb_desc {
+    target_ulong pc;
+    target_ulong cs_base;
+    CPUArchState *env;
+    tb_page_addr_t phys_page1;
+    uint32_t flags;
+};
+
+static bool tb_cmp(const void *p, const void *d)
+{
+    const TranslationBlock *tb = p;
+    const struct tb_desc *desc = d;
+
+    if (tb->pc == desc->pc &&
+        tb->page_addr[0] == desc->phys_page1 &&
+        tb->cs_base == desc->cs_base &&
+        tb->flags == desc->flags) {
+        /* check next page if needed */
+        if (tb->page_addr[1] == -1) {
+            return true;
+        } else {
+            tb_page_addr_t phys_page2;
+            target_ulong virt_page2;
+
+            virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
+            phys_page2 = get_page_addr_code(desc->env, virt_page2);
+            if (tb->page_addr[1] == phys_page2) {
+                return true;
+            }
+        }
+    }
+    return false;
+}
+
 static TranslationBlock *tb_find_physical(CPUState *cpu,
                                           target_ulong pc,
                                           target_ulong cs_base,
                                           uint32_t flags)
 {
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
-    TranslationBlock *tb, **tb_hash_head, **ptb1;
+    tb_page_addr_t phys_pc;
+    struct tb_desc desc;
     uint32_t h;
-    tb_page_addr_t phys_pc, phys_page1;
 
-    /* find translated block using physical mappings */
-    phys_pc = get_page_addr_code(env, pc);
-    phys_page1 = phys_pc & TARGET_PAGE_MASK;
+    desc.env = (CPUArchState *)cpu->env_ptr;
+    desc.cs_base = cs_base;
+    desc.flags = flags;
+    desc.pc = pc;
+    phys_pc = get_page_addr_code(desc.env, pc);
+    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_hash_func(phys_pc, pc, flags);
-
-    /* Start at head of the hash entry */
-    ptb1 = tb_hash_head = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tb = *ptb1;
-
-    while (tb) {
-        if (tb->pc == pc &&
-            tb->page_addr[0] == phys_page1 &&
-            tb->cs_base == cs_base &&
-            tb->flags == flags) {
-
-            if (tb->page_addr[1] == -1) {
-                /* done, we have a match */
-                break;
-            } else {
-                /* check next page if needed */
-                target_ulong virt_page2 = (pc & TARGET_PAGE_MASK) +
-                                          TARGET_PAGE_SIZE;
-                tb_page_addr_t phys_page2 = get_page_addr_code(env, virt_page2);
-
-                if (tb->page_addr[1] == phys_page2) {
-                    break;
-                }
-            }
-        }
-
-        ptb1 = &tb->phys_hash_next;
-        tb = *ptb1;
-    }
-
-    if (tb) {
-        /* Move the TB to the head of the list */
-        *ptb1 = tb->phys_hash_next;
-        tb->phys_hash_next = *tb_hash_head;
-        *tb_hash_head = tb;
-    }
-    return tb;
+    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
 }
 
 static TranslationBlock *tb_find_slow(CPUState *cpu,
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 85528f9..68e73b6 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -21,6 +21,7 @@
 #define _EXEC_ALL_H_
 
 #include "qemu-common.h"
+#include "qemu/qht.h"
 
 /* allow to see translation results - the slowdown should be negligible, so we leave it */
 #define DEBUG_DISAS
@@ -212,8 +213,8 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
 
 #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
 
-#define CODE_GEN_PHYS_HASH_BITS     15
-#define CODE_GEN_PHYS_HASH_SIZE     (1 << CODE_GEN_PHYS_HASH_BITS)
+#define CODE_GEN_HTABLE_BITS     15
+#define CODE_GEN_HTABLE_SIZE     (1 << CODE_GEN_HTABLE_BITS)
 
 /* Estimated block size for TB allocation.  */
 /* ??? The following is based on a 2015 survey of x86_64 host output.
@@ -250,8 +251,6 @@ struct TranslationBlock {
 
     void *tc_ptr;    /* pointer to the translated code */
     uint8_t *tc_search;  /* pointer to search data */
-    /* next matching tb for physical address. */
-    struct TranslationBlock *phys_hash_next;
     /* original tb when cflags has CF_NOCACHE */
     struct TranslationBlock *orig_tb;
     /* first and second physical page containing code. The lower bit
@@ -296,7 +295,7 @@ typedef struct TBContext TBContext;
 struct TBContext {
 
     TranslationBlock *tbs;
-    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
+    struct qht htable;
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 88ccfd1..1d0200b 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,7 +20,6 @@
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
-#include "exec/exec-all.h"
 #include "exec/tb-hash-xx.h"
 
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
@@ -49,7 +48,7 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
 static inline
 uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t flags)
 {
-    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    return tb_hash_func5(phys_pc, pc, flags);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index c48fccb..5357737 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -734,6 +734,13 @@ static inline void code_gen_alloc(size_t tb_size)
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
+static void tb_htable_init(void)
+{
+    unsigned int mode = QHT_MODE_AUTO_RESIZE;
+
+    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
+}
+
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
    (in bytes) allocated to the translation buffer. Zero means default
    size. */
@@ -741,6 +748,7 @@ void tcg_exec_init(unsigned long tb_size)
 {
     cpu_gen_init();
     page_init();
+    tb_htable_init();
     code_gen_alloc(tb_size);
 #if defined(CONFIG_SOFTMMU)
     /* There's no guest base to take into account, so go ahead and
@@ -845,7 +853,7 @@ void tb_flush(CPUState *cpu)
         cpu->tb_flushed = true;
     }
 
-    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
+    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
@@ -856,60 +864,46 @@ void tb_flush(CPUState *cpu)
 
 #ifdef DEBUG_TB_CHECK
 
-static void tb_invalidate_check(target_ulong address)
+static void
+do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 {
-    TranslationBlock *tb;
-    int i;
+    TranslationBlock *tb = p;
+    target_ulong addr = *(target_ulong *)userp;
 
-    address &= TARGET_PAGE_MASK;
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-             tb = tb->phys_hash_next) {
-            if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
-                  address >= tb->pc + tb->size)) {
-                printf("ERROR invalidate: address=" TARGET_FMT_lx
-                       " PC=%08lx size=%04x\n",
-                       address, (long)tb->pc, tb->size);
-            }
-        }
+    if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
+        printf("ERROR invalidate: address=" TARGET_FMT_lx
+               " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
     }
 }
 
-/* verify that all the pages have correct rights for code */
-static void tb_page_check(void)
+static void tb_invalidate_check(target_ulong address)
 {
-    TranslationBlock *tb;
-    int i, flags1, flags2;
-
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-                tb = tb->phys_hash_next) {
-            flags1 = page_get_flags(tb->pc);
-            flags2 = page_get_flags(tb->pc + tb->size - 1);
-            if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
-                printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
-                       (long)tb->pc, tb->size, flags1, flags2);
-            }
-        }
-    }
+    address &= TARGET_PAGE_MASK;
+    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
 }
 
-#endif
-
-static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock *tb)
+static void
+do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 {
-    TranslationBlock *tb1;
+    TranslationBlock *tb = p;
+    int flags1, flags2;
 
-    for (;;) {
-        tb1 = *ptb;
-        if (tb1 == tb) {
-            *ptb = tb1->phys_hash_next;
-            break;
-        }
-        ptb = &tb1->phys_hash_next;
+    flags1 = page_get_flags(tb->pc);
+    flags2 = page_get_flags(tb->pc + tb->size - 1);
+    if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
+        printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
+               (long)tb->pc, tb->size, flags1, flags2);
     }
 }
 
+/* verify that all the pages have correct rights for code */
+static void tb_page_check(void)
+{
+    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
+}
+
+#endif
+
 static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
 {
     TranslationBlock *tb1;
@@ -997,7 +991,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
+    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
@@ -1127,13 +1121,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
     uint32_t h;
-    TranslationBlock **ptb;
 
     /* add in the hash table */
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tb->phys_hash_next = *ptb;
-    *ptb = tb;
+    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* add in the page list */
     tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Qemu-devel] [PATCH v6 15/15] translate-all: add tb hash bucket info to 'info jit' dump
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (13 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 14/15] tb hash: track translated blocks with qht Emilio G. Cota
@ 2016-05-25  1:13 ` Emilio G. Cota
  2016-05-29 21:14   ` Sergey Fedorov
  2016-06-08  6:25 ` [Qemu-devel] [PATCH v6 00/15] tb hash improvements Alex Bennée
  15 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-25  1:13 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson, Sergey Fedorov

Examples:

- Good hashing, i.e. tb_hash_func5(phys_pc, pc, flags):
TB count            715135/2684354
[...]
TB hash buckets     388775/524288 (74.15% head buckets used)
TB hash occupancy   33.04% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
TB hash avg chain   1.017 buckets. Histogram: 1|█▁▁|3

- Not-so-good hashing, i.e. tb_hash_func5(phys_pc, pc, 0):
TB count            712636/2684354
[...]
TB hash buckets     344924/524288 (65.79% head buckets used)
TB hash occupancy   31.64% avg chain occ. Histogram: [0,10)%|█ ▆  ▅▁▃▁▂|[90,100]%
TB hash avg chain   1.047 buckets. Histogram: 1|█▁▁▁|4

- Bad hashing, i.e. tb_hash_func5(phys_pc, 0, 0):
TB count            702818/2684354
[...]
TB hash buckets     112741/524288 (21.50% head buckets used)
TB hash occupancy   10.15% avg chain occ. Histogram: [0,10)%|█ ▁  ▁▁▁▁▁|[90,100]%
TB hash avg chain   2.107 buckets. Histogram: [1.0,10.2)|█▁▁▁▁▁▁▁▁▁|[83.8,93.0]

- Good hashing, but no auto-resize:
TB count            715634/2684354
TB hash buckets     8192/8192 (100.00% head buckets used)
TB hash occupancy   98.30% avg chain occ. Histogram: [95.3,95.8)%|▁▁▃▄▃▄▁▇▁█|[99.5,100.0]%
TB hash avg chain   22.070 buckets. Histogram: [15.0,16.7)|▁▂▅▄█▅▁▁▁▁|[30.3,32.0]

Suggested-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 translate-all.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/translate-all.c b/translate-all.c
index 5357737..c8074cf 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1667,6 +1667,10 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     int i, target_code_size, max_target_code_size;
     int direct_jmp_count, direct_jmp2_count, cross_page;
     TranslationBlock *tb;
+    struct qht_stats hst;
+    uint32_t hgram_opts;
+    size_t hgram_bins;
+    char *hgram;
 
     target_code_size = 0;
     max_target_code_size = 0;
@@ -1717,6 +1721,38 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
                 direct_jmp2_count,
                 tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
                         tcg_ctx.tb_ctx.nb_tbs : 0);
+
+    qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
+
+    cpu_fprintf(f, "TB hash buckets     %zu/%zu (%0.2f%% head buckets used)\n",
+                hst.used_head_buckets, hst.head_buckets,
+                (double)hst.used_head_buckets / hst.head_buckets * 100);
+
+    hgram_opts =  QDIST_PR_BORDER | QDIST_PR_LABELS;
+    hgram_opts |= QDIST_PR_100X   | QDIST_PR_PERCENT;
+    if (qdist_xmax(&hst.occupancy) - qdist_xmin(&hst.occupancy) == 1) {
+        hgram_opts |= QDIST_PR_NODECIMAL;
+    }
+    hgram = qdist_pr(&hst.occupancy, 10, hgram_opts);
+    cpu_fprintf(f, "TB hash occupancy   %0.2f%% avg chain occ. Histogram: %s\n",
+                qdist_avg(&hst.occupancy) * 100, hgram);
+    g_free(hgram);
+
+    hgram_opts = QDIST_PR_BORDER | QDIST_PR_LABELS;
+    hgram_bins = qdist_xmax(&hst.chain) - qdist_xmin(&hst.chain);
+    if (hgram_bins > 10) {
+        hgram_bins = 10;
+    } else {
+        hgram_bins = 0;
+        hgram_opts |= QDIST_PR_NODECIMAL | QDIST_PR_NOBINRANGE;
+    }
+    hgram = qdist_pr(&hst.chain, hgram_bins, hgram_opts);
+    cpu_fprintf(f, "TB hash avg chain   %0.3f buckets. Histogram: %s\n",
+                qdist_avg(&hst.chain), hgram);
+    g_free(hgram);
+
+    qht_statistics_destroy(&hst);
+
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %d\n", tcg_ctx.tb_ctx.tb_flush_count);
     cpu_fprintf(f, "TB invalidate count %d\n",
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 01/15] compiler.h: add QEMU_ALIGNED() to enforce struct alignment
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 01/15] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
@ 2016-05-27 19:54   ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-27 19:54 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

> ---
>  include/qemu/compiler.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
> index 8f1cc7b..b64f899 100644
> --- a/include/qemu/compiler.h
> +++ b/include/qemu/compiler.h
> @@ -41,6 +41,8 @@
>  # define QEMU_PACKED __attribute__((packed))
>  #endif
>  
> +#define QEMU_ALIGNED(X) __attribute__((aligned(X)))
> +
>  #ifndef glue
>  #define xglue(x, y) x ## y
>  #define glue(x, y) xglue(x, y)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 02/15] seqlock: remove optional mutex
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 02/15] seqlock: remove optional mutex Emilio G. Cota
@ 2016-05-27 19:55   ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-27 19:55 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> This option is unused; besides, it bloats the struct when not needed.
> Let's just let writers define their own locks elsewhere.
>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

> ---
>  cpus.c                 |  2 +-
>  include/qemu/seqlock.h | 10 +---------
>  2 files changed, 2 insertions(+), 10 deletions(-)
>
> diff --git a/cpus.c b/cpus.c
> index cbeb1f6..dd86da5 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -619,7 +619,7 @@ int cpu_throttle_get_percentage(void)
>  
>  void cpu_ticks_init(void)
>  {
> -    seqlock_init(&timers_state.vm_clock_seqlock, NULL);
> +    seqlock_init(&timers_state.vm_clock_seqlock);
>      vmstate_register(NULL, 0, &vmstate_timers, &timers_state);
>      throttle_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL_RT,
>                                             cpu_throttle_timer_tick, NULL);
> diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
> index 70b01fd..e673482 100644
> --- a/include/qemu/seqlock.h
> +++ b/include/qemu/seqlock.h
> @@ -19,22 +19,17 @@
>  typedef struct QemuSeqLock QemuSeqLock;
>  
>  struct QemuSeqLock {
> -    QemuMutex *mutex;
>      unsigned sequence;
>  };
>  
> -static inline void seqlock_init(QemuSeqLock *sl, QemuMutex *mutex)
> +static inline void seqlock_init(QemuSeqLock *sl)
>  {
> -    sl->mutex = mutex;
>      sl->sequence = 0;
>  }
>  
>  /* Lock out other writers and update the count.  */
>  static inline void seqlock_write_lock(QemuSeqLock *sl)
>  {
> -    if (sl->mutex) {
> -        qemu_mutex_lock(sl->mutex);
> -    }
>      ++sl->sequence;
>  
>      /* Write sequence before updating other fields.  */
> @@ -47,9 +42,6 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
>      smp_wmb();
>  
>      ++sl->sequence;
> -    if (sl->mutex) {
> -        qemu_mutex_unlock(sl->mutex);
> -    }
>  }
>  
>  static inline unsigned seqlock_read_begin(QemuSeqLock *sl)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 03/15] seqlock: rename write_lock/unlock to write_begin/end
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 03/15] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
@ 2016-05-27 19:59   ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-27 19:59 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> It is a more appropriate name, now that the mutex embedded
> in the seqlock is gone.
>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

> ---
>  cpus.c                 | 28 ++++++++++++++--------------
>  include/qemu/seqlock.h |  4 ++--
>  2 files changed, 16 insertions(+), 16 deletions(-)
>
> diff --git a/cpus.c b/cpus.c
> index dd86da5..735c9b2 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -247,13 +247,13 @@ int64_t cpu_get_clock(void)
>  void cpu_enable_ticks(void)
>  {
>      /* Here, the really thing protected by seqlock is cpu_clock_offset. */
> -    seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_begin(&timers_state.vm_clock_seqlock);
>      if (!timers_state.cpu_ticks_enabled) {
>          timers_state.cpu_ticks_offset -= cpu_get_host_ticks();
>          timers_state.cpu_clock_offset -= get_clock();
>          timers_state.cpu_ticks_enabled = 1;
>      }
> -    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_end(&timers_state.vm_clock_seqlock);
>  }
>  
>  /* disable cpu_get_ticks() : the clock is stopped. You must not call
> @@ -263,13 +263,13 @@ void cpu_enable_ticks(void)
>  void cpu_disable_ticks(void)
>  {
>      /* Here, the really thing protected by seqlock is cpu_clock_offset. */
> -    seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_begin(&timers_state.vm_clock_seqlock);
>      if (timers_state.cpu_ticks_enabled) {
>          timers_state.cpu_ticks_offset += cpu_get_host_ticks();
>          timers_state.cpu_clock_offset = cpu_get_clock_locked();
>          timers_state.cpu_ticks_enabled = 0;
>      }
> -    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_end(&timers_state.vm_clock_seqlock);
>  }
>  
>  /* Correlation between real and virtual time is always going to be
> @@ -292,7 +292,7 @@ static void icount_adjust(void)
>          return;
>      }
>  
> -    seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_begin(&timers_state.vm_clock_seqlock);
>      cur_time = cpu_get_clock_locked();
>      cur_icount = cpu_get_icount_locked();
>  
> @@ -313,7 +313,7 @@ static void icount_adjust(void)
>      last_delta = delta;
>      timers_state.qemu_icount_bias = cur_icount
>                                - (timers_state.qemu_icount << icount_time_shift);
> -    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_end(&timers_state.vm_clock_seqlock);
>  }
>  
>  static void icount_adjust_rt(void *opaque)
> @@ -353,7 +353,7 @@ static void icount_warp_rt(void)
>          return;
>      }
>  
> -    seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_begin(&timers_state.vm_clock_seqlock);
>      if (runstate_is_running()) {
>          int64_t clock = REPLAY_CLOCK(REPLAY_CLOCK_VIRTUAL_RT,
>                                       cpu_get_clock_locked());
> @@ -372,7 +372,7 @@ static void icount_warp_rt(void)
>          timers_state.qemu_icount_bias += warp_delta;
>      }
>      vm_clock_warp_start = -1;
> -    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +    seqlock_write_end(&timers_state.vm_clock_seqlock);
>  
>      if (qemu_clock_expired(QEMU_CLOCK_VIRTUAL)) {
>          qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
> @@ -397,9 +397,9 @@ void qtest_clock_warp(int64_t dest)
>          int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
>          int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
>  
> -        seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +        seqlock_write_begin(&timers_state.vm_clock_seqlock);
>          timers_state.qemu_icount_bias += warp;
> -        seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +        seqlock_write_end(&timers_state.vm_clock_seqlock);
>  
>          qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
>          timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
> @@ -466,9 +466,9 @@ void qemu_start_warp_timer(void)
>               * It is useful when we want a deterministic execution time,
>               * isolated from host latencies.
>               */
> -            seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +            seqlock_write_begin(&timers_state.vm_clock_seqlock);
>              timers_state.qemu_icount_bias += deadline;
> -            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +            seqlock_write_end(&timers_state.vm_clock_seqlock);
>              qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
>          } else {
>              /*
> @@ -479,11 +479,11 @@ void qemu_start_warp_timer(void)
>               * you will not be sending network packets continuously instead of
>               * every 100ms.
>               */
> -            seqlock_write_lock(&timers_state.vm_clock_seqlock);
> +            seqlock_write_begin(&timers_state.vm_clock_seqlock);
>              if (vm_clock_warp_start == -1 || vm_clock_warp_start > clock) {
>                  vm_clock_warp_start = clock;
>              }
> -            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
> +            seqlock_write_end(&timers_state.vm_clock_seqlock);
>              timer_mod_anticipate(icount_warp_timer, clock + deadline);
>          }
>      } else if (deadline == 0) {
> diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
> index e673482..4dfc055 100644
> --- a/include/qemu/seqlock.h
> +++ b/include/qemu/seqlock.h
> @@ -28,7 +28,7 @@ static inline void seqlock_init(QemuSeqLock *sl)
>  }
>  
>  /* Lock out other writers and update the count.  */
> -static inline void seqlock_write_lock(QemuSeqLock *sl)
> +static inline void seqlock_write_begin(QemuSeqLock *sl)
>  {
>      ++sl->sequence;
>  
> @@ -36,7 +36,7 @@ static inline void seqlock_write_lock(QemuSeqLock *sl)
>      smp_wmb();
>  }
>  
> -static inline void seqlock_write_unlock(QemuSeqLock *sl)
> +static inline void seqlock_write_end(QemuSeqLock *sl)
>  {
>      /* Write other fields before finalizing sequence.  */
>      smp_wmb();

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/15] include/processor.h: define cpu_relax()
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 04/15] include/processor.h: define cpu_relax() Emilio G. Cota
@ 2016-05-27 20:53   ` Sergey Fedorov
  2016-05-27 21:10     ` Emilio G. Cota
  0 siblings, 1 reply; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-27 20:53 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> Taken from the linux kernel.
>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/processor.h | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
>  create mode 100644 include/qemu/processor.h
>
> diff --git a/include/qemu/processor.h b/include/qemu/processor.h
> new file mode 100644
> index 0000000..42bcc99
> --- /dev/null
> +++ b/include/qemu/processor.h
> @@ -0,0 +1,30 @@
> +/*
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *

If it's taken from the Linux kernel shouldn't it be attributed here?

> + * License: GNU GPL, version 2.
> + *   See the COPYING file in the top-level directory.
> + */
> +#ifndef QEMU_PROCESSOR_H
> +#define QEMU_PROCESSOR_H
> +
> +#include "qemu/atomic.h"
> +
> +#if defined(__i386__) || defined(__x86_64__)
> +# define cpu_relax() asm volatile("rep; nop" ::: "memory")
> +
> +#elif defined(__ia64__)
> +# define cpu_relax() asm volatile("hint @pause" ::: "memory")
> +
> +#elif defined(__aarch64__)
> +# define cpu_relax() asm volatile("yield" ::: "memory")
> +
> +#elif defined(__powerpc64__)
> +/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
> +# define cpu_relax() asm volatile("or 1, 1, 1;"
> +                                  "or 2, 2, 2;" ::: "memory")
> +
> +#else
> +# define cpu_relax() barrier()
> +#endif
> +
> +#endif /* QEMU_PROCESSOR_H */

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/15] include/processor.h: define cpu_relax()
  2016-05-27 20:53   ` Sergey Fedorov
@ 2016-05-27 21:10     ` Emilio G. Cota
  2016-05-28 12:35       ` Sergey Fedorov
  0 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-27 21:10 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Fri, May 27, 2016 at 23:53:01 +0300, Sergey Fedorov wrote:
> On 25/05/16 04:13, Emilio G. Cota wrote:
> > Taken from the linux kernel.
> >
> > Reviewed-by: Richard Henderson <rth@twiddle.net>
> > Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  include/qemu/processor.h | 30 ++++++++++++++++++++++++++++++
> >  1 file changed, 30 insertions(+)
> >  create mode 100644 include/qemu/processor.h
> >
> > diff --git a/include/qemu/processor.h b/include/qemu/processor.h
> > new file mode 100644
> > index 0000000..42bcc99
> > --- /dev/null
> > +++ b/include/qemu/processor.h
> > @@ -0,0 +1,30 @@
> > +/*
> > + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> > + *
> 
> If it's taken from the Linux kernel shouldn't it be attributed here?

It's "taken" as in "we do like the kernel does", not as in
"we're just copying this code".

		Emilio

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 04/15] include/processor.h: define cpu_relax()
  2016-05-27 21:10     ` Emilio G. Cota
@ 2016-05-28 12:35       ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-28 12:35 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 28/05/16 00:10, Emilio G. Cota wrote:
> On Fri, May 27, 2016 at 23:53:01 +0300, Sergey Fedorov wrote:
>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>> Taken from the linux kernel.
>>>
>>> Reviewed-by: Richard Henderson <rth@twiddle.net>
>>> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
>>> Signed-off-by: Emilio G. Cota <cota@braap.org>
>>> ---
>>>  include/qemu/processor.h | 30 ++++++++++++++++++++++++++++++
>>>  1 file changed, 30 insertions(+)
>>>  create mode 100644 include/qemu/processor.h
>>>
>>> diff --git a/include/qemu/processor.h b/include/qemu/processor.h
>>> new file mode 100644
>>> index 0000000..42bcc99
>>> --- /dev/null
>>> +++ b/include/qemu/processor.h
>>> @@ -0,0 +1,30 @@
>>> +/*
>>> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
>>> + *
>> If it's taken from the Linux kernel shouldn't it be attributed here?
> It's "taken" as in "we do like the kernel does", not as in
> "we're just copying this code".
>

Well, then

Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

Regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 06/15] exec: add tb_hash_func5, derived from xxhash
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 06/15] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
@ 2016-05-28 12:36   ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-28 12:36 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> This will be used by upcoming changes for hashing the tb hash.
>
> Add this into a separate file to include the copyright notice from
> xxhash.
>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

> ---
>  include/exec/tb-hash-xx.h | 94 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 94 insertions(+)
>  create mode 100644 include/exec/tb-hash-xx.h
>
> diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
> new file mode 100644
> index 0000000..9f3fc05
> --- /dev/null
> +++ b/include/exec/tb-hash-xx.h
> @@ -0,0 +1,94 @@
> +/*
> + * xxHash - Fast Hash algorithm
> + * Copyright (C) 2012-2016, Yann Collet
> + *
> + * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions are
> + * met:
> + *
> + * + Redistributions of source code must retain the above copyright
> + * notice, this list of conditions and the following disclaimer.
> + * + Redistributions in binary form must reproduce the above
> + * copyright notice, this list of conditions and the following disclaimer
> + * in the documentation and/or other materials provided with the
> + * distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + *
> + * You can contact the author at :
> + * - xxHash source repository : https://github.com/Cyan4973/xxHash
> + */
> +#ifndef EXEC_TB_HASH_XX
> +#define EXEC_TB_HASH_XX
> +
> +#include <qemu/bitops.h>
> +
> +#define PRIME32_1   2654435761U
> +#define PRIME32_2   2246822519U
> +#define PRIME32_3   3266489917U
> +#define PRIME32_4    668265263U
> +#define PRIME32_5    374761393U
> +
> +#define TB_HASH_XX_SEED 1
> +
> +/*
> + * xxhash32, customized for input variables that are not guaranteed to be
> + * contiguous in memory.
> + */
> +static inline
> +uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e)
> +{
> +    uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
> +    uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
> +    uint32_t v3 = TB_HASH_XX_SEED + 0;
> +    uint32_t v4 = TB_HASH_XX_SEED - PRIME32_1;
> +    uint32_t a = a0 >> 32;
> +    uint32_t b = a0;
> +    uint32_t c = b0 >> 32;
> +    uint32_t d = b0;
> +    uint32_t h32;
> +
> +    v1 += a * PRIME32_2;
> +    v1 = rol32(v1, 13);
> +    v1 *= PRIME32_1;
> +
> +    v2 += b * PRIME32_2;
> +    v2 = rol32(v2, 13);
> +    v2 *= PRIME32_1;
> +
> +    v3 += c * PRIME32_2;
> +    v3 = rol32(v3, 13);
> +    v3 *= PRIME32_1;
> +
> +    v4 += d * PRIME32_2;
> +    v4 = rol32(v4, 13);
> +    v4 *= PRIME32_1;
> +
> +    h32 = rol32(v1, 1) + rol32(v2, 7) + rol32(v3, 12) + rol32(v4, 18);
> +    h32 += 20;
> +
> +    h32 += e * PRIME32_3;
> +    h32  = rol32(h32, 17) * PRIME32_4;
> +
> +    h32 ^= h32 >> 15;
> +    h32 *= PRIME32_2;
> +    h32 ^= h32 >> 13;
> +    h32 *= PRIME32_3;
> +    h32 ^= h32 >> 16;
> +
> +    return h32;
> +}
> +
> +#endif /* EXEC_TB_HASH_XX */

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 07/15] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 07/15] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
@ 2016-05-28 12:39   ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-28 12:39 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> For some workloads such as arm bootup, tb_phys_hash is performance-critical.
> The is due to the high frequency of accesses to the hash table, originated
> by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
> More info:
>   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
>
> To dig further into this I modified an arm image booting debian jessie to
> immediately shut down after boot. Analysis revealed that quite a bit of time
> is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
> results in very uneven loading of chains in the hash table's buckets;
> the longest observed chain had ~550 elements.
>
> The appended addresses this with two changes:
>
> 1) Use xxhash as the hash table's hash function. xxhash is a fast,
>    high-quality hashing function.
>
> 2) Feed the hashing function with not just tb_phys, but also pc and flags.
>
> This improves performance over using just tb_phys for hashing, since that
> resulted in some hash buckets having many TB's, while others getting very few;
> with these changes, the longest observed chain on a single hash bucket is
> brought down from ~550 to ~40.
>
> Tests show that the other element checked for in tb_find_physical,
> cs_base, is always a match when tb_phys+pc+flags are a match,
> so hashing cs_base is wasteful. It could be that this is an ARM-only
> thing, though. UPDATE:
> On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
>> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
>> consisting of only a delay slot).
>> It may well still turn out to be reasonable to ignore cs_base for hashing.
> BTW, after this change the hash table should not be called "tb_hash_phys"
> anymore; this is addressed later in this series.
>
> This change gives consistent bootup time improvements. I tested two
> host machines:
> - Intel Xeon E5-2690: 11.6% less time
> - Intel i7-4790K: 19.2% less time
>
> Increasing the number of hash buckets yields further improvements. However,
> using a larger, fixed number of buckets can degrade performance for other
> workloads that do not translate as many blocks (600K+ for debian-jessie arm
> bootup). This is dealt with later in this series.
>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

> ---
>  cpu-exec.c             |  4 ++--
>  include/exec/tb-hash.h |  8 ++++++--
>  translate-all.c        | 10 +++++-----
>  3 files changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 14df1aa..1735032 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -231,13 +231,13 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
>  {
>      CPUArchState *env = (CPUArchState *)cpu->env_ptr;
>      TranslationBlock *tb, **tb_hash_head, **ptb1;
> -    unsigned int h;
> +    uint32_t h;
>      tb_page_addr_t phys_pc, phys_page1;
>  
>      /* find translated block using physical mappings */
>      phys_pc = get_page_addr_code(env, pc);
>      phys_page1 = phys_pc & TARGET_PAGE_MASK;
> -    h = tb_phys_hash_func(phys_pc);
> +    h = tb_hash_func(phys_pc, pc, flags);
>  
>      /* Start at head of the hash entry */
>      ptb1 = tb_hash_head = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
> index 0f4e8a0..88ccfd1 100644
> --- a/include/exec/tb-hash.h
> +++ b/include/exec/tb-hash.h
> @@ -20,6 +20,9 @@
>  #ifndef EXEC_TB_HASH
>  #define EXEC_TB_HASH
>  
> +#include "exec/exec-all.h"
> +#include "exec/tb-hash-xx.h"
> +
>  /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
>     addresses on the same page.  The top bits are the same.  This allows
>     TLB invalidation to quickly clear a subset of the hash table.  */
> @@ -43,9 +46,10 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
>             | (tmp & TB_JMP_ADDR_MASK));
>  }
>  
> -static inline unsigned int tb_phys_hash_func(tb_page_addr_t pc)
> +static inline
> +uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t flags)
>  {
> -    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
> +    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
>  }
>  
>  #endif
> diff --git a/translate-all.c b/translate-all.c
> index b54f472..c48fccb 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -991,12 +991,12 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>  {
>      CPUState *cpu;
>      PageDesc *p;
> -    unsigned int h;
> +    uint32_t h;
>      tb_page_addr_t phys_pc;
>  
>      /* remove the TB from the hash list */
>      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
> -    h = tb_phys_hash_func(phys_pc);
> +    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
>      tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
>  
>      /* remove the TB from the page list */
> @@ -1126,11 +1126,11 @@ static inline void tb_alloc_page(TranslationBlock *tb,
>  static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>                           tb_page_addr_t phys_page2)
>  {
> -    unsigned int h;
> +    uint32_t h;
>      TranslationBlock **ptb;
>  
> -    /* add in the physical hash table */
> -    h = tb_phys_hash_func(phys_pc);
> +    /* add in the hash table */
> +    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
>      ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
>      tb->phys_hash_next = *ptb;
>      *ptb = tb;

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data Emilio G. Cota
@ 2016-05-28 18:15   ` Sergey Fedorov
  2016-06-03 17:22     ` Emilio G. Cota
  2016-06-07  1:05     ` Emilio G. Cota
  0 siblings, 2 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-28 18:15 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> diff --git a/util/qdist.c b/util/qdist.c
> new file mode 100644
> index 0000000..3343640
> --- /dev/null
> +++ b/util/qdist.c
> @@ -0,0 +1,386 @@
(snip)
> +
> +void qdist_add(struct qdist *dist, double x, long count)
> +{
> +    struct qdist_entry *entry = NULL;
> +
> +    if (dist->entries) {
> +        struct qdist_entry e;
> +
> +        e.x = x;
> +        entry = bsearch(&e, dist->entries, dist->n, sizeof(e), qdist_cmp);
> +    }
> +
> +    if (entry) {
> +        entry->count += count;
> +        return;
> +    }
> +
> +    dist->entries = g_realloc(dist->entries,
> +                              sizeof(*dist->entries) * (dist->n + 1));

Repeated doubling?

> +    dist->n++;
> +    entry = &dist->entries[dist->n - 1];

What if we combine the above two lines:

    entry = &dist->entries[dist->n++];

or just reverse them:

    entry = &dist->entries[dist->n];
    dist->n++;


> +    entry->x = x;
> +    entry->count = count;
> +    qsort(dist->entries, dist->n, sizeof(*entry), qdist_cmp);
> +}
> +
(snip)
> +static char *qdist_pr_internal(const struct qdist *dist)
> +{
> +    double min, max, step;
> +    GString *s = g_string_new("");
> +    size_t i;
> +
> +    /* if only one entry, its printout will be either full or empty */
> +    if (dist->n == 1) {
> +        if (dist->entries[0].count) {
> +            g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES - 1]);
> +        } else {
> +            g_string_append_c(s, ' ');
> +        }
> +        goto out;
> +    }
> +
> +    /* get min and max counts */
> +    min = dist->entries[0].count;
> +    max = min;
> +    for (i = 0; i < dist->n; i++) {
> +        struct qdist_entry *e = &dist->entries[i];
> +
> +        if (e->count < min) {
> +            min = e->count;
> +        }
> +        if (e->count > max) {
> +            max = e->count;
> +        }
> +    }
> +
> +    /* floor((count - min) * step) will give us the block index */
> +    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
> +
> +    for (i = 0; i < dist->n; i++) {
> +        struct qdist_entry *e = &dist->entries[i];
> +        int index;
> +
> +        /* make an exception with 0; instead of using block[0], print a space */
> +        if (e->count) {
> +            index = (int)((e->count - min) * step);

So "e->count == min" gives us one eighth block instead of just space?

> +            g_string_append_unichar(s, qdist_blocks[index]);
> +        } else {
> +            g_string_append_c(s, ' ');
> +        }
> +    }
> + out:
> +    return g_string_free(s, FALSE);
> +}
> +
> +/*
> + * Bin the distribution in @from into @n bins of consecutive, non-overlapping
> + * intervals, copying the result to @to.
> + *
> + * This function is internal to qdist: only this file and test code should
> + * ever call it.
> + *
> + * Note: calling this function on an already-binned qdist is a bug.
> + *
> + * If @n == 0 or @from->n == 1, use @from->n.
> + */
> +void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
> +{
> +    double xmin, xmax;
> +    double step;
> +    size_t i, j, j_min;
> +
> +    qdist_init(to);
> +
> +    if (!from->entries) {
> +        return;
> +    }
> +    if (!n || from->n == 1) {
> +        n = from->n;
> +    }
> +
> +    /* set equally-sized bins between @from's left and right */
> +    xmin = qdist_xmin(from);
> +    xmax = qdist_xmax(from);
> +    step = (xmax - xmin) / n;
> +
> +    if (n == from->n) {
> +        /* if @from's entries are equally spaced, no need to re-bin */
> +        for (i = 0; i < from->n; i++) {
> +            if (from->entries[i].x != xmin + i * step) {
> +                goto rebin;

static inline function instead of goto?

> +            }
> +        }
> +        /* they're equally spaced, so copy the dist and bail out */
> +        to->entries = g_malloc(sizeof(*to->entries) * from->n);

g_new()?

> +        to->n = from->n;
> +        memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
> +        return;
> +    }
> +
> + rebin:
> +    j_min = 0;
> +    for (i = 0; i < n; i++) {
> +        double x;
> +        double left, right;
> +
> +        left = xmin + i * step;
> +        right = xmin + (i + 1) * step;
> +
> +        /* Add x, even if it might not get any counts later */
> +        x = left;

This way we round down to the left margin of each bin like this:

    xmin [*---*---*---*---*] xmax   -- from
          |  /|  /|  /|  /
          | / | / | / | /
          |/  |/  |/  |/
          |   |   |   |
          V   V   V   V
         [*   *   *   *]            -- to


instead of e.g. rounding to the middle of each bin with:

    x = left + step / 2;

which would give the picture like this:


    xmin [*---*---*---*---*] xmax   -- from
          |   |   |   |   |
           \ / \ / \ / \ /
            |   |   |   |
            V   V   V   V
           [*   *   *   *]          -- to

or even:

    left = xmin + (i - 0.5) * step;
    right = left + step;
    x = left + step / 2;

with corresponding changes in the following loop. That would give us
this picture (with the same 'n'):

    xmin [*----*----*----*] xmax    -- from
        \   /\   /\   /\   /
         \ /  \ /  \ /  \ /
          |    |    |    |
          V    V    V    V
         [*    *    *    *]         -- to

I'm not sure which is the more correct option from the mathematical
point of view; but multiple-binning with the last variant of the
algorithm we would still give the same result.

> +        qdist_add(to, x, 0);
> +
> +        /*
> +         * To avoid double-counting we capture [left, right) ranges, except for
> +         * the righmost bin, which captures a [left, right] range.
> +         */
> +        for (j = j_min; j < from->n; j++) {

Looks like we don't need to keep both 'j' and 'j_min'. We could just use
'j', initialize it before the outer loop, and do the inner loop with
"while".

> +            struct qdist_entry *o = &from->entries[j];
> +
> +            /* entries are ordered so do not check beyond right */
> +            if (o->x > right) {
> +                break;
> +            }
> +            if (o->x >= left && (o->x < right ||
> +                                   (i == n - 1 && o->x == right))) {
> +                qdist_add(to, x, o->count);
> +                /* don't check this entry again */
> +                j_min = j + 1;
> +            }
> +        }
> +    }
> +}
> +
(snip)
> +double qdist_avg(const struct qdist *dist)
> +{
> +    unsigned long count;
> +    size_t i;
> +    double ret = 0;
> +
> +    count = qdist_sample_count(dist);
> +    if (!count) {
> +        return NAN;
> +    }
> +    for (i = 0; i < dist->n; i++) {
> +        struct qdist_entry *e = &dist->entries[i];
> +
> +        ret += e->x * e->count / count;

Please use Welford’s method or something like that, see
http://stackoverflow.com/a/1346890.

> +    }
> +    return ret;
> +}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 09/15] qdist: add test program
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 09/15] qdist: add test program Emilio G. Cota
@ 2016-05-28 18:56   ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-28 18:56 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Acked-by: Sergey Fedorov <sergey.fedorov@linaro.org>

> ---
>  tests/.gitignore   |   1 +
>  tests/Makefile     |   6 +-
>  tests/test-qdist.c | 369 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 375 insertions(+), 1 deletion(-)
>  create mode 100644 tests/test-qdist.c
>
> diff --git a/tests/.gitignore b/tests/.gitignore
> index a06a8ba..7c0d156 100644
> --- a/tests/.gitignore
> +++ b/tests/.gitignore
> @@ -48,6 +48,7 @@ test-qapi-types.[ch]
>  test-qapi-visit.[ch]
>  test-qdev-global-props
>  test-qemu-opts
> +test-qdist
>  test-qga
>  test-qmp-commands
>  test-qmp-commands.h
> diff --git a/tests/Makefile b/tests/Makefile
> index 9dddde6..a5af20b 100644
> --- a/tests/Makefile
> +++ b/tests/Makefile
> @@ -70,6 +70,8 @@ check-unit-y += tests/rcutorture$(EXESUF)
>  gcov-files-rcutorture-y = util/rcu.c
>  check-unit-y += tests/test-rcu-list$(EXESUF)
>  gcov-files-test-rcu-list-y = util/rcu.c
> +check-unit-y += tests/test-qdist$(EXESUF)
> +gcov-files-test-qdist-y = util/qdist.c
>  check-unit-y += tests/test-bitops$(EXESUF)
>  check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
>  check-unit-y += tests/check-qom-interface$(EXESUF)
> @@ -392,7 +394,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
>  	tests/test-qmp-commands.o tests/test-visitor-serialization.o \
>  	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
>  	tests/test-opts-visitor.o tests/test-qmp-event.o \
> -	tests/rcutorture.o tests/test-rcu-list.o
> +	tests/rcutorture.o tests/test-rcu-list.o \
> +	tests/test-qdist.o
>  
>  $(test-obj-y): QEMU_INCLUDES += -Itests
>  QEMU_CFLAGS += -I$(SRC_PATH)/tests
> @@ -431,6 +434,7 @@ tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
>  tests/test-int128$(EXESUF): tests/test-int128.o
>  tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
>  tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
> +tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
>  
>  tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
>  	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
> diff --git a/tests/test-qdist.c b/tests/test-qdist.c
> new file mode 100644
> index 0000000..7625a57
> --- /dev/null
> +++ b/tests/test-qdist.c
> @@ -0,0 +1,369 @@
> +/*
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + *   See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include <glib.h>
> +#include "qemu/qdist.h"
> +
> +#include <math.h>
> +
> +struct entry_desc {
> +    double x;
> +    unsigned long count;
> +
> +    /* 0 prints a space, 1-8 prints from qdist_blocks[] */
> +    int fill_code;
> +};
> +
> +/* See: https://en.wikipedia.org/wiki/Block_Elements */
> +static const gunichar qdist_blocks[] = {
> +    0x2581,
> +    0x2582,
> +    0x2583,
> +    0x2584,
> +    0x2585,
> +    0x2586,
> +    0x2587,
> +    0x2588
> +};
> +
> +#define QDIST_NR_BLOCK_CODES ARRAY_SIZE(qdist_blocks)
> +
> +static char *pr_hist(const struct entry_desc *darr, size_t n)
> +{
> +    GString *s = g_string_new("");
> +    size_t i;
> +
> +    for (i = 0; i < n; i++) {
> +        int fill = darr[i].fill_code;
> +
> +        if (fill) {
> +            assert(fill <= QDIST_NR_BLOCK_CODES);
> +            g_string_append_unichar(s, qdist_blocks[fill - 1]);
> +        } else {
> +            g_string_append_c(s, ' ');
> +        }
> +    }
> +    return g_string_free(s, FALSE);
> +}
> +
> +static void
> +histogram_check(const struct qdist *dist, const struct entry_desc *darr,
> +                size_t n, size_t n_bins)
> +{
> +    char *pr = qdist_pr_plain(dist, n_bins);
> +    char *str = pr_hist(darr, n);
> +
> +    g_assert_cmpstr(pr, ==, str);
> +    g_free(pr);
> +    g_free(str);
> +}
> +
> +static void histogram_check_single_full(const struct qdist *dist, size_t n_bins)
> +{
> +    struct entry_desc desc = { .fill_code = 8 };
> +
> +    histogram_check(dist, &desc, 1, n_bins);
> +}
> +
> +static void
> +entries_check(const struct qdist *dist, const struct entry_desc *darr, size_t n)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < n; i++) {
> +        struct qdist_entry *e = &dist->entries[i];
> +
> +        g_assert_cmpuint(e->count, ==, darr[i].count);
> +    }
> +}
> +
> +static void
> +entries_insert(struct qdist *dist, const struct entry_desc *darr, size_t n)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < n; i++) {
> +        qdist_add(dist, darr[i].x, darr[i].count);
> +    }
> +}
> +
> +static void do_test_bin(const struct entry_desc *a, size_t n_a,
> +                        const struct entry_desc *b, size_t n_b)
> +{
> +    struct qdist qda;
> +    struct qdist qdb;
> +
> +    qdist_init(&qda);
> +
> +    entries_insert(&qda, a, n_a);
> +    qdist_inc(&qda, a[0].x);
> +    qdist_add(&qda, a[0].x, -1);
> +
> +    g_assert_cmpuint(qdist_unique_entries(&qda), ==, n_a);
> +    g_assert_cmpfloat(qdist_xmin(&qda), ==, a[0].x);
> +    g_assert_cmpfloat(qdist_xmax(&qda), ==, a[n_a - 1].x);
> +    histogram_check(&qda, a, n_a, 0);
> +    histogram_check(&qda, a, n_a, n_a);
> +
> +    qdist_bin__internal(&qdb, &qda, n_b);
> +    g_assert_cmpuint(qdb.n, ==, n_b);
> +    entries_check(&qdb, b, n_b);
> +    g_assert_cmpuint(qdist_sample_count(&qda), ==, qdist_sample_count(&qdb));
> +    /*
> +     * No histogram_check() for $qdb, since we'd rebin it and that is a bug.
> +     * Instead, regenerate it from $qda.
> +     */
> +    histogram_check(&qda, b, n_b, n_b);
> +
> +    qdist_destroy(&qdb);
> +    qdist_destroy(&qda);
> +}
> +
> +static void do_test_pr(uint32_t opt)
> +{
> +    static const struct entry_desc desc[] = {
> +        [0] = { 1, 900, 8 },
> +        [1] = { 2, 1, 1 },
> +        [2] = { 3, 2, 1 }
> +    };
> +    static const char border[] = "|";
> +    const char *llabel = NULL;
> +    const char *rlabel = NULL;
> +    struct qdist dist;
> +    GString *s;
> +    char *str;
> +    char *pr;
> +    size_t n;
> +
> +    n = ARRAY_SIZE(desc);
> +    qdist_init(&dist);
> +
> +    entries_insert(&dist, desc, n);
> +    histogram_check(&dist, desc, n, 0);
> +
> +    s = g_string_new("");
> +
> +    if (opt & QDIST_PR_LABELS) {
> +        unsigned int lopts = opt & (QDIST_PR_NODECIMAL |
> +                                    QDIST_PR_PERCENT |
> +                                    QDIST_PR_100X |
> +                                    QDIST_PR_NOBINRANGE);
> +
> +        if (lopts == 0) {
> +            llabel = "[1.0,1.7)";
> +            rlabel = "[2.3,3.0]";
> +        } else if (lopts == QDIST_PR_NODECIMAL) {
> +            llabel = "[1,2)";
> +            rlabel = "[2,3]";
> +        } else if (lopts == (QDIST_PR_PERCENT | QDIST_PR_NODECIMAL)) {
> +            llabel = "[1,2)%";
> +            rlabel = "[2,3]%";
> +        } else if (lopts == QDIST_PR_100X) {
> +            llabel = "[100.0,166.7)";
> +            rlabel = "[233.3,300.0]";
> +        } else if (lopts == (QDIST_PR_NOBINRANGE | QDIST_PR_NODECIMAL)) {
> +            llabel = "1";
> +            rlabel = "3";
> +        } else {
> +            g_assert_cmpstr("BUG", ==, "This is not meant to be exhaustive");
> +        }
> +    }
> +
> +    if (llabel) {
> +        g_string_append(s, llabel);
> +    }
> +    if (opt & QDIST_PR_BORDER) {
> +        g_string_append(s, border);
> +    }
> +
> +    str = pr_hist(desc, n);
> +    g_string_append(s, str);
> +    g_free(str);
> +
> +    if (opt & QDIST_PR_BORDER) {
> +        g_string_append(s, border);
> +    }
> +    if (rlabel) {
> +        g_string_append(s, rlabel);
> +    }
> +
> +    str = g_string_free(s, FALSE);
> +    pr = qdist_pr(&dist, n, opt);
> +    g_assert_cmpstr(pr, ==, str);
> +    g_free(pr);
> +    g_free(str);
> +
> +    qdist_destroy(&dist);
> +}
> +
> +static inline void do_test_pr_label(uint32_t opt)
> +{
> +    opt |= QDIST_PR_LABELS;
> +    do_test_pr(opt);
> +}
> +
> +static void test_pr(void)
> +{
> +    do_test_pr(0);
> +
> +    do_test_pr(QDIST_PR_BORDER);
> +
> +    /* 100X should be ignored because we're not setting LABELS */
> +    do_test_pr(QDIST_PR_100X);
> +
> +    do_test_pr_label(0);
> +    do_test_pr_label(QDIST_PR_NODECIMAL);
> +    do_test_pr_label(QDIST_PR_PERCENT | QDIST_PR_NODECIMAL);
> +    do_test_pr_label(QDIST_PR_100X);
> +    do_test_pr_label(QDIST_PR_NOBINRANGE | QDIST_PR_NODECIMAL);
> +}
> +
> +static void test_bin_shrink(void)
> +{
> +    static const struct entry_desc a[] = {
> +        [0] = { 0.0,   42922, 7 },
> +        [1] = { 0.25,  47834, 8 },
> +        [2] = { 0.50,  26628, 0 },
> +        [3] = { 0.625, 597,   4 },
> +        [4] = { 0.75,  10298, 1 },
> +        [5] = { 0.875, 22,    2 },
> +        [6] = { 1.0,   2771,  1 }
> +    };
> +    static const struct entry_desc b[] = {
> +        [0] = { 0.0, 42922, 7 },
> +        [1] = { 0.25, 47834, 8 },
> +        [2] = { 0.50, 27225, 3 },
> +        [3] = { 0.75, 13091, 1 }
> +    };
> +
> +    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
> +}
> +
> +static void test_bin_expand(void)
> +{
> +    static const struct entry_desc a[] = {
> +        [0] = { 0.0,   11713, 5 },
> +        [1] = { 0.25,  20294, 0 },
> +        [2] = { 0.50,  17266, 8 },
> +        [3] = { 0.625, 1506,  0 },
> +        [4] = { 0.75,  10355, 6 },
> +        [5] = { 0.833, 2,     1 },
> +        [6] = { 0.875, 99,    4 },
> +        [7] = { 1.0,   4301,  2 }
> +    };
> +    static const struct entry_desc b[] = {
> +        [0] = { 0.0, 11713, 5 },
> +        [1] = { 0.0, 0,     0 },
> +        [2] = { 0.0, 20294, 8 },
> +        [3] = { 0.0, 0,     0 },
> +        [4] = { 0.0, 0,     0 },
> +        [5] = { 0.0, 17266, 6 },
> +        [6] = { 0.0, 1506,  1 },
> +        [7] = { 0.0, 10355, 4 },
> +        [8] = { 0.0, 101,   1 },
> +        [9] = { 0.0, 4301,  2 }
> +    };
> +
> +    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
> +}
> +
> +static void test_bin_simple(void)
> +{
> +    static const struct entry_desc a[] = {
> +        [0] = { 10, 101, 8 },
> +        [1] = { 11, 0, 0 },
> +        [2] = { 12, 2, 1 }
> +    };
> +    static const struct entry_desc b[] = {
> +        [0] = { 0, 101, 8 },
> +        [1] = { 0, 0, 0 },
> +        [2] = { 0, 0, 0 },
> +        [3] = { 0, 0, 0 },
> +        [4] = { 0, 2, 1 }
> +    };
> +
> +    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
> +}
> +
> +static void test_single_full(void)
> +{
> +    struct qdist dist;
> +
> +    qdist_init(&dist);
> +
> +    qdist_add(&dist, 3, 102);
> +    g_assert_cmpfloat(qdist_avg(&dist), ==, 3);
> +    g_assert_cmpfloat(qdist_xmin(&dist), ==, 3);
> +    g_assert_cmpfloat(qdist_xmax(&dist), ==, 3);
> +
> +    histogram_check_single_full(&dist, 0);
> +    histogram_check_single_full(&dist, 1);
> +    histogram_check_single_full(&dist, 10);
> +
> +    qdist_destroy(&dist);
> +}
> +
> +static void test_single_empty(void)
> +{
> +    struct qdist dist;
> +    char *pr;
> +
> +    qdist_init(&dist);
> +
> +    qdist_add(&dist, 3, 0);
> +    g_assert_cmpuint(qdist_sample_count(&dist), ==, 0);
> +    g_assert(isnan(qdist_avg(&dist)));
> +    g_assert_cmpfloat(qdist_xmin(&dist), ==, 3);
> +    g_assert_cmpfloat(qdist_xmax(&dist), ==, 3);
> +
> +    pr = qdist_pr_plain(&dist, 0);
> +    g_assert_cmpstr(pr, ==, " ");
> +    g_free(pr);
> +
> +    pr = qdist_pr_plain(&dist, 1);
> +    g_assert_cmpstr(pr, ==, " ");
> +    g_free(pr);
> +
> +    pr = qdist_pr_plain(&dist, 2);
> +    g_assert_cmpstr(pr, ==, " ");
> +    g_free(pr);
> +
> +    qdist_destroy(&dist);
> +}
> +
> +static void test_none(void)
> +{
> +    struct qdist dist;
> +    char *pr;
> +
> +    qdist_init(&dist);
> +
> +    g_assert(isnan(qdist_avg(&dist)));
> +    g_assert(isnan(qdist_xmin(&dist)));
> +    g_assert(isnan(qdist_xmax(&dist)));
> +
> +    pr = qdist_pr_plain(&dist, 0);
> +    g_assert(pr == NULL);
> +
> +    pr = qdist_pr_plain(&dist, 2);
> +    g_assert(pr == NULL);
> +
> +    qdist_destroy(&dist);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +    g_test_init(&argc, &argv, NULL);
> +    g_test_add_func("/qdist/none", test_none);
> +    g_test_add_func("/qdist/single/empty", test_single_empty);
> +    g_test_add_func("/qdist/single/full", test_single_full);
> +    g_test_add_func("/qdist/binning/simple", test_bin_simple);
> +    g_test_add_func("/qdist/binning/expand", test_bin_expand);
> +    g_test_add_func("/qdist/binning/shrink", test_bin_shrink);
> +    g_test_add_func("/qdist/pr", test_pr);
> +    return g_test_run();
> +}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
@ 2016-05-29 19:52   ` Sergey Fedorov
  2016-05-29 19:55     ` Sergey Fedorov
                       ` (3 more replies)
  0 siblings, 4 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-29 19:52 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> diff --git a/include/qemu/qht.h b/include/qemu/qht.h
> new file mode 100644
> index 0000000..aec60aa
> --- /dev/null
> +++ b/include/qemu/qht.h
> @@ -0,0 +1,183 @@
(snip)
> +/**
> + * qht_init - Initialize a QHT
> + * @ht: QHT to be initialized
> + * @n_elems: number of entries the hash table should be optimized for.
> + * @mode: bitmask with OR'ed QHT_MODE_*
> + */
> +void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);

First of all, thank you for spending your time on the documentation of
the API!

I was just wondering if it could be worthwhile to pass a hash function
when initializing a QHT. Then we could have variants of qht_insert(),
qht_remove() and qht_lookup() which does not require a computed hash
value but call the function by themselves. This could make sense since a
hash value passed the the functions should always be exactly the same
for the same object.

(snip)
> +/**
> + * qht_remove - remove a pointer from the hash table
> + * @ht: QHT to remove from
> + * @p: pointer to be removed
> + * @hash: hash corresponding to @p
> + *
> + * Attempting to remove a NULL @p is a bug.
> + *
> + * Just-removed @p pointers cannot be immediately freed; they need to remain
> + * valid until the end of the RCU grace period in which qht_remove() is called.
> + * This guarantees that concurrent lookups will always compare against valid
> + * data.

Mention rcu_call1()/call_rcu()/g_free_rcu()?

> + *
> + * Returns true on success.
> + * Returns false if the @p-@hash pair was not found.
> + */
> +bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
> +
(snip)
> diff --git a/util/qht.c b/util/qht.c
> new file mode 100644
> index 0000000..ca5a620
> --- /dev/null
> +++ b/util/qht.c
> @@ -0,0 +1,837 @@
(snip)
> +/* trigger a resize when n_added_buckets > n_buckets / div */
> +#define QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV 8

Just out of curiosity, how did you get this number?

> +
> +static void qht_do_resize(struct qht *ht, struct qht_map *new);
> +static void qht_grow_maybe(struct qht *ht);

qht_grow_maybe() is used just once. Please consider reordering of
definitions and removing this forward declaration.

(snip)
> +
> +/* call with head->lock held */
> +static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
> +                               struct qht_bucket *head, void *p, uint32_t hash,
> +                               bool *needs_resize)
> +{
> +    struct qht_bucket *b = head;
> +    struct qht_bucket *prev = NULL;
> +    struct qht_bucket *new = NULL;
> +    int i;
> +
> +    for (;;) {
> +        if (b == NULL) {
> +            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
> +            memset(b, 0, sizeof(*b));
> +            new = b;
> +            atomic_inc(&map->n_added_buckets);
> +            if (unlikely(qht_map_needs_resize(map)) && needs_resize) {
> +                *needs_resize = true;
> +            }
> +        }
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->pointers[i]) {
> +                if (unlikely(b->pointers[i] == p)) {
> +                    return false;
> +                }
> +                continue;
> +            }
> +            /* found an empty key: acquire the seqlock and write */
> +            seqlock_write_begin(&head->sequence);
> +            if (new) {
> +                atomic_rcu_set(&prev->next, b);
> +            }
> +            b->hashes[i] = hash;
> +            atomic_set(&b->pointers[i], p);
> +            seqlock_write_end(&head->sequence);
> +            return true;
> +        }
> +        prev = b;
> +        b = b->next;
> +    }
> +}

Here is my attempt:

static bool qht_insert__locked(struct qht *ht, struct qht_map
*map,           
                               struct qht_bucket *head, void *p,
uint32_t hash,
                               bool
*needs_resize)                            
{                                                                             

    struct qht_bucket **bpp = &head,
*new;                                    
    int i =
0;                                                                
                                                                              

    do
{                                                                      
        while (i < QHT_BUCKET_ENTRIES)
{                                      
            if ((*bpp)->pointers[i])
{                                        
                if (unlikely((*bpp)->pointers[i] == p))
{                     
                    return
false;                                             
               
}                                                             
               
i++;                                                          
               
continue;                                                     
           
}                                                                 
            goto
found;                                                       
       
}                                                                     
        bpp =
&(*bpp)->next;                                                  
        i =
0;                                                                
    } while
(*bpp);                                                           
                                                                              

    new = qemu_memalign(QHT_BUCKET_ALIGN,
sizeof(*new));                      
    memset(new, 0,
sizeof(*new));                                             
    atomic_rcu_set(bpp,
new);                                                 
   
atomic_inc(&map->n_added_buckets);                                        
    if (unlikely(qht_map_needs_resize(map)) && needs_resize)
{                
        *needs_resize =
true;                                                 
   
}                                                                         
found:                                                                        

    /* found an empty key: acquire the seqlock and write
*/                   
   
seqlock_write_begin(&head->sequence);                                     
    (*bpp)->hashes[i] =
hash;                                                 
    atomic_set(&(*bpp)->pointers[i],
p);                                      
   
seqlock_write_end(&head->sequence);                                       
    return
true;                                                              
}                                                                             


Feel free to use it as you wish.

>
(snip)
> +/*
> + * Find the last valid entry in @head, and swap it with @orig[pos], which has
> + * just been invalidated.
> + */
> +static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
> +{
> +    struct qht_bucket *b = orig;
> +    struct qht_bucket *prev = NULL;
> +    int i;
> +
> +    if (qht_entry_is_last(orig, pos)) {
> +        orig->hashes[pos] = 0;
> +        atomic_set(&orig->pointers[pos], NULL);
> +        return;
> +    }
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            if (b->pointers[i]) {
> +                continue;
> +            }
> +            if (i > 0) {
> +                return qht_entry_move(orig, pos, b, i - 1);
> +            }
> +            qht_debug_assert(prev);

'prev' can be NULL if this is the first iteration.

> +            return qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
> +        }
> +        prev = b;
> +        b = b->next;
> +    } while (b);
> +    /* no free entries other than orig[pos], so swap it with the last one */
> +    qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
> +}
> +
> +/* call with b->lock held */
> +static inline
> +bool qht_remove__locked(struct qht_map *map, struct qht_bucket *head,
> +                        const void *p, uint32_t hash)
> +{
> +    struct qht_bucket *b = head;
> +    int i;
> +
> +    do {
> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> +            void *q = b->pointers[i];
> +
> +            if (unlikely(q == NULL)) {
> +                return false;
> +            }
> +            if (q == p) {
> +                qht_debug_assert(b->hashes[i] == hash);
> +                seqlock_write_begin(&head->sequence);
> +                qht_bucket_fill_hole(b, i);

"Fill hole" doesn't correspond to the function's new job since there's
no hole. "Remove entry" would make more sense, I think.

> +                seqlock_write_end(&head->sequence);
> +                return true;
> +            }
> +        }
> +        b = b->next;
> +    } while (b);
> +    return false;
> +}
(snip)
> +/*
> + * Call with ht->lock and all bucket locks held.
> + *
> + * Creating the @new map here would add unnecessary delay while all the locks
> + * are held--holding up the bucket locks is particularly bad, since no writes
> + * can occur while these are held. Thus, we let callers create the new map,
> + * hopefully without the bucket locks held.
> + */
> +static void qht_do_resize(struct qht *ht, struct qht_map *new)
> +{
> +    struct qht_map *old;
> +
> +    old = ht->map;
> +    g_assert_cmpuint(new->n_buckets, !=, old->n_buckets);
> +
> +    qht_map_iter__all_locked(ht, old, qht_map_copy, new);
> +    qht_map_debug__all_locked(new);
> +
> +    atomic_rcu_set(&ht->map, new);
> +    call_rcu1(&old->rcu, qht_map_reclaim);

call_rcu() macro is a more convenient way to do this and you wouldn't
need qht_map_reclaim().

> +}

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-29 19:52   ` Sergey Fedorov
@ 2016-05-29 19:55     ` Sergey Fedorov
  2016-05-31  7:46     ` Alex Bennée
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-29 19:55 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 29/05/16 22:52, Sergey Fedorov wrote:
> On 25/05/16 04:13, Emilio G. Cota wrote:
>> +
>> +/* call with head->lock held */
>> +static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
>> +                               struct qht_bucket *head, void *p, uint32_t hash,
>> +                               bool *needs_resize)
>> +{
>> +    struct qht_bucket *b = head;
>> +    struct qht_bucket *prev = NULL;
>> +    struct qht_bucket *new = NULL;
>> +    int i;
>> +
>> +    for (;;) {
>> +        if (b == NULL) {
>> +            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
>> +            memset(b, 0, sizeof(*b));
>> +            new = b;
>> +            atomic_inc(&map->n_added_buckets);
>> +            if (unlikely(qht_map_needs_resize(map)) && needs_resize) {
>> +                *needs_resize = true;
>> +            }
>> +        }
>> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>> +            if (b->pointers[i]) {
>> +                if (unlikely(b->pointers[i] == p)) {
>> +                    return false;
>> +                }
>> +                continue;
>> +            }
>> +            /* found an empty key: acquire the seqlock and write */
>> +            seqlock_write_begin(&head->sequence);
>> +            if (new) {
>> +                atomic_rcu_set(&prev->next, b);
>> +            }
>> +            b->hashes[i] = hash;
>> +            atomic_set(&b->pointers[i], p);
>> +            seqlock_write_end(&head->sequence);
>> +            return true;
>> +        }
>> +        prev = b;
>> +        b = b->next;
>> +    }
>> +}
> Here is my attempt:

static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
                               struct qht_bucket *head, void *p,
uint32_t hash,
                               bool *needs_resize)
{
    struct qht_bucket **bpp = &head, *new;
    int i = 0;

    do {
        while (i < QHT_BUCKET_ENTRIES) {
            if ((*bpp)->pointers[i]) {
                if (unlikely((*bpp)->pointers[i] == p)) {
                    return false;
                }
                i++;
                continue;
            }
            goto found;
        }
        bpp = &(*bpp)->next;
        i = 0;
    } while (*bpp);

    new = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*new));
    memset(new, 0, sizeof(*new));
    atomic_rcu_set(bpp, new);
    atomic_inc(&map->n_added_buckets);
    if (unlikely(qht_map_needs_resize(map)) && needs_resize) {
        *needs_resize = true;
    }
found:
    /* found an empty key: acquire the seqlock and write */
    seqlock_write_begin(&head->sequence);
    (*bpp)->hashes[i] = hash;
    atomic_set(&(*bpp)->pointers[i], p);
    seqlock_write_end(&head->sequence);
    return true;
}

> Feel free to use it as you wish.

Sorry for the chopped email. Hope this one will be better :)

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 11/15] qht: add test program
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 11/15] qht: add test program Emilio G. Cota
@ 2016-05-29 20:15   ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-29 20:15 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Acked-by: Sergey Fedorov <sergey.fedorov@linaro.org>

> ---
>  tests/.gitignore |   1 +
>  tests/Makefile   |   6 ++-
>  tests/test-qht.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 165 insertions(+), 1 deletion(-)
>  create mode 100644 tests/test-qht.c
>
> diff --git a/tests/.gitignore b/tests/.gitignore
> index 7c0d156..ffde5d2 100644
> --- a/tests/.gitignore
> +++ b/tests/.gitignore
> @@ -50,6 +50,7 @@ test-qdev-global-props
>  test-qemu-opts
>  test-qdist
>  test-qga
> +test-qht
>  test-qmp-commands
>  test-qmp-commands.h
>  test-qmp-event
> diff --git a/tests/Makefile b/tests/Makefile
> index a5af20b..8589b11 100644
> --- a/tests/Makefile
> +++ b/tests/Makefile
> @@ -72,6 +72,8 @@ check-unit-y += tests/test-rcu-list$(EXESUF)
>  gcov-files-test-rcu-list-y = util/rcu.c
>  check-unit-y += tests/test-qdist$(EXESUF)
>  gcov-files-test-qdist-y = util/qdist.c
> +check-unit-y += tests/test-qht$(EXESUF)
> +gcov-files-test-qht-y = util/qht.c
>  check-unit-y += tests/test-bitops$(EXESUF)
>  check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
>  check-unit-y += tests/check-qom-interface$(EXESUF)
> @@ -395,7 +397,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
>  	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
>  	tests/test-opts-visitor.o tests/test-qmp-event.o \
>  	tests/rcutorture.o tests/test-rcu-list.o \
> -	tests/test-qdist.o
> +	tests/test-qdist.o \
> +	tests/test-qht.o
>  
>  $(test-obj-y): QEMU_INCLUDES += -Itests
>  QEMU_CFLAGS += -I$(SRC_PATH)/tests
> @@ -435,6 +438,7 @@ tests/test-int128$(EXESUF): tests/test-int128.o
>  tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
>  tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
>  tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
> +tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
>  
>  tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
>  	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
> diff --git a/tests/test-qht.c b/tests/test-qht.c
> new file mode 100644
> index 0000000..c8eb930
> --- /dev/null
> +++ b/tests/test-qht.c
> @@ -0,0 +1,159 @@
> +/*
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + *   See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include <glib.h>
> +#include "qemu/qht.h"
> +
> +#define N 5000
> +
> +static struct qht ht;
> +static int32_t arr[N * 2];
> +
> +static bool is_equal(const void *obj, const void *userp)
> +{
> +    const int32_t *a = obj;
> +    const int32_t *b = userp;
> +
> +    return *a == *b;
> +}
> +
> +static void insert(int a, int b)
> +{
> +    int i;
> +
> +    for (i = a; i < b; i++) {
> +        uint32_t hash;
> +
> +        arr[i] = i;
> +        hash = i;
> +
> +        qht_insert(&ht, &arr[i], hash);
> +    }
> +}
> +
> +static void rm(int init, int end)
> +{
> +    int i;
> +
> +    for (i = init; i < end; i++) {
> +        uint32_t hash;
> +
> +        hash = arr[i];
> +        g_assert_true(qht_remove(&ht, &arr[i], hash));
> +    }
> +}
> +
> +static void check(int a, int b, bool expected)
> +{
> +    struct qht_stats stats;
> +    int i;
> +
> +    for (i = a; i < b; i++) {
> +        void *p;
> +        uint32_t hash;
> +        int32_t val;
> +
> +        val = i;
> +        hash = i;
> +        p = qht_lookup(&ht, is_equal, &val, hash);
> +        g_assert_true(!!p == expected);
> +    }
> +    qht_statistics_init(&ht, &stats);
> +    if (stats.used_head_buckets) {
> +        g_assert_cmpfloat(qdist_avg(&stats.chain), >=, 1.0);
> +    }
> +    g_assert_cmpuint(stats.head_buckets, >, 0);
> +    qht_statistics_destroy(&stats);
> +}
> +
> +static void count_func(struct qht *ht, void *p, uint32_t hash, void *userp)
> +{
> +    unsigned int *curr = userp;
> +
> +    (*curr)++;
> +}
> +
> +static void check_n(size_t expected)
> +{
> +    struct qht_stats stats;
> +
> +    qht_statistics_init(&ht, &stats);
> +    g_assert_cmpuint(stats.entries, ==, expected);
> +    qht_statistics_destroy(&stats);
> +}
> +
> +static void iter_check(unsigned int count)
> +{
> +    unsigned int curr = 0;
> +
> +    qht_iter(&ht, count_func, &curr);
> +    g_assert_cmpuint(curr, ==, count);
> +}
> +
> +static void qht_do_test(unsigned int mode, size_t init_entries)
> +{
> +    qht_init(&ht, 0, mode);
> +
> +    insert(0, N);
> +    check(0, N, true);
> +    check_n(N);
> +    check(-N, -1, false);
> +    iter_check(N);
> +
> +    rm(101, 102);
> +    check_n(N - 1);
> +    insert(N, N * 2);
> +    check_n(N + N - 1);
> +    rm(N, N * 2);
> +    check_n(N - 1);
> +    insert(101, 102);
> +    check_n(N);
> +
> +    rm(10, 200);
> +    check_n(N - 190);
> +    insert(150, 200);
> +    check_n(N - 190 + 50);
> +    insert(10, 150);
> +    check_n(N);
> +
> +    rm(1, 2);
> +    check_n(N - 1);
> +    qht_reset_size(&ht, 0);
> +    check_n(0);
> +    check(0, N, false);
> +
> +    qht_destroy(&ht);
> +}
> +
> +static void qht_test(unsigned int mode)
> +{
> +    qht_do_test(mode, 0);
> +    qht_do_test(mode, 1);
> +    qht_do_test(mode, 2);
> +    qht_do_test(mode, 8);
> +    qht_do_test(mode, 16);
> +    qht_do_test(mode, 8192);
> +    qht_do_test(mode, 16384);
> +}
> +
> +static void test_default(void)
> +{
> +    qht_test(0);
> +}
> +
> +static void test_resize(void)
> +{
> +    qht_test(QHT_MODE_AUTO_RESIZE);
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +    g_test_init(&argc, &argv, NULL);
> +    g_test_add_func("/qht/mode/default", test_default);
> +    g_test_add_func("/qht/mode/resize", test_resize);
> +    return g_test_run();
> +}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark Emilio G. Cota
@ 2016-05-29 20:45   ` Sergey Fedorov
  2016-06-03 11:41     ` Emilio G. Cota
  2016-05-31 15:12   ` Alex Bennée
  1 sibling, 1 reply; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-29 20:45 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> diff --git a/tests/qht-bench.c b/tests/qht-bench.c
> new file mode 100644
> index 0000000..30d27c8
> --- /dev/null
> +++ b/tests/qht-bench.c
> @@ -0,0 +1,474 @@
(snip)
> +static void do_rw(struct thread_info *info)
> +{
> +    struct thread_stats *stats = &info->stats;
> +    uint32_t hash;
> +    long *p;
> +
> +    if (info->r >= update_threshold) {
> +        bool read;
> +
> +        p = &keys[info->r & (lookup_range - 1)];
> +        hash = h(*p);
> +        read = qht_lookup(&ht, is_equal, p, hash);
> +        if (read) {
> +            stats->rd++;
> +        } else {
> +            stats->not_rd++;
> +        }
> +    } else {
> +        p = &keys[info->r & (update_range - 1)];
> +        hash = h(*p);

The previous two lines are common for the both "if" branches. Lets move
it above the "if".

> +        if (info->write_op) {
> +            bool written = false;
> +
> +            if (qht_lookup(&ht, is_equal, p, hash) == NULL) {
> +                written = qht_insert(&ht, p, hash);
> +            }
> +            if (written) {
> +                stats->in++;
> +            } else {
> +                stats->not_in++;
> +            }
> +        } else {
> +            bool removed = false;
> +
> +            if (qht_lookup(&ht, is_equal, p, hash)) {
> +                removed = qht_remove(&ht, p, hash);
> +            }
> +            if (removed) {
> +                stats->rm++;
> +            } else {
> +                stats->not_rm++;
> +            }
> +        }
> +        info->write_op = !info->write_op;
> +    }
> +}
> +
> +static void *thread_func(void *p)
> +{
> +    struct thread_info *info = p;
> +
> +    while (!atomic_mb_read(&test_start)) {
> +        cpu_relax();
> +    }
> +
> +    rcu_register_thread();

Shouldn't we do this before checking for 'test_start'?

> +
> +    rcu_read_lock();

Why don't we do rcu_read_lock()/rcu_read_unlock() inside the loop?

> +    while (!atomic_read(&test_stop)) {
> +        info->r = xorshift64star(info->r);
> +        info->func(info);
> +    }
> +    rcu_read_unlock();
> +
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
> +/* sets everything except info->func */
> +static void prepare_thread_info(struct thread_info *info, int i)
> +{
> +    /* seed for the RNG; each thread should have a different one */
> +    info->r = (i + 1) ^ time(NULL);
> +    /* the first update will be a write */
> +    info->write_op = true;
> +    /* the first resize will be down */
> +    info->resize_down = true;
> +
> +    memset(&info->stats, 0, sizeof(info->stats));
> +}
> +
> +static void
> +th_create_n(QemuThread **threads, struct thread_info **infos, const char *name,
> +            void (*func)(struct thread_info *), int offset, int n)

'offset' is not used in this function.

> +{
> +    struct thread_info *info;
> +    QemuThread *th;
> +    int i;
> +
> +    th = g_malloc(sizeof(*th) * n);
> +    *threads = th;
> +
> +    info = qemu_memalign(64, sizeof(*info) * n);
> +    *infos = info;
> +
> +    for (i = 0; i < n; i++) {
> +        prepare_thread_info(&info[i], i);
> +        info[i].func = func;
> +        qemu_thread_create(&th[i], name, thread_func, &info[i],
> +                           QEMU_THREAD_JOINABLE);
> +    }
> +}
> +
(snip)
> +
> +static void run_test(void)
> +{
> +    unsigned int remaining;
> +    int i;
> +

Are we sure all the threads are ready at this point? Otherwise why
bother with 'test_start' flag?

> +    atomic_mb_set(&test_start, true);
> +    do {
> +        remaining = sleep(duration);
> +    } while (remaining);
> +    atomic_mb_set(&test_stop, true);
> +
> +    for (i = 0; i < n_rw_threads; i++) {
> +        qemu_thread_join(&rw_threads[i]);
> +    }
> +    for (i = 0; i < n_rz_threads; i++) {
> +        qemu_thread_join(&rz_threads[i]);
> +    }
> +}
> +
>

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 13/15] qht: add test-qht-par to invoke qht-bench from 'check' target
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 13/15] qht: add test-qht-par to invoke qht-bench from 'check' target Emilio G. Cota
@ 2016-05-29 20:53   ` Sergey Fedorov
  2016-06-03 11:07     ` Emilio G. Cota
  0 siblings, 1 reply; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-29 20:53 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> diff --git a/tests/test-qht-par.c b/tests/test-qht-par.c
> new file mode 100644
> index 0000000..fc0cb23
> --- /dev/null
> +++ b/tests/test-qht-par.c
> @@ -0,0 +1,56 @@
(snip)
> +
> +#define TEST_QHT_STRING "tests/qht-bench 1>/dev/null 2>&1 -R -S0.1 -D10000 -N1"
> +
> +static void test_qht(int n_threads, int update_rate, int duration)
> +{
> +    char *str;
> +    int rc;
> +
> +    str = g_strdup_printf(TEST_QHT_STRING "-n %d -u %d -d %d",

There needs to be an extra space either at the beginning of the literal
string, or at the end of the string defined by TEST_QHT_STRING, so that
we don't get "... -N1-n ...".

> +                          n_threads, update_rate, duration);
> +    rc = system(str);
> +    g_free(str);
> +    g_assert_cmpint(rc, ==, 0);
> +}
> +
>

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 14/15] tb hash: track translated blocks with qht
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 14/15] tb hash: track translated blocks with qht Emilio G. Cota
@ 2016-05-29 21:09   ` Sergey Fedorov
  2016-05-31  8:39   ` Alex Bennée
  1 sibling, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-29 21:09 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> Having a fixed-size hash table for keeping track of all translation blocks
> is suboptimal: some workloads are just too big or too small to get maximum
> performance from the hash table. The MRU promotion policy helps improve
> performance when the hash table is a little undersized, but it cannot
> make up for severely undersized hash tables.
>
> Furthermore, frequent MRU promotions result in writes that are a scalability
> bottleneck. For scalability, lookups should only perform reads, not writes.
> This is not a big deal for now, but it will become one once MTTCG matures.
>
> The appended fixes these issues by using qht as the implementation of
> the TB hash table. This solution is superior to other alternatives considered,
> namely:
>
> - master: implementation in QEMU before this patchset
> - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
> - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
>               MRU is implemented here by adding an intermediate struct
>               that contains the u32 hash and a pointer to the TB; this
>               allows us, on an MRU promotion, to copy said struct (that is not
>               at the head), and put this new copy at the head. After a grace
>               period, the original non-head struct can be eliminated, and
>               after another grace period, freed.
> - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
>                    no MRU for lookups; MRU for inserts.
> The appended solution is the following:
> - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
>                  no MRU for lookups; MRU for inserts.
>
> The plots below compare the considered solutions. The Y axis shows the
> boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
> sweeps the number of buckets (or initial number of buckets for qht-autoresize).
> The plots in PNG format (and with errorbars) can be seen here:
>   http://imgur.com/a/Awgnq
>
> Each test runs 5 times, and the entire QEMU process is pinned to a
> single core for repeatability of results.
>
>                             Host: Intel Xeon E5-2690
>
>   28 ++------------+-------------+-------------+-------------+------------++
>      A*****        +             +             +             master **A*** +
>   27 ++    *                                                 xxhash ##B###++
>      |      A******A******                               xxhash-rcu $$C$$$ |
>   26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
>      D%%$$                              A******A******A*qht-dyn-mru A*E****A
>   25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
>      B#####%                                                               |
>   24 ++    #C$$$$$                                                        ++
>      |      B###  $                                                        |
>      |          ## C$$$$$$                                                 |
>   23 ++           #       C$$$$$$                                         ++
>      |             B######       C$$$$$$                                %%%D
>   22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
>      |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
>   21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
>      +             E@@@   F&&&   +      E@     +      F&&&   +             +
>   20 ++------------+-------------+-------------+-------------+------------++
>      14            16            18            20            22            24
>                              log2 number of buckets
>
>                                  Host: Intel i7-4790K
>
>   14.5 ++------------+------------+-------------+------------+------------++
>        A**           +            +             +            master **A*** +
>     14 ++ **                                                 xxhash ##B###++
>   13.5 ++   **                                           xxhash-rcu $$C$$$++
>        |                                            qht-fixed-nomru %%D%%% |
>     13 ++     A******                                   qht-dyn-mru @@E@@@++
>        |             A*****A******A******             qht-dyn-nomru &&F&&& |
>   12.5 C$$                               A******A******A*****A******    ***A
>     12 ++ $$                                                        A***  ++
>        D%%% $$                                                             |
>   11.5 ++  %%                                                             ++
>        B###  %C$$$$$$                                                      |
>     11 ++  ## D%%%%% C$$$$$                                               ++
>        |     #      %      C$$$$$$                                         |
>   10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
>     10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
>        +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
>    9.5 ++------------+------------+-------------+------------+------------++
>        14            16           18            20           22            24
>                               log2 number of buckets
>
> Note that the original point before this patch series is X=15 for "master";
> the little sensitivity to the increased number of buckets is due to the
> poor hashing function in master.
>
> xxhash-rcu has significant overhead due to the constant churn of allocating
> and deallocating intermediate structs for implementing MRU. An alternative
> would be do consider failed lookups as "maybe not there", and then
> acquire the external lock (tb_lock in this case) to really confirm that
> there was indeed a failed lookup. This, however, would not be enough
> to implement dynamic resizing--this is more complex: see
> "Resizable, Scalable, Concurrent Hash Tables via Relativistic
> Programming" by Triplett, McKenney and Walpole. This solution was
> discarded due to the very coarse RCU read critical sections that we have
> in MTTCG; resizing requires waiting for readers after every pointer update,
> and resizes require many pointer updates, so this would quickly become
> prohibitive.
>
> qht-fixed-nomru shows that MRU promotion is advisable for undersized
> hash tables.
>
> However, qht-dyn-mru shows that MRU promotion is not important if the
> hash table is properly sized: there is virtually no difference in
> performance between qht-dyn-nomru and qht-dyn-mru.
>
> Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
> X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
> can achieve with optimum sizing of the hash table, while keeping the hash
> table scalable for readers.
>
> The improvement we get before and after this patch for booting debian jessie
> with arm-softmmu is:
>
> - Intel Xeon E5-2690: 10.5% less time
> - Intel i7-4790K: 5.2% less time
>
> We could get this same improvement _for this particular workload_ by
> statically increasing the size of the hash table. But this would hurt
> workloads that do not need a large hash table. The dynamic (upward)
> resizing allows us to start small and enlarge the hash table as needed.
>
> A quick note on downsizing: the table is resized back to 2**15 buckets
> on every tb_flush; this makes sense because it is not guaranteed that the
> table will reach the same number of TBs later on (e.g. most bootup code is
> thrown away after boot); it makes sense to grow the hash table as
> more code blocks are translated. This also avoids the complication of
> having to build downsizing hysteresis logic into qht.
>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c              | 86 ++++++++++++++++++++++++-------------------------
>  include/exec/exec-all.h |  9 +++---
>  include/exec/tb-hash.h  |  3 +-
>  translate-all.c         | 85 ++++++++++++++++++++++--------------------------
>  4 files changed, 86 insertions(+), 97 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 1735032..6a2350d 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -224,57 +224,57 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
>  }
>  #endif
>  
> +struct tb_desc {
> +    target_ulong pc;
> +    target_ulong cs_base;
> +    CPUArchState *env;
> +    tb_page_addr_t phys_page1;
> +    uint32_t flags;
> +};
> +
> +static bool tb_cmp(const void *p, const void *d)
> +{
> +    const TranslationBlock *tb = p;
> +    const struct tb_desc *desc = d;
> +
> +    if (tb->pc == desc->pc &&
> +        tb->page_addr[0] == desc->phys_page1 &&
> +        tb->cs_base == desc->cs_base &&
> +        tb->flags == desc->flags) {
> +        /* check next page if needed */
> +        if (tb->page_addr[1] == -1) {
> +            return true;
> +        } else {
> +            tb_page_addr_t phys_page2;
> +            target_ulong virt_page2;
> +
> +            virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
> +            phys_page2 = get_page_addr_code(desc->env, virt_page2);
> +            if (tb->page_addr[1] == phys_page2) {
> +                return true;
> +            }
> +        }
> +    }
> +    return false;
> +}
> +
>  static TranslationBlock *tb_find_physical(CPUState *cpu,
>                                            target_ulong pc,
>                                            target_ulong cs_base,
>                                            uint32_t flags)
>  {
> -    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
> -    TranslationBlock *tb, **tb_hash_head, **ptb1;
> +    tb_page_addr_t phys_pc;
> +    struct tb_desc desc;
>      uint32_t h;
> -    tb_page_addr_t phys_pc, phys_page1;
>  
> -    /* find translated block using physical mappings */
> -    phys_pc = get_page_addr_code(env, pc);
> -    phys_page1 = phys_pc & TARGET_PAGE_MASK;
> +    desc.env = (CPUArchState *)cpu->env_ptr;
> +    desc.cs_base = cs_base;
> +    desc.flags = flags;
> +    desc.pc = pc;
> +    phys_pc = get_page_addr_code(desc.env, pc);
> +    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
>      h = tb_hash_func(phys_pc, pc, flags);
> -
> -    /* Start at head of the hash entry */
> -    ptb1 = tb_hash_head = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    tb = *ptb1;
> -
> -    while (tb) {
> -        if (tb->pc == pc &&
> -            tb->page_addr[0] == phys_page1 &&
> -            tb->cs_base == cs_base &&
> -            tb->flags == flags) {
> -
> -            if (tb->page_addr[1] == -1) {
> -                /* done, we have a match */
> -                break;
> -            } else {
> -                /* check next page if needed */
> -                target_ulong virt_page2 = (pc & TARGET_PAGE_MASK) +
> -                                          TARGET_PAGE_SIZE;
> -                tb_page_addr_t phys_page2 = get_page_addr_code(env, virt_page2);

get_page_addr_code() can trigger a spurious exception here. However,
this patch is not a guilt because this could happen before this patch
and even before this patch series.

Reviewed-by: Sergey Fedorov <serge.fedorov@linaro.org>

Kind regards,
Sergey

> -
> -                if (tb->page_addr[1] == phys_page2) {
> -                    break;
> -                }
> -            }
> -        }
> -
> -        ptb1 = &tb->phys_hash_next;
> -        tb = *ptb1;
> -    }
> -
> -    if (tb) {
> -        /* Move the TB to the head of the list */
> -        *ptb1 = tb->phys_hash_next;
> -        tb->phys_hash_next = *tb_hash_head;
> -        *tb_hash_head = tb;
> -    }
> -    return tb;
> +    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
>  }
>  
>  static TranslationBlock *tb_find_slow(CPUState *cpu,
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 85528f9..68e73b6 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -21,6 +21,7 @@
>  #define _EXEC_ALL_H_
>  
>  #include "qemu-common.h"
> +#include "qemu/qht.h"
>  
>  /* allow to see translation results - the slowdown should be negligible, so we leave it */
>  #define DEBUG_DISAS
> @@ -212,8 +213,8 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
>  
>  #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
>  
> -#define CODE_GEN_PHYS_HASH_BITS     15
> -#define CODE_GEN_PHYS_HASH_SIZE     (1 << CODE_GEN_PHYS_HASH_BITS)
> +#define CODE_GEN_HTABLE_BITS     15
> +#define CODE_GEN_HTABLE_SIZE     (1 << CODE_GEN_HTABLE_BITS)
>  
>  /* Estimated block size for TB allocation.  */
>  /* ??? The following is based on a 2015 survey of x86_64 host output.
> @@ -250,8 +251,6 @@ struct TranslationBlock {
>  
>      void *tc_ptr;    /* pointer to the translated code */
>      uint8_t *tc_search;  /* pointer to search data */
> -    /* next matching tb for physical address. */
> -    struct TranslationBlock *phys_hash_next;
>      /* original tb when cflags has CF_NOCACHE */
>      struct TranslationBlock *orig_tb;
>      /* first and second physical page containing code. The lower bit
> @@ -296,7 +295,7 @@ typedef struct TBContext TBContext;
>  struct TBContext {
>  
>      TranslationBlock *tbs;
> -    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
> +    struct qht htable;
>      int nb_tbs;
>      /* any access to the tbs or the page table must use this lock */
>      QemuMutex tb_lock;
> diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
> index 88ccfd1..1d0200b 100644
> --- a/include/exec/tb-hash.h
> +++ b/include/exec/tb-hash.h
> @@ -20,7 +20,6 @@
>  #ifndef EXEC_TB_HASH
>  #define EXEC_TB_HASH
>  
> -#include "exec/exec-all.h"
>  #include "exec/tb-hash-xx.h"
>  
>  /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
> @@ -49,7 +48,7 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
>  static inline
>  uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t flags)
>  {
> -    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
> +    return tb_hash_func5(phys_pc, pc, flags);
>  }
>  
>  #endif
> diff --git a/translate-all.c b/translate-all.c
> index c48fccb..5357737 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -734,6 +734,13 @@ static inline void code_gen_alloc(size_t tb_size)
>      qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
>  }
>  
> +static void tb_htable_init(void)
> +{
> +    unsigned int mode = QHT_MODE_AUTO_RESIZE;
> +
> +    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
> +}
> +
>  /* Must be called before using the QEMU cpus. 'tb_size' is the size
>     (in bytes) allocated to the translation buffer. Zero means default
>     size. */
> @@ -741,6 +748,7 @@ void tcg_exec_init(unsigned long tb_size)
>  {
>      cpu_gen_init();
>      page_init();
> +    tb_htable_init();
>      code_gen_alloc(tb_size);
>  #if defined(CONFIG_SOFTMMU)
>      /* There's no guest base to take into account, so go ahead and
> @@ -845,7 +853,7 @@ void tb_flush(CPUState *cpu)
>          cpu->tb_flushed = true;
>      }
>  
> -    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
> +    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
>      page_flush_tb();
>  
>      tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
> @@ -856,60 +864,46 @@ void tb_flush(CPUState *cpu)
>  
>  #ifdef DEBUG_TB_CHECK
>  
> -static void tb_invalidate_check(target_ulong address)
> +static void
> +do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>  {
> -    TranslationBlock *tb;
> -    int i;
> +    TranslationBlock *tb = p;
> +    target_ulong addr = *(target_ulong *)userp;
>  
> -    address &= TARGET_PAGE_MASK;
> -    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> -             tb = tb->phys_hash_next) {
> -            if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
> -                  address >= tb->pc + tb->size)) {
> -                printf("ERROR invalidate: address=" TARGET_FMT_lx
> -                       " PC=%08lx size=%04x\n",
> -                       address, (long)tb->pc, tb->size);
> -            }
> -        }
> +    if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
> +        printf("ERROR invalidate: address=" TARGET_FMT_lx
> +               " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
>      }
>  }
>  
> -/* verify that all the pages have correct rights for code */
> -static void tb_page_check(void)
> +static void tb_invalidate_check(target_ulong address)
>  {
> -    TranslationBlock *tb;
> -    int i, flags1, flags2;
> -
> -    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> -                tb = tb->phys_hash_next) {
> -            flags1 = page_get_flags(tb->pc);
> -            flags2 = page_get_flags(tb->pc + tb->size - 1);
> -            if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> -                printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> -                       (long)tb->pc, tb->size, flags1, flags2);
> -            }
> -        }
> -    }
> +    address &= TARGET_PAGE_MASK;
> +    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
>  }
>  
> -#endif
> -
> -static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock *tb)
> +static void
> +do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>  {
> -    TranslationBlock *tb1;
> +    TranslationBlock *tb = p;
> +    int flags1, flags2;
>  
> -    for (;;) {
> -        tb1 = *ptb;
> -        if (tb1 == tb) {
> -            *ptb = tb1->phys_hash_next;
> -            break;
> -        }
> -        ptb = &tb1->phys_hash_next;
> +    flags1 = page_get_flags(tb->pc);
> +    flags2 = page_get_flags(tb->pc + tb->size - 1);
> +    if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> +        printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> +               (long)tb->pc, tb->size, flags1, flags2);
>      }
>  }
>  
> +/* verify that all the pages have correct rights for code */
> +static void tb_page_check(void)
> +{
> +    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
> +}
> +
> +#endif
> +
>  static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
>  {
>      TranslationBlock *tb1;
> @@ -997,7 +991,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>      /* remove the TB from the hash list */
>      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
> +    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
>  
>      /* remove the TB from the page list */
>      if (tb->page_addr[0] != page_addr) {
> @@ -1127,13 +1121,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>                           tb_page_addr_t phys_page2)
>  {
>      uint32_t h;
> -    TranslationBlock **ptb;
>  
>      /* add in the hash table */
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    tb->phys_hash_next = *ptb;
> -    *ptb = tb;
> +    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
>  
>      /* add in the page list */
>      tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 15/15] translate-all: add tb hash bucket info to 'info jit' dump
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 15/15] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
@ 2016-05-29 21:14   ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-05-29 21:14 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Paolo Bonzini, Alex Bennée, Richard Henderson

On 25/05/16 04:13, Emilio G. Cota wrote:
> Examples:
>
> - Good hashing, i.e. tb_hash_func5(phys_pc, pc, flags):
> TB count            715135/2684354
> [...]
> TB hash buckets     388775/524288 (74.15% head buckets used)
> TB hash occupancy   33.04% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
> TB hash avg chain   1.017 buckets. Histogram: 1|█▁▁|3
>
> - Not-so-good hashing, i.e. tb_hash_func5(phys_pc, pc, 0):
> TB count            712636/2684354
> [...]
> TB hash buckets     344924/524288 (65.79% head buckets used)
> TB hash occupancy   31.64% avg chain occ. Histogram: [0,10)%|█ ▆  ▅▁▃▁▂|[90,100]%
> TB hash avg chain   1.047 buckets. Histogram: 1|█▁▁▁|4
>
> - Bad hashing, i.e. tb_hash_func5(phys_pc, 0, 0):
> TB count            702818/2684354
> [...]
> TB hash buckets     112741/524288 (21.50% head buckets used)
> TB hash occupancy   10.15% avg chain occ. Histogram: [0,10)%|█ ▁  ▁▁▁▁▁|[90,100]%
> TB hash avg chain   2.107 buckets. Histogram: [1.0,10.2)|█▁▁▁▁▁▁▁▁▁|[83.8,93.0]
>
> - Good hashing, but no auto-resize:
> TB count            715634/2684354
> TB hash buckets     8192/8192 (100.00% head buckets used)
> TB hash occupancy   98.30% avg chain occ. Histogram: [95.3,95.8)%|▁▁▃▄▃▄▁▇▁█|[99.5,100.0]%
> TB hash avg chain   22.070 buckets. Histogram: [15.0,16.7)|▁▂▅▄█▅▁▁▁▁|[30.3,32.0]

I personally don't see much use of the histogram labels. However,

Acked-by: Sergey Fedorov <sergey.fedorov@linaro.org>

>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  translate-all.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
>
> diff --git a/translate-all.c b/translate-all.c
> index 5357737..c8074cf 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -1667,6 +1667,10 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>      int i, target_code_size, max_target_code_size;
>      int direct_jmp_count, direct_jmp2_count, cross_page;
>      TranslationBlock *tb;
> +    struct qht_stats hst;
> +    uint32_t hgram_opts;
> +    size_t hgram_bins;
> +    char *hgram;
>  
>      target_code_size = 0;
>      max_target_code_size = 0;
> @@ -1717,6 +1721,38 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>                  direct_jmp2_count,
>                  tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
>                          tcg_ctx.tb_ctx.nb_tbs : 0);
> +
> +    qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
> +
> +    cpu_fprintf(f, "TB hash buckets     %zu/%zu (%0.2f%% head buckets used)\n",
> +                hst.used_head_buckets, hst.head_buckets,
> +                (double)hst.used_head_buckets / hst.head_buckets * 100);
> +
> +    hgram_opts =  QDIST_PR_BORDER | QDIST_PR_LABELS;
> +    hgram_opts |= QDIST_PR_100X   | QDIST_PR_PERCENT;
> +    if (qdist_xmax(&hst.occupancy) - qdist_xmin(&hst.occupancy) == 1) {
> +        hgram_opts |= QDIST_PR_NODECIMAL;
> +    }
> +    hgram = qdist_pr(&hst.occupancy, 10, hgram_opts);
> +    cpu_fprintf(f, "TB hash occupancy   %0.2f%% avg chain occ. Histogram: %s\n",
> +                qdist_avg(&hst.occupancy) * 100, hgram);
> +    g_free(hgram);
> +
> +    hgram_opts = QDIST_PR_BORDER | QDIST_PR_LABELS;
> +    hgram_bins = qdist_xmax(&hst.chain) - qdist_xmin(&hst.chain);
> +    if (hgram_bins > 10) {
> +        hgram_bins = 10;
> +    } else {
> +        hgram_bins = 0;
> +        hgram_opts |= QDIST_PR_NODECIMAL | QDIST_PR_NOBINRANGE;
> +    }
> +    hgram = qdist_pr(&hst.chain, hgram_bins, hgram_opts);
> +    cpu_fprintf(f, "TB hash avg chain   %0.3f buckets. Histogram: %s\n",
> +                qdist_avg(&hst.chain), hgram);
> +    g_free(hgram);
> +
> +    qht_statistics_destroy(&hst);
> +
>      cpu_fprintf(f, "\nStatistics:\n");
>      cpu_fprintf(f, "TB flush count      %d\n", tcg_ctx.tb_ctx.tb_flush_count);
>      cpu_fprintf(f, "TB invalidate count %d\n",

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-29 19:52   ` Sergey Fedorov
  2016-05-29 19:55     ` Sergey Fedorov
@ 2016-05-31  7:46     ` Alex Bennée
  2016-06-01 20:53       ` Sergey Fedorov
  2016-06-03  9:18     ` Emilio G. Cota
  2016-06-03 11:01     ` Emilio G. Cota
  3 siblings, 1 reply; 63+ messages in thread
From: Alex Bennée @ 2016-05-31  7:46 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Emilio G. Cota, QEMU Developers, MTTCG Devel, Paolo Bonzini,
	Richard Henderson


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 25/05/16 04:13, Emilio G. Cota wrote:
>> diff --git a/include/qemu/qht.h b/include/qemu/qht.h
>> new file mode 100644
>> index 0000000..aec60aa
>> --- /dev/null
>> +++ b/include/qemu/qht.h
>> @@ -0,0 +1,183 @@
> (snip)
>> +/**
>> + * qht_init - Initialize a QHT
>> + * @ht: QHT to be initialized
>> + * @n_elems: number of entries the hash table should be optimized for.
>> + * @mode: bitmask with OR'ed QHT_MODE_*
>> + */
>> +void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
>
> First of all, thank you for spending your time on the documentation of
> the API!
>
> I was just wondering if it could be worthwhile to pass a hash function
> when initializing a QHT. Then we could have variants of qht_insert(),
> qht_remove() and qht_lookup() which does not require a computed hash
> value but call the function by themselves. This could make sense since a
> hash value passed the the functions should always be exactly the same
> for the same object.

Wouldn't this be for an expansion of the API when we actually have
something that would use it that way?

>
> (snip)
>> +/**
>> + * qht_remove - remove a pointer from the hash table
>> + * @ht: QHT to remove from
>> + * @p: pointer to be removed
>> + * @hash: hash corresponding to @p
>> + *
>> + * Attempting to remove a NULL @p is a bug.
>> + *
>> + * Just-removed @p pointers cannot be immediately freed; they need to remain
>> + * valid until the end of the RCU grace period in which qht_remove() is called.
>> + * This guarantees that concurrent lookups will always compare against valid
>> + * data.
>
> Mention rcu_call1()/call_rcu()/g_free_rcu()?
>
>> + *
>> + * Returns true on success.
>> + * Returns false if the @p-@hash pair was not found.
>> + */
>> +bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
>> +
> (snip)
>> diff --git a/util/qht.c b/util/qht.c
>> new file mode 100644
>> index 0000000..ca5a620
>> --- /dev/null
>> +++ b/util/qht.c
>> @@ -0,0 +1,837 @@
> (snip)
>> +/* trigger a resize when n_added_buckets > n_buckets / div */
>> +#define QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV 8
>
> Just out of curiosity, how did you get this number?
>
>> +
>> +static void qht_do_resize(struct qht *ht, struct qht_map *new);
>> +static void qht_grow_maybe(struct qht *ht);
>
> qht_grow_maybe() is used just once. Please consider reordering of
> definitions and removing this forward declaration.
>
> (snip)
>> +
>> +/* call with head->lock held */
>> +static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
>> +                               struct qht_bucket *head, void *p, uint32_t hash,
>> +                               bool *needs_resize)
>> +{
>> +    struct qht_bucket *b = head;
>> +    struct qht_bucket *prev = NULL;
>> +    struct qht_bucket *new = NULL;
>> +    int i;
>> +
>> +    for (;;) {
>> +        if (b == NULL) {
>> +            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
>> +            memset(b, 0, sizeof(*b));
>> +            new = b;
>> +            atomic_inc(&map->n_added_buckets);
>> +            if (unlikely(qht_map_needs_resize(map)) && needs_resize) {
>> +                *needs_resize = true;
>> +            }
>> +        }
>> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>> +            if (b->pointers[i]) {
>> +                if (unlikely(b->pointers[i] == p)) {
>> +                    return false;
>> +                }
>> +                continue;
>> +            }
>> +            /* found an empty key: acquire the seqlock and write */
>> +            seqlock_write_begin(&head->sequence);
>> +            if (new) {
>> +                atomic_rcu_set(&prev->next, b);
>> +            }
>> +            b->hashes[i] = hash;
>> +            atomic_set(&b->pointers[i], p);
>> +            seqlock_write_end(&head->sequence);
>> +            return true;
>> +        }
>> +        prev = b;
>> +        b = b->next;
>> +    }
>> +}
>
> Here is my attempt:
>
> static bool qht_insert__locked(struct qht *ht, struct qht_map
> *map,
>                                struct qht_bucket *head, void *p,
> uint32_t hash,
>                                bool
> *needs_resize)
> {
>
>     struct qht_bucket **bpp = &head,
> *new;
>     int i =
> 0;
>
>
>     do
> {
>         while (i < QHT_BUCKET_ENTRIES)
> {
>             if ((*bpp)->pointers[i])
> {
>                 if (unlikely((*bpp)->pointers[i] == p))
> {
>                     return
> false;
>
> }
>
> i++;
>
> continue;
>
> }
>             goto
> found;
>
> }
>         bpp =
> &(*bpp)->next;
>         i =
> 0;
>     } while
> (*bpp);
>
>
>     new = qemu_memalign(QHT_BUCKET_ALIGN,
> sizeof(*new));
>     memset(new, 0,
> sizeof(*new));
>     atomic_rcu_set(bpp,
> new);
>
> atomic_inc(&map->n_added_buckets);
>     if (unlikely(qht_map_needs_resize(map)) && needs_resize)
> {
>         *needs_resize =
> true;
>
> }
> found:
>
>     /* found an empty key: acquire the seqlock and write
> */
>
> seqlock_write_begin(&head->sequence);
>     (*bpp)->hashes[i] =
> hash;
>     atomic_set(&(*bpp)->pointers[i],
> p);
>
> seqlock_write_end(&head->sequence);
>     return
> true;
> }
>
>
> Feel free to use it as you wish.
>
>>
> (snip)
>> +/*
>> + * Find the last valid entry in @head, and swap it with @orig[pos], which has
>> + * just been invalidated.
>> + */
>> +static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
>> +{
>> +    struct qht_bucket *b = orig;
>> +    struct qht_bucket *prev = NULL;
>> +    int i;
>> +
>> +    if (qht_entry_is_last(orig, pos)) {
>> +        orig->hashes[pos] = 0;
>> +        atomic_set(&orig->pointers[pos], NULL);
>> +        return;
>> +    }
>> +    do {
>> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>> +            if (b->pointers[i]) {
>> +                continue;
>> +            }
>> +            if (i > 0) {
>> +                return qht_entry_move(orig, pos, b, i - 1);
>> +            }
>> +            qht_debug_assert(prev);
>
> 'prev' can be NULL if this is the first iteration.
>
>> +            return qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
>> +        }
>> +        prev = b;
>> +        b = b->next;
>> +    } while (b);
>> +    /* no free entries other than orig[pos], so swap it with the last one */
>> +    qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
>> +}
>> +
>> +/* call with b->lock held */
>> +static inline
>> +bool qht_remove__locked(struct qht_map *map, struct qht_bucket *head,
>> +                        const void *p, uint32_t hash)
>> +{
>> +    struct qht_bucket *b = head;
>> +    int i;
>> +
>> +    do {
>> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>> +            void *q = b->pointers[i];
>> +
>> +            if (unlikely(q == NULL)) {
>> +                return false;
>> +            }
>> +            if (q == p) {
>> +                qht_debug_assert(b->hashes[i] == hash);
>> +                seqlock_write_begin(&head->sequence);
>> +                qht_bucket_fill_hole(b, i);
>
> "Fill hole" doesn't correspond to the function's new job since there's
> no hole. "Remove entry" would make more sense, I think.
>
>> +                seqlock_write_end(&head->sequence);
>> +                return true;
>> +            }
>> +        }
>> +        b = b->next;
>> +    } while (b);
>> +    return false;
>> +}
> (snip)
>> +/*
>> + * Call with ht->lock and all bucket locks held.
>> + *
>> + * Creating the @new map here would add unnecessary delay while all the locks
>> + * are held--holding up the bucket locks is particularly bad, since no writes
>> + * can occur while these are held. Thus, we let callers create the new map,
>> + * hopefully without the bucket locks held.
>> + */
>> +static void qht_do_resize(struct qht *ht, struct qht_map *new)
>> +{
>> +    struct qht_map *old;
>> +
>> +    old = ht->map;
>> +    g_assert_cmpuint(new->n_buckets, !=, old->n_buckets);
>> +
>> +    qht_map_iter__all_locked(ht, old, qht_map_copy, new);
>> +    qht_map_debug__all_locked(new);
>> +
>> +    atomic_rcu_set(&ht->map, new);
>> +    call_rcu1(&old->rcu, qht_map_reclaim);
>
> call_rcu() macro is a more convenient way to do this and you wouldn't
> need qht_map_reclaim().
>
>> +}
>
> Kind regards,
> Sergey


--
Alex Bennée

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 14/15] tb hash: track translated blocks with qht
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 14/15] tb hash: track translated blocks with qht Emilio G. Cota
  2016-05-29 21:09   ` Sergey Fedorov
@ 2016-05-31  8:39   ` Alex Bennée
  1 sibling, 0 replies; 63+ messages in thread
From: Alex Bennée @ 2016-05-31  8:39 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> Having a fixed-size hash table for keeping track of all translation blocks
> is suboptimal: some workloads are just too big or too small to get maximum
<snip>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c              | 86 ++++++++++++++++++++++++-------------------------
>  include/exec/exec-all.h |  9 +++---
>  include/exec/tb-hash.h  |  3 +-
>  translate-all.c         | 85 ++++++++++++++++++++++--------------------------
>  4 files changed, 86 insertions(+), 97 deletions(-)

There are some conflicts with master here due to the move of TBContext.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark Emilio G. Cota
  2016-05-29 20:45   ` Sergey Fedorov
@ 2016-05-31 15:12   ` Alex Bennée
  2016-05-31 16:44     ` Emilio G. Cota
  1 sibling, 1 reply; 63+ messages in thread
From: Alex Bennée @ 2016-05-31 15:12 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> This serves as a performance benchmark as well as a stress test
> for QHT. We can tweak quite a number of things, including the
> number of resize threads and how frequently resizes are triggered.
>
> A performance comparison of QHT vs CLHT[1] and ck_hs[2] using
> this same benchmark program can be found here:
>   http://imgur.com/a/0Bms4
>
> The tests are run on a 64-core AMD Opteron 6376.

It would be useful to include the template of the command line arguments
for these plots here. For example when I run it without any arguments I
get:


Results:
 Read:              34.21 M (100.00% of 34.21M)
 Inserted:          0.00 M (-nan% of 0.00M)
 Removed:           0.00 M (-nan% of 0.00M)
 Throughput:        34.21 MT/s
 Throughput/thread: 34.21 MT/s/thread

Looking at the graph it says 200k keys, so:

$ ./tests/qht-bench -k 200000
qht-bench: tests/qht-bench.c:309: htable_init: Assertion `init_size <= init_range `

So I'm a little confused on how I'm running this benchmark.

>
> Note that ck_hs's performance drops significantly as writes go
> up, since it requires an external lock (I used a ck_spinlock)
> around every write.
>
> Also, note that CLHT instead of using a seqlock, relies on an
> allocator that does not ever return the same address during the
> same read-critical section. This gives it a slight performance
> advantage over QHT on read-heavy workloads, since the seqlock
> writes aren't there.
>
> [1] CLHT: https://github.com/LPD-EPFL/CLHT
>           https://infoscience.epfl.ch/record/207109/files/ascy_asplos15.pdf
>
> [2] ck_hs: http://concurrencykit.org/
>            http://backtrace.io/blog/blog/2015/03/13/workload-specialization/
>
> A few of those plots are shown in text here, since that site
> might not be online forever. Throughput is on Mops/s on the Y axis.
>
>                              200K keys, 0 % updates
>
>   450 ++--+------+------+-------+-------+-------+-------+------+-------+--++
>       |   +      +      +       +       +       +       +      +      +N+  |
>   400 ++                                                           ---+E+ ++
>       |                                                       +++----      |
>   350 ++          9 ++------+------++                       --+E+    -+H+ ++
>       |             |      +H+-     |                 -+N+----   ---- +++  |
>   300 ++          8 ++     +E+     ++             -----+E+  --+H+         ++
>       |             |      +++      |         -+N+-----+H+--               |
>   250 ++          7 ++------+------++  +++-----+E+----                    ++
>   200 ++                    1         -+E+-----+H+                        ++
>       |                           ----                     qht +-E--+      |
>   150 ++                      -+E+                        clht +-H--+     ++
>       |                   ----                              ck +-N--+      |
>   100 ++               +E+                                                ++
>       |            ----                                                    |
>    50 ++       -+E+                                                       ++
>       |   +E+E+  +      +       +       +       +       +      +       +   |
>     0 ++--E------+------+-------+-------+-------+-------+------+-------+--++
>           1      8      16      24      32      40      48     56      64
>                                 Number of threads
>
>                              200K keys, 1 % updates
>
>   350 ++--+------+------+-------+-------+-------+-------+------+-------+--++
>       |   +      +      +       +       +       +       +      +     -+E+  |
>   300 ++                                                         -----+H+ ++
>       |                                                       +E+--        |
>       |           9 ++------+------++                  +++----             |
>   250 ++            |      +E+   -- |                 -+E+                ++
>       |           8 ++         --  ++             ----                     |
>   200 ++            |      +++-     |  +++  ---+E+                        ++
>       |           7 ++------N------++ -+E+--               qht +-E--+      |
>       |                     1  +++----                    clht +-H--+      |
>   150 ++                      -+E+                          ck +-N--+     ++
>       |                   ----                                             |
>   100 ++               +E+                                                ++
>       |            ----                                                    |
>       |        -+E+                                                        |
>    50 ++    +H+-+N+----+N+-----+N+------                                  ++
>       |   +E+E+  +      +       +      +N+-----+N+-----+N+----+N+-----+N+  |
>     0 ++--E------+------+-------+-------+-------+-------+------+-------+--++
>           1      8      16      24      32      40      48     56      64
>                                 Number of threads
>
>                              200K keys, 20 % updates
>
>   300 ++--+------+------+-------+-------+-------+-------+------+-------+--++
>       |   +      +      +       +       +       +       +      +       +   |
>       |                                                              -+H+  |
>   250 ++                                                         ----     ++
>       |           9 ++------+------++                       --+H+  ---+E+  |
>       |           8 ++     +H+--   ++                 -+H+----+E+--        |
>   200 ++            |      +E+    --|             -----+E+--  +++         ++
>       |           7 ++      + ---- ++       ---+H+---- +++ qht +-E--+      |
>   150 ++          6 ++------N------++ -+H+-----+E+        clht +-H--+     ++
>       |                     1     -----+E+--                ck +-N--+      |
>       |                       -+H+----                                     |
>   100 ++                  -----+E+                                        ++
>       |                +E+--                                               |
>       |            ----+++                                                 |
>    50 ++       -+E+                                                       ++
>       |     +E+ +++                                                        |
>       |   +E+N+-+N+-----+       +       +       +       +      +       +   |
>     0 ++--E------+------N-------N-------N-------N-------N------N-------N--++
>           1      8      16      24      32      40      48     56      64
>                                 Number of threads
>
>                             200K keys, 100 % updates       qht +-E--+
>                                                           clht +-H--+
>   160 ++--+------+------+-------+-------+-------+-------+---ck-+-N-----+--++
>       |   +      +      +       +       +       +       +      +   ----H   |
>   140 ++                                                      +H+--  -+E+ ++
>       |                                                +++----   ----      |
>   120 ++          8 ++------+------++                 -+H+    +E+         ++
>       |           7 ++     +H+---- ++             ---- +++----             |
>   100 ++            |      +E+      |  +++  ---+H+    -+E+                ++
>       |           6 ++     +++     ++ -+H+--   +++----                     |
>    80 ++          5 ++------N----------+E+-----+E+                        ++
>       |                     1 -+H+---- +++                                 |
>       |                   -----+E+                                         |
>    60 ++               +H+---- +++                                        ++
>       |            ----+E+                                                 |
>    40 ++        +H+----                                                   ++
>       |       --+E+                                                        |
>    20 ++    +E+                                                           ++
>       |  +EE+    +      +       +       +       +       +      +       +   |
>     0 ++--+N-N---N------N-------N-------N-------N-------N------N-------N--++
>           1      8      16      24      32      40      48     56      64
>                                 Number of threads
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tests/.gitignore  |   1 +
>  tests/Makefile    |   3 +-
>  tests/qht-bench.c | 474 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 477 insertions(+), 1 deletion(-)
>  create mode 100644 tests/qht-bench.c
>
> diff --git a/tests/.gitignore b/tests/.gitignore
> index ffde5d2..d19023e 100644
> --- a/tests/.gitignore
> +++ b/tests/.gitignore
> @@ -7,6 +7,7 @@ check-qnull
>  check-qstring
>  check-qom-interface
>  check-qom-proplist
> +qht-bench
>  rcutorture
>  test-aio
>  test-base64
> diff --git a/tests/Makefile b/tests/Makefile
> index 8589b11..176bbd8 100644
> --- a/tests/Makefile
> +++ b/tests/Makefile
> @@ -398,7 +398,7 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
>  	tests/test-opts-visitor.o tests/test-qmp-event.o \
>  	tests/rcutorture.o tests/test-rcu-list.o \
>  	tests/test-qdist.o \
> -	tests/test-qht.o
> +	tests/test-qht.o tests/qht-bench.o
>
>  $(test-obj-y): QEMU_INCLUDES += -Itests
>  QEMU_CFLAGS += -I$(SRC_PATH)/tests
> @@ -439,6 +439,7 @@ tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
>  tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
>  tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
>  tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
> +tests/qht-bench$(EXESUF): tests/qht-bench.o $(test-util-obj-y)
>
>  tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
>  	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
> diff --git a/tests/qht-bench.c b/tests/qht-bench.c
> new file mode 100644
> index 0000000..30d27c8
> --- /dev/null
> +++ b/tests/qht-bench.c
> @@ -0,0 +1,474 @@
> +/*
> + * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
> + *
> + * License: GNU GPL, version 2 or later.
> + *   See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include <glib.h>
> +#include "qemu/processor.h"
> +#include "qemu/atomic.h"
> +#include "qemu/qht.h"
> +#include "qemu/rcu.h"
> +#include "exec/tb-hash-xx.h"
> +
> +struct thread_stats {
> +    size_t rd;
> +    size_t not_rd;
> +    size_t in;
> +    size_t not_in;
> +    size_t rm;
> +    size_t not_rm;
> +    size_t rz;
> +    size_t not_rz;
> +};
> +
> +struct thread_info {
> +    void (*func)(struct thread_info *);
> +    struct thread_stats stats;
> +    uint64_t r;
> +    bool write_op; /* writes alternate between insertions and removals */
> +    bool resize_down;
> +} QEMU_ALIGNED(64); /* avoid false sharing among threads */
> +
> +static struct qht ht;
> +static QemuThread *rw_threads;
> +
> +#define DEFAULT_RANGE (4096)
> +#define DEFAULT_QHT_N_ELEMS DEFAULT_RANGE
> +
> +static unsigned int duration = 1;
> +static unsigned int n_rw_threads = 1;
> +static unsigned long lookup_range = DEFAULT_RANGE;
> +static unsigned long update_range = DEFAULT_RANGE;
> +static size_t init_range = DEFAULT_RANGE;
> +static size_t init_size = DEFAULT_RANGE;
> +static long populate_offset;
> +static long *keys;
> +
> +static size_t resize_min;
> +static size_t resize_max;
> +static struct thread_info *rz_info;
> +static unsigned long resize_delay = 1000;
> +static double resize_rate; /* 0.0 to 1.0 */
> +static unsigned int n_rz_threads = 1;
> +static QemuThread *rz_threads;
> +
> +static double update_rate; /* 0.0 to 1.0 */
> +static uint64_t update_threshold;
> +static uint64_t resize_threshold;
> +
> +static size_t qht_n_elems = DEFAULT_QHT_N_ELEMS;
> +static int qht_mode;
> +
> +static bool test_start;
> +static bool test_stop;
> +
> +static struct thread_info *rw_info;
> +
> +static const char commands_string[] =
> +    " -d = duration, in seconds\n"
> +    " -n = number of threads\n"
> +    "\n"
> +    " -k = initial number of keys\n"
> +    " -o = offset at which keys start\n"
> +    " -K = initial range of keys (will be rounded up to pow2)\n"
> +    " -l = lookup range of keys (will be rounded up to pow2)\n"
> +    " -r = update range of keys (will be rounded up to pow2)\n"
> +    "\n"
> +    " -u = update rate (0.0 to 100.0), 50/50 split of insertions/removals\n"
> +    "\n"
> +    " -s = initial size hint\n"
> +    " -R = enable auto-resize\n"
> +    " -S = resize rate (0.0 to 100.0)\n"
> +    " -D = delay (in us) between potential resizes\n"
> +    " -N = number of resize threads";
> +
> +static void usage_complete(int argc, char *argv[])
> +{
> +    fprintf(stderr, "Usage: %s [options]\n", argv[0]);
> +    fprintf(stderr, "options:\n%s\n", commands_string);
> +    exit(-1);
> +}
> +
> +static bool is_equal(const void *obj, const void *userp)
> +{
> +    const long *a = obj;
> +    const long *b = userp;
> +
> +    return *a == *b;
> +}
> +
> +static inline uint32_t h(unsigned long v)
> +{
> +    return tb_hash_func5(v, 0, 0);
> +}
> +
> +/*
> + * From: https://en.wikipedia.org/wiki/Xorshift
> + * This is faster than rand_r(), and gives us a wider range (RAND_MAX is only
> + * guaranteed to be >= INT_MAX).
> + */
> +static uint64_t xorshift64star(uint64_t x)
> +{
> +    x ^= x >> 12; /* a */
> +    x ^= x << 25; /* b */
> +    x ^= x >> 27; /* c */
> +    return x * UINT64_C(2685821657736338717);
> +}
> +
> +static void do_rz(struct thread_info *info)
> +{
> +    struct thread_stats *stats = &info->stats;
> +
> +    if (info->r < resize_threshold) {
> +        size_t size = info->resize_down ? resize_min : resize_max;
> +        bool resized;
> +
> +        resized = qht_resize(&ht, size);
> +        info->resize_down = !info->resize_down;
> +
> +        if (resized) {
> +            stats->rz++;
> +        } else {
> +            stats->not_rz++;
> +        }
> +    }
> +    g_usleep(resize_delay);
> +}
> +
> +static void do_rw(struct thread_info *info)
> +{
> +    struct thread_stats *stats = &info->stats;
> +    uint32_t hash;
> +    long *p;
> +
> +    if (info->r >= update_threshold) {
> +        bool read;
> +
> +        p = &keys[info->r & (lookup_range - 1)];
> +        hash = h(*p);
> +        read = qht_lookup(&ht, is_equal, p, hash);
> +        if (read) {
> +            stats->rd++;
> +        } else {
> +            stats->not_rd++;
> +        }
> +    } else {
> +        p = &keys[info->r & (update_range - 1)];
> +        hash = h(*p);
> +        if (info->write_op) {
> +            bool written = false;
> +
> +            if (qht_lookup(&ht, is_equal, p, hash) == NULL) {
> +                written = qht_insert(&ht, p, hash);
> +            }
> +            if (written) {
> +                stats->in++;
> +            } else {
> +                stats->not_in++;
> +            }
> +        } else {
> +            bool removed = false;
> +
> +            if (qht_lookup(&ht, is_equal, p, hash)) {
> +                removed = qht_remove(&ht, p, hash);
> +            }
> +            if (removed) {
> +                stats->rm++;
> +            } else {
> +                stats->not_rm++;
> +            }
> +        }
> +        info->write_op = !info->write_op;
> +    }
> +}
> +
> +static void *thread_func(void *p)
> +{
> +    struct thread_info *info = p;
> +
> +    while (!atomic_mb_read(&test_start)) {
> +        cpu_relax();
> +    }
> +
> +    rcu_register_thread();
> +
> +    rcu_read_lock();
> +    while (!atomic_read(&test_stop)) {
> +        info->r = xorshift64star(info->r);
> +        info->func(info);
> +    }
> +    rcu_read_unlock();
> +
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
> +/* sets everything except info->func */
> +static void prepare_thread_info(struct thread_info *info, int i)
> +{
> +    /* seed for the RNG; each thread should have a different one */
> +    info->r = (i + 1) ^ time(NULL);
> +    /* the first update will be a write */
> +    info->write_op = true;
> +    /* the first resize will be down */
> +    info->resize_down = true;
> +
> +    memset(&info->stats, 0, sizeof(info->stats));
> +}
> +
> +static void
> +th_create_n(QemuThread **threads, struct thread_info **infos, const char *name,
> +            void (*func)(struct thread_info *), int offset, int n)
> +{
> +    struct thread_info *info;
> +    QemuThread *th;
> +    int i;
> +
> +    th = g_malloc(sizeof(*th) * n);
> +    *threads = th;
> +
> +    info = qemu_memalign(64, sizeof(*info) * n);
> +    *infos = info;
> +
> +    for (i = 0; i < n; i++) {
> +        prepare_thread_info(&info[i], i);
> +        info[i].func = func;
> +        qemu_thread_create(&th[i], name, thread_func, &info[i],
> +                           QEMU_THREAD_JOINABLE);
> +    }
> +}
> +
> +static void create_threads(void)
> +{
> +    th_create_n(&rw_threads, &rw_info, "rw", do_rw, 0, n_rw_threads);
> +    th_create_n(&rz_threads, &rz_info, "rz", do_rz, n_rw_threads, n_rz_threads);
> +}
> +
> +static void pr_params(void)
> +{
> +    printf("Parameters:\n");
> +    printf(" duration:          %d s\n", duration);
> +    printf(" # of threads:      %u\n", n_rw_threads);
> +    printf(" initial # of keys: %zu\n", init_size);
> +    printf(" initial size hint: %zu\n", qht_n_elems);
> +    printf(" auto-resize:       %s\n",
> +           qht_mode & QHT_MODE_AUTO_RESIZE ? "on" : "off");
> +    if (resize_rate) {
> +        printf(" resize_rate:       %f%%\n", resize_rate * 100.0);
> +        printf(" resize range:      %zu-%zu\n", resize_min, resize_max);
> +        printf(" # resize threads   %u\n", n_rz_threads);
> +    }
> +    printf(" update rate:       %f%%\n", update_rate * 100.0);
> +    printf(" offset:            %ld\n", populate_offset);
> +    printf(" initial key range: %zu\n", init_range);
> +    printf(" lookup range:      %zu\n", lookup_range);
> +    printf(" update range:      %zu\n", update_range);
> +}
> +
> +static void do_threshold(double rate, uint64_t *threshold)
> +{
> +    if (rate == 1.0) {
> +        *threshold = UINT64_MAX;
> +    } else {
> +        *threshold = rate * UINT64_MAX;
> +    }
> +}
> +
> +static void htable_init(void)
> +{
> +    unsigned long n = MAX(init_range, update_range);
> +    uint64_t r = time(NULL);
> +    size_t retries = 0;
> +    size_t i;
> +
> +    /* avoid allocating memory later by allocating all the keys now */
> +    keys = g_malloc(sizeof(*keys) * n);
> +    for (i = 0; i < n; i++) {
> +        keys[i] = populate_offset + i;
> +    }
> +
> +    /* some sanity checks */
> +    g_assert_cmpuint(lookup_range, <=, n);
> +
> +    /* compute thresholds */
> +    do_threshold(update_rate, &update_threshold);
> +    do_threshold(resize_rate, &resize_threshold);
> +
> +    if (resize_rate) {
> +        resize_min = n / 2;
> +        resize_max = n;
> +        assert(resize_min < resize_max);
> +    } else {
> +        n_rz_threads = 0;
> +    }
> +
> +    /* initialize the hash table */
> +    qht_init(&ht, qht_n_elems, qht_mode);
> +    assert(init_size <= init_range);
> +
> +    pr_params();
> +
> +    fprintf(stderr, "Initialization: populating %zu items...", init_size);
> +    for (i = 0; i < init_size; i++) {
> +        for (;;) {
> +            uint32_t hash;
> +            long *p;
> +
> +            r = xorshift64star(r);
> +            p = &keys[r & (init_range - 1)];
> +            hash = h(*p);
> +            if (qht_insert(&ht, p, hash)) {
> +                break;
> +            }
> +            retries++;
> +        }
> +    }
> +    fprintf(stderr, " populated after %zu retries\n", retries);
> +}
> +
> +static void add_stats(struct thread_stats *s, struct thread_info *info, int n)
> +{
> +    int i;
> +
> +    for (i = 0; i < n; i++) {
> +        struct thread_stats *stats = &info[i].stats;
> +
> +        s->rd += stats->rd;
> +        s->not_rd += stats->not_rd;
> +
> +        s->in += stats->in;
> +        s->not_in += stats->not_in;
> +
> +        s->rm += stats->rm;
> +        s->not_rm += stats->not_rm;
> +
> +        s->rz += stats->rz;
> +        s->not_rz += stats->not_rz;
> +    }
> +}
> +
> +static void pr_stats(void)
> +{
> +    struct thread_stats s = {};
> +    double tx;
> +
> +    add_stats(&s, rw_info, n_rw_threads);
> +    add_stats(&s, rz_info, n_rz_threads);
> +
> +    printf("Results:\n");
> +
> +    if (resize_rate) {
> +        printf(" Resizes:           %zu (%.2f%% of %zu)\n",
> +               s.rz, (double)s.rz / (s.rz + s.not_rz) * 100, s.rz + s.not_rz);
> +    }
> +
> +    printf(" Read:              %.2f M (%.2f%% of %.2fM)\n",
> +           (double)s.rd / 1e6,
> +           (double)s.rd / (s.rd + s.not_rd) * 100,
> +           (double)(s.rd + s.not_rd) / 1e6);
> +    printf(" Inserted:          %.2f M (%.2f%% of %.2fM)\n",
> +           (double)s.in / 1e6,
> +           (double)s.in / (s.in + s.not_in) * 100,
> +           (double)(s.in + s.not_in) / 1e6);
> +    printf(" Removed:           %.2f M (%.2f%% of %.2fM)\n",
> +           (double)s.rm / 1e6,
> +           (double)s.rm / (s.rm + s.not_rm) * 100,
> +           (double)(s.rm + s.not_rm) / 1e6);
> +
> +    tx = (s.rd + s.not_rd + s.in + s.not_in + s.rm + s.not_rm) / 1e6 / duration;
> +    printf(" Throughput:        %.2f MT/s\n", tx);
> +    printf(" Throughput/thread: %.2f MT/s/thread\n", tx / n_rw_threads);
> +}
> +
> +static void run_test(void)
> +{
> +    unsigned int remaining;
> +    int i;
> +
> +    atomic_mb_set(&test_start, true);
> +    do {
> +        remaining = sleep(duration);
> +    } while (remaining);
> +    atomic_mb_set(&test_stop, true);
> +
> +    for (i = 0; i < n_rw_threads; i++) {
> +        qemu_thread_join(&rw_threads[i]);
> +    }
> +    for (i = 0; i < n_rz_threads; i++) {
> +        qemu_thread_join(&rz_threads[i]);
> +    }
> +}
> +
> +static void parse_args(int argc, char *argv[])
> +{
> +    int c;
> +
> +    for (;;) {
> +        c = getopt(argc, argv, "d:D:k:K:l:hn:N:o:r:Rs:S:u:");
> +        if (c < 0) {
> +            break;
> +        }
> +        switch (c) {
> +        case 'd':
> +            duration = atoi(optarg);
> +            break;
> +        case 'D':
> +            resize_delay = atol(optarg);
> +            break;
> +        case 'h':
> +            usage_complete(argc, argv);
> +            exit(0);
> +        case 'k':
> +            init_size = atol(optarg);
> +            break;
> +        case 'K':
> +            init_range = pow2ceil(atol(optarg));
> +            break;
> +        case 'l':
> +            lookup_range = pow2ceil(atol(optarg));
> +            break;
> +        case 'n':
> +            n_rw_threads = atoi(optarg);
> +            break;
> +        case 'N':
> +            n_rz_threads = atoi(optarg);
> +            break;
> +        case 'o':
> +            populate_offset = atol(optarg);
> +            break;
> +        case 'r':
> +            update_range = pow2ceil(atol(optarg));
> +            break;
> +        case 'R':
> +            qht_mode |= QHT_MODE_AUTO_RESIZE;
> +            break;
> +        case 's':
> +            qht_n_elems = atol(optarg);
> +            break;
> +        case 'S':
> +            resize_rate = atof(optarg) / 100.0;
> +            if (resize_rate > 1.0) {
> +                resize_rate = 1.0;
> +            }
> +            break;
> +        case 'u':
> +            update_rate = atof(optarg) / 100.0;
> +            if (update_rate > 1.0) {
> +                update_rate = 1.0;
> +            }
> +            break;
> +        }
> +    }
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +    parse_args(argc, argv);
> +    htable_init();
> +    create_threads();
> +    run_test();
> +    pr_stats();
> +    return 0;
> +}


--
Alex Bennée

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark
  2016-05-31 15:12   ` Alex Bennée
@ 2016-05-31 16:44     ` Emilio G. Cota
  0 siblings, 0 replies; 63+ messages in thread
From: Emilio G. Cota @ 2016-05-31 16:44 UTC (permalink / raw)
  To: Alex Bennée
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov

On Tue, May 31, 2016 at 16:12:32 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> > This serves as a performance benchmark as well as a stress test
> > for QHT. We can tweak quite a number of things, including the
> > number of resize threads and how frequently resizes are triggered.
> >
> > A performance comparison of QHT vs CLHT[1] and ck_hs[2] using
> > this same benchmark program can be found here:
> >   http://imgur.com/a/0Bms4
> >
> > The tests are run on a 64-core AMD Opteron 6376.
> 
> It would be useful to include the template of the command line arguments
> for these plots here. For example when I run it without any arguments I
> get:
> 
> 
> Results:
>  Read:              34.21 M (100.00% of 34.21M)
>  Inserted:          0.00 M (-nan% of 0.00M)
>  Removed:           0.00 M (-nan% of 0.00M)
>  Throughput:        34.21 MT/s
>  Throughput/thread: 34.21 MT/s/thread
> 
> Looking at the graph it says 200k keys, so:
> 
> $ ./tests/qht-bench -k 200000
> qht-bench: tests/qht-bench.c:309: htable_init: Assertion `init_size <= init_range `
> 
> So I'm a little confused on how I'm running this benchmark.

./qht-bench -d $duration -n $n -u $u -k $range -K $range -l $range -r $range -s $range

where $duration is in seconds, $n is the number of threads, $u is the update
rate (0.0 to 100.0), and $range is the number of keys.

Most people will want to set all of -k,K,l,r,s to the same value, so it might
be a good idea to add another parameter to set them all at once.

		E.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-31  7:46     ` Alex Bennée
@ 2016-06-01 20:53       ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-01 20:53 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Emilio G. Cota, QEMU Developers, MTTCG Devel, Paolo Bonzini,
	Richard Henderson

On 31/05/16 10:46, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>> diff --git a/include/qemu/qht.h b/include/qemu/qht.h
>>> new file mode 100644
>>> index 0000000..aec60aa
>>> --- /dev/null
>>> +++ b/include/qemu/qht.h
>>> @@ -0,0 +1,183 @@
>> (snip)
>>> +/**
>>> + * qht_init - Initialize a QHT
>>> + * @ht: QHT to be initialized
>>> + * @n_elems: number of entries the hash table should be optimized for.
>>> + * @mode: bitmask with OR'ed QHT_MODE_*
>>> + */
>>> +void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
>> First of all, thank you for spending your time on the documentation of
>> the API!
>>
>> I was just wondering if it could be worthwhile to pass a hash function
>> when initializing a QHT. Then we could have variants of qht_insert(),
>> qht_remove() and qht_lookup() which does not require a computed hash
>> value but call the function by themselves. This could make sense since a
>> hash value passed the the functions should always be exactly the same
>> for the same object.
> Wouldn't this be for an expansion of the API when we actually have
> something that would use it that way?
>

Yes, I think in "tb hash: track translated blocks with qht" we could just pass tb_hash_func() to qht_init().

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-29 19:52   ` Sergey Fedorov
  2016-05-29 19:55     ` Sergey Fedorov
  2016-05-31  7:46     ` Alex Bennée
@ 2016-06-03  9:18     ` Emilio G. Cota
  2016-06-03 15:19       ` Sergey Fedorov
  2016-06-03 11:01     ` Emilio G. Cota
  3 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-03  9:18 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Sun, May 29, 2016 at 22:52:27 +0300, Sergey Fedorov wrote:
> I was just wondering if it could be worthwhile to pass a hash function
> when initializing a QHT. Then we could have variants of qht_insert(),
> qht_remove() and qht_lookup() which does not require a computed hash
> value but call the function by themselves. This could make sense since a
> hash value passed the the functions should always be exactly the same
> for the same object.

I considered this when designing the API. I think it's not worth having
in qht; callers could have their own wrapper to do this though.

For the only caller of qht that we have so far I don't see this
as being worth the hassle.

For instance, we couldn't use the same function for lookups and
inserts/removals, since the hash function would look like:

uint32_t hash_func(void *p)
{
	TranslationBlock *tb = p;
	return tb_hash_func(tb->phys_pc, ...);
}

But for lookups we don't yet know *tb (that's what we're looking for!).
All we have is the tb_desc struct that we use for comparisons.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table
  2016-05-29 19:52   ` Sergey Fedorov
                       ` (2 preceding siblings ...)
  2016-06-03  9:18     ` Emilio G. Cota
@ 2016-06-03 11:01     ` Emilio G. Cota
  2016-06-03 15:34       ` Sergey Fedorov
  3 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-03 11:01 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Sun, May 29, 2016 at 22:52:27 +0300, Sergey Fedorov wrote:
> > +/**
> > + * qht_remove - remove a pointer from the hash table
> > + * @ht: QHT to remove from
> > + * @p: pointer to be removed
> > + * @hash: hash corresponding to @p
> > + *
> > + * Attempting to remove a NULL @p is a bug.
> > + *
> > + * Just-removed @p pointers cannot be immediately freed; they need to remain
> > + * valid until the end of the RCU grace period in which qht_remove() is called.
> > + * This guarantees that concurrent lookups will always compare against valid
> > + * data.
> 
> Mention rcu_call1()/call_rcu()/g_free_rcu()?

Probably using 'see also:' for non-qht functions is a little too much.
The pointer to RCU should be enough, me thinks.

> (snip)
> > +/* trigger a resize when n_added_buckets > n_buckets / div */
> > +#define QHT_NR_ADDED_BUCKETS_THRESHOLD_DIV 8
> 
> Just out of curiosity, how did you get this number?

Good question. Tested what gave the best performance for the ARM
bootup test :-) It matched the performance of the old behavior,
that was, to double the size when n_entries reached
n_buckets * QHT_BUCKET_ENTRIES / 2.

> > +
> > +static void qht_do_resize(struct qht *ht, struct qht_map *new);
> > +static void qht_grow_maybe(struct qht *ht);
> 
> qht_grow_maybe() is used just once. Please consider reordering of
> definitions and removing this forward declaration.

Done.

(snip)
> > +/*
> > + * Find the last valid entry in @head, and swap it with @orig[pos], which has
> > + * just been invalidated.
> > + */
> > +static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
> > +{
> > +    struct qht_bucket *b = orig;
> > +    struct qht_bucket *prev = NULL;
> > +    int i;
> > +
> > +    if (qht_entry_is_last(orig, pos)) {
> > +        orig->hashes[pos] = 0;
> > +        atomic_set(&orig->pointers[pos], NULL);
> > +        return;
> > +    }
> > +    do {
> > +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> > +            if (b->pointers[i]) {
> > +                continue;
> > +            }
> > +            if (i > 0) {
> > +                return qht_entry_move(orig, pos, b, i - 1);
> > +            }
> > +            qht_debug_assert(prev);
> 
> 'prev' can be NULL if this is the first iteration.

How can prev be NULL here and that not be a bug? NULL here would
mean there was a hole before orig[pos]. Or orig[pos] was
NULL, which is also a bug.

> > +
> > +/* call with b->lock held */
> > +static inline
> > +bool qht_remove__locked(struct qht_map *map, struct qht_bucket *head,
> > +                        const void *p, uint32_t hash)
> > +{
> > +    struct qht_bucket *b = head;
> > +    int i;
> > +
> > +    do {
> > +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
> > +            void *q = b->pointers[i];
> > +
> > +            if (unlikely(q == NULL)) {
> > +                return false;
> > +            }
> > +            if (q == p) {
> > +                qht_debug_assert(b->hashes[i] == hash);
> > +                seqlock_write_begin(&head->sequence);
> > +                qht_bucket_fill_hole(b, i);
> 
> "Fill hole" doesn't correspond to the function's new job since there's
> no hole. "Remove entry" would make more sense, I think.

Changed.

> (snip)
> > +/*
> > + * Call with ht->lock and all bucket locks held.
> > + *
> > + * Creating the @new map here would add unnecessary delay while all the locks
> > + * are held--holding up the bucket locks is particularly bad, since no writes
> > + * can occur while these are held. Thus, we let callers create the new map,
> > + * hopefully without the bucket locks held.
> > + */
> > +static void qht_do_resize(struct qht *ht, struct qht_map *new)
> > +{
> > +    struct qht_map *old;
> > +
> > +    old = ht->map;
> > +    g_assert_cmpuint(new->n_buckets, !=, old->n_buckets);
> > +
> > +    qht_map_iter__all_locked(ht, old, qht_map_copy, new);
> > +    qht_map_debug__all_locked(new);
> > +
> > +    atomic_rcu_set(&ht->map, new);
> > +    call_rcu1(&old->rcu, qht_map_reclaim);
> 
> call_rcu() macro is a more convenient way to do this and you wouldn't
> need qht_map_reclaim().

True. Originally the RCU struct wasn't at the head, then I changed
it to please valgrind. Changed now.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 13/15] qht: add test-qht-par to invoke qht-bench from 'check' target
  2016-05-29 20:53   ` Sergey Fedorov
@ 2016-06-03 11:07     ` Emilio G. Cota
  0 siblings, 0 replies; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-03 11:07 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Sun, May 29, 2016 at 23:53:42 +0300, Sergey Fedorov wrote:
> On 25/05/16 04:13, Emilio G. Cota wrote:
> (snip)
> > +
> > +#define TEST_QHT_STRING "tests/qht-bench 1>/dev/null 2>&1 -R -S0.1 -D10000 -N1"
> > +
> > +static void test_qht(int n_threads, int update_rate, int duration)
> > +{
> > +    char *str;
> > +    int rc;
> > +
> > +    str = g_strdup_printf(TEST_QHT_STRING "-n %d -u %d -d %d",
> 
> There needs to be an extra space either at the beginning of the literal
> string, or at the end of the string defined by TEST_QHT_STRING, so that
> we don't get "... -N1-n ...".

Good catch! Changed now.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark
  2016-05-29 20:45   ` Sergey Fedorov
@ 2016-06-03 11:41     ` Emilio G. Cota
  2016-06-03 15:41       ` Sergey Fedorov
  0 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-03 11:41 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Sun, May 29, 2016 at 23:45:23 +0300, Sergey Fedorov wrote:
> On 25/05/16 04:13, Emilio G. Cota wrote:
> > diff --git a/tests/qht-bench.c b/tests/qht-bench.c
> > new file mode 100644
> > index 0000000..30d27c8
> > --- /dev/null
> > +++ b/tests/qht-bench.c
> > @@ -0,0 +1,474 @@
> (snip)
> > +static void do_rw(struct thread_info *info)
> > +{
> > +    struct thread_stats *stats = &info->stats;
> > +    uint32_t hash;
> > +    long *p;
> > +
> > +    if (info->r >= update_threshold) {
> > +        bool read;
> > +
> > +        p = &keys[info->r & (lookup_range - 1)];
> > +        hash = h(*p);
> > +        read = qht_lookup(&ht, is_equal, p, hash);
> > +        if (read) {
> > +            stats->rd++;
> > +        } else {
> > +            stats->not_rd++;
> > +        }
> > +    } else {
> > +        p = &keys[info->r & (update_range - 1)];
> > +        hash = h(*p);
> 
> The previous two lines are common for the both "if" branches. Lets move
> it above the "if".

Not quite. The mask uses lookup_range above, and update_range below.

> > +        if (info->write_op) {
> > +            bool written = false;
> > +
> > +            if (qht_lookup(&ht, is_equal, p, hash) == NULL) {
> > +                written = qht_insert(&ht, p, hash);
> > +            }
> > +            if (written) {
> > +                stats->in++;
> > +            } else {
> > +                stats->not_in++;
> > +            }
> > +        } else {
> > +            bool removed = false;
> > +
> > +            if (qht_lookup(&ht, is_equal, p, hash)) {
> > +                removed = qht_remove(&ht, p, hash);
> > +            }
> > +            if (removed) {
> > +                stats->rm++;
> > +            } else {
> > +                stats->not_rm++;
> > +            }
> > +        }
> > +        info->write_op = !info->write_op;
> > +    }
> > +}
> > +
> > +static void *thread_func(void *p)
> > +{
> > +    struct thread_info *info = p;
> > +
> > +    while (!atomic_mb_read(&test_start)) {
> > +        cpu_relax();
> > +    }
> > +
> > +    rcu_register_thread();
> 
> Shouldn't we do this before checking for 'test_start'?

>From a correctness point of view it doesn't matter. But yes, it
is better to do it earlier. Changed.

> > +
> > +    rcu_read_lock();
> 
> Why don't we do rcu_read_lock()/rcu_read_unlock() inside the loop?

Because that will slow down the benchmark unnecessarily (Throughput
for single-threaded and default opts goes down from 38M/s to 35M/s).
For this benchmark we want to benchmark QHT's performance, not RCU's.
And really we're not allocating/deallocating elements dynamically,
so from a memory usage viewpoint calling this inside or outside
of the loop doesn't matter.

> > +    while (!atomic_read(&test_stop)) {
> > +        info->r = xorshift64star(info->r);
> > +        info->func(info);
> > +    }
> > +    rcu_read_unlock();
> > +
> > +    rcu_unregister_thread();
> > +    return NULL;
> > +}
> > +
> > +/* sets everything except info->func */
> > +static void prepare_thread_info(struct thread_info *info, int i)
> > +{
> > +    /* seed for the RNG; each thread should have a different one */
> > +    info->r = (i + 1) ^ time(NULL);
> > +    /* the first update will be a write */
> > +    info->write_op = true;
> > +    /* the first resize will be down */
> > +    info->resize_down = true;
> > +
> > +    memset(&info->stats, 0, sizeof(info->stats));
> > +}
> > +
> > +static void
> > +th_create_n(QemuThread **threads, struct thread_info **infos, const char *name,
> > +            void (*func)(struct thread_info *), int offset, int n)
> 
> 'offset' is not used in this function.

Good catch! Changed now:
+    prepare_thread_info(&info[i], offset + i);
The offset is passed so that each created thread has a unique
RNG seed.

> > +{
> > +    struct thread_info *info;
> > +    QemuThread *th;
> > +    int i;
> > +
> > +    th = g_malloc(sizeof(*th) * n);
> > +    *threads = th;
> > +
> > +    info = qemu_memalign(64, sizeof(*info) * n);
> > +    *infos = info;
> > +
> > +    for (i = 0; i < n; i++) {
> > +        prepare_thread_info(&info[i], i);
> > +        info[i].func = func;
> > +        qemu_thread_create(&th[i], name, thread_func, &info[i],
> > +                           QEMU_THREAD_JOINABLE);
> > +    }
> > +}
> > +
> (snip)
> > +
> > +static void run_test(void)
> > +{
> > +    unsigned int remaining;
> > +    int i;
> > +
> 
> Are we sure all the threads are ready at this point? Otherwise why
> bother with 'test_start' flag?

Good point. Added the following:

diff --git a/tests/qht-bench.c b/tests/qht-bench.c
index 885da9c..c1ed9b9 100644
--- a/tests/qht-bench.c
+++ b/tests/qht-bench.c
@@ -43,6 +43,7 @@ static unsigned long lookup_range = DEFAULT_RANGE;
 static unsigned long update_range = DEFAULT_RANGE;
 static size_t init_range = DEFAULT_RANGE;
 static size_t init_size = DEFAULT_RANGE;
+static size_t n_ready_threads;
 static long populate_offset;
 static long *keys;
 
@@ -190,6 +191,7 @@ static void *thread_func(void *p)
 
     rcu_register_thread();
 
+    atomic_inc(&n_ready_threads);
     while (!atomic_mb_read(&test_start)) {
         cpu_relax();
     }
@@ -387,6 +389,9 @@ static void run_test(void)
     unsigned int remaining;
     int i;
 
+    while (atomic_read(&n_ready_threads) != n_rw_threads + n_rz_threads) {
+        cpu_relax();
+    }
     atomic_mb_set(&test_start, true);
     do {
         remaining = sleep(duration);

Thanks,

		Emilio

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table
  2016-06-03  9:18     ` Emilio G. Cota
@ 2016-06-03 15:19       ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-03 15:19 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 03/06/16 12:18, Emilio G. Cota wrote:
> On Sun, May 29, 2016 at 22:52:27 +0300, Sergey Fedorov wrote:
>> I was just wondering if it could be worthwhile to pass a hash function
>> when initializing a QHT. Then we could have variants of qht_insert(),
>> qht_remove() and qht_lookup() which does not require a computed hash
>> value but call the function by themselves. This could make sense since a
>> hash value passed the the functions should always be exactly the same
>> for the same object.
> I considered this when designing the API. I think it's not worth having
> in qht; callers could have their own wrapper to do this though.
>
> For the only caller of qht that we have so far I don't see this
> as being worth the hassle.
>
> For instance, we couldn't use the same function for lookups and
> inserts/removals, since the hash function would look like:
>
> uint32_t hash_func(void *p)
> {
> 	TranslationBlock *tb = p;
> 	return tb_hash_func(tb->phys_pc, ...);
> }
>
> But for lookups we don't yet know *tb (that's what we're looking for!).
> All we have is the tb_desc struct that we use for comparisons.

Fair enough.

Thanks,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table
  2016-06-03 11:01     ` Emilio G. Cota
@ 2016-06-03 15:34       ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-03 15:34 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 03/06/16 14:01, Emilio G. Cota wrote:
> On Sun, May 29, 2016 at 22:52:27 +0300, Sergey Fedorov wrote:
>>> +/*
>>> + * Find the last valid entry in @head, and swap it with @orig[pos], which has
>>> + * just been invalidated.
>>> + */
>>> +static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
>>> +{
>>> +    struct qht_bucket *b = orig;
>>> +    struct qht_bucket *prev = NULL;
>>> +    int i;
>>> +
>>> +    if (qht_entry_is_last(orig, pos)) {
>>> +        orig->hashes[pos] = 0;
>>> +        atomic_set(&orig->pointers[pos], NULL);
>>> +        return;
>>> +    }
>>> +    do {
>>> +        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>>> +            if (b->pointers[i]) {
>>> +                continue;
>>> +            }
>>> +            if (i > 0) {
>>> +                return qht_entry_move(orig, pos, b, i - 1);
>>> +            }
>>> +            qht_debug_assert(prev);
>> 'prev' can be NULL if this is the first iteration.
> How can prev be NULL here and that not be a bug? NULL here would
> mean there was a hole before orig[pos]. Or orig[pos] was
> NULL, which is also a bug.

Yeah, you're right.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark
  2016-06-03 11:41     ` Emilio G. Cota
@ 2016-06-03 15:41       ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-03 15:41 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 03/06/16 14:41, Emilio G. Cota wrote:
> On Sun, May 29, 2016 at 23:45:23 +0300, Sergey Fedorov wrote:
>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>> diff --git a/tests/qht-bench.c b/tests/qht-bench.c
>>> new file mode 100644
>>> index 0000000..30d27c8
>>> --- /dev/null
>>> +++ b/tests/qht-bench.c
>>> @@ -0,0 +1,474 @@
>> (snip)
>>> +static void do_rw(struct thread_info *info)
>>> +{
>>> +    struct thread_stats *stats = &info->stats;
>>> +    uint32_t hash;
>>> +    long *p;
>>> +
>>> +    if (info->r >= update_threshold) {
>>> +        bool read;
>>> +
>>> +        p = &keys[info->r & (lookup_range - 1)];
>>> +        hash = h(*p);
>>> +        read = qht_lookup(&ht, is_equal, p, hash);
>>> +        if (read) {
>>> +            stats->rd++;
>>> +        } else {
>>> +            stats->not_rd++;
>>> +        }
>>> +    } else {
>>> +        p = &keys[info->r & (update_range - 1)];
>>> +        hash = h(*p);
>> The previous two lines are common for the both "if" branches. Lets move
>> it above the "if".
> Not quite. The mask uses lookup_range above, and update_range below.

Ah, yes :)

>
>>> +
>>> +    rcu_read_lock();
>> Why don't we do rcu_read_lock()/rcu_read_unlock() inside the loop?
> Because that will slow down the benchmark unnecessarily (Throughput
> for single-threaded and default opts goes down from 38M/s to 35M/s).
> For this benchmark we want to benchmark QHT's performance, not RCU's.
> And really we're not allocating/deallocating elements dynamically,
> so from a memory usage viewpoint calling this inside or outside
> of the loop doesn't matter.

Fair enough.

>
>>> +    while (!atomic_read(&test_stop)) {
>>> +        info->r = xorshift64star(info->r);
>>> +        info->func(info);
>>> +    }
>>> +    rcu_read_unlock();
>>> +
>>> +    rcu_unregister_thread();
>>> +    return NULL;
>>> +}
>> (snip)
>>> +static void run_test(void)
>>> +{
>>> +    unsigned int remaining;
>>> +    int i;
>>> +
>> Are we sure all the threads are ready at this point? Otherwise why
>> bother with 'test_start' flag?
> Good point. Added the following:
>
> diff --git a/tests/qht-bench.c b/tests/qht-bench.c
> index 885da9c..c1ed9b9 100644
> --- a/tests/qht-bench.c
> +++ b/tests/qht-bench.c
> @@ -43,6 +43,7 @@ static unsigned long lookup_range = DEFAULT_RANGE;
>  static unsigned long update_range = DEFAULT_RANGE;
>  static size_t init_range = DEFAULT_RANGE;
>  static size_t init_size = DEFAULT_RANGE;
> +static size_t n_ready_threads;
>  static long populate_offset;
>  static long *keys;
>  
> @@ -190,6 +191,7 @@ static void *thread_func(void *p)
>  
>      rcu_register_thread();
>  
> +    atomic_inc(&n_ready_threads);
>      while (!atomic_mb_read(&test_start)) {
>          cpu_relax();
>      }
> @@ -387,6 +389,9 @@ static void run_test(void)
>      unsigned int remaining;
>      int i;
>  
> +    while (atomic_read(&n_ready_threads) != n_rw_threads + n_rz_threads) {
> +        cpu_relax();
> +    }
>      atomic_mb_set(&test_start, true);
>      do {
>          remaining = sleep(duration);
>

Looks good.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-05-28 18:15   ` Sergey Fedorov
@ 2016-06-03 17:22     ` Emilio G. Cota
  2016-06-03 17:29       ` Sergey Fedorov
  2016-06-07  1:05     ` Emilio G. Cota
  1 sibling, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-03 17:22 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
> On 25/05/16 04:13, Emilio G. Cota wrote:
> (snip)
> > +double qdist_avg(const struct qdist *dist)
> > +{
> > +    unsigned long count;
> > +    size_t i;
> > +    double ret = 0;
> > +
> > +    count = qdist_sample_count(dist);
> > +    if (!count) {
> > +        return NAN;
> > +    }
> > +    for (i = 0; i < dist->n; i++) {
> > +        struct qdist_entry *e = &dist->entries[i];
> > +
> > +        ret += e->x * e->count / count;
> 
> Please use Welford’s method or something like that, see
> http://stackoverflow.com/a/1346890.

Yes, the way the mean is computed right now, we might suffer
from underflow if count is huge. But I'd rather take that, than the
perf penalty of an iterative method (such as the one used
in Welford's). Note that we might have huge amounts of
items, e.g. one item per head bucket in qht's occupancy qdist
(and 0.5M head buckets is easy to achieve).

If we were to use an iterative method, we'd need to do something
like:

double qdist_avg(const struct qdist *dist)
{
    size_t i, j;
    double ret = 0;

    if (!qdist_sample_count(dist)) {
        return NAN;
    }
    /* compute moving average to prevent under/overflow */
    for (i = 0; i < dist->n; i++) {
        struct qdist_entry *e = &dist->entries[i];

        for (j = 0; j < e->count; j++) {

            ret += (e->x - ret) / (i + j + 1);
        }
    }
    return ret;
}

Note that skipping the inner loop would be incorrect.

I measured the time it takes to execute qdist_avg(&hst.occupancy) at the
end of booting debian jessie for ARM. The difference is
significant:

Original:  0.000002 s
Iterative: 0.002846 s

So really I think we should be OK with a potential underflow. If you want
I can add a comment to remind our future selves of these findings.

		Emilio

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-03 17:22     ` Emilio G. Cota
@ 2016-06-03 17:29       ` Sergey Fedorov
  2016-06-03 17:46         ` Sergey Fedorov
  0 siblings, 1 reply; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-03 17:29 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 03/06/16 20:22, Emilio G. Cota wrote:
> On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
>> On 25/05/16 04:13, Emilio G. Cota wrote:
>> (snip)
>>> +double qdist_avg(const struct qdist *dist)
>>> +{
>>> +    unsigned long count;
>>> +    size_t i;
>>> +    double ret = 0;
>>> +
>>> +    count = qdist_sample_count(dist);
>>> +    if (!count) {
>>> +        return NAN;
>>> +    }
>>> +    for (i = 0; i < dist->n; i++) {
>>> +        struct qdist_entry *e = &dist->entries[i];
>>> +
>>> +        ret += e->x * e->count / count;
>> Please use Welford’s method or something like that, see
>> http://stackoverflow.com/a/1346890.
> Yes, the way the mean is computed right now, we might suffer
> from underflow if count is huge. But I'd rather take that, than the
> perf penalty of an iterative method (such as the one used
> in Welford's). Note that we might have huge amounts of
> items, e.g. one item per head bucket in qht's occupancy qdist
> (and 0.5M head buckets is easy to achieve).
>
> If we were to use an iterative method, we'd need to do something
> like:
>
> double qdist_avg(const struct qdist *dist)
> {
>     size_t i, j;
>     double ret = 0;
>
>     if (!qdist_sample_count(dist)) {
>         return NAN;
>     }
>     /* compute moving average to prevent under/overflow */
>     for (i = 0; i < dist->n; i++) {
>         struct qdist_entry *e = &dist->entries[i];
>
>         for (j = 0; j < e->count; j++) {
>
>             ret += (e->x - ret) / (i + j + 1);
>         }
>     }
>     return ret;
> }
>
> Note that skipping the inner loop would be incorrect.

Ah, it's a shame. I'm wondering if there is some other algorithm that
could work for us?

> I measured the time it takes to execute qdist_avg(&hst.occupancy) at the
> end of booting debian jessie for ARM. The difference is
> significant:
>
> Original:  0.000002 s
> Iterative: 0.002846 s

Have you compared the results of computing the average as well?

>
> So really I think we should be OK with a potential underflow. If you want
> I can add a comment to remind our future selves of these findings.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-03 17:29       ` Sergey Fedorov
@ 2016-06-03 17:46         ` Sergey Fedorov
  2016-06-06 23:40           ` Emilio G. Cota
  0 siblings, 1 reply; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-03 17:46 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 03/06/16 20:29, Sergey Fedorov wrote:
> On 03/06/16 20:22, Emilio G. Cota wrote:
>> On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
>>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>> (snip)
>>>> +double qdist_avg(const struct qdist *dist)
>>>> +{
>>>> +    unsigned long count;
>>>> +    size_t i;
>>>> +    double ret = 0;
>>>> +
>>>> +    count = qdist_sample_count(dist);
>>>> +    if (!count) {
>>>> +        return NAN;
>>>> +    }
>>>> +    for (i = 0; i < dist->n; i++) {
>>>> +        struct qdist_entry *e = &dist->entries[i];
>>>> +
>>>> +        ret += e->x * e->count / count;
>>> Please use Welford’s method or something like that, see
>>> http://stackoverflow.com/a/1346890.
>> Yes, the way the mean is computed right now, we might suffer
>> from underflow if count is huge. But I'd rather take that, than the
>> perf penalty of an iterative method (such as the one used
>> in Welford's). Note that we might have huge amounts of
>> items, e.g. one item per head bucket in qht's occupancy qdist
>> (and 0.5M head buckets is easy to achieve).
>>
>> If we were to use an iterative method, we'd need to do something
>> like:
>>
>> double qdist_avg(const struct qdist *dist)
>> {
>>     size_t i, j;
>>     double ret = 0;
>>
>>     if (!qdist_sample_count(dist)) {
>>         return NAN;
>>     }
>>     /* compute moving average to prevent under/overflow */
>>     for (i = 0; i < dist->n; i++) {
>>         struct qdist_entry *e = &dist->entries[i];
>>
>>         for (j = 0; j < e->count; j++) {
>>
>>             ret += (e->x - ret) / (i + j + 1);
>>         }
>>     }
>>     return ret;
>> }
>>
>> Note that skipping the inner loop would be incorrect.
> Ah, it's a shame. I'm wondering if there is some other algorithm that
> could work for us?

Maybe something like
https://en.wikipedia.org/wiki/Kahan_summation_algorithm could help?

Regards,
Sergey

>
>> I measured the time it takes to execute qdist_avg(&hst.occupancy) at the
>> end of booting debian jessie for ARM. The difference is
>> significant:
>>
>> Original:  0.000002 s
>> Iterative: 0.002846 s
> Have you compared the results of computing the average as well?
>
>> So really I think we should be OK with a potential underflow. If you want
>> I can add a comment to remind our future selves of these findings.
> Kind regards,
> Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-03 17:46         ` Sergey Fedorov
@ 2016-06-06 23:40           ` Emilio G. Cota
  2016-06-07 14:06             ` Sergey Fedorov
  0 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-06 23:40 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Fri, Jun 03, 2016 at 20:46:07 +0300, Sergey Fedorov wrote:
> On 03/06/16 20:29, Sergey Fedorov wrote:
> > On 03/06/16 20:22, Emilio G. Cota wrote:
> >> On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
> >>> On 25/05/16 04:13, Emilio G. Cota wrote:
> >>> (snip)
> >>>> +double qdist_avg(const struct qdist *dist)
> >>>> +{
> >>>> +    unsigned long count;
> >>>> +    size_t i;
> >>>> +    double ret = 0;
> >>>> +
> >>>> +    count = qdist_sample_count(dist);
> >>>> +    if (!count) {
> >>>> +        return NAN;
> >>>> +    }
> >>>> +    for (i = 0; i < dist->n; i++) {
> >>>> +        struct qdist_entry *e = &dist->entries[i];
> >>>> +
> >>>> +        ret += e->x * e->count / count;
> >>> Please use Welford’s method or something like that, see
> >>> http://stackoverflow.com/a/1346890.
> >> Yes, the way the mean is computed right now, we might suffer
> >> from underflow if count is huge. But I'd rather take that, than the
> >> perf penalty of an iterative method (such as the one used
> >> in Welford's). Note that we might have huge amounts of
> >> items, e.g. one item per head bucket in qht's occupancy qdist
> >> (and 0.5M head buckets is easy to achieve).
> >>
> >> If we were to use an iterative method, we'd need to do something
> >> like:
> >>
> >> double qdist_avg(const struct qdist *dist)
> >> {
> >>     size_t i, j;
> >>     double ret = 0;
> >>
> >>     if (!qdist_sample_count(dist)) {
> >>         return NAN;
> >>     }
> >>     /* compute moving average to prevent under/overflow */
> >>     for (i = 0; i < dist->n; i++) {
> >>         struct qdist_entry *e = &dist->entries[i];
> >>
> >>         for (j = 0; j < e->count; j++) {
> >>
> >>             ret += (e->x - ret) / (i + j + 1);
> >>         }
> >>     }
> >>     return ret;
> >> }
> >>
> >> Note that skipping the inner loop would be incorrect.
> > Ah, it's a shame. I'm wondering if there is some other algorithm that
> > could work for us?
> 
> Maybe something like
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm could help?

That algorithm is overkill for what we're doing. Pairwise summation
should suffice:

diff --git a/util/qdist.c b/util/qdist.c
index 3343640..909bd2b 100644
--- a/util/qdist.c
+++ b/util/qdist.c
@@ -367,20 +367,34 @@ unsigned long qdist_sample_count(const struct qdist *dist)
     return count;
 }
 
+static double qdist_pairwise_avg(const struct qdist *dist, size_t index,
+                                 size_t n, unsigned long count)
+{
+    if (n <= 2) {
+        size_t i;
+        double ret = 0;
+
+        for (i = 0; i < n; i++) {
+            struct qdist_entry *e = &dist->entries[index + i];
+
+            ret += e->x * e->count / count;
+        }
+        return ret;
+    } else {
+        size_t n2 = n / 2;
+
+        return qdist_pairwise_avg(dist, index, n2, count) +
+               qdist_pairwise_avg(dist, index + n2, n - n2, count);
+    }
+}
+
 double qdist_avg(const struct qdist *dist)
 {
     unsigned long count;
-    size_t i;
-    double ret = 0;
 
     count = qdist_sample_count(dist);
     if (!count) {
         return NAN;
     }
-    for (i = 0; i < dist->n; i++) {
-        struct qdist_entry *e = &dist->entries[i];
-
-        ret += e->x * e->count / count;
-    }
-    return ret;
+    return qdist_pairwise_avg(dist, 0, dist->n, count);
 }

		Emilio

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-05-28 18:15   ` Sergey Fedorov
  2016-06-03 17:22     ` Emilio G. Cota
@ 2016-06-07  1:05     ` Emilio G. Cota
  2016-06-07 15:56       ` Sergey Fedorov
  1 sibling, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-07  1:05 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
> On 25/05/16 04:13, Emilio G. Cota wrote:
> > diff --git a/util/qdist.c b/util/qdist.c
> > new file mode 100644
> > index 0000000..3343640
> > --- /dev/null
> > +++ b/util/qdist.c
> > @@ -0,0 +1,386 @@
> (snip)
> > +
> > +void qdist_add(struct qdist *dist, double x, long count)
> > +{
> > +    struct qdist_entry *entry = NULL;
> > +
> > +    if (dist->entries) {
> > +        struct qdist_entry e;
> > +
> > +        e.x = x;
> > +        entry = bsearch(&e, dist->entries, dist->n, sizeof(e), qdist_cmp);
> > +    }
> > +
> > +    if (entry) {
> > +        entry->count += count;
> > +        return;
> > +    }
> > +
> > +    dist->entries = g_realloc(dist->entries,
> > +                              sizeof(*dist->entries) * (dist->n + 1));
> 
> Repeated doubling?

Can you please elaborate?

> > +    dist->n++;
> > +    entry = &dist->entries[dist->n - 1];
> 
> What if we combine the above two lines:
> 
>     entry = &dist->entries[dist->n++];
> 
> or just reverse them:
> 
>     entry = &dist->entries[dist->n];
>     dist->n++;

I have less trouble understanding the original.

> > +    entry->x = x;
> > +    entry->count = count;
> > +    qsort(dist->entries, dist->n, sizeof(*entry), qdist_cmp);
> > +}
> > +
> (snip)
> > +static char *qdist_pr_internal(const struct qdist *dist)
> > +{
> > +    double min, max, step;
> > +    GString *s = g_string_new("");
> > +    size_t i;
> > +
> > +    /* if only one entry, its printout will be either full or empty */
> > +    if (dist->n == 1) {
> > +        if (dist->entries[0].count) {
> > +            g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES - 1]);
> > +        } else {
> > +            g_string_append_c(s, ' ');
> > +        }
> > +        goto out;
> > +    }
> > +
> > +    /* get min and max counts */
> > +    min = dist->entries[0].count;
> > +    max = min;
> > +    for (i = 0; i < dist->n; i++) {
> > +        struct qdist_entry *e = &dist->entries[i];
> > +
> > +        if (e->count < min) {
> > +            min = e->count;
> > +        }
> > +        if (e->count > max) {
> > +            max = e->count;
> > +        }
> > +    }
> > +
> > +    /* floor((count - min) * step) will give us the block index */
> > +    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
> > +
> > +    for (i = 0; i < dist->n; i++) {
> > +        struct qdist_entry *e = &dist->entries[i];
> > +        int index;
> > +
> > +        /* make an exception with 0; instead of using block[0], print a space */
> > +        if (e->count) {
> > +            index = (int)((e->count - min) * step);
> 
> So "e->count == min" gives us one eighth block instead of just space?

Yes, only 0 can print a space.

> > +            g_string_append_unichar(s, qdist_blocks[index]);
> > +        } else {
> > +            g_string_append_c(s, ' ');
> > +        }
> > +    }
> > + out:
> > +    return g_string_free(s, FALSE);
> > +}
> > +
> > +/*
> > + * Bin the distribution in @from into @n bins of consecutive, non-overlapping
> > + * intervals, copying the result to @to.
> > + *
> > + * This function is internal to qdist: only this file and test code should
> > + * ever call it.
> > + *
> > + * Note: calling this function on an already-binned qdist is a bug.
> > + *
> > + * If @n == 0 or @from->n == 1, use @from->n.
> > + */
> > +void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
> > +{
> > +    double xmin, xmax;
> > +    double step;
> > +    size_t i, j, j_min;
> > +
> > +    qdist_init(to);
> > +
> > +    if (!from->entries) {
> > +        return;
> > +    }
> > +    if (!n || from->n == 1) {
> > +        n = from->n;
> > +    }
> > +
> > +    /* set equally-sized bins between @from's left and right */
> > +    xmin = qdist_xmin(from);
> > +    xmax = qdist_xmax(from);
> > +    step = (xmax - xmin) / n;
> > +
> > +    if (n == from->n) {
> > +        /* if @from's entries are equally spaced, no need to re-bin */
> > +        for (i = 0; i < from->n; i++) {
> > +            if (from->entries[i].x != xmin + i * step) {
> > +                goto rebin;
> 
> static inline function instead of goto?

It would have quite a few arguments, I think the goto is fine.

> > +            }
> > +        }
> > +        /* they're equally spaced, so copy the dist and bail out */
> > +        to->entries = g_malloc(sizeof(*to->entries) * from->n);
> 
> g_new()?

Changed.

> > +        to->n = from->n;
> > +        memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
> > +        return;
> > +    }
> > +
> > + rebin:
> > +    j_min = 0;
> > +    for (i = 0; i < n; i++) {
> > +        double x;
> > +        double left, right;
> > +
> > +        left = xmin + i * step;
> > +        right = xmin + (i + 1) * step;
> > +
> > +        /* Add x, even if it might not get any counts later */
> > +        x = left;
> 
> This way we round down to the left margin of each bin like this:
> 
>     xmin [*---*---*---*---*] xmax   -- from
>           |  /|  /|  /|  /
>           | / | / | / | /
>           |/  |/  |/  |/
>           |   |   |   |
>           V   V   V   V
>          [*   *   *   *]            -- to
(snip)
>     xmin [*----*----*----*] xmax    -- from
>         \   /\   /\   /\   /
>          \ /  \ /  \ /  \ /
>           |    |    |    |
>           V    V    V    V
>          [*    *    *    *]         -- to
> 
> I'm not sure which is the more correct option from the mathematical
> point of view; but multiple-binning with the last variant of the
> algorithm we would still give the same result.

There's no "right" or "wrong" way as long as we're consistent
and we print the right counts in the right bins. I think the
convention I chose is simple enough, and leads to simple printing
of the labels. But yes other alternatives would be OK here.

> > +        qdist_add(to, x, 0);
> > +
> > +        /*
> > +         * To avoid double-counting we capture [left, right) ranges, except for
> > +         * the righmost bin, which captures a [left, right] range.
> > +         */
> > +        for (j = j_min; j < from->n; j++) {
> 
> Looks like we don't need to keep both 'j' and 'j_min'. We could just use
> 'j', initialize it before the outer loop, and do the inner loop with
> "while".

I prefer for over while if the for loop looks idiomatic.
wrt j_min, I'd let the compiler deal with it.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-06 23:40           ` Emilio G. Cota
@ 2016-06-07 14:06             ` Sergey Fedorov
  2016-06-07 22:53               ` Emilio G. Cota
  0 siblings, 1 reply; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-07 14:06 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 07/06/16 02:40, Emilio G. Cota wrote:
> On Fri, Jun 03, 2016 at 20:46:07 +0300, Sergey Fedorov wrote:
>> On 03/06/16 20:29, Sergey Fedorov wrote:
>>> On 03/06/16 20:22, Emilio G. Cota wrote:
>>>> On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
>>>>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>>>> (snip)
>>>>>> +double qdist_avg(const struct qdist *dist)
>>>>>> +{
>>>>>> +    unsigned long count;
>>>>>> +    size_t i;
>>>>>> +    double ret = 0;
>>>>>> +
>>>>>> +    count = qdist_sample_count(dist);
>>>>>> +    if (!count) {
>>>>>> +        return NAN;
>>>>>> +    }
>>>>>> +    for (i = 0; i < dist->n; i++) {
>>>>>> +        struct qdist_entry *e = &dist->entries[i];
>>>>>> +
>>>>>> +        ret += e->x * e->count / count;
>>>>> Please use Welford’s method or something like that, see
>>>>> http://stackoverflow.com/a/1346890.
>>>> Yes, the way the mean is computed right now, we might suffer
>>>> from underflow if count is huge. But I'd rather take that, than the
>>>> perf penalty of an iterative method (such as the one used
>>>> in Welford's). Note that we might have huge amounts of
>>>> items, e.g. one item per head bucket in qht's occupancy qdist
>>>> (and 0.5M head buckets is easy to achieve).
>>>>
>>>> If we were to use an iterative method, we'd need to do something
>>>> like:
>>>>
>>>> double qdist_avg(const struct qdist *dist)
>>>> {
>>>>     size_t i, j;
>>>>     double ret = 0;
>>>>
>>>>     if (!qdist_sample_count(dist)) {
>>>>         return NAN;
>>>>     }
>>>>     /* compute moving average to prevent under/overflow */
>>>>     for (i = 0; i < dist->n; i++) {
>>>>         struct qdist_entry *e = &dist->entries[i];
>>>>
>>>>         for (j = 0; j < e->count; j++) {
>>>>
>>>>             ret += (e->x - ret) / (i + j + 1);
>>>>         }
>>>>     }
>>>>     return ret;
>>>> }
>>>>
>>>> Note that skipping the inner loop would be incorrect.
>>> Ah, it's a shame. I'm wondering if there is some other algorithm that
>>> could work for us?
>> Maybe something like
>> https://en.wikipedia.org/wiki/Kahan_summation_algorithm could help?
> That algorithm is overkill for what we're doing. Pairwise summation
> should suffice:
>
> diff --git a/util/qdist.c b/util/qdist.c
> index 3343640..909bd2b 100644
> --- a/util/qdist.c
> +++ b/util/qdist.c
> @@ -367,20 +367,34 @@ unsigned long qdist_sample_count(const struct qdist *dist)
>      return count;
>  }
>  
> +static double qdist_pairwise_avg(const struct qdist *dist, size_t index,
> +                                 size_t n, unsigned long count)
> +{
> +    if (n <= 2) {

We would like to amortize the overhead of the recursion by making the
cut-off sufficiently large.

> +        size_t i;
> +        double ret = 0;
> +
> +        for (i = 0; i < n; i++) {
> +            struct qdist_entry *e = &dist->entries[index + i];
> +
> +            ret += e->x * e->count / count;
> +        }
> +        return ret;
> +    } else {
> +        size_t n2 = n / 2;
> +
> +        return qdist_pairwise_avg(dist, index, n2, count) +
> +               qdist_pairwise_avg(dist, index + n2, n - n2, count);
> +    }
> +}
> +
>  double qdist_avg(const struct qdist *dist)
>  {
>      unsigned long count;
> -    size_t i;
> -    double ret = 0;
>  
>      count = qdist_sample_count(dist);
>      if (!count) {
>          return NAN;
>      }
> -    for (i = 0; i < dist->n; i++) {
> -        struct qdist_entry *e = &dist->entries[i];
> -
> -        ret += e->x * e->count / count;
> -    }
> -    return ret;
> +    return qdist_pairwise_avg(dist, 0, dist->n, count);
>  }

Otherwise looks good.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-07  1:05     ` Emilio G. Cota
@ 2016-06-07 15:56       ` Sergey Fedorov
  2016-06-08  0:02         ` Emilio G. Cota
  0 siblings, 1 reply; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-07 15:56 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 07/06/16 04:05, Emilio G. Cota wrote:
> On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>> diff --git a/util/qdist.c b/util/qdist.c
>>> new file mode 100644
>>> index 0000000..3343640
>>> --- /dev/null
>>> +++ b/util/qdist.c
>>> @@ -0,0 +1,386 @@
>> (snip)
>>> +
>>> +void qdist_add(struct qdist *dist, double x, long count)
>>> +{
>>> +    struct qdist_entry *entry = NULL;
>>> +
>>> +    if (dist->entries) {
>>> +        struct qdist_entry e;
>>> +
>>> +        e.x = x;
>>> +        entry = bsearch(&e, dist->entries, dist->n, sizeof(e), qdist_cmp);
>>> +    }
>>> +
>>> +    if (entry) {
>>> +        entry->count += count;
>>> +        return;
>>> +    }
>>> +
>>> +    dist->entries = g_realloc(dist->entries,
>>> +                              sizeof(*dist->entries) * (dist->n + 1));
>> Repeated doubling?
> Can you please elaborate?

I mean dynamic array with a growth factor of 2
[https://en.wikipedia.org/wiki/Dynamic_array].

>
>>> +    dist->n++;
>>> +    entry = &dist->entries[dist->n - 1];
>> What if we combine the above two lines:
>>
>>     entry = &dist->entries[dist->n++];
>>
>> or just reverse them:
>>
>>     entry = &dist->entries[dist->n];
>>     dist->n++;
> I have less trouble understanding the original.

Okay.

>
>>> +    entry->x = x;
>>> +    entry->count = count;
>>> +    qsort(dist->entries, dist->n, sizeof(*entry), qdist_cmp);
>>> +}
>>> +
>> (snip)
>>> +static char *qdist_pr_internal(const struct qdist *dist)
>>> +{
>>> +    double min, max, step;
>>> +    GString *s = g_string_new("");
>>> +    size_t i;
>>> +
>>> +    /* if only one entry, its printout will be either full or empty */
>>> +    if (dist->n == 1) {
>>> +        if (dist->entries[0].count) {
>>> +            g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES - 1]);
>>> +        } else {
>>> +            g_string_append_c(s, ' ');
>>> +        }
>>> +        goto out;
>>> +    }
>>> +
>>> +    /* get min and max counts */
>>> +    min = dist->entries[0].count;
>>> +    max = min;
>>> +    for (i = 0; i < dist->n; i++) {
>>> +        struct qdist_entry *e = &dist->entries[i];
>>> +
>>> +        if (e->count < min) {
>>> +            min = e->count;
>>> +        }
>>> +        if (e->count > max) {
>>> +            max = e->count;
>>> +        }
>>> +    }
>>> +
>>> +    /* floor((count - min) * step) will give us the block index */
>>> +    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
>>> +
>>> +    for (i = 0; i < dist->n; i++) {
>>> +        struct qdist_entry *e = &dist->entries[i];
>>> +        int index;
>>> +
>>> +        /* make an exception with 0; instead of using block[0], print a space */
>>> +        if (e->count) {
>>> +            index = (int)((e->count - min) * step);
>> So "e->count == min" gives us one eighth block instead of just space?
> Yes, only 0 can print a space.

So our scale is not linear. I think some users might get confused by this.

>
>>> +            g_string_append_unichar(s, qdist_blocks[index]);
>>> +        } else {
>>> +            g_string_append_c(s, ' ');
>>> +        }
>>> +    }
>>> + out:
>>> +    return g_string_free(s, FALSE);
>>> +}
>>> +
>>> +/*
>>> + * Bin the distribution in @from into @n bins of consecutive, non-overlapping
>>> + * intervals, copying the result to @to.
>>> + *
>>> + * This function is internal to qdist: only this file and test code should
>>> + * ever call it.
>>> + *
>>> + * Note: calling this function on an already-binned qdist is a bug.
>>> + *
>>> + * If @n == 0 or @from->n == 1, use @from->n.
>>> + */
>>> +void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
>>> +{
>>> +    double xmin, xmax;
>>> +    double step;
>>> +    size_t i, j, j_min;
>>> +
>>> +    qdist_init(to);
>>> +
>>> +    if (!from->entries) {
>>> +        return;
>>> +    }
>>> +    if (!n || from->n == 1) {
>>> +        n = from->n;
>>> +    }
>>> +
>>> +    /* set equally-sized bins between @from's left and right */
>>> +    xmin = qdist_xmin(from);
>>> +    xmax = qdist_xmax(from);
>>> +    step = (xmax - xmin) / n;
>>> +
>>> +    if (n == from->n) {
>>> +        /* if @from's entries are equally spaced, no need to re-bin */
>>> +        for (i = 0; i < from->n; i++) {
>>> +            if (from->entries[i].x != xmin + i * step) {
>>> +                goto rebin;
>> static inline function instead of goto?
> It would have quite a few arguments, I think the goto is fine.

Actually, it would be 'xmin', 'xmax', and 'step' in addition to 'to',
'from', and 'n'. But yes, probably goto is fine here.

>
>>> +            }
>>> +        }
>>> +        /* they're equally spaced, so copy the dist and bail out */
>>> +        to->entries = g_malloc(sizeof(*to->entries) * from->n);
>> g_new()?
> Changed.
>
>>> +        to->n = from->n;
>>> +        memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
>>> +        return;
>>> +    }
>>> +
>>> + rebin:

By the way, here's a space before the 'rebin' label.

>>> +    j_min = 0;
>>> +    for (i = 0; i < n; i++) {
>>> +        double x;
>>> +        double left, right;
>>> +
>>> +        left = xmin + i * step;
>>> +        right = xmin + (i + 1) * step;
>>> +
>>> +        /* Add x, even if it might not get any counts later */
>>> +        x = left;
>> This way we round down to the left margin of each bin like this:
>>
>>     xmin [*---*---*---*---*] xmax   -- from
>>           |  /|  /|  /|  /
>>           | / | / | / | /
>>           |/  |/  |/  |/
>>           |   |   |   |
>>           V   V   V   V
>>          [*   *   *   *]            -- to
> (snip)
>>     xmin [*----*----*----*] xmax    -- from
>>         \   /\   /\   /\   /
>>          \ /  \ /  \ /  \ /
>>           |    |    |    |
>>           V    V    V    V
>>          [*    *    *    *]         -- to
>>
>> I'm not sure which is the more correct option from the mathematical
>> point of view; but multiple-binning with the last variant of the
>> algorithm we would still give the same result.
> There's no "right" or "wrong" way as long as we're consistent
> and we print the right counts in the right bins. I think the
> convention I chose is simple enough, and leads to simple printing
> of the labels. But yes other alternatives would be OK here.

Well, if we go ahead with my last suggestion the code would look like this:

rebin:
    /* We do the binning using the following scheme:
     *
     *  xmin [*----*----*----*] xmax    -- from
     *      \   /\   /\   /\   /
     *       \ /  \ /  \ /  \ /
     *        |    |    |    |
     *        V    V    V    V
     *       [*    *    *    *]         -- to
     *
     */
    step = (xmax - xmin) / (n - 1);
    j = 0;
    for (i = 0; i < n; i++) {
        double x;
        double right;

        x = xmin + i * step;
        right = x + 0.5 * step;

        /* Add x, even if it might not get any counts later */
        qdist_add(to, x, 0);

        /* To avoid double-counting we capture [left, right) ranges */
        while (from->entries[j].x < right && j < from->n) {
            qdist_add(to, x, from->entries[j].count);
            j++;
        }
    }
    assert(j == from->n);
}

Actually it's simpler than current version.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-07 14:06             ` Sergey Fedorov
@ 2016-06-07 22:53               ` Emilio G. Cota
  2016-06-08 13:09                 ` Sergey Fedorov
  0 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-07 22:53 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Tue, Jun 07, 2016 at 17:06:16 +0300, Sergey Fedorov wrote:
> On 07/06/16 02:40, Emilio G. Cota wrote:
> > On Fri, Jun 03, 2016 at 20:46:07 +0300, Sergey Fedorov wrote:
> >> Maybe something like
> >> https://en.wikipedia.org/wiki/Kahan_summation_algorithm could help?
> > That algorithm is overkill for what we're doing. Pairwise summation
> > should suffice:
> >
> > diff --git a/util/qdist.c b/util/qdist.c
> > index 3343640..909bd2b 100644
> > --- a/util/qdist.c
> > +++ b/util/qdist.c
> > @@ -367,20 +367,34 @@ unsigned long qdist_sample_count(const struct qdist *dist)
> >      return count;
> >  }
> >  
> > +static double qdist_pairwise_avg(const struct qdist *dist, size_t index,
> > +                                 size_t n, unsigned long count)
> > +{
> > +    if (n <= 2) {
> 
> We would like to amortize the overhead of the recursion by making the
> cut-off sufficiently large.

Yes, this was just for showing what it looked like.

We can use 128 here like JuliaLang does:
  https://github.com/JuliaLang/julia/blob/d98f2c0dcd/base/arraymath.jl#L366

(snip)
> Otherwise looks good.

Thanks!

		Emilio

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-07 15:56       ` Sergey Fedorov
@ 2016-06-08  0:02         ` Emilio G. Cota
  2016-06-08 14:10           ` Sergey Fedorov
  0 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-08  0:02 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Tue, Jun 07, 2016 at 18:56:48 +0300, Sergey Fedorov wrote:
> On 07/06/16 04:05, Emilio G. Cota wrote:
> > On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
> >> On 25/05/16 04:13, Emilio G. Cota wrote:
> >>> diff --git a/util/qdist.c b/util/qdist.c
> >>> new file mode 100644
> >>> index 0000000..3343640
> >>> --- /dev/null
> >>> +++ b/util/qdist.c
> >>> @@ -0,0 +1,386 @@
> >> (snip)
> >>> +
> >>> +void qdist_add(struct qdist *dist, double x, long count)
> >>> +{
> >>> +    struct qdist_entry *entry = NULL;
> >>> +
> >>> +    if (dist->entries) {
> >>> +        struct qdist_entry e;
> >>> +
> >>> +        e.x = x;
> >>> +        entry = bsearch(&e, dist->entries, dist->n, sizeof(e), qdist_cmp);
> >>> +    }
> >>> +
> >>> +    if (entry) {
> >>> +        entry->count += count;
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    dist->entries = g_realloc(dist->entries,
> >>> +                              sizeof(*dist->entries) * (dist->n + 1));
> >> Repeated doubling?
> > Can you please elaborate?
> 
> I mean dynamic array with a growth factor of 2
> [https://en.wikipedia.org/wiki/Dynamic_array].

Changed to:

diff --git a/include/qemu/qdist.h b/include/qemu/qdist.h
index 6d8b701..f30050c 100644
--- a/include/qemu/qdist.h
+++ b/include/qemu/qdist.h
@@ -29,6 +29,7 @@ struct qdist_entry {
 struct qdist {
     struct qdist_entry *entries;
     size_t n;
+    size_t size;
 };
 
 #define QDIST_PR_BORDER     BIT(0)
diff --git a/util/qdist.c b/util/qdist.c
index dc9dbd1..3b54354 100644
--- a/util/qdist.c
+++ b/util/qdist.c
@@ -16,6 +16,7 @@
 void qdist_init(struct qdist *dist)
 {
     dist->entries = NULL;
+    dist->size = 0;
     dist->n = 0;
 }
 
@@ -58,8 +59,11 @@ void qdist_add(struct qdist *dist, double x, long count)
         return;
     }
 
-    dist->entries = g_realloc(dist->entries,
-                              sizeof(*dist->entries) * (dist->n + 1));
+    if (unlikely(dist->n == dist->size)) {
+        dist->size = dist->size ? dist->size * 2 : 1;
+        dist->entries = g_realloc(dist->entries,
+                                  sizeof(*dist->entries) * (dist->size));
+    }
     dist->n++;
     entry = &dist->entries[dist->n - 1];
     entry->x = x;


> >> (snip)
> >>> +static char *qdist_pr_internal(const struct qdist *dist)
> >>> +{
> >>> +    double min, max, step;
> >>> +    GString *s = g_string_new("");
> >>> +    size_t i;
> >>> +
> >>> +    /* if only one entry, its printout will be either full or empty */
> >>> +    if (dist->n == 1) {
> >>> +        if (dist->entries[0].count) {
> >>> +            g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES - 1]);
> >>> +        } else {
> >>> +            g_string_append_c(s, ' ');
> >>> +        }
> >>> +        goto out;
> >>> +    }
> >>> +
> >>> +    /* get min and max counts */
> >>> +    min = dist->entries[0].count;
> >>> +    max = min;
> >>> +    for (i = 0; i < dist->n; i++) {
> >>> +        struct qdist_entry *e = &dist->entries[i];
> >>> +
> >>> +        if (e->count < min) {
> >>> +            min = e->count;
> >>> +        }
> >>> +        if (e->count > max) {
> >>> +            max = e->count;
> >>> +        }
> >>> +    }
> >>> +
> >>> +    /* floor((count - min) * step) will give us the block index */
> >>> +    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
> >>> +
> >>> +    for (i = 0; i < dist->n; i++) {
> >>> +        struct qdist_entry *e = &dist->entries[i];
> >>> +        int index;
> >>> +
> >>> +        /* make an exception with 0; instead of using block[0], print a space */
> >>> +        if (e->count) {
> >>> +            index = (int)((e->count - min) * step);
> >> So "e->count == min" gives us one eighth block instead of just space?
> > Yes, only 0 can print a space.
> 
> So our scale is not linear. I think some users might get confused by this.

That's correct. I think special-casing 0 makes sense though, since
it increases the signal-to-noise ratio of the histogram. For example:

1) 0 as ' ':
TB hash occupancy   31.84% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
TB hash avg chain   1.015 buckets. Histogram: 1|█▁▁|3

2) 0 as '1/8':
TB hash occupancy   32.07% avg chain occ. Histogram: [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]%
TB hash avg chain   1.015 buckets. Histogram: 1|▇▁▁|3

I think in these examples most users would be less confused by 1) than by 2).

(snip)
> >>> +        to->n = from->n;
> >>> +        memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
> >>> +        return;
> >>> +    }
> >>> +
> >>> + rebin:
> 
> By the way, here's a space before the 'rebin' label.

Yes, I always do this.
It prevents diff from mistaking the label for a function definition,
and thus wrongly using the label as context. See:
  https://lkml.org/lkml/2010/6/16/312


> >>> +    j_min = 0;
> >>> +    for (i = 0; i < n; i++) {
> >>> +        double x;
> >>> +        double left, right;
> >>> +
> >>> +        left = xmin + i * step;
> >>> +        right = xmin + (i + 1) * step;
> >>> +
> >>> +        /* Add x, even if it might not get any counts later */
> >>> +        x = left;
> >> This way we round down to the left margin of each bin like this:
> >>
> >>     xmin [*---*---*---*---*] xmax   -- from
> >>           |  /|  /|  /|  /
> >>           | / | / | / | /
> >>           |/  |/  |/  |/
> >>           |   |   |   |
> >>           V   V   V   V
> >>          [*   *   *   *]            -- to
> > (snip)
> >>     xmin [*----*----*----*] xmax    -- from
> >>         \   /\   /\   /\   /
> >>          \ /  \ /  \ /  \ /
> >>           |    |    |    |
> >>           V    V    V    V
> >>          [*    *    *    *]         -- to
> >>
> >> I'm not sure which is the more correct option from the mathematical
> >> point of view; but multiple-binning with the last variant of the
> >> algorithm we would still give the same result.
> > There's no "right" or "wrong" way as long as we're consistent
> > and we print the right counts in the right bins. I think the
> > convention I chose is simple enough, and leads to simple printing
> > of the labels. But yes other alternatives would be OK here.
> 
> Well, if we go ahead with my last suggestion the code would look like this:
> 
> rebin:
>     /* We do the binning using the following scheme:
>      *
>      *  xmin [*----*----*----*] xmax    -- from
>      *      \   /\   /\   /\   /
>      *       \ /  \ /  \ /  \ /
>      *        |    |    |    |
>      *        V    V    V    V
>      *       [*    *    *    *]         -- to
>      *
>      */
>     step = (xmax - xmin) / (n - 1);
>     j = 0;
>     for (i = 0; i < n; i++) {
>         double x;
>         double right;
> 
>         x = xmin + i * step;
>         right = x + 0.5 * step;
> 
>         /* Add x, even if it might not get any counts later */
>         qdist_add(to, x, 0);
> 
>         /* To avoid double-counting we capture [left, right) ranges */
>         while (from->entries[j].x < right && j < from->n) {
>             qdist_add(to, x, from->entries[j].count);
>             j++;
>         }
>     }
>     assert(j == from->n);
> }
> 
> Actually it's simpler than current version.

The behaviour isn't the same though. With this we have
that the two outer bins (leftmost and rightmost) are unnecessarily
large (since they're out of the range of the input data).

For example, assume the data is between 0 and 100 and n=5 (i.e. step=25),
it makes no sense to report the first bin as [-12.5,12.5). If we
then truncate the unnecessary edges, we'd have [0,12.5), but
then the second bin is [12.5,37.5). Bins of unequal size are
possible (although a bit unusual) in histograms, but given
our Unicode-based representation, we're limited to same-width bars.

		Emilio

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/15] tb hash improvements
  2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
                   ` (14 preceding siblings ...)
  2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 15/15] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
@ 2016-06-08  6:25 ` Alex Bennée
  2016-06-08 15:16   ` Emilio G. Cota
  2016-06-08 15:35   ` Richard Henderson
  15 siblings, 2 replies; 63+ messages in thread
From: Alex Bennée @ 2016-06-08  6:25 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> v5: https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg02366.html
>
> v6 applies cleanly on top of tcg-next (8b1fe3f4 "cpu-exec:
> Clean up 'interrupt_request' reloading", tagged "pull-tcg-20160512").

Emilio,

How is v7 going? I only mention it because we are starting to discuss
feature freeze for 2.7 and I'm pretty keen to get QHT merged before
then.

Richard,

How happy are you with this series so far? Are you planning to take in
via your tree when ready?

>
> Changes from v5, mostly from Sergey's review:
>
> - processor.h: use #ifdef #elif throughout the file
>
> - tb_hash_func: use uint32 for 'flags' param
>
> - tb_hash_func5: do 'foo >> 32' instead of 'foo >> 31 >> 1', since foo
>   is a u64.
>
> - thread.h:
>   * qemu_spin_locked: remove acquire semantics; simply use atomic_read().
>   * qemu_spin_trylock: return bool instead of 0 or -EBUSY; this saves
>     a branch.
>   * qemu_spin:
>     + use __sync_lock_test_and_set and __sync_lock_release; drop
>       the patches touching atomic.h.
>     + add unlikely() hint to "while (test_and_set)"; this gives a small
>       speedup under no contention.
>
> - qht:
>   * merge the parallel-writes patch into the QHT patch.
>     [Richard: I dropped your reviewed-by since the patch changed
>      quite a bit.]
>   * drop unneeded #includes from qht.h
>   * document qht.h using kerneldoc.
>   * use unsigned int for storing the seqlock version.
>   * fix a couple of typos in the comments at the top of qht.c.
>   * explain better the "no duplicated pointer" policy: while trying to
>     insert an already-existing hash-pointer pair is OK (insert will
>     just return false), it's not OK to insert different hash-pointer
>     pairs that share the same pointer value, but not the hashes.
>   * Add comment about lookups having to be done in an RCU read-critical
>     section.
>   * remove map->stale; simply check ht->map before and after acquiring
>     a bucket lock.
>   * only use atomic_read/set on bucket pointers, not hashes. Reading
>     partially-updated hashes is OK, since we'll retry anyway thanks
>     to the seqlock. Add a comment regarding this at the top of struct
>     qht_bucket.
>   * s/b->n/b->n_buckets/
>   * define qht_debug_assert, enabled #ifdef QHT_DEBUG. Use it instead of
>     assert(), except in one case (slow path) where g_assert_cmpuint is
>     convenient.
>   * use a mutex for ht->lock instead of a spinlock. This makes the resize
>     code simpler, since holding ht->lock for a bit of time is OK now;
>     other threads won't be busy-waiting. Document that ht->lock needs
>     to be grabbed before b->lock.
>   * use atomic_rcu_read/set instead of open-coding them.
>   * qht_remove: only clear out b->hashes[] and b->pointers[] if they belong
>                 to what was the last entry in the chain.
>   * qht_remove: add debug assert against inserting a NULL pointer.
>
> Thanks,
>
> 		Emilio


--
Alex Bennée

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-07 22:53               ` Emilio G. Cota
@ 2016-06-08 13:09                 ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-08 13:09 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 08/06/16 01:53, Emilio G. Cota wrote:
> On Tue, Jun 07, 2016 at 17:06:16 +0300, Sergey Fedorov wrote:
>> On 07/06/16 02:40, Emilio G. Cota wrote:
>>> On Fri, Jun 03, 2016 at 20:46:07 +0300, Sergey Fedorov wrote:
>>>> Maybe something like
>>>> https://en.wikipedia.org/wiki/Kahan_summation_algorithm could help?
>>> That algorithm is overkill for what we're doing. Pairwise summation
>>> should suffice:
>>>
>>> diff --git a/util/qdist.c b/util/qdist.c
>>> index 3343640..909bd2b 100644
>>> --- a/util/qdist.c
>>> +++ b/util/qdist.c
>>> @@ -367,20 +367,34 @@ unsigned long qdist_sample_count(const struct qdist *dist)
>>>      return count;
>>>  }
>>>  
>>> +static double qdist_pairwise_avg(const struct qdist *dist, size_t index,
>>> +                                 size_t n, unsigned long count)
>>> +{
>>> +    if (n <= 2) {
>> We would like to amortize the overhead of the recursion by making the
>> cut-off sufficiently large.
> Yes, this was just for showing what it looked like.
>
> We can use 128 here like JuliaLang does:
>   https://github.com/JuliaLang/julia/blob/d98f2c0dcd/base/arraymath.jl#L366
>
>

Probably cut-off of ~10 items would be enough to amortize the recursion
in C.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-08  0:02         ` Emilio G. Cota
@ 2016-06-08 14:10           ` Sergey Fedorov
  2016-06-08 18:06             ` Emilio G. Cota
  0 siblings, 1 reply; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-08 14:10 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 08/06/16 03:02, Emilio G. Cota wrote:
> On Tue, Jun 07, 2016 at 18:56:48 +0300, Sergey Fedorov wrote:
>> On 07/06/16 04:05, Emilio G. Cota wrote:
>>> On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
>>>> On 25/05/16 04:13, Emilio G. Cota wrote:
>>>>> diff --git a/util/qdist.c b/util/qdist.c
>>>>> new file mode 100644
>>>>> index 0000000..3343640
>>>>> --- /dev/null
>>>>> +++ b/util/qdist.c
>>>>> @@ -0,0 +1,386 @@
>>>> (snip)
>>>>> +
>>>>> +void qdist_add(struct qdist *dist, double x, long count)
>>>>> +{
>>>>> +    struct qdist_entry *entry = NULL;
>>>>> +
>>>>> +    if (dist->entries) {
>>>>> +        struct qdist_entry e;
>>>>> +
>>>>> +        e.x = x;
>>>>> +        entry = bsearch(&e, dist->entries, dist->n, sizeof(e), qdist_cmp);
>>>>> +    }
>>>>> +
>>>>> +    if (entry) {
>>>>> +        entry->count += count;
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    dist->entries = g_realloc(dist->entries,
>>>>> +                              sizeof(*dist->entries) * (dist->n + 1));
>>>> Repeated doubling?
>>> Can you please elaborate?
>> I mean dynamic array with a growth factor of 2
>> [https://en.wikipedia.org/wiki/Dynamic_array].
> Changed to:
>
> diff --git a/include/qemu/qdist.h b/include/qemu/qdist.h
> index 6d8b701..f30050c 100644
> --- a/include/qemu/qdist.h
> +++ b/include/qemu/qdist.h
> @@ -29,6 +29,7 @@ struct qdist_entry {
>  struct qdist {
>      struct qdist_entry *entries;
>      size_t n;
> +    size_t size;
>  };
>  
>  #define QDIST_PR_BORDER     BIT(0)
> diff --git a/util/qdist.c b/util/qdist.c
> index dc9dbd1..3b54354 100644
> --- a/util/qdist.c
> +++ b/util/qdist.c
> @@ -16,6 +16,7 @@
>  void qdist_init(struct qdist *dist)
>  {
>      dist->entries = NULL;
> +    dist->size = 0;
>      dist->n = 0;
>  }
>  
> @@ -58,8 +59,11 @@ void qdist_add(struct qdist *dist, double x, long count)
>          return;
>      }
>  
> -    dist->entries = g_realloc(dist->entries,
> -                              sizeof(*dist->entries) * (dist->n + 1));
> +    if (unlikely(dist->n == dist->size)) {
> +        dist->size = dist->size ? dist->size * 2 : 1;

We could initialize 'dist->size' to 1 and allocate a 1-entry
'dist->entries' array in qdist_init() to avoid this ternary operation ;-)

Otherwise looks good.

> +        dist->entries = g_realloc(dist->entries,
> +                                  sizeof(*dist->entries) * (dist->size));
> +    }
>      dist->n++;
>      entry = &dist->entries[dist->n - 1];
>      entry->x = x;
>
>
>>>> (snip)
>>>>> +static char *qdist_pr_internal(const struct qdist *dist)
>>>>> +{
>>>>> +    double min, max, step;
>>>>> +    GString *s = g_string_new("");
>>>>> +    size_t i;
>>>>> +
>>>>> +    /* if only one entry, its printout will be either full or empty */
>>>>> +    if (dist->n == 1) {
>>>>> +        if (dist->entries[0].count) {
>>>>> +            g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES - 1]);
>>>>> +        } else {
>>>>> +            g_string_append_c(s, ' ');
>>>>> +        }
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    /* get min and max counts */
>>>>> +    min = dist->entries[0].count;
>>>>> +    max = min;
>>>>> +    for (i = 0; i < dist->n; i++) {
>>>>> +        struct qdist_entry *e = &dist->entries[i];
>>>>> +
>>>>> +        if (e->count < min) {
>>>>> +            min = e->count;
>>>>> +        }
>>>>> +        if (e->count > max) {
>>>>> +            max = e->count;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    /* floor((count - min) * step) will give us the block index */
>>>>> +    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
>>>>> +
>>>>> +    for (i = 0; i < dist->n; i++) {
>>>>> +        struct qdist_entry *e = &dist->entries[i];
>>>>> +        int index;
>>>>> +
>>>>> +        /* make an exception with 0; instead of using block[0], print a space */
>>>>> +        if (e->count) {
>>>>> +            index = (int)((e->count - min) * step);
>>>> So "e->count == min" gives us one eighth block instead of just space?
>>> Yes, only 0 can print a space.
>> So our scale is not linear. I think some users might get confused by this.
> That's correct. I think special-casing 0 makes sense though, since
> it increases the signal-to-noise ratio of the histogram. For example:
>
> 1) 0 as ' ':
> TB hash occupancy   31.84% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
> TB hash avg chain   1.015 buckets. Histogram: 1|█▁▁|3
>
> 2) 0 as '1/8':
> TB hash occupancy   32.07% avg chain occ. Histogram: [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]%
> TB hash avg chain   1.015 buckets. Histogram: 1|▇▁▁|3
>
> I think in these examples most users would be less confused by 1) than by 2).

I was meaning to represent all bars whose value < 1/8 as a space, not
only whose value is pure zero. Otherwise we can see 1/8 bar where the
actual value is negligibly differ from zero as in the second example.

>
> (snip)
>>>>> +        to->n = from->n;
>>>>> +        memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> + rebin:
>> By the way, here's a space before the 'rebin' label.
> Yes, I always do this.
> It prevents diff from mistaking the label for a function definition,
> and thus wrongly using the label as context. See:
>   https://lkml.org/lkml/2010/6/16/312

Cool!

>
>
>>>>> +    j_min = 0;
>>>>> +    for (i = 0; i < n; i++) {
>>>>> +        double x;
>>>>> +        double left, right;
>>>>> +
>>>>> +        left = xmin + i * step;
>>>>> +        right = xmin + (i + 1) * step;
>>>>> +
>>>>> +        /* Add x, even if it might not get any counts later */
>>>>> +        x = left;
>>>> This way we round down to the left margin of each bin like this:
>>>>
>>>>     xmin [*---*---*---*---*] xmax   -- from
>>>>           |  /|  /|  /|  /
>>>>           | / | / | / | /
>>>>           |/  |/  |/  |/
>>>>           |   |   |   |
>>>>           V   V   V   V
>>>>          [*   *   *   *]            -- to
>>> (snip)
>>>>     xmin [*----*----*----*] xmax    -- from
>>>>         \   /\   /\   /\   /
>>>>          \ /  \ /  \ /  \ /
>>>>           |    |    |    |
>>>>           V    V    V    V
>>>>          [*    *    *    *]         -- to
>>>>
>>>> I'm not sure which is the more correct option from the mathematical
>>>> point of view; but multiple-binning with the last variant of the
>>>> algorithm we would still give the same result.
>>> There's no "right" or "wrong" way as long as we're consistent
>>> and we print the right counts in the right bins. I think the
>>> convention I chose is simple enough, and leads to simple printing
>>> of the labels. But yes other alternatives would be OK here.
>> Well, if we go ahead with my last suggestion the code would look like this:
>>
>> rebin:
>>     /* We do the binning using the following scheme:
>>      *
>>      *  xmin [*----*----*----*] xmax    -- from
>>      *      \   /\   /\   /\   /
>>      *       \ /  \ /  \ /  \ /
>>      *        |    |    |    |
>>      *        V    V    V    V
>>      *       [*    *    *    *]         -- to
>>      *
>>      */
>>     step = (xmax - xmin) / (n - 1);
>>     j = 0;
>>     for (i = 0; i < n; i++) {
>>         double x;
>>         double right;
>>
>>         x = xmin + i * step;
>>         right = x + 0.5 * step;
>>
>>         /* Add x, even if it might not get any counts later */
>>         qdist_add(to, x, 0);
>>
>>         /* To avoid double-counting we capture [left, right) ranges */
>>         while (from->entries[j].x < right && j < from->n) {
>>             qdist_add(to, x, from->entries[j].count);
>>             j++;
>>         }
>>     }
>>     assert(j == from->n);
>> }
>>
>> Actually it's simpler than current version.
> The behaviour isn't the same though. With this we have
> that the two outer bins (leftmost and rightmost) are unnecessarily
> large (since they're out of the range of the input data).
>
> For example, assume the data is between 0 and 100 and n=5 (i.e. step=25),
> it makes no sense to report the first bin as [-12.5,12.5). If we
> then truncate the unnecessary edges, we'd have [0,12.5), but
> then the second bin is [12.5,37.5). Bins of unequal size are
> possible (although a bit unusual) in histograms, but given
> our Unicode-based representation, we're limited to same-width bars.

That is why I noted that I'm not sure what is the most correct from
mathematical point of view. Maybe consider the second option? I.e.
rounding to the middle of each bin with:

    x = left + step / 2;

which would give the picture like this:


    xmin [*---*---*---*---*] xmax   -- from
          |   |   |   |   |
           \ / \ / \ / \ /
            |   |   |   |
            V   V   V   V
           [*   *   *   *]          -- to

Anyway, you may consider if you like whether it's possible to apply some
simplifications from my code to the final version.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/15] tb hash improvements
  2016-06-08  6:25 ` [Qemu-devel] [PATCH v6 00/15] tb hash improvements Alex Bennée
@ 2016-06-08 15:16   ` Emilio G. Cota
  2016-06-08 15:35   ` Richard Henderson
  1 sibling, 0 replies; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-08 15:16 UTC (permalink / raw)
  To: Alex Bennée
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Richard Henderson,
	Sergey Fedorov

On Wed, Jun 08, 2016 at 07:25:33 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> 
> > v5: https://lists.gnu.org/archive/html/qemu-devel/2016-05/msg02366.html
> >
> > v6 applies cleanly on top of tcg-next (8b1fe3f4 "cpu-exec:
> > Clean up 'interrupt_request' reloading", tagged "pull-tcg-20160512").
> 
> How is v7 going? I only mention it because we are starting to discuss
> feature freeze for 2.7 and I'm pretty keen to get QHT merged before
> then.

I'll send it today. v7 applies cleanly on top of the current master--I've
fixed the conflicts you pointed out related to the addition of
include/exec/tb-context.h.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/15] tb hash improvements
  2016-06-08  6:25 ` [Qemu-devel] [PATCH v6 00/15] tb hash improvements Alex Bennée
  2016-06-08 15:16   ` Emilio G. Cota
@ 2016-06-08 15:35   ` Richard Henderson
  2016-06-08 15:37     ` Sergey Fedorov
  1 sibling, 1 reply; 63+ messages in thread
From: Richard Henderson @ 2016-06-08 15:35 UTC (permalink / raw)
  To: Alex Bennée, Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Sergey Fedorov

On 06/07/2016 11:25 PM, Alex Bennée wrote:
> Richard,
> 
> How happy are you with this series so far? Are you planning to take in
> via your tree when ready?

I'm happy with it.  I was all set to merge v6 before Sergey started commenting.


r~

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/15] tb hash improvements
  2016-06-08 15:35   ` Richard Henderson
@ 2016-06-08 15:37     ` Sergey Fedorov
  2016-06-08 16:45       ` Alex Bennée
  0 siblings, 1 reply; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-08 15:37 UTC (permalink / raw)
  To: Richard Henderson, Alex Bennée, Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini

On 08/06/16 18:35, Richard Henderson wrote:
> On 06/07/2016 11:25 PM, Alex Bennée wrote:
>> Richard,
>>
>> How happy are you with this series so far? Are you planning to take in
>> via your tree when ready?
> I'm happy with it.  I was all set to merge v6 before Sergey started commenting.

I think v7 would be fine for merge yet :)

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 00/15] tb hash improvements
  2016-06-08 15:37     ` Sergey Fedorov
@ 2016-06-08 16:45       ` Alex Bennée
  0 siblings, 0 replies; 63+ messages in thread
From: Alex Bennée @ 2016-06-08 16:45 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Richard Henderson, Emilio G. Cota, QEMU Developers, MTTCG Devel,
	Paolo Bonzini


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 08/06/16 18:35, Richard Henderson wrote:
>> On 06/07/2016 11:25 PM, Alex Bennée wrote:
>>> Richard,
>>>
>>> How happy are you with this series so far? Are you planning to take in
>>> via your tree when ready?
>> I'm happy with it.  I was all set to merge v6 before Sergey started commenting.
>
> I think v7 would be fine for merge yet :)

Cool, we await v7 with baited breath then ;-)

--
Alex Bennée

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-08 14:10           ` Sergey Fedorov
@ 2016-06-08 18:06             ` Emilio G. Cota
  2016-06-08 18:18               ` Sergey Fedorov
  0 siblings, 1 reply; 63+ messages in thread
From: Emilio G. Cota @ 2016-06-08 18:06 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On Wed, Jun 08, 2016 at 17:10:03 +0300, Sergey Fedorov wrote:
> On 08/06/16 03:02, Emilio G. Cota wrote:
> > -    dist->entries = g_realloc(dist->entries,
> > -                              sizeof(*dist->entries) * (dist->n + 1));
> > +    if (unlikely(dist->n == dist->size)) {
> > +        dist->size = dist->size ? dist->size * 2 : 1;
> 
> We could initialize 'dist->size' to 1 and allocate a 1-entry
> 'dist->entries' array in qdist_init() to avoid this ternary operation ;-)

Done. This resulted in quite a few modifications, since dist->entries == NULL
had been used as an equivalent to dist->n == 0.

> >>>> (snip)
> >> So our scale is not linear. I think some users might get confused by this.
> > That's correct. I think special-casing 0 makes sense though, since
> > it increases the signal-to-noise ratio of the histogram. For example:
> >
> > 1) 0 as ' ':
> > TB hash occupancy   31.84% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
> > TB hash avg chain   1.015 buckets. Histogram: 1|█▁▁|3
> >
> > 2) 0 as '1/8':
> > TB hash occupancy   32.07% avg chain occ. Histogram: [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]%
> > TB hash avg chain   1.015 buckets. Histogram: 1|▇▁▁|3
> >
> > I think in these examples most users would be less confused by 1) than by 2).
> 
> I was meaning to represent all bars whose value < 1/8 as a space, not
> only whose value is pure zero. Otherwise we can see 1/8 bar where the
> actual value is negligibly differ from zero as in the second example.

I see. That would be (3):

TB hash occupancy   32.79% avg chain occ. Histogram: [0,10)%|▅ █  ▅ ▂  |[90,100]%
TB hash avg chain   1.017 buckets. Histogram: 1|█  |3

I still think (1) is the representation that gives the most information.
IMO it's valuable that "close to zero" and "zero" are represented differently,
in the same way that max and "close to max" are represented differently
as well (only max gets 8/8).

BTW, while looking into this I fixed a bug; sometimes we'd print 7/8 instead
of 8/8 for the max value, due to the ordering of FP computations
[see 1-3 in (2) above; it's 7/8 instead of 8/8]. Fixed with:

diff --git a/util/qdist.c b/util/qdist.c
index cfe09e6..7842d34 100644
--- a/util/qdist.c
+++ b/util/qdist.c
@@ -103,7 +103,7 @@ static const gunichar qdist_blocks[] = {
  */
 static char *qdist_pr_internal(const struct qdist *dist)
 {
-    double min, max, step;
+    double min, max;
     GString *s = g_string_new("");
     size_t i;
 
@@ -131,16 +131,14 @@ static char *qdist_pr_internal(const struct qdist *dist)
         }
     }
 
-    /* floor((count - min) * step) will give us the block index */
-    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
-
     for (i = 0; i < dist->n; i++) {
         struct qdist_entry *e = &dist->entries[i];
         int index;
 
         /* make an exception with 0; instead of using block[0], print a space */
         if (e->count) {
-            index = (int)((e->count - min) * step);
+            /* divide first to avoid loss of precision when e->count == max */
+            index = (e->count - min) / (max - min) * (QDIST_NR_BLOCK_CODES - 1);
             g_string_append_unichar(s, qdist_blocks[index]);
         } else {
             g_string_append_c(s, ' ');

I also added a test to test-qdist (called "test_bin_precision") that
checks for this.

> > The behaviour isn't the same though. With this we have
> > that the two outer bins (leftmost and rightmost) are unnecessarily
> > large (since they're out of the range of the input data).
> >
> > For example, assume the data is between 0 and 100 and n=5 (i.e. step=25),
> > it makes no sense to report the first bin as [-12.5,12.5). If we
> > then truncate the unnecessary edges, we'd have [0,12.5), but
> > then the second bin is [12.5,37.5). Bins of unequal size are
> > possible (although a bit unusual) in histograms, but given
> > our Unicode-based representation, we're limited to same-width bars.
> 
> That is why I noted that I'm not sure what is the most correct from
> mathematical point of view. Maybe consider the second option? I.e.
> rounding to the middle of each bin with:
> 
>     x = left + step / 2;
> 
> which would give the picture like this:
> 
> 
>     xmin [*---*---*---*---*] xmax   -- from
>           |   |   |   |   |
>            \ / \ / \ / \ /
>             |   |   |   |
>             V   V   V   V
>            [*   *   *   *]          -- to

This binning is equivalent to what we do right now.

The only difference is where the value is set (either at the left
of the bin, or at the center as above); this, however, isn't too
important, since this value is only used when printing
the labels, i.e. we could print [left, left+step) or
[center-step/2, center+step/2) and still get the same results.

> Anyway, you may consider if you like whether it's possible to apply some
> simplifications from my code to the final version.

OK. This is how it looks like:

diff --git a/util/qdist.c b/util/qdist.c
index 7842d34..3ca2227 100644
--- a/util/qdist.c
+++ b/util/qdist.c
@@ -163,7 +163,7 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
 {
     double xmin, xmax;
     double step;
-    size_t i, j, j_min;
+    size_t i, j;
 
     qdist_init(to);
 
@@ -194,7 +194,7 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
     }
 
  rebin:
-    j_min = 0;
+    j = 0;
     for (i = 0; i < n; i++) {
         double x;
         double left, right;
@@ -210,19 +210,13 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
          * To avoid double-counting we capture [left, right) ranges, except for
          * the righmost bin, which captures a [left, right] range.
          */
-        for (j = j_min; j < from->n; j++) {
+        while (j < from->n &&
+               (from->entries[j].x < right ||
+                (i == n - 1 && from->entries[j].x == right))) {
             struct qdist_entry *o = &from->entries[j];
 
-            /* entries are ordered so do not check beyond right */
-            if (o->x > right) {
-                break;
-            }
-            if (o->x >= left && (o->x < right ||
-                                   (i == n - 1 && o->x == right))) {
-                qdist_add(to, x, o->count);
-                /* don't check this entry again */
-                j_min = j + 1;
-            }
+            qdist_add(to, x, o->count);
+            j++;
         }
     }
 }

Thanks,

		Emilio

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data
  2016-06-08 18:06             ` Emilio G. Cota
@ 2016-06-08 18:18               ` Sergey Fedorov
  0 siblings, 0 replies; 63+ messages in thread
From: Sergey Fedorov @ 2016-06-08 18:18 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Alex Bennée, Paolo Bonzini,
	Richard Henderson

On 08/06/16 21:06, Emilio G. Cota wrote:
> On Wed, Jun 08, 2016 at 17:10:03 +0300, Sergey Fedorov wrote:
>> On 08/06/16 03:02, Emilio G. Cota wrote:
>>> -    dist->entries = g_realloc(dist->entries,
>>> -                              sizeof(*dist->entries) * (dist->n + 1));
>>> +    if (unlikely(dist->n == dist->size)) {
>>> +        dist->size = dist->size ? dist->size * 2 : 1;
>> We could initialize 'dist->size' to 1 and allocate a 1-entry
>> 'dist->entries' array in qdist_init() to avoid this ternary operation ;-)
> Done. This resulted in quite a few modifications, since dist->entries == NULL
> had been used as an equivalent to dist->n == 0.
>
>>>>>> (snip)
>>>> So our scale is not linear. I think some users might get confused by this.
>>> That's correct. I think special-casing 0 makes sense though, since
>>> it increases the signal-to-noise ratio of the histogram. For example:
>>>
>>> 1) 0 as ' ':
>>> TB hash occupancy   31.84% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
>>> TB hash avg chain   1.015 buckets. Histogram: 1|█▁▁|3
>>>
>>> 2) 0 as '1/8':
>>> TB hash occupancy   32.07% avg chain occ. Histogram: [0,10)%|▆▁█▁▁▅▁▃▁▁|[90,100]%
>>> TB hash avg chain   1.015 buckets. Histogram: 1|▇▁▁|3
>>>
>>> I think in these examples most users would be less confused by 1) than by 2).
>> I was meaning to represent all bars whose value < 1/8 as a space, not
>> only whose value is pure zero. Otherwise we can see 1/8 bar where the
>> actual value is negligibly differ from zero as in the second example.
> I see. That would be (3):
>
> TB hash occupancy   32.79% avg chain occ. Histogram: [0,10)%|▅ █  ▅ ▂  |[90,100]%
> TB hash avg chain   1.017 buckets. Histogram: 1|█  |3
>
> I still think (1) is the representation that gives the most information.
> IMO it's valuable that "close to zero" and "zero" are represented differently,
> in the same way that max and "close to max" are represented differently
> as well (only max gets 8/8).

Anyhow, these histograms are only for rough estimation and to see some
nice unicode output :)

>
> BTW, while looking into this I fixed a bug; sometimes we'd print 7/8 instead
> of 8/8 for the max value, due to the ordering of FP computations
> [see 1-3 in (2) above; it's 7/8 instead of 8/8]. Fixed with:
>
> diff --git a/util/qdist.c b/util/qdist.c
> index cfe09e6..7842d34 100644
> --- a/util/qdist.c
> +++ b/util/qdist.c
> @@ -103,7 +103,7 @@ static const gunichar qdist_blocks[] = {
>   */
>  static char *qdist_pr_internal(const struct qdist *dist)
>  {
> -    double min, max, step;
> +    double min, max;
>      GString *s = g_string_new("");
>      size_t i;
>  
> @@ -131,16 +131,14 @@ static char *qdist_pr_internal(const struct qdist *dist)
>          }
>      }
>  
> -    /* floor((count - min) * step) will give us the block index */
> -    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
> -
>      for (i = 0; i < dist->n; i++) {
>          struct qdist_entry *e = &dist->entries[i];
>          int index;
>  
>          /* make an exception with 0; instead of using block[0], print a space */
>          if (e->count) {
> -            index = (int)((e->count - min) * step);
> +            /* divide first to avoid loss of precision when e->count == max */
> +            index = (e->count - min) / (max - min) * (QDIST_NR_BLOCK_CODES - 1);
>              g_string_append_unichar(s, qdist_blocks[index]);
>          } else {
>              g_string_append_c(s, ' ');
>
> I also added a test to test-qdist (called "test_bin_precision") that
> checks for this.
>
>>> The behaviour isn't the same though. With this we have
>>> that the two outer bins (leftmost and rightmost) are unnecessarily
>>> large (since they're out of the range of the input data).
>>>
>>> For example, assume the data is between 0 and 100 and n=5 (i.e. step=25),
>>> it makes no sense to report the first bin as [-12.5,12.5). If we
>>> then truncate the unnecessary edges, we'd have [0,12.5), but
>>> then the second bin is [12.5,37.5). Bins of unequal size are
>>> possible (although a bit unusual) in histograms, but given
>>> our Unicode-based representation, we're limited to same-width bars.
>> That is why I noted that I'm not sure what is the most correct from
>> mathematical point of view. Maybe consider the second option? I.e.
>> rounding to the middle of each bin with:
>>
>>     x = left + step / 2;
>>
>> which would give the picture like this:
>>
>>
>>     xmin [*---*---*---*---*] xmax   -- from
>>           |   |   |   |   |
>>            \ / \ / \ / \ /
>>             |   |   |   |
>>             V   V   V   V
>>            [*   *   *   *]          -- to
> This binning is equivalent to what we do right now.
>
> The only difference is where the value is set (either at the left
> of the bin, or at the center as above); this, however, isn't too
> important, since this value is only used when printing
> the labels, i.e. we could print [left, left+step) or
> [center-step/2, center+step/2) and still get the same results.
>
>> Anyway, you may consider if you like whether it's possible to apply some
>> simplifications from my code to the final version.
> OK. This is how it looks like:
>
> diff --git a/util/qdist.c b/util/qdist.c
> index 7842d34..3ca2227 100644
> --- a/util/qdist.c
> +++ b/util/qdist.c
> @@ -163,7 +163,7 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
>  {
>      double xmin, xmax;
>      double step;
> -    size_t i, j, j_min;
> +    size_t i, j;
>  
>      qdist_init(to);
>  
> @@ -194,7 +194,7 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
>      }
>  
>   rebin:
> -    j_min = 0;
> +    j = 0;
>      for (i = 0; i < n; i++) {
>          double x;
>          double left, right;
> @@ -210,19 +210,13 @@ void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
>           * To avoid double-counting we capture [left, right) ranges, except for
>           * the righmost bin, which captures a [left, right] range.
>           */
> -        for (j = j_min; j < from->n; j++) {
> +        while (j < from->n &&
> +               (from->entries[j].x < right ||
> +                (i == n - 1 && from->entries[j].x == right))) {

If "i == n - 1" then we have to put everything what's left in 'from',
right? If so, then:

while (j < from->n && (from->entries[j].x < right || i == n - 1)) {

Kind regards,
Sergey

>              struct qdist_entry *o = &from->entries[j];
>  
> -            /* entries are ordered so do not check beyond right */
> -            if (o->x > right) {
> -                break;
> -            }
> -            if (o->x >= left && (o->x < right ||
> -                                   (i == n - 1 && o->x == right))) {
> -                qdist_add(to, x, o->count);
> -                /* don't check this entry again */
> -                j_min = j + 1;
> -            }
> +            qdist_add(to, x, o->count);
> +            j++;
>          }
>      }
>  }
>
> Thanks,
>
> 		Emilio

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2016-06-08 18:18 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-25  1:13 [Qemu-devel] [PATCH v6 00/15] tb hash improvements Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 01/15] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
2016-05-27 19:54   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 02/15] seqlock: remove optional mutex Emilio G. Cota
2016-05-27 19:55   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 03/15] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
2016-05-27 19:59   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 04/15] include/processor.h: define cpu_relax() Emilio G. Cota
2016-05-27 20:53   ` Sergey Fedorov
2016-05-27 21:10     ` Emilio G. Cota
2016-05-28 12:35       ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 05/15] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 06/15] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
2016-05-28 12:36   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 07/15] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
2016-05-28 12:39   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data Emilio G. Cota
2016-05-28 18:15   ` Sergey Fedorov
2016-06-03 17:22     ` Emilio G. Cota
2016-06-03 17:29       ` Sergey Fedorov
2016-06-03 17:46         ` Sergey Fedorov
2016-06-06 23:40           ` Emilio G. Cota
2016-06-07 14:06             ` Sergey Fedorov
2016-06-07 22:53               ` Emilio G. Cota
2016-06-08 13:09                 ` Sergey Fedorov
2016-06-07  1:05     ` Emilio G. Cota
2016-06-07 15:56       ` Sergey Fedorov
2016-06-08  0:02         ` Emilio G. Cota
2016-06-08 14:10           ` Sergey Fedorov
2016-06-08 18:06             ` Emilio G. Cota
2016-06-08 18:18               ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 09/15] qdist: add test program Emilio G. Cota
2016-05-28 18:56   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 10/15] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
2016-05-29 19:52   ` Sergey Fedorov
2016-05-29 19:55     ` Sergey Fedorov
2016-05-31  7:46     ` Alex Bennée
2016-06-01 20:53       ` Sergey Fedorov
2016-06-03  9:18     ` Emilio G. Cota
2016-06-03 15:19       ` Sergey Fedorov
2016-06-03 11:01     ` Emilio G. Cota
2016-06-03 15:34       ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 11/15] qht: add test program Emilio G. Cota
2016-05-29 20:15   ` Sergey Fedorov
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 12/15] qht: add qht-bench, a performance benchmark Emilio G. Cota
2016-05-29 20:45   ` Sergey Fedorov
2016-06-03 11:41     ` Emilio G. Cota
2016-06-03 15:41       ` Sergey Fedorov
2016-05-31 15:12   ` Alex Bennée
2016-05-31 16:44     ` Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 13/15] qht: add test-qht-par to invoke qht-bench from 'check' target Emilio G. Cota
2016-05-29 20:53   ` Sergey Fedorov
2016-06-03 11:07     ` Emilio G. Cota
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 14/15] tb hash: track translated blocks with qht Emilio G. Cota
2016-05-29 21:09   ` Sergey Fedorov
2016-05-31  8:39   ` Alex Bennée
2016-05-25  1:13 ` [Qemu-devel] [PATCH v6 15/15] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
2016-05-29 21:14   ` Sergey Fedorov
2016-06-08  6:25 ` [Qemu-devel] [PATCH v6 00/15] tb hash improvements Alex Bennée
2016-06-08 15:16   ` Emilio G. Cota
2016-06-08 15:35   ` Richard Henderson
2016-06-08 15:37     ` Sergey Fedorov
2016-06-08 16:45       ` Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.