All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1
@ 2018-02-27  5:39 Emilio G. Cota
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 01/16] qht: require a default comparison function Emilio G. Cota
                   ` (15 more replies)
  0 siblings, 16 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

With this set we finally remove tb_lock. The performance gains
when booting a guest are compelling at low core counts. However,
beyond 8 cores performance doesn't improve due to unrelated
contention--see results in the last patch of the series
("tcg: remove tb_lock").

I have another series that greatly reduces this other contention by
using per-CPU locks instead of the BQL to keep track of a subset
of CPUState. But that series is pretty large so let's deal with
this first.

You can fetch the patches from:
  https://github.com/cota/qemu/tree/tb-lock-removal-redux-v1

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 01/16] qht: require a default comparison function
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-02-28 19:02   ` Richard Henderson
  2018-03-28 16:21   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 02/16] qht: return existing entry when qht_insert fails Emilio G. Cota
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

qht_lookup now uses the default cmp function. qht_lookup_custom is defined
to retain the old behaviour, that is a cmp function is explicitly provided.

qht_insert will gain use of the default cmp in the next patch.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c      |  4 ++--
 accel/tcg/translate-all.c | 16 +++++++++++++++-
 include/qemu/qht.h        | 23 +++++++++++++++++++----
 tests/qht-bench.c         | 14 +++++++-------
 tests/test-qht.c          | 15 ++++++++++-----
 util/qht.c                | 14 +++++++++++---
 6 files changed, 64 insertions(+), 22 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 280200f..ec57564 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -293,7 +293,7 @@ struct tb_desc {
     uint32_t trace_vcpu_dstate;
 };
 
-static bool tb_cmp(const void *p, const void *d)
+static bool tb_lookup_cmp(const void *p, const void *d)
 {
     const TranslationBlock *tb = p;
     const struct tb_desc *desc = d;
@@ -338,7 +338,7 @@ TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
     phys_pc = get_page_addr_code(desc.env, pc);
     desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_hash_func(phys_pc, pc, flags, cf_mask, *cpu->trace_dstate);
-    return qht_lookup(&tb_ctx.htable, tb_cmp, &desc, h);
+    return qht_lookup_custom(&tb_ctx.htable, tb_lookup_cmp, &desc, h);
 }
 
 void tb_set_jmp_target(TranslationBlock *tb, int n, uintptr_t addr)
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 67795cd..1cf10f8 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -785,11 +785,25 @@ static inline void code_gen_alloc(size_t tb_size)
     qemu_mutex_init(&tb_ctx.tb_lock);
 }
 
+static bool tb_cmp(const void *ap, const void *bp)
+{
+    const TranslationBlock *a = ap;
+    const TranslationBlock *b = bp;
+
+    return a->pc == b->pc &&
+        a->cs_base == b->cs_base &&
+        a->flags == b->flags &&
+        (tb_cflags(a) & CF_HASH_MASK) == (tb_cflags(b) & CF_HASH_MASK) &&
+        a->trace_vcpu_dstate == b->trace_vcpu_dstate &&
+        a->page_addr[0] == b->page_addr[0] &&
+        a->page_addr[1] == b->page_addr[1];
+}
+
 static void tb_htable_init(void)
 {
     unsigned int mode = QHT_MODE_AUTO_RESIZE;
 
-    qht_init(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
+    qht_init(&tb_ctx.htable, tb_cmp, CODE_GEN_HTABLE_SIZE, mode);
 }
 
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
diff --git a/include/qemu/qht.h b/include/qemu/qht.h
index 531aa95..dd512bf 100644
--- a/include/qemu/qht.h
+++ b/include/qemu/qht.h
@@ -11,8 +11,11 @@
 #include "qemu/thread.h"
 #include "qemu/qdist.h"
 
+typedef bool (*qht_cmp_func_t)(const void *a, const void *b);
+
 struct qht {
     struct qht_map *map;
+    qht_cmp_func_t cmp;
     QemuMutex lock; /* serializes setters of ht->map */
     unsigned int mode;
 };
@@ -47,10 +50,12 @@ typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
 /**
  * qht_init - Initialize a QHT
  * @ht: QHT to be initialized
+ * @cmp: default comparison function. Cannot be NULL.
  * @n_elems: number of entries the hash table should be optimized for.
  * @mode: bitmask with OR'ed QHT_MODE_*
  */
-void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
+void qht_init(struct qht *ht, qht_cmp_func_t cmp, size_t n_elems,
+              unsigned int mode);
 
 /**
  * qht_destroy - destroy a previously initialized QHT
@@ -78,7 +83,7 @@ void qht_destroy(struct qht *ht);
 bool qht_insert(struct qht *ht, void *p, uint32_t hash);
 
 /**
- * qht_lookup - Look up a pointer in a QHT
+ * qht_lookup_custom - Look up a pointer using a custom comparison function.
  * @ht: QHT to be looked up
  * @func: function to compare existing pointers against @userp
  * @userp: pointer to pass to @func
@@ -94,8 +99,18 @@ bool qht_insert(struct qht *ht, void *p, uint32_t hash);
  * Returns the corresponding pointer when a match is found.
  * Returns NULL otherwise.
  */
-void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
-                 uint32_t hash);
+void *qht_lookup_custom(struct qht *ht, qht_lookup_func_t func,
+                        const void *userp, uint32_t hash);
+
+/**
+ * qht_lookup - Look up a pointer in a QHT
+ * @ht: QHT to be looked up
+ * @userp: pointer to pass to @func
+ * @hash: hash of the pointer to be looked up
+ *
+ * Calls qht_lookup_custom() using @ht's default comparison function.
+ */
+void *qht_lookup(struct qht *ht, const void *userp, uint32_t hash);
 
 /**
  * qht_remove - remove a pointer from the hash table
diff --git a/tests/qht-bench.c b/tests/qht-bench.c
index 4cabdfd..c94ac25 100644
--- a/tests/qht-bench.c
+++ b/tests/qht-bench.c
@@ -93,10 +93,10 @@ static void usage_complete(int argc, char *argv[])
     exit(-1);
 }
 
-static bool is_equal(const void *obj, const void *userp)
+static bool is_equal(const void *ap, const void *bp)
 {
-    const long *a = obj;
-    const long *b = userp;
+    const long *a = ap;
+    const long *b = bp;
 
     return *a == *b;
 }
@@ -150,7 +150,7 @@ static void do_rw(struct thread_info *info)
 
         p = &keys[info->r & (lookup_range - 1)];
         hash = h(*p);
-        read = qht_lookup(&ht, is_equal, p, hash);
+        read = qht_lookup(&ht, p, hash);
         if (read) {
             stats->rd++;
         } else {
@@ -162,7 +162,7 @@ static void do_rw(struct thread_info *info)
         if (info->write_op) {
             bool written = false;
 
-            if (qht_lookup(&ht, is_equal, p, hash) == NULL) {
+            if (qht_lookup(&ht, p, hash) == NULL) {
                 written = qht_insert(&ht, p, hash);
             }
             if (written) {
@@ -173,7 +173,7 @@ static void do_rw(struct thread_info *info)
         } else {
             bool removed = false;
 
-            if (qht_lookup(&ht, is_equal, p, hash)) {
+            if (qht_lookup(&ht, p, hash)) {
                 removed = qht_remove(&ht, p, hash);
             }
             if (removed) {
@@ -308,7 +308,7 @@ static void htable_init(void)
     }
 
     /* initialize the hash table */
-    qht_init(&ht, qht_n_elems, qht_mode);
+    qht_init(&ht, is_equal, qht_n_elems, qht_mode);
     assert(init_size <= init_range);
 
     pr_params();
diff --git a/tests/test-qht.c b/tests/test-qht.c
index 9b7423a..f8f2886 100644
--- a/tests/test-qht.c
+++ b/tests/test-qht.c
@@ -13,10 +13,10 @@
 static struct qht ht;
 static int32_t arr[N * 2];
 
-static bool is_equal(const void *obj, const void *userp)
+static bool is_equal(const void *ap, const void *bp)
 {
-    const int32_t *a = obj;
-    const int32_t *b = userp;
+    const int32_t *a = ap;
+    const int32_t *b = bp;
 
     return *a == *b;
 }
@@ -60,7 +60,12 @@ static void check(int a, int b, bool expected)
 
         val = i;
         hash = i;
-        p = qht_lookup(&ht, is_equal, &val, hash);
+        /* test both lookup variants; results should be the same */
+        if (i % 2) {
+            p = qht_lookup(&ht, &val, hash);
+        } else {
+            p = qht_lookup_custom(&ht, is_equal, &val, hash);
+        }
         g_assert_true(!!p == expected);
     }
     rcu_read_unlock();
@@ -102,7 +107,7 @@ static void qht_do_test(unsigned int mode, size_t init_entries)
     /* under KVM we might fetch stats from an uninitialized qht */
     check_n(0);
 
-    qht_init(&ht, 0, mode);
+    qht_init(&ht, is_equal, 0, mode);
 
     check_n(0);
     insert(0, N);
diff --git a/util/qht.c b/util/qht.c
index ff4d2e6..dcb3ee1 100644
--- a/util/qht.c
+++ b/util/qht.c
@@ -351,11 +351,14 @@ static struct qht_map *qht_map_create(size_t n_buckets)
     return map;
 }
 
-void qht_init(struct qht *ht, size_t n_elems, unsigned int mode)
+void qht_init(struct qht *ht, qht_cmp_func_t cmp, size_t n_elems,
+              unsigned int mode)
 {
     struct qht_map *map;
     size_t n_buckets = qht_elems_to_buckets(n_elems);
 
+    g_assert(cmp);
+    ht->cmp = cmp;
     ht->mode = mode;
     qemu_mutex_init(&ht->lock);
     map = qht_map_create(n_buckets);
@@ -479,8 +482,8 @@ void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
     return ret;
 }
 
-void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
-                 uint32_t hash)
+void *qht_lookup_custom(struct qht *ht, qht_lookup_func_t func,
+                        const void *userp, uint32_t hash)
 {
     struct qht_bucket *b;
     struct qht_map *map;
@@ -502,6 +505,11 @@ void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
     return qht_lookup__slowpath(b, func, userp, hash);
 }
 
+void *qht_lookup(struct qht *ht, const void *userp, uint32_t hash)
+{
+    return qht_lookup_custom(ht, ht->cmp, userp, hash);
+}
+
 /* call with head->lock held */
 static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
                                struct qht_bucket *head, void *p, uint32_t hash,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 02/16] qht: return existing entry when qht_insert fails
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 01/16] qht: require a default comparison function Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-02-28 19:10   ` Richard Henderson
  2018-03-28 16:33   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 03/16] tcg: track TBs with per-region BST's Emilio G. Cota
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

The meaning of "existing" is now changed to "matches in hash and
ht->cmp result". This is saner than just checking the pointer value.

Note that we now return NULL on insertion success, or the existing
pointer on failure. We can do this because NULL pointers are not
allowed to be inserted in QHT.

Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/qht.h |  7 ++++---
 tests/qht-bench.c  |  4 ++--
 tests/test-qht.c   |  5 ++++-
 util/qht.c         | 17 +++++++++--------
 4 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/include/qemu/qht.h b/include/qemu/qht.h
index dd512bf..c320cb6 100644
--- a/include/qemu/qht.h
+++ b/include/qemu/qht.h
@@ -77,10 +77,11 @@ void qht_destroy(struct qht *ht);
  * In case of successful operation, smp_wmb() is implied before the pointer is
  * inserted into the hash table.
  *
- * Returns true on success.
- * Returns false if the @p-@hash pair already exists in the hash table.
+ * On success, returns NULL.
+ * On failure, returns the pointer from an entry that is equivalent (i.e.
+ * ht->cmp matches and the hash is the same) to @p-@h.
  */
-bool qht_insert(struct qht *ht, void *p, uint32_t hash);
+void *qht_insert(struct qht *ht, void *p, uint32_t hash);
 
 /**
  * qht_lookup_custom - Look up a pointer using a custom comparison function.
diff --git a/tests/qht-bench.c b/tests/qht-bench.c
index c94ac25..2f88400 100644
--- a/tests/qht-bench.c
+++ b/tests/qht-bench.c
@@ -163,7 +163,7 @@ static void do_rw(struct thread_info *info)
             bool written = false;
 
             if (qht_lookup(&ht, p, hash) == NULL) {
-                written = qht_insert(&ht, p, hash);
+                written = !qht_insert(&ht, p, hash);
             }
             if (written) {
                 stats->in++;
@@ -322,7 +322,7 @@ static void htable_init(void)
             r = xorshift64star(r);
             p = &keys[r & (init_range - 1)];
             hash = h(*p);
-            if (qht_insert(&ht, p, hash)) {
+            if (qht_insert(&ht, p, hash) == NULL) {
                 break;
             }
             retries++;
diff --git a/tests/test-qht.c b/tests/test-qht.c
index f8f2886..7164ae4 100644
--- a/tests/test-qht.c
+++ b/tests/test-qht.c
@@ -27,11 +27,14 @@ static void insert(int a, int b)
 
     for (i = a; i < b; i++) {
         uint32_t hash;
+        void *existing;
 
         arr[i] = i;
         hash = i;
 
-        qht_insert(&ht, &arr[i], hash);
+        g_assert_true(!qht_insert(&ht, &arr[i], hash));
+        existing = qht_insert(&ht, &arr[i], hash);
+        g_assert_true(existing == &arr[i]);
     }
 }
 
diff --git a/util/qht.c b/util/qht.c
index dcb3ee1..f9f49a9 100644
--- a/util/qht.c
+++ b/util/qht.c
@@ -511,9 +511,9 @@ void *qht_lookup(struct qht *ht, const void *userp, uint32_t hash)
 }
 
 /* call with head->lock held */
-static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
-                               struct qht_bucket *head, void *p, uint32_t hash,
-                               bool *needs_resize)
+static void *qht_insert__locked(struct qht *ht, struct qht_map *map,
+                                struct qht_bucket *head, void *p, uint32_t hash,
+                                bool *needs_resize)
 {
     struct qht_bucket *b = head;
     struct qht_bucket *prev = NULL;
@@ -523,8 +523,9 @@ static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
     do {
         for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
             if (b->pointers[i]) {
-                if (unlikely(b->pointers[i] == p)) {
-                    return false;
+                if (unlikely(b->hashes[i] == hash &&
+                             ht->cmp(b->pointers[i], p))) {
+                    return b->pointers[i];
                 }
             } else {
                 goto found;
@@ -553,7 +554,7 @@ static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
     atomic_set(&b->hashes[i], hash);
     atomic_set(&b->pointers[i], p);
     seqlock_write_end(&head->sequence);
-    return true;
+    return NULL;
 }
 
 static __attribute__((noinline)) void qht_grow_maybe(struct qht *ht)
@@ -577,12 +578,12 @@ static __attribute__((noinline)) void qht_grow_maybe(struct qht *ht)
     qemu_mutex_unlock(&ht->lock);
 }
 
-bool qht_insert(struct qht *ht, void *p, uint32_t hash)
+void *qht_insert(struct qht *ht, void *p, uint32_t hash)
 {
     struct qht_bucket *b;
     struct qht_map *map;
     bool needs_resize = false;
-    bool ret;
+    void *ret;
 
     /* NULL pointers are not supported */
     qht_debug_assert(p);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 03/16] tcg: track TBs with per-region BST's
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 01/16] qht: require a default comparison function Emilio G. Cota
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 02/16] qht: return existing entry when qht_insert fails Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-02-28 20:53   ` Richard Henderson
  2018-03-29  9:54   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 04/16] tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx Emilio G. Cota
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

This paves the way for enabling scalable parallel generation of TCG code.

Instead of tracking TBs with a single binary search tree (BST), use a
BST for each TCG region, protecting it with a lock. This is as scalable
as it gets, since each TCG thread operates on a separate region.

The core of this change is the introduction of struct tcg_region_tree,
which contains a pointer to a GTree and an associated lock to serialize
accesses to it. We then allocate an array of tcg_region_tree's, adding
the appropriate padding to avoid false sharing based on
qemu_dcache_linesize.

Given a tc_ptr, we first find the corresponding region_tree. This
is done by special-casing the first and last regions first, since they
might be of size != region.size; otherwise we just divide the offset
by region.stride. I was worried about this division (several dozen
cycles of latency), but profiling shows that this is not a fast path.
Note that region.stride is not required to be a power of two; it
is only required to be a multiple of the host's page size.

Note that with this design we can also provide consistent snapshots
about all region trees at once; for instance, tcg_tb_foreach
acquires/releases all region_tree locks before/after iterating over them.
For this reason we now drop tb_lock in dump_exec_info().

As an alternative I considered implementing a concurrent BST, but this
can be tricky to get right, offers no consistent snapshots of the BST,
and performance and scalability-wise I don't think it could ever beat
having separate GTrees, given that our workload is insert-mostly (all
concurrent BST designs I've seen focus, understandably, on making
lookups fast, which comes at the expense of convoluted, non-wait-free
insertions/removals).

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c      |   2 +-
 accel/tcg/translate-all.c | 101 ++++--------------------
 include/exec/exec-all.h   |   1 -
 include/exec/tb-context.h |   1 -
 tcg/tcg.c                 | 191 ++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg.h                 |   6 ++
 6 files changed, 213 insertions(+), 89 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index ec57564..8c68727 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -222,7 +222,7 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
 
     tb_lock();
     tb_phys_invalidate(tb, -1);
-    tb_remove(tb);
+    tcg_tb_remove(tb);
     tb_unlock();
 }
 #endif
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 1cf10f8..3a51d49 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -205,8 +205,6 @@ void tb_lock_reset(void)
     }
 }
 
-static TranslationBlock *tb_find_pc(uintptr_t tc_ptr);
-
 void cpu_gen_init(void)
 {
     tcg_context_init(&tcg_init_ctx);
@@ -375,13 +373,13 @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
 
     if (check_offset < tcg_init_ctx.code_gen_buffer_size) {
         tb_lock();
-        tb = tb_find_pc(host_pc);
+        tb = tcg_tb_lookup(host_pc);
         if (tb) {
             cpu_restore_state_from_tb(cpu, tb, host_pc);
             if (tb->cflags & CF_NOCACHE) {
                 /* one-shot translation, invalidate it immediately */
                 tb_phys_invalidate(tb, -1);
-                tb_remove(tb);
+                tcg_tb_remove(tb);
             }
             r = true;
         }
@@ -731,48 +729,6 @@ static inline void *alloc_code_gen_buffer(void)
 }
 #endif /* USE_STATIC_CODE_GEN_BUFFER, WIN32, POSIX */
 
-/* compare a pointer @ptr and a tb_tc @s */
-static int ptr_cmp_tb_tc(const void *ptr, const struct tb_tc *s)
-{
-    if (ptr >= s->ptr + s->size) {
-        return 1;
-    } else if (ptr < s->ptr) {
-        return -1;
-    }
-    return 0;
-}
-
-static gint tb_tc_cmp(gconstpointer ap, gconstpointer bp)
-{
-    const struct tb_tc *a = ap;
-    const struct tb_tc *b = bp;
-
-    /*
-     * When both sizes are set, we know this isn't a lookup.
-     * This is the most likely case: every TB must be inserted; lookups
-     * are a lot less frequent.
-     */
-    if (likely(a->size && b->size)) {
-        if (a->ptr > b->ptr) {
-            return 1;
-        } else if (a->ptr < b->ptr) {
-            return -1;
-        }
-        /* a->ptr == b->ptr should happen only on deletions */
-        g_assert(a->size == b->size);
-        return 0;
-    }
-    /*
-     * All lookups have either .size field set to 0.
-     * From the glib sources we see that @ap is always the lookup key. However
-     * the docs provide no guarantee, so we just mark this case as likely.
-     */
-    if (likely(a->size == 0)) {
-        return ptr_cmp_tb_tc(a->ptr, b);
-    }
-    return ptr_cmp_tb_tc(b->ptr, a);
-}
-
 static inline void code_gen_alloc(size_t tb_size)
 {
     tcg_ctx->code_gen_buffer_size = size_code_gen_buffer(tb_size);
@@ -781,7 +737,6 @@ static inline void code_gen_alloc(size_t tb_size)
         fprintf(stderr, "Could not allocate dynamic translator buffer\n");
         exit(1);
     }
-    tb_ctx.tb_tree = g_tree_new(tb_tc_cmp);
     qemu_mutex_init(&tb_ctx.tb_lock);
 }
 
@@ -842,14 +797,6 @@ static TranslationBlock *tb_alloc(target_ulong pc)
     return tb;
 }
 
-/* Called with tb_lock held.  */
-void tb_remove(TranslationBlock *tb)
-{
-    assert_tb_locked();
-
-    g_tree_remove(tb_ctx.tb_tree, &tb->tc);
-}
-
 static inline void invalidate_page_bitmap(PageDesc *p)
 {
 #ifdef CONFIG_SOFTMMU
@@ -914,10 +861,10 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
     if (DEBUG_TB_FLUSH_GATE) {
-        size_t nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
+        size_t nb_tbs = tcg_nb_tbs();
         size_t host_size = 0;
 
-        g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
+        tcg_tb_foreach(tb_host_size_iter, &host_size);
         printf("qemu: flush code_size=%zu nb_tbs=%zu avg_tb_size=%zu\n",
                tcg_code_size(), nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
     }
@@ -926,10 +873,6 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
         cpu_tb_jmp_cache_clear(cpu);
     }
 
-    /* Increment the refcount first so that destroy acts as a reset */
-    g_tree_ref(tb_ctx.tb_tree);
-    g_tree_destroy(tb_ctx.tb_tree);
-
     qht_reset_size(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
@@ -1409,7 +1352,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
      * through the physical hash table and physical page list.
      */
     tb_link_page(tb, phys_pc, phys_page2);
-    g_tree_insert(tb_ctx.tb_tree, &tb->tc, tb);
+    tcg_tb_insert(tb);
     return tb;
 }
 
@@ -1513,7 +1456,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
                 current_tb = NULL;
                 if (cpu->mem_io_pc) {
                     /* now we have a real cpu fault */
-                    current_tb = tb_find_pc(cpu->mem_io_pc);
+                    current_tb = tcg_tb_lookup(cpu->mem_io_pc);
                 }
             }
             if (current_tb == tb &&
@@ -1629,7 +1572,7 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
     tb = p->first_tb;
 #ifdef TARGET_HAS_PRECISE_SMC
     if (tb && pc != 0) {
-        current_tb = tb_find_pc(pc);
+        current_tb = tcg_tb_lookup(pc);
     }
     if (cpu != NULL) {
         env = cpu->env_ptr;
@@ -1672,18 +1615,6 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
 }
 #endif
 
-/*
- * Find the TB 'tb' such that
- * tb->tc.ptr <= tc_ptr < tb->tc.ptr + tb->tc.size
- * Return NULL if not found.
- */
-static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
-{
-    struct tb_tc s = { .ptr = (void *)tc_ptr };
-
-    return g_tree_lookup(tb_ctx.tb_tree, &s);
-}
-
 #if !defined(CONFIG_USER_ONLY)
 void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr)
 {
@@ -1711,7 +1642,7 @@ void tb_check_watchpoint(CPUState *cpu)
 {
     TranslationBlock *tb;
 
-    tb = tb_find_pc(cpu->mem_io_pc);
+    tb = tcg_tb_lookup(cpu->mem_io_pc);
     if (tb) {
         /* We can use retranslation to find the PC.  */
         cpu_restore_state_from_tb(cpu, tb, cpu->mem_io_pc);
@@ -1745,7 +1676,7 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
     uint32_t n;
 
     tb_lock();
-    tb = tb_find_pc(retaddr);
+    tb = tcg_tb_lookup(retaddr);
     if (!tb) {
         cpu_abort(cpu, "cpu_io_recompile: could not find TB for pc=%p",
                   (void *)retaddr);
@@ -1789,7 +1720,7 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
              * cpu_exec_nocache() */
             tb_phys_invalidate(tb->orig_tb, -1);
         }
-        tb_remove(tb);
+        tcg_tb_remove(tb);
     }
 
     /* TODO: If env->pc != tb->pc (i.e. the faulting instruction was not
@@ -1860,6 +1791,7 @@ static void print_qht_statistics(FILE *f, fprintf_function cpu_fprintf,
 }
 
 struct tb_tree_stats {
+    size_t nb_tbs;
     size_t host_size;
     size_t target_size;
     size_t max_target_size;
@@ -1873,6 +1805,7 @@ static gboolean tb_tree_stats_iter(gpointer key, gpointer value, gpointer data)
     const TranslationBlock *tb = value;
     struct tb_tree_stats *tst = data;
 
+    tst->nb_tbs++;
     tst->host_size += tb->tc.size;
     tst->target_size += tb->size;
     if (tb->size > tst->max_target_size) {
@@ -1896,10 +1829,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     struct qht_stats hst;
     size_t nb_tbs;
 
-    tb_lock();
-
-    nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
-    g_tree_foreach(tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
+    tcg_tb_foreach(tb_tree_stats_iter, &tst);
+    nb_tbs = tst.nb_tbs;
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
     /*
@@ -1934,8 +1865,6 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     cpu_fprintf(f, "TB invalidate count %d\n", tb_ctx.tb_phys_invalidate_count);
     cpu_fprintf(f, "TLB flush count     %zu\n", tlb_flush_count());
     tcg_dump_info(f, cpu_fprintf);
-
-    tb_unlock();
 }
 
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf)
@@ -2203,7 +2132,7 @@ int page_unprotect(target_ulong address, uintptr_t pc)
              * set the page to PAGE_WRITE and did the TB invalidate for us.
              */
 #ifdef TARGET_HAS_PRECISE_SMC
-            TranslationBlock *current_tb = tb_find_pc(pc);
+            TranslationBlock *current_tb = tcg_tb_lookup(pc);
             if (current_tb) {
                 current_tb_invalidated = tb_cflags(current_tb) & CF_INVALID;
             }
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index e5afd2e..17e08b3 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -401,7 +401,6 @@ static inline uint32_t curr_cflags(void)
          | (use_icount ? CF_USE_ICOUNT : 0);
 }
 
-void tb_remove(TranslationBlock *tb);
 void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
 TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index 1d41202..d8472c8 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -31,7 +31,6 @@ typedef struct TBContext TBContext;
 
 struct TBContext {
 
-    GTree *tb_tree;
     struct qht htable;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
diff --git a/tcg/tcg.c b/tcg/tcg.c
index bb24526..b471708 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -135,6 +135,12 @@ static TCGContext **tcg_ctxs;
 static unsigned int n_tcg_ctxs;
 TCGv_env cpu_env = 0;
 
+struct tcg_region_tree {
+    QemuMutex lock;
+    GTree *tree;
+    /* padding to avoid false sharing is computed at run-time */
+};
+
 /*
  * We divide code_gen_buffer into equally-sized "regions" that TCG threads
  * dynamically allocate from as demand dictates. Given appropriate region
@@ -158,6 +164,13 @@ struct tcg_region_state {
 };
 
 static struct tcg_region_state region;
+/*
+ * This is an array of struct tcg_region_tree's, with padding.
+ * We use void * to simplify the computation of region_trees[i]; each
+ * struct is found every tree_size bytes.
+ */
+static void *region_trees;
+static size_t tree_size;
 static TCGRegSet tcg_target_available_regs[TCG_TYPE_COUNT];
 static TCGRegSet tcg_target_call_clobber_regs;
 
@@ -295,6 +308,180 @@ TCGLabel *gen_new_label(void)
 
 #include "tcg-target.inc.c"
 
+/* compare a pointer @ptr and a tb_tc @s */
+static int ptr_cmp_tb_tc(const void *ptr, const struct tb_tc *s)
+{
+    if (ptr >= s->ptr + s->size) {
+        return 1;
+    } else if (ptr < s->ptr) {
+        return -1;
+    }
+    return 0;
+}
+
+static gint tb_tc_cmp(gconstpointer ap, gconstpointer bp)
+{
+    const struct tb_tc *a = ap;
+    const struct tb_tc *b = bp;
+
+    /*
+     * When both sizes are set, we know this isn't a lookup.
+     * This is the most likely case: every TB must be inserted; lookups
+     * are a lot less frequent.
+     */
+    if (likely(a->size && b->size)) {
+        if (a->ptr > b->ptr) {
+            return 1;
+        } else if (a->ptr < b->ptr) {
+            return -1;
+        }
+        /* a->ptr == b->ptr should happen only on deletions */
+        g_assert(a->size == b->size);
+        return 0;
+    }
+    /*
+     * All lookups have either .size field set to 0.
+     * From the glib sources we see that @ap is always the lookup key. However
+     * the docs provide no guarantee, so we just mark this case as likely.
+     */
+    if (likely(a->size == 0)) {
+        return ptr_cmp_tb_tc(a->ptr, b);
+    }
+    return ptr_cmp_tb_tc(b->ptr, a);
+}
+
+static void tcg_region_trees_init(void)
+{
+    size_t i;
+
+    tree_size = ROUND_UP(sizeof(struct tcg_region_tree), qemu_dcache_linesize);
+    region_trees = qemu_memalign(qemu_dcache_linesize, region.n * tree_size);
+    for (i = 0; i < region.n; i++) {
+        struct tcg_region_tree *rt = region_trees + i * tree_size;
+
+        qemu_mutex_init(&rt->lock);
+        rt->tree = g_tree_new(tb_tc_cmp);
+    }
+}
+
+static struct tcg_region_tree *tc_ptr_to_region_tree(void *p)
+{
+    size_t region_idx;
+
+    if (p < region.start_aligned) {
+        region_idx = 0;
+    } else {
+        ptrdiff_t offset = p - region.start_aligned;
+
+        if (offset > region.stride * (region.n - 1)) {
+            region_idx = region.n - 1;
+        } else {
+            region_idx = offset / region.stride;
+        }
+    }
+    return region_trees + region_idx * tree_size;
+}
+
+void tcg_tb_insert(TranslationBlock *tb)
+{
+    struct tcg_region_tree *rt = tc_ptr_to_region_tree(tb->tc.ptr);
+
+    qemu_mutex_lock(&rt->lock);
+    g_tree_insert(rt->tree, &tb->tc, tb);
+    qemu_mutex_unlock(&rt->lock);
+}
+
+void tcg_tb_remove(TranslationBlock *tb)
+{
+    struct tcg_region_tree *rt = tc_ptr_to_region_tree(tb->tc.ptr);
+
+    qemu_mutex_lock(&rt->lock);
+    g_tree_remove(rt->tree, &tb->tc);
+    qemu_mutex_unlock(&rt->lock);
+}
+
+/*
+ * Find the TB 'tb' such that
+ * tb->tc.ptr <= tc_ptr < tb->tc.ptr + tb->tc.size
+ * Return NULL if not found.
+ */
+TranslationBlock *tcg_tb_lookup(uintptr_t tc_ptr)
+{
+    struct tcg_region_tree *rt = tc_ptr_to_region_tree((void *)tc_ptr);
+    TranslationBlock *tb;
+    struct tb_tc s = { .ptr = (void *)tc_ptr };
+
+    qemu_mutex_lock(&rt->lock);
+    tb = g_tree_lookup(rt->tree, &s);
+    qemu_mutex_unlock(&rt->lock);
+    return tb;
+}
+
+static void tcg_region_tree_lock_all(void)
+{
+    size_t i;
+
+    for (i = 0; i < region.n; i++) {
+        struct tcg_region_tree *rt = region_trees + i * tree_size;
+
+        qemu_mutex_lock(&rt->lock);
+    }
+}
+
+static void tcg_region_tree_unlock_all(void)
+{
+    size_t i;
+
+    for (i = 0; i < region.n; i++) {
+        struct tcg_region_tree *rt = region_trees + i * tree_size;
+
+        qemu_mutex_unlock(&rt->lock);
+    }
+}
+
+void tcg_tb_foreach(GTraverseFunc func, gpointer user_data)
+{
+    size_t i;
+
+    tcg_region_tree_lock_all();
+    for (i = 0; i < region.n; i++) {
+        struct tcg_region_tree *rt = region_trees + i * tree_size;
+
+        g_tree_foreach(rt->tree, func, user_data);
+    }
+    tcg_region_tree_unlock_all();
+}
+
+size_t tcg_nb_tbs(void)
+{
+    size_t nb_tbs = 0;
+    size_t i;
+
+    tcg_region_tree_lock_all();
+    for (i = 0; i < region.n; i++) {
+        struct tcg_region_tree *rt = region_trees + i * tree_size;
+
+        nb_tbs += g_tree_nnodes(rt->tree);
+    }
+    tcg_region_tree_unlock_all();
+    return nb_tbs;
+}
+
+static void tcg_region_tree_reset_all(void)
+{
+    size_t i;
+
+    tcg_region_tree_lock_all();
+    for (i = 0; i < region.n; i++) {
+        struct tcg_region_tree *rt = region_trees + i * tree_size;
+
+        /* Increment the refcount first so that destroy acts as a reset */
+        g_tree_ref(rt->tree);
+        g_tree_destroy(rt->tree);
+    }
+    tcg_region_tree_unlock_all();
+}
+
 static void tcg_region_bounds(size_t curr_region, void **pstart, void **pend)
 {
     void *start, *end;
@@ -380,6 +567,8 @@ void tcg_region_reset_all(void)
         g_assert(!err);
     }
     qemu_mutex_unlock(&region.lock);
+
+    tcg_region_tree_reset_all();
 }
 
 #ifdef CONFIG_USER_ONLY
@@ -496,6 +685,8 @@ void tcg_region_init(void)
         g_assert(!rc);
     }
 
+    tcg_region_trees_init();
+
     /* In user-mode we support only one ctx, so do the initial allocation now */
 #ifdef CONFIG_USER_ONLY
     {
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 9e2d909..8bf29cc 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -850,6 +850,12 @@ void tcg_region_reset_all(void);
 size_t tcg_code_size(void);
 size_t tcg_code_capacity(void);
 
+void tcg_tb_insert(TranslationBlock *tb);
+void tcg_tb_remove(TranslationBlock *tb);
+TranslationBlock *tcg_tb_lookup(uintptr_t tc_ptr);
+void tcg_tb_foreach(GTraverseFunc func, gpointer user_data);
+size_t tcg_nb_tbs(void);
+
 /* user-mode: Called with tb_lock held.  */
 static inline void *tcg_malloc(int size)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 04/16] tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (2 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 03/16] tcg: track TBs with per-region BST's Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-02-28 20:55   ` Richard Henderson
  2018-03-29 10:06   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 05/16] translate-all: iterate over TBs in a page with PAGE_FOR_EACH_TB Emilio G. Cota
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

Thereby making it per-TCGContext. Once we remove tb_lock, this will
avoid an atomic increment every time a TB is invalidated.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c |  5 +++--
 include/exec/tb-context.h |  1 -
 tcg/tcg.c                 | 14 ++++++++++++++
 tcg/tcg.h                 |  3 +++
 4 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 3a51d49..20ad3fc 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1072,7 +1072,8 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* suppress any remaining jumps to this TB */
     tb_jmp_unlink(tb);
 
-    tb_ctx.tb_phys_invalidate_count++;
+    atomic_set(&tcg_ctx->tb_phys_invalidate_count,
+               tcg_ctx->tb_phys_invalidate_count + 1);
 }
 
 #ifdef CONFIG_SOFTMMU
@@ -1862,7 +1863,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %u\n",
                 atomic_read(&tb_ctx.tb_flush_count));
-    cpu_fprintf(f, "TB invalidate count %d\n", tb_ctx.tb_phys_invalidate_count);
+    cpu_fprintf(f, "TB invalidate count %zu\n", tcg_tb_phys_invalidate_count());
     cpu_fprintf(f, "TLB flush count     %zu\n", tlb_flush_count());
     tcg_dump_info(f, cpu_fprintf);
 }
diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index d8472c8..8c9b49c 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -37,7 +37,6 @@ struct TBContext {
 
     /* statistics */
     unsigned tb_flush_count;
-    int tb_phys_invalidate_count;
 };
 
 extern TBContext tb_ctx;
diff --git a/tcg/tcg.c b/tcg/tcg.c
index b471708..a7b596e 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -791,6 +791,20 @@ size_t tcg_code_capacity(void)
     return capacity;
 }
 
+size_t tcg_tb_phys_invalidate_count(void)
+{
+    unsigned int n_ctxs = atomic_read(&n_tcg_ctxs);
+    unsigned int i;
+    size_t total = 0;
+
+    for (i = 0; i < n_ctxs; i++) {
+        const TCGContext *s = atomic_read(&tcg_ctxs[i]);
+
+        total += atomic_read(&s->tb_phys_invalidate_count);
+    }
+    return total;
+}
+
 /* pool based memory allocation */
 void *tcg_malloc_internal(TCGContext *s, int size)
 {
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 8bf29cc..9dd9448 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -694,6 +694,8 @@ struct TCGContext {
     /* Threshold to flush the translated code buffer.  */
     void *code_gen_highwater;
 
+    size_t tb_phys_invalidate_count;
+
     /* Track which vCPU triggers events */
     CPUState *cpu;                      /* *_trans */
 
@@ -852,6 +854,7 @@ size_t tcg_code_capacity(void);
 
 void tcg_tb_insert(TranslationBlock *tb);
 void tcg_tb_remove(TranslationBlock *tb);
+size_t tcg_tb_phys_invalidate_count(void);
 TranslationBlock *tcg_tb_lookup(uintptr_t tc_ptr);
 void tcg_tb_foreach(GTraverseFunc func, gpointer user_data);
 size_t tcg_nb_tbs(void);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 05/16] translate-all: iterate over TBs in a page with PAGE_FOR_EACH_TB
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (3 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 04/16] tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-02-28 21:40   ` Richard Henderson
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 06/16] translate-all: make l1_map lockless Emilio G. Cota
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

This commit does several things, but to avoid churn I merged them all
into the same commit. To wit:

- Use uintptr_t instead of TranslationBlock * for the list of TBs in a page.
  Just like we did in (c37e6d7e "tcg: Use uintptr_t type for
  jmp_list_{next|first} fields of TB"), the rationale is the same: these
  are tagged pointers, not pointers. So use a more appropriate type.

- Only check the least significant bit of the tagged pointers. Masking
  with 3/~3 is unnecessary and confusing.

- Introduce the TB_FOR_EACH_TAGGED macro, and use it to define
  PAGE_FOR_EACH_TB, which improves readability.

- Update tb_page_remove to use PAGE_FOR_EACH_TB. In case there
  is a bug and we attempt to remove a TB that is not in the list, instead
  of segfaulting (since the list is NULL-terminated) we will reach
  g_assert_not_reached().

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 65 +++++++++++++++++++++++------------------------
 include/exec/exec-all.h   |  2 +-
 2 files changed, 33 insertions(+), 34 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 20ad3fc..06aa905 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -103,7 +103,7 @@
 
 typedef struct PageDesc {
     /* list of TBs intersecting this ram page */
-    TranslationBlock *first_tb;
+    uintptr_t first_tb;
 #ifdef CONFIG_SOFTMMU
     /* in order to optimize self modifying code, we count the number
        of lookups we do to a given page to use a bitmap */
@@ -114,6 +114,18 @@ typedef struct PageDesc {
 #endif
 } PageDesc;
 
+/* list iterators for lists of tagged pointers in TranslationBlock */
+#define TB_FOR_EACH_TAGGED(head, tb, n, field)                  \
+    for (n = (head) & 1,                                        \
+             tb = (TranslationBlock *)((head) & ~1);            \
+         tb;                                                    \
+         tb = (TranslationBlock *)tb->field[n],                 \
+             n = (uintptr_t)tb & 1,                             \
+             tb = (TranslationBlock *)((uintptr_t)tb & ~1))
+
+#define PAGE_FOR_EACH_TB(pagedesc, tb, n)                       \
+    TB_FOR_EACH_TAGGED((pagedesc)->first_tb, tb, n, page_next)
+
 /* In system mode we want L1_MAP to be based on ram offsets,
    while in user mode we want it to be based on virtual addresses.  */
 #if !defined(CONFIG_USER_ONLY)
@@ -818,7 +830,7 @@ static void page_flush_tb_1(int level, void **lp)
         PageDesc *pd = *lp;
 
         for (i = 0; i < V_L2_SIZE; ++i) {
-            pd[i].first_tb = NULL;
+            pd[i].first_tb = (uintptr_t)NULL;
             invalidate_page_bitmap(pd + i);
         }
     } else {
@@ -946,21 +958,21 @@ static void tb_page_check(void)
 
 #endif /* CONFIG_USER_ONLY */
 
-static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
+static inline void tb_page_remove(PageDesc *pd, TranslationBlock *tb)
 {
     TranslationBlock *tb1;
+    uintptr_t *pprev;
     unsigned int n1;
 
-    for (;;) {
-        tb1 = *ptb;
-        n1 = (uintptr_t)tb1 & 3;
-        tb1 = (TranslationBlock *)((uintptr_t)tb1 & ~3);
+    pprev = &pd->first_tb;
+    PAGE_FOR_EACH_TB(pd, tb1, n1) {
         if (tb1 == tb) {
-            *ptb = tb1->page_next[n1];
-            break;
+            *pprev = tb1->page_next[n1];
+            return;
         }
-        ptb = &tb1->page_next[n1];
+        pprev = &tb1->page_next[n1];
     }
+    g_assert_not_reached();
 }
 
 /* remove the TB from a list of TBs jumping to the n-th jump target of the TB */
@@ -1048,12 +1060,12 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
         p = page_find(tb->page_addr[0] >> TARGET_PAGE_BITS);
-        tb_page_remove(&p->first_tb, tb);
+        tb_page_remove(p, tb);
         invalidate_page_bitmap(p);
     }
     if (tb->page_addr[1] != -1 && tb->page_addr[1] != page_addr) {
         p = page_find(tb->page_addr[1] >> TARGET_PAGE_BITS);
-        tb_page_remove(&p->first_tb, tb);
+        tb_page_remove(p, tb);
         invalidate_page_bitmap(p);
     }
 
@@ -1084,10 +1096,7 @@ static void build_page_bitmap(PageDesc *p)
 
     p->code_bitmap = bitmap_new(TARGET_PAGE_SIZE);
 
-    tb = p->first_tb;
-    while (tb != NULL) {
-        n = (uintptr_t)tb & 3;
-        tb = (TranslationBlock *)((uintptr_t)tb & ~3);
+    PAGE_FOR_EACH_TB(p, tb, n) {
         /* NOTE: this is subtle as a TB may span two physical pages */
         if (n == 0) {
             /* NOTE: tb_end may be after the end of the page, but
@@ -1102,7 +1111,6 @@ static void build_page_bitmap(PageDesc *p)
             tb_end = ((tb->pc + tb->size) & ~TARGET_PAGE_MASK);
         }
         bitmap_set(p->code_bitmap, tb_start, tb_end - tb_start);
-        tb = tb->page_next[n];
     }
 }
 #endif
@@ -1125,9 +1133,9 @@ static inline void tb_alloc_page(TranslationBlock *tb,
     p = page_find_alloc(page_addr >> TARGET_PAGE_BITS, 1);
     tb->page_next[n] = p->first_tb;
 #ifndef CONFIG_USER_ONLY
-    page_already_protected = p->first_tb != NULL;
+    page_already_protected = p->first_tb != (uintptr_t)NULL;
 #endif
-    p->first_tb = (TranslationBlock *)((uintptr_t)tb | n);
+    p->first_tb = (uintptr_t)tb | n;
     invalidate_page_bitmap(p);
 
 #if defined(CONFIG_USER_ONLY)
@@ -1404,7 +1412,7 @@ void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
 void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
                                    int is_cpu_write_access)
 {
-    TranslationBlock *tb, *tb_next;
+    TranslationBlock *tb;
     tb_page_addr_t tb_start, tb_end;
     PageDesc *p;
     int n;
@@ -1435,11 +1443,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
     /* we remove all the TBs in the range [start, end[ */
     /* XXX: see if in some cases it could be faster to invalidate all
        the code */
-    tb = p->first_tb;
-    while (tb != NULL) {
-        n = (uintptr_t)tb & 3;
-        tb = (TranslationBlock *)((uintptr_t)tb & ~3);
-        tb_next = tb->page_next[n];
+    PAGE_FOR_EACH_TB(p, tb, n) {
         /* NOTE: this is subtle as a TB may span two physical pages */
         if (n == 0) {
             /* NOTE: tb_end may be after the end of the page, but
@@ -1476,7 +1480,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
 #endif /* TARGET_HAS_PRECISE_SMC */
             tb_phys_invalidate(tb, -1);
         }
-        tb = tb_next;
     }
 #if !defined(CONFIG_USER_ONLY)
     /* if no code remaining, no need to continue to use slow writes */
@@ -1570,18 +1573,15 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
     }
 
     tb_lock();
-    tb = p->first_tb;
 #ifdef TARGET_HAS_PRECISE_SMC
-    if (tb && pc != 0) {
+    if (p->first_tb && pc != 0) {
         current_tb = tcg_tb_lookup(pc);
     }
     if (cpu != NULL) {
         env = cpu->env_ptr;
     }
 #endif
-    while (tb != NULL) {
-        n = (uintptr_t)tb & 3;
-        tb = (TranslationBlock *)((uintptr_t)tb & ~3);
+    PAGE_FOR_EACH_TB(p, tb, n) {
 #ifdef TARGET_HAS_PRECISE_SMC
         if (current_tb == tb &&
             (current_tb->cflags & CF_COUNT_MASK) != 1) {
@@ -1598,9 +1598,8 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
         }
 #endif /* TARGET_HAS_PRECISE_SMC */
         tb_phys_invalidate(tb, addr);
-        tb = tb->page_next[n];
     }
-    p->first_tb = NULL;
+    p->first_tb = (uintptr_t)NULL;
 #ifdef TARGET_HAS_PRECISE_SMC
     if (current_tb_modified) {
         /* Force execution of one insn next time.  */
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 17e08b3..5f7e65a 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -356,7 +356,7 @@ struct TranslationBlock {
     struct TranslationBlock *orig_tb;
     /* first and second physical page containing code. The lower bit
        of the pointer tells the index in page_next[] */
-    struct TranslationBlock *page_next[2];
+    uintptr_t page_next[2];
     tb_page_addr_t page_addr[2];
 
     /* The following data are used to directly call another TB from
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 06/16] translate-all: make l1_map lockless
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (4 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 05/16] translate-all: iterate over TBs in a page with PAGE_FOR_EACH_TB Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-02-28 22:15   ` Richard Henderson
  2018-03-29 10:16   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 07/16] translate-all: remove hole in PageDesc Emilio G. Cota
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

Groundwork for supporting parallel TCG generation.

We never remove entries from the radix tree, so we can use cmpxchg
to implement lockless insertions.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c       | 24 ++++++++++++++----------
 docs/devel/multi-thread-tcg.txt |  4 ++--
 2 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 06aa905..f2bfa71 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -472,20 +472,12 @@ static void page_init(void)
 #endif
 }
 
-/* If alloc=1:
- * Called with tb_lock held for system emulation.
- * Called with mmap_lock held for user-mode emulation.
- */
 static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
 {
     PageDesc *pd;
     void **lp;
     int i;
 
-    if (alloc) {
-        assert_memory_lock();
-    }
-
     /* Level 1.  Always allocated.  */
     lp = l1_map + ((index >> v_l1_shift) & (v_l1_size - 1));
 
@@ -494,11 +486,17 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
         void **p = atomic_rcu_read(lp);
 
         if (p == NULL) {
+            void *existing;
+
             if (!alloc) {
                 return NULL;
             }
             p = g_new0(void *, V_L2_SIZE);
-            atomic_rcu_set(lp, p);
+            existing = atomic_cmpxchg(lp, NULL, p);
+            if (unlikely(existing)) {
+                g_free(p);
+                p = existing;
+            }
         }
 
         lp = p + ((index >> (i * V_L2_BITS)) & (V_L2_SIZE - 1));
@@ -506,11 +504,17 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
 
     pd = atomic_rcu_read(lp);
     if (pd == NULL) {
+        void *existing;
+
         if (!alloc) {
             return NULL;
         }
         pd = g_new0(PageDesc, V_L2_SIZE);
-        atomic_rcu_set(lp, pd);
+        existing = atomic_cmpxchg(lp, NULL, pd);
+        if (unlikely(existing)) {
+            g_free(pd);
+            pd = existing;
+        }
     }
 
     return pd + (index & (V_L2_SIZE - 1));
diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
index a99b456..faf8918 100644
--- a/docs/devel/multi-thread-tcg.txt
+++ b/docs/devel/multi-thread-tcg.txt
@@ -134,8 +134,8 @@ tb_set_jmp_target() code. Modification to the linked lists that allow
 searching for linked pages are done under the protect of the
 tb_lock().
 
-The global page table is protected by the tb_lock() in system-mode and
-mmap_lock() in linux-user mode.
+The global page table is a lockless radix tree; cmpxchg is used
+to atomically insert new elements.
 
 The lookup caches are updated atomically and the lookup hash uses QHT
 which is designed for concurrent safe lookup.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 07/16] translate-all: remove hole in PageDesc
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (5 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 06/16] translate-all: make l1_map lockless Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-02-28 22:17   ` Richard Henderson
  2018-03-29 10:17   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 08/16] translate-all: work page-by-page in tb_invalidate_phys_range_1 Emilio G. Cota
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

Groundwork for supporting parallel TCG generation.

Move the hole to the end of the struct, so that a u32
field can be added there without bloating the struct.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index f2bfa71..816419a 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -107,8 +107,8 @@ typedef struct PageDesc {
 #ifdef CONFIG_SOFTMMU
     /* in order to optimize self modifying code, we count the number
        of lookups we do to a given page to use a bitmap */
-    unsigned int code_write_count;
     unsigned long *code_bitmap;
+    unsigned int code_write_count;
 #else
     unsigned long flags;
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 08/16] translate-all: work page-by-page in tb_invalidate_phys_range_1
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (6 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 07/16] translate-all: remove hole in PageDesc Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-02-28 22:23   ` Richard Henderson
                     ` (2 more replies)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 09/16] translate-all: move tb_invalidate_phys_page_range up in the file Emilio G. Cota
                   ` (7 subsequent siblings)
  15 siblings, 3 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

So that we pass a same-page range to tb_invalidate_phys_page_range,
instead of always passing an end address that could be on a different
page.

As discussed with Peter Maydell on the list [1], tb_invalidate_phys_page_range
doesn't actually do much with 'end', which explains why we have never
hit a bug despite going against what the comment on top of
tb_invalidate_phys_page_range requires:

> * Invalidate all TBs which intersect with the target physical address range
> * [start;end[. NOTE: start and end must refer to the *same* physical page.

The appended honours the comment, which avoids confusion.

While at it, rework the loop into a for loop, which is less error prone
(e.g. "continue" won't result in an infinite loop).

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg09165.html

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 816419a..a98e182 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1381,10 +1381,14 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
  */
 static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
 {
-    while (start < end) {
-        tb_invalidate_phys_page_range(start, end, 0);
-        start &= TARGET_PAGE_MASK;
-        start += TARGET_PAGE_SIZE;
+    tb_page_addr_t next;
+
+    for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
+         start < end;
+         start = next, next += TARGET_PAGE_SIZE) {
+        tb_page_addr_t bound = MIN(next, end);
+
+        tb_invalidate_phys_page_range(start, bound, 0);
     }
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 09/16] translate-all: move tb_invalidate_phys_page_range up in the file
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (7 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 08/16] translate-all: work page-by-page in tb_invalidate_phys_range_1 Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-02-28 22:24   ` Richard Henderson
  2018-03-29 10:08   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 10/16] translate-all: use per-page locking in !user-mode Emilio G. Cota
                   ` (6 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

This greatly simplifies next commit's diff.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 77 ++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 38 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index a98e182..4cb03f1 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1371,44 +1371,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 
 /*
  * Invalidate all TBs which intersect with the target physical address range
- * [start;end[. NOTE: start and end may refer to *different* physical pages.
- * 'is_cpu_write_access' should be true if called from a real cpu write
- * access: the virtual CPU will exit the current TB if code is modified inside
- * this TB.
- *
- * Called with mmap_lock held for user-mode emulation, grabs tb_lock
- * Called with tb_lock held for system-mode emulation
- */
-static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
-{
-    tb_page_addr_t next;
-
-    for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
-         start < end;
-         start = next, next += TARGET_PAGE_SIZE) {
-        tb_page_addr_t bound = MIN(next, end);
-
-        tb_invalidate_phys_page_range(start, bound, 0);
-    }
-}
-
-#ifdef CONFIG_SOFTMMU
-void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
-{
-    assert_tb_locked();
-    tb_invalidate_phys_range_1(start, end);
-}
-#else
-void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
-{
-    assert_memory_lock();
-    tb_lock();
-    tb_invalidate_phys_range_1(start, end);
-    tb_unlock();
-}
-#endif
-/*
- * Invalidate all TBs which intersect with the target physical address range
  * [start;end[. NOTE: start and end must refer to the *same* physical page.
  * 'is_cpu_write_access' should be true if called from a real cpu write
  * access: the virtual CPU will exit the current TB if code is modified inside
@@ -1505,6 +1467,45 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
 #endif
 }
 
+/*
+ * Invalidate all TBs which intersect with the target physical address range
+ * [start;end[. NOTE: start and end may refer to *different* physical pages.
+ * 'is_cpu_write_access' should be true if called from a real cpu write
+ * access: the virtual CPU will exit the current TB if code is modified inside
+ * this TB.
+ *
+ * Called with mmap_lock held for user-mode emulation, grabs tb_lock
+ * Called with tb_lock held for system-mode emulation
+ */
+static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
+{
+    tb_page_addr_t next;
+
+    for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
+         start < end;
+         start = next, next += TARGET_PAGE_SIZE) {
+        tb_page_addr_t bound = MIN(next, end);
+
+        tb_invalidate_phys_page_range(start, bound, 0);
+    }
+}
+
+#ifdef CONFIG_SOFTMMU
+void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
+{
+    assert_tb_locked();
+    tb_invalidate_phys_range_1(start, end);
+}
+#else
+void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
+{
+    assert_memory_lock();
+    tb_lock();
+    tb_invalidate_phys_range_1(start, end);
+    tb_unlock();
+}
+#endif
+
 #ifdef CONFIG_SOFTMMU
 /* len must be <= 8 and start must be a multiple of len.
  * Called via softmmu_template.h when code areas are written to with
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 10/16] translate-all: use per-page locking in !user-mode
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (8 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 09/16] translate-all: move tb_invalidate_phys_page_range up in the file Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-03-29 14:55   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 11/16] translate-all: add page_collection assertions Emilio G. Cota
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

Groundwork for supporting parallel TCG generation.

Instead of using a global lock (tb_lock) to protect changes
to pages, use fine-grained, per-page locks in !user-mode.
User-mode stays with mmap_lock.

Sometimes changes need to happen atomically on more than one
page (e.g. when a TB that spans across two pages is
added/invalidated, or when a range of pages is invalidated).
We therefore introduce struct page_collection, which helps
us keep track of a set of pages that have been locked in
the appropriate locking order (i.e. by ascending page index).

This commit first introduces the structs and the function helpers,
to then convert the calling code to use per-page locking. Note
that tb_lock is not removed yet.

While at it, rename tb_alloc_page to tb_page_add, which pairs with
tb_page_remove.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 432 +++++++++++++++++++++++++++++++++++++++++-----
 accel/tcg/translate-all.h |   3 +
 include/exec/exec-all.h   |   3 +-
 3 files changed, 396 insertions(+), 42 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 4cb03f1..07527d5 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -112,8 +112,55 @@ typedef struct PageDesc {
 #else
     unsigned long flags;
 #endif
+#ifndef CONFIG_USER_ONLY
+    QemuSpin lock;
+#endif
 } PageDesc;
 
+/**
+ * struct page_entry - page descriptor entry
+ * @pd:     pointer to the &struct PageDesc of the page this entry represents
+ * @index:  page index of the page
+ * @locked: whether the page is locked
+ *
+ * This struct helps us keep track of the locked state of a page, without
+ * bloating &struct PageDesc.
+ *
+ * A page lock protects accesses to all fields of &struct PageDesc.
+ *
+ * See also: &struct page_collection.
+ */
+struct page_entry {
+    PageDesc *pd;
+    tb_page_addr_t index;
+    bool locked;
+};
+
+/**
+ * struct page_collection - tracks a set of pages (i.e. &struct page_entry's)
+ * @tree:   Binary search tree (BST) of the pages, with key == page index
+ * @max:    Pointer to the page in @tree with the highest page index
+ *
+ * To avoid deadlock we lock pages in ascending order of page index.
+ * When operating on a set of pages, we need to keep track of them so that
+ * we can lock them in order and also unlock them later. For this we collect
+ * pages (i.e. &struct page_entry's) in a binary search @tree. Given that the
+ * @tree implementation we use does not provide an O(1) operation to obtain the
+ * highest-ranked element, we use @max to keep track of the inserted page
+ * with the highest index. This is valuable because if a page is not in
+ * the tree and its index is higher than @max's, then we can lock it
+ * without breaking the locking order rule.
+ *
+ * Note on naming: 'struct page_set' would be shorter, but we already have a few
+ * page_set_*() helpers, so page_collection is used instead to avoid confusion.
+ *
+ * See also: page_collection_lock().
+ */
+struct page_collection {
+    GTree *tree;
+    struct page_entry *max;
+};
+
 /* list iterators for lists of tagged pointers in TranslationBlock */
 #define TB_FOR_EACH_TAGGED(head, tb, n, field)                  \
     for (n = (head) & 1,                                        \
@@ -510,6 +557,15 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
             return NULL;
         }
         pd = g_new0(PageDesc, V_L2_SIZE);
+#ifndef CONFIG_USER_ONLY
+        {
+            int i;
+
+            for (i = 0; i < V_L2_SIZE; i++) {
+                qemu_spin_init(&pd[i].lock);
+            }
+        }
+#endif
         existing = atomic_cmpxchg(lp, NULL, pd);
         if (unlikely(existing)) {
             g_free(pd);
@@ -525,6 +581,228 @@ static inline PageDesc *page_find(tb_page_addr_t index)
     return page_find_alloc(index, 0);
 }
 
+/* In user-mode page locks aren't used; mmap_lock is enough */
+#ifdef CONFIG_USER_ONLY
+static inline void page_lock(PageDesc *pd)
+{ }
+
+static inline void page_unlock(PageDesc *pd)
+{ }
+
+static inline void page_lock_tb(const TranslationBlock *tb)
+{ }
+
+static inline void page_unlock_tb(const TranslationBlock *tb)
+{ }
+
+struct page_collection *
+page_collection_lock(tb_page_addr_t start, tb_page_addr_t end)
+{
+    return NULL;
+}
+
+void page_collection_unlock(struct page_collection *set)
+{ }
+#else /* !CONFIG_USER_ONLY */
+
+static inline void page_lock(PageDesc *pd)
+{
+    qemu_spin_lock(&pd->lock);
+}
+
+static inline void page_unlock(PageDesc *pd)
+{
+    qemu_spin_unlock(&pd->lock);
+}
+
+/* lock the page(s) of a TB in the correct acquisition order */
+static inline void page_lock_tb(const TranslationBlock *tb)
+{
+    if (likely(tb->page_addr[1] == -1)) {
+        page_lock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
+        return;
+    }
+    if (tb->page_addr[0] < tb->page_addr[1]) {
+        page_lock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
+        page_lock(page_find(tb->page_addr[1] >> TARGET_PAGE_BITS));
+    } else {
+        page_lock(page_find(tb->page_addr[1] >> TARGET_PAGE_BITS));
+        page_lock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
+    }
+}
+
+static inline void page_unlock_tb(const TranslationBlock *tb)
+{
+    page_unlock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
+    if (unlikely(tb->page_addr[1] != -1)) {
+        page_unlock(page_find(tb->page_addr[1] >> TARGET_PAGE_BITS));
+    }
+}
+
+static inline struct page_entry *
+page_entry_new(PageDesc *pd, tb_page_addr_t index)
+{
+    struct page_entry *pe = g_malloc(sizeof(*pe));
+
+    pe->index = index;
+    pe->pd = pd;
+    pe->locked = false;
+    return pe;
+}
+
+static void page_entry_destroy(gpointer p)
+{
+    struct page_entry *pe = p;
+
+    g_assert(pe->locked);
+    page_unlock(pe->pd);
+    g_free(pe);
+}
+
+/* returns false on success */
+static bool page_entry_trylock(struct page_entry *pe)
+{
+    bool busy;
+
+    busy = qemu_spin_trylock(&pe->pd->lock);
+    if (!busy) {
+        g_assert(!pe->locked);
+        pe->locked = true;
+    }
+    return busy;
+}
+
+static void do_page_entry_lock(struct page_entry *pe)
+{
+    page_lock(pe->pd);
+    g_assert(!pe->locked);
+    pe->locked = true;
+}
+
+static gboolean page_entry_lock(gpointer key, gpointer value, gpointer data)
+{
+    struct page_entry *pe = value;
+
+    do_page_entry_lock(pe);
+    return FALSE;
+}
+
+static gboolean page_entry_unlock(gpointer key, gpointer value, gpointer data)
+{
+    struct page_entry *pe = value;
+
+    if (pe->locked) {
+        pe->locked = false;
+        page_unlock(pe->pd);
+    }
+    return FALSE;
+}
+
+/*
+ * Trylock a page, and if successful, add the page to a collection.
+ * Returns true ("busy") if the page could not be locked; false otherwise.
+ */
+static bool page_trylock_add(struct page_collection *set, tb_page_addr_t addr)
+{
+    tb_page_addr_t index = addr >> TARGET_PAGE_BITS;
+    struct page_entry *pe;
+    PageDesc *pd;
+
+    pe = g_tree_lookup(set->tree, &index);
+    if (pe) {
+        return false;
+    }
+
+    pd = page_find(index);
+    if (pd == NULL) {
+        return false;
+    }
+
+    pe = page_entry_new(pd, index);
+    g_tree_insert(set->tree, &pe->index, pe);
+
+    /*
+     * If this is either (1) the first insertion or (2) a page whose index
+     * is higher than any other so far, just lock the page and move on.
+     */
+    if (set->max == NULL || pe->index > set->max->index) {
+        set->max = pe;
+        do_page_entry_lock(pe);
+        return false;
+    }
+    /*
+     * Try to acquire out-of-order lock; if busy, return busy so that we acquire
+     * locks in order.
+     */
+    return page_entry_trylock(pe);
+}
+
+static gint tb_page_addr_cmp(gconstpointer ap, gconstpointer bp, gpointer udata)
+{
+    tb_page_addr_t a = *(const tb_page_addr_t *)ap;
+    tb_page_addr_t b = *(const tb_page_addr_t *)bp;
+
+    if (a == b) {
+        return 0;
+    } else if (a < b) {
+        return -1;
+    }
+    return 1;
+}
+
+/*
+ * Lock a range of pages ([@start,@end[) as well as the pages of all
+ * intersecting TBs.
+ * Locking order: acquire locks in ascending order of page index.
+ */
+struct page_collection *
+page_collection_lock(tb_page_addr_t start, tb_page_addr_t end)
+{
+    struct page_collection *set = g_malloc(sizeof(*set));
+    tb_page_addr_t index;
+    PageDesc *pd;
+
+    start >>= TARGET_PAGE_BITS;
+    end   >>= TARGET_PAGE_BITS;
+    g_assert(start <= end);
+
+    set->tree = g_tree_new_full(tb_page_addr_cmp, NULL, NULL,
+                                page_entry_destroy);
+    set->max = NULL;
+
+ retry:
+    g_tree_foreach(set->tree, page_entry_lock, NULL);
+
+    for (index = start; index <= end; index++) {
+        TranslationBlock *tb;
+        int n;
+
+        pd = page_find(index);
+        if (pd == NULL) {
+            continue;
+        }
+        PAGE_FOR_EACH_TB(pd, tb, n) {
+            if (page_trylock_add(set, tb->page_addr[0]) ||
+                (tb->page_addr[1] != -1 &&
+                 page_trylock_add(set, tb->page_addr[1]))) {
+                /* drop all locks, and reacquire in order */
+                g_tree_foreach(set->tree, page_entry_unlock, NULL);
+                goto retry;
+            }
+        }
+    }
+    return set;
+}
+
+void page_collection_unlock(struct page_collection *set)
+{
+    /* entries are unlocked and freed via page_entry_destroy */
+    g_tree_destroy(set->tree);
+    g_free(set);
+}
+
+#endif /* !CONFIG_USER_ONLY */
+
 #if defined(CONFIG_USER_ONLY)
 /* Currently it is not recommended to allocate big chunks of data in
    user mode. It will change when a dedicated libc will be used.  */
@@ -813,6 +1091,7 @@ static TranslationBlock *tb_alloc(target_ulong pc)
     return tb;
 }
 
+/* call with @p->lock held */
 static inline void invalidate_page_bitmap(PageDesc *p)
 {
 #ifdef CONFIG_SOFTMMU
@@ -834,8 +1113,10 @@ static void page_flush_tb_1(int level, void **lp)
         PageDesc *pd = *lp;
 
         for (i = 0; i < V_L2_SIZE; ++i) {
+            page_lock(&pd[i]);
             pd[i].first_tb = (uintptr_t)NULL;
             invalidate_page_bitmap(pd + i);
+            page_unlock(&pd[i]);
         }
     } else {
         void **pp = *lp;
@@ -962,6 +1243,7 @@ static void tb_page_check(void)
 
 #endif /* CONFIG_USER_ONLY */
 
+/* call with @pd->lock held */
 static inline void tb_page_remove(PageDesc *pd, TranslationBlock *tb)
 {
     TranslationBlock *tb1;
@@ -1038,11 +1320,8 @@ static inline void tb_jmp_unlink(TranslationBlock *tb)
     }
 }
 
-/* invalidate one TB
- *
- * Called with tb_lock held.
- */
-void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
+/* If @rm_from_page_list is set, call with the TB's pages' locks held */
+static void do_tb_phys_invalidate(TranslationBlock *tb, bool rm_from_page_list)
 {
     CPUState *cpu;
     PageDesc *p;
@@ -1062,15 +1341,15 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     }
 
     /* remove the TB from the page list */
-    if (tb->page_addr[0] != page_addr) {
+    if (rm_from_page_list) {
         p = page_find(tb->page_addr[0] >> TARGET_PAGE_BITS);
         tb_page_remove(p, tb);
         invalidate_page_bitmap(p);
-    }
-    if (tb->page_addr[1] != -1 && tb->page_addr[1] != page_addr) {
-        p = page_find(tb->page_addr[1] >> TARGET_PAGE_BITS);
-        tb_page_remove(p, tb);
-        invalidate_page_bitmap(p);
+        if (tb->page_addr[1] != -1) {
+            p = page_find(tb->page_addr[1] >> TARGET_PAGE_BITS);
+            tb_page_remove(p, tb);
+            invalidate_page_bitmap(p);
+        }
     }
 
     /* remove the TB from the hash list */
@@ -1092,7 +1371,28 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
                tcg_ctx->tb_phys_invalidate_count + 1);
 }
 
+static void tb_phys_invalidate__locked(TranslationBlock *tb)
+{
+    do_tb_phys_invalidate(tb, true);
+}
+
+/* invalidate one TB
+ *
+ * Called with tb_lock held.
+ */
+void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
+{
+    if (page_addr == -1) {
+        page_lock_tb(tb);
+        do_tb_phys_invalidate(tb, true);
+        page_unlock_tb(tb);
+    } else {
+        do_tb_phys_invalidate(tb, false);
+    }
+}
+
 #ifdef CONFIG_SOFTMMU
+/* call with @p->lock held */
 static void build_page_bitmap(PageDesc *p)
 {
     int n, tb_start, tb_end;
@@ -1122,11 +1422,11 @@ static void build_page_bitmap(PageDesc *p)
 /* add the tb in the target page and protect it if necessary
  *
  * Called with mmap_lock held for user-mode emulation.
+ * Called with @p->lock held.
  */
-static inline void tb_alloc_page(TranslationBlock *tb,
-                                 unsigned int n, tb_page_addr_t page_addr)
+static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
+                               unsigned int n, tb_page_addr_t page_addr)
 {
-    PageDesc *p;
 #ifndef CONFIG_USER_ONLY
     bool page_already_protected;
 #endif
@@ -1134,7 +1434,6 @@ static inline void tb_alloc_page(TranslationBlock *tb,
     assert_memory_lock();
 
     tb->page_addr[n] = page_addr;
-    p = page_find_alloc(page_addr >> TARGET_PAGE_BITS, 1);
     tb->page_next[n] = p->first_tb;
 #ifndef CONFIG_USER_ONLY
     page_already_protected = p->first_tb != (uintptr_t)NULL;
@@ -1186,17 +1485,38 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
+    PageDesc *p;
+    PageDesc *p2 = NULL;
     uint32_t h;
 
     assert_memory_lock();
 
-    /* add in the page list */
-    tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
-    if (phys_page2 != -1) {
-        tb_alloc_page(tb, 1, phys_page2);
-    } else {
+    /*
+     * Add the TB to the page list.
+     * To avoid deadlock, acquire first the lock of the lower-addressed page.
+     */
+    p = page_find_alloc(phys_pc >> TARGET_PAGE_BITS, 1);
+    if (likely(phys_page2 == -1)) {
         tb->page_addr[1] = -1;
+        page_lock(p);
+        tb_page_add(p, tb, 0, phys_pc & TARGET_PAGE_MASK);
+    } else {
+        p2 = page_find_alloc(phys_page2 >> TARGET_PAGE_BITS, 1);
+        if (phys_pc < phys_page2) {
+            page_lock(p);
+            page_lock(p2);
+        } else {
+            page_lock(p2);
+            page_lock(p);
+        }
+        tb_page_add(p, tb, 0, phys_pc & TARGET_PAGE_MASK);
+        tb_page_add(p2, tb, 1, phys_page2);
+    }
+
+    if (p2) {
+        page_unlock(p2);
     }
+    page_unlock(p);
 
     /* add in the hash table */
     h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
@@ -1370,21 +1690,17 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 }
 
 /*
- * Invalidate all TBs which intersect with the target physical address range
- * [start;end[. NOTE: start and end must refer to the *same* physical page.
- * 'is_cpu_write_access' should be true if called from a real cpu write
- * access: the virtual CPU will exit the current TB if code is modified inside
- * this TB.
- *
- * Called with tb_lock/mmap_lock held for user-mode emulation
- * Called with tb_lock held for system-mode emulation
+ * Call with all @pages locked.
+ * @p must be non-NULL.
  */
-void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
-                                   int is_cpu_write_access)
+static void
+tb_invalidate_phys_page_range__locked(struct page_collection *pages,
+                                      PageDesc *p, tb_page_addr_t start,
+                                      tb_page_addr_t end,
+                                      int is_cpu_write_access)
 {
     TranslationBlock *tb;
     tb_page_addr_t tb_start, tb_end;
-    PageDesc *p;
     int n;
 #ifdef TARGET_HAS_PRECISE_SMC
     CPUState *cpu = current_cpu;
@@ -1400,10 +1716,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
     assert_memory_lock();
     assert_tb_locked();
 
-    p = page_find(start >> TARGET_PAGE_BITS);
-    if (!p) {
-        return;
-    }
 #if defined(TARGET_HAS_PRECISE_SMC)
     if (cpu != NULL) {
         env = cpu->env_ptr;
@@ -1448,7 +1760,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
                                      &current_flags);
             }
 #endif /* TARGET_HAS_PRECISE_SMC */
-            tb_phys_invalidate(tb, -1);
+            tb_phys_invalidate__locked(tb);
         }
     }
 #if !defined(CONFIG_USER_ONLY)
@@ -1460,6 +1772,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
 #endif
 #ifdef TARGET_HAS_PRECISE_SMC
     if (current_tb_modified) {
+        page_collection_unlock(pages);
         /* Force execution of one insn next time.  */
         cpu->cflags_next_tb = 1 | curr_cflags();
         cpu_loop_exit_noexc(cpu);
@@ -1469,6 +1782,35 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
 
 /*
  * Invalidate all TBs which intersect with the target physical address range
+ * [start;end[. NOTE: start and end must refer to the *same* physical page.
+ * 'is_cpu_write_access' should be true if called from a real cpu write
+ * access: the virtual CPU will exit the current TB if code is modified inside
+ * this TB.
+ *
+ * Called with tb_lock/mmap_lock held for user-mode emulation
+ * Called with tb_lock held for system-mode emulation
+ */
+void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
+                                   int is_cpu_write_access)
+{
+    struct page_collection *pages;
+    PageDesc *p;
+
+    assert_memory_lock();
+    assert_tb_locked();
+
+    p = page_find(start >> TARGET_PAGE_BITS);
+    if (p == NULL) {
+        return;
+    }
+    pages = page_collection_lock(start, end);
+    tb_invalidate_phys_page_range__locked(pages, p, start, end,
+                                          is_cpu_write_access);
+    page_collection_unlock(pages);
+}
+
+/*
+ * Invalidate all TBs which intersect with the target physical address range
  * [start;end[. NOTE: start and end may refer to *different* physical pages.
  * 'is_cpu_write_access' should be true if called from a real cpu write
  * access: the virtual CPU will exit the current TB if code is modified inside
@@ -1479,15 +1821,22 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
  */
 static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
 {
+    struct page_collection *pages;
     tb_page_addr_t next;
 
+    pages = page_collection_lock(start, end);
     for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
          start < end;
          start = next, next += TARGET_PAGE_SIZE) {
+        PageDesc *pd = page_find(start >> TARGET_PAGE_BITS);
         tb_page_addr_t bound = MIN(next, end);
 
-        tb_invalidate_phys_page_range(start, bound, 0);
+        if (pd == NULL) {
+            continue;
+        }
+        tb_invalidate_phys_page_range__locked(pages, pd, start, bound, 0);
     }
+    page_collection_unlock(pages);
 }
 
 #ifdef CONFIG_SOFTMMU
@@ -1513,6 +1862,7 @@ void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
  */
 void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
 {
+    struct page_collection *pages;
     PageDesc *p;
 
 #if 0
@@ -1530,11 +1880,10 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
     if (!p) {
         return;
     }
+
+    pages = page_collection_lock(start, start + len);
     if (!p->code_bitmap &&
         ++p->code_write_count >= SMC_BITMAP_USE_THRESHOLD) {
-        /* build code bitmap.  FIXME: writes should be protected by
-         * tb_lock, reads by tb_lock or RCU.
-         */
         build_page_bitmap(p);
     }
     if (p->code_bitmap) {
@@ -1548,8 +1897,9 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
         }
     } else {
     do_invalidate:
-        tb_invalidate_phys_page_range(start, start + len, 1);
+        tb_invalidate_phys_page_range__locked(pages, p, start, start + len, 1);
     }
+    page_collection_unlock(pages);
 }
 #else
 /* Called with mmap_lock held. If pc is not 0 then it indicates the
diff --git a/accel/tcg/translate-all.h b/accel/tcg/translate-all.h
index ba8e4d6..6d1d258 100644
--- a/accel/tcg/translate-all.h
+++ b/accel/tcg/translate-all.h
@@ -23,6 +23,9 @@
 
 
 /* translate-all.c */
+struct page_collection *page_collection_lock(tb_page_addr_t start,
+                                             tb_page_addr_t end);
+void page_collection_unlock(struct page_collection *set);
 void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len);
 void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
                                    int is_cpu_write_access);
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 5f7e65a..aeaa127 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -355,7 +355,8 @@ struct TranslationBlock {
     /* original tb when cflags has CF_NOCACHE */
     struct TranslationBlock *orig_tb;
     /* first and second physical page containing code. The lower bit
-       of the pointer tells the index in page_next[] */
+       of the pointer tells the index in page_next[].
+       The list is protected by the TB's page('s) lock(s) */
     uintptr_t page_next[2];
     tb_page_addr_t page_addr[2];
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 11/16] translate-all: add page_collection assertions
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (9 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 10/16] translate-all: use per-page locking in !user-mode Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-03-29 15:08   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB Emilio G. Cota
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

The appended adds assertions to make sure we do not longjmp with page
locks held. Some notes:

- user-mode has nothing to check, since page_locks are !user-mode only.

- The checks only apply to page collections, since these have relatively
  complex callers.

- Some simple page_lock/unlock callers have been left unchecked --
  namely page_lock_tb, tb_phys_invalidate and tb_link_page.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c      |  1 +
 accel/tcg/translate-all.c | 22 ++++++++++++++++++++++
 include/exec/exec-all.h   |  8 ++++++++
 3 files changed, 31 insertions(+)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 8c68727..7c83887 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -271,6 +271,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
         tcg_debug_assert(!have_mmap_lock());
 #endif
         tb_lock_reset();
+        assert_page_collection_locked(false);
     }
 
     if (in_exclusive_region) {
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 07527d5..82832ef 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -605,6 +605,24 @@ void page_collection_unlock(struct page_collection *set)
 { }
 #else /* !CONFIG_USER_ONLY */
 
+#ifdef CONFIG_DEBUG_TCG
+static __thread bool page_collection_locked;
+
+void assert_page_collection_locked(bool val)
+{
+    tcg_debug_assert(page_collection_locked == val);
+}
+
+static inline void set_page_collection_locked(bool val)
+{
+    page_collection_locked = val;
+}
+#else
+static inline void set_page_collection_locked(bool val)
+{
+}
+#endif /* !CONFIG_DEBUG_TCG */
+
 static inline void page_lock(PageDesc *pd)
 {
     qemu_spin_lock(&pd->lock);
@@ -677,6 +695,7 @@ static void do_page_entry_lock(struct page_entry *pe)
     page_lock(pe->pd);
     g_assert(!pe->locked);
     pe->locked = true;
+    set_page_collection_locked(true);
 }
 
 static gboolean page_entry_lock(gpointer key, gpointer value, gpointer data)
@@ -769,6 +788,7 @@ page_collection_lock(tb_page_addr_t start, tb_page_addr_t end)
     set->tree = g_tree_new_full(tb_page_addr_cmp, NULL, NULL,
                                 page_entry_destroy);
     set->max = NULL;
+    assert_page_collection_locked(false);
 
  retry:
     g_tree_foreach(set->tree, page_entry_lock, NULL);
@@ -787,6 +807,7 @@ page_collection_lock(tb_page_addr_t start, tb_page_addr_t end)
                  page_trylock_add(set, tb->page_addr[1]))) {
                 /* drop all locks, and reacquire in order */
                 g_tree_foreach(set->tree, page_entry_unlock, NULL);
+                set_page_collection_locked(false);
                 goto retry;
             }
         }
@@ -799,6 +820,7 @@ void page_collection_unlock(struct page_collection *set)
     /* entries are unlocked and freed via page_entry_destroy */
     g_tree_destroy(set->tree);
     g_free(set);
+    set_page_collection_locked(false);
 }
 
 #endif /* !CONFIG_USER_ONLY */
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index aeaa127..7911e69 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -431,6 +431,14 @@ void tb_lock(void);
 void tb_unlock(void);
 void tb_lock_reset(void);
 
+#if !defined(CONFIG_USER_ONLY) && defined(CONFIG_DEBUG_TCG)
+void assert_page_collection_locked(bool val);
+#else
+static inline void assert_page_collection_locked(bool val)
+{
+}
+#endif
+
 #if !defined(CONFIG_USER_ONLY)
 
 struct MemoryRegion *iotlb_to_region(CPUState *cpu,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (10 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 11/16] translate-all: add page_collection assertions Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-03-29 15:19   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 13/16] translate-all: protect TB jumps with a per-destination-TB lock Emilio G. Cota
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

Use the recently-gained QHT feature of returning the matching TB if it
already exists. This allows us to get rid of the lookup we perform
right after acquiring tb_lock.

Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c      | 14 ++------------
 accel/tcg/translate-all.c | 47 ++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 40 insertions(+), 21 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 7c83887..8aed38c 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -243,10 +243,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
         if (tb == NULL) {
             mmap_lock();
             tb_lock();
-            tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
-            if (likely(tb == NULL)) {
-                tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
-            }
+            tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
             tb_unlock();
             mmap_unlock();
         }
@@ -396,14 +393,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
         tb_lock();
         acquired_tb_lock = true;
 
-        /* There's a chance that our desired tb has been translated while
-         * taking the locks so we check again inside the lock.
-         */
-        tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
-        if (likely(tb == NULL)) {
-            /* if no translated code available, then translate it now */
-            tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
-        }
+        tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
 
         mmap_unlock();
         /* We add the TB in the virtual pc hash table for the fast lookup */
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 82832ef..dbe6c12 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1503,12 +1503,16 @@ static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
  * (-1) to indicate that only one page contains the TB.
  *
  * Called with mmap_lock held for user-mode emulation.
+ *
+ * Returns @tb or an existing TB that matches @tb.
  */
-static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
-                         tb_page_addr_t phys_page2)
+static TranslationBlock *
+tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
+             tb_page_addr_t phys_page2)
 {
     PageDesc *p;
     PageDesc *p2 = NULL;
+    void *existing_tb;
     uint32_t h;
 
     assert_memory_lock();
@@ -1516,6 +1520,11 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     /*
      * Add the TB to the page list.
      * To avoid deadlock, acquire first the lock of the lower-addressed page.
+     * We keep the locks held until after inserting the TB in the hash table,
+     * so that if the insertion fails we know for sure that the TBs are still
+     * in the page descriptors.
+     * Note that inserting into the hash table first isn't an option, since
+     * we can only insert TBs that are fully initialized.
      */
     p = page_find_alloc(phys_pc >> TARGET_PAGE_BITS, 1);
     if (likely(phys_page2 == -1)) {
@@ -1535,21 +1544,33 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
         tb_page_add(p2, tb, 1, phys_page2);
     }
 
+    /* add in the hash table */
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
+                     tb->trace_vcpu_dstate);
+    existing_tb = qht_insert(&tb_ctx.htable, tb, h);
+
+    /* remove TB from the page(s) if we couldn't insert it */
+    if (unlikely(existing_tb)) {
+        tb_page_remove(p, tb);
+        invalidate_page_bitmap(p);
+        if (p2) {
+            tb_page_remove(p2, tb);
+            invalidate_page_bitmap(p2);
+        }
+        tb = existing_tb;
+    }
+
     if (p2) {
         page_unlock(p2);
     }
     page_unlock(p);
 
-    /* add in the hash table */
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
-                     tb->trace_vcpu_dstate);
-    qht_insert(&tb_ctx.htable, tb, h);
-
 #ifdef CONFIG_USER_ONLY
     if (DEBUG_TB_CHECK_GATE) {
         tb_page_check();
     }
 #endif
+    return tb;
 }
 
 /* Called with mmap_lock held for user mode emulation.  */
@@ -1558,7 +1579,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
                               uint32_t flags, int cflags)
 {
     CPUArchState *env = cpu->env_ptr;
-    TranslationBlock *tb;
+    TranslationBlock *tb, *existing_tb;
     tb_page_addr_t phys_pc, phys_page2;
     target_ulong virt_page2;
     tcg_insn_unit *gen_code_buf;
@@ -1706,7 +1727,15 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
      * memory barrier is required before tb_link_page() makes the TB visible
      * through the physical hash table and physical page list.
      */
-    tb_link_page(tb, phys_pc, phys_page2);
+    existing_tb = tb_link_page(tb, phys_pc, phys_page2);
+    /* if the TB already exists, discard what we just translated */
+    if (unlikely(existing_tb != tb)) {
+        uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
+
+        orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
+        atomic_set(&tcg_ctx->code_gen_ptr, orig_aligned);
+        return existing_tb;
+    }
     tcg_tb_insert(tb);
     return tb;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 13/16] translate-all: protect TB jumps with a per-destination-TB lock
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (11 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-02-27 11:33   ` Paolo Bonzini
  2018-03-28 15:57   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 14/16] cputlb: remove tb_lock from tlb_flush functions Emilio G. Cota
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

This applies to both user-mode and !user-mode emulation.

Instead of relying on a global lock, protect the list of incoming
jumps with tb->jmp_lock. This lock also protects tb->cflags,
so update all tb->cflags readers outside tb->jmp_lock to use
atomic reads via tb_cflags().

In order to find the destination TB (and therefore its jmp_lock)
from the origin TB, we introduce tb->jmp_dest[].

I considered not using a linked list of jumps, which simplifies
code and makes the struct smaller. However, it unnecessarily increases
memory usage, which results in a performance decrease. See for
instance these numbers booting+shutting down debian-arm:
                      Time (s)  Rel. err (%)  Abs. err (s)  Rel. slowdown (%)
------------------------------------------------------------------------------
 before                  20.88          0.74      0.154512                 0.
 after                   20.81          0.38      0.079078        -0.33524904
 GTree                   21.02          0.28      0.058856         0.67049808
 GHashTable + xxhash     21.63          1.08      0.233604          3.5919540

Using a hash table or a binary tree to keep track of the jumps
doesn't really pay off, not only due to the increased memory usage,
but also because most TBs have only 0 or 1 jumps to them. The maximum
number of jumps when booting debian-arm that I measured is 35, but
as we can see in the histogram below a TB with that many incoming jumps
is extremely rare; the average TB has 0.80 incoming jumps.

n_jumps: 379208; avg jumps/tb: 0.801099
dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁  ▁▁▁     ▁|[34.0,35.0]

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c            |  41 +++++++++-----
 accel/tcg/translate-all.c       | 118 ++++++++++++++++++++++++----------------
 docs/devel/multi-thread-tcg.txt |   6 +-
 include/exec/exec-all.h         |  33 +++++++----
 4 files changed, 123 insertions(+), 75 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 8aed38c..20dad1b 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -350,28 +350,43 @@ void tb_set_jmp_target(TranslationBlock *tb, int n, uintptr_t addr)
     }
 }
 
-/* Called with tb_lock held.  */
 static inline void tb_add_jump(TranslationBlock *tb, int n,
                                TranslationBlock *tb_next)
 {
+    uintptr_t old;
+
     assert(n < ARRAY_SIZE(tb->jmp_list_next));
-    if (tb->jmp_list_next[n]) {
-        /* Another thread has already done this while we were
-         * outside of the lock; nothing to do in this case */
-        return;
+    qemu_spin_lock(&tb_next->jmp_lock);
+
+    /* make sure the destination TB is valid */
+    if (tb_next->cflags & CF_INVALID) {
+        goto out_unlock_next;
+    }
+    /* Atomically claim the jump destination slot only if it was NULL */
+    old = atomic_cmpxchg(&tb->jmp_dest[n], (uintptr_t)NULL, (uintptr_t)tb_next);
+    if (old) {
+        goto out_unlock_next;
     }
+
+    /* patch the native jump address */
+    tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc.ptr);
+
+    /* add in TB jmp list */
+    tb->jmp_list_next[n] = tb_next->jmp_list_head;
+    tb_next->jmp_list_head = (uintptr_t)tb | n;
+
+    qemu_spin_unlock(&tb_next->jmp_lock);
+
     qemu_log_mask_and_addr(CPU_LOG_EXEC, tb->pc,
                            "Linking TBs %p [" TARGET_FMT_lx
                            "] index %d -> %p [" TARGET_FMT_lx "]\n",
                            tb->tc.ptr, tb->pc, n,
                            tb_next->tc.ptr, tb_next->pc);
+    return;
 
-    /* patch the native jump address */
-    tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc.ptr);
-
-    /* add in TB jmp circular list */
-    tb->jmp_list_next[n] = tb_next->jmp_list_first;
-    tb_next->jmp_list_first = (uintptr_t)tb | n;
+ out_unlock_next:
+    qemu_spin_unlock(&tb_next->jmp_lock);
+    return;
 }
 
 static inline TranslationBlock *tb_find(CPUState *cpu,
@@ -414,9 +429,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
             tb_lock();
             acquired_tb_lock = true;
         }
-        if (!(tb->cflags & CF_INVALID)) {
-            tb_add_jump(last_tb, tb_exit, tb);
-        }
+        tb_add_jump(last_tb, tb_exit, tb);
     }
     if (acquired_tb_lock) {
         tb_unlock();
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index dbe6c12..9ab6477 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -173,6 +173,9 @@ struct page_collection {
 #define PAGE_FOR_EACH_TB(pagedesc, tb, n)                       \
     TB_FOR_EACH_TAGGED((pagedesc)->first_tb, tb, n, page_next)
 
+#define TB_FOR_EACH_JMP(head_tb, tb, n)                                 \
+    TB_FOR_EACH_TAGGED((head_tb)->jmp_list_head, tb, n, jmp_list_next)
+
 /* In system mode we want L1_MAP to be based on ram offsets,
    while in user mode we want it to be based on virtual addresses.  */
 #if !defined(CONFIG_USER_ONLY)
@@ -390,7 +393,7 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
     return -1;
 
  found:
-    if (tb->cflags & CF_USE_ICOUNT) {
+    if (tb_cflags(tb) & CF_USE_ICOUNT) {
         assert(use_icount);
         /* Reset the cycle counter to the start of the block.  */
         cpu->icount_decr.u16.low += num_insns;
@@ -435,7 +438,7 @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
         tb = tcg_tb_lookup(host_pc);
         if (tb) {
             cpu_restore_state_from_tb(cpu, tb, host_pc);
-            if (tb->cflags & CF_NOCACHE) {
+            if (tb_cflags(tb) & CF_NOCACHE) {
                 /* one-shot translation, invalidate it immediately */
                 tb_phys_invalidate(tb, -1);
                 tcg_tb_remove(tb);
@@ -1283,34 +1286,53 @@ static inline void tb_page_remove(PageDesc *pd, TranslationBlock *tb)
     g_assert_not_reached();
 }
 
-/* remove the TB from a list of TBs jumping to the n-th jump target of the TB */
-static inline void tb_remove_from_jmp_list(TranslationBlock *tb, int n)
+/* remove @orig from its @n_orig-th jump list */
+static inline void tb_remove_from_jmp_list(TranslationBlock *orig, int n_orig)
 {
-    TranslationBlock *tb1;
-    uintptr_t *ptb, ntb;
-    unsigned int n1;
+    uintptr_t ptr, ptr_locked;
+    TranslationBlock *dest;
+    TranslationBlock *tb;
+    uintptr_t *pprev;
+    int n;
 
-    ptb = &tb->jmp_list_next[n];
-    if (*ptb) {
-        /* find tb(n) in circular list */
-        for (;;) {
-            ntb = *ptb;
-            n1 = ntb & 3;
-            tb1 = (TranslationBlock *)(ntb & ~3);
-            if (n1 == n && tb1 == tb) {
-                break;
-            }
-            if (n1 == 2) {
-                ptb = &tb1->jmp_list_first;
-            } else {
-                ptb = &tb1->jmp_list_next[n1];
-            }
-        }
-        /* now we can suppress tb(n) from the list */
-        *ptb = tb->jmp_list_next[n];
+    /* mark the LSB of jmp_dest[] so that no further jumps can be inserted */
+    ptr = atomic_or_fetch(&orig->jmp_dest[n_orig], 1);
+    dest = (TranslationBlock *)(ptr & ~1);
+    if (dest == NULL) {
+        return;
+    }
 
-        tb->jmp_list_next[n] = (uintptr_t)NULL;
+    qemu_spin_lock(&dest->jmp_lock);
+    /*
+     * While acquiring the lock, the jump might have been removed if the
+     * destination TB was invalidated; check again.
+     */
+    ptr_locked = atomic_read(&orig->jmp_dest[n_orig]);
+    if (ptr_locked != ptr) {
+        qemu_spin_unlock(&dest->jmp_lock);
+        /*
+         * The only possibility is that the jump was unlinked via
+         * tb_jump_unlink(dest). Seeing here another destination would be a bug,
+         * because we set the LSB above.
+         */
+        g_assert(ptr_locked == 1 && dest->cflags & CF_INVALID);
+        return;
     }
+    /*
+     * We first acquired the lock, and since the destination pointer matches,
+     * we know for sure that @orig is in the jmp list.
+     */
+    pprev = &dest->jmp_list_head;
+    TB_FOR_EACH_JMP(dest, tb, n) {
+        if (tb == orig && n == n_orig) {
+            *pprev = tb->jmp_list_next[n];
+            /* no need to set orig->jmp_dest[n]; setting the LSB was enough */
+            qemu_spin_unlock(&dest->jmp_lock);
+            return;
+        }
+        pprev = &tb->jmp_list_next[n];
+    }
+    g_assert_not_reached();
 }
 
 /* reset the jump entry 'n' of a TB so that it is not chained to
@@ -1322,24 +1344,21 @@ static inline void tb_reset_jump(TranslationBlock *tb, int n)
 }
 
 /* remove any jumps to the TB */
-static inline void tb_jmp_unlink(TranslationBlock *tb)
+static inline void tb_jmp_unlink(TranslationBlock *dest)
 {
-    TranslationBlock *tb1;
-    uintptr_t *ptb, ntb;
-    unsigned int n1;
+    TranslationBlock *tb;
+    int n;
 
-    ptb = &tb->jmp_list_first;
-    for (;;) {
-        ntb = *ptb;
-        n1 = ntb & 3;
-        tb1 = (TranslationBlock *)(ntb & ~3);
-        if (n1 == 2) {
-            break;
-        }
-        tb_reset_jump(tb1, n1);
-        *ptb = tb1->jmp_list_next[n1];
-        tb1->jmp_list_next[n1] = (uintptr_t)NULL;
+    qemu_spin_lock(&dest->jmp_lock);
+
+    TB_FOR_EACH_JMP(dest, tb, n) {
+        tb_reset_jump(tb, n);
+        atomic_and(&tb->jmp_dest[n], (uintptr_t)NULL | 1);
+        /* No need to clear the list entry; setting the dest ptr is enough */
     }
+    dest->jmp_list_head = (uintptr_t)NULL;
+
+    qemu_spin_unlock(&dest->jmp_lock);
 }
 
 /* If @rm_from_page_list is set, call with the TB's pages' locks held */
@@ -1352,11 +1371,14 @@ static void do_tb_phys_invalidate(TranslationBlock *tb, bool rm_from_page_list)
 
     assert_tb_locked();
 
+    /* make sure no further incoming jumps will be chained to this TB */
+    qemu_spin_lock(&tb->jmp_lock);
     atomic_set(&tb->cflags, tb->cflags | CF_INVALID);
+    qemu_spin_unlock(&tb->jmp_lock);
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb_cflags(tb) & CF_HASH_MASK,
                      tb->trace_vcpu_dstate);
     if (!qht_remove(&tb_ctx.htable, tb, h)) {
         return;
@@ -1703,10 +1725,12 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
                  CODE_GEN_ALIGN));
 
     /* init jump list */
-    assert(((uintptr_t)tb & 3) == 0);
-    tb->jmp_list_first = (uintptr_t)tb | 2;
+    qemu_spin_init(&tb->jmp_lock);
+    tb->jmp_list_head = (uintptr_t)NULL;
     tb->jmp_list_next[0] = (uintptr_t)NULL;
     tb->jmp_list_next[1] = (uintptr_t)NULL;
+    tb->jmp_dest[0] = (uintptr_t)NULL;
+    tb->jmp_dest[1] = (uintptr_t)NULL;
 
     /* init original jump addresses wich has been set during tcg_gen_code() */
     if (tb->jmp_reset_offset[0] != TB_JMP_RESET_OFFSET_INVALID) {
@@ -1798,7 +1822,7 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages,
                 }
             }
             if (current_tb == tb &&
-                (current_tb->cflags & CF_COUNT_MASK) != 1) {
+                (tb_cflags(current_tb) & CF_COUNT_MASK) != 1) {
                 /* If we are modifying the current TB, we must stop
                 its execution. We could be more precise by checking
                 that the modification is after the current PC, but it
@@ -1994,7 +2018,7 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
     PAGE_FOR_EACH_TB(p, tb, n) {
 #ifdef TARGET_HAS_PRECISE_SMC
         if (current_tb == tb &&
-            (current_tb->cflags & CF_COUNT_MASK) != 1) {
+            (tb_cflags(current_tb) & CF_COUNT_MASK) != 1) {
                 /* If we are modifying the current TB, we must stop
                    its execution. We could be more precise by checking
                    that the modification is after the current PC, but it
@@ -2124,7 +2148,7 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
     /* Adjust the execution state of the next TB.  */
     cpu->cflags_next_tb = curr_cflags() | CF_LAST_IO | n;
 
-    if (tb->cflags & CF_NOCACHE) {
+    if (tb_cflags(tb) & CF_NOCACHE) {
         if (tb->orig_tb) {
             /* Invalidate original TB if this TB was generated in
              * cpu_exec_nocache() */
diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
index faf8918..36da1f1 100644
--- a/docs/devel/multi-thread-tcg.txt
+++ b/docs/devel/multi-thread-tcg.txt
@@ -131,8 +131,10 @@ DESIGN REQUIREMENT: Safely handle invalidation of TBs
 
 The direct jump themselves are updated atomically by the TCG
 tb_set_jmp_target() code. Modification to the linked lists that allow
-searching for linked pages are done under the protect of the
-tb_lock().
+searching for linked pages are done under the protection of tb->jmp_lock,
+where tb is the destination block of a jump. Each origin block keeps a
+pointer to its destinations so that the appropriate lock can be acquired before
+iterating over a jump list.
 
 The global page table is a lockless radix tree; cmpxchg is used
 to atomically insert new elements.
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 7911e69..d69b853 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -360,6 +360,9 @@ struct TranslationBlock {
     uintptr_t page_next[2];
     tb_page_addr_t page_addr[2];
 
+    /* jmp_lock placed here to fill a 4-byte hole. Its documentation is below */
+    QemuSpin jmp_lock;
+
     /* The following data are used to directly call another TB from
      * the code of this one. This can be done either by emitting direct or
      * indirect native jump instructions. These jumps are reset so that the TB
@@ -371,20 +374,26 @@ struct TranslationBlock {
 #define TB_JMP_RESET_OFFSET_INVALID 0xffff /* indicates no jump generated */
     uintptr_t jmp_target_arg[2];  /* target address or offset */
 
-    /* Each TB has an associated circular list of TBs jumping to this one.
-     * jmp_list_first points to the first TB jumping to this one.
-     * jmp_list_next is used to point to the next TB in a list.
-     * Since each TB can have two jumps, it can participate in two lists.
-     * jmp_list_first and jmp_list_next are 4-byte aligned pointers to a
-     * TranslationBlock structure, but the two least significant bits of
-     * them are used to encode which data field of the pointed TB should
-     * be used to traverse the list further from that TB:
-     * 0 => jmp_list_next[0], 1 => jmp_list_next[1], 2 => jmp_list_first.
-     * In other words, 0/1 tells which jump is used in the pointed TB,
-     * and 2 means that this is a pointer back to the target TB of this list.
+    /*
+     * Each TB has a NULL-terminated list (jmp_list_head) of incoming jumps.
+     * Each TB can have two outgoing jumps, and therefore can participate
+     * in two lists. The list entries are kept in jmp_list_next[2]. The least
+     * significant bit (LSB) of the pointers in these lists is used to encode
+     * which of the two list entries is to be used in the pointed TB.
+     *
+     * List traversals are protected by jmp_lock. The destination TB of each
+     * outgoing jump is kept in jmp_dest[] so that the appropriate jmp_lock
+     * can be acquired from any origin TB.
+     *
+     * jmp_dest[] are tagged pointers as well. The LSB is set when the TB is
+     * being invalidated, so that no further outgoing jumps from it can be set.
+     *
+     * jmp_lock also protects the CF_INVALID cflag; a jump must not be chained
+     * to a destination TB that has CF_INVALID set.
      */
+    uintptr_t jmp_list_head;
     uintptr_t jmp_list_next[2];
-    uintptr_t jmp_list_first;
+    uintptr_t jmp_dest[2];
 };
 
 extern bool parallel_cpus;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 14/16] cputlb: remove tb_lock from tlb_flush functions
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (12 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 13/16] translate-all: protect TB jumps with a per-destination-TB lock Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-03-29 15:46   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 15/16] translate-all: remove tb_lock mention from cpu_restore_state_from_tb Emilio G. Cota
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 16/16] tcg: remove tb_lock Emilio G. Cota
  15 siblings, 1 reply; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

The acquisition of tb_lock was added when the async tlb_flush
was introduced in e3b9ca810 ("cputlb: introduce tlb_flush_* async work.")

tb_lock was there to allow us to do memset() on the tb_jmp_cache's.
However, since f3ced3c5928 ("tcg: consistently access cpu->tb_jmp_cache
atomically") all accesses to tb_jmp_cache are atomic, so tb_lock
is not needed here. Get rid of it.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cputlb.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 0543903..f5c3a09 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -125,8 +125,6 @@ static void tlb_flush_nocheck(CPUState *cpu)
     atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);
     tlb_debug("(count: %zu)\n", tlb_flush_count());
 
-    tb_lock();
-
     memset(env->tlb_table, -1, sizeof(env->tlb_table));
     memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
     cpu_tb_jmp_cache_clear(cpu);
@@ -135,8 +133,6 @@ static void tlb_flush_nocheck(CPUState *cpu)
     env->tlb_flush_addr = -1;
     env->tlb_flush_mask = 0;
 
-    tb_unlock();
-
     atomic_mb_set(&cpu->pending_tlb_flush, 0);
 }
 
@@ -180,8 +176,6 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
 
     assert_cpu_is_self(cpu);
 
-    tb_lock();
-
     tlb_debug("start: mmu_idx:0x%04lx\n", mmu_idx_bitmask);
 
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
@@ -197,8 +191,6 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
     cpu_tb_jmp_cache_clear(cpu);
 
     tlb_debug("done\n");
-
-    tb_unlock();
 }
 
 void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 15/16] translate-all: remove tb_lock mention from cpu_restore_state_from_tb
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (13 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 14/16] cputlb: remove tb_lock from tlb_flush functions Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-03-29 16:06   ` Alex Bennée
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 16/16] tcg: remove tb_lock Emilio G. Cota
  15 siblings, 1 reply; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

tb_lock was needed when the function did retranslation. However,
since fca8a500d519 ("tcg: Save insn data and use it in
cpu_restore_state_from_tb") we don't do retranslation.

Get rid of the comment.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 9ab6477..ee49d03 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -357,9 +357,7 @@ static int encode_search(TranslationBlock *tb, uint8_t *block)
     return p - block;
 }
 
-/* The cpu state corresponding to 'searched_pc' is restored.
- * Called with tb_lock held.
- */
+/* The cpu state corresponding to 'searched_pc' is restored */
 static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
                                      uintptr_t searched_pc)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [PATCH 16/16] tcg: remove tb_lock
  2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
                   ` (14 preceding siblings ...)
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 15/16] translate-all: remove tb_lock mention from cpu_restore_state_from_tb Emilio G. Cota
@ 2018-02-27  5:39 ` Emilio G. Cota
  2018-03-29 16:15   ` Alex Bennée
  15 siblings, 1 reply; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-27  5:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini

Use mmap_lock in user-mode to protect TCG state and the page
descriptors.
In !user-mode, each vCPU has its own TCG state, so no locks
needed. Per-page locks are used to protect the page descriptors.

Per-TB locks are used in both modes to protect TB jumps.

Some notes:

- tb_lock is removed from notdirty_mem_write by passing a
  locked page_collection to tb_invalidate_phys_page_fast.

- tcg_tb_lookup/remove/insert/etc have their own internal lock(s),
  so there is no need to further serialize access to them.

- do_tb_flush is run in a safe async context, meaning no other
  vCPU threads are running. Therefore acquiring mmap_lock there
  is just to please tools such as thread sanitizer.

- Not visible in the diff, but tb_invalidate_phys_page already
  has an assert_memory_lock.

- cpu_io_recompile is !user-only, so no mmap_lock there.

- Added mmap_unlock()'s before all siglongjmp's that could
  be called in user-mode while mmap_lock is held.
  + Added an assert for !have_mmap_lock() after returning from
    the longjmp in cpu_exec, just like we do in cpu_exec_step_atomic.

Performance numbers before/after:

Host: AMD Opteron(tm) Processor 6376

                 ubuntu 17.04 ppc64 bootup+shutdown time

  700 +-+--+----+------+------------+-----------+------------*--+-+
      |    +    +      +            +           +           *B    |
      |         before ***B***                            ** *    |
      |tb lock removal ###D###                         ***        |
  600 +-+                                           ***         +-+
      |                                           **         #    |
      |                                        *B*          #D    |
      |                                     *** *         ##      |
  500 +-+                                ***           ###      +-+
      |                             * ***           ###           |
      |                            *B*          # ##              |
      |                          ** *          #D#                |
  400 +-+                      **            ##                 +-+
      |                      **           ###                     |
      |                    **           ##                        |
      |                  **         # ##                          |
  300 +-+  *           B*          #D#                          +-+
      |    B         ***        ###                               |
      |    *       **       ####                                  |
      |     *   ***      ###                                      |
  200 +-+   B  *B     #D#                                       +-+
      |     #B* *   ## #                                          |
      |     #*    ##                                              |
      |    + D##D#     +            +           +            +    |
  100 +-+--+----+------+------------+-----------+------------+--+-+
           1    8      16      Guest CPUs       48           64
  png: https://imgur.com/HwmBHXe

              debian jessie aarch64 bootup+shutdown time

  90 +-+--+-----+-----+------------+------------+------------+--+-+
     |    +     +     +            +            +            +    |
     |         before ***B***                                B    |
  80 +tb lock removal ###D###                              **D  +-+
     |                                                   **###    |
     |                                                 **##       |
  70 +-+                                             ** #       +-+
     |                                             ** ##          |
     |                                           **  #            |
  60 +-+                                       *B  ##           +-+
     |                                       **  ##               |
     |                                    ***  #D                 |
  50 +-+                               ***   ##                 +-+
     |                             * **   ###                     |
     |                           **B*  ###                        |
  40 +-+                     ****  # ##                         +-+
     |                   ****     #D#                             |
     |             ***B**      ###                                |
  30 +-+    B***B**        ####                                 +-+
     |    B *   *     # ###                                       |
     |     B       ###D#                                          |
  20 +-+   D  ##D##                                             +-+
     |      D#                                                    |
     |    +     +     +            +            +            +    |
  10 +-+--+-----+-----+------------+------------+------------+--+-+
          1     8     16      Guest CPUs        48           64
  png: https://imgur.com/iGpGFtv

The gains are high for 4-8 CPUs. Beyond that point, however, unrelated
lock contention significantly hurts scalability.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c            |  34 +++--------
 accel/tcg/translate-all.c       | 130 ++++++++++++----------------------------
 accel/tcg/translate-all.h       |   3 +-
 docs/devel/multi-thread-tcg.txt |  11 ++--
 exec.c                          |  25 ++++----
 include/exec/cpu-common.h       |   2 +-
 include/exec/exec-all.h         |   4 --
 include/exec/memory-internal.h  |   6 +-
 include/exec/tb-context.h       |   2 -
 linux-user/main.c               |   3 -
 tcg/tcg.h                       |   4 +-
 11 files changed, 73 insertions(+), 151 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 20dad1b..e7a602b 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -210,20 +210,20 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
        We only end up here when an existing TB is too long.  */
     cflags |= MIN(max_cycles, CF_COUNT_MASK);
 
-    tb_lock();
+    mmap_lock();
     tb = tb_gen_code(cpu, orig_tb->pc, orig_tb->cs_base,
                      orig_tb->flags, cflags);
     tb->orig_tb = orig_tb;
-    tb_unlock();
+    mmap_unlock();
 
     /* execute the generated code */
     trace_exec_tb_nocache(tb, tb->pc);
     cpu_tb_exec(cpu, tb);
 
-    tb_lock();
+    mmap_lock();
     tb_phys_invalidate(tb, -1);
+    mmap_unlock();
     tcg_tb_remove(tb);
-    tb_unlock();
 }
 #endif
 
@@ -242,9 +242,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
         tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
         if (tb == NULL) {
             mmap_lock();
-            tb_lock();
             tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
-            tb_unlock();
             mmap_unlock();
         }
 
@@ -259,15 +257,13 @@ void cpu_exec_step_atomic(CPUState *cpu)
         cpu_tb_exec(cpu, tb);
         cc->cpu_exec_exit(cpu);
     } else {
-        /* We may have exited due to another problem here, so we need
-         * to reset any tb_locks we may have taken but didn't release.
+        /*
          * The mmap_lock is dropped by tb_gen_code if it runs out of
          * memory.
          */
 #ifndef CONFIG_SOFTMMU
         tcg_debug_assert(!have_mmap_lock());
 #endif
-        tb_lock_reset();
         assert_page_collection_locked(false);
     }
 
@@ -396,20 +392,11 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
     TranslationBlock *tb;
     target_ulong cs_base, pc;
     uint32_t flags;
-    bool acquired_tb_lock = false;
 
     tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
     if (tb == NULL) {
-        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
-         * taken outside tb_lock. As system emulation is currently
-         * single threaded the locks are NOPs.
-         */
         mmap_lock();
-        tb_lock();
-        acquired_tb_lock = true;
-
         tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
-
         mmap_unlock();
         /* We add the TB in the virtual pc hash table for the fast lookup */
         atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
@@ -425,15 +412,8 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
 #endif
     /* See if we can patch the calling TB. */
     if (last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
-        if (!acquired_tb_lock) {
-            tb_lock();
-            acquired_tb_lock = true;
-        }
         tb_add_jump(last_tb, tb_exit, tb);
     }
-    if (acquired_tb_lock) {
-        tb_unlock();
-    }
     return tb;
 }
 
@@ -706,7 +686,9 @@ int cpu_exec(CPUState *cpu)
         g_assert(cc == CPU_GET_CLASS(cpu));
 #endif /* buggy compiler */
         cpu->can_do_io = 1;
-        tb_lock_reset();
+#ifndef CONFIG_SOFTMMU
+        tcg_debug_assert(!have_mmap_lock());
+#endif
         if (qemu_mutex_iothread_locked()) {
             qemu_mutex_unlock_iothread();
         }
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index ee49d03..8b3673c 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -88,13 +88,13 @@
 #endif
 
 /* Access to the various translations structures need to be serialised via locks
- * for consistency. This is automatic for SoftMMU based system
- * emulation due to its single threaded nature. In user-mode emulation
- * access to the memory related structures are protected with the
- * mmap_lock.
+ * for consistency.
+ * In user-mode emulation access to the memory related structures are protected
+ * with mmap_lock.
+ * In !user-mode we use per-page locks.
  */
 #ifdef CONFIG_SOFTMMU
-#define assert_memory_lock() tcg_debug_assert(have_tb_lock)
+#define assert_memory_lock()
 #else
 #define assert_memory_lock() tcg_debug_assert(have_mmap_lock())
 #endif
@@ -219,9 +219,6 @@ __thread TCGContext *tcg_ctx;
 TBContext tb_ctx;
 bool parallel_cpus;
 
-/* translation block context */
-static __thread int have_tb_lock;
-
 static void page_table_config_init(void)
 {
     uint32_t v_l1_bits;
@@ -242,31 +239,6 @@ static void page_table_config_init(void)
     assert(v_l2_levels >= 0);
 }
 
-#define assert_tb_locked() tcg_debug_assert(have_tb_lock)
-#define assert_tb_unlocked() tcg_debug_assert(!have_tb_lock)
-
-void tb_lock(void)
-{
-    assert_tb_unlocked();
-    qemu_mutex_lock(&tb_ctx.tb_lock);
-    have_tb_lock++;
-}
-
-void tb_unlock(void)
-{
-    assert_tb_locked();
-    have_tb_lock--;
-    qemu_mutex_unlock(&tb_ctx.tb_lock);
-}
-
-void tb_lock_reset(void)
-{
-    if (have_tb_lock) {
-        qemu_mutex_unlock(&tb_ctx.tb_lock);
-        have_tb_lock = 0;
-    }
-}
-
 void cpu_gen_init(void)
 {
     tcg_context_init(&tcg_init_ctx);
@@ -432,7 +404,6 @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
     check_offset = host_pc - (uintptr_t) tcg_init_ctx.code_gen_buffer;
 
     if (check_offset < tcg_init_ctx.code_gen_buffer_size) {
-        tb_lock();
         tb = tcg_tb_lookup(host_pc);
         if (tb) {
             cpu_restore_state_from_tb(cpu, tb, host_pc);
@@ -443,7 +414,6 @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
             }
             r = true;
         }
-        tb_unlock();
     }
 
     return r;
@@ -1054,7 +1024,6 @@ static inline void code_gen_alloc(size_t tb_size)
         fprintf(stderr, "Could not allocate dynamic translator buffer\n");
         exit(1);
     }
-    qemu_mutex_init(&tb_ctx.tb_lock);
 }
 
 static bool tb_cmp(const void *ap, const void *bp)
@@ -1098,14 +1067,12 @@ void tcg_exec_init(unsigned long tb_size)
 /*
  * Allocate a new translation block. Flush the translation buffer if
  * too many translation blocks or too much generated code.
- *
- * Called with tb_lock held.
  */
 static TranslationBlock *tb_alloc(target_ulong pc)
 {
     TranslationBlock *tb;
 
-    assert_tb_locked();
+    assert_memory_lock();
 
     tb = tcg_tb_alloc(tcg_ctx);
     if (unlikely(tb == NULL)) {
@@ -1171,8 +1138,7 @@ static gboolean tb_host_size_iter(gpointer key, gpointer value, gpointer data)
 /* flush all the translation blocks */
 static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 {
-    tb_lock();
-
+    mmap_lock();
     /* If it is already been done on request of another CPU,
      * just retry.
      */
@@ -1202,7 +1168,7 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     atomic_mb_set(&tb_ctx.tb_flush_count, tb_ctx.tb_flush_count + 1);
 
 done:
-    tb_unlock();
+    mmap_unlock();
 }
 
 void tb_flush(CPUState *cpu)
@@ -1236,7 +1202,7 @@ do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 
 /* verify that all the pages have correct rights for code
  *
- * Called with tb_lock held.
+ * Called with mmap_lock held.
  */
 static void tb_invalidate_check(target_ulong address)
 {
@@ -1266,7 +1232,10 @@ static void tb_page_check(void)
 
 #endif /* CONFIG_USER_ONLY */
 
-/* call with @pd->lock held */
+/*
+ * user-mode: call with mmap_lock held
+ * !user-mode: call with @pd->lock held
+ */
 static inline void tb_page_remove(PageDesc *pd, TranslationBlock *tb)
 {
     TranslationBlock *tb1;
@@ -1359,7 +1328,11 @@ static inline void tb_jmp_unlink(TranslationBlock *dest)
     qemu_spin_unlock(&dest->jmp_lock);
 }
 
-/* If @rm_from_page_list is set, call with the TB's pages' locks held */
+/*
+ * In user-mode, call with mmap_lock held.
+ * In !user-mode, if @rm_from_page_list is set, call with the TB's pages'
+ * locks held.
+ */
 static void do_tb_phys_invalidate(TranslationBlock *tb, bool rm_from_page_list)
 {
     CPUState *cpu;
@@ -1367,7 +1340,7 @@ static void do_tb_phys_invalidate(TranslationBlock *tb, bool rm_from_page_list)
     uint32_t h;
     tb_page_addr_t phys_pc;
 
-    assert_tb_locked();
+    assert_memory_lock();
 
     /* make sure no further incoming jumps will be chained to this TB */
     qemu_spin_lock(&tb->jmp_lock);
@@ -1420,7 +1393,7 @@ static void tb_phys_invalidate__locked(TranslationBlock *tb)
 
 /* invalidate one TB
  *
- * Called with tb_lock held.
+ * Called with mmap_lock held in user-mode.
  */
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 {
@@ -1464,7 +1437,7 @@ static void build_page_bitmap(PageDesc *p)
 /* add the tb in the target page and protect it if necessary
  *
  * Called with mmap_lock held for user-mode emulation.
- * Called with @p->lock held.
+ * Called with @p->lock held in !user-mode.
  */
 static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
                                unsigned int n, tb_page_addr_t page_addr)
@@ -1744,10 +1717,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     if ((pc & TARGET_PAGE_MASK) != virt_page2) {
         phys_page2 = get_page_addr_code(env, virt_page2);
     }
-    /* As long as consistency of the TB stuff is provided by tb_lock in user
-     * mode and is implicit in single-threaded softmmu emulation, no explicit
-     * memory barrier is required before tb_link_page() makes the TB visible
-     * through the physical hash table and physical page list.
+    /*
+     * No explicit memory barrier is required -- tb_link_page() makes the
+     * TB visible in a consistent state.
      */
     existing_tb = tb_link_page(tb, phys_pc, phys_page2);
     /* if the TB already exists, discard what we just translated */
@@ -1763,8 +1735,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 }
 
 /*
- * Call with all @pages locked.
  * @p must be non-NULL.
+ * user-mode: call with mmap_lock held.
+ * !user-mode: call with all @pages locked.
  */
 static void
 tb_invalidate_phys_page_range__locked(struct page_collection *pages,
@@ -1787,7 +1760,6 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages,
 #endif /* TARGET_HAS_PRECISE_SMC */
 
     assert_memory_lock();
-    assert_tb_locked();
 
 #if defined(TARGET_HAS_PRECISE_SMC)
     if (cpu != NULL) {
@@ -1848,6 +1820,7 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages,
         page_collection_unlock(pages);
         /* Force execution of one insn next time.  */
         cpu->cflags_next_tb = 1 | curr_cflags();
+        mmap_unlock();
         cpu_loop_exit_noexc(cpu);
     }
 #endif
@@ -1860,8 +1833,7 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages,
  * access: the virtual CPU will exit the current TB if code is modified inside
  * this TB.
  *
- * Called with tb_lock/mmap_lock held for user-mode emulation
- * Called with tb_lock held for system-mode emulation
+ * Called with mmap_lock held for user-mode emulation
  */
 void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
                                    int is_cpu_write_access)
@@ -1870,7 +1842,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
     PageDesc *p;
 
     assert_memory_lock();
-    assert_tb_locked();
 
     p = page_find(start >> TARGET_PAGE_BITS);
     if (p == NULL) {
@@ -1889,14 +1860,15 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
  * access: the virtual CPU will exit the current TB if code is modified inside
  * this TB.
  *
- * Called with mmap_lock held for user-mode emulation, grabs tb_lock
- * Called with tb_lock held for system-mode emulation
+ * Called with mmap_lock held for user-mode emulation.
  */
-static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
+void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
 {
     struct page_collection *pages;
     tb_page_addr_t next;
 
+    assert_memory_lock();
+
     pages = page_collection_lock(start, end);
     for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
          start < end;
@@ -1913,29 +1885,15 @@ static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
 }
 
 #ifdef CONFIG_SOFTMMU
-void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
-{
-    assert_tb_locked();
-    tb_invalidate_phys_range_1(start, end);
-}
-#else
-void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
-{
-    assert_memory_lock();
-    tb_lock();
-    tb_invalidate_phys_range_1(start, end);
-    tb_unlock();
-}
-#endif
-
-#ifdef CONFIG_SOFTMMU
 /* len must be <= 8 and start must be a multiple of len.
  * Called via softmmu_template.h when code areas are written to with
  * iothread mutex not held.
+ *
+ * Call with all @pages in the range [@start, @start + len[ locked.
  */
-void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
+void tb_invalidate_phys_page_fast(struct page_collection *pages,
+                                  tb_page_addr_t start, int len)
 {
-    struct page_collection *pages;
     PageDesc *p;
 
 #if 0
@@ -1954,7 +1912,6 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
         return;
     }
 
-    pages = page_collection_lock(start, start + len);
     if (!p->code_bitmap &&
         ++p->code_write_count >= SMC_BITMAP_USE_THRESHOLD) {
         build_page_bitmap(p);
@@ -1972,7 +1929,6 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
     do_invalidate:
         tb_invalidate_phys_page_range__locked(pages, p, start, start + len, 1);
     }
-    page_collection_unlock(pages);
 }
 #else
 /* Called with mmap_lock held. If pc is not 0 then it indicates the
@@ -2004,7 +1960,6 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
         return false;
     }
 
-    tb_lock();
 #ifdef TARGET_HAS_PRECISE_SMC
     if (p->first_tb && pc != 0) {
         current_tb = tcg_tb_lookup(pc);
@@ -2036,12 +1991,9 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
     if (current_tb_modified) {
         /* Force execution of one insn next time.  */
         cpu->cflags_next_tb = 1 | curr_cflags();
-        /* tb_lock will be reset after cpu_loop_exit_noexc longjmps
-         * back into the cpu_exec loop. */
         return true;
     }
 #endif
-    tb_unlock();
 
     return false;
 }
@@ -2062,18 +2014,18 @@ void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr)
         return;
     }
     ram_addr = memory_region_get_ram_addr(mr) + addr;
-    tb_lock();
     tb_invalidate_phys_page_range(ram_addr, ram_addr + 1, 0);
-    tb_unlock();
     rcu_read_unlock();
 }
 #endif /* !defined(CONFIG_USER_ONLY) */
 
-/* Called with tb_lock held.  */
+/* user-mode: call with mmap_lock held */
 void tb_check_watchpoint(CPUState *cpu)
 {
     TranslationBlock *tb;
 
+    assert_memory_lock();
+
     tb = tcg_tb_lookup(cpu->mem_io_pc);
     if (tb) {
         /* We can use retranslation to find the PC.  */
@@ -2107,7 +2059,6 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
     TranslationBlock *tb;
     uint32_t n;
 
-    tb_lock();
     tb = tcg_tb_lookup(retaddr);
     if (!tb) {
         cpu_abort(cpu, "cpu_io_recompile: could not find TB for pc=%p",
@@ -2160,9 +2111,6 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
      *  repeating the fault, which is horribly inefficient.
      *  Better would be to execute just this insn uncached, or generate a
      *  second new TB.
-     *
-     * cpu_loop_exit_noexc will longjmp back to cpu_exec where the
-     * tb_lock gets reset.
      */
     cpu_loop_exit_noexc(cpu);
 }
diff --git a/accel/tcg/translate-all.h b/accel/tcg/translate-all.h
index 6d1d258..e6cb963 100644
--- a/accel/tcg/translate-all.h
+++ b/accel/tcg/translate-all.h
@@ -26,7 +26,8 @@
 struct page_collection *page_collection_lock(tb_page_addr_t start,
                                              tb_page_addr_t end);
 void page_collection_unlock(struct page_collection *set);
-void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len);
+void tb_invalidate_phys_page_fast(struct page_collection *pages,
+                                  tb_page_addr_t start, int len);
 void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
                                    int is_cpu_write_access);
 void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end);
diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
index 36da1f1..e1e002b 100644
--- a/docs/devel/multi-thread-tcg.txt
+++ b/docs/devel/multi-thread-tcg.txt
@@ -61,6 +61,7 @@ have their block-to-block jumps patched.
 Global TCG State
 ----------------
 
+### User-mode emulation
 We need to protect the entire code generation cycle including any post
 generation patching of the translated code. This also implies a shared
 translation buffer which contains code running on all cores. Any
@@ -75,9 +76,11 @@ patching.
 
 (Current solution)
 
-Mainly as part of the linux-user work all code generation is
-serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the
-place of mmap_lock() in linux-user.
+Code generation is serialised with mmap_lock().
+
+### !User-mode emulation
+Each vCPU has its own TCG context and associated TCG region, thereby
+requiring no locking.
 
 Translation Blocks
 ------------------
@@ -192,7 +195,7 @@ work as "safe work" and exiting the cpu run loop. This ensure by the
 time execution restarts all flush operations have completed.
 
 TLB flag updates are all done atomically and are also protected by the
-tb_lock() which is used by the functions that update the TLB in bulk.
+corresponding page lock.
 
 (Known limitation)
 
diff --git a/exec.c b/exec.c
index 4d8addb..ed6ef05 100644
--- a/exec.c
+++ b/exec.c
@@ -821,9 +821,7 @@ void cpu_exec_realizefn(CPUState *cpu, Error **errp)
 static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
 {
     mmap_lock();
-    tb_lock();
     tb_invalidate_phys_page_range(pc, pc + 1, 0);
-    tb_unlock();
     mmap_unlock();
 }
 #else
@@ -2406,21 +2404,20 @@ void memory_notdirty_write_prepare(NotDirtyInfo *ndi,
     ndi->ram_addr = ram_addr;
     ndi->mem_vaddr = mem_vaddr;
     ndi->size = size;
-    ndi->locked = false;
+    ndi->pages = NULL;
 
     assert(tcg_enabled());
     if (!cpu_physical_memory_get_dirty_flag(ram_addr, DIRTY_MEMORY_CODE)) {
-        ndi->locked = true;
-        tb_lock();
-        tb_invalidate_phys_page_fast(ram_addr, size);
+        ndi->pages = page_collection_lock(ram_addr, ram_addr + size);
+        tb_invalidate_phys_page_fast(ndi->pages, ram_addr, size);
     }
 }
 
 /* Called within RCU critical section. */
 void memory_notdirty_write_complete(NotDirtyInfo *ndi)
 {
-    if (ndi->locked) {
-        tb_unlock();
+    if (ndi->pages) {
+        page_collection_unlock(ndi->pages);
     }
 
     /* Set both VGA and migration bits for simplicity and to remove
@@ -2521,18 +2518,16 @@ static void check_watchpoint(int offset, int len, MemTxAttrs attrs, int flags)
                 }
                 cpu->watchpoint_hit = wp;
 
-                /* Both tb_lock and iothread_mutex will be reset when
-                 * cpu_loop_exit or cpu_loop_exit_noexc longjmp
-                 * back into the cpu_exec main loop.
-                 */
-                tb_lock();
+                mmap_lock();
                 tb_check_watchpoint(cpu);
                 if (wp->flags & BP_STOP_BEFORE_ACCESS) {
                     cpu->exception_index = EXCP_DEBUG;
+                    mmap_unlock();
                     cpu_loop_exit(cpu);
                 } else {
                     /* Force execution of one insn next time.  */
                     cpu->cflags_next_tb = 1 | curr_cflags();
+                    mmap_unlock();
                     cpu_loop_exit_noexc(cpu);
                 }
             }
@@ -2947,9 +2942,9 @@ static void invalidate_and_set_dirty(MemoryRegion *mr, hwaddr addr,
     }
     if (dirty_log_mask & (1 << DIRTY_MEMORY_CODE)) {
         assert(tcg_enabled());
-        tb_lock();
+        mmap_lock();
         tb_invalidate_phys_range(addr, addr + length);
-        tb_unlock();
+        mmap_unlock();
         dirty_log_mask &= ~(1 << DIRTY_MEMORY_CODE);
     }
     cpu_physical_memory_set_dirty_range(addr, length, dirty_log_mask);
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 74341b1..ae25cc3 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -23,7 +23,7 @@ typedef struct CPUListState {
     FILE *file;
 } CPUListState;
 
-/* The CPU list lock nests outside tb_lock/tb_unlock.  */
+/* The CPU list lock nests outside page_(un)lock or mmap_(un)lock */
 void qemu_init_cpu_list(void);
 void cpu_list_lock(void);
 void cpu_list_unlock(void);
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index d69b853..eac027c 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -436,10 +436,6 @@ extern uintptr_t tci_tb_ptr;
    smaller than 4 bytes, so we don't worry about special-casing this.  */
 #define GETPC_ADJ   2
 
-void tb_lock(void);
-void tb_unlock(void);
-void tb_lock_reset(void);
-
 #if !defined(CONFIG_USER_ONLY) && defined(CONFIG_DEBUG_TCG)
 void assert_page_collection_locked(bool val);
 #else
diff --git a/include/exec/memory-internal.h b/include/exec/memory-internal.h
index 4162474..9a48974 100644
--- a/include/exec/memory-internal.h
+++ b/include/exec/memory-internal.h
@@ -40,6 +40,8 @@ void mtree_print_dispatch(fprintf_function mon, void *f,
                           struct AddressSpaceDispatch *d,
                           MemoryRegion *root);
 
+struct page_collection;
+
 /* Opaque struct for passing info from memory_notdirty_write_prepare()
  * to memory_notdirty_write_complete(). Callers should treat all fields
  * as private, with the exception of @active.
@@ -51,10 +53,10 @@ void mtree_print_dispatch(fprintf_function mon, void *f,
  */
 typedef struct {
     CPUState *cpu;
+    struct page_collection *pages;
     ram_addr_t ram_addr;
     vaddr mem_vaddr;
     unsigned size;
-    bool locked;
     bool active;
 } NotDirtyInfo;
 
@@ -82,7 +84,7 @@ typedef struct {
  *
  * This must only be called if we are using TCG; it will assert otherwise.
  *
- * We may take a lock in the prepare call, so callers must ensure that
+ * We may take locks in the prepare call, so callers must ensure that
  * they don't exit (via longjump or otherwise) without calling complete.
  *
  * This call must only be made inside an RCU critical section.
diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index 8c9b49c..feb585e 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -32,8 +32,6 @@ typedef struct TBContext TBContext;
 struct TBContext {
 
     struct qht htable;
-    /* any access to the tbs or the page table must use this lock */
-    QemuMutex tb_lock;
 
     /* statistics */
     unsigned tb_flush_count;
diff --git a/linux-user/main.c b/linux-user/main.c
index fd79006..d4379fe 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -129,7 +129,6 @@ void fork_start(void)
 {
     start_exclusive();
     mmap_fork_start();
-    qemu_mutex_lock(&tb_ctx.tb_lock);
     cpu_list_lock();
 }
 
@@ -145,14 +144,12 @@ void fork_end(int child)
                 QTAILQ_REMOVE(&cpus, cpu, node);
             }
         }
-        qemu_mutex_init(&tb_ctx.tb_lock);
         qemu_init_cpu_list();
         gdbserver_fork(thread_cpu);
         /* qemu_init_cpu_list() takes care of reinitializing the
          * exclusive state, so we don't need to end_exclusive() here.
          */
     } else {
-        qemu_mutex_unlock(&tb_ctx.tb_lock);
         cpu_list_unlock();
         end_exclusive();
     }
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 9dd9448..c411bf5 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -841,7 +841,7 @@ static inline bool tcg_op_buf_full(void)
 
 /* pool based memory allocation */
 
-/* user-mode: tb_lock must be held for tcg_malloc_internal. */
+/* user-mode: mmap_lock must be held for tcg_malloc_internal. */
 void *tcg_malloc_internal(TCGContext *s, int size);
 void tcg_pool_reset(TCGContext *s);
 TranslationBlock *tcg_tb_alloc(TCGContext *s);
@@ -859,7 +859,7 @@ TranslationBlock *tcg_tb_lookup(uintptr_t tc_ptr);
 void tcg_tb_foreach(GTraverseFunc func, gpointer user_data);
 size_t tcg_nb_tbs(void);
 
-/* user-mode: Called with tb_lock held.  */
+/* user-mode: Called with mmap_lock held.  */
 static inline void *tcg_malloc(int size)
 {
     TCGContext *s = tcg_ctx;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 13/16] translate-all: protect TB jumps with a per-destination-TB lock
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 13/16] translate-all: protect TB jumps with a per-destination-TB lock Emilio G. Cota
@ 2018-02-27 11:33   ` Paolo Bonzini
  2018-02-27 11:43     ` Laurent Desnogues
  2018-03-28 15:57   ` Alex Bennée
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2018-02-27 11:33 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Richard Henderson

On 27/02/2018 06:39, Emilio G. Cota wrote:
> Using a hash table or a binary tree to keep track of the jumps
> doesn't really pay off, not only due to the increased memory usage,
> but also because most TBs have only 0 or 1 jumps to them. The maximum
> number of jumps when booting debian-arm that I measured is 35, but
> as we can see in the histogram below a TB with that many incoming jumps
> is extremely rare; the average TB has 0.80 incoming jumps.
> 
> n_jumps: 379208; avg jumps/tb: 0.801099
> dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁  ▁▁▁     ▁|[34.0,35.0]

This makes sense, for example:

   while(...) {
   }

2 basic blocks, 0 and 1 incoming jumps (avg 0.5)

   if(...) {
   }

2 basic blocks, 0 and 1 incoming jumps (avg 0.5)

   if(...) {
   } else {
   }

3 basic blocks, 0, 1 and 1 incoming jumps (avg 0.66)

So 0.8 is actually a lot. :)  The long tail is probably for switch
statements.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 13/16] translate-all: protect TB jumps with a per-destination-TB lock
  2018-02-27 11:33   ` Paolo Bonzini
@ 2018-02-27 11:43     ` Laurent Desnogues
  2018-02-27 14:31       ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Laurent Desnogues @ 2018-02-27 11:43 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Emilio G. Cota, qemu-devel, Richard Henderson

On Tue, Feb 27, 2018 at 12:33 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 27/02/2018 06:39, Emilio G. Cota wrote:
>> Using a hash table or a binary tree to keep track of the jumps
>> doesn't really pay off, not only due to the increased memory usage,
>> but also because most TBs have only 0 or 1 jumps to them. The maximum
>> number of jumps when booting debian-arm that I measured is 35, but
>> as we can see in the histogram below a TB with that many incoming jumps
>> is extremely rare; the average TB has 0.80 incoming jumps.
>>
>> n_jumps: 379208; avg jumps/tb: 0.801099
>> dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁  ▁▁▁     ▁|[34.0,35.0]
>
> This makes sense, for example:
>
>    while(...) {
>    }
>
> 2 basic blocks, 0 and 1 incoming jumps (avg 0.5)
>
>    if(...) {
>    }
>
> 2 basic blocks, 0 and 1 incoming jumps (avg 0.5)
>
>    if(...) {
>    } else {
>    }
>
> 3 basic blocks, 0, 1 and 1 incoming jumps (avg 0.66)
>
> So 0.8 is actually a lot. :)  The long tail is probably for switch
> statements.

And calls too :-)


Laurent

> Paolo
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 13/16] translate-all: protect TB jumps with a per-destination-TB lock
  2018-02-27 11:43     ` Laurent Desnogues
@ 2018-02-27 14:31       ` Paolo Bonzini
  0 siblings, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2018-02-27 14:31 UTC (permalink / raw)
  To: Laurent Desnogues; +Cc: Emilio G. Cota, qemu-devel, Richard Henderson

On 27/02/2018 12:43, Laurent Desnogues wrote:
> On Tue, Feb 27, 2018 at 12:33 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 27/02/2018 06:39, Emilio G. Cota wrote:
>>> Using a hash table or a binary tree to keep track of the jumps
>>> doesn't really pay off, not only due to the increased memory usage,
>>> but also because most TBs have only 0 or 1 jumps to them. The maximum
>>> number of jumps when booting debian-arm that I measured is 35, but
>>> as we can see in the histogram below a TB with that many incoming jumps
>>> is extremely rare; the average TB has 0.80 incoming jumps.
>>>
>>> n_jumps: 379208; avg jumps/tb: 0.801099
>>> dist: [0.0,1.0)|▄█▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁  ▁▁▁     ▁|[34.0,35.0]
>>
>> This makes sense, for example:
>>
>>    while(...) {
>>    }
>>
>> 2 basic blocks, 0 and 1 incoming jumps (avg 0.5)
>>
>>    if(...) {
>>    }
>>
>> 2 basic blocks, 0 and 1 incoming jumps (avg 0.5)
>>
>>    if(...) {
>>    } else {
>>    }
>>
>> 3 basic blocks, 0, 1 and 1 incoming jumps (avg 0.66)
>>
>> So 0.8 is actually a lot. :)  The long tail is probably for switch
>> statements.
> 
> And calls too :-)

Jumps are only chained within the same page, so probably only the
smaller buckets can be for calls.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 01/16] qht: require a default comparison function
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 01/16] qht: require a default comparison function Emilio G. Cota
@ 2018-02-28 19:02   ` Richard Henderson
  2018-03-28 16:21   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Richard Henderson @ 2018-02-28 19:02 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Paolo Bonzini

On 02/26/2018 09:39 PM, Emilio G. Cota wrote:
> qht_lookup now uses the default cmp function. qht_lookup_custom is defined
> to retain the old behaviour, that is a cmp function is explicitly provided.
> 
> qht_insert will gain use of the default cmp in the next patch.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/cpu-exec.c      |  4 ++--
>  accel/tcg/translate-all.c | 16 +++++++++++++++-
>  include/qemu/qht.h        | 23 +++++++++++++++++++----
>  tests/qht-bench.c         | 14 +++++++-------
>  tests/test-qht.c          | 15 ++++++++++-----
>  util/qht.c                | 14 +++++++++++---
>  6 files changed, 64 insertions(+), 22 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 02/16] qht: return existing entry when qht_insert fails
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 02/16] qht: return existing entry when qht_insert fails Emilio G. Cota
@ 2018-02-28 19:10   ` Richard Henderson
  2018-03-28 16:33   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Richard Henderson @ 2018-02-28 19:10 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Paolo Bonzini

On 02/26/2018 09:39 PM, Emilio G. Cota wrote:
> The meaning of "existing" is now changed to "matches in hash and
> ht->cmp result". This is saner than just checking the pointer value.
> 
> Note that we now return NULL on insertion success, or the existing
> pointer on failure. We can do this because NULL pointers are not
> allowed to be inserted in QHT.
> 
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/qht.h |  7 ++++---
>  tests/qht-bench.c  |  4 ++--
>  tests/test-qht.c   |  5 ++++-
>  util/qht.c         | 17 +++++++++--------
>  4 files changed, 19 insertions(+), 14 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/16] tcg: track TBs with per-region BST's
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 03/16] tcg: track TBs with per-region BST's Emilio G. Cota
@ 2018-02-28 20:53   ` Richard Henderson
  2018-03-29  9:54   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Richard Henderson @ 2018-02-28 20:53 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Paolo Bonzini

On 02/26/2018 09:39 PM, Emilio G. Cota wrote:
> This paves the way for enabling scalable parallel generation of TCG code.
> 
> Instead of tracking TBs with a single binary search tree (BST), use a
> BST for each TCG region, protecting it with a lock. This is as scalable
> as it gets, since each TCG thread operates on a separate region.
> 
> The core of this change is the introduction of struct tcg_region_tree,
> which contains a pointer to a GTree and an associated lock to serialize
> accesses to it. We then allocate an array of tcg_region_tree's, adding
> the appropriate padding to avoid false sharing based on
> qemu_dcache_linesize.
> 
> Given a tc_ptr, we first find the corresponding region_tree. This
> is done by special-casing the first and last regions first, since they
> might be of size != region.size; otherwise we just divide the offset
> by region.stride. I was worried about this division (several dozen
> cycles of latency), but profiling shows that this is not a fast path.
> Note that region.stride is not required to be a power of two; it
> is only required to be a multiple of the host's page size.
> 
> Note that with this design we can also provide consistent snapshots
> about all region trees at once; for instance, tcg_tb_foreach
> acquires/releases all region_tree locks before/after iterating over them.
> For this reason we now drop tb_lock in dump_exec_info().
> 
> As an alternative I considered implementing a concurrent BST, but this
> can be tricky to get right, offers no consistent snapshots of the BST,
> and performance and scalability-wise I don't think it could ever beat
> having separate GTrees, given that our workload is insert-mostly (all
> concurrent BST designs I've seen focus, understandably, on making
> lookups fast, which comes at the expense of convoluted, non-wait-free
> insertions/removals).
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/cpu-exec.c      |   2 +-
>  accel/tcg/translate-all.c | 101 ++++--------------------
>  include/exec/exec-all.h   |   1 -
>  include/exec/tb-context.h |   1 -
>  tcg/tcg.c                 | 191 ++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/tcg.h                 |   6 ++
>  6 files changed, 213 insertions(+), 89 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 04/16] tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 04/16] tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx Emilio G. Cota
@ 2018-02-28 20:55   ` Richard Henderson
  2018-03-29 10:06   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Richard Henderson @ 2018-02-28 20:55 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Paolo Bonzini

On 02/26/2018 09:39 PM, Emilio G. Cota wrote:
> Thereby making it per-TCGContext. Once we remove tb_lock, this will
> avoid an atomic increment every time a TB is invalidated.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/translate-all.c |  5 +++--
>  include/exec/tb-context.h |  1 -
>  tcg/tcg.c                 | 14 ++++++++++++++
>  tcg/tcg.h                 |  3 +++
>  4 files changed, 20 insertions(+), 3 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 05/16] translate-all: iterate over TBs in a page with PAGE_FOR_EACH_TB
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 05/16] translate-all: iterate over TBs in a page with PAGE_FOR_EACH_TB Emilio G. Cota
@ 2018-02-28 21:40   ` Richard Henderson
  2018-02-28 22:50     ` Emilio G. Cota
  0 siblings, 1 reply; 52+ messages in thread
From: Richard Henderson @ 2018-02-28 21:40 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Paolo Bonzini

On 02/26/2018 09:39 PM, Emilio G. Cota wrote:
> +/* list iterators for lists of tagged pointers in TranslationBlock */
> +#define TB_FOR_EACH_TAGGED(head, tb, n, field)                  \
> +    for (n = (head) & 1,                                        \
> +             tb = (TranslationBlock *)((head) & ~1);            \
> +         tb;                                                    \
> +         tb = (TranslationBlock *)tb->field[n],                 \
> +             n = (uintptr_t)tb & 1,                             \
> +             tb = (TranslationBlock *)((uintptr_t)tb & ~1))
> +
> +#define PAGE_FOR_EACH_TB(pagedesc, tb, n)                       \
> +    TB_FOR_EACH_TAGGED((pagedesc)->first_tb, tb, n, page_next)
> +

I'm not sure I like the generalization of TB_FOR_EACH_TAGGED.  Do you use it
for anything besides PAGE_FOR_EACH_TB?

Weird indentation in the clauses.

Otherwise,
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] translate-all: make l1_map lockless
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 06/16] translate-all: make l1_map lockless Emilio G. Cota
@ 2018-02-28 22:15   ` Richard Henderson
  2018-03-29 10:16   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Richard Henderson @ 2018-02-28 22:15 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Paolo Bonzini

On 02/26/2018 09:39 PM, Emilio G. Cota wrote:
> Groundwork for supporting parallel TCG generation.
> 
> We never remove entries from the radix tree, so we can use cmpxchg
> to implement lockless insertions.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/translate-all.c       | 24 ++++++++++++++----------
>  docs/devel/multi-thread-tcg.txt |  4 ++--
>  2 files changed, 16 insertions(+), 12 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 07/16] translate-all: remove hole in PageDesc
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 07/16] translate-all: remove hole in PageDesc Emilio G. Cota
@ 2018-02-28 22:17   ` Richard Henderson
  2018-03-29 10:17   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Richard Henderson @ 2018-02-28 22:17 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Paolo Bonzini

On 02/26/2018 09:39 PM, Emilio G. Cota wrote:
> Groundwork for supporting parallel TCG generation.
> 
> Move the hole to the end of the struct, so that a u32
> field can be added there without bloating the struct.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/translate-all.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/16] translate-all: work page-by-page in tb_invalidate_phys_range_1
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 08/16] translate-all: work page-by-page in tb_invalidate_phys_range_1 Emilio G. Cota
@ 2018-02-28 22:23   ` Richard Henderson
  2018-03-29 10:10   ` Alex Bennée
  2018-03-29 10:17   ` Alex Bennée
  2 siblings, 0 replies; 52+ messages in thread
From: Richard Henderson @ 2018-02-28 22:23 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Paolo Bonzini

On 02/26/2018 09:39 PM, Emilio G. Cota wrote:
> So that we pass a same-page range to tb_invalidate_phys_page_range,
> instead of always passing an end address that could be on a different
> page.
> 
> As discussed with Peter Maydell on the list [1], tb_invalidate_phys_page_range
> doesn't actually do much with 'end', which explains why we have never
> hit a bug despite going against what the comment on top of
> tb_invalidate_phys_page_range requires:
> 
>> * Invalidate all TBs which intersect with the target physical address range
>> * [start;end[. NOTE: start and end must refer to the *same* physical page.
> The appended honours the comment, which avoids confusion.
> 
> While at it, rework the loop into a for loop, which is less error prone
> (e.g. "continue" won't result in an infinite loop).
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg09165.html
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 09/16] translate-all: move tb_invalidate_phys_page_range up in the file
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 09/16] translate-all: move tb_invalidate_phys_page_range up in the file Emilio G. Cota
@ 2018-02-28 22:24   ` Richard Henderson
  2018-03-29 10:08   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Richard Henderson @ 2018-02-28 22:24 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Paolo Bonzini

On 02/26/2018 09:39 PM, Emilio G. Cota wrote:
> This greatly simplifies next commit's diff.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/translate-all.c | 77 ++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 38 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 05/16] translate-all: iterate over TBs in a page with PAGE_FOR_EACH_TB
  2018-02-28 21:40   ` Richard Henderson
@ 2018-02-28 22:50     ` Emilio G. Cota
  2018-02-28 22:53       ` Richard Henderson
  0 siblings, 1 reply; 52+ messages in thread
From: Emilio G. Cota @ 2018-02-28 22:50 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, Paolo Bonzini

On Wed, Feb 28, 2018 at 13:40:15 -0800, Richard Henderson wrote:
> On 02/26/2018 09:39 PM, Emilio G. Cota wrote:
> > +/* list iterators for lists of tagged pointers in TranslationBlock */
> > +#define TB_FOR_EACH_TAGGED(head, tb, n, field)                  \
> > +    for (n = (head) & 1,                                        \
> > +             tb = (TranslationBlock *)((head) & ~1);            \
> > +         tb;                                                    \
> > +         tb = (TranslationBlock *)tb->field[n],                 \
> > +             n = (uintptr_t)tb & 1,                             \
> > +             tb = (TranslationBlock *)((uintptr_t)tb & ~1))
> > +
> > +#define PAGE_FOR_EACH_TB(pagedesc, tb, n)                       \
> > +    TB_FOR_EACH_TAGGED((pagedesc)->first_tb, tb, n, page_next)
> > +
> 
> I'm not sure I like the generalization of TB_FOR_EACH_TAGGED.  Do you use it
> for anything besides PAGE_FOR_EACH_TB?

Yes, see patch 13. I've added the following comment to the commit log:
 - Introduce the TB_FOR_EACH_TAGGED macro, and use it to define
   PAGE_FOR_EACH_TB, which improves readability. Note that
   TB_FOR_EACH_TAGGED will gain another user in a subsequent patch.

> Weird indentation in the clauses.

Is this any better?

#define TB_FOR_EACH_TAGGED(head, tb, n, field)                          \
    for (n = (head) & 1, tb = (TranslationBlock *)((head) & ~1);        \
         tb; tb = (TranslationBlock *)tb->field[n], n = (uintptr_t)tb & 1, \
             tb = (TranslationBlock *)((uintptr_t)tb & ~1))

> Otherwise,
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 05/16] translate-all: iterate over TBs in a page with PAGE_FOR_EACH_TB
  2018-02-28 22:50     ` Emilio G. Cota
@ 2018-02-28 22:53       ` Richard Henderson
  0 siblings, 0 replies; 52+ messages in thread
From: Richard Henderson @ 2018-02-28 22:53 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini

On 02/28/2018 02:50 PM, Emilio G. Cota wrote:
> Is this any better?
> 
> #define TB_FOR_EACH_TAGGED(head, tb, n, field)                          \
>     for (n = (head) & 1, tb = (TranslationBlock *)((head) & ~1);        \
>          tb; tb = (TranslationBlock *)tb->field[n], n = (uintptr_t)tb & 1, \
>              tb = (TranslationBlock *)((uintptr_t)tb & ~1))

Yes, thanks.


r~

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 13/16] translate-all: protect TB jumps with a per-destination-TB lock
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 13/16] translate-all: protect TB jumps with a per-destination-TB lock Emilio G. Cota
  2018-02-27 11:33   ` Paolo Bonzini
@ 2018-03-28 15:57   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-28 15:57 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> This applies to both user-mode and !user-mode emulation.
>
<snip>
> @@ -2124,7 +2148,7 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
>      /* Adjust the execution state of the next TB.  */
>      cpu->cflags_next_tb = curr_cflags() | CF_LAST_IO | n;
>
> -    if (tb->cflags & CF_NOCACHE) {
> +    if (tb_cflags(tb) & CF_NOCACHE) {
>          if (tb->orig_tb) {
>              /* Invalidate original TB if this TB was generated in
>               * cpu_exec_nocache() */

Heads up, this fails to apply on master, most likely due to
87f963be66a32453e001d1052b000f1653605caa

--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 01/16] qht: require a default comparison function
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 01/16] qht: require a default comparison function Emilio G. Cota
  2018-02-28 19:02   ` Richard Henderson
@ 2018-03-28 16:21   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-28 16:21 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> qht_lookup now uses the default cmp function. qht_lookup_custom is defined
> to retain the old behaviour, that is a cmp function is explicitly provided.
>
> qht_insert will gain use of the default cmp in the next patch.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/cpu-exec.c      |  4 ++--
>  accel/tcg/translate-all.c | 16 +++++++++++++++-
>  include/qemu/qht.h        | 23 +++++++++++++++++++----
>  tests/qht-bench.c         | 14 +++++++-------
>  tests/test-qht.c          | 15 ++++++++++-----
>  util/qht.c                | 14 +++++++++++---
>  6 files changed, 64 insertions(+), 22 deletions(-)
>
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 280200f..ec57564 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -293,7 +293,7 @@ struct tb_desc {
>      uint32_t trace_vcpu_dstate;
>  };
>
> -static bool tb_cmp(const void *p, const void *d)
> +static bool tb_lookup_cmp(const void *p, const void *d)
>  {
>      const TranslationBlock *tb = p;
>      const struct tb_desc *desc = d;
> @@ -338,7 +338,7 @@ TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
>      phys_pc = get_page_addr_code(desc.env, pc);
>      desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
>      h = tb_hash_func(phys_pc, pc, flags, cf_mask, *cpu->trace_dstate);
> -    return qht_lookup(&tb_ctx.htable, tb_cmp, &desc, h);
> +    return qht_lookup_custom(&tb_ctx.htable, tb_lookup_cmp, &desc, h);
>  }
>
>  void tb_set_jmp_target(TranslationBlock *tb, int n, uintptr_t addr)
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 67795cd..1cf10f8 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -785,11 +785,25 @@ static inline void code_gen_alloc(size_t tb_size)
>      qemu_mutex_init(&tb_ctx.tb_lock);
>  }
>
> +static bool tb_cmp(const void *ap, const void *bp)
> +{
> +    const TranslationBlock *a = ap;
> +    const TranslationBlock *b = bp;
> +
> +    return a->pc == b->pc &&
> +        a->cs_base == b->cs_base &&
> +        a->flags == b->flags &&
> +        (tb_cflags(a) & CF_HASH_MASK) == (tb_cflags(b) & CF_HASH_MASK) &&
> +        a->trace_vcpu_dstate == b->trace_vcpu_dstate &&
> +        a->page_addr[0] == b->page_addr[0] &&
> +        a->page_addr[1] == b->page_addr[1];
> +}
> +
>  static void tb_htable_init(void)
>  {
>      unsigned int mode = QHT_MODE_AUTO_RESIZE;
>
> -    qht_init(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
> +    qht_init(&tb_ctx.htable, tb_cmp, CODE_GEN_HTABLE_SIZE, mode);
>  }
>
>  /* Must be called before using the QEMU cpus. 'tb_size' is the size
> diff --git a/include/qemu/qht.h b/include/qemu/qht.h
> index 531aa95..dd512bf 100644
> --- a/include/qemu/qht.h
> +++ b/include/qemu/qht.h
> @@ -11,8 +11,11 @@
>  #include "qemu/thread.h"
>  #include "qemu/qdist.h"
>
> +typedef bool (*qht_cmp_func_t)(const void *a, const void *b);
> +
>  struct qht {
>      struct qht_map *map;
> +    qht_cmp_func_t cmp;
>      QemuMutex lock; /* serializes setters of ht->map */
>      unsigned int mode;
>  };
> @@ -47,10 +50,12 @@ typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
>  /**
>   * qht_init - Initialize a QHT
>   * @ht: QHT to be initialized
> + * @cmp: default comparison function. Cannot be NULL.
>   * @n_elems: number of entries the hash table should be optimized for.
>   * @mode: bitmask with OR'ed QHT_MODE_*
>   */
> -void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
> +void qht_init(struct qht *ht, qht_cmp_func_t cmp, size_t n_elems,
> +              unsigned int mode);
>
>  /**
>   * qht_destroy - destroy a previously initialized QHT
> @@ -78,7 +83,7 @@ void qht_destroy(struct qht *ht);
>  bool qht_insert(struct qht *ht, void *p, uint32_t hash);
>
>  /**
> - * qht_lookup - Look up a pointer in a QHT
> + * qht_lookup_custom - Look up a pointer using a custom comparison function.
>   * @ht: QHT to be looked up
>   * @func: function to compare existing pointers against @userp
>   * @userp: pointer to pass to @func
> @@ -94,8 +99,18 @@ bool qht_insert(struct qht *ht, void *p, uint32_t hash);
>   * Returns the corresponding pointer when a match is found.
>   * Returns NULL otherwise.
>   */
> -void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
> -                 uint32_t hash);
> +void *qht_lookup_custom(struct qht *ht, qht_lookup_func_t func,
> +                        const void *userp, uint32_t hash);
> +
> +/**
> + * qht_lookup - Look up a pointer in a QHT
> + * @ht: QHT to be looked up
> + * @userp: pointer to pass to @func
> + * @hash: hash of the pointer to be looked up
> + *
> + * Calls qht_lookup_custom() using @ht's default comparison function.
> + */
> +void *qht_lookup(struct qht *ht, const void *userp, uint32_t hash);
>
>  /**
>   * qht_remove - remove a pointer from the hash table
> diff --git a/tests/qht-bench.c b/tests/qht-bench.c
> index 4cabdfd..c94ac25 100644
> --- a/tests/qht-bench.c
> +++ b/tests/qht-bench.c
> @@ -93,10 +93,10 @@ static void usage_complete(int argc, char *argv[])
>      exit(-1);
>  }
>
> -static bool is_equal(const void *obj, const void *userp)
> +static bool is_equal(const void *ap, const void *bp)
>  {
> -    const long *a = obj;
> -    const long *b = userp;
> +    const long *a = ap;
> +    const long *b = bp;
>
>      return *a == *b;
>  }
> @@ -150,7 +150,7 @@ static void do_rw(struct thread_info *info)
>
>          p = &keys[info->r & (lookup_range - 1)];
>          hash = h(*p);
> -        read = qht_lookup(&ht, is_equal, p, hash);
> +        read = qht_lookup(&ht, p, hash);
>          if (read) {
>              stats->rd++;
>          } else {
> @@ -162,7 +162,7 @@ static void do_rw(struct thread_info *info)
>          if (info->write_op) {
>              bool written = false;
>
> -            if (qht_lookup(&ht, is_equal, p, hash) == NULL) {
> +            if (qht_lookup(&ht, p, hash) == NULL) {
>                  written = qht_insert(&ht, p, hash);
>              }
>              if (written) {
> @@ -173,7 +173,7 @@ static void do_rw(struct thread_info *info)
>          } else {
>              bool removed = false;
>
> -            if (qht_lookup(&ht, is_equal, p, hash)) {
> +            if (qht_lookup(&ht, p, hash)) {
>                  removed = qht_remove(&ht, p, hash);
>              }
>              if (removed) {
> @@ -308,7 +308,7 @@ static void htable_init(void)
>      }
>
>      /* initialize the hash table */
> -    qht_init(&ht, qht_n_elems, qht_mode);
> +    qht_init(&ht, is_equal, qht_n_elems, qht_mode);
>      assert(init_size <= init_range);
>
>      pr_params();
> diff --git a/tests/test-qht.c b/tests/test-qht.c
> index 9b7423a..f8f2886 100644
> --- a/tests/test-qht.c
> +++ b/tests/test-qht.c
> @@ -13,10 +13,10 @@
>  static struct qht ht;
>  static int32_t arr[N * 2];
>
> -static bool is_equal(const void *obj, const void *userp)
> +static bool is_equal(const void *ap, const void *bp)
>  {
> -    const int32_t *a = obj;
> -    const int32_t *b = userp;
> +    const int32_t *a = ap;
> +    const int32_t *b = bp;
>
>      return *a == *b;
>  }
> @@ -60,7 +60,12 @@ static void check(int a, int b, bool expected)
>
>          val = i;
>          hash = i;
> -        p = qht_lookup(&ht, is_equal, &val, hash);
> +        /* test both lookup variants; results should be the same */
> +        if (i % 2) {
> +            p = qht_lookup(&ht, &val, hash);
> +        } else {
> +            p = qht_lookup_custom(&ht, is_equal, &val, hash);
> +        }
>          g_assert_true(!!p == expected);
>      }
>      rcu_read_unlock();
> @@ -102,7 +107,7 @@ static void qht_do_test(unsigned int mode, size_t init_entries)
>      /* under KVM we might fetch stats from an uninitialized qht */
>      check_n(0);
>
> -    qht_init(&ht, 0, mode);
> +    qht_init(&ht, is_equal, 0, mode);
>
>      check_n(0);
>      insert(0, N);
> diff --git a/util/qht.c b/util/qht.c
> index ff4d2e6..dcb3ee1 100644
> --- a/util/qht.c
> +++ b/util/qht.c
> @@ -351,11 +351,14 @@ static struct qht_map *qht_map_create(size_t n_buckets)
>      return map;
>  }
>
> -void qht_init(struct qht *ht, size_t n_elems, unsigned int mode)
> +void qht_init(struct qht *ht, qht_cmp_func_t cmp, size_t n_elems,
> +              unsigned int mode)
>  {
>      struct qht_map *map;
>      size_t n_buckets = qht_elems_to_buckets(n_elems);
>
> +    g_assert(cmp);
> +    ht->cmp = cmp;
>      ht->mode = mode;
>      qemu_mutex_init(&ht->lock);
>      map = qht_map_create(n_buckets);
> @@ -479,8 +482,8 @@ void *qht_lookup__slowpath(struct qht_bucket *b, qht_lookup_func_t func,
>      return ret;
>  }
>
> -void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
> -                 uint32_t hash)
> +void *qht_lookup_custom(struct qht *ht, qht_lookup_func_t func,
> +                        const void *userp, uint32_t hash)
>  {
>      struct qht_bucket *b;
>      struct qht_map *map;
> @@ -502,6 +505,11 @@ void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
>      return qht_lookup__slowpath(b, func, userp, hash);
>  }
>
> +void *qht_lookup(struct qht *ht, const void *userp, uint32_t hash)
> +{
> +    return qht_lookup_custom(ht, ht->cmp, userp, hash);
> +}
> +
>  /* call with head->lock held */
>  static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
>                                 struct qht_bucket *head, void *p, uint32_t hash,


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 02/16] qht: return existing entry when qht_insert fails
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 02/16] qht: return existing entry when qht_insert fails Emilio G. Cota
  2018-02-28 19:10   ` Richard Henderson
@ 2018-03-28 16:33   ` Alex Bennée
  2018-04-05 17:10     ` Emilio G. Cota
  1 sibling, 1 reply; 52+ messages in thread
From: Alex Bennée @ 2018-03-28 16:33 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> The meaning of "existing" is now changed to "matches in hash and
> ht->cmp result". This is saner than just checking the pointer value.
>
> Note that we now return NULL on insertion success, or the existing
> pointer on failure. We can do this because NULL pointers are not
> allowed to be inserted in QHT.
>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/qht.h |  7 ++++---
>  tests/qht-bench.c  |  4 ++--
>  tests/test-qht.c   |  5 ++++-
>  util/qht.c         | 17 +++++++++--------
>  4 files changed, 19 insertions(+), 14 deletions(-)
>
> diff --git a/include/qemu/qht.h b/include/qemu/qht.h
> index dd512bf..c320cb6 100644
> --- a/include/qemu/qht.h
> +++ b/include/qemu/qht.h
> @@ -77,10 +77,11 @@ void qht_destroy(struct qht *ht);
>   * In case of successful operation, smp_wmb() is implied before the pointer is
>   * inserted into the hash table.
>   *
> - * Returns true on success.
> - * Returns false if the @p-@hash pair already exists in the hash table.
> + * On success, returns NULL.
> + * On failure, returns the pointer from an entry that is equivalent (i.e.
> + * ht->cmp matches and the hash is the same) to @p-@h.
>   */
> -bool qht_insert(struct qht *ht, void *p, uint32_t hash);
> +void *qht_insert(struct qht *ht, void *p, uint32_t hash);

Hmm this seems needlessly counter intuitive. I realise the potential
efficiency in overloading success/fail but wouldn't a:

  bool qht_insert(struct qht *ht, void *p, uint32_t hash, void **existing);

be conceptually nicer?

>
>  /**
>   * qht_lookup_custom - Look up a pointer using a custom comparison function.
> diff --git a/tests/qht-bench.c b/tests/qht-bench.c
> index c94ac25..2f88400 100644
> --- a/tests/qht-bench.c
> +++ b/tests/qht-bench.c
> @@ -163,7 +163,7 @@ static void do_rw(struct thread_info *info)
>              bool written = false;
>
>              if (qht_lookup(&ht, p, hash) == NULL) {
> -                written = qht_insert(&ht, p, hash);
> +                written = !qht_insert(&ht, p, hash);
>              }
>              if (written) {
>                  stats->in++;
> @@ -322,7 +322,7 @@ static void htable_init(void)
>              r = xorshift64star(r);
>              p = &keys[r & (init_range - 1)];
>              hash = h(*p);
> -            if (qht_insert(&ht, p, hash)) {
> +            if (qht_insert(&ht, p, hash) == NULL) {
>                  break;
>              }
>              retries++;
> diff --git a/tests/test-qht.c b/tests/test-qht.c
> index f8f2886..7164ae4 100644
> --- a/tests/test-qht.c
> +++ b/tests/test-qht.c
> @@ -27,11 +27,14 @@ static void insert(int a, int b)
>
>      for (i = a; i < b; i++) {
>          uint32_t hash;
> +        void *existing;
>
>          arr[i] = i;
>          hash = i;
>
> -        qht_insert(&ht, &arr[i], hash);
> +        g_assert_true(!qht_insert(&ht, &arr[i], hash));
> +        existing = qht_insert(&ht, &arr[i], hash);
> +        g_assert_true(existing == &arr[i]);
>      }
>  }
>
> diff --git a/util/qht.c b/util/qht.c
> index dcb3ee1..f9f49a9 100644
> --- a/util/qht.c
> +++ b/util/qht.c
> @@ -511,9 +511,9 @@ void *qht_lookup(struct qht *ht, const void *userp, uint32_t hash)
>  }
>
>  /* call with head->lock held */
> -static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
> -                               struct qht_bucket *head, void *p, uint32_t hash,
> -                               bool *needs_resize)
> +static void *qht_insert__locked(struct qht *ht, struct qht_map *map,
> +                                struct qht_bucket *head, void *p, uint32_t hash,
> +                                bool *needs_resize)
>  {
>      struct qht_bucket *b = head;
>      struct qht_bucket *prev = NULL;
> @@ -523,8 +523,9 @@ static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
>      do {
>          for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
>              if (b->pointers[i]) {
> -                if (unlikely(b->pointers[i] == p)) {
> -                    return false;
> +                if (unlikely(b->hashes[i] == hash &&
> +                             ht->cmp(b->pointers[i], p))) {
> +                    return b->pointers[i];
>                  }
>              } else {
>                  goto found;
> @@ -553,7 +554,7 @@ static bool qht_insert__locked(struct qht *ht, struct qht_map *map,
>      atomic_set(&b->hashes[i], hash);
>      atomic_set(&b->pointers[i], p);
>      seqlock_write_end(&head->sequence);
> -    return true;
> +    return NULL;
>  }
>
>  static __attribute__((noinline)) void qht_grow_maybe(struct qht *ht)
> @@ -577,12 +578,12 @@ static __attribute__((noinline)) void qht_grow_maybe(struct qht *ht)
>      qemu_mutex_unlock(&ht->lock);
>  }
>
> -bool qht_insert(struct qht *ht, void *p, uint32_t hash)
> +void *qht_insert(struct qht *ht, void *p, uint32_t hash)
>  {
>      struct qht_bucket *b;
>      struct qht_map *map;
>      bool needs_resize = false;
> -    bool ret;
> +    void *ret;
>
>      /* NULL pointers are not supported */
>      qht_debug_assert(p);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 03/16] tcg: track TBs with per-region BST's
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 03/16] tcg: track TBs with per-region BST's Emilio G. Cota
  2018-02-28 20:53   ` Richard Henderson
@ 2018-03-29  9:54   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-29  9:54 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> This paves the way for enabling scalable parallel generation of TCG code.
>
> Instead of tracking TBs with a single binary search tree (BST), use a
> BST for each TCG region, protecting it with a lock. This is as scalable
> as it gets, since each TCG thread operates on a separate region.
>
> The core of this change is the introduction of struct tcg_region_tree,
> which contains a pointer to a GTree and an associated lock to serialize
> accesses to it. We then allocate an array of tcg_region_tree's, adding
> the appropriate padding to avoid false sharing based on
> qemu_dcache_linesize.
>
> Given a tc_ptr, we first find the corresponding region_tree. This
> is done by special-casing the first and last regions first, since they
> might be of size != region.size; otherwise we just divide the offset
> by region.stride. I was worried about this division (several dozen
> cycles of latency), but profiling shows that this is not a fast path.
> Note that region.stride is not required to be a power of two; it
> is only required to be a multiple of the host's page size.
>
> Note that with this design we can also provide consistent snapshots
> about all region trees at once; for instance, tcg_tb_foreach
> acquires/releases all region_tree locks before/after iterating over them.
> For this reason we now drop tb_lock in dump_exec_info().
>
> As an alternative I considered implementing a concurrent BST, but this
> can be tricky to get right, offers no consistent snapshots of the BST,
> and performance and scalability-wise I don't think it could ever beat
> having separate GTrees, given that our workload is insert-mostly (all
> concurrent BST designs I've seen focus, understandably, on making
> lookups fast, which comes at the expense of convoluted, non-wait-free
> insertions/removals).
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/cpu-exec.c      |   2 +-
>  accel/tcg/translate-all.c | 101 ++++--------------------
>  include/exec/exec-all.h   |   1 -
>  include/exec/tb-context.h |   1 -
>  tcg/tcg.c                 | 191 ++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/tcg.h                 |   6 ++
>  6 files changed, 213 insertions(+), 89 deletions(-)
>
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index ec57564..8c68727 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -222,7 +222,7 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
>
>      tb_lock();
>      tb_phys_invalidate(tb, -1);
> -    tb_remove(tb);
> +    tcg_tb_remove(tb);
>      tb_unlock();
>  }
>  #endif
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 1cf10f8..3a51d49 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -205,8 +205,6 @@ void tb_lock_reset(void)
>      }
>  }
>
> -static TranslationBlock *tb_find_pc(uintptr_t tc_ptr);
> -
>  void cpu_gen_init(void)
>  {
>      tcg_context_init(&tcg_init_ctx);
> @@ -375,13 +373,13 @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
>
>      if (check_offset < tcg_init_ctx.code_gen_buffer_size) {
>          tb_lock();
> -        tb = tb_find_pc(host_pc);
> +        tb = tcg_tb_lookup(host_pc);
>          if (tb) {
>              cpu_restore_state_from_tb(cpu, tb, host_pc);
>              if (tb->cflags & CF_NOCACHE) {
>                  /* one-shot translation, invalidate it immediately */
>                  tb_phys_invalidate(tb, -1);
> -                tb_remove(tb);
> +                tcg_tb_remove(tb);
>              }
>              r = true;
>          }
> @@ -731,48 +729,6 @@ static inline void *alloc_code_gen_buffer(void)
>  }
>  #endif /* USE_STATIC_CODE_GEN_BUFFER, WIN32, POSIX */
>
> -/* compare a pointer @ptr and a tb_tc @s */
> -static int ptr_cmp_tb_tc(const void *ptr, const struct tb_tc *s)
> -{
> -    if (ptr >= s->ptr + s->size) {
> -        return 1;
> -    } else if (ptr < s->ptr) {
> -        return -1;
> -    }
> -    return 0;
> -}
> -
> -static gint tb_tc_cmp(gconstpointer ap, gconstpointer bp)
> -{
> -    const struct tb_tc *a = ap;
> -    const struct tb_tc *b = bp;
> -
> -    /*
> -     * When both sizes are set, we know this isn't a lookup.
> -     * This is the most likely case: every TB must be inserted; lookups
> -     * are a lot less frequent.
> -     */
> -    if (likely(a->size && b->size)) {
> -        if (a->ptr > b->ptr) {
> -            return 1;
> -        } else if (a->ptr < b->ptr) {
> -            return -1;
> -        }
> -        /* a->ptr == b->ptr should happen only on deletions */
> -        g_assert(a->size == b->size);
> -        return 0;
> -    }
> -    /*
> -     * All lookups have either .size field set to 0.
> -     * From the glib sources we see that @ap is always the lookup key. However
> -     * the docs provide no guarantee, so we just mark this case as likely.
> -     */
> -    if (likely(a->size == 0)) {
> -        return ptr_cmp_tb_tc(a->ptr, b);
> -    }
> -    return ptr_cmp_tb_tc(b->ptr, a);
> -}
> -
>  static inline void code_gen_alloc(size_t tb_size)
>  {
>      tcg_ctx->code_gen_buffer_size = size_code_gen_buffer(tb_size);
> @@ -781,7 +737,6 @@ static inline void code_gen_alloc(size_t tb_size)
>          fprintf(stderr, "Could not allocate dynamic translator buffer\n");
>          exit(1);
>      }
> -    tb_ctx.tb_tree = g_tree_new(tb_tc_cmp);
>      qemu_mutex_init(&tb_ctx.tb_lock);
>  }
>
> @@ -842,14 +797,6 @@ static TranslationBlock *tb_alloc(target_ulong pc)
>      return tb;
>  }
>
> -/* Called with tb_lock held.  */
> -void tb_remove(TranslationBlock *tb)
> -{
> -    assert_tb_locked();
> -
> -    g_tree_remove(tb_ctx.tb_tree, &tb->tc);
> -}
> -
>  static inline void invalidate_page_bitmap(PageDesc *p)
>  {
>  #ifdef CONFIG_SOFTMMU
> @@ -914,10 +861,10 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>      }
>
>      if (DEBUG_TB_FLUSH_GATE) {
> -        size_t nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
> +        size_t nb_tbs = tcg_nb_tbs();
>          size_t host_size = 0;
>
> -        g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
> +        tcg_tb_foreach(tb_host_size_iter, &host_size);
>          printf("qemu: flush code_size=%zu nb_tbs=%zu avg_tb_size=%zu\n",
>                 tcg_code_size(), nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
>      }
> @@ -926,10 +873,6 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>          cpu_tb_jmp_cache_clear(cpu);
>      }
>
> -    /* Increment the refcount first so that destroy acts as a reset */
> -    g_tree_ref(tb_ctx.tb_tree);
> -    g_tree_destroy(tb_ctx.tb_tree);
> -
>      qht_reset_size(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
>      page_flush_tb();
>
> @@ -1409,7 +1352,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>       * through the physical hash table and physical page list.
>       */
>      tb_link_page(tb, phys_pc, phys_page2);
> -    g_tree_insert(tb_ctx.tb_tree, &tb->tc, tb);
> +    tcg_tb_insert(tb);
>      return tb;
>  }
>
> @@ -1513,7 +1456,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>                  current_tb = NULL;
>                  if (cpu->mem_io_pc) {
>                      /* now we have a real cpu fault */
> -                    current_tb = tb_find_pc(cpu->mem_io_pc);
> +                    current_tb = tcg_tb_lookup(cpu->mem_io_pc);
>                  }
>              }
>              if (current_tb == tb &&
> @@ -1629,7 +1572,7 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
>      tb = p->first_tb;
>  #ifdef TARGET_HAS_PRECISE_SMC
>      if (tb && pc != 0) {
> -        current_tb = tb_find_pc(pc);
> +        current_tb = tcg_tb_lookup(pc);
>      }
>      if (cpu != NULL) {
>          env = cpu->env_ptr;
> @@ -1672,18 +1615,6 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
>  }
>  #endif
>
> -/*
> - * Find the TB 'tb' such that
> - * tb->tc.ptr <= tc_ptr < tb->tc.ptr + tb->tc.size
> - * Return NULL if not found.
> - */
> -static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
> -{
> -    struct tb_tc s = { .ptr = (void *)tc_ptr };
> -
> -    return g_tree_lookup(tb_ctx.tb_tree, &s);
> -}
> -
>  #if !defined(CONFIG_USER_ONLY)
>  void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr)
>  {
> @@ -1711,7 +1642,7 @@ void tb_check_watchpoint(CPUState *cpu)
>  {
>      TranslationBlock *tb;
>
> -    tb = tb_find_pc(cpu->mem_io_pc);
> +    tb = tcg_tb_lookup(cpu->mem_io_pc);
>      if (tb) {
>          /* We can use retranslation to find the PC.  */
>          cpu_restore_state_from_tb(cpu, tb, cpu->mem_io_pc);
> @@ -1745,7 +1676,7 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
>      uint32_t n;
>
>      tb_lock();
> -    tb = tb_find_pc(retaddr);
> +    tb = tcg_tb_lookup(retaddr);
>      if (!tb) {
>          cpu_abort(cpu, "cpu_io_recompile: could not find TB for pc=%p",
>                    (void *)retaddr);
> @@ -1789,7 +1720,7 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
>               * cpu_exec_nocache() */
>              tb_phys_invalidate(tb->orig_tb, -1);
>          }
> -        tb_remove(tb);
> +        tcg_tb_remove(tb);
>      }
>
>      /* TODO: If env->pc != tb->pc (i.e. the faulting instruction was not
> @@ -1860,6 +1791,7 @@ static void print_qht_statistics(FILE *f, fprintf_function cpu_fprintf,
>  }
>
>  struct tb_tree_stats {
> +    size_t nb_tbs;
>      size_t host_size;
>      size_t target_size;
>      size_t max_target_size;
> @@ -1873,6 +1805,7 @@ static gboolean tb_tree_stats_iter(gpointer key, gpointer value, gpointer data)
>      const TranslationBlock *tb = value;
>      struct tb_tree_stats *tst = data;
>
> +    tst->nb_tbs++;
>      tst->host_size += tb->tc.size;
>      tst->target_size += tb->size;
>      if (tb->size > tst->max_target_size) {
> @@ -1896,10 +1829,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>      struct qht_stats hst;
>      size_t nb_tbs;
>
> -    tb_lock();
> -
> -    nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
> -    g_tree_foreach(tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
> +    tcg_tb_foreach(tb_tree_stats_iter, &tst);
> +    nb_tbs = tst.nb_tbs;
>      /* XXX: avoid using doubles ? */
>      cpu_fprintf(f, "Translation buffer state:\n");
>      /*
> @@ -1934,8 +1865,6 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>      cpu_fprintf(f, "TB invalidate count %d\n", tb_ctx.tb_phys_invalidate_count);
>      cpu_fprintf(f, "TLB flush count     %zu\n", tlb_flush_count());
>      tcg_dump_info(f, cpu_fprintf);
> -
> -    tb_unlock();
>  }
>
>  void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf)
> @@ -2203,7 +2132,7 @@ int page_unprotect(target_ulong address, uintptr_t pc)
>               * set the page to PAGE_WRITE and did the TB invalidate for us.
>               */
>  #ifdef TARGET_HAS_PRECISE_SMC
> -            TranslationBlock *current_tb = tb_find_pc(pc);
> +            TranslationBlock *current_tb = tcg_tb_lookup(pc);
>              if (current_tb) {
>                  current_tb_invalidated = tb_cflags(current_tb) & CF_INVALID;
>              }
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index e5afd2e..17e08b3 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -401,7 +401,6 @@ static inline uint32_t curr_cflags(void)
>           | (use_icount ? CF_USE_ICOUNT : 0);
>  }
>
> -void tb_remove(TranslationBlock *tb);
>  void tb_flush(CPUState *cpu);
>  void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
>  TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
> diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
> index 1d41202..d8472c8 100644
> --- a/include/exec/tb-context.h
> +++ b/include/exec/tb-context.h
> @@ -31,7 +31,6 @@ typedef struct TBContext TBContext;
>
>  struct TBContext {
>
> -    GTree *tb_tree;
>      struct qht htable;
>      /* any access to the tbs or the page table must use this lock */
>      QemuMutex tb_lock;
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index bb24526..b471708 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -135,6 +135,12 @@ static TCGContext **tcg_ctxs;
>  static unsigned int n_tcg_ctxs;
>  TCGv_env cpu_env = 0;
>
> +struct tcg_region_tree {
> +    QemuMutex lock;
> +    GTree *tree;
> +    /* padding to avoid false sharing is computed at run-time */
> +};
> +
>  /*
>   * We divide code_gen_buffer into equally-sized "regions" that TCG threads
>   * dynamically allocate from as demand dictates. Given appropriate region
> @@ -158,6 +164,13 @@ struct tcg_region_state {
>  };
>
>  static struct tcg_region_state region;
> +/*
> + * This is an array of struct tcg_region_tree's, with padding.
> + * We use void * to simplify the computation of region_trees[i]; each
> + * struct is found every tree_size bytes.
> + */
> +static void *region_trees;
> +static size_t tree_size;
>  static TCGRegSet tcg_target_available_regs[TCG_TYPE_COUNT];
>  static TCGRegSet tcg_target_call_clobber_regs;
>
> @@ -295,6 +308,180 @@ TCGLabel *gen_new_label(void)
>
>  #include "tcg-target.inc.c"
>
> +/* compare a pointer @ptr and a tb_tc @s */
> +static int ptr_cmp_tb_tc(const void *ptr, const struct tb_tc *s)
> +{
> +    if (ptr >= s->ptr + s->size) {
> +        return 1;
> +    } else if (ptr < s->ptr) {
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +static gint tb_tc_cmp(gconstpointer ap, gconstpointer bp)
> +{
> +    const struct tb_tc *a = ap;
> +    const struct tb_tc *b = bp;
> +
> +    /*
> +     * When both sizes are set, we know this isn't a lookup.
> +     * This is the most likely case: every TB must be inserted; lookups
> +     * are a lot less frequent.
> +     */
> +    if (likely(a->size && b->size)) {
> +        if (a->ptr > b->ptr) {
> +            return 1;
> +        } else if (a->ptr < b->ptr) {
> +            return -1;
> +        }
> +        /* a->ptr == b->ptr should happen only on deletions */
> +        g_assert(a->size == b->size);
> +        return 0;
> +    }
> +    /*
> +     * All lookups have either .size field set to 0.
> +     * From the glib sources we see that @ap is always the lookup key. However
> +     * the docs provide no guarantee, so we just mark this case as likely.
> +     */
> +    if (likely(a->size == 0)) {
> +        return ptr_cmp_tb_tc(a->ptr, b);
> +    }
> +    return ptr_cmp_tb_tc(b->ptr, a);
> +}
> +
> +static void tcg_region_trees_init(void)
> +{
> +    size_t i;
> +
> +    tree_size = ROUND_UP(sizeof(struct tcg_region_tree), qemu_dcache_linesize);
> +    region_trees = qemu_memalign(qemu_dcache_linesize, region.n * tree_size);
> +    for (i = 0; i < region.n; i++) {
> +        struct tcg_region_tree *rt = region_trees + i * tree_size;
> +
> +        qemu_mutex_init(&rt->lock);
> +        rt->tree = g_tree_new(tb_tc_cmp);
> +    }
> +}
> +
> +static struct tcg_region_tree *tc_ptr_to_region_tree(void *p)
> +{
> +    size_t region_idx;
> +
> +    if (p < region.start_aligned) {
> +        region_idx = 0;
> +    } else {
> +        ptrdiff_t offset = p - region.start_aligned;
> +
> +        if (offset > region.stride * (region.n - 1)) {
> +            region_idx = region.n - 1;
> +        } else {
> +            region_idx = offset / region.stride;
> +        }
> +    }
> +    return region_trees + region_idx * tree_size;
> +}
> +
> +void tcg_tb_insert(TranslationBlock *tb)
> +{
> +    struct tcg_region_tree *rt = tc_ptr_to_region_tree(tb->tc.ptr);
> +
> +    qemu_mutex_lock(&rt->lock);
> +    g_tree_insert(rt->tree, &tb->tc, tb);
> +    qemu_mutex_unlock(&rt->lock);
> +}
> +
> +void tcg_tb_remove(TranslationBlock *tb)
> +{
> +    struct tcg_region_tree *rt = tc_ptr_to_region_tree(tb->tc.ptr);
> +
> +    qemu_mutex_lock(&rt->lock);
> +    g_tree_remove(rt->tree, &tb->tc);
> +    qemu_mutex_unlock(&rt->lock);
> +}
> +
> +/*
> + * Find the TB 'tb' such that
> + * tb->tc.ptr <= tc_ptr < tb->tc.ptr + tb->tc.size
> + * Return NULL if not found.
> + */
> +TranslationBlock *tcg_tb_lookup(uintptr_t tc_ptr)
> +{
> +    struct tcg_region_tree *rt = tc_ptr_to_region_tree((void *)tc_ptr);
> +    TranslationBlock *tb;
> +    struct tb_tc s = { .ptr = (void *)tc_ptr };
> +
> +    qemu_mutex_lock(&rt->lock);
> +    tb = g_tree_lookup(rt->tree, &s);
> +    qemu_mutex_unlock(&rt->lock);
> +    return tb;
> +}
> +
> +static void tcg_region_tree_lock_all(void)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < region.n; i++) {
> +        struct tcg_region_tree *rt = region_trees + i * tree_size;
> +
> +        qemu_mutex_lock(&rt->lock);
> +    }
> +}
> +
> +static void tcg_region_tree_unlock_all(void)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < region.n; i++) {
> +        struct tcg_region_tree *rt = region_trees + i * tree_size;
> +
> +        qemu_mutex_unlock(&rt->lock);
> +    }
> +}
> +
> +void tcg_tb_foreach(GTraverseFunc func, gpointer user_data)
> +{
> +    size_t i;
> +
> +    tcg_region_tree_lock_all();
> +    for (i = 0; i < region.n; i++) {
> +        struct tcg_region_tree *rt = region_trees + i * tree_size;
> +
> +        g_tree_foreach(rt->tree, func, user_data);
> +    }
> +    tcg_region_tree_unlock_all();
> +}
> +
> +size_t tcg_nb_tbs(void)
> +{
> +    size_t nb_tbs = 0;
> +    size_t i;
> +
> +    tcg_region_tree_lock_all();
> +    for (i = 0; i < region.n; i++) {
> +        struct tcg_region_tree *rt = region_trees + i * tree_size;
> +
> +        nb_tbs += g_tree_nnodes(rt->tree);
> +    }
> +    tcg_region_tree_unlock_all();
> +    return nb_tbs;
> +}
> +
> +static void tcg_region_tree_reset_all(void)
> +{
> +    size_t i;
> +
> +    tcg_region_tree_lock_all();
> +    for (i = 0; i < region.n; i++) {
> +        struct tcg_region_tree *rt = region_trees + i * tree_size;
> +
> +        /* Increment the refcount first so that destroy acts as a reset */
> +        g_tree_ref(rt->tree);
> +        g_tree_destroy(rt->tree);
> +    }
> +    tcg_region_tree_unlock_all();
> +}
> +
>  static void tcg_region_bounds(size_t curr_region, void **pstart, void **pend)
>  {
>      void *start, *end;
> @@ -380,6 +567,8 @@ void tcg_region_reset_all(void)
>          g_assert(!err);
>      }
>      qemu_mutex_unlock(&region.lock);
> +
> +    tcg_region_tree_reset_all();
>  }
>
>  #ifdef CONFIG_USER_ONLY
> @@ -496,6 +685,8 @@ void tcg_region_init(void)
>          g_assert(!rc);
>      }
>
> +    tcg_region_trees_init();
> +
>      /* In user-mode we support only one ctx, so do the initial allocation now */
>  #ifdef CONFIG_USER_ONLY
>      {
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 9e2d909..8bf29cc 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -850,6 +850,12 @@ void tcg_region_reset_all(void);
>  size_t tcg_code_size(void);
>  size_t tcg_code_capacity(void);
>
> +void tcg_tb_insert(TranslationBlock *tb);
> +void tcg_tb_remove(TranslationBlock *tb);
> +TranslationBlock *tcg_tb_lookup(uintptr_t tc_ptr);
> +void tcg_tb_foreach(GTraverseFunc func, gpointer user_data);
> +size_t tcg_nb_tbs(void);
> +
>  /* user-mode: Called with tb_lock held.  */
>  static inline void *tcg_malloc(int size)
>  {


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 04/16] tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 04/16] tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx Emilio G. Cota
  2018-02-28 20:55   ` Richard Henderson
@ 2018-03-29 10:06   ` Alex Bennée
  2018-04-05 17:18     ` Emilio G. Cota
  1 sibling, 1 reply; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 10:06 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Thereby making it per-TCGContext. Once we remove tb_lock, this will
> avoid an atomic increment every time a TB is invalidated.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/translate-all.c |  5 +++--
>  include/exec/tb-context.h |  1 -
>  tcg/tcg.c                 | 14 ++++++++++++++
>  tcg/tcg.h                 |  3 +++
>  4 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 3a51d49..20ad3fc 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1072,7 +1072,8 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>      /* suppress any remaining jumps to this TB */
>      tb_jmp_unlink(tb);
>
> -    tb_ctx.tb_phys_invalidate_count++;
> +    atomic_set(&tcg_ctx->tb_phys_invalidate_count,
> +               tcg_ctx->tb_phys_invalidate_count + 1);

We do have an atomic_inc helper for this or we need comment that we only
have atomic_reads() to worry about hence no races.

Otherwise:

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

>  }
>
>  #ifdef CONFIG_SOFTMMU
> @@ -1862,7 +1863,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>      cpu_fprintf(f, "\nStatistics:\n");
>      cpu_fprintf(f, "TB flush count      %u\n",
>                  atomic_read(&tb_ctx.tb_flush_count));
> -    cpu_fprintf(f, "TB invalidate count %d\n", tb_ctx.tb_phys_invalidate_count);
> +    cpu_fprintf(f, "TB invalidate count %zu\n", tcg_tb_phys_invalidate_count());
>      cpu_fprintf(f, "TLB flush count     %zu\n", tlb_flush_count());
>      tcg_dump_info(f, cpu_fprintf);
>  }
> diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
> index d8472c8..8c9b49c 100644
> --- a/include/exec/tb-context.h
> +++ b/include/exec/tb-context.h
> @@ -37,7 +37,6 @@ struct TBContext {
>
>      /* statistics */
>      unsigned tb_flush_count;
> -    int tb_phys_invalidate_count;
>  };
>
>  extern TBContext tb_ctx;
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index b471708..a7b596e 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -791,6 +791,20 @@ size_t tcg_code_capacity(void)
>      return capacity;
>  }
>
> +size_t tcg_tb_phys_invalidate_count(void)
> +{
> +    unsigned int n_ctxs = atomic_read(&n_tcg_ctxs);
> +    unsigned int i;
> +    size_t total = 0;
> +
> +    for (i = 0; i < n_ctxs; i++) {
> +        const TCGContext *s = atomic_read(&tcg_ctxs[i]);
> +
> +        total += atomic_read(&s->tb_phys_invalidate_count);
> +    }
> +    return total;
> +}
> +
>  /* pool based memory allocation */
>  void *tcg_malloc_internal(TCGContext *s, int size)
>  {
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 8bf29cc..9dd9448 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -694,6 +694,8 @@ struct TCGContext {
>      /* Threshold to flush the translated code buffer.  */
>      void *code_gen_highwater;
>
> +    size_t tb_phys_invalidate_count;
> +
>      /* Track which vCPU triggers events */
>      CPUState *cpu;                      /* *_trans */
>
> @@ -852,6 +854,7 @@ size_t tcg_code_capacity(void);
>
>  void tcg_tb_insert(TranslationBlock *tb);
>  void tcg_tb_remove(TranslationBlock *tb);
> +size_t tcg_tb_phys_invalidate_count(void);
>  TranslationBlock *tcg_tb_lookup(uintptr_t tc_ptr);
>  void tcg_tb_foreach(GTraverseFunc func, gpointer user_data);
>  size_t tcg_nb_tbs(void);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 09/16] translate-all: move tb_invalidate_phys_page_range up in the file
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 09/16] translate-all: move tb_invalidate_phys_page_range up in the file Emilio G. Cota
  2018-02-28 22:24   ` Richard Henderson
@ 2018-03-29 10:08   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 10:08 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> This greatly simplifies next commit's diff.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/translate-all.c | 77 ++++++++++++++++++++++++-----------------------
>  1 file changed, 39 insertions(+), 38 deletions(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index a98e182..4cb03f1 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1371,44 +1371,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>
>  /*
>   * Invalidate all TBs which intersect with the target physical address range
> - * [start;end[. NOTE: start and end may refer to *different* physical pages.
> - * 'is_cpu_write_access' should be true if called from a real cpu write
> - * access: the virtual CPU will exit the current TB if code is modified inside
> - * this TB.
> - *
> - * Called with mmap_lock held for user-mode emulation, grabs tb_lock
> - * Called with tb_lock held for system-mode emulation
> - */
> -static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
> -{
> -    tb_page_addr_t next;
> -
> -    for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
> -         start < end;
> -         start = next, next += TARGET_PAGE_SIZE) {
> -        tb_page_addr_t bound = MIN(next, end);
> -
> -        tb_invalidate_phys_page_range(start, bound, 0);
> -    }
> -}
> -
> -#ifdef CONFIG_SOFTMMU
> -void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
> -{
> -    assert_tb_locked();
> -    tb_invalidate_phys_range_1(start, end);
> -}
> -#else
> -void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
> -{
> -    assert_memory_lock();
> -    tb_lock();
> -    tb_invalidate_phys_range_1(start, end);
> -    tb_unlock();
> -}
> -#endif
> -/*
> - * Invalidate all TBs which intersect with the target physical address range
>   * [start;end[. NOTE: start and end must refer to the *same* physical page.
>   * 'is_cpu_write_access' should be true if called from a real cpu write
>   * access: the virtual CPU will exit the current TB if code is modified inside
> @@ -1505,6 +1467,45 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>  #endif
>  }
>
> +/*
> + * Invalidate all TBs which intersect with the target physical address range
> + * [start;end[. NOTE: start and end may refer to *different* physical pages.
> + * 'is_cpu_write_access' should be true if called from a real cpu write
> + * access: the virtual CPU will exit the current TB if code is modified inside
> + * this TB.
> + *
> + * Called with mmap_lock held for user-mode emulation, grabs tb_lock
> + * Called with tb_lock held for system-mode emulation
> + */
> +static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
> +{
> +    tb_page_addr_t next;
> +
> +    for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
> +         start < end;
> +         start = next, next += TARGET_PAGE_SIZE) {
> +        tb_page_addr_t bound = MIN(next, end);
> +
> +        tb_invalidate_phys_page_range(start, bound, 0);
> +    }
> +}
> +
> +#ifdef CONFIG_SOFTMMU
> +void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
> +{
> +    assert_tb_locked();
> +    tb_invalidate_phys_range_1(start, end);
> +}
> +#else
> +void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
> +{
> +    assert_memory_lock();
> +    tb_lock();
> +    tb_invalidate_phys_range_1(start, end);
> +    tb_unlock();
> +}
> +#endif
> +
>  #ifdef CONFIG_SOFTMMU
>  /* len must be <= 8 and start must be a multiple of len.
>   * Called via softmmu_template.h when code areas are written to with


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/16] translate-all: work page-by-page in tb_invalidate_phys_range_1
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 08/16] translate-all: work page-by-page in tb_invalidate_phys_range_1 Emilio G. Cota
  2018-02-28 22:23   ` Richard Henderson
@ 2018-03-29 10:10   ` Alex Bennée
  2018-03-29 10:17   ` Alex Bennée
  2 siblings, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 10:10 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> So that we pass a same-page range to tb_invalidate_phys_page_range,
> instead of always passing an end address that could be on a different
> page.
>
> As discussed with Peter Maydell on the list [1], tb_invalidate_phys_page_range
> doesn't actually do much with 'end', which explains why we have never
> hit a bug despite going against what the comment on top of
> tb_invalidate_phys_page_range requires:
>
>> * Invalidate all TBs which intersect with the target physical address range
>> * [start;end[. NOTE: start and end must refer to the *same* physical page.
>
> The appended honours the comment, which avoids confusion.
>
> While at it, rework the loop into a for loop, which is less error prone
> (e.g. "continue" won't result in an infinite loop).
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg09165.html
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/translate-all.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 816419a..a98e182 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1381,10 +1381,14 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>   */
>  static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
>  {
> -    while (start < end) {
> -        tb_invalidate_phys_page_range(start, end, 0);
> -        start &= TARGET_PAGE_MASK;
> -        start += TARGET_PAGE_SIZE;
> +    tb_page_addr_t next;
> +
> +    for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
> +         start < end;
> +         start = next, next += TARGET_PAGE_SIZE) {
> +        tb_page_addr_t bound = MIN(next, end);
> +
> +        tb_invalidate_phys_page_range(start, bound, 0);
>      }
>  }


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 06/16] translate-all: make l1_map lockless
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 06/16] translate-all: make l1_map lockless Emilio G. Cota
  2018-02-28 22:15   ` Richard Henderson
@ 2018-03-29 10:16   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 10:16 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Groundwork for supporting parallel TCG generation.
>
> We never remove entries from the radix tree, so we can use cmpxchg
> to implement lockless insertions.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/translate-all.c       | 24 ++++++++++++++----------
>  docs/devel/multi-thread-tcg.txt |  4 ++--
>  2 files changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 06aa905..f2bfa71 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -472,20 +472,12 @@ static void page_init(void)
>  #endif
>  }
>
> -/* If alloc=1:
> - * Called with tb_lock held for system emulation.
> - * Called with mmap_lock held for user-mode emulation.
> - */
>  static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
>  {
>      PageDesc *pd;
>      void **lp;
>      int i;
>
> -    if (alloc) {
> -        assert_memory_lock();
> -    }
> -
>      /* Level 1.  Always allocated.  */
>      lp = l1_map + ((index >> v_l1_shift) & (v_l1_size - 1));
>
> @@ -494,11 +486,17 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
>          void **p = atomic_rcu_read(lp);
>
>          if (p == NULL) {
> +            void *existing;
> +
>              if (!alloc) {
>                  return NULL;
>              }
>              p = g_new0(void *, V_L2_SIZE);
> -            atomic_rcu_set(lp, p);
> +            existing = atomic_cmpxchg(lp, NULL, p);
> +            if (unlikely(existing)) {
> +                g_free(p);
> +                p = existing;
> +            }
>          }
>
>          lp = p + ((index >> (i * V_L2_BITS)) & (V_L2_SIZE - 1));
> @@ -506,11 +504,17 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
>
>      pd = atomic_rcu_read(lp);
>      if (pd == NULL) {
> +        void *existing;
> +
>          if (!alloc) {
>              return NULL;
>          }
>          pd = g_new0(PageDesc, V_L2_SIZE);
> -        atomic_rcu_set(lp, pd);
> +        existing = atomic_cmpxchg(lp, NULL, pd);
> +        if (unlikely(existing)) {
> +            g_free(pd);
> +            pd = existing;
> +        }
>      }
>
>      return pd + (index & (V_L2_SIZE - 1));
> diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
> index a99b456..faf8918 100644
> --- a/docs/devel/multi-thread-tcg.txt
> +++ b/docs/devel/multi-thread-tcg.txt
> @@ -134,8 +134,8 @@ tb_set_jmp_target() code. Modification to the linked lists that allow
>  searching for linked pages are done under the protect of the
>  tb_lock().
>
> -The global page table is protected by the tb_lock() in system-mode and
> -mmap_lock() in linux-user mode.
> +The global page table is a lockless radix tree; cmpxchg is used
> +to atomically insert new elements.
>
>  The lookup caches are updated atomically and the lookup hash uses QHT
>  which is designed for concurrent safe lookup.


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 07/16] translate-all: remove hole in PageDesc
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 07/16] translate-all: remove hole in PageDesc Emilio G. Cota
  2018-02-28 22:17   ` Richard Henderson
@ 2018-03-29 10:17   ` Alex Bennée
  1 sibling, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 10:17 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Groundwork for supporting parallel TCG generation.
>
> Move the hole to the end of the struct, so that a u32
> field can be added there without bloating the struct.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/translate-all.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index f2bfa71..816419a 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -107,8 +107,8 @@ typedef struct PageDesc {
>  #ifdef CONFIG_SOFTMMU
>      /* in order to optimize self modifying code, we count the number
>         of lookups we do to a given page to use a bitmap */
> -    unsigned int code_write_count;
>      unsigned long *code_bitmap;
> +    unsigned int code_write_count;
>  #else
>      unsigned long flags;
>  #endif


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 08/16] translate-all: work page-by-page in tb_invalidate_phys_range_1
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 08/16] translate-all: work page-by-page in tb_invalidate_phys_range_1 Emilio G. Cota
  2018-02-28 22:23   ` Richard Henderson
  2018-03-29 10:10   ` Alex Bennée
@ 2018-03-29 10:17   ` Alex Bennée
  2 siblings, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 10:17 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> So that we pass a same-page range to tb_invalidate_phys_page_range,
> instead of always passing an end address that could be on a different
> page.
>
> As discussed with Peter Maydell on the list [1], tb_invalidate_phys_page_range
> doesn't actually do much with 'end', which explains why we have never
> hit a bug despite going against what the comment on top of
> tb_invalidate_phys_page_range requires:
>
>> * Invalidate all TBs which intersect with the target physical address range
>> * [start;end[. NOTE: start and end must refer to the *same* physical page.
>
> The appended honours the comment, which avoids confusion.
>
> While at it, rework the loop into a for loop, which is less error prone
> (e.g. "continue" won't result in an infinite loop).
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg09165.html
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/translate-all.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 816419a..a98e182 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1381,10 +1381,14 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>   */
>  static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
>  {
> -    while (start < end) {
> -        tb_invalidate_phys_page_range(start, end, 0);
> -        start &= TARGET_PAGE_MASK;
> -        start += TARGET_PAGE_SIZE;
> +    tb_page_addr_t next;
> +
> +    for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
> +         start < end;
> +         start = next, next += TARGET_PAGE_SIZE) {
> +        tb_page_addr_t bound = MIN(next, end);
> +
> +        tb_invalidate_phys_page_range(start, bound, 0);
>      }
>  }


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 10/16] translate-all: use per-page locking in !user-mode
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 10/16] translate-all: use per-page locking in !user-mode Emilio G. Cota
@ 2018-03-29 14:55   ` Alex Bennée
  2018-04-06  0:43     ` Emilio G. Cota
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 14:55 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Groundwork for supporting parallel TCG generation.
>
> Instead of using a global lock (tb_lock) to protect changes
> to pages, use fine-grained, per-page locks in !user-mode.
> User-mode stays with mmap_lock.
>
> Sometimes changes need to happen atomically on more than one
> page (e.g. when a TB that spans across two pages is
> added/invalidated, or when a range of pages is invalidated).
> We therefore introduce struct page_collection, which helps
> us keep track of a set of pages that have been locked in
> the appropriate locking order (i.e. by ascending page index).
>
> This commit first introduces the structs and the function helpers,
> to then convert the calling code to use per-page locking. Note
> that tb_lock is not removed yet.
>
> While at it, rename tb_alloc_page to tb_page_add, which pairs with
> tb_page_remove.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/translate-all.c | 432 +++++++++++++++++++++++++++++++++++++++++-----
>  accel/tcg/translate-all.h |   3 +
>  include/exec/exec-all.h   |   3 +-
>  3 files changed, 396 insertions(+), 42 deletions(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 4cb03f1..07527d5 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -112,8 +112,55 @@ typedef struct PageDesc {
>  #else
>      unsigned long flags;
>  #endif
> +#ifndef CONFIG_USER_ONLY
> +    QemuSpin lock;
> +#endif
>  } PageDesc;
>
> +/**
> + * struct page_entry - page descriptor entry
> + * @pd:     pointer to the &struct PageDesc of the page this entry represents
> + * @index:  page index of the page
> + * @locked: whether the page is locked
> + *
> + * This struct helps us keep track of the locked state of a page, without
> + * bloating &struct PageDesc.
> + *
> + * A page lock protects accesses to all fields of &struct PageDesc.
> + *
> + * See also: &struct page_collection.
> + */
> +struct page_entry {
> +    PageDesc *pd;
> +    tb_page_addr_t index;
> +    bool locked;
> +};
> +
> +/**
> + * struct page_collection - tracks a set of pages (i.e. &struct page_entry's)
> + * @tree:   Binary search tree (BST) of the pages, with key == page index
> + * @max:    Pointer to the page in @tree with the highest page index
> + *
> + * To avoid deadlock we lock pages in ascending order of page index.
> + * When operating on a set of pages, we need to keep track of them so that
> + * we can lock them in order and also unlock them later. For this we collect
> + * pages (i.e. &struct page_entry's) in a binary search @tree. Given that the
> + * @tree implementation we use does not provide an O(1) operation to obtain the
> + * highest-ranked element, we use @max to keep track of the inserted page
> + * with the highest index. This is valuable because if a page is not in
> + * the tree and its index is higher than @max's, then we can lock it
> + * without breaking the locking order rule.
> + *
> + * Note on naming: 'struct page_set' would be shorter, but we already have a few
> + * page_set_*() helpers, so page_collection is used instead to avoid confusion.
> + *
> + * See also: page_collection_lock().
> + */
> +struct page_collection {
> +    GTree *tree;
> +    struct page_entry *max;
> +};
> +
>  /* list iterators for lists of tagged pointers in TranslationBlock */
>  #define TB_FOR_EACH_TAGGED(head, tb, n, field)                  \
>      for (n = (head) & 1,                                        \
> @@ -510,6 +557,15 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
>              return NULL;
>          }
>          pd = g_new0(PageDesc, V_L2_SIZE);
> +#ifndef CONFIG_USER_ONLY
> +        {
> +            int i;
> +
> +            for (i = 0; i < V_L2_SIZE; i++) {
> +                qemu_spin_init(&pd[i].lock);
> +            }
> +        }
> +#endif
>          existing = atomic_cmpxchg(lp, NULL, pd);
>          if (unlikely(existing)) {
>              g_free(pd);
> @@ -525,6 +581,228 @@ static inline PageDesc *page_find(tb_page_addr_t index)
>      return page_find_alloc(index, 0);
>  }
>
> +/* In user-mode page locks aren't used; mmap_lock is enough */
> +#ifdef CONFIG_USER_ONLY
> +static inline void page_lock(PageDesc *pd)
> +{ }
> +
> +static inline void page_unlock(PageDesc *pd)
> +{ }
> +
> +static inline void page_lock_tb(const TranslationBlock *tb)
> +{ }
> +
> +static inline void page_unlock_tb(const TranslationBlock *tb)
> +{ }
> +
> +struct page_collection *
> +page_collection_lock(tb_page_addr_t start, tb_page_addr_t end)
> +{
> +    return NULL;
> +}
> +
> +void page_collection_unlock(struct page_collection *set)
> +{ }
> +#else /* !CONFIG_USER_ONLY */
> +
> +static inline void page_lock(PageDesc *pd)
> +{
> +    qemu_spin_lock(&pd->lock);
> +}
> +
> +static inline void page_unlock(PageDesc *pd)
> +{
> +    qemu_spin_unlock(&pd->lock);
> +}
> +
> +/* lock the page(s) of a TB in the correct acquisition order */
> +static inline void page_lock_tb(const TranslationBlock *tb)
> +{
> +    if (likely(tb->page_addr[1] == -1)) {
> +        page_lock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
> +        return;
> +    }
> +    if (tb->page_addr[0] < tb->page_addr[1]) {
> +        page_lock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
> +        page_lock(page_find(tb->page_addr[1] >> TARGET_PAGE_BITS));
> +    } else {
> +        page_lock(page_find(tb->page_addr[1] >> TARGET_PAGE_BITS));
> +        page_lock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
> +    }
> +}
> +
> +static inline void page_unlock_tb(const TranslationBlock *tb)
> +{
> +    page_unlock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
> +    if (unlikely(tb->page_addr[1] != -1)) {
> +        page_unlock(page_find(tb->page_addr[1] >> TARGET_PAGE_BITS));
> +    }
> +}
> +
> +static inline struct page_entry *
> +page_entry_new(PageDesc *pd, tb_page_addr_t index)
> +{
> +    struct page_entry *pe = g_malloc(sizeof(*pe));
> +
> +    pe->index = index;
> +    pe->pd = pd;
> +    pe->locked = false;
> +    return pe;
> +}
> +
> +static void page_entry_destroy(gpointer p)
> +{
> +    struct page_entry *pe = p;
> +
> +    g_assert(pe->locked);
> +    page_unlock(pe->pd);
> +    g_free(pe);
> +}
> +
> +/* returns false on success */
> +static bool page_entry_trylock(struct page_entry *pe)
> +{
> +    bool busy;
> +
> +    busy = qemu_spin_trylock(&pe->pd->lock);
> +    if (!busy) {
> +        g_assert(!pe->locked);
> +        pe->locked = true;
> +    }
> +    return busy;
> +}
> +
> +static void do_page_entry_lock(struct page_entry *pe)
> +{
> +    page_lock(pe->pd);
> +    g_assert(!pe->locked);
> +    pe->locked = true;
> +}
> +
> +static gboolean page_entry_lock(gpointer key, gpointer value, gpointer data)
> +{
> +    struct page_entry *pe = value;
> +
> +    do_page_entry_lock(pe);
> +    return FALSE;
> +}
> +
> +static gboolean page_entry_unlock(gpointer key, gpointer value, gpointer data)
> +{
> +    struct page_entry *pe = value;
> +
> +    if (pe->locked) {
> +        pe->locked = false;
> +        page_unlock(pe->pd);
> +    }
> +    return FALSE;
> +}
> +
> +/*
> + * Trylock a page, and if successful, add the page to a collection.
> + * Returns true ("busy") if the page could not be locked; false otherwise.
> + */
> +static bool page_trylock_add(struct page_collection *set, tb_page_addr_t addr)
> +{
> +    tb_page_addr_t index = addr >> TARGET_PAGE_BITS;
> +    struct page_entry *pe;
> +    PageDesc *pd;
> +
> +    pe = g_tree_lookup(set->tree, &index);
> +    if (pe) {
> +        return false;
> +    }
> +
> +    pd = page_find(index);
> +    if (pd == NULL) {
> +        return false;
> +    }
> +
> +    pe = page_entry_new(pd, index);
> +    g_tree_insert(set->tree, &pe->index, pe);
> +
> +    /*
> +     * If this is either (1) the first insertion or (2) a page whose index
> +     * is higher than any other so far, just lock the page and move on.
> +     */
> +    if (set->max == NULL || pe->index > set->max->index) {
> +        set->max = pe;
> +        do_page_entry_lock(pe);
> +        return false;
> +    }
> +    /*
> +     * Try to acquire out-of-order lock; if busy, return busy so that we acquire
> +     * locks in order.
> +     */
> +    return page_entry_trylock(pe);
> +}
> +
> +static gint tb_page_addr_cmp(gconstpointer ap, gconstpointer bp, gpointer udata)
> +{
> +    tb_page_addr_t a = *(const tb_page_addr_t *)ap;
> +    tb_page_addr_t b = *(const tb_page_addr_t *)bp;
> +
> +    if (a == b) {
> +        return 0;
> +    } else if (a < b) {
> +        return -1;
> +    }
> +    return 1;
> +}
> +
> +/*
> + * Lock a range of pages ([@start,@end[) as well as the pages of all
> + * intersecting TBs.
> + * Locking order: acquire locks in ascending order of page index.
> + */
> +struct page_collection *
> +page_collection_lock(tb_page_addr_t start, tb_page_addr_t end)
> +{
> +    struct page_collection *set = g_malloc(sizeof(*set));
> +    tb_page_addr_t index;
> +    PageDesc *pd;
> +
> +    start >>= TARGET_PAGE_BITS;
> +    end   >>= TARGET_PAGE_BITS;
> +    g_assert(start <= end);
> +
> +    set->tree = g_tree_new_full(tb_page_addr_cmp, NULL, NULL,
> +                                page_entry_destroy);
> +    set->max = NULL;
> +
> + retry:
> +    g_tree_foreach(set->tree, page_entry_lock, NULL);
> +
> +    for (index = start; index <= end; index++) {
> +        TranslationBlock *tb;
> +        int n;
> +
> +        pd = page_find(index);
> +        if (pd == NULL) {
> +            continue;
> +        }
> +        PAGE_FOR_EACH_TB(pd, tb, n) {
> +            if (page_trylock_add(set, tb->page_addr[0]) ||
> +                (tb->page_addr[1] != -1 &&
> +                 page_trylock_add(set, tb->page_addr[1]))) {
> +                /* drop all locks, and reacquire in order */
> +                g_tree_foreach(set->tree, page_entry_unlock, NULL);
> +                goto retry;
> +            }
> +        }
> +    }
> +    return set;
> +}
> +
> +void page_collection_unlock(struct page_collection *set)
> +{
> +    /* entries are unlocked and freed via page_entry_destroy */
> +    g_tree_destroy(set->tree);
> +    g_free(set);
> +}
> +
> +#endif /* !CONFIG_USER_ONLY */
> +
>  #if defined(CONFIG_USER_ONLY)
>  /* Currently it is not recommended to allocate big chunks of data in
>     user mode. It will change when a dedicated libc will be used.  */
> @@ -813,6 +1091,7 @@ static TranslationBlock *tb_alloc(target_ulong pc)
>      return tb;
>  }
>
> +/* call with @p->lock held */
>  static inline void invalidate_page_bitmap(PageDesc *p)
>  {
>  #ifdef CONFIG_SOFTMMU
> @@ -834,8 +1113,10 @@ static void page_flush_tb_1(int level, void **lp)
>          PageDesc *pd = *lp;
>
>          for (i = 0; i < V_L2_SIZE; ++i) {
> +            page_lock(&pd[i]);
>              pd[i].first_tb = (uintptr_t)NULL;
>              invalidate_page_bitmap(pd + i);
> +            page_unlock(&pd[i]);
>          }
>      } else {
>          void **pp = *lp;
> @@ -962,6 +1243,7 @@ static void tb_page_check(void)
>
>  #endif /* CONFIG_USER_ONLY */
>
> +/* call with @pd->lock held */
>  static inline void tb_page_remove(PageDesc *pd, TranslationBlock *tb)
>  {
>      TranslationBlock *tb1;
> @@ -1038,11 +1320,8 @@ static inline void tb_jmp_unlink(TranslationBlock *tb)
>      }
>  }
>
> -/* invalidate one TB
> - *
> - * Called with tb_lock held.
> - */
> -void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
> +/* If @rm_from_page_list is set, call with the TB's pages' locks held */
> +static void do_tb_phys_invalidate(TranslationBlock *tb, bool rm_from_page_list)
>  {
>      CPUState *cpu;
>      PageDesc *p;
> @@ -1062,15 +1341,15 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>      }
>
>      /* remove the TB from the page list */
> -    if (tb->page_addr[0] != page_addr) {
> +    if (rm_from_page_list) {
>          p = page_find(tb->page_addr[0] >> TARGET_PAGE_BITS);
>          tb_page_remove(p, tb);
>          invalidate_page_bitmap(p);
> -    }
> -    if (tb->page_addr[1] != -1 && tb->page_addr[1] != page_addr) {
> -        p = page_find(tb->page_addr[1] >> TARGET_PAGE_BITS);
> -        tb_page_remove(p, tb);
> -        invalidate_page_bitmap(p);
> +        if (tb->page_addr[1] != -1) {
> +            p = page_find(tb->page_addr[1] >> TARGET_PAGE_BITS);
> +            tb_page_remove(p, tb);
> +            invalidate_page_bitmap(p);
> +        }
>      }
>
>      /* remove the TB from the hash list */
> @@ -1092,7 +1371,28 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>                 tcg_ctx->tb_phys_invalidate_count + 1);
>  }
>
> +static void tb_phys_invalidate__locked(TranslationBlock *tb)
> +{
> +    do_tb_phys_invalidate(tb, true);
> +}
> +
> +/* invalidate one TB
> + *
> + * Called with tb_lock held.
> + */
> +void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
> +{
> +    if (page_addr == -1) {
> +        page_lock_tb(tb);
> +        do_tb_phys_invalidate(tb, true);
> +        page_unlock_tb(tb);
> +    } else {
> +        do_tb_phys_invalidate(tb, false);
> +    }
> +}
> +
>  #ifdef CONFIG_SOFTMMU
> +/* call with @p->lock held */
>  static void build_page_bitmap(PageDesc *p)
>  {
>      int n, tb_start, tb_end;
> @@ -1122,11 +1422,11 @@ static void build_page_bitmap(PageDesc *p)
>  /* add the tb in the target page and protect it if necessary
>   *
>   * Called with mmap_lock held for user-mode emulation.
> + * Called with @p->lock held.
>   */
> -static inline void tb_alloc_page(TranslationBlock *tb,
> -                                 unsigned int n, tb_page_addr_t page_addr)
> +static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
> +                               unsigned int n, tb_page_addr_t page_addr)
>  {
> -    PageDesc *p;
>  #ifndef CONFIG_USER_ONLY
>      bool page_already_protected;
>  #endif
> @@ -1134,7 +1434,6 @@ static inline void tb_alloc_page(TranslationBlock *tb,
>      assert_memory_lock();
>
>      tb->page_addr[n] = page_addr;
> -    p = page_find_alloc(page_addr >> TARGET_PAGE_BITS, 1);
>      tb->page_next[n] = p->first_tb;
>  #ifndef CONFIG_USER_ONLY
>      page_already_protected = p->first_tb != (uintptr_t)NULL;
> @@ -1186,17 +1485,38 @@ static inline void tb_alloc_page(TranslationBlock *tb,
>  static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>                           tb_page_addr_t phys_page2)
>  {
> +    PageDesc *p;
> +    PageDesc *p2 = NULL;
>      uint32_t h;
>
>      assert_memory_lock();
>
> -    /* add in the page list */
> -    tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
> -    if (phys_page2 != -1) {
> -        tb_alloc_page(tb, 1, phys_page2);
> -    } else {
> +    /*
> +     * Add the TB to the page list.
> +     * To avoid deadlock, acquire first the lock of the lower-addressed page.
> +     */
> +    p = page_find_alloc(phys_pc >> TARGET_PAGE_BITS, 1);
> +    if (likely(phys_page2 == -1)) {
>          tb->page_addr[1] = -1;
> +        page_lock(p);
> +        tb_page_add(p, tb, 0, phys_pc & TARGET_PAGE_MASK);
> +    } else {
> +        p2 = page_find_alloc(phys_page2 >> TARGET_PAGE_BITS, 1);
> +        if (phys_pc < phys_page2) {
> +            page_lock(p);
> +            page_lock(p2);
> +        } else {
> +            page_lock(p2);
> +            page_lock(p);
> +        }

Give we repeat this check further up perhaps a:

  page_lock_pair(PageDesc *p1, th_page_addr_t phys1, PageDesc *p2,  tb_page_addr_t phys2)


> +        tb_page_add(p, tb, 0, phys_pc & TARGET_PAGE_MASK);
> +        tb_page_add(p2, tb, 1, phys_page2);
> +    }
> +
> +    if (p2) {
> +        page_unlock(p2);
>      }
> +    page_unlock(p);
>
>      /* add in the hash table */
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
> @@ -1370,21 +1690,17 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>  }
>
>  /*
> - * Invalidate all TBs which intersect with the target physical address range
> - * [start;end[. NOTE: start and end must refer to the *same* physical page.
> - * 'is_cpu_write_access' should be true if called from a real cpu write
> - * access: the virtual CPU will exit the current TB if code is modified inside
> - * this TB.
> - *
> - * Called with tb_lock/mmap_lock held for user-mode emulation
> - * Called with tb_lock held for system-mode emulation
> + * Call with all @pages locked.
> + * @p must be non-NULL.
>   */
> -void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
> -                                   int is_cpu_write_access)
> +static void
> +tb_invalidate_phys_page_range__locked(struct page_collection *pages,
> +                                      PageDesc *p, tb_page_addr_t start,
> +                                      tb_page_addr_t end,
> +                                      int is_cpu_write_access)
>  {
>      TranslationBlock *tb;
>      tb_page_addr_t tb_start, tb_end;
> -    PageDesc *p;
>      int n;
>  #ifdef TARGET_HAS_PRECISE_SMC
>      CPUState *cpu = current_cpu;
> @@ -1400,10 +1716,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>      assert_memory_lock();
>      assert_tb_locked();
>
> -    p = page_find(start >> TARGET_PAGE_BITS);
> -    if (!p) {
> -        return;
> -    }
>  #if defined(TARGET_HAS_PRECISE_SMC)
>      if (cpu != NULL) {
>          env = cpu->env_ptr;
> @@ -1448,7 +1760,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>                                       &current_flags);
>              }
>  #endif /* TARGET_HAS_PRECISE_SMC */
> -            tb_phys_invalidate(tb, -1);
> +            tb_phys_invalidate__locked(tb);
>          }
>      }
>  #if !defined(CONFIG_USER_ONLY)
> @@ -1460,6 +1772,7 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>  #endif
>  #ifdef TARGET_HAS_PRECISE_SMC
>      if (current_tb_modified) {
> +        page_collection_unlock(pages);
>          /* Force execution of one insn next time.  */
>          cpu->cflags_next_tb = 1 | curr_cflags();
>          cpu_loop_exit_noexc(cpu);
> @@ -1469,6 +1782,35 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>
>  /*
>   * Invalidate all TBs which intersect with the target physical address range
> + * [start;end[. NOTE: start and end must refer to the *same* physical page.
> + * 'is_cpu_write_access' should be true if called from a real cpu write
> + * access: the virtual CPU will exit the current TB if code is modified inside
> + * this TB.
> + *
> + * Called with tb_lock/mmap_lock held for user-mode emulation
> + * Called with tb_lock held for system-mode emulation
> + */
> +void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
> +                                   int is_cpu_write_access)
> +{
> +    struct page_collection *pages;
> +    PageDesc *p;
> +
> +    assert_memory_lock();
> +    assert_tb_locked();
> +
> +    p = page_find(start >> TARGET_PAGE_BITS);
> +    if (p == NULL) {
> +        return;
> +    }
> +    pages = page_collection_lock(start, end);
> +    tb_invalidate_phys_page_range__locked(pages, p, start, end,
> +                                          is_cpu_write_access);
> +    page_collection_unlock(pages);
> +}
> +
> +/*
> + * Invalidate all TBs which intersect with the target physical address range
>   * [start;end[. NOTE: start and end may refer to *different* physical pages.
>   * 'is_cpu_write_access' should be true if called from a real cpu write
>   * access: the virtual CPU will exit the current TB if code is modified inside
> @@ -1479,15 +1821,22 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>   */
>  static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
>  {
> +    struct page_collection *pages;
>      tb_page_addr_t next;
>
> +    pages = page_collection_lock(start, end);
>      for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
>           start < end;
>           start = next, next += TARGET_PAGE_SIZE) {
> +        PageDesc *pd = page_find(start >> TARGET_PAGE_BITS);
>          tb_page_addr_t bound = MIN(next, end);
>
> -        tb_invalidate_phys_page_range(start, bound, 0);
> +        if (pd == NULL) {
> +            continue;
> +        }
> +        tb_invalidate_phys_page_range__locked(pages, pd, start, bound, 0);
>      }
> +    page_collection_unlock(pages);
>  }
>
>  #ifdef CONFIG_SOFTMMU
> @@ -1513,6 +1862,7 @@ void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
>   */
>  void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
>  {
> +    struct page_collection *pages;
>      PageDesc *p;
>
>  #if 0
> @@ -1530,11 +1880,10 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
>      if (!p) {
>          return;
>      }
> +
> +    pages = page_collection_lock(start, start + len);
>      if (!p->code_bitmap &&
>          ++p->code_write_count >= SMC_BITMAP_USE_THRESHOLD) {
> -        /* build code bitmap.  FIXME: writes should be protected by
> -         * tb_lock, reads by tb_lock or RCU.
> -         */
>          build_page_bitmap(p);
>      }
>      if (p->code_bitmap) {
> @@ -1548,8 +1897,9 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
>          }
>      } else {
>      do_invalidate:
> -        tb_invalidate_phys_page_range(start, start + len, 1);
> +        tb_invalidate_phys_page_range__locked(pages, p, start, start + len, 1);
>      }
> +    page_collection_unlock(pages);
>  }
>  #else
>  /* Called with mmap_lock held. If pc is not 0 then it indicates the
> diff --git a/accel/tcg/translate-all.h b/accel/tcg/translate-all.h
> index ba8e4d6..6d1d258 100644
> --- a/accel/tcg/translate-all.h
> +++ b/accel/tcg/translate-all.h
> @@ -23,6 +23,9 @@
>
>
>  /* translate-all.c */
> +struct page_collection *page_collection_lock(tb_page_addr_t start,
> +                                             tb_page_addr_t end);
> +void page_collection_unlock(struct page_collection *set);
>  void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len);
>  void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>                                     int is_cpu_write_access);
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 5f7e65a..aeaa127 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -355,7 +355,8 @@ struct TranslationBlock {
>      /* original tb when cflags has CF_NOCACHE */
>      struct TranslationBlock *orig_tb;
>      /* first and second physical page containing code. The lower bit
> -       of the pointer tells the index in page_next[] */
> +       of the pointer tells the index in page_next[].
> +       The list is protected by the TB's page('s) lock(s) */
>      uintptr_t page_next[2];
>      tb_page_addr_t page_addr[2];

The diff is a little messy around tb_page_add but I think we need an
assert_page_lock(p) which compiles to check mmap_lock in CONFIG_USER
instead of the assert_memory_lock().

Then we can be clear about tb, memory and page locks.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 11/16] translate-all: add page_collection assertions
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 11/16] translate-all: add page_collection assertions Emilio G. Cota
@ 2018-03-29 15:08   ` Alex Bennée
  0 siblings, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 15:08 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> The appended adds assertions to make sure we do not longjmp with page
> locks held. Some notes:
>
> - user-mode has nothing to check, since page_locks are !user-mode only.
>
> - The checks only apply to page collections, since these have relatively
>   complex callers.
>
> - Some simple page_lock/unlock callers have been left unchecked --
>   namely page_lock_tb, tb_phys_invalidate and tb_link_page.

As mentioned in the previous email I think there is a need for
assert_page_locked() at least for places currently still using
assert_memory_locked(). It could certainly be DEBUG_TCG only case
though.

Otherwise:

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/cpu-exec.c      |  1 +
>  accel/tcg/translate-all.c | 22 ++++++++++++++++++++++
>  include/exec/exec-all.h   |  8 ++++++++
>  3 files changed, 31 insertions(+)
>
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 8c68727..7c83887 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -271,6 +271,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
>          tcg_debug_assert(!have_mmap_lock());
>  #endif
>          tb_lock_reset();
> +        assert_page_collection_locked(false);
>      }
>
>      if (in_exclusive_region) {
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 07527d5..82832ef 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -605,6 +605,24 @@ void page_collection_unlock(struct page_collection *set)
>  { }
>  #else /* !CONFIG_USER_ONLY */
>
> +#ifdef CONFIG_DEBUG_TCG
> +static __thread bool page_collection_locked;
> +
> +void assert_page_collection_locked(bool val)
> +{
> +    tcg_debug_assert(page_collection_locked == val);
> +}
> +
> +static inline void set_page_collection_locked(bool val)
> +{
> +    page_collection_locked = val;
> +}
> +#else
> +static inline void set_page_collection_locked(bool val)
> +{
> +}
> +#endif /* !CONFIG_DEBUG_TCG */
> +
>  static inline void page_lock(PageDesc *pd)
>  {
>      qemu_spin_lock(&pd->lock);
> @@ -677,6 +695,7 @@ static void do_page_entry_lock(struct page_entry *pe)
>      page_lock(pe->pd);
>      g_assert(!pe->locked);
>      pe->locked = true;
> +    set_page_collection_locked(true);
>  }
>
>  static gboolean page_entry_lock(gpointer key, gpointer value, gpointer data)
> @@ -769,6 +788,7 @@ page_collection_lock(tb_page_addr_t start, tb_page_addr_t end)
>      set->tree = g_tree_new_full(tb_page_addr_cmp, NULL, NULL,
>                                  page_entry_destroy);
>      set->max = NULL;
> +    assert_page_collection_locked(false);
>
>   retry:
>      g_tree_foreach(set->tree, page_entry_lock, NULL);
> @@ -787,6 +807,7 @@ page_collection_lock(tb_page_addr_t start, tb_page_addr_t end)
>                   page_trylock_add(set, tb->page_addr[1]))) {
>                  /* drop all locks, and reacquire in order */
>                  g_tree_foreach(set->tree, page_entry_unlock, NULL);
> +                set_page_collection_locked(false);
>                  goto retry;
>              }
>          }
> @@ -799,6 +820,7 @@ void page_collection_unlock(struct page_collection *set)
>      /* entries are unlocked and freed via page_entry_destroy */
>      g_tree_destroy(set->tree);
>      g_free(set);
> +    set_page_collection_locked(false);
>  }
>
>  #endif /* !CONFIG_USER_ONLY */
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index aeaa127..7911e69 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -431,6 +431,14 @@ void tb_lock(void);
>  void tb_unlock(void);
>  void tb_lock_reset(void);
>
> +#if !defined(CONFIG_USER_ONLY) && defined(CONFIG_DEBUG_TCG)
> +void assert_page_collection_locked(bool val);
> +#else
> +static inline void assert_page_collection_locked(bool val)
> +{
> +}
> +#endif
> +
>  #if !defined(CONFIG_USER_ONLY)
>
>  struct MemoryRegion *iotlb_to_region(CPUState *cpu,


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB Emilio G. Cota
@ 2018-03-29 15:19   ` Alex Bennée
  2018-04-06  1:23     ` Emilio G. Cota
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 15:19 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Use the recently-gained QHT feature of returning the matching TB if it
> already exists. This allows us to get rid of the lookup we perform
> right after acquiring tb_lock.
>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/cpu-exec.c      | 14 ++------------
>  accel/tcg/translate-all.c | 47 ++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 40 insertions(+), 21 deletions(-)
>
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 7c83887..8aed38c 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -243,10 +243,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
>          if (tb == NULL) {
>              mmap_lock();
>              tb_lock();
> -            tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
> -            if (likely(tb == NULL)) {
> -                tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
> -            }
> +            tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);

tb_gen_code needs to be renamed to reflect it's semantics.
tb_get_or_gen_code? Or maybe tb_get_code with a sub-helper to do the
generation.

>              tb_unlock();
>              mmap_unlock();
>          }
> @@ -396,14 +393,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
>          tb_lock();
>          acquired_tb_lock = true;
>
> -        /* There's a chance that our desired tb has been translated while
> -         * taking the locks so we check again inside the lock.
> -         */
> -        tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
> -        if (likely(tb == NULL)) {
> -            /* if no translated code available, then translate it now */
> -            tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
> -        }
> +        tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
>
>          mmap_unlock();
>          /* We add the TB in the virtual pc hash table for the fast lookup */
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 82832ef..dbe6c12 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1503,12 +1503,16 @@ static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
>   * (-1) to indicate that only one page contains the TB.
>   *
>   * Called with mmap_lock held for user-mode emulation.
> + *
> + * Returns @tb or an existing TB that matches @tb.

That's just confusing to read. So this returns a TB like the @tb we
passed in but actually a different one matching the same conditions?

>   */
> -static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
> -                         tb_page_addr_t phys_page2)
> +static TranslationBlock *
> +tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
> +             tb_page_addr_t phys_page2)
>  {
>      PageDesc *p;
>      PageDesc *p2 = NULL;
> +    void *existing_tb;
>      uint32_t h;
>
>      assert_memory_lock();
> @@ -1516,6 +1520,11 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>      /*
>       * Add the TB to the page list.
>       * To avoid deadlock, acquire first the lock of the lower-addressed page.
> +     * We keep the locks held until after inserting the TB in the hash table,
> +     * so that if the insertion fails we know for sure that the TBs are still
> +     * in the page descriptors.
> +     * Note that inserting into the hash table first isn't an option, since
> +     * we can only insert TBs that are fully initialized.
>       */
>      p = page_find_alloc(phys_pc >> TARGET_PAGE_BITS, 1);
>      if (likely(phys_page2 == -1)) {
> @@ -1535,21 +1544,33 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>          tb_page_add(p2, tb, 1, phys_page2);
>      }
>
> +    /* add in the hash table */
> +    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
> +                     tb->trace_vcpu_dstate);
> +    existing_tb = qht_insert(&tb_ctx.htable, tb, h);

modulo comments about qht_insert API earlier in the series.

> +
> +    /* remove TB from the page(s) if we couldn't insert it */
> +    if (unlikely(existing_tb)) {
> +        tb_page_remove(p, tb);
> +        invalidate_page_bitmap(p);
> +        if (p2) {
> +            tb_page_remove(p2, tb);
> +            invalidate_page_bitmap(p2);
> +        }
> +        tb = existing_tb;
> +    }
> +
>      if (p2) {
>          page_unlock(p2);
>      }
>      page_unlock(p);
>
> -    /* add in the hash table */
> -    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
> -                     tb->trace_vcpu_dstate);
> -    qht_insert(&tb_ctx.htable, tb, h);
> -
>  #ifdef CONFIG_USER_ONLY
>      if (DEBUG_TB_CHECK_GATE) {
>          tb_page_check();
>      }
>  #endif
> +    return tb;
>  }
>
>  /* Called with mmap_lock held for user mode emulation.  */
> @@ -1558,7 +1579,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>                                uint32_t flags, int cflags)
>  {
>      CPUArchState *env = cpu->env_ptr;
> -    TranslationBlock *tb;
> +    TranslationBlock *tb, *existing_tb;
>      tb_page_addr_t phys_pc, phys_page2;
>      target_ulong virt_page2;
>      tcg_insn_unit *gen_code_buf;
> @@ -1706,7 +1727,15 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>       * memory barrier is required before tb_link_page() makes the TB visible
>       * through the physical hash table and physical page list.
>       */
> -    tb_link_page(tb, phys_pc, phys_page2);
> +    existing_tb = tb_link_page(tb, phys_pc, phys_page2);
> +    /* if the TB already exists, discard what we just translated */

So are we in the position now that we could potentially do a translation
but be beaten by another thread generating the same code? I suspect we could
do with a bit of explanatory commentary for the tb_gen_code functions.

Also I think the "Translation Blocks" section needs updating in the
MTTCG design document to make this clear.

I'm curious if we should be counting unused translations somewhere in
the JIT stats. I'm guessing you need to work at a pathalogical case to
hit this much?

> +    if (unlikely(existing_tb != tb)) {
> +        uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
> +
> +        orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
> +        atomic_set(&tcg_ctx->code_gen_ptr, orig_aligned);
> +        return existing_tb;
> +    }
>      tcg_tb_insert(tb);
>      return tb;
>  }


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 14/16] cputlb: remove tb_lock from tlb_flush functions
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 14/16] cputlb: remove tb_lock from tlb_flush functions Emilio G. Cota
@ 2018-03-29 15:46   ` Alex Bennée
  0 siblings, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 15:46 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> The acquisition of tb_lock was added when the async tlb_flush
> was introduced in e3b9ca810 ("cputlb: introduce tlb_flush_* async work.")
>
> tb_lock was there to allow us to do memset() on the tb_jmp_cache's.
> However, since f3ced3c5928 ("tcg: consistently access cpu->tb_jmp_cache
> atomically") all accesses to tb_jmp_cache are atomic, so tb_lock
> is not needed here. Get rid of it.

\o/

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/cputlb.c | 8 --------
>  1 file changed, 8 deletions(-)
>
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index 0543903..f5c3a09 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -125,8 +125,6 @@ static void tlb_flush_nocheck(CPUState *cpu)
>      atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);
>      tlb_debug("(count: %zu)\n", tlb_flush_count());
>
> -    tb_lock();
> -
>      memset(env->tlb_table, -1, sizeof(env->tlb_table));
>      memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
>      cpu_tb_jmp_cache_clear(cpu);
> @@ -135,8 +133,6 @@ static void tlb_flush_nocheck(CPUState *cpu)
>      env->tlb_flush_addr = -1;
>      env->tlb_flush_mask = 0;
>
> -    tb_unlock();
> -
>      atomic_mb_set(&cpu->pending_tlb_flush, 0);
>  }
>
> @@ -180,8 +176,6 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
>
>      assert_cpu_is_self(cpu);
>
> -    tb_lock();
> -
>      tlb_debug("start: mmu_idx:0x%04lx\n", mmu_idx_bitmask);
>
>      for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
> @@ -197,8 +191,6 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data)
>      cpu_tb_jmp_cache_clear(cpu);
>
>      tlb_debug("done\n");
> -
> -    tb_unlock();
>  }
>
>  void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap)


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 15/16] translate-all: remove tb_lock mention from cpu_restore_state_from_tb
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 15/16] translate-all: remove tb_lock mention from cpu_restore_state_from_tb Emilio G. Cota
@ 2018-03-29 16:06   ` Alex Bennée
  2018-04-06  1:40     ` Emilio G. Cota
  0 siblings, 1 reply; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 16:06 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> tb_lock was needed when the function did retranslation. However,
> since fca8a500d519 ("tcg: Save insn data and use it in
> cpu_restore_state_from_tb") we don't do retranslation.
>
> Get rid of the comment.

I think we need to modify the comment in cpu_restore_state as well:

  Either way we need return early as we can't resolve it here.

>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/translate-all.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 9ab6477..ee49d03 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -357,9 +357,7 @@ static int encode_search(TranslationBlock *tb, uint8_t *block)
>      return p - block;
>  }
>
> -/* The cpu state corresponding to 'searched_pc' is restored.
> - * Called with tb_lock held.
> - */
> +/* The cpu state corresponding to 'searched_pc' is restored */
>  static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
>                                       uintptr_t searched_pc)
>  {


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 16/16] tcg: remove tb_lock
  2018-02-27  5:39 ` [Qemu-devel] [PATCH 16/16] tcg: remove tb_lock Emilio G. Cota
@ 2018-03-29 16:15   ` Alex Bennée
  0 siblings, 0 replies; 52+ messages in thread
From: Alex Bennée @ 2018-03-29 16:15 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Use mmap_lock in user-mode to protect TCG state and the page
> descriptors.
> In !user-mode, each vCPU has its own TCG state, so no locks
> needed. Per-page locks are used to protect the page descriptors.
>
> Per-TB locks are used in both modes to protect TB jumps.
>
> Some notes:
>
> - tb_lock is removed from notdirty_mem_write by passing a
>   locked page_collection to tb_invalidate_phys_page_fast.
>
> - tcg_tb_lookup/remove/insert/etc have their own internal lock(s),
>   so there is no need to further serialize access to them.
>
> - do_tb_flush is run in a safe async context, meaning no other
>   vCPU threads are running. Therefore acquiring mmap_lock there
>   is just to please tools such as thread sanitizer.
>
> - Not visible in the diff, but tb_invalidate_phys_page already
>   has an assert_memory_lock.
>
> - cpu_io_recompile is !user-only, so no mmap_lock there.
>
> - Added mmap_unlock()'s before all siglongjmp's that could
>   be called in user-mode while mmap_lock is held.
>   + Added an assert for !have_mmap_lock() after returning from
>     the longjmp in cpu_exec, just like we do in cpu_exec_step_atomic.
>
> Performance numbers before/after:
>
> Host: AMD Opteron(tm) Processor 6376
>
>                  ubuntu 17.04 ppc64 bootup+shutdown time
>
>   700 +-+--+----+------+------------+-----------+------------*--+-+
>       |    +    +      +            +           +           *B    |
>       |         before ***B***                            ** *    |
>       |tb lock removal ###D###                         ***        |
>   600 +-+                                           ***         +-+
>       |                                           **         #    |
>       |                                        *B*          #D    |
>       |                                     *** *         ##      |
>   500 +-+                                ***           ###      +-+
>       |                             * ***           ###           |
>       |                            *B*          # ##              |
>       |                          ** *          #D#                |
>   400 +-+                      **            ##                 +-+
>       |                      **           ###                     |
>       |                    **           ##                        |
>       |                  **         # ##                          |
>   300 +-+  *           B*          #D#                          +-+
>       |    B         ***        ###                               |
>       |    *       **       ####                                  |
>       |     *   ***      ###                                      |
>   200 +-+   B  *B     #D#                                       +-+
>       |     #B* *   ## #                                          |
>       |     #*    ##                                              |
>       |    + D##D#     +            +           +            +    |
>   100 +-+--+----+------+------------+-----------+------------+--+-+
>            1    8      16      Guest CPUs       48           64
>   png: https://imgur.com/HwmBHXe
>
>               debian jessie aarch64 bootup+shutdown time
>
>   90 +-+--+-----+-----+------------+------------+------------+--+-+
>      |    +     +     +            +            +            +    |
>      |         before ***B***                                B    |
>   80 +tb lock removal ###D###                              **D  +-+
>      |                                                   **###    |
>      |                                                 **##       |
>   70 +-+                                             ** #       +-+
>      |                                             ** ##          |
>      |                                           **  #            |
>   60 +-+                                       *B  ##           +-+
>      |                                       **  ##               |
>      |                                    ***  #D                 |
>   50 +-+                               ***   ##                 +-+
>      |                             * **   ###                     |
>      |                           **B*  ###                        |
>   40 +-+                     ****  # ##                         +-+
>      |                   ****     #D#                             |
>      |             ***B**      ###                                |
>   30 +-+    B***B**        ####                                 +-+
>      |    B *   *     # ###                                       |
>      |     B       ###D#                                          |
>   20 +-+   D  ##D##                                             +-+
>      |      D#                                                    |
>      |    +     +     +            +            +            +    |
>   10 +-+--+-----+-----+------------+------------+------------+--+-+
>           1     8     16      Guest CPUs        48           64
>   png: https://imgur.com/iGpGFtv
>
> The gains are high for 4-8 CPUs. Beyond that point, however, unrelated
> lock contention significantly hurts scalability.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/cpu-exec.c            |  34 +++--------
>  accel/tcg/translate-all.c       | 130 ++++++++++++----------------------------
>  accel/tcg/translate-all.h       |   3 +-
>  docs/devel/multi-thread-tcg.txt |  11 ++--
>  exec.c                          |  25 ++++----
>  include/exec/cpu-common.h       |   2 +-
>  include/exec/exec-all.h         |   4 --
>  include/exec/memory-internal.h  |   6 +-
>  include/exec/tb-context.h       |   2 -
>  linux-user/main.c               |   3 -
>  tcg/tcg.h                       |   4 +-
>  11 files changed, 73 insertions(+), 151 deletions(-)
>
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 20dad1b..e7a602b 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -210,20 +210,20 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
>         We only end up here when an existing TB is too long.  */
>      cflags |= MIN(max_cycles, CF_COUNT_MASK);
>
> -    tb_lock();
> +    mmap_lock();
>      tb = tb_gen_code(cpu, orig_tb->pc, orig_tb->cs_base,
>                       orig_tb->flags, cflags);
>      tb->orig_tb = orig_tb;
> -    tb_unlock();
> +    mmap_unlock();
>
>      /* execute the generated code */
>      trace_exec_tb_nocache(tb, tb->pc);
>      cpu_tb_exec(cpu, tb);
>
> -    tb_lock();
> +    mmap_lock();
>      tb_phys_invalidate(tb, -1);
> +    mmap_unlock();
>      tcg_tb_remove(tb);
> -    tb_unlock();
>  }
>  #endif
>
> @@ -242,9 +242,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
>          tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
>          if (tb == NULL) {
>              mmap_lock();
> -            tb_lock();
>              tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
> -            tb_unlock();
>              mmap_unlock();
>          }
>
> @@ -259,15 +257,13 @@ void cpu_exec_step_atomic(CPUState *cpu)
>          cpu_tb_exec(cpu, tb);
>          cc->cpu_exec_exit(cpu);
>      } else {
> -        /* We may have exited due to another problem here, so we need
> -         * to reset any tb_locks we may have taken but didn't release.
> +        /*
>           * The mmap_lock is dropped by tb_gen_code if it runs out of
>           * memory.
>           */
>  #ifndef CONFIG_SOFTMMU
>          tcg_debug_assert(!have_mmap_lock());
>  #endif
> -        tb_lock_reset();
>          assert_page_collection_locked(false);
>      }
>
> @@ -396,20 +392,11 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
>      TranslationBlock *tb;
>      target_ulong cs_base, pc;
>      uint32_t flags;
> -    bool acquired_tb_lock = false;
>
>      tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
>      if (tb == NULL) {
> -        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
> -         * taken outside tb_lock. As system emulation is currently
> -         * single threaded the locks are NOPs.
> -         */
>          mmap_lock();
> -        tb_lock();
> -        acquired_tb_lock = true;
> -
>          tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
> -
>          mmap_unlock();
>          /* We add the TB in the virtual pc hash table for the fast lookup */
>          atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
> @@ -425,15 +412,8 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
>  #endif
>      /* See if we can patch the calling TB. */
>      if (last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
> -        if (!acquired_tb_lock) {
> -            tb_lock();
> -            acquired_tb_lock = true;
> -        }
>          tb_add_jump(last_tb, tb_exit, tb);
>      }
> -    if (acquired_tb_lock) {
> -        tb_unlock();
> -    }
>      return tb;
>  }
>
> @@ -706,7 +686,9 @@ int cpu_exec(CPUState *cpu)
>          g_assert(cc == CPU_GET_CLASS(cpu));
>  #endif /* buggy compiler */
>          cpu->can_do_io = 1;
> -        tb_lock_reset();
> +#ifndef CONFIG_SOFTMMU
> +        tcg_debug_assert(!have_mmap_lock());
> +#endif
>          if (qemu_mutex_iothread_locked()) {
>              qemu_mutex_unlock_iothread();
>          }
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index ee49d03..8b3673c 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -88,13 +88,13 @@
>  #endif
>
>  /* Access to the various translations structures need to be serialised via locks
> - * for consistency. This is automatic for SoftMMU based system
> - * emulation due to its single threaded nature. In user-mode emulation
> - * access to the memory related structures are protected with the
> - * mmap_lock.
> + * for consistency.
> + * In user-mode emulation access to the memory related structures are protected
> + * with mmap_lock.
> + * In !user-mode we use per-page locks.
>   */
>  #ifdef CONFIG_SOFTMMU
> -#define assert_memory_lock() tcg_debug_assert(have_tb_lock)
> +#define assert_memory_lock()
>  #else
>  #define assert_memory_lock() tcg_debug_assert(have_mmap_lock())
>  #endif
> @@ -219,9 +219,6 @@ __thread TCGContext *tcg_ctx;
>  TBContext tb_ctx;
>  bool parallel_cpus;
>
> -/* translation block context */
> -static __thread int have_tb_lock;
> -
>  static void page_table_config_init(void)
>  {
>      uint32_t v_l1_bits;
> @@ -242,31 +239,6 @@ static void page_table_config_init(void)
>      assert(v_l2_levels >= 0);
>  }
>
> -#define assert_tb_locked() tcg_debug_assert(have_tb_lock)
> -#define assert_tb_unlocked() tcg_debug_assert(!have_tb_lock)
> -
> -void tb_lock(void)
> -{
> -    assert_tb_unlocked();
> -    qemu_mutex_lock(&tb_ctx.tb_lock);
> -    have_tb_lock++;
> -}
> -
> -void tb_unlock(void)
> -{
> -    assert_tb_locked();
> -    have_tb_lock--;
> -    qemu_mutex_unlock(&tb_ctx.tb_lock);
> -}
> -
> -void tb_lock_reset(void)
> -{
> -    if (have_tb_lock) {
> -        qemu_mutex_unlock(&tb_ctx.tb_lock);
> -        have_tb_lock = 0;
> -    }
> -}
> -
>  void cpu_gen_init(void)
>  {
>      tcg_context_init(&tcg_init_ctx);
> @@ -432,7 +404,6 @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
>      check_offset = host_pc - (uintptr_t) tcg_init_ctx.code_gen_buffer;
>
>      if (check_offset < tcg_init_ctx.code_gen_buffer_size) {
> -        tb_lock();
>          tb = tcg_tb_lookup(host_pc);
>          if (tb) {
>              cpu_restore_state_from_tb(cpu, tb, host_pc);
> @@ -443,7 +414,6 @@ bool cpu_restore_state(CPUState *cpu, uintptr_t host_pc)
>              }
>              r = true;
>          }
> -        tb_unlock();
>      }
>
>      return r;
> @@ -1054,7 +1024,6 @@ static inline void code_gen_alloc(size_t tb_size)
>          fprintf(stderr, "Could not allocate dynamic translator buffer\n");
>          exit(1);
>      }
> -    qemu_mutex_init(&tb_ctx.tb_lock);
>  }
>
>  static bool tb_cmp(const void *ap, const void *bp)
> @@ -1098,14 +1067,12 @@ void tcg_exec_init(unsigned long tb_size)
>  /*
>   * Allocate a new translation block. Flush the translation buffer if
>   * too many translation blocks or too much generated code.
> - *
> - * Called with tb_lock held.
>   */
>  static TranslationBlock *tb_alloc(target_ulong pc)
>  {
>      TranslationBlock *tb;
>
> -    assert_tb_locked();
> +    assert_memory_lock();
>
>      tb = tcg_tb_alloc(tcg_ctx);
>      if (unlikely(tb == NULL)) {
> @@ -1171,8 +1138,7 @@ static gboolean tb_host_size_iter(gpointer key, gpointer value, gpointer data)
>  /* flush all the translation blocks */
>  static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>  {
> -    tb_lock();
> -
> +    mmap_lock();
>      /* If it is already been done on request of another CPU,
>       * just retry.
>       */
> @@ -1202,7 +1168,7 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>      atomic_mb_set(&tb_ctx.tb_flush_count, tb_ctx.tb_flush_count + 1);
>
>  done:
> -    tb_unlock();
> +    mmap_unlock();
>  }
>
>  void tb_flush(CPUState *cpu)
> @@ -1236,7 +1202,7 @@ do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>
>  /* verify that all the pages have correct rights for code
>   *
> - * Called with tb_lock held.
> + * Called with mmap_lock held.
>   */
>  static void tb_invalidate_check(target_ulong address)
>  {
> @@ -1266,7 +1232,10 @@ static void tb_page_check(void)
>
>  #endif /* CONFIG_USER_ONLY */
>
> -/* call with @pd->lock held */
> +/*
> + * user-mode: call with mmap_lock held
> + * !user-mode: call with @pd->lock held
> + */
>  static inline void tb_page_remove(PageDesc *pd, TranslationBlock *tb)
>  {
>      TranslationBlock *tb1;
> @@ -1359,7 +1328,11 @@ static inline void tb_jmp_unlink(TranslationBlock *dest)
>      qemu_spin_unlock(&dest->jmp_lock);
>  }
>
> -/* If @rm_from_page_list is set, call with the TB's pages' locks held */
> +/*
> + * In user-mode, call with mmap_lock held.
> + * In !user-mode, if @rm_from_page_list is set, call with the TB's pages'
> + * locks held.
> + */
>  static void do_tb_phys_invalidate(TranslationBlock *tb, bool rm_from_page_list)
>  {
>      CPUState *cpu;
> @@ -1367,7 +1340,7 @@ static void do_tb_phys_invalidate(TranslationBlock *tb, bool rm_from_page_list)
>      uint32_t h;
>      tb_page_addr_t phys_pc;
>
> -    assert_tb_locked();
> +    assert_memory_lock();
>
>      /* make sure no further incoming jumps will be chained to this TB */
>      qemu_spin_lock(&tb->jmp_lock);
> @@ -1420,7 +1393,7 @@ static void tb_phys_invalidate__locked(TranslationBlock *tb)
>
>  /* invalidate one TB
>   *
> - * Called with tb_lock held.
> + * Called with mmap_lock held in user-mode.
>   */
>  void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>  {
> @@ -1464,7 +1437,7 @@ static void build_page_bitmap(PageDesc *p)
>  /* add the tb in the target page and protect it if necessary
>   *
>   * Called with mmap_lock held for user-mode emulation.
> - * Called with @p->lock held.
> + * Called with @p->lock held in !user-mode.
>   */
>  static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
>                                 unsigned int n, tb_page_addr_t page_addr)
> @@ -1744,10 +1717,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>      if ((pc & TARGET_PAGE_MASK) != virt_page2) {
>          phys_page2 = get_page_addr_code(env, virt_page2);
>      }
> -    /* As long as consistency of the TB stuff is provided by tb_lock in user
> -     * mode and is implicit in single-threaded softmmu emulation, no explicit
> -     * memory barrier is required before tb_link_page() makes the TB visible
> -     * through the physical hash table and physical page list.
> +    /*
> +     * No explicit memory barrier is required -- tb_link_page() makes the
> +     * TB visible in a consistent state.
>       */
>      existing_tb = tb_link_page(tb, phys_pc, phys_page2);
>      /* if the TB already exists, discard what we just translated */
> @@ -1763,8 +1735,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>  }
>
>  /*
> - * Call with all @pages locked.
>   * @p must be non-NULL.
> + * user-mode: call with mmap_lock held.
> + * !user-mode: call with all @pages locked.
>   */
>  static void
>  tb_invalidate_phys_page_range__locked(struct page_collection *pages,
> @@ -1787,7 +1760,6 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages,
>  #endif /* TARGET_HAS_PRECISE_SMC */
>
>      assert_memory_lock();
> -    assert_tb_locked();
>
>  #if defined(TARGET_HAS_PRECISE_SMC)
>      if (cpu != NULL) {
> @@ -1848,6 +1820,7 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages,
>          page_collection_unlock(pages);
>          /* Force execution of one insn next time.  */
>          cpu->cflags_next_tb = 1 | curr_cflags();
> +        mmap_unlock();
>          cpu_loop_exit_noexc(cpu);
>      }
>  #endif
> @@ -1860,8 +1833,7 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages,
>   * access: the virtual CPU will exit the current TB if code is modified inside
>   * this TB.
>   *
> - * Called with tb_lock/mmap_lock held for user-mode emulation
> - * Called with tb_lock held for system-mode emulation
> + * Called with mmap_lock held for user-mode emulation
>   */
>  void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>                                     int is_cpu_write_access)
> @@ -1870,7 +1842,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>      PageDesc *p;
>
>      assert_memory_lock();
> -    assert_tb_locked();
>
>      p = page_find(start >> TARGET_PAGE_BITS);
>      if (p == NULL) {
> @@ -1889,14 +1860,15 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>   * access: the virtual CPU will exit the current TB if code is modified inside
>   * this TB.
>   *
> - * Called with mmap_lock held for user-mode emulation, grabs tb_lock
> - * Called with tb_lock held for system-mode emulation
> + * Called with mmap_lock held for user-mode emulation.
>   */
> -static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
> +void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
>  {
>      struct page_collection *pages;
>      tb_page_addr_t next;
>
> +    assert_memory_lock();
> +
>      pages = page_collection_lock(start, end);
>      for (next = (start & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
>           start < end;
> @@ -1913,29 +1885,15 @@ static void tb_invalidate_phys_range_1(tb_page_addr_t start, tb_page_addr_t end)
>  }
>
>  #ifdef CONFIG_SOFTMMU
> -void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
> -{
> -    assert_tb_locked();
> -    tb_invalidate_phys_range_1(start, end);
> -}
> -#else
> -void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
> -{
> -    assert_memory_lock();
> -    tb_lock();
> -    tb_invalidate_phys_range_1(start, end);
> -    tb_unlock();
> -}
> -#endif
> -
> -#ifdef CONFIG_SOFTMMU
>  /* len must be <= 8 and start must be a multiple of len.
>   * Called via softmmu_template.h when code areas are written to with
>   * iothread mutex not held.
> + *
> + * Call with all @pages in the range [@start, @start + len[ locked.
>   */
> -void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
> +void tb_invalidate_phys_page_fast(struct page_collection *pages,
> +                                  tb_page_addr_t start, int len)
>  {
> -    struct page_collection *pages;
>      PageDesc *p;
>
>  #if 0
> @@ -1954,7 +1912,6 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
>          return;
>      }
>
> -    pages = page_collection_lock(start, start + len);
>      if (!p->code_bitmap &&
>          ++p->code_write_count >= SMC_BITMAP_USE_THRESHOLD) {
>          build_page_bitmap(p);
> @@ -1972,7 +1929,6 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
>      do_invalidate:
>          tb_invalidate_phys_page_range__locked(pages, p, start, start + len, 1);
>      }
> -    page_collection_unlock(pages);
>  }
>  #else
>  /* Called with mmap_lock held. If pc is not 0 then it indicates the
> @@ -2004,7 +1960,6 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
>          return false;
>      }
>
> -    tb_lock();
>  #ifdef TARGET_HAS_PRECISE_SMC
>      if (p->first_tb && pc != 0) {
>          current_tb = tcg_tb_lookup(pc);
> @@ -2036,12 +1991,9 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
>      if (current_tb_modified) {
>          /* Force execution of one insn next time.  */
>          cpu->cflags_next_tb = 1 | curr_cflags();
> -        /* tb_lock will be reset after cpu_loop_exit_noexc longjmps
> -         * back into the cpu_exec loop. */
>          return true;
>      }
>  #endif
> -    tb_unlock();
>
>      return false;
>  }
> @@ -2062,18 +2014,18 @@ void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr)
>          return;
>      }
>      ram_addr = memory_region_get_ram_addr(mr) + addr;
> -    tb_lock();
>      tb_invalidate_phys_page_range(ram_addr, ram_addr + 1, 0);
> -    tb_unlock();
>      rcu_read_unlock();
>  }
>  #endif /* !defined(CONFIG_USER_ONLY) */
>
> -/* Called with tb_lock held.  */
> +/* user-mode: call with mmap_lock held */
>  void tb_check_watchpoint(CPUState *cpu)
>  {
>      TranslationBlock *tb;
>
> +    assert_memory_lock();
> +
>      tb = tcg_tb_lookup(cpu->mem_io_pc);
>      if (tb) {
>          /* We can use retranslation to find the PC.  */
> @@ -2107,7 +2059,6 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
>      TranslationBlock *tb;
>      uint32_t n;
>
> -    tb_lock();
>      tb = tcg_tb_lookup(retaddr);
>      if (!tb) {
>          cpu_abort(cpu, "cpu_io_recompile: could not find TB for pc=%p",
> @@ -2160,9 +2111,6 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
>       *  repeating the fault, which is horribly inefficient.
>       *  Better would be to execute just this insn uncached, or generate a
>       *  second new TB.
> -     *
> -     * cpu_loop_exit_noexc will longjmp back to cpu_exec where the
> -     * tb_lock gets reset.
>       */
>      cpu_loop_exit_noexc(cpu);
>  }
> diff --git a/accel/tcg/translate-all.h b/accel/tcg/translate-all.h
> index 6d1d258..e6cb963 100644
> --- a/accel/tcg/translate-all.h
> +++ b/accel/tcg/translate-all.h
> @@ -26,7 +26,8 @@
>  struct page_collection *page_collection_lock(tb_page_addr_t start,
>                                               tb_page_addr_t end);
>  void page_collection_unlock(struct page_collection *set);
> -void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len);
> +void tb_invalidate_phys_page_fast(struct page_collection *pages,
> +                                  tb_page_addr_t start, int len);
>  void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>                                     int is_cpu_write_access);
>  void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end);
> diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
> index 36da1f1..e1e002b 100644
> --- a/docs/devel/multi-thread-tcg.txt
> +++ b/docs/devel/multi-thread-tcg.txt
> @@ -61,6 +61,7 @@ have their block-to-block jumps patched.
>  Global TCG State
>  ----------------
>
> +### User-mode emulation
>  We need to protect the entire code generation cycle including any post
>  generation patching of the translated code. This also implies a shared
>  translation buffer which contains code running on all cores. Any
> @@ -75,9 +76,11 @@ patching.
>
>  (Current solution)
>
> -Mainly as part of the linux-user work all code generation is
> -serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the
> -place of mmap_lock() in linux-user.
> +Code generation is serialised with mmap_lock().
> +
> +### !User-mode emulation
> +Each vCPU has its own TCG context and associated TCG region, thereby
> +requiring no locking.
>
>  Translation Blocks
>  ------------------
> @@ -192,7 +195,7 @@ work as "safe work" and exiting the cpu run loop. This ensure by the
>  time execution restarts all flush operations have completed.
>
>  TLB flag updates are all done atomically and are also protected by the
> -tb_lock() which is used by the functions that update the TLB in bulk.
> +corresponding page lock.
>
>  (Known limitation)
>
> diff --git a/exec.c b/exec.c
> index 4d8addb..ed6ef05 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -821,9 +821,7 @@ void cpu_exec_realizefn(CPUState *cpu, Error **errp)
>  static void breakpoint_invalidate(CPUState *cpu, target_ulong pc)
>  {
>      mmap_lock();
> -    tb_lock();
>      tb_invalidate_phys_page_range(pc, pc + 1, 0);
> -    tb_unlock();
>      mmap_unlock();
>  }
>  #else
> @@ -2406,21 +2404,20 @@ void memory_notdirty_write_prepare(NotDirtyInfo *ndi,
>      ndi->ram_addr = ram_addr;
>      ndi->mem_vaddr = mem_vaddr;
>      ndi->size = size;
> -    ndi->locked = false;
> +    ndi->pages = NULL;
>
>      assert(tcg_enabled());
>      if (!cpu_physical_memory_get_dirty_flag(ram_addr, DIRTY_MEMORY_CODE)) {
> -        ndi->locked = true;
> -        tb_lock();
> -        tb_invalidate_phys_page_fast(ram_addr, size);
> +        ndi->pages = page_collection_lock(ram_addr, ram_addr + size);
> +        tb_invalidate_phys_page_fast(ndi->pages, ram_addr, size);
>      }
>  }
>
>  /* Called within RCU critical section. */
>  void memory_notdirty_write_complete(NotDirtyInfo *ndi)
>  {
> -    if (ndi->locked) {
> -        tb_unlock();
> +    if (ndi->pages) {
> +        page_collection_unlock(ndi->pages);
>      }
>
>      /* Set both VGA and migration bits for simplicity and to remove
> @@ -2521,18 +2518,16 @@ static void check_watchpoint(int offset, int len, MemTxAttrs attrs, int flags)
>                  }
>                  cpu->watchpoint_hit = wp;
>
> -                /* Both tb_lock and iothread_mutex will be reset when
> -                 * cpu_loop_exit or cpu_loop_exit_noexc longjmp
> -                 * back into the cpu_exec main loop.
> -                 */
> -                tb_lock();
> +                mmap_lock();
>                  tb_check_watchpoint(cpu);
>                  if (wp->flags & BP_STOP_BEFORE_ACCESS) {
>                      cpu->exception_index = EXCP_DEBUG;
> +                    mmap_unlock();
>                      cpu_loop_exit(cpu);
>                  } else {
>                      /* Force execution of one insn next time.  */
>                      cpu->cflags_next_tb = 1 | curr_cflags();
> +                    mmap_unlock();
>                      cpu_loop_exit_noexc(cpu);
>                  }
>              }
> @@ -2947,9 +2942,9 @@ static void invalidate_and_set_dirty(MemoryRegion *mr, hwaddr addr,
>      }
>      if (dirty_log_mask & (1 << DIRTY_MEMORY_CODE)) {
>          assert(tcg_enabled());
> -        tb_lock();
> +        mmap_lock();
>          tb_invalidate_phys_range(addr, addr + length);
> -        tb_unlock();
> +        mmap_unlock();
>          dirty_log_mask &= ~(1 << DIRTY_MEMORY_CODE);
>      }
>      cpu_physical_memory_set_dirty_range(addr, length, dirty_log_mask);
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 74341b1..ae25cc3 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -23,7 +23,7 @@ typedef struct CPUListState {
>      FILE *file;
>  } CPUListState;
>
> -/* The CPU list lock nests outside tb_lock/tb_unlock.  */
> +/* The CPU list lock nests outside page_(un)lock or mmap_(un)lock */
>  void qemu_init_cpu_list(void);
>  void cpu_list_lock(void);
>  void cpu_list_unlock(void);
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index d69b853..eac027c 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -436,10 +436,6 @@ extern uintptr_t tci_tb_ptr;
>     smaller than 4 bytes, so we don't worry about special-casing this.  */
>  #define GETPC_ADJ   2
>
> -void tb_lock(void);
> -void tb_unlock(void);
> -void tb_lock_reset(void);
> -
>  #if !defined(CONFIG_USER_ONLY) && defined(CONFIG_DEBUG_TCG)
>  void assert_page_collection_locked(bool val);
>  #else
> diff --git a/include/exec/memory-internal.h b/include/exec/memory-internal.h
> index 4162474..9a48974 100644
> --- a/include/exec/memory-internal.h
> +++ b/include/exec/memory-internal.h
> @@ -40,6 +40,8 @@ void mtree_print_dispatch(fprintf_function mon, void *f,
>                            struct AddressSpaceDispatch *d,
>                            MemoryRegion *root);
>
> +struct page_collection;
> +
>  /* Opaque struct for passing info from memory_notdirty_write_prepare()
>   * to memory_notdirty_write_complete(). Callers should treat all fields
>   * as private, with the exception of @active.
> @@ -51,10 +53,10 @@ void mtree_print_dispatch(fprintf_function mon, void *f,
>   */
>  typedef struct {
>      CPUState *cpu;
> +    struct page_collection *pages;
>      ram_addr_t ram_addr;
>      vaddr mem_vaddr;
>      unsigned size;
> -    bool locked;
>      bool active;
>  } NotDirtyInfo;
>
> @@ -82,7 +84,7 @@ typedef struct {
>   *
>   * This must only be called if we are using TCG; it will assert otherwise.
>   *
> - * We may take a lock in the prepare call, so callers must ensure that
> + * We may take locks in the prepare call, so callers must ensure that
>   * they don't exit (via longjump or otherwise) without calling complete.
>   *
>   * This call must only be made inside an RCU critical section.
> diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
> index 8c9b49c..feb585e 100644
> --- a/include/exec/tb-context.h
> +++ b/include/exec/tb-context.h
> @@ -32,8 +32,6 @@ typedef struct TBContext TBContext;
>  struct TBContext {
>
>      struct qht htable;
> -    /* any access to the tbs or the page table must use this lock */
> -    QemuMutex tb_lock;
>
>      /* statistics */
>      unsigned tb_flush_count;
> diff --git a/linux-user/main.c b/linux-user/main.c
> index fd79006..d4379fe 100644
> --- a/linux-user/main.c
> +++ b/linux-user/main.c
> @@ -129,7 +129,6 @@ void fork_start(void)
>  {
>      start_exclusive();
>      mmap_fork_start();
> -    qemu_mutex_lock(&tb_ctx.tb_lock);
>      cpu_list_lock();
>  }
>
> @@ -145,14 +144,12 @@ void fork_end(int child)
>                  QTAILQ_REMOVE(&cpus, cpu, node);
>              }
>          }
> -        qemu_mutex_init(&tb_ctx.tb_lock);
>          qemu_init_cpu_list();
>          gdbserver_fork(thread_cpu);
>          /* qemu_init_cpu_list() takes care of reinitializing the
>           * exclusive state, so we don't need to end_exclusive() here.
>           */
>      } else {
> -        qemu_mutex_unlock(&tb_ctx.tb_lock);
>          cpu_list_unlock();
>          end_exclusive();
>      }
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 9dd9448..c411bf5 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -841,7 +841,7 @@ static inline bool tcg_op_buf_full(void)
>
>  /* pool based memory allocation */
>
> -/* user-mode: tb_lock must be held for tcg_malloc_internal. */
> +/* user-mode: mmap_lock must be held for tcg_malloc_internal. */
>  void *tcg_malloc_internal(TCGContext *s, int size);
>  void tcg_pool_reset(TCGContext *s);
>  TranslationBlock *tcg_tb_alloc(TCGContext *s);
> @@ -859,7 +859,7 @@ TranslationBlock *tcg_tb_lookup(uintptr_t tc_ptr);
>  void tcg_tb_foreach(GTraverseFunc func, gpointer user_data);
>  size_t tcg_nb_tbs(void);
>
> -/* user-mode: Called with tb_lock held.  */
> +/* user-mode: Called with mmap_lock held.  */
>  static inline void *tcg_malloc(int size)
>  {
>      TCGContext *s = tcg_ctx;


--
Alex Bennée

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 02/16] qht: return existing entry when qht_insert fails
  2018-03-28 16:33   ` Alex Bennée
@ 2018-04-05 17:10     ` Emilio G. Cota
  0 siblings, 0 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-04-05 17:10 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson

On Wed, Mar 28, 2018 at 17:33:09 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> > -bool qht_insert(struct qht *ht, void *p, uint32_t hash);
> > +void *qht_insert(struct qht *ht, void *p, uint32_t hash);
> 
> Hmm this seems needlessly counter intuitive. I realise the potential
> efficiency in overloading success/fail but wouldn't a:
> 
>   bool qht_insert(struct qht *ht, void *p, uint32_t hash, void **existing);
> 
> be conceptually nicer?

Good point, fixed in v2.

Thanks,

		E.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 04/16] tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx
  2018-03-29 10:06   ` Alex Bennée
@ 2018-04-05 17:18     ` Emilio G. Cota
  0 siblings, 0 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-04-05 17:18 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson

On Thu, Mar 29, 2018 at 11:06:07 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
(snip)
> > diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> > index 3a51d49..20ad3fc 100644
> > --- a/accel/tcg/translate-all.c
> > +++ b/accel/tcg/translate-all.c
> > @@ -1072,7 +1072,8 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
> >      /* suppress any remaining jumps to this TB */
> >      tb_jmp_unlink(tb);
> >
> > -    tb_ctx.tb_phys_invalidate_count++;
> > +    atomic_set(&tcg_ctx->tb_phys_invalidate_count,
> > +               tcg_ctx->tb_phys_invalidate_count + 1);
> 
> We do have an atomic_inc helper for this or we need comment that we only
> have atomic_reads() to worry about hence no races.
> 
> Otherwise:
> 
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

tcg_ctx is per-thread (or otherwise tb_lock is held here), so
yes we only worry here about concurrent atomic_reads. This
is common enough not to need a comment, me thinks.
[we have quite a few more instances of atomic_set(foo, foo + 1),
for instance when incrementing TCGProfile counts.]

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 10/16] translate-all: use per-page locking in !user-mode
  2018-03-29 14:55   ` Alex Bennée
@ 2018-04-06  0:43     ` Emilio G. Cota
  0 siblings, 0 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-04-06  0:43 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson

On Thu, Mar 29, 2018 at 15:55:13 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
(snip)
> > +/* lock the page(s) of a TB in the correct acquisition order */
> > +static inline void page_lock_tb(const TranslationBlock *tb)
> > +{
> > +    if (likely(tb->page_addr[1] == -1)) {
> > +        page_lock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
> > +        return;
> > +    }
> > +    if (tb->page_addr[0] < tb->page_addr[1]) {
> > +        page_lock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
> > +        page_lock(page_find(tb->page_addr[1] >> TARGET_PAGE_BITS));
> > +    } else {
> > +        page_lock(page_find(tb->page_addr[1] >> TARGET_PAGE_BITS));
> > +        page_lock(page_find(tb->page_addr[0] >> TARGET_PAGE_BITS));
> > +    }
> > +}
(snip)
> > +    /*
> > +     * Add the TB to the page list.
> > +     * To avoid deadlock, acquire first the lock of the lower-addressed page.
> > +     */
> > +    p = page_find_alloc(phys_pc >> TARGET_PAGE_BITS, 1);
> > +    if (likely(phys_page2 == -1)) {
> >          tb->page_addr[1] = -1;
> > +        page_lock(p);
> > +        tb_page_add(p, tb, 0, phys_pc & TARGET_PAGE_MASK);
> > +    } else {
> > +        p2 = page_find_alloc(phys_page2 >> TARGET_PAGE_BITS, 1);
> > +        if (phys_pc < phys_page2) {
> > +            page_lock(p);
> > +            page_lock(p2);
> > +        } else {
> > +            page_lock(p2);
> > +            page_lock(p);
> > +        }
> 
> Give we repeat this check further up perhaps a:
> 
>   page_lock_pair(PageDesc *p1, th_page_addr_t phys1, PageDesc *p2,  tb_page_addr_t phys2)

After trying, I don't think it's worth the trouble.

Note that page_lock_tb expands to nothing in user-mode,
whereas the latter snippet is shared by user-mode and
!user-mode. Dealing with that gets ugly quickly;
besides, we'd have to optionally return *p1 and *p2,
plus choose whether to use page_find or page_find_alloc..

(snip)
> The diff is a little messy around tb_page_add but I think we need an
> assert_page_lock(p) which compiles to check mmap_lock in CONFIG_USER
> instead of the assert_memory_lock().
> 
> Then we can be clear about tb, memory and page locks.

I've added an extra patch to v2 to do this, since this patch is
already huge.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB
  2018-03-29 15:19   ` Alex Bennée
@ 2018-04-06  1:23     ` Emilio G. Cota
  0 siblings, 0 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-04-06  1:23 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson

On Thu, Mar 29, 2018 at 16:19:39 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
> 
> > Use the recently-gained QHT feature of returning the matching TB if it
> > already exists. This allows us to get rid of the lookup we perform
> > right after acquiring tb_lock.
> >
> > Suggested-by: Richard Henderson <rth@twiddle.net>
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  accel/tcg/cpu-exec.c      | 14 ++------------
> >  accel/tcg/translate-all.c | 47 ++++++++++++++++++++++++++++++++++++++---------
> >  2 files changed, 40 insertions(+), 21 deletions(-)
> >
> > diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> > index 7c83887..8aed38c 100644
> > --- a/accel/tcg/cpu-exec.c
> > +++ b/accel/tcg/cpu-exec.c
> > @@ -243,10 +243,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
> >          if (tb == NULL) {
> >              mmap_lock();
> >              tb_lock();
> > -            tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
> > -            if (likely(tb == NULL)) {
> > -                tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
> > -            }
> > +            tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
> 
> tb_gen_code needs to be renamed to reflect it's semantics.
> tb_get_or_gen_code? Or maybe tb_get_code with a sub-helper to do the
> generation.

I think it can remain as tb_gen_code. The caller still gets
a TB, and whether that TB has been generated by this thread or
any other thread is irrelevant.

(snip)
> > --- a/accel/tcg/translate-all.c
> > +++ b/accel/tcg/translate-all.c
> > @@ -1503,12 +1503,16 @@ static inline void tb_page_add(PageDesc *p, TranslationBlock *tb,
> >   * (-1) to indicate that only one page contains the TB.
> >   *
> >   * Called with mmap_lock held for user-mode emulation.
> > + *
> > + * Returns @tb or an existing TB that matches @tb.
> 
> That's just confusing to read. So this returns a TB like the @tb we
> passed in but actually a different one matching the same conditions?

Good point. Here tb_link_page is not a great name, but instead
of adding a long name such as tb_link_page_or_get_existing, in
v2 I've expanded the above comment. It now looks as follows:

 * Returns a pointer @tb, or a pointer to an existing TB that matches @tb.
 * Note that in !user-mode, another thread might have already added a TB
 * for the same block of guest code that @tb corresponds to. In that case,
 * the caller should discard the original @tb, and use instead the returned TB.
 
> > @@ -1706,7 +1727,15 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
> >       * memory barrier is required before tb_link_page() makes the TB visible
> >       * through the physical hash table and physical page list.
> >       */
> > -    tb_link_page(tb, phys_pc, phys_page2);
> > +    existing_tb = tb_link_page(tb, phys_pc, phys_page2);
> > +    /* if the TB already exists, discard what we just translated */
> 
> So are we in the position now that we could potentially do a translation
> but be beaten by another thread generating the same code?

Exactly.

> I suspect we could
> do with a bit of explanatory commentary for the tb_gen_code functions.

As I said above I don't think tb_gen_code changes at all
to its callers, since the caller still gets a TB pointer that it
did not have before.

tb_link_page is the key here -- I hope the updated comment
I quoted above is enough to make things clear.

> Also I think the "Translation Blocks" section needs updating in the
> MTTCG design document to make this clear.

I've added a comment at the bottom of that section:

Parallel code generation is supported. QHT is used at insertion time
as the synchronization point across threads, thereby ensuring that we only
keep track of a single TranslationBlock for each guest code block.

> I'm curious if we should be counting unused translations somewhere in
> the JIT stats. I'm guessing you need to work at a pathalogical case to
> hit this much?

This should be extremely rare on most workloads. Given that and the
fact that we won't have unused translated code (we discard it by
resetting code_gen_ptr), I wouldn't worry too much about this.
In the unlikely case that it ever became a problem, TCG profiling
time would account for it, and on a perf profile we'd see the slow
path in tb_link_page being taken.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [PATCH 15/16] translate-all: remove tb_lock mention from cpu_restore_state_from_tb
  2018-03-29 16:06   ` Alex Bennée
@ 2018-04-06  1:40     ` Emilio G. Cota
  0 siblings, 0 replies; 52+ messages in thread
From: Emilio G. Cota @ 2018-04-06  1:40 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Paolo Bonzini, Richard Henderson

On Thu, Mar 29, 2018 at 17:06:56 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
> 
> > tb_lock was needed when the function did retranslation. However,
> > since fca8a500d519 ("tcg: Save insn data and use it in
> > cpu_restore_state_from_tb") we don't do retranslation.
> >
> > Get rid of the comment.
> 
> I think we need to modify the comment in cpu_restore_state as well:
> 
>   Either way we need return early as we can't resolve it here.

Thanks, I've added this suggestion to patch 16/16.

		E.

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2018-04-06  1:40 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-27  5:39 [Qemu-devel] [PATCH 00/16] tcg: tb_lock removal redux v1 Emilio G. Cota
2018-02-27  5:39 ` [Qemu-devel] [PATCH 01/16] qht: require a default comparison function Emilio G. Cota
2018-02-28 19:02   ` Richard Henderson
2018-03-28 16:21   ` Alex Bennée
2018-02-27  5:39 ` [Qemu-devel] [PATCH 02/16] qht: return existing entry when qht_insert fails Emilio G. Cota
2018-02-28 19:10   ` Richard Henderson
2018-03-28 16:33   ` Alex Bennée
2018-04-05 17:10     ` Emilio G. Cota
2018-02-27  5:39 ` [Qemu-devel] [PATCH 03/16] tcg: track TBs with per-region BST's Emilio G. Cota
2018-02-28 20:53   ` Richard Henderson
2018-03-29  9:54   ` Alex Bennée
2018-02-27  5:39 ` [Qemu-devel] [PATCH 04/16] tcg: move tb_ctx.tb_phys_invalidate_count to tcg_ctx Emilio G. Cota
2018-02-28 20:55   ` Richard Henderson
2018-03-29 10:06   ` Alex Bennée
2018-04-05 17:18     ` Emilio G. Cota
2018-02-27  5:39 ` [Qemu-devel] [PATCH 05/16] translate-all: iterate over TBs in a page with PAGE_FOR_EACH_TB Emilio G. Cota
2018-02-28 21:40   ` Richard Henderson
2018-02-28 22:50     ` Emilio G. Cota
2018-02-28 22:53       ` Richard Henderson
2018-02-27  5:39 ` [Qemu-devel] [PATCH 06/16] translate-all: make l1_map lockless Emilio G. Cota
2018-02-28 22:15   ` Richard Henderson
2018-03-29 10:16   ` Alex Bennée
2018-02-27  5:39 ` [Qemu-devel] [PATCH 07/16] translate-all: remove hole in PageDesc Emilio G. Cota
2018-02-28 22:17   ` Richard Henderson
2018-03-29 10:17   ` Alex Bennée
2018-02-27  5:39 ` [Qemu-devel] [PATCH 08/16] translate-all: work page-by-page in tb_invalidate_phys_range_1 Emilio G. Cota
2018-02-28 22:23   ` Richard Henderson
2018-03-29 10:10   ` Alex Bennée
2018-03-29 10:17   ` Alex Bennée
2018-02-27  5:39 ` [Qemu-devel] [PATCH 09/16] translate-all: move tb_invalidate_phys_page_range up in the file Emilio G. Cota
2018-02-28 22:24   ` Richard Henderson
2018-03-29 10:08   ` Alex Bennée
2018-02-27  5:39 ` [Qemu-devel] [PATCH 10/16] translate-all: use per-page locking in !user-mode Emilio G. Cota
2018-03-29 14:55   ` Alex Bennée
2018-04-06  0:43     ` Emilio G. Cota
2018-02-27  5:39 ` [Qemu-devel] [PATCH 11/16] translate-all: add page_collection assertions Emilio G. Cota
2018-03-29 15:08   ` Alex Bennée
2018-02-27  5:39 ` [Qemu-devel] [PATCH 12/16] translate-all: discard TB when tb_link_page returns an existing matching TB Emilio G. Cota
2018-03-29 15:19   ` Alex Bennée
2018-04-06  1:23     ` Emilio G. Cota
2018-02-27  5:39 ` [Qemu-devel] [PATCH 13/16] translate-all: protect TB jumps with a per-destination-TB lock Emilio G. Cota
2018-02-27 11:33   ` Paolo Bonzini
2018-02-27 11:43     ` Laurent Desnogues
2018-02-27 14:31       ` Paolo Bonzini
2018-03-28 15:57   ` Alex Bennée
2018-02-27  5:39 ` [Qemu-devel] [PATCH 14/16] cputlb: remove tb_lock from tlb_flush functions Emilio G. Cota
2018-03-29 15:46   ` Alex Bennée
2018-02-27  5:39 ` [Qemu-devel] [PATCH 15/16] translate-all: remove tb_lock mention from cpu_restore_state_from_tb Emilio G. Cota
2018-03-29 16:06   ` Alex Bennée
2018-04-06  1:40     ` Emilio G. Cota
2018-02-27  5:39 ` [Qemu-devel] [PATCH 16/16] tcg: remove tb_lock Emilio G. Cota
2018-03-29 16:15   ` Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.