[Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress)
@ 2017-06-29 20:28 Emilio G. Cota
  2017-06-29 20:28 ` [Qemu-devel] [RFC 1/7] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-06-29 20:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This is a request for comments as well as a request for help :-)

I've been experimenting with making TCGContext per-thread, so that we can
run most of tcg_gen_code in parallel. I've made some progress,
but haven't yet got it to work.

My guess is that the TCG stack is still global instead of per-vCPU
(it's been global since tmp_buf was removed from CPUState, right?),
but I'm having trouble following that code so most likely I'm wrong.

Any help would be appreciated--please disregard minor nits, I want
to see whether I can make this work to then take measurements
to decide whether this is worth the trouble.

- Patch 1 is a trivial doc fixup, feel free to pick it up

- Patches 2-3 remove *tbs[] to use a binary search tree instead.
  This removes the assumption in tb_find_pc that *tbs[] are ordered
  by tc_ptr, thereby allowing us to generate code regardless of
  its location on the host (as we do after patch 6).

- Patch 4 addresses a reporting issue: ever since we embedded the
  struct TB's in code_gen_buffer (6e3b2bfd6), we have been
  misreporting the size of the generated code. Not a huge deal,
  but I noticed while I was working on this.

- Patches 5-7 make TCGContext per-thread in softmmu. I have put there
  some XXX's to note that I'm aware of those issues, so don't worry
  too much about those--except of course if you have any input on
  what the cause of the race(s) might be.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 1/7] exec-all: fix typos in TranslationBlock's documentation
  2017-06-29 20:28 [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Emilio G. Cota
@ 2017-06-29 20:28 ` Emilio G. Cota
  2017-06-30  6:17   ` Richard Henderson
  2017-06-29 20:28 ` [Qemu-devel] [RFC 2/7] translate-all: add out_size field to TranslationBlock Emilio G. Cota
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Emilio G. Cota @ 2017-06-29 20:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 724ec73..35a75f1 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -338,7 +338,7 @@ struct TranslationBlock {
     /* The following data are used to directly call another TB from
      * the code of this one. This can be done either by emitting direct or
      * indirect native jump instructions. These jumps are reset so that the TB
-     * just continue its execution. The TB can be linked to another one by
+     * just continues its execution. The TB can be linked to another one by
      * setting one of the jump targets (or patching the jump instruction). Only
      * two of such jumps are supported.
      */
@@ -349,7 +349,7 @@ struct TranslationBlock {
 #else
     uintptr_t jmp_target_addr[2]; /* target address for indirect jump */
 #endif
-    /* Each TB has an assosiated circular list of TBs jumping to this one.
+    /* Each TB has an associated circular list of TBs jumping to this one.
      * jmp_list_first points to the first TB jumping to this one.
      * jmp_list_next is used to point to the next TB in a list.
      * Since each TB can have two jumps, it can participate in two lists.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 2/7] translate-all: add out_size field to TranslationBlock
  2017-06-29 20:28 [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Emilio G. Cota
  2017-06-29 20:28 ` [Qemu-devel] [RFC 1/7] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
@ 2017-06-29 20:28 ` Emilio G. Cota
  2017-06-30  6:31   ` Richard Henderson
  2017-06-29 20:28 ` [Qemu-devel] [RFC 3/7] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Emilio G. Cota @ 2017-06-29 20:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This paves the way for upcoming work: we need tb->out_size for
tb_find_pc to work with a binary search tree.

Note that due to the cacheline padding we are using, for
hosts with 64-byte cache lines this will not waste any
additional memory. Using a s16 would be ideal, since that
would plug an existing hole in the struct, but I see no
guarantee that a TB won't overflow it.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   |  1 +
 accel/tcg/translate-all.c | 16 ++++++++--------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 35a75f1..df12338 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -363,6 +363,7 @@ struct TranslationBlock {
      */
     uintptr_t jmp_list_next[2];
     uintptr_t jmp_list_first;
+    int32_t out_size; /* size of host code for this block */
 };
 
 void tb_free(TranslationBlock *tb);
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index f6ad46b..da91482 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1260,7 +1260,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb_page_addr_t phys_pc, phys_page2;
     target_ulong virt_page2;
     tcg_insn_unit *gen_code_buf;
-    int gen_code_size, search_size;
+    int search_size;
 #ifdef CONFIG_PROFILER
     int64_t ti;
 #endif
@@ -1327,11 +1327,11 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
        the tcg optimization currently hidden inside tcg_gen_code.  All
        that should be required is to flush the TBs, allocate a new TB,
        re-initialize it per above, and re-do the actual code generation.  */
-    gen_code_size = tcg_gen_code(&tcg_ctx, tb);
-    if (unlikely(gen_code_size < 0)) {
+    tb->out_size = tcg_gen_code(&tcg_ctx, tb);
+    if (unlikely(tb->out_size < 0)) {
         goto buffer_overflow;
     }
-    search_size = encode_search(tb, (void *)gen_code_buf + gen_code_size);
+    search_size = encode_search(tb, (void *)gen_code_buf + tb->out_size);
     if (unlikely(search_size < 0)) {
         goto buffer_overflow;
     }
@@ -1339,7 +1339,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 #ifdef CONFIG_PROFILER
     tcg_ctx.code_time += profile_getclock();
     tcg_ctx.code_in_len += tb->size;
-    tcg_ctx.code_out_len += gen_code_size;
+    tcg_ctx.code_out_len += tb->out_size;
     tcg_ctx.search_out_len += search_size;
 #endif
 
@@ -1347,8 +1347,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     if (qemu_loglevel_mask(CPU_LOG_TB_OUT_ASM) &&
         qemu_log_in_addr_range(tb->pc)) {
         qemu_log_lock();
-        qemu_log("OUT: [size=%d]\n", gen_code_size);
-        log_disas(tb->tc_ptr, gen_code_size);
+        qemu_log("OUT: [size=%d]\n", tb->out_size);
+        log_disas(tb->tc_ptr, tb->out_size);
         qemu_log("\n");
         qemu_log_flush();
         qemu_log_unlock();
@@ -1356,7 +1356,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 #endif
 
     tcg_ctx.code_gen_ptr = (void *)
-        ROUND_UP((uintptr_t)gen_code_buf + gen_code_size + search_size,
+        ROUND_UP((uintptr_t)gen_code_buf + tb->out_size + search_size,
                  CODE_GEN_ALIGN);
 
     /* init jump list */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 3/7] translate-all: use a binary search tree to track TBs in TBContext
  2017-06-29 20:28 [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Emilio G. Cota
  2017-06-29 20:28 ` [Qemu-devel] [RFC 1/7] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
  2017-06-29 20:28 ` [Qemu-devel] [RFC 2/7] translate-all: add out_size field to TranslationBlock Emilio G. Cota
@ 2017-06-29 20:28 ` Emilio G. Cota
  2017-06-30  7:41   ` Richard Henderson
  2017-06-29 20:28 ` [Qemu-devel] [RFC 4/7] translate-all: report correct avg host TB size Emilio G. Cota
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Emilio G. Cota @ 2017-06-29 20:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This is a prerequisite for having threads generate code on separate
buffers, which will help scalability when booting multiple cores
under MTTCG.

Note that tb_free does not free space in the code_gen_buffer anymore,
since we cannot easily know whether the tb is the last one inserted
in code_gen_buffer.

Performance-wise, lookups in tb_find_pc are the same as before:
O(log n). However, insertions are O(log n) instead of O(1), which results
in a small slowdown when booting debian-arm:

Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
	-machine type=virt -nographic -smp 1 -m 4096 \
	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
	-device virtio-net-device,netdev=unet \
	-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
	-device virtio-blk-device,drive=myblock \
	-kernel img/arm/aarch32-current-linux-kernel-only.img \
	-append console=ttyAMA0 root=/dev/vda1 \
	-name arm,debug-threads=on -smp 1' (10 runs):

- Before:
      10289.389753      task-clock (msec)         #    0.952 CPUs utilized            ( +-  0.13% )
            18,238      context-switches          #    0.002 M/sec                    ( +-  0.73% )
                 0      cpu-migrations            #    0.000 K/sec
            86,555      page-faults               #    0.008 M/sec                    ( +-  0.49% )
    45,079,926,395      cycles                    #    4.381 GHz                      ( +-  0.14% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    84,582,463,603      instructions              #    1.88  insns per cycle          ( +-  0.26% )
    14,964,335,400      branches                  # 1454.346 M/sec                    ( +-  0.29% )
       288,324,215      branch-misses             #    1.93% of all branches          ( +-  0.34% )

      10.813687279 seconds time elapsed                                          ( +-  0.42% )

- After:
      10333.181473      task-clock (msec)         #    0.944 CPUs utilized            ( +-  0.27% )
            18,167      context-switches          #    0.002 M/sec                    ( +-  0.20% )
                 0      cpu-migrations            #    0.000 K/sec
            83,354      page-faults               #    0.008 M/sec                    ( +-  0.92% )
    45,247,697,926      cycles                    #    4.379 GHz                      ( +-  0.23% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    84,537,657,945      instructions              #    1.87  insns per cycle          ( +-  0.18% )
    14,988,568,500      branches                  # 1450.528 M/sec                    ( +-  0.21% )
       294,765,097      branch-misses             #    1.97% of all branches          ( +-  0.43% )

      10.946641611 seconds time elapsed                                          ( +-  0.79% )

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-context.h |   4 +-
 accel/tcg/translate-all.c | 181 ++++++++++++++++++++--------------------------
 2 files changed, 81 insertions(+), 104 deletions(-)

diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index 25c2afe..1fa8dcc 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -31,10 +31,8 @@ typedef struct TBContext TBContext;
 
 struct TBContext {
 
-    TranslationBlock **tbs;
+    GTree *tb_tree;
     struct qht htable;
-    size_t tbs_size;
-    int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index da91482..a18fbf7 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -770,6 +770,21 @@ static inline void *alloc_code_gen_buffer(void)
 }
 #endif /* USE_STATIC_CODE_GEN_BUFFER, WIN32, POSIX */
 
+/* @key is already in the tree so it's safe to use container_of on it */
+static gint tc_ptr_cmp(gconstpointer candidate, gconstpointer key)
+{
+    uintptr_t a = *(uintptr_t *)candidate;
+    const TranslationBlock *tb = container_of(key, TranslationBlock, tc_ptr);
+    uintptr_t b = (uintptr_t)tb->tc_ptr;
+
+    if (a >= b + tb->out_size) {
+        return 1;
+    } else if (a < b) {
+        return -1;
+    }
+    return 0;
+}
+
 static inline void code_gen_alloc(size_t tb_size)
 {
     tcg_ctx.code_gen_buffer_size = size_code_gen_buffer(tb_size);
@@ -778,15 +793,7 @@ static inline void code_gen_alloc(size_t tb_size)
         fprintf(stderr, "Could not allocate dynamic translator buffer\n");
         exit(1);
     }
-
-    /* size this conservatively -- realloc later if needed */
-    tcg_ctx.tb_ctx.tbs_size =
-        tcg_ctx.code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE / 8;
-    if (unlikely(!tcg_ctx.tb_ctx.tbs_size)) {
-        tcg_ctx.tb_ctx.tbs_size = 64 * 1024;
-    }
-    tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock *, tcg_ctx.tb_ctx.tbs_size);
-
+    tcg_ctx.tb_ctx.tb_tree = g_tree_new(tc_ptr_cmp);
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
@@ -827,7 +834,6 @@ bool tcg_enabled(void)
 static TranslationBlock *tb_alloc(target_ulong pc)
 {
     TranslationBlock *tb;
-    TBContext *ctx;
 
     assert_tb_locked();
 
@@ -835,12 +841,6 @@ static TranslationBlock *tb_alloc(target_ulong pc)
     if (unlikely(tb == NULL)) {
         return NULL;
     }
-    ctx = &tcg_ctx.tb_ctx;
-    if (unlikely(ctx->nb_tbs == ctx->tbs_size)) {
-        ctx->tbs_size *= 2;
-        ctx->tbs = g_renew(TranslationBlock *, ctx->tbs, ctx->tbs_size);
-    }
-    ctx->tbs[ctx->nb_tbs++] = tb;
     return tb;
 }
 
@@ -849,16 +849,7 @@ void tb_free(TranslationBlock *tb)
 {
     assert_tb_locked();
 
-    /* In practice this is mostly used for single use temporary TB
-       Ignore the hard cases and just back up if this TB happens to
-       be the last one generated.  */
-    if (tcg_ctx.tb_ctx.nb_tbs > 0 &&
-            tb == tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) {
-        size_t struct_size = ROUND_UP(sizeof(*tb), qemu_icache_linesize);
-
-        tcg_ctx.code_gen_ptr = tb->tc_ptr - struct_size;
-        tcg_ctx.tb_ctx.nb_tbs--;
-    }
+    g_tree_remove(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr);
 }
 
 static inline void invalidate_page_bitmap(PageDesc *p)
@@ -906,6 +897,8 @@ static void page_flush_tb(void)
 /* flush all the translation blocks */
 static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 {
+    int nb_tbs __attribute__((unused));
+
     tb_lock();
 
     /* If it is already been done on request of another CPU,
@@ -916,11 +909,12 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
 #if defined(DEBUG_TB_FLUSH)
+    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
     printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
            (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
-           tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.tb_ctx.nb_tbs > 0 ?
+           nb_tbs, nb_tbs > 0 ?
            ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)) /
-           tcg_ctx.tb_ctx.nb_tbs : 0);
+           nb_tbs : 0);
 #endif
     if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)
         > tcg_ctx.code_gen_buffer_size) {
@@ -935,7 +929,10 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
         }
     }
 
-    tcg_ctx.tb_ctx.nb_tbs = 0;
+    /* Increment the refcount first so that destroy acts as a reset */
+    g_tree_ref(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_destroy(tcg_ctx.tb_ctx.tb_tree);
+
     qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
@@ -1385,6 +1382,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
      * through the physical hash table and physical page list.
      */
     tb_link_page(tb, phys_pc, phys_page2);
+    g_tree_insert(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr, tb);
     return tb;
 }
 
@@ -1653,37 +1651,14 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
 }
 #endif
 
-/* find the TB 'tb' such that tb[0].tc_ptr <= tc_ptr <
-   tb[1].tc_ptr. Return NULL if not found */
+/*
+ * Find the TB 'tb' such that
+ * tb->tc_ptr <= tc_ptr < tb->tc_ptr + tb->out_size
+ * Return NULL if not found.
+ */
 static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
 {
-    int m_min, m_max, m;
-    uintptr_t v;
-    TranslationBlock *tb;
-
-    if (tcg_ctx.tb_ctx.nb_tbs <= 0) {
-        return NULL;
-    }
-    if (tc_ptr < (uintptr_t)tcg_ctx.code_gen_buffer ||
-        tc_ptr >= (uintptr_t)tcg_ctx.code_gen_ptr) {
-        return NULL;
-    }
-    /* binary search (cf Knuth) */
-    m_min = 0;
-    m_max = tcg_ctx.tb_ctx.nb_tbs - 1;
-    while (m_min <= m_max) {
-        m = (m_min + m_max) >> 1;
-        tb = tcg_ctx.tb_ctx.tbs[m];
-        v = (uintptr_t)tb->tc_ptr;
-        if (v == tc_ptr) {
-            return tb;
-        } else if (tc_ptr < v) {
-            m_max = m - 1;
-        } else {
-            m_min = m + 1;
-        }
-    }
-    return tcg_ctx.tb_ctx.tbs[m_max];
+    return g_tree_lookup(tcg_ctx.tb_ctx.tb_tree, &tc_ptr);
 }
 
 #if !defined(CONFIG_USER_ONLY)
@@ -1866,63 +1841,67 @@ static void print_qht_statistics(FILE *f, fprintf_function cpu_fprintf,
     g_free(hgram);
 }
 
+struct tb_tree_stats {
+    int target_size;
+    int max_target_size;
+    int direct_jmp_count;
+    int direct_jmp2_count;
+    int cross_page;
+};
+
+static gboolean tb_tree_stats_iter(gpointer key, gpointer value, gpointer data)
+{
+    const TranslationBlock *tb = value;
+    struct tb_tree_stats *tst = data;
+
+    tst->target_size += tb->size;
+    if (tb->size > tst->max_target_size) {
+        tst->max_target_size = tb->size;
+    }
+    if (tb->page_addr[1] != -1) {
+        tst->cross_page++;
+    }
+    if (tb->jmp_reset_offset[0] != TB_JMP_RESET_OFFSET_INVALID) {
+        tst->direct_jmp_count++;
+        if (tb->jmp_reset_offset[1] != TB_JMP_RESET_OFFSET_INVALID) {
+            tst->direct_jmp2_count++;
+        }
+    }
+    return false;
+}
+
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 {
-    int i, target_code_size, max_target_code_size;
-    int direct_jmp_count, direct_jmp2_count, cross_page;
-    TranslationBlock *tb;
+    struct tb_tree_stats tst = {};
     struct qht_stats hst;
+    int nb_tbs;
 
     tb_lock();
 
-    target_code_size = 0;
-    max_target_code_size = 0;
-    cross_page = 0;
-    direct_jmp_count = 0;
-    direct_jmp2_count = 0;
-    for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) {
-        tb = tcg_ctx.tb_ctx.tbs[i];
-        target_code_size += tb->size;
-        if (tb->size > max_target_code_size) {
-            max_target_code_size = tb->size;
-        }
-        if (tb->page_addr[1] != -1) {
-            cross_page++;
-        }
-        if (tb->jmp_reset_offset[0] != TB_JMP_RESET_OFFSET_INVALID) {
-            direct_jmp_count++;
-            if (tb->jmp_reset_offset[1] != TB_JMP_RESET_OFFSET_INVALID) {
-                direct_jmp2_count++;
-            }
-        }
-    }
+    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
     cpu_fprintf(f, "gen code size       %td/%zd\n",
                 tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
                 tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
-    cpu_fprintf(f, "TB count            %d\n", tcg_ctx.tb_ctx.nb_tbs);
+    cpu_fprintf(f, "TB count            %d\n", nb_tbs);
     cpu_fprintf(f, "TB avg target size  %d max=%d bytes\n",
-            tcg_ctx.tb_ctx.nb_tbs ? target_code_size /
-                    tcg_ctx.tb_ctx.nb_tbs : 0,
-            max_target_code_size);
+                nb_tbs ? tst.target_size / nb_tbs : 0,
+                tst.max_target_size);
     cpu_fprintf(f, "TB avg host size    %td bytes (expansion ratio: %0.1f)\n",
-            tcg_ctx.tb_ctx.nb_tbs ? (tcg_ctx.code_gen_ptr -
-                                     tcg_ctx.code_gen_buffer) /
-                                     tcg_ctx.tb_ctx.nb_tbs : 0,
-                target_code_size ? (double) (tcg_ctx.code_gen_ptr -
-                                             tcg_ctx.code_gen_buffer) /
-                                             target_code_size : 0);
-    cpu_fprintf(f, "cross page TB count %d (%d%%)\n", cross_page,
-            tcg_ctx.tb_ctx.nb_tbs ? (cross_page * 100) /
-                                    tcg_ctx.tb_ctx.nb_tbs : 0);
+                nb_tbs ? (tcg_ctx.code_gen_ptr -
+                          tcg_ctx.code_gen_buffer) / nb_tbs : 0,
+                tst.target_size ? (double) (tcg_ctx.code_gen_ptr -
+                                            tcg_ctx.code_gen_buffer) /
+                                            tst.target_size : 0);
+    cpu_fprintf(f, "cross page TB count %d (%d%%)\n", tst.cross_page,
+            nb_tbs ? (tst.cross_page * 100) / nb_tbs : 0);
     cpu_fprintf(f, "direct jump count   %d (%d%%) (2 jumps=%d %d%%)\n",
-                direct_jmp_count,
-                tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp_count * 100) /
-                        tcg_ctx.tb_ctx.nb_tbs : 0,
-                direct_jmp2_count,
-                tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
-                        tcg_ctx.tb_ctx.nb_tbs : 0);
+                tst.direct_jmp_count,
+                nb_tbs ? (tst.direct_jmp_count * 100) / nb_tbs : 0,
+                tst.direct_jmp2_count,
+                nb_tbs ? (tst.direct_jmp2_count * 100) / nb_tbs : 0);
 
     qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
     print_qht_statistics(f, cpu_fprintf, hst);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 4/7] translate-all: report correct avg host TB size
  2017-06-29 20:28 [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Emilio G. Cota
                   ` (2 preceding siblings ...)
  2017-06-29 20:28 ` [Qemu-devel] [RFC 3/7] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
@ 2017-06-29 20:28 ` Emilio G. Cota
  2017-06-30  7:50   ` Richard Henderson
  2017-06-29 20:28 ` [Qemu-devel] [RFC 5/7] tcg: take tb_ctx out of TCGContext Emilio G. Cota
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Emilio G. Cota @ 2017-06-29 20:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Since commit 6e3b2bfd6 ("tcg: allocate TB structs before the
corresponding translated code") we are not fully utilizing
code_gen_buffer for translated code, and therefore are
incorrectly reporting the amount of translated code as well as
the average host TB size. Address this by:

- Making the conscious choice of misreporting the total translated code;
  doing otherwise would mislead users into thinking "tb-size" is not
  honoured.

- Expanding tb_tree_stats to accurately count the bytes of translated code on
  the host, and using this for reporting the average tb host size,
  as well as the expansion ratio.

In the future we might want to consider reporting the accurate numbers for
the total translated code, together with a "bookkeeping/overhead" field to
account for the TB structs.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index a18fbf7..9714777 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -894,9 +894,20 @@ static void page_flush_tb(void)
     }
 }
 
+static __attribute__((unused))
+gboolean tb_host_size_iter(gpointer key, gpointer value, gpointer data)
+{
+    const TranslationBlock *tb = value;
+    int *size = data;
+
+    *size += tb->out_size;
+    return false;
+}
+
 /* flush all the translation blocks */
 static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 {
+    int host_size __attribute__((unused));
     int nb_tbs __attribute__((unused));
 
     tb_lock();
@@ -909,12 +920,11 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
 #if defined(DEBUG_TB_FLUSH)
+    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_host_size_iter, &host_size);
     nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
-    printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
+    printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%d\n",
            (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
-           nb_tbs, nb_tbs > 0 ?
-           ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)) /
-           nb_tbs : 0);
+           nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
 #endif
     if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)
         > tcg_ctx.code_gen_buffer_size) {
@@ -1842,6 +1852,7 @@ static void print_qht_statistics(FILE *f, fprintf_function cpu_fprintf,
 }
 
 struct tb_tree_stats {
+    int host_size;
     int target_size;
     int max_target_size;
     int direct_jmp_count;
@@ -1854,6 +1865,7 @@ static gboolean tb_tree_stats_iter(gpointer key, gpointer value, gpointer data)
     const TranslationBlock *tb = value;
     struct tb_tree_stats *tst = data;
 
+    tst->host_size += tb->out_size;
     tst->target_size += tb->size;
     if (tb->size > tst->max_target_size) {
         tst->max_target_size = tb->size;
@@ -1882,6 +1894,11 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
+    /*
+     * Report total code size including the padding and TB structs;
+     * otherwise users might think "-tb-size" is not honoured.
+     * For avg host size we use the precise numbers from tb_tree_stats though.
+     */
     cpu_fprintf(f, "gen code size       %td/%zd\n",
                 tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
                 tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
@@ -1889,12 +1906,9 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     cpu_fprintf(f, "TB avg target size  %d max=%d bytes\n",
                 nb_tbs ? tst.target_size / nb_tbs : 0,
                 tst.max_target_size);
-    cpu_fprintf(f, "TB avg host size    %td bytes (expansion ratio: %0.1f)\n",
-                nb_tbs ? (tcg_ctx.code_gen_ptr -
-                          tcg_ctx.code_gen_buffer) / nb_tbs : 0,
-                tst.target_size ? (double) (tcg_ctx.code_gen_ptr -
-                                            tcg_ctx.code_gen_buffer) /
-                                            tst.target_size : 0);
+    cpu_fprintf(f, "TB avg host size    %d bytes (expansion ratio: %0.1f)\n",
+                nb_tbs ? tst.host_size / nb_tbs : 0,
+                tst.target_size ? (double)tst.host_size / tst.target_size : 0);
     cpu_fprintf(f, "cross page TB count %d (%d%%)\n", tst.cross_page,
             nb_tbs ? (tst.cross_page * 100) / nb_tbs : 0);
     cpu_fprintf(f, "direct jump count   %d (%d%%) (2 jumps=%d %d%%)\n",
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 5/7] tcg: take tb_ctx out of TCGContext
  2017-06-29 20:28 [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Emilio G. Cota
                   ` (3 preceding siblings ...)
  2017-06-29 20:28 ` [Qemu-devel] [RFC 4/7] translate-all: report correct avg host TB size Emilio G. Cota
@ 2017-06-29 20:28 ` Emilio G. Cota
  2017-06-30  7:58   ` Richard Henderson
  2017-06-29 20:28 ` [Qemu-devel] [RFC 6/7] [XXX] tcg: make TCGContext thread-local for softmmu Emilio G. Cota
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Emilio G. Cota @ 2017-06-29 20:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Before TCGContext is made thread-local.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-context.h |  2 ++
 tcg/tcg.h                 |  2 --
 accel/tcg/cpu-exec.c      |  2 +-
 accel/tcg/translate-all.c | 57 ++++++++++++++++++++++++-----------------------
 linux-user/main.c         |  6 ++---
 5 files changed, 35 insertions(+), 34 deletions(-)

diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index 1fa8dcc..1d41202 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -41,4 +41,6 @@ struct TBContext {
     int tb_phys_invalidate_count;
 };
 
+extern TBContext tb_ctx;
+
 #endif
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 9e37722..3b3359c 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -706,8 +706,6 @@ struct TCGContext {
     /* Threshold to flush the translated code buffer.  */
     void *code_gen_highwater;
 
-    TBContext tb_ctx;
-
     /* Track which vCPU triggers events */
     CPUState *cpu;                      /* *_trans */
     TCGv_env tcg_env;                   /* *_exec  */
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 3581618..54ecae2 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -323,7 +323,7 @@ TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
     phys_pc = get_page_addr_code(desc.env, pc);
     desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_hash_func(phys_pc, pc, flags);
-    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
+    return qht_lookup(&tb_ctx.htable, tb_cmp, &desc, h);
 }
 
 static inline TranslationBlock *tb_find(CPUState *cpu,
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 9714777..2869c79 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -133,6 +133,7 @@ static void *l1_map[V_L1_MAX_SIZE];
 
 /* code generation context */
 TCGContext tcg_ctx;
+TBContext tb_ctx;
 bool parallel_cpus;
 
 /* translation block context */
@@ -164,7 +165,7 @@ static void page_table_config_init(void)
 void tb_lock(void)
 {
     assert_tb_unlocked();
-    qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
+    qemu_mutex_lock(&tb_ctx.tb_lock);
     have_tb_lock++;
 }
 
@@ -172,13 +173,13 @@ void tb_unlock(void)
 {
     assert_tb_locked();
     have_tb_lock--;
-    qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+    qemu_mutex_unlock(&tb_ctx.tb_lock);
 }
 
 void tb_lock_reset(void)
 {
     if (have_tb_lock) {
-        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+        qemu_mutex_unlock(&tb_ctx.tb_lock);
         have_tb_lock = 0;
     }
 }
@@ -793,15 +794,15 @@ static inline void code_gen_alloc(size_t tb_size)
         fprintf(stderr, "Could not allocate dynamic translator buffer\n");
         exit(1);
     }
-    tcg_ctx.tb_ctx.tb_tree = g_tree_new(tc_ptr_cmp);
-    qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
+    tb_ctx.tb_tree = g_tree_new(tc_ptr_cmp);
+    qemu_mutex_init(&tb_ctx.tb_lock);
 }
 
 static void tb_htable_init(void)
 {
     unsigned int mode = QHT_MODE_AUTO_RESIZE;
 
-    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
+    qht_init(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
 }
 
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
@@ -849,7 +850,7 @@ void tb_free(TranslationBlock *tb)
 {
     assert_tb_locked();
 
-    g_tree_remove(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr);
+    g_tree_remove(tb_ctx.tb_tree, &tb->tc_ptr);
 }
 
 static inline void invalidate_page_bitmap(PageDesc *p)
@@ -915,13 +916,13 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     /* If it is already been done on request of another CPU,
      * just retry.
      */
-    if (tcg_ctx.tb_ctx.tb_flush_count != tb_flush_count.host_int) {
+    if (tb_ctx.tb_flush_count != tb_flush_count.host_int) {
         goto done;
     }
 
 #if defined(DEBUG_TB_FLUSH)
-    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_host_size_iter, &host_size);
-    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
+    nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
     printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%d\n",
            (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
            nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
@@ -940,17 +941,17 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
     /* Increment the refcount first so that destroy acts as a reset */
-    g_tree_ref(tcg_ctx.tb_ctx.tb_tree);
-    g_tree_destroy(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_ref(tb_ctx.tb_tree);
+    g_tree_destroy(tb_ctx.tb_tree);
 
-    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
+    qht_reset_size(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
     /* XXX: flush processor icache at this point if cache flush is
        expensive */
-    atomic_mb_set(&tcg_ctx.tb_ctx.tb_flush_count,
-                  tcg_ctx.tb_ctx.tb_flush_count + 1);
+    atomic_mb_set(&tb_ctx.tb_flush_count,
+                  tb_ctx.tb_flush_count + 1);
 
 done:
     tb_unlock();
@@ -959,7 +960,7 @@ done:
 void tb_flush(CPUState *cpu)
 {
     if (tcg_enabled()) {
-        unsigned tb_flush_count = atomic_mb_read(&tcg_ctx.tb_ctx.tb_flush_count);
+        unsigned tb_flush_count = atomic_mb_read(&tb_ctx.tb_flush_count);
         async_safe_run_on_cpu(cpu, do_tb_flush,
                               RUN_ON_CPU_HOST_INT(tb_flush_count));
     }
@@ -986,7 +987,7 @@ do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 static void tb_invalidate_check(target_ulong address)
 {
     address &= TARGET_PAGE_MASK;
-    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
+    qht_iter(&tb_ctx.htable, do_tb_invalidate_check, &address);
 }
 
 static void
@@ -1006,7 +1007,7 @@ do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 /* verify that all the pages have correct rights for code */
 static void tb_page_check(void)
 {
-    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
+    qht_iter(&tb_ctx.htable, do_tb_page_check, NULL);
 }
 
 #endif
@@ -1105,7 +1106,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
+    qht_remove(&tb_ctx.htable, tb, h);
 
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
@@ -1134,7 +1135,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* suppress any remaining jumps to this TB */
     tb_jmp_unlink(tb);
 
-    tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
+    tb_ctx.tb_phys_invalidate_count++;
 }
 
 #ifdef CONFIG_SOFTMMU
@@ -1250,7 +1251,7 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
 
     /* add in the hash table */
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
+    qht_insert(&tb_ctx.htable, tb, h);
 
 #ifdef DEBUG_TB_CHECK
     tb_page_check();
@@ -1392,7 +1393,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
      * through the physical hash table and physical page list.
      */
     tb_link_page(tb, phys_pc, phys_page2);
-    g_tree_insert(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr, tb);
+    g_tree_insert(tb_ctx.tb_tree, &tb->tc_ptr, tb);
     return tb;
 }
 
@@ -1668,7 +1669,7 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
  */
 static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
 {
-    return g_tree_lookup(tcg_ctx.tb_ctx.tb_tree, &tc_ptr);
+    return g_tree_lookup(tb_ctx.tb_tree, &tc_ptr);
 }
 
 #if !defined(CONFIG_USER_ONLY)
@@ -1890,8 +1891,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 
     tb_lock();
 
-    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
-    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
+    nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
+    g_tree_foreach(tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
     /*
@@ -1917,15 +1918,15 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
                 tst.direct_jmp2_count,
                 nb_tbs ? (tst.direct_jmp2_count * 100) / nb_tbs : 0);
 
-    qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
+    qht_statistics_init(&tb_ctx.htable, &hst);
     print_qht_statistics(f, cpu_fprintf, hst);
     qht_statistics_destroy(&hst);
 
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %u\n",
-            atomic_read(&tcg_ctx.tb_ctx.tb_flush_count));
+            atomic_read(&tb_ctx.tb_flush_count));
     cpu_fprintf(f, "TB invalidate count %d\n",
-            tcg_ctx.tb_ctx.tb_phys_invalidate_count);
+            tb_ctx.tb_phys_invalidate_count);
     cpu_fprintf(f, "TLB flush count     %d\n", tlb_flush_count);
     tcg_dump_info(f, cpu_fprintf);
 
diff --git a/linux-user/main.c b/linux-user/main.c
index ad03c9e..630c73d 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -114,7 +114,7 @@ int cpu_get_pic_interrupt(CPUX86State *env)
 void fork_start(void)
 {
     cpu_list_lock();
-    qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
+    qemu_mutex_lock(&tb_ctx.tb_lock);
     mmap_fork_start();
 }
 
@@ -130,11 +130,11 @@ void fork_end(int child)
                 QTAILQ_REMOVE(&cpus, cpu, node);
             }
         }
-        qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
+        qemu_mutex_init(&tb_ctx.tb_lock);
         qemu_init_cpu_list();
         gdbserver_fork(thread_cpu);
     } else {
-        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+        qemu_mutex_unlock(&tb_ctx.tb_lock);
         cpu_list_unlock();
     }
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 6/7] [XXX] tcg: make TCGContext thread-local for softmmu
  2017-06-29 20:28 [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Emilio G. Cota
                   ` (4 preceding siblings ...)
  2017-06-29 20:28 ` [Qemu-devel] [RFC 5/7] tcg: take tb_ctx out of TCGContext Emilio G. Cota
@ 2017-06-29 20:28 ` Emilio G. Cota
  2017-06-30  8:18   ` Richard Henderson
  2017-07-01  1:47   ` [Qemu-devel] [PATCH] fixup: missed some TLS variables Emilio G. Cota
  2017-06-29 20:28 ` [Qemu-devel] [RFC 7/7] [XXX] translate-all: generate TCG code without holding tb_lock Emilio G. Cota
  2017-06-30  8:25 ` [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Richard Henderson
  7 siblings, 2 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-06-29 20:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This will allow us to generate TCG code in parallel.

User-mode is kept out of this: contention due to concurrent translation
is more commonly found in full-system mode (e.g. booting a many-core guest).

XXX: For now, only convert arm/a64, since these are the only guests that
have proper MTTCG support.

XXX: arm_translate_init needs to be called from a proper place.

XXX: TCG profiling info and statistics are broken by this

XXX: This is calling prologue_init once per vCPU, i.e. each TCGContext
     gets a different prologue/epilogue (all of them with the same
     contents though). Far from ideal, but for an experiment it
     "should" work, right?

XXX: Giving the same amount of code_gen_buffer to each vCPU is certainly
     a bad idea. A "page-like" allocation policy would be better, e.g.
     give chunks of 1MB to each vCPU as they need it. But for now I'm
     just trying to see whether this can ever work.

XXX: After allowing tb_gen_code to run in parallel (see next patch),
     crashes due to races in TCG code are found very quickly with -smp > 1
     (e.g. "tcg/tcg.c:233: tcg_out_label: Assertion `!l->has_value' failed.")
     Note that with -smp 1 it works fine; with smp > 1 I can make it
     fail later with "taskset -c 0", so clearly there is a race going on.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h    |  4 +++-
 target/arm/translate.h     |  8 +++----
 tcg/tcg.h                  | 10 +++++++--
 accel/tcg/translate-all.c  | 56 ++++++++++++++++++++++++++++++++++++++++------
 cpus.c                     |  2 ++
 target/arm/cpu.c           |  4 +++-
 target/arm/translate-a64.c |  6 ++---
 target/arm/translate.c     | 16 ++++++-------
 tcg/i386/tcg-target.inc.c  |  2 +-
 tcg/tcg-common.c           |  2 +-
 tcg/tcg.c                  |  6 ++---
 11 files changed, 85 insertions(+), 31 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index df12338..4b4c143 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -47,7 +47,9 @@ void gen_intermediate_code(CPUArchState *env, struct TranslationBlock *tb);
 void restore_state_to_opc(CPUArchState *env, struct TranslationBlock *tb,
                           target_ulong *data);
 
-void cpu_gen_init(void);
+#ifdef CONFIG_SOFTMMU
+void cpu_gen_init(int cpu_index);
+#endif
 bool cpu_restore_state(CPUState *cpu, uintptr_t searched_pc);
 
 void QEMU_NORETURN cpu_loop_exit_noexc(CPUState *cpu);
diff --git a/target/arm/translate.h b/target/arm/translate.h
index 15d383d..8f04c57 100644
--- a/target/arm/translate.h
+++ b/target/arm/translate.h
@@ -76,10 +76,10 @@ typedef struct DisasCompare {
 } DisasCompare;
 
 /* Share the TCG temporaries common between 32 and 64 bit modes.  */
-extern TCGv_env cpu_env;
-extern TCGv_i32 cpu_NF, cpu_ZF, cpu_CF, cpu_VF;
-extern TCGv_i64 cpu_exclusive_addr;
-extern TCGv_i64 cpu_exclusive_val;
+extern TCG_THREAD TCGv_env cpu_env;
+extern TCG_THREAD TCGv_i32 cpu_NF, cpu_ZF, cpu_CF, cpu_VF;
+extern TCG_THREAD TCGv_i64 cpu_exclusive_addr;
+extern TCG_THREAD TCGv_i64 cpu_exclusive_val;
 
 static inline int arm_dc_feature(DisasContext *dc, int feature)
 {
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 3b3359c..01cf21f 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -727,7 +727,13 @@ struct TCGContext {
     target_ulong gen_insn_data[TCG_MAX_INSNS][TARGET_INSN_START_WORDS];
 };
 
-extern TCGContext tcg_ctx;
+#ifdef CONFIG_SOFTMMU
+#define TCG_THREAD __thread
+#else
+#define TCG_THREAD
+#endif
+
+extern TCG_THREAD TCGContext tcg_ctx;
 extern bool parallel_cpus;
 
 static inline void tcg_set_insn_param(int op_idx, int arg, TCGArg v)
@@ -887,7 +893,7 @@ typedef struct TCGOpDef {
 #endif
 } TCGOpDef;
 
-extern TCGOpDef tcg_op_defs[];
+extern TCG_THREAD TCGOpDef tcg_op_defs[];
 extern const size_t tcg_op_defs_max;
 
 typedef struct TCGTargetOpDef {
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 2869c79..125b1a8 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -58,6 +58,7 @@
 #include "qemu/main-loop.h"
 #include "exec/log.h"
 #include "sysemu/cpus.h"
+#include "sysemu/sysemu.h"
 
 /* #define DEBUG_TB_INVALIDATE */
 /* #define DEBUG_TB_FLUSH */
@@ -132,9 +133,12 @@ static int v_l2_levels;
 static void *l1_map[V_L1_MAX_SIZE];
 
 /* code generation context */
-TCGContext tcg_ctx;
+TCG_THREAD TCGContext tcg_ctx;
 TBContext tb_ctx;
 bool parallel_cpus;
+#ifdef CONFIG_SOFTMMU
+static TCGContext *tcg_common_ctx;
+#endif
 
 /* translation block context */
 __thread int have_tb_lock;
@@ -186,10 +190,35 @@ void tb_lock_reset(void)
 
 static TranslationBlock *tb_find_pc(uintptr_t tc_ptr);
 
-void cpu_gen_init(void)
+#ifdef CONFIG_SOFTMMU
+
+/* XXX, see below */
+void arm_translate_init(void);
+
+void cpu_gen_init(int cpu_index)
 {
-    tcg_context_init(&tcg_ctx); 
+    uintptr_t addr;
+    size_t size;
+
+    tcg_context_init(&tcg_ctx);
+    size = tcg_common_ctx->code_gen_buffer_size / smp_cpus;
+    assert(!(tcg_common_ctx->code_gen_buffer_size % smp_cpus));
+    addr = (uintptr_t)tcg_common_ctx->code_gen_buffer;
+    addr += size * cpu_index;
+    tcg_ctx.code_gen_buffer = (void *)addr;
+    tcg_ctx.code_gen_buffer_size = size;
+    tcg_prologue_init(&tcg_ctx);
+    /*
+     * XXX find a proper place to init the TCG globals. This should be trivial
+     * once when the "generic translation loop" work is finished.
+     *
+     * Note that initialising the TCG globals (that are __thread variables
+     * in full-system mode) from a *_cpu_initfn is not a viable option, since
+     * this function is called before the vCPU threads are created.
+     */
+    arm_translate_init();
 }
+#endif
 
 /* Encode VAL as a signed leb128 sequence at P.
    Return P incremented past the encoded value.  */
@@ -561,6 +590,18 @@ static inline size_t size_code_gen_buffer(size_t tb_size)
     if (tb_size > MAX_CODE_GEN_BUFFER_SIZE) {
         tb_size = MAX_CODE_GEN_BUFFER_SIZE;
     }
+#ifdef CONFIG_SOFTMMU
+    {
+        size_t per_cpu = tb_size / smp_cpus;
+
+        if (per_cpu < MIN_CODE_GEN_BUFFER_SIZE) {
+            tb_size = MIN_CODE_GEN_BUFFER_SIZE * smp_cpus;
+            per_cpu = MIN_CODE_GEN_BUFFER_SIZE;
+        }
+        /* make sure tb_size divides smp_cpus evenly */
+        tb_size = per_cpu * smp_cpus;
+    }
+#endif
     return tb_size;
 }
 
@@ -810,20 +851,21 @@ static void tb_htable_init(void)
    size. */
 void tcg_exec_init(unsigned long tb_size)
 {
-    cpu_gen_init();
     page_init();
     tb_htable_init();
     code_gen_alloc(tb_size);
 #if defined(CONFIG_SOFTMMU)
-    /* There's no guest base to take into account, so go ahead and
-       initialize the prologue now.  */
-    tcg_prologue_init(&tcg_ctx);
+    tcg_common_ctx = &tcg_ctx;
 #endif
 }
 
 bool tcg_enabled(void)
 {
+#ifdef CONFIG_SOFTMMU
+    return tcg_common_ctx->code_gen_buffer != NULL;
+#else
     return tcg_ctx.code_gen_buffer != NULL;
+#endif
 }
 
 /*
diff --git a/cpus.c b/cpus.c
index 14bb8d5..1a5437b 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1307,6 +1307,7 @@ static void *qemu_tcg_rr_cpu_thread_fn(void *arg)
     CPUState *cpu = arg;
 
     rcu_register_thread();
+    cpu_gen_init(cpu->cpu_index);
 
     qemu_mutex_lock_iothread();
     qemu_thread_get_self(cpu->thread);
@@ -1454,6 +1455,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
     g_assert(!use_icount);
 
     rcu_register_thread();
+    cpu_gen_init(cpu->cpu_index);
 
     qemu_mutex_lock_iothread();
     qemu_thread_get_self(cpu->thread);
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 28a9141..43948ef 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -469,7 +469,7 @@ static void arm_cpu_initfn(Object *obj)
 {
     CPUState *cs = CPU(obj);
     ARMCPU *cpu = ARM_CPU(obj);
-    static bool inited;
+    static bool inited __attribute__((unused));
 
     cs->env_ptr = &cpu->env;
     cpu->cp_regs = g_hash_table_new_full(g_int_hash, g_int_equal,
@@ -511,10 +511,12 @@ static void arm_cpu_initfn(Object *obj)
 
     if (tcg_enabled()) {
         cpu->psci_version = 2; /* TCG implements PSCI 0.2 */
+#ifndef CONFIG_SOFTMMU
         if (!inited) {
             inited = true;
             arm_translate_init();
         }
+#endif
     }
 }
 
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index e55547d..9450551 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -36,11 +36,11 @@
 
 #include "trace-tcg.h"
 
-static TCGv_i64 cpu_X[32];
-static TCGv_i64 cpu_pc;
+static TCG_THREAD TCGv_i64 cpu_X[32];
+static TCG_THREAD TCGv_i64 cpu_pc;
 
 /* Load/store exclusive handling */
-static TCGv_i64 cpu_exclusive_high;
+static TCG_THREAD TCGv_i64 cpu_exclusive_high;
 static TCGv_i64 cpu_reg(DisasContext *s, int reg);
 
 static const char *regnames[] = {
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 0862f9e..9ad4bbb 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -58,17 +58,17 @@
 #define IS_USER(s) (s->user)
 #endif
 
-TCGv_env cpu_env;
+TCG_THREAD TCGv_env cpu_env;
 /* We reuse the same 64-bit temporaries for efficiency.  */
-static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
-static TCGv_i32 cpu_R[16];
-TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
-TCGv_i64 cpu_exclusive_addr;
-TCGv_i64 cpu_exclusive_val;
+static TCG_THREAD TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
+static TCG_THREAD TCGv_i32 cpu_R[16];
+TCG_THREAD TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
+TCG_THREAD TCGv_i64 cpu_exclusive_addr;
+TCG_THREAD TCGv_i64 cpu_exclusive_val;
 
 /* FIXME:  These should be removed.  */
-static TCGv_i32 cpu_F0s, cpu_F1s;
-static TCGv_i64 cpu_F0d, cpu_F1d;
+static TCG_THREAD TCGv_i32 cpu_F0s, cpu_F1s;
+static TCG_THREAD TCGv_i64 cpu_F0d, cpu_F1d;
 
 #include "exec/gen-icount.h"
 
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 01e3b4e..608264a 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -146,7 +146,7 @@ static bool have_lzcnt;
 # define have_lzcnt 0
 #endif
 
-static tcg_insn_unit *tb_ret_addr;
+static TCG_THREAD tcg_insn_unit *tb_ret_addr;
 
 static void patch_reloc(tcg_insn_unit *code_ptr, int type,
                         intptr_t value, intptr_t addend)
diff --git a/tcg/tcg-common.c b/tcg/tcg-common.c
index 2f139de..7f6edb8 100644
--- a/tcg/tcg-common.c
+++ b/tcg/tcg-common.c
@@ -31,7 +31,7 @@
 uintptr_t tci_tb_ptr;
 #endif
 
-TCGOpDef tcg_op_defs[] = {
+TCG_THREAD TCGOpDef tcg_op_defs[] = {
 #define DEF(s, oargs, iargs, cargs, flags) \
          { #s, oargs, iargs, cargs, iargs + oargs + cargs, flags },
 #include "tcg-opc.h"
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 3559829..326c25a 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -117,8 +117,8 @@ static bool tcg_out_tb_finalize(TCGContext *s);
 
 
 
-static TCGRegSet tcg_target_available_regs[2];
-static TCGRegSet tcg_target_call_clobber_regs;
+static TCG_THREAD TCGRegSet tcg_target_available_regs[2];
+static TCG_THREAD TCGRegSet tcg_target_call_clobber_regs;
 
 #if TCG_TARGET_INSN_UNIT_SIZE == 1
 static __attribute__((unused)) inline void tcg_out8(TCGContext *s, uint8_t v)
@@ -320,7 +320,7 @@ static const TCGHelperInfo all_helpers[] = {
 #include "exec/helper-tcg.h"
 };
 
-static int indirect_reg_alloc_order[ARRAY_SIZE(tcg_target_reg_alloc_order)];
+static TCG_THREAD int indirect_reg_alloc_order[ARRAY_SIZE(tcg_target_reg_alloc_order)];
 static void process_op_defs(TCGContext *s);
 
 void tcg_context_init(TCGContext *s)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC 7/7] [XXX] translate-all: generate TCG code without holding tb_lock
  2017-06-29 20:28 [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Emilio G. Cota
                   ` (5 preceding siblings ...)
  2017-06-29 20:28 ` [Qemu-devel] [RFC 6/7] [XXX] tcg: make TCGContext thread-local for softmmu Emilio G. Cota
@ 2017-06-29 20:28 ` Emilio G. Cota
  2017-06-30  8:25 ` [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Richard Henderson
  7 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-06-29 20:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Generate new TBs without holding tb_lock, which should increase
scalability when running MTTCG for workloads that generate a lot
of code (e.g. booting linux).

XXX: Doesn't work :-) See the commit log from the previous patch.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c      |  8 +++++++-
 accel/tcg/translate-all.c | 18 ++++++++++++++++--
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 54ecae2..47f0882 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -351,6 +351,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
              * single threaded the locks are NOPs.
              */
             mmap_lock();
+#ifdef CONFIG_USER_ONLY
             tb_lock();
             have_tb_lock = true;
 
@@ -362,7 +363,12 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
                 /* if no translated code available, then translate it now */
                 tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
             }
-
+#else
+            /* tb_gen_code will acquire tb_lock.
+             * Just for the diff: note that have_tb_lock is local to tb_find! */
+            have_tb_lock = true;
+            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+#endif
             mmap_unlock();
         }
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 125b1a8..da29bcc 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -878,8 +878,6 @@ static TranslationBlock *tb_alloc(target_ulong pc)
 {
     TranslationBlock *tb;
 
-    assert_tb_locked();
-
     tb = tcg_tb_alloc(&tcg_ctx);
     if (unlikely(tb == NULL)) {
         return NULL;
@@ -1314,7 +1312,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 #ifdef CONFIG_PROFILER
     int64_t ti;
 #endif
+#ifdef CONFIG_USER_ONLY
     assert_memory_lock();
+#endif
 
     phys_pc = get_page_addr_code(env, pc);
     if (use_icount && !(cflags & CF_IGNORE_ICOUNT)) {
@@ -1429,6 +1429,20 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     if ((pc & TARGET_PAGE_MASK) != virt_page2) {
         phys_page2 = get_page_addr_code(env, virt_page2);
     }
+    if (!have_tb_lock) {
+        TranslationBlock *t;
+
+        tb_lock();
+        /*
+         * There's a chance that our desired tb has been translated while
+         * we were translating it.
+         */
+        t = tb_htable_lookup(cpu, pc, cs_base, flags);
+        if (unlikely(t)) {
+            /* this is very unlikely so just waste the TB space we just used */
+            return t;
+        }
+    }
     /* As long as consistency of the TB stuff is provided by tb_lock in user
      * mode and is implicit in single-threaded softmmu emulation, no explicit
      * memory barrier is required before tb_link_page() makes the TB visible
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 1/7] exec-all: fix typos in TranslationBlock's documentation
  2017-06-29 20:28 ` [Qemu-devel] [RFC 1/7] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
@ 2017-06-30  6:17   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-06-30  6:17 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   include/exec/exec-all.h | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 2/7] translate-all: add out_size field to TranslationBlock
  2017-06-29 20:28 ` [Qemu-devel] [RFC 2/7] translate-all: add out_size field to TranslationBlock Emilio G. Cota
@ 2017-06-30  6:31   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-06-30  6:31 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> Note that due to the cacheline padding we are using, for
> hosts with 64-byte cache lines this will not waste any
> additional memory. Using a s16 would be ideal, since that
> would plug an existing hole in the struct, but I see no
> guarantee that a TB won't overflow it.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   include/exec/exec-all.h   |  1 +
>   accel/tcg/translate-all.c | 16 ++++++++--------
>   2 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 35a75f1..df12338 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -363,6 +363,7 @@ struct TranslationBlock {
>        */
>       uintptr_t jmp_list_next[2];
>       uintptr_t jmp_list_first;
> +    int32_t out_size; /* size of host code for this block */

unsigned probably better.

I do wonder about putting it in the hole after invalid.
Which itself could be shrunk to bool.

I don't believe there's much chance of an overflow of uint16_t.  The limit of 
OPC_BUF_SIZE = 640 fairly well limits the practical size.  And, honestly, it 
doesn't matter if we saturate to 0xffff, so long as you retain the full-size 
gen_code_size local variable.


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 3/7] translate-all: use a binary search tree to track TBs in TBContext
  2017-06-29 20:28 ` [Qemu-devel] [RFC 3/7] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
@ 2017-06-30  7:41   ` Richard Henderson
  2017-06-30  7:49     ` Richard Henderson
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2017-06-30  7:41 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> +/* @key is already in the tree so it's safe to use container_of on it */
> +static gint tc_ptr_cmp(gconstpointer candidate, gconstpointer key)
> +{
> +    uintptr_t a = *(uintptr_t *)candidate;
> +    const TranslationBlock *tb = container_of(key, TranslationBlock, tc_ptr);

This I'm not keen on.  It'd be one thing if it was our own datastructure, but I 
see nothing in the GTree documentation that says that the comparison must 
always be done this way.


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 3/7] translate-all: use a binary search tree to track TBs in TBContext
  2017-06-30  7:41   ` Richard Henderson
@ 2017-06-30  7:49     ` Richard Henderson
  2017-06-30 14:55       ` Emilio G. Cota
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2017-06-30  7:49 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 06/30/2017 12:41 AM, Richard Henderson wrote:
> On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
>> +/* @key is already in the tree so it's safe to use container_of on it */
>> +static gint tc_ptr_cmp(gconstpointer candidate, gconstpointer key)
>> +{
>> +    uintptr_t a = *(uintptr_t *)candidate;
>> +    const TranslationBlock *tb = container_of(key, TranslationBlock, tc_ptr);
> 
> This I'm not keen on.  It'd be one thing if it was our own datastructure, but I 
> see nothing in the GTree documentation that says that the comparison must 
> always be done this way.

What if we bundle tc_ptr + tc_size into a struct and only reference that?  We'd 
embed that struct into the TB.  In tb_find_pc, create that struct on the stack, 
setting tc_size = 0.


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 4/7] translate-all: report correct avg host TB size
  2017-06-29 20:28 ` [Qemu-devel] [RFC 4/7] translate-all: report correct avg host TB size Emilio G. Cota
@ 2017-06-30  7:50   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-06-30  7:50 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> +    int *size = data;

I'd really prefer you use size_t for all of these.


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 5/7] tcg: take tb_ctx out of TCGContext
  2017-06-29 20:28 ` [Qemu-devel] [RFC 5/7] tcg: take tb_ctx out of TCGContext Emilio G. Cota
@ 2017-06-30  7:58   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-06-30  7:58 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> Before TCGContext is made thread-local.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   include/exec/tb-context.h |  2 ++
>   tcg/tcg.h                 |  2 --
>   accel/tcg/cpu-exec.c      |  2 +-
>   accel/tcg/translate-all.c | 57 ++++++++++++++++++++++++-----------------------
>   linux-user/main.c         |  6 ++---
>   5 files changed, 35 insertions(+), 34 deletions(-)

There's a bit more of TCGContext that shouldn't be thread-local.  But I guess 
this is the one that absolutely must be shared.

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 6/7] [XXX] tcg: make TCGContext thread-local for softmmu
  2017-06-29 20:28 ` [Qemu-devel] [RFC 6/7] [XXX] tcg: make TCGContext thread-local for softmmu Emilio G. Cota
@ 2017-06-30  8:18   ` Richard Henderson
  2017-07-01  1:54     ` Emilio G. Cota
  2017-07-01  1:47   ` [Qemu-devel] [PATCH] fixup: missed some TLS variables Emilio G. Cota
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2017-06-30  8:18 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> This will allow us to generate TCG code in parallel.
> 
> User-mode is kept out of this: contention due to concurrent translation
> is more commonly found in full-system mode (e.g. booting a many-core guest).
> 
> XXX: For now, only convert arm/a64, since these are the only guests that
> have proper MTTCG support.

Not true.  TCG_GUEST_DEFAULT_MO suggests we have 4.

> XXX: This is calling prologue_init once per vCPU, i.e. each TCGContext
>       gets a different prologue/epilogue (all of them with the same
>       contents though). Far from ideal, but for an experiment it
>       "should" work, right?

Um... sure.  But surely there's a better way.  Perhaps copying the main 
thread's tcg_ctx?  That'd be after prologue, and target globals are created. 
That would avoid the need to make all the target cpu_* globals be thread-local, 
for instance.

Of course, now that I think of it, this implies that my tcg patches to use 
pointers instead of indicies is not the Way Forward...

>   /* Share the TCG temporaries common between 32 and 64 bit modes.  */
> -extern TCGv_env cpu_env;
> -extern TCGv_i32 cpu_NF, cpu_ZF, cpu_CF, cpu_VF;
> -extern TCGv_i64 cpu_exclusive_addr;
> -extern TCGv_i64 cpu_exclusive_val;
> +extern TCG_THREAD TCGv_env cpu_env;
> +extern TCG_THREAD TCGv_i32 cpu_NF, cpu_ZF, cpu_CF, cpu_VF;
> +extern TCG_THREAD TCGv_i64 cpu_exclusive_addr;
> +extern TCG_THREAD TCGv_i64 cpu_exclusive_val;

It would be Really Good if we could avoid this in the end.
I see that it's the quickest way to test for now though.

> @@ -887,7 +893,7 @@ typedef struct TCGOpDef {
>   #endif
>   } TCGOpDef;
>   
> -extern TCGOpDef tcg_op_defs[];
> +extern TCG_THREAD TCGOpDef tcg_op_defs[];

Why?  It's constant after startup.


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress)
  2017-06-29 20:28 [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Emilio G. Cota
                   ` (6 preceding siblings ...)
  2017-06-29 20:28 ` [Qemu-devel] [RFC 7/7] [XXX] translate-all: generate TCG code without holding tb_lock Emilio G. Cota
@ 2017-06-30  8:25 ` Richard Henderson
  2017-07-01  2:00   ` Emilio G. Cota
  7 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2017-06-30  8:25 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> - Patches 2-3 remove *tbs[] to use a binary search tree instead.
>    This removes the assumption in tb_find_pc that *tbs[] are ordered
>    by tc_ptr, thereby allowing us to generate code regardless of
>    its location on the host (as we do after patch 6).

Have you considered a scheme by which the front end translation and tcg 
optimization are done outside the lock, but final code generation is done 
inside the lock?

It would put at least half of the translation time in the parallel space 
without requiring changes to code_buffer allocation.


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 3/7] translate-all: use a binary search tree to track TBs in TBContext
  2017-06-30  7:49     ` Richard Henderson
@ 2017-06-30 14:55       ` Emilio G. Cota
  0 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-06-30 14:55 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Fri, Jun 30, 2017 at 00:49:37 -0700, Richard Henderson wrote:
> On 06/30/2017 12:41 AM, Richard Henderson wrote:
> >On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> >>+/* @key is already in the tree so it's safe to use container_of on it */
> >>+static gint tc_ptr_cmp(gconstpointer candidate, gconstpointer key)
> >>+{
> >>+    uintptr_t a = *(uintptr_t *)candidate;
> >>+    const TranslationBlock *tb = container_of(key, TranslationBlock, tc_ptr);
> >
> >This I'm not keen on.  It'd be one thing if it was our own datastructure,
> >but I see nothing in the GTree documentation that says that the comparison
> >must always be done this way.

I also checked for this. Couldn't find anything in the docs -- the
only guarantee I could find is the implicit one that since the g_tree
module was created, the 2nd pointer ("b" above) has consistenly been
the in-node one. I don't think they'd ever change this, but yes
relying on this assumption is a bit risky.

If we prefer using our own we could bring the AVL tree from CCAN:
  http://git.ozlabs.org/?p=ccan;a=tree;f=ccan/avl

> What if we bundle tc_ptr + tc_size into a struct and only reference that?
> We'd embed that struct into the TB.  In tb_find_pc, create that struct on
> the stack, setting tc_size = 0.

This is clean and safe, but we'd add a 4-byte hole for 64-on-64bit.
However, we could bring other fields into the embedded struct to plug
the hole. Also, using an anonymous struct would hide this from the
calling code:

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 4b4c143..07f1f50 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -319,16 +319,18 @@ struct TranslationBlock {
     uint16_t size;      /* size of target code for this block (1 <=
                            size <= TARGET_PAGE_SIZE) */
     uint16_t icount;
-    uint32_t cflags;    /* compile flags */
+
+    struct {
+        void *tc_ptr;    /* pointer to the translated code */
+        int32_t out_size; /* size of host code for this block */
+        uint32_t cflags;    /* compile flags */
 #define CF_COUNT_MASK  0x7fff
 #define CF_LAST_IO     0x8000 /* Last insn may be an IO access.  */
 #define CF_NOCACHE     0x10000 /* To be freed after execution */
 #define CF_USE_ICOUNT  0x20000
 #define CF_IGNORE_ICOUNT 0x40000 /* Do not generate icount code */
+    };
 
-    uint16_t invalid;
-
-    void *tc_ptr;    /* pointer to the translated code */
     uint8_t *tc_search;  /* pointer to search data */
     /* original tb when cflags has CF_NOCACHE */
     struct TranslationBlock *orig_tb;
@@ -365,7 +367,7 @@ struct TranslationBlock {
      */
     uintptr_t jmp_list_next[2];
     uintptr_t jmp_list_first;
-    int32_t out_size; /* size of host code for this block */
+    uint16_t invalid;
 };

That is 122 bytes, with all 6 bytes of padding at the end.
We also move invalid to the 2nd cache line, which I'm not sure
it would matter much (I liked having out_size there because
it's fairly slow path).

Also I'd rename out_size to tc_size.

		E.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH] fixup: missed some TLS variables
  2017-06-29 20:28 ` [Qemu-devel] [RFC 6/7] [XXX] tcg: make TCGContext thread-local for softmmu Emilio G. Cota
  2017-06-30  8:18   ` Richard Henderson
@ 2017-07-01  1:47   ` Emilio G. Cota
  1 sibling, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-07-01  1:47 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

On Thu, Jun 29, 2017 at 16:28:28 -0400, Emilio G. Cota wrote:
> XXX: After allowing tb_gen_code to run in parallel (see next patch),
>      crashes due to races in TCG code are found very quickly with -smp > 1
>      (e.g. "tcg/tcg.c:233: tcg_out_label: Assertion `!l->has_value' failed.")
>      Note that with -smp 1 it works fine; with smp > 1 I can make it
>      fail later with "taskset -c 0", so clearly there is a race going on.

Fixed! I missed a few __thread's -- see below.

I found the missing ones by moving tb_lock around and using
'valgrind --tool=drd' (it's not ThreadSanitizer, but it doesn't
choke on our coroutines and it's reasonably fast).

Preliminary tests show even with patch 6 we don't gain almost anything
when booting 64 cores (on a 64-core host), which is what I expected
(we still have to acquire tb_lock on each translation, even if only
for a few instructions, which is a scalability killer). But I'd
worry about optimisations later; for now I'll focus on cleaning up
the patchset, so my next steps are:

- Apply Richard's feedback so far
- Consolidate TLS variables into TCGContext, so that we'll have
  as little TLS variables as possible.

Cheers,

		E.

--- 8< ----

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/gen-icount.h | 2 +-
 tcg/optimize.c            | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/exec/gen-icount.h b/include/exec/gen-icount.h
index 62d462e..2e2bc7b 100644
--- a/include/exec/gen-icount.h
+++ b/include/exec/gen-icount.h
@@ -6,7 +6,7 @@
 /* Helpers for instruction counting code generation.  */
 
 static int icount_start_insn_idx;
-static TCGLabel *exitreq_label;
+static TCG_THREAD TCGLabel *exitreq_label;
 
 static inline void gen_tb_start(TranslationBlock *tb)
 {
diff --git a/tcg/optimize.c b/tcg/optimize.c
index adfc56c..71af19b 100644
--- a/tcg/optimize.c
+++ b/tcg/optimize.c
@@ -40,8 +40,8 @@ struct tcg_temp_info {
     tcg_target_ulong mask;
 };
 
-static struct tcg_temp_info temps[TCG_MAX_TEMPS];
-static TCGTempSet temps_used;
+static TCG_THREAD struct tcg_temp_info temps[TCG_MAX_TEMPS];
+static TCG_THREAD TCGTempSet temps_used;
 
 static inline bool temp_is_const(TCGArg arg)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 6/7] [XXX] tcg: make TCGContext thread-local for softmmu
  2017-06-30  8:18   ` Richard Henderson
@ 2017-07-01  1:54     ` Emilio G. Cota
  0 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-07-01  1:54 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Fri, Jun 30, 2017 at 01:18:58 -0700, Richard Henderson wrote:
> On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> >This will allow us to generate TCG code in parallel.
> >
> >User-mode is kept out of this: contention due to concurrent translation
> >is more commonly found in full-system mode (e.g. booting a many-core guest).
> >
> >XXX: For now, only convert arm/a64, since these are the only guests that
> >have proper MTTCG support.
> 
> Not true.  TCG_GUEST_DEFAULT_MO suggests we have 4.

Yes, sorry that was outdated. I'm testing arm/a64 because I can boot
many cores on a64.

> >XXX: This is calling prologue_init once per vCPU, i.e. each TCGContext
> >      gets a different prologue/epilogue (all of them with the same
> >      contents though). Far from ideal, but for an experiment it
> >      "should" work, right?
> 
> Um... sure.  But surely there's a better way.  Perhaps copying the main
> thread's tcg_ctx?  That'd be after prologue, and target globals are created.
> That would avoid the need to make all the target cpu_* globals be
> thread-local, for instance.
> 
> Of course, now that I think of it, this implies that my tcg patches to use
> pointers instead of indicies is not the Way Forward...

Didn't consider this, but after looking at the code (sans your patchset)
I don't see why this wouldn't work -- would be great to avoid touching
target code, as you point out below.

> >  /* Share the TCG temporaries common between 32 and 64 bit modes.  */
> >-extern TCGv_env cpu_env;
> >-extern TCGv_i32 cpu_NF, cpu_ZF, cpu_CF, cpu_VF;
> >-extern TCGv_i64 cpu_exclusive_addr;
> >-extern TCGv_i64 cpu_exclusive_val;
> >+extern TCG_THREAD TCGv_env cpu_env;
> >+extern TCG_THREAD TCGv_i32 cpu_NF, cpu_ZF, cpu_CF, cpu_VF;
> >+extern TCG_THREAD TCGv_i64 cpu_exclusive_addr;
> >+extern TCG_THREAD TCGv_i64 cpu_exclusive_val;
> 
> It would be Really Good if we could avoid this in the end.
> I see that it's the quickest way to test for now though.
> 
> >@@ -887,7 +893,7 @@ typedef struct TCGOpDef {
> >  #endif
> >  } TCGOpDef;
> >-extern TCGOpDef tcg_op_defs[];
> >+extern TCG_THREAD TCGOpDef tcg_op_defs[];
> 
> Why?  It's constant after startup.

I was just going for a mass conversion first to see whether this
could ever work. Will refine it.

Thanks,

		E.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress)
  2017-06-30  8:25 ` [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Richard Henderson
@ 2017-07-01  2:00   ` Emilio G. Cota
  0 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-07-01  2:00 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Fri, Jun 30, 2017 at 01:25:54 -0700, Richard Henderson wrote:
> On 06/29/2017 01:28 PM, Emilio G. Cota wrote:
> >- Patches 2-3 remove *tbs[] to use a binary search tree instead.
> >   This removes the assumption in tb_find_pc that *tbs[] are ordered
> >   by tc_ptr, thereby allowing us to generate code regardless of
> >   its location on the host (as we do after patch 6).
> 
> Have you considered a scheme by which the front end translation and tcg
> optimization are done outside the lock, but final code generation is done
> inside the lock?
> 
> It would put at least half of the translation time in the parallel space
> without requiring changes to code_buffer allocation.

I don't think that would save much, because the performance issue comes
from the fact that we have to grab the lock, regardless of how long we hold
it. So even if we did nothing inside the lock, scalability when
translating a lot of code (e.g. booting) would still be quite bad.

So we either get rid of the lock altogether, or use a more scalable lock.

		E.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-07-01  2:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-29 20:28 [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Emilio G. Cota
2017-06-29 20:28 ` [Qemu-devel] [RFC 1/7] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
2017-06-30  6:17   ` Richard Henderson
2017-06-29 20:28 ` [Qemu-devel] [RFC 2/7] translate-all: add out_size field to TranslationBlock Emilio G. Cota
2017-06-30  6:31   ` Richard Henderson
2017-06-29 20:28 ` [Qemu-devel] [RFC 3/7] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
2017-06-30  7:41   ` Richard Henderson
2017-06-30  7:49     ` Richard Henderson
2017-06-30 14:55       ` Emilio G. Cota
2017-06-29 20:28 ` [Qemu-devel] [RFC 4/7] translate-all: report correct avg host TB size Emilio G. Cota
2017-06-30  7:50   ` Richard Henderson
2017-06-29 20:28 ` [Qemu-devel] [RFC 5/7] tcg: take tb_ctx out of TCGContext Emilio G. Cota
2017-06-30  7:58   ` Richard Henderson
2017-06-29 20:28 ` [Qemu-devel] [RFC 6/7] [XXX] tcg: make TCGContext thread-local for softmmu Emilio G. Cota
2017-06-30  8:18   ` Richard Henderson
2017-07-01  1:54     ` Emilio G. Cota
2017-07-01  1:47   ` [Qemu-devel] [PATCH] fixup: missed some TLS variables Emilio G. Cota
2017-06-29 20:28 ` [Qemu-devel] [RFC 7/7] [XXX] translate-all: generate TCG code without holding tb_lock Emilio G. Cota
2017-06-30  8:25 ` [Qemu-devel] [RFC 0/7] tcg: parallel code generation (Work in Progress) Richard Henderson
2017-07-01  2:00   ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.