[Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts
@ 2017-07-16 20:03 Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 01/45] vl: fix breakage of -tb-size Emilio G. Cota
                   ` (44 more replies)
  0 siblings, 45 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

v1:
  https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg02059.html

Thanks all for your comments on v1.

This v2 patchset applies on top of stefanha's tracing tree (9212a18e371):
  https://github.com/stefanha/qemu/tree/tracing
That tree has some changes (per-vcpu TCG tracing) that would conflict
with many of the patches in this series. So I based the series on that
tree.

To ease review/testing, you can pull this series from:
  https://github.com/cota/qemu/tree/multi-tcg-v2

Note: patches 1 and 2 are already on master, but not yet on stefanha's tree.
So I'm leaving them here.

Note: I cannot even compile-test _WIN32 bits, help appreciated! See
patches 40-41.

Changes from v1:
- Added R-b tags
- Added comments to the commit logs about the atomic_set/read thing.
- Renamed have_tb_lock to acquired_tb_lock in tb_find
- Merged tb->invalid into tb->cflags
  - Cleaned up the checking of the tb->invalid field
- Consolidated TB lookups into a common tb_lookup__cpu_state function
  - Removed addr argument from lookup_tb_ptr
- Defined CF_PARALLEL, and used it for hashing. Incorporated Richard's
  feedback on the previous patch, including:
  - Removed use of parallel_cpus from target/*
  - Removed use of parallel_cpus from tcg/*
  - Moved down the exclusive region in cpu_exec_step_atomic
    - Brought cpu_exec_step into cpu_exec_step_atomic
- Defined and used DEBUG_*_GATE in translate-all
  - Introduced TB_PAGE_ADDR_FMT
- Defined struct tb_tc to bring together tb->tc_{ptr,search,size}
  - Used the struct for g_tree comparisons
  - The struct has now a 4-byte hole, but really given the added
    tb->trace_vcpu_dstate field (a u32) we probably can just live
    with it.
- renamed tb_free to tb_remove
- Use size_t everywhere when counting TB's and code size
- Moved tci_regs to tcg_qemu_tb_exec's stack
- Defined tcg_init_ctx and made tcg_ctx a pointer
- Switched to dynamic allocation of TCG optimizer globals
  - Folded them into TCGContext
- Introduced an array of *tcg_ctx's (instead of a list) to keep track
  of TCGContexts.
- Wrapped a macro with do..while(0) in the TCGProf patch to please checkpatch
- Moved qemu_real_host_page_size/mask to osdep
  - Introduced qemu_mprotect_rwx/none in osdep
    - Used these helpers instead of local inlines in translate-all.c
- TCG regions:
  - tcg_region_init takes a desired number of regions, not a desired
    region size.
      - TCG region sizes are a multiple of the host's page size
  - Add a guard page at the end of each region
    - Do not allocate a guard page when allocating code_gen_buffer
  - switched tcg_region_alloc to positive logic (return true on error)
  - Document non-trivial functions (N.B. some doc added in the region
    patch, but quite a bit more is added in the "multiple TCG context"
    patch)
  - Simplified initialization: child TCG threads just have to call
    tcg_register_thread(). All other initialization is done by the
    parent thread.
  - Changed the place at which we call tcg_region_init in softmmu,
    so that we can check whether mttcg is enabled when deciding
    how many regions to have.
    - Use 1 region when !mttcg.
- Dropped the "do not hold tb_lock" patch for now; the patchset is
  already too long, and to do a good job there takes more than just
  one patch. I have already started working on that though, based
  on the feedback from v1.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 01/45] vl: fix breakage of -tb-size
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 02/45] translate-all: remove redundant !tcg_enabled check in dump_exec_info Emilio G. Cota
                   ` (43 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Commit e7b161d573 ("vl: add tcg_enabled() for tcg related code") adds
a check to exit the program when !tcg_enabled() while parsing the -tb-size
flag.

It turns out that when the -tb-size flag is evaluated, tcg_enabled() can
only return 0, since it is set (or not) much later by configure_accelerator().

Fix it by unconditionally exiting if the flag is passed to a QEMU binary
built with !CONFIG_TCG.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 vl.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/vl.c b/vl.c
index d17c863..9ece570 100644
--- a/vl.c
+++ b/vl.c
@@ -3933,10 +3933,10 @@ int main(int argc, char **argv, char **envp)
                 configure_rtc(opts);
                 break;
             case QEMU_OPTION_tb_size:
-                if (!tcg_enabled()) {
-                    error_report("TCG is disabled");
-                    exit(1);
-                }
+#ifndef CONFIG_TCG
+                error_report("TCG is disabled");
+                exit(1);
+#endif
                 if (qemu_strtoul(optarg, NULL, 0, &tcg_tb_size) < 0) {
                     error_report("Invalid argument to -tb-size");
                     exit(1);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 02/45] translate-all: remove redundant !tcg_enabled check in dump_exec_info
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 01/45] vl: fix breakage of -tb-size Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 03/45] cputlb: bring back tlb_flush_count under !TLB_DEBUG Emilio G. Cota
                   ` (42 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This check is redundant because it is already performed by the only
caller of dump_exec_info -- the caller was updated by b7da97eef
("monitor: Check whether TCG is enabled before running the "info jit"
code").

Checking twice wouldn't necessarily be too bad, but here the check also
returns with tb_lock held. So we can either do the check before tb_lock is
acquired, or just get rid of it. Given that it is redundant, I am going
for the latter option.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index e09bd43..090ebad 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1858,11 +1858,6 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 
     tb_lock();
 
-    if (!tcg_enabled()) {
-        cpu_fprintf(f, "TCG not enabled\n");
-        return;
-    }
-
     target_code_size = 0;
     max_target_code_size = 0;
     cross_page = 0;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 03/45] cputlb: bring back tlb_flush_count under !TLB_DEBUG
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 01/45] vl: fix breakage of -tb-size Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 02/45] translate-all: remove redundant !tcg_enabled check in dump_exec_info Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 04/45] tcg: fix corruption of code_time profiling counter upon tb_flush Emilio G. Cota
                   ` (41 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Commit f0aff0f124 ("cputlb: add assert_cpu_is_self checks") buried
the increment of tlb_flush_count under TLB_DEBUG. This results in
"info jit" always (mis)reporting 0 TLB flushes when !TLB_DEBUG.

Besides, under MTTCG tlb_flush_count is updated by several threads,
so in order not to lose counts we'd either have to use atomic ops
or distribute the counter, which is more scalable.

This patch does the latter by embedding tlb_flush_count in CPUArchState.
The global count is then easily obtained by iterating over the CPU list.

Note that this change also requires updating the accessors to
tlb_flush_count to use atomic_read/set whenever there may be conflicting
accesses (as defined in C11) to it.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-defs.h   |  1 +
 include/exec/cputlb.h     |  3 +--
 accel/tcg/cputlb.c        | 17 ++++++++++++++---
 accel/tcg/translate-all.c |  2 +-
 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index bc8e7f8..e43ff83 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -137,6 +137,7 @@ typedef struct CPUIOTLBEntry {
     CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];               \
     CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                    \
     CPUIOTLBEntry iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                 \
+    size_t tlb_flush_count;                                             \
     target_ulong tlb_flush_addr;                                        \
     target_ulong tlb_flush_mask;                                        \
     target_ulong vtlb_index;                                            \
diff --git a/include/exec/cputlb.h b/include/exec/cputlb.h
index 3f94178..c91db21 100644
--- a/include/exec/cputlb.h
+++ b/include/exec/cputlb.h
@@ -23,7 +23,6 @@
 /* cputlb.c */
 void tlb_protect_code(ram_addr_t ram_addr);
 void tlb_unprotect_code(ram_addr_t ram_addr);
-extern int tlb_flush_count;
-
+size_t tlb_flush_count(void);
 #endif
 #endif
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 85635ae..9377110 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -92,8 +92,18 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
     }
 }
 
-/* statistics */
-int tlb_flush_count;
+size_t tlb_flush_count(void)
+{
+    CPUState *cpu;
+    size_t count = 0;
+
+    CPU_FOREACH(cpu) {
+        CPUArchState *env = cpu->env_ptr;
+
+        count += atomic_read(&env->tlb_flush_count);
+    }
+    return count;
+}
 
 /* This is OK because CPU architectures generally permit an
  * implementation to drop entries from the TLB at any time, so
@@ -112,7 +122,8 @@ static void tlb_flush_nocheck(CPUState *cpu)
     }
 
     assert_cpu_is_self(cpu);
-    tlb_debug("(count: %d)\n", tlb_flush_count++);
+    atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);
+    tlb_debug("(count: %zu)\n", tlb_flush_count());
 
     tb_lock();
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 090ebad..3ee69e5 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1916,7 +1916,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
             atomic_read(&tcg_ctx.tb_ctx.tb_flush_count));
     cpu_fprintf(f, "TB invalidate count %d\n",
             tcg_ctx.tb_ctx.tb_phys_invalidate_count);
-    cpu_fprintf(f, "TLB flush count     %d\n", tlb_flush_count);
+    cpu_fprintf(f, "TLB flush count     %zu\n", tlb_flush_count());
     tcg_dump_info(f, cpu_fprintf);
 
     tb_unlock();
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 04/45] tcg: fix corruption of code_time profiling counter upon tb_flush
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (2 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 03/45] cputlb: bring back tlb_flush_count under !TLB_DEBUG Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 05/45] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
                   ` (40 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Whenever there is an overflow in code_gen_buffer (e.g. we run out
of space in it and have to flush it), the code_time profiling counter
ends up with an invalid value (that is, code_time -= profile_getclock(),
without later on getting += profile_getclock() due to the goto).

Fix it by using the ti variable, so that we only update code_time
when there is no overflow. Note that in case there is an overflow
we fail to account for the elapsed coding time, but this is quite rare
so we can probably live with it.

"info jit" before/after, roughly at the same time during debian-arm bootup:

- before:
Statistics:
TB flush count      1
TB invalidate count 4665
TLB flush count     998
JIT cycles          -615191529184601 (-256329.804 s at 2.4 GHz)
translated TBs      302310 (aborted=0 0.0%)
avg ops/TB          48.4 max=438
deleted ops/TB      8.54
avg temps/TB        32.31 max=38
avg host code/TB    361.5
avg search data/TB  24.5
cycles/op           -42014693.0
cycles/in byte      -121444900.2
cycles/out byte     -5629031.1
cycles/search byte     -83114481.0
  gen_interm time   -0.0%
  gen_code time     100.0%
optim./code time    -0.0%
liveness/code time  -0.0%
cpu_restore count   6236
  avg cycles        110.4

- after:
Statistics:
TB flush count      1
TB invalidate count 4665
TLB flush count     1010
JIT cycles          1996899624 (0.832 s at 2.4 GHz)
translated TBs      297961 (aborted=0 0.0%)
avg ops/TB          48.5 max=438
deleted ops/TB      8.56
avg temps/TB        32.31 max=38
avg host code/TB    361.8
avg search data/TB  24.5
cycles/op           138.2
cycles/in byte      398.4
cycles/out byte     18.5
cycles/search byte     273.1
  gen_interm time   14.0%
  gen_code time     86.0%
optim./code time    19.4%
liveness/code time  10.3%
cpu_restore count   6372
  avg cycles        111.0

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 3ee69e5..63f8538 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1300,7 +1300,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 #ifdef CONFIG_PROFILER
     tcg_ctx.tb_count++;
     tcg_ctx.interm_time += profile_getclock() - ti;
-    tcg_ctx.code_time -= profile_getclock();
+    ti = profile_getclock();
 #endif
 
     /* ??? Overflow could be handled better here.  In particular, we
@@ -1318,7 +1318,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     }
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx.code_time += profile_getclock();
+    tcg_ctx.code_time += profile_getclock() - ti;
     tcg_ctx.code_in_len += tb->size;
     tcg_ctx.code_out_len += gen_code_size;
     tcg_ctx.search_out_len += search_size;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 05/45] exec-all: fix typos in TranslationBlock's documentation
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (3 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 04/45] tcg: fix corruption of code_time profiling counter upon tb_flush Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 06/45] translate-all: make have_tb_lock static Emilio G. Cota
                   ` (39 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 887d7b3..28e3a24 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -344,7 +344,7 @@ struct TranslationBlock {
     /* The following data are used to directly call another TB from
      * the code of this one. This can be done either by emitting direct or
      * indirect native jump instructions. These jumps are reset so that the TB
-     * just continue its execution. The TB can be linked to another one by
+     * just continues its execution. The TB can be linked to another one by
      * setting one of the jump targets (or patching the jump instruction). Only
      * two of such jumps are supported.
      */
@@ -355,7 +355,7 @@ struct TranslationBlock {
 #else
     uintptr_t jmp_target_addr[2]; /* target address for indirect jump */
 #endif
-    /* Each TB has an assosiated circular list of TBs jumping to this one.
+    /* Each TB has an associated circular list of TBs jumping to this one.
      * jmp_list_first points to the first TB jumping to this one.
      * jmp_list_next is used to point to the next TB in a list.
      * Since each TB can have two jumps, it can participate in two lists.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 06/45] translate-all: make have_tb_lock static
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (4 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 05/45] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 07/45] cpu-exec: rename have_tb_lock to acquired_tb_lock in tb_find Emilio G. Cota
                   ` (38 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

It is only used by this object, and it's not exported to any other.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 63f8538..a124181 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -139,7 +139,7 @@ TCGContext tcg_ctx;
 bool parallel_cpus;
 
 /* translation block context */
-__thread int have_tb_lock;
+static __thread int have_tb_lock;
 
 static void page_table_config_init(void)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 07/45] cpu-exec: rename have_tb_lock to acquired_tb_lock in tb_find
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (5 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 06/45] translate-all: make have_tb_lock static Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-17 22:39   ` Richard Henderson
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 08/45] tcg/i386: constify tcg_target_callee_save_regs Emilio G. Cota
                   ` (37 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Reusing the have_tb_lock name, which is also defined in translate-all.c,
makes code reviewing unnecessarily harder.

Avoid potential confusion by renaming the local have_tb_lock variable
to something else.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index d84b01d..c4c289b 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -337,7 +337,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
     TranslationBlock *tb;
     target_ulong cs_base, pc;
     uint32_t flags;
-    bool have_tb_lock = false;
+    bool acquired_tb_lock = false;
 
     /* we record a subset of the CPU state. It will
        always be the same before a given translated block
@@ -356,7 +356,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
              */
             mmap_lock();
             tb_lock();
-            have_tb_lock = true;
+            acquired_tb_lock = true;
 
             /* There's a chance that our desired tb has been translated while
              * taking the locks so we check again inside the lock.
@@ -384,15 +384,15 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
 #endif
     /* See if we can patch the calling TB. */
     if (last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
-        if (!have_tb_lock) {
+        if (!acquired_tb_lock) {
             tb_lock();
-            have_tb_lock = true;
+            acquired_tb_lock = true;
         }
         if (!tb->invalid) {
             tb_add_jump(last_tb, tb_exit, tb);
         }
     }
-    if (have_tb_lock) {
+    if (acquired_tb_lock) {
         tb_unlock();
     }
     return tb;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 08/45] tcg/i386: constify tcg_target_callee_save_regs
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (6 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 07/45] cpu-exec: rename have_tb_lock to acquired_tb_lock in tb_find Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 09/45] tcg/mips: " Emilio G. Cota
                   ` (36 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/i386/tcg-target.inc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 01e3b4e..06df01a 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -2514,7 +2514,7 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
     return NULL;
 }
 
-static int tcg_target_callee_save_regs[] = {
+static const int tcg_target_callee_save_regs[] = {
 #if TCG_TARGET_REG_BITS == 64
     TCG_REG_RBP,
     TCG_REG_RBX,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 09/45] tcg/mips: constify tcg_target_callee_save_regs
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (7 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 08/45] tcg/i386: constify tcg_target_callee_save_regs Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs Emilio G. Cota
                   ` (35 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/mips/tcg-target.inc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcg/mips/tcg-target.inc.c b/tcg/mips/tcg-target.inc.c
index 85756b8..56db228 100644
--- a/tcg/mips/tcg-target.inc.c
+++ b/tcg/mips/tcg-target.inc.c
@@ -2323,7 +2323,7 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
     return NULL;
 }
 
-static int tcg_target_callee_save_regs[] = {
+static const int tcg_target_callee_save_regs[] = {
     TCG_REG_S0,       /* used for the global env (TCG_AREG0) */
     TCG_REG_S1,
     TCG_REG_S2,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (8 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 09/45] tcg/mips: " Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-17 22:55   ` Richard Henderson
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 11/45] exec-all: bring tb->invalid into tb->cflags Emilio G. Cota
                   ` (34 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This gets rid of the need to check the tb->invalid bit during lookups.

After this change we do not need atomics to operate on tb->invalid: setting
and checking its value is serialised with tb_lock.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c      | 3 +--
 accel/tcg/translate-all.c | 8 ++++++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index c4c289b..9b5ce13 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -292,8 +292,7 @@ static bool tb_cmp(const void *p, const void *d)
         tb->page_addr[0] == desc->phys_page1 &&
         tb->cs_base == desc->cs_base &&
         tb->flags == desc->flags &&
-        tb->trace_vcpu_dstate == desc->trace_vcpu_dstate &&
-        !atomic_read(&tb->invalid)) {
+        tb->trace_vcpu_dstate == desc->trace_vcpu_dstate) {
         /* check next page if needed */
         if (tb->page_addr[1] == -1) {
             return true;
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index a124181..6d4c05f 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1073,13 +1073,17 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 
     assert_tb_locked();
 
-    atomic_set(&tb->invalid, true);
-
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->trace_vcpu_dstate);
     qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
 
+    /*
+     * Mark the TB as invalid *after* it's been removed from tb_hash, which
+     * eliminates the need to check this bit on lookups.
+     */
+    tb->invalid = true;
+
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
         p = page_find(tb->page_addr[0] >> TARGET_PAGE_BITS);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 11/45] exec-all: bring tb->invalid into tb->cflags
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (9 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-17 23:07   ` Richard Henderson
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 12/45] tcg: remove addr argument from lookup_tb_ptr Emilio G. Cota
                   ` (33 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This gets rid of a hole in struct TranslationBlock.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   | 3 +--
 accel/tcg/cpu-exec.c      | 2 +-
 accel/tcg/translate-all.c | 3 +--
 3 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 28e3a24..78a1714 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -326,12 +326,11 @@ struct TranslationBlock {
 #define CF_NOCACHE     0x10000 /* To be freed after execution */
 #define CF_USE_ICOUNT  0x20000
 #define CF_IGNORE_ICOUNT 0x40000 /* Do not generate icount code */
+#define CF_INVALID     0x80000 /* Protected by tb_lock */
 
     /* Per-vCPU dynamic tracing state used to generate this TB */
     uint32_t trace_vcpu_dstate;
 
-    uint16_t invalid;
-
     void *tc_ptr;    /* pointer to the translated code */
     uint8_t *tc_search;  /* pointer to search data */
     /* original tb when cflags has CF_NOCACHE */
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 9b5ce13..34841cd 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -387,7 +387,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
             tb_lock();
             acquired_tb_lock = true;
         }
-        if (!tb->invalid) {
+        if (!(tb->cflags & CF_INVALID)) {
             tb_add_jump(last_tb, tb_exit, tb);
         }
     }
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 6d4c05f..53fbb06 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1082,7 +1082,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
      * Mark the TB as invalid *after* it's been removed from tb_hash, which
      * eliminates the need to check this bit on lookups.
      */
-    tb->invalid = true;
+    tb->cflags |= CF_INVALID;
 
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
@@ -1273,7 +1273,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb->flags = flags;
     tb->cflags = cflags;
     tb->trace_vcpu_dstate = *cpu->trace_dstate;
-    tb->invalid = false;
 
 #ifdef CONFIG_PROFILER
     tcg_ctx.tb_count1++; /* includes aborted translations because of
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 12/45] tcg: remove addr argument from lookup_tb_ptr
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (10 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 11/45] exec-all: bring tb->invalid into tb->cflags Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-17 23:25   ` Richard Henderson
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 13/45] tcg: consolidate TB lookups in tb_lookup__cpu_state Emilio G. Cota
                   ` (32 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

It is unlikely that we will ever want to call this helper passing
an argument other than the current PC. So just remove the argument,
and use the pc we already get from cpu_get_tb_cpu_state.

This change paves the way to having a common "tb_lookup" function.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg-op.h               |  4 ++--
 tcg/tcg-runtime.h          |  2 +-
 target/alpha/translate.c   |  2 +-
 target/arm/translate-a64.c |  4 ++--
 target/arm/translate.c     |  5 +----
 target/hppa/translate.c    |  6 +++---
 target/i386/translate.c    | 17 +++++------------
 target/mips/translate.c    |  4 ++--
 target/s390x/translate.c   |  2 +-
 tcg/tcg-op.c               |  4 ++--
 tcg/tcg-runtime.c          | 20 ++++++++++----------
 11 files changed, 30 insertions(+), 40 deletions(-)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 5d3278f..18d01b2 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -797,7 +797,7 @@ static inline void tcg_gen_exit_tb(uintptr_t val)
 void tcg_gen_goto_tb(unsigned idx);
 
 /**
- * tcg_gen_lookup_and_goto_ptr() - look up a TB and jump to it if valid
+ * tcg_gen_lookup_and_goto_ptr() - look up the current TB, jump to it if valid
  * @addr: Guest address of the target TB
  *
  * If the TB is not valid, jump to the epilogue.
@@ -805,7 +805,7 @@ void tcg_gen_goto_tb(unsigned idx);
  * This operation is optional. If the TCG backend does not implement goto_ptr,
  * this op is equivalent to calling tcg_gen_exit_tb() with 0 as the argument.
  */
-void tcg_gen_lookup_and_goto_ptr(TCGv addr);
+void tcg_gen_lookup_and_goto_ptr(void);
 
 #if TARGET_LONG_BITS == 32
 #define tcg_temp_new() tcg_temp_new_i32()
diff --git a/tcg/tcg-runtime.h b/tcg/tcg-runtime.h
index c41d38a..1df17d0 100644
--- a/tcg/tcg-runtime.h
+++ b/tcg/tcg-runtime.h
@@ -24,7 +24,7 @@ DEF_HELPER_FLAGS_1(clrsb_i64, TCG_CALL_NO_RWG_SE, i64, i64)
 DEF_HELPER_FLAGS_1(ctpop_i32, TCG_CALL_NO_RWG_SE, i32, i32)
 DEF_HELPER_FLAGS_1(ctpop_i64, TCG_CALL_NO_RWG_SE, i64, i64)
 
-DEF_HELPER_FLAGS_2(lookup_tb_ptr, TCG_CALL_NO_WG_SE, ptr, env, tl)
+DEF_HELPER_FLAGS_1(lookup_tb_ptr, TCG_CALL_NO_WG_SE, ptr, env)
 
 DEF_HELPER_FLAGS_1(exit_atomic, TCG_CALL_NO_WG, noreturn, env)
 
diff --git a/target/alpha/translate.c b/target/alpha/translate.c
index 232af9e..96c527b 100644
--- a/target/alpha/translate.c
+++ b/target/alpha/translate.c
@@ -3022,7 +3022,7 @@ void gen_intermediate_code(CPUAlphaState *env, struct TranslationBlock *tb)
         /* FALLTHRU */
     case EXIT_PC_UPDATED:
         if (!use_exit_tb(&ctx)) {
-            tcg_gen_lookup_and_goto_ptr(cpu_pc);
+            tcg_gen_lookup_and_goto_ptr();
             break;
         }
         /* FALLTHRU */
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index e55547d..49d35c2 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -379,7 +379,7 @@ static inline void gen_goto_tb(DisasContext *s, int n, uint64_t dest)
         } else if (s->singlestep_enabled) {
             gen_exception_internal(EXCP_DEBUG);
         } else {
-            tcg_gen_lookup_and_goto_ptr(cpu_pc);
+            tcg_gen_lookup_and_goto_ptr();
             s->is_jmp = DISAS_TB_JUMP;
         }
     }
@@ -11369,7 +11369,7 @@ void gen_intermediate_code_a64(ARMCPU *cpu, TranslationBlock *tb)
             gen_a64_set_pc_im(dc->pc);
             /* fall through */
         case DISAS_JUMP:
-            tcg_gen_lookup_and_goto_ptr(cpu_pc);
+            tcg_gen_lookup_and_goto_ptr();
             break;
         case DISAS_EXIT:
             tcg_gen_exit_tb(0);
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 0862f9e..ebbe407 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -4152,10 +4152,7 @@ static inline bool use_goto_tb(DisasContext *s, target_ulong dest)
 
 static void gen_goto_ptr(void)
 {
-    TCGv addr = tcg_temp_new();
-    tcg_gen_extu_i32_tl(addr, cpu_R[15]);
-    tcg_gen_lookup_and_goto_ptr(addr);
-    tcg_temp_free(addr);
+    tcg_gen_lookup_and_goto_ptr();
 }
 
 static void gen_goto_tb(DisasContext *s, int n, target_ulong dest)
diff --git a/target/hppa/translate.c b/target/hppa/translate.c
index e10abc5..91053e2 100644
--- a/target/hppa/translate.c
+++ b/target/hppa/translate.c
@@ -517,7 +517,7 @@ static void gen_goto_tb(DisasContext *ctx, int which,
         if (ctx->singlestep_enabled) {
             gen_excp_1(EXCP_DEBUG);
         } else {
-            tcg_gen_lookup_and_goto_ptr(cpu_iaoq_f);
+            tcg_gen_lookup_and_goto_ptr();
         }
     }
 }
@@ -1527,7 +1527,7 @@ static ExitStatus do_ibranch(DisasContext *ctx, TCGv dest,
         if (link != 0) {
             tcg_gen_movi_tl(cpu_gr[link], ctx->iaoq_n);
         }
-        tcg_gen_lookup_and_goto_ptr(cpu_iaoq_f);
+        tcg_gen_lookup_and_goto_ptr();
         return nullify_end(ctx, NO_EXIT);
     } else {
         cond_prep(&ctx->null_cond);
@@ -3885,7 +3885,7 @@ void gen_intermediate_code(CPUHPPAState *env, struct TranslationBlock *tb)
         if (ctx.singlestep_enabled) {
             gen_excp_1(EXCP_DEBUG);
         } else {
-            tcg_gen_lookup_and_goto_ptr(cpu_iaoq_f);
+            tcg_gen_lookup_and_goto_ptr();
         }
         break;
     default:
diff --git a/target/i386/translate.c b/target/i386/translate.c
index ed3b896..291c577 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -2511,7 +2511,7 @@ static void gen_bnd_jmp(DisasContext *s)
    If RECHECK_TF, emit a rechecking helper for #DB, ignoring the state of
    S->TF.  This is used by the syscall/sysret insns.  */
 static void
-do_gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf, TCGv jr)
+do_gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf, bool jr)
 {
     gen_update_cc_op(s);
 
@@ -2532,12 +2532,8 @@ do_gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf, TCGv jr)
         tcg_gen_exit_tb(0);
     } else if (s->tf) {
         gen_helper_single_step(cpu_env);
-    } else if (!TCGV_IS_UNUSED(jr)) {
-        TCGv vaddr = tcg_temp_new();
-
-        tcg_gen_add_tl(vaddr, jr, cpu_seg_base[R_CS]);
-        tcg_gen_lookup_and_goto_ptr(vaddr);
-        tcg_temp_free(vaddr);
+    } else if (jr) {
+        tcg_gen_lookup_and_goto_ptr();
     } else {
         tcg_gen_exit_tb(0);
     }
@@ -2547,10 +2543,7 @@ do_gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf, TCGv jr)
 static inline void
 gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf)
 {
-    TCGv unused;
-
-    TCGV_UNUSED(unused);
-    do_gen_eob_worker(s, inhibit, recheck_tf, unused);
+    do_gen_eob_worker(s, inhibit, recheck_tf, false);
 }
 
 /* End of block.
@@ -2569,7 +2562,7 @@ static void gen_eob(DisasContext *s)
 /* Jump to register */
 static void gen_jr(DisasContext *s, TCGv dest)
 {
-    do_gen_eob_worker(s, false, false, dest);
+    do_gen_eob_worker(s, false, false, true);
 }
 
 /* generate a jump to eip. No segment change must happen before as a
diff --git a/target/mips/translate.c b/target/mips/translate.c
index 559f8fe..a2f5385 100644
--- a/target/mips/translate.c
+++ b/target/mips/translate.c
@@ -4233,7 +4233,7 @@ static inline void gen_goto_tb(DisasContext *ctx, int n, target_ulong dest)
             save_cpu_state(ctx, 0);
             gen_helper_raise_exception_debug(cpu_env);
         }
-        tcg_gen_lookup_and_goto_ptr(cpu_PC);
+        tcg_gen_lookup_and_goto_ptr();
     }
 }
 
@@ -10725,7 +10725,7 @@ static void gen_branch(DisasContext *ctx, int insn_bytes)
                 save_cpu_state(ctx, 0);
                 gen_helper_raise_exception_debug(cpu_env);
             }
-            tcg_gen_lookup_and_goto_ptr(cpu_PC);
+            tcg_gen_lookup_and_goto_ptr();
             break;
         default:
             fprintf(stderr, "unknown branch 0x%x\n", proc_hflags);
diff --git a/target/s390x/translate.c b/target/s390x/translate.c
index 592d6b0..b503c2c 100644
--- a/target/s390x/translate.c
+++ b/target/s390x/translate.c
@@ -5859,7 +5859,7 @@ void gen_intermediate_code(CPUS390XState *env, struct TranslationBlock *tb)
         } else if (use_exit_tb(&dc) || status == EXIT_PC_STALE_NOCHAIN) {
             tcg_gen_exit_tb(0);
         } else {
-            tcg_gen_lookup_and_goto_ptr(psw_addr);
+            tcg_gen_lookup_and_goto_ptr();
         }
         break;
     default:
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 87f673e..205d07f 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -2587,11 +2587,11 @@ void tcg_gen_goto_tb(unsigned idx)
     tcg_gen_op1i(INDEX_op_goto_tb, idx);
 }
 
-void tcg_gen_lookup_and_goto_ptr(TCGv addr)
+void tcg_gen_lookup_and_goto_ptr(void)
 {
     if (TCG_TARGET_HAS_goto_ptr && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
         TCGv_ptr ptr = tcg_temp_new_ptr();
-        gen_helper_lookup_tb_ptr(ptr, tcg_ctx.tcg_env, addr);
+        gen_helper_lookup_tb_ptr(ptr, tcg_ctx.tcg_env);
         tcg_gen_op1i(INDEX_op_goto_ptr, GET_TCGV_PTR(ptr));
         tcg_temp_free_ptr(ptr);
     } else {
diff --git a/tcg/tcg-runtime.c b/tcg/tcg-runtime.c
index 3e23649..e85a042 100644
--- a/tcg/tcg-runtime.c
+++ b/tcg/tcg-runtime.c
@@ -144,33 +144,33 @@ uint64_t HELPER(ctpop_i64)(uint64_t arg)
     return ctpop64(arg);
 }
 
-void *HELPER(lookup_tb_ptr)(CPUArchState *env, target_ulong addr)
+void *HELPER(lookup_tb_ptr)(CPUArchState *env)
 {
     CPUState *cpu = ENV_GET_CPU(env);
     TranslationBlock *tb;
     target_ulong cs_base, pc;
-    uint32_t flags, addr_hash;
+    uint32_t flags, hash;
 
-    addr_hash = tb_jmp_cache_hash_func(addr);
-    tb = atomic_rcu_read(&cpu->tb_jmp_cache[addr_hash]);
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
+    hash = tb_jmp_cache_hash_func(pc);
+    tb = atomic_rcu_read(&cpu->tb_jmp_cache[hash]);
 
     if (unlikely(!(tb
-                   && tb->pc == addr
+                   && tb->pc == pc
                    && tb->cs_base == cs_base
                    && tb->flags == flags
                    && tb->trace_vcpu_dstate == *cpu->trace_dstate))) {
-        tb = tb_htable_lookup(cpu, addr, cs_base, flags);
+        tb = tb_htable_lookup(cpu, pc, cs_base, flags);
         if (!tb) {
             return tcg_ctx.code_gen_epilogue;
         }
-        atomic_set(&cpu->tb_jmp_cache[addr_hash], tb);
+        atomic_set(&cpu->tb_jmp_cache[hash], tb);
     }
 
-    qemu_log_mask_and_addr(CPU_LOG_EXEC, addr,
+    qemu_log_mask_and_addr(CPU_LOG_EXEC, pc,
                            "Chain %p [%d: " TARGET_FMT_lx "] %s\n",
-                           tb->tc_ptr, cpu->cpu_index, addr,
-                           lookup_symbol(addr));
+                           tb->tc_ptr, cpu->cpu_index, pc,
+                           lookup_symbol(pc));
     return tb->tc_ptr;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 13/45] tcg: consolidate TB lookups in tb_lookup__cpu_state
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (11 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 12/45] tcg: remove addr argument from lookup_tb_ptr Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-17 23:41   ` Richard Henderson
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 14/45] tcg: define CF_PARALLEL and use it for TB hashing Emilio G. Cota
                   ` (31 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This avoids duplicating code. cpu_exec_step will also use the
new common function once we integrate parallel_cpus into tb->cflags.

Performance-wise, I measured a small improvement when booting debian-arm.
Note that inlining pays off:

 Performance counter stats for 'taskset -c 0 qemu-system-arm \
	-machine type=virt -nographic -smp 1 -m 4096 \
	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
	-device virtio-net-device,netdev=unet \
	-drive file=jessie.qcow2,id=myblock,index=0,if=none \
	-device virtio-blk-device,drive=myblock \
	-kernel kernel.img -append console=ttyAMA0 root=/dev/vda1 \
	-name arm,debug-threads=on -smp 1' (10 runs):

Before:
      18714.917392 task-clock                #    0.952 CPUs utilized            ( +-  0.95% )
            23,142 context-switches          #    0.001 M/sec                    ( +-  0.50% )
                 1 CPU-migrations            #    0.000 M/sec
            10,558 page-faults               #    0.001 M/sec                    ( +-  0.95% )
    53,957,727,252 cycles                    #    2.883 GHz                      ( +-  0.91% ) [83.33%]
    24,440,599,852 stalled-cycles-frontend   #   45.30% frontend cycles idle     ( +-  1.20% ) [83.33%]
    16,495,714,424 stalled-cycles-backend    #   30.57% backend  cycles idle     ( +-  0.95% ) [66.66%]
    76,267,572,582 instructions              #    1.41  insns per cycle
                                             #    0.32  stalled cycles per insn  ( +-  0.87% ) [83.34%]
    12,692,186,323 branches                  #  678.186 M/sec                    ( +-  0.92% ) [83.35%]
       263,486,879 branch-misses             #    2.08% of all branches          ( +-  0.73% ) [83.34%]

      19.648474449 seconds time elapsed                                          ( +-  0.82% )

After, w/ inline (this patch):
      18471.376627 task-clock                #    0.955 CPUs utilized            ( +-  0.96% )
            23,048 context-switches          #    0.001 M/sec                    ( +-  0.48% )
                 1 CPU-migrations            #    0.000 M/sec
            10,708 page-faults               #    0.001 M/sec                    ( +-  0.81% )
    53,208,990,796 cycles                    #    2.881 GHz                      ( +-  0.98% ) [83.34%]
    23,941,071,673 stalled-cycles-frontend   #   44.99% frontend cycles idle     ( +-  0.95% ) [83.34%]
    16,161,773,848 stalled-cycles-backend    #   30.37% backend  cycles idle     ( +-  0.76% ) [66.67%]
    75,786,269,766 instructions              #    1.42  insns per cycle
                                             #    0.32  stalled cycles per insn  ( +-  1.24% ) [83.34%]
    12,573,617,143 branches                  #  680.708 M/sec                    ( +-  1.34% ) [83.33%]
       260,235,550 branch-misses             #    2.07% of all branches          ( +-  0.66% ) [83.33%]

      19.340502161 seconds time elapsed                                          ( +-  0.56% )

After, w/o inline:
      18791.253967 task-clock                #    0.954 CPUs utilized            ( +-  0.78% )
            23,230 context-switches          #    0.001 M/sec                    ( +-  0.42% )
                 1 CPU-migrations            #    0.000 M/sec
            10,563 page-faults               #    0.001 M/sec                    ( +-  1.27% )
    54,168,674,622 cycles                    #    2.883 GHz                      ( +-  0.80% ) [83.34%]
    24,244,712,629 stalled-cycles-frontend   #   44.76% frontend cycles idle     ( +-  1.37% ) [83.33%]
    16,288,648,572 stalled-cycles-backend    #   30.07% backend  cycles idle     ( +-  0.95% ) [66.66%]
    77,659,755,503 instructions              #    1.43  insns per cycle
                                             #    0.31  stalled cycles per insn  ( +-  0.97% ) [83.34%]
    12,922,780,045 branches                  #  687.702 M/sec                    ( +-  1.06% ) [83.34%]
       261,962,386 branch-misses             #    2.03% of all branches          ( +-  0.71% ) [83.35%]

      19.700174670 seconds time elapsed                                          ( +-  0.56% )

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-lookup.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 accel/tcg/cpu-exec.c     | 47 ++++++++++++++++++-----------------------------
 tcg/tcg-runtime.c        | 24 ++++++------------------
 3 files changed, 72 insertions(+), 47 deletions(-)
 create mode 100644 include/exec/tb-lookup.h

diff --git a/include/exec/tb-lookup.h b/include/exec/tb-lookup.h
new file mode 100644
index 0000000..5e3f104
--- /dev/null
+++ b/include/exec/tb-lookup.h
@@ -0,0 +1,48 @@
+/*
+ * Copyright (C) 2017, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef EXEC_TB_LOOKUP_H
+#define EXEC_TB_LOOKUP_H
+
+#include "qemu/osdep.h"
+
+#ifdef NEED_CPU_H
+#include "cpu.h"
+#else
+#include "exec/poison.h"
+#endif
+
+#include "exec/exec-all.h"
+#include "exec/tb-hash.h"
+
+/* Might cause an exception, so have a longjmp destination ready */
+static inline TranslationBlock *
+tb_lookup__cpu_state(CPUState *cpu, target_ulong *pc, target_ulong *cs_base,
+                     uint32_t *flags)
+{
+    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
+    TranslationBlock *tb;
+    uint32_t hash;
+
+    cpu_get_tb_cpu_state(env, pc, cs_base, flags);
+    hash = tb_jmp_cache_hash_func(*pc);
+    tb = atomic_rcu_read(&cpu->tb_jmp_cache[hash]);
+    if (likely(tb &&
+               tb->pc == *pc &&
+               tb->cs_base == *cs_base &&
+               tb->flags == *flags &&
+               tb->trace_vcpu_dstate == *cpu->trace_dstate)) {
+        return tb;
+    }
+    tb = tb_htable_lookup(cpu, *pc, *cs_base, *flags);
+    if (tb == NULL) {
+        return NULL;
+    }
+    atomic_set(&cpu->tb_jmp_cache[hash], tb);
+    return tb;
+}
+
+#endif /* EXEC_TB_LOOKUP_H */
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 34841cd..3a08ad0 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -28,6 +28,7 @@
 #include "exec/address-spaces.h"
 #include "qemu/rcu.h"
 #include "exec/tb-hash.h"
+#include "exec/tb-lookup.h"
 #include "exec/log.h"
 #include "qemu/main-loop.h"
 #if defined(TARGET_I386) && !defined(CONFIG_USER_ONLY)
@@ -332,43 +333,31 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
                                         TranslationBlock *last_tb,
                                         int tb_exit)
 {
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb;
     target_ulong cs_base, pc;
     uint32_t flags;
     bool acquired_tb_lock = false;
 
-    /* we record a subset of the CPU state. It will
-       always be the same before a given translated block
-       is executed. */
-    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-    tb = atomic_rcu_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
-    if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
-                 tb->flags != flags ||
-                 tb->trace_vcpu_dstate != *cpu->trace_dstate)) {
-        tb = tb_htable_lookup(cpu, pc, cs_base, flags);
-        if (!tb) {
-
-            /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
-             * taken outside tb_lock. As system emulation is currently
-             * single threaded the locks are NOPs.
-             */
-            mmap_lock();
-            tb_lock();
-            acquired_tb_lock = true;
-
-            /* There's a chance that our desired tb has been translated while
-             * taking the locks so we check again inside the lock.
-             */
-            tb = tb_htable_lookup(cpu, pc, cs_base, flags);
-            if (!tb) {
-                /* if no translated code available, then translate it now */
-                tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
-            }
+    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags);
+    if (tb == NULL) {
+        /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
+         * taken outside tb_lock. As system emulation is currently
+         * single threaded the locks are NOPs.
+         */
+        mmap_lock();
+        tb_lock();
+        acquired_tb_lock = true;
 
-            mmap_unlock();
+        /* There's a chance that our desired tb has been translated while
+         * taking the locks so we check again inside the lock.
+         */
+        tb = tb_htable_lookup(cpu, pc, cs_base, flags);
+        if (likely(tb == NULL)) {
+            /* if no translated code available, then translate it now */
+            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
         }
 
+        mmap_unlock();
         /* We add the TB in the virtual pc hash table for the fast lookup */
         atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
     }
diff --git a/tcg/tcg-runtime.c b/tcg/tcg-runtime.c
index e85a042..7100339 100644
--- a/tcg/tcg-runtime.c
+++ b/tcg/tcg-runtime.c
@@ -27,7 +27,7 @@
 #include "exec/helper-proto.h"
 #include "exec/cpu_ldst.h"
 #include "exec/exec-all.h"
-#include "exec/tb-hash.h"
+#include "exec/tb-lookup.h"
 #include "disas/disas.h"
 #include "exec/log.h"
 
@@ -149,24 +149,12 @@ void *HELPER(lookup_tb_ptr)(CPUArchState *env)
     CPUState *cpu = ENV_GET_CPU(env);
     TranslationBlock *tb;
     target_ulong cs_base, pc;
-    uint32_t flags, hash;
-
-    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-    hash = tb_jmp_cache_hash_func(pc);
-    tb = atomic_rcu_read(&cpu->tb_jmp_cache[hash]);
-
-    if (unlikely(!(tb
-                   && tb->pc == pc
-                   && tb->cs_base == cs_base
-                   && tb->flags == flags
-                   && tb->trace_vcpu_dstate == *cpu->trace_dstate))) {
-        tb = tb_htable_lookup(cpu, pc, cs_base, flags);
-        if (!tb) {
-            return tcg_ctx.code_gen_epilogue;
-        }
-        atomic_set(&cpu->tb_jmp_cache[hash], tb);
-    }
+    uint32_t flags;
 
+    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags);
+    if (tb == NULL) {
+        return tcg_ctx.code_gen_epilogue;
+    }
     qemu_log_mask_and_addr(CPU_LOG_EXEC, pc,
                            "Chain %p [%d: " TARGET_FMT_lx "] %s\n",
                            tb->tc_ptr, cpu->cpu_index, pc,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 14/45] tcg: define CF_PARALLEL and use it for TB hashing
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (12 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 13/45] tcg: consolidate TB lookups in tb_lookup__cpu_state Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-17 23:46   ` Richard Henderson
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 15/45] target/arm: check CF_PARALLEL instead of parallel_cpus Emilio G. Cota
                   ` (30 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This will enable us to decouple code translation from the value
of parallel_cpus at any given time. It will also help us minimize
TB flushes when generating code via EXCP_ATOMIC.

Note that the declaration of parallel_cpus is brought to exec-all.h
to be able to define there the inlines. The inlines use an unnecessary
temp variable that is there just to make it easier to add more bits
to the mask in the future.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   | 26 ++++++++++++++++++++++-
 include/exec/tb-hash-xx.h |  9 +++++---
 include/exec/tb-hash.h    |  4 ++--
 include/exec/tb-lookup.h  |  5 +++--
 tcg/tcg.h                 |  1 -
 accel/tcg/cpu-exec.c      | 53 +++++++++++++++++++++++++++--------------------
 accel/tcg/translate-all.c | 23 ++++++++++++++++----
 exec.c                    |  7 ++++++-
 hw/i386/kvmvapic.c        |  7 ++++++-
 tcg/tcg-runtime.c         |  2 +-
 tests/qht-bench.c         |  2 +-
 11 files changed, 100 insertions(+), 39 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 78a1714..b3f04c3 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -327,6 +327,7 @@ struct TranslationBlock {
 #define CF_USE_ICOUNT  0x20000
 #define CF_IGNORE_ICOUNT 0x40000 /* Do not generate icount code */
 #define CF_INVALID     0x80000 /* Protected by tb_lock */
+#define CF_PARALLEL    0x100000 /* Generate code for a parallel context */
 
     /* Per-vCPU dynamic tracing state used to generate this TB */
     uint32_t trace_vcpu_dstate;
@@ -370,11 +371,34 @@ struct TranslationBlock {
     uintptr_t jmp_list_first;
 };
 
+extern bool parallel_cpus;
+
+/* mask cflags for hashing/comparison */
+static inline uint32_t mask_cf(uint32_t cflags)
+{
+    uint32_t mask = 0;
+
+    mask |= CF_PARALLEL;
+    return cflags & mask;
+}
+
+/* current cflags, masked for hashing/comparison */
+static inline uint32_t curr_cf_mask(void)
+{
+    uint32_t val = 0;
+
+    if (parallel_cpus) {
+        val |= CF_PARALLEL;
+    }
+    return val;
+}
+
 void tb_free(TranslationBlock *tb);
 void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
 TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
-                                   target_ulong cs_base, uint32_t flags);
+                                   target_ulong cs_base, uint32_t flags,
+                                   uint32_t cf_mask);
 
 #if defined(USE_DIRECT_JUMP)
 
diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
index 6cd3022..747a9a6 100644
--- a/include/exec/tb-hash-xx.h
+++ b/include/exec/tb-hash-xx.h
@@ -48,8 +48,8 @@
  * xxhash32, customized for input variables that are not guaranteed to be
  * contiguous in memory.
  */
-static inline
-uint32_t tb_hash_func6(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f)
+static inline uint32_t
+tb_hash_func7(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f, uint32_t g)
 {
     uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
     uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
@@ -78,7 +78,7 @@ uint32_t tb_hash_func6(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f)
     v4 *= PRIME32_1;
 
     h32 = rol32(v1, 1) + rol32(v2, 7) + rol32(v3, 12) + rol32(v4, 18);
-    h32 += 24;
+    h32 += 28;
 
     h32 += e * PRIME32_3;
     h32  = rol32(h32, 17) * PRIME32_4;
@@ -86,6 +86,9 @@ uint32_t tb_hash_func6(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f)
     h32 += f * PRIME32_3;
     h32  = rol32(h32, 17) * PRIME32_4;
 
+    h32 += g * PRIME32_3;
+    h32  = rol32(h32, 17) * PRIME32_4;
+
     h32 ^= h32 >> 15;
     h32 *= PRIME32_2;
     h32 ^= h32 >> 13;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 17b5ee0..0526c4f 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -59,9 +59,9 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
 
 static inline
 uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t flags,
-                      uint32_t trace_vcpu_dstate)
+                      uint32_t cf_mask, uint32_t trace_vcpu_dstate)
 {
-    return tb_hash_func6(phys_pc, pc, flags, trace_vcpu_dstate);
+    return tb_hash_func7(phys_pc, pc, flags, cf_mask, trace_vcpu_dstate);
 }
 
 #endif
diff --git a/include/exec/tb-lookup.h b/include/exec/tb-lookup.h
index 5e3f104..9948b67 100644
--- a/include/exec/tb-lookup.h
+++ b/include/exec/tb-lookup.h
@@ -21,7 +21,7 @@
 /* Might cause an exception, so have a longjmp destination ready */
 static inline TranslationBlock *
 tb_lookup__cpu_state(CPUState *cpu, target_ulong *pc, target_ulong *cs_base,
-                     uint32_t *flags)
+                     uint32_t *flags, uint32_t cf_mask)
 {
     CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb;
@@ -34,10 +34,11 @@ tb_lookup__cpu_state(CPUState *cpu, target_ulong *pc, target_ulong *cs_base,
                tb->pc == *pc &&
                tb->cs_base == *cs_base &&
                tb->flags == *flags &&
+               mask_cf(tb->cflags) == cf_mask &&
                tb->trace_vcpu_dstate == *cpu->trace_dstate)) {
         return tb;
     }
-    tb = tb_htable_lookup(cpu, *pc, *cs_base, *flags);
+    tb = tb_htable_lookup(cpu, *pc, *cs_base, *flags, cf_mask);
     if (tb == NULL) {
         return NULL;
     }
diff --git a/tcg/tcg.h b/tcg/tcg.h
index da78721..96872f8 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -730,7 +730,6 @@ struct TCGContext {
 };
 
 extern TCGContext tcg_ctx;
-extern bool parallel_cpus;
 
 static inline void tcg_set_insn_param(int op_idx, int arg, TCGArg v)
 {
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 3a08ad0..efe5c85 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -198,16 +198,20 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
                              TranslationBlock *orig_tb, bool ignore_icount)
 {
     TranslationBlock *tb;
+    uint32_t cflags;
 
     /* Should never happen.
        We only end up here when an existing TB is too long.  */
     if (max_cycles > CF_COUNT_MASK)
         max_cycles = CF_COUNT_MASK;
 
+    cflags = max_cycles | CF_NOCACHE | (ignore_icount ? CF_IGNORE_ICOUNT : 0);
+    if (parallel_cpus) {
+        cflags |= CF_PARALLEL;
+    }
     tb_lock();
     tb = tb_gen_code(cpu, orig_tb->pc, orig_tb->cs_base, orig_tb->flags,
-                     max_cycles | CF_NOCACHE
-                         | (ignore_icount ? CF_IGNORE_ICOUNT : 0));
+                     cflags);
     tb->orig_tb = orig_tb;
     tb_unlock();
 
@@ -225,31 +229,26 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
 static void cpu_exec_step(CPUState *cpu)
 {
     CPUClass *cc = CPU_GET_CLASS(cpu);
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb;
     target_ulong cs_base, pc;
     uint32_t flags;
+    uint32_t cflags = 1 | CF_IGNORE_ICOUNT;
 
-    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
     if (sigsetjmp(cpu->jmp_env, 0) == 0) {
-        mmap_lock();
-        tb_lock();
-        tb = tb_gen_code(cpu, pc, cs_base, flags,
-                         1 | CF_NOCACHE | CF_IGNORE_ICOUNT);
-        tb->orig_tb = NULL;
-        tb_unlock();
-        mmap_unlock();
+        tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, mask_cf(cflags));
+        if (tb == NULL) {
+            mmap_lock();
+            tb_lock();
+            tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
+            tb_unlock();
+            mmap_unlock();
+        }
 
         cc->cpu_exec_enter(cpu);
         /* execute the generated code */
-        trace_exec_tb_nocache(tb, pc);
+        trace_exec_tb(tb, pc);
         cpu_tb_exec(cpu, tb);
         cc->cpu_exec_exit(cpu);
-
-        tb_lock();
-        tb_phys_invalidate(tb, -1);
-        tb_free(tb);
-        tb_unlock();
     } else {
         /* We may have exited due to another problem here, so we need
          * to reset any tb_locks we may have taken but didn't release.
@@ -281,6 +280,7 @@ struct tb_desc {
     CPUArchState *env;
     tb_page_addr_t phys_page1;
     uint32_t flags;
+    uint32_t cf_mask;
     uint32_t trace_vcpu_dstate;
 };
 
@@ -293,6 +293,7 @@ static bool tb_cmp(const void *p, const void *d)
         tb->page_addr[0] == desc->phys_page1 &&
         tb->cs_base == desc->cs_base &&
         tb->flags == desc->flags &&
+        mask_cf(tb->cflags) == desc->cf_mask &&
         tb->trace_vcpu_dstate == desc->trace_vcpu_dstate) {
         /* check next page if needed */
         if (tb->page_addr[1] == -1) {
@@ -312,7 +313,8 @@ static bool tb_cmp(const void *p, const void *d)
 }
 
 TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
-                                   target_ulong cs_base, uint32_t flags)
+                                   target_ulong cs_base, uint32_t flags,
+                                   uint32_t cf_mask)
 {
     tb_page_addr_t phys_pc;
     struct tb_desc desc;
@@ -321,11 +323,12 @@ TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
     desc.env = (CPUArchState *)cpu->env_ptr;
     desc.cs_base = cs_base;
     desc.flags = flags;
+    desc.cf_mask = cf_mask;
     desc.trace_vcpu_dstate = *cpu->trace_dstate;
     desc.pc = pc;
     phys_pc = get_page_addr_code(desc.env, pc);
     desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_hash_func(phys_pc, pc, flags, *cpu->trace_dstate);
+    h = tb_hash_func(phys_pc, pc, flags, cf_mask, *cpu->trace_dstate);
     return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
 }
 
@@ -337,8 +340,9 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
     target_ulong cs_base, pc;
     uint32_t flags;
     bool acquired_tb_lock = false;
+    uint32_t cf_mask = curr_cf_mask();
 
-    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags);
+    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
     if (tb == NULL) {
         /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
          * taken outside tb_lock. As system emulation is currently
@@ -351,10 +355,15 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
         /* There's a chance that our desired tb has been translated while
          * taking the locks so we check again inside the lock.
          */
-        tb = tb_htable_lookup(cpu, pc, cs_base, flags);
+        tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
         if (likely(tb == NULL)) {
+            uint32_t cflags = 0;
+
+            if (parallel_cpus) {
+                cflags |= CF_PARALLEL;
+            }
             /* if no translated code available, then translate it now */
-            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+            tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
         }
 
         mmap_unlock();
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 53fbb06..483248f 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1075,7 +1075,8 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->trace_vcpu_dstate);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, mask_cf(tb->cflags),
+                     tb->trace_vcpu_dstate);
     qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /*
@@ -1226,7 +1227,8 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     }
 
     /* add in the hash table */
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->trace_vcpu_dstate);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, mask_cf(tb->cflags),
+                     tb->trace_vcpu_dstate);
     qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
 
 #ifdef DEBUG_TB_CHECK
@@ -1504,10 +1506,15 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
 #endif
 #ifdef TARGET_HAS_PRECISE_SMC
     if (current_tb_modified) {
+        uint32_t cflags = 1;
+
+        if (parallel_cpus) {
+            cflags |= CF_PARALLEL;
+        }
         /* we generate a block containing just the instruction
            modifying the memory. It will ensure that it cannot modify
            itself */
-        tb_gen_code(cpu, current_pc, current_cs_base, current_flags, 1);
+        tb_gen_code(cpu, current_pc, current_cs_base, current_flags, cflags);
         cpu_loop_exit_noexc(cpu);
     }
 #endif
@@ -1622,10 +1629,15 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
     p->first_tb = NULL;
 #ifdef TARGET_HAS_PRECISE_SMC
     if (current_tb_modified) {
+        uint32_t cflags = 1;
+
+        if (parallel_cpus) {
+            cflags |= CF_PARALLEL;
+        }
         /* we generate a block containing just the instruction
            modifying the memory. It will ensure that it cannot modify
            itself */
-        tb_gen_code(cpu, current_pc, current_cs_base, current_flags, 1);
+        tb_gen_code(cpu, current_pc, current_cs_base, current_flags, cflags);
         /* tb_lock will be reset after cpu_loop_exit_noexc longjmps
          * back into the cpu_exec loop. */
         return true;
@@ -1769,6 +1781,9 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
     }
 
     cflags = n | CF_LAST_IO;
+    if (parallel_cpus) {
+        cflags |= CF_PARALLEL;
+    }
     pc = tb->pc;
     cs_base = tb->cs_base;
     flags = tb->flags;
diff --git a/exec.c b/exec.c
index a083ff8..adc160f 100644
--- a/exec.c
+++ b/exec.c
@@ -2414,8 +2414,13 @@ static void check_watchpoint(int offset, int len, MemTxAttrs attrs, int flags)
                     cpu->exception_index = EXCP_DEBUG;
                     cpu_loop_exit(cpu);
                 } else {
+                    uint32_t cflags = 1;
+
+                    if (parallel_cpus) {
+                        cflags |= CF_PARALLEL;
+                    }
                     cpu_get_tb_cpu_state(env, &pc, &cs_base, &cpu_flags);
-                    tb_gen_code(cpu, pc, cs_base, cpu_flags, 1);
+                    tb_gen_code(cpu, pc, cs_base, cpu_flags, cflags);
                     cpu_loop_exit_noexc(cpu);
                 }
             }
diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
index 0d9ef77..22e151d 100644
--- a/hw/i386/kvmvapic.c
+++ b/hw/i386/kvmvapic.c
@@ -458,10 +458,15 @@ static void patch_instruction(VAPICROMState *s, X86CPU *cpu, target_ulong ip)
     resume_all_vcpus();
 
     if (tcg_enabled()) {
+        uint32_t cflags = 1;
+
+        if (parallel_cpus) {
+            cflags |= CF_PARALLEL;
+        }
         /* Both tb_lock and iothread_mutex will be reset when
          *  longjmps back into the cpu_exec loop. */
         tb_lock();
-        tb_gen_code(cs, current_pc, current_cs_base, current_flags, 1);
+        tb_gen_code(cs, current_pc, current_cs_base, current_flags, cflags);
         cpu_loop_exit_noexc(cs);
     }
 }
diff --git a/tcg/tcg-runtime.c b/tcg/tcg-runtime.c
index 7100339..bf6f248 100644
--- a/tcg/tcg-runtime.c
+++ b/tcg/tcg-runtime.c
@@ -151,7 +151,7 @@ void *HELPER(lookup_tb_ptr)(CPUArchState *env)
     target_ulong cs_base, pc;
     uint32_t flags;
 
-    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags);
+    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, curr_cf_mask());
     if (tb == NULL) {
         return tcg_ctx.code_gen_epilogue;
     }
diff --git a/tests/qht-bench.c b/tests/qht-bench.c
index 11c1cec..4cabdfd 100644
--- a/tests/qht-bench.c
+++ b/tests/qht-bench.c
@@ -103,7 +103,7 @@ static bool is_equal(const void *obj, const void *userp)
 
 static inline uint32_t h(unsigned long v)
 {
-    return tb_hash_func6(v, 0, 0, 0);
+    return tb_hash_func7(v, 0, 0, 0, 0);
 }
 
 /*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 15/45] target/arm: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (13 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 14/45] tcg: define CF_PARALLEL and use it for TB hashing Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-17 23:46   ` Richard Henderson
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 16/45] target/hppa: " Emilio G. Cota
                   ` (29 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Thereby decoupling the resulting translated code from the current state
of the system.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/arm/helper-a64.h    |  4 ++++
 target/arm/helper-a64.c    | 38 ++++++++++++++++++++++++++++++++------
 target/arm/op_helper.c     |  7 -------
 target/arm/translate-a64.c | 31 +++++++++++++++++++++++++------
 target/arm/translate.c     |  9 +++++++--
 5 files changed, 68 insertions(+), 21 deletions(-)

diff --git a/target/arm/helper-a64.h b/target/arm/helper-a64.h
index 6f9eaba..85d8674 100644
--- a/target/arm/helper-a64.h
+++ b/target/arm/helper-a64.h
@@ -43,4 +43,8 @@ DEF_HELPER_FLAGS_2(fcvtx_f64_to_f32, TCG_CALL_NO_RWG, f32, f64, env)
 DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
+DEF_HELPER_FLAGS_4(paired_cmpxchg64_le_parallel, TCG_CALL_NO_WG,
+                   i64, env, i64, i64, i64)
 DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
+DEF_HELPER_FLAGS_4(paired_cmpxchg64_be_parallel, TCG_CALL_NO_WG,
+                   i64, env, i64, i64, i64)
diff --git a/target/arm/helper-a64.c b/target/arm/helper-a64.c
index d9df82c..d0e435c 100644
--- a/target/arm/helper-a64.c
+++ b/target/arm/helper-a64.c
@@ -430,8 +430,9 @@ uint64_t HELPER(crc32c_64)(uint64_t acc, uint64_t val, uint32_t bytes)
 }
 
 /* Returns 0 on success; 1 otherwise.  */
-uint64_t HELPER(paired_cmpxchg64_le)(CPUARMState *env, uint64_t addr,
-                                     uint64_t new_lo, uint64_t new_hi)
+static uint64_t do_paired_cmpxchg64_le(CPUARMState *env, uint64_t addr,
+                                       uint64_t new_lo, uint64_t new_hi,
+                                       bool parallel)
 {
     uintptr_t ra = GETPC();
     Int128 oldv, cmpv, newv;
@@ -440,7 +441,7 @@ uint64_t HELPER(paired_cmpxchg64_le)(CPUARMState *env, uint64_t addr,
     cmpv = int128_make128(env->exclusive_val, env->exclusive_high);
     newv = int128_make128(new_lo, new_hi);
 
-    if (parallel_cpus) {
+    if (parallel) {
 #ifndef CONFIG_ATOMIC128
         cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
 #else
@@ -484,8 +485,21 @@ uint64_t HELPER(paired_cmpxchg64_le)(CPUARMState *env, uint64_t addr,
     return !success;
 }
 
-uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
-                                     uint64_t new_lo, uint64_t new_hi)
+uint64_t HELPER(paired_cmpxchg64_le)(CPUARMState *env, uint64_t addr,
+                                              uint64_t new_lo, uint64_t new_hi)
+{
+    return do_paired_cmpxchg64_le(env, addr, new_lo, new_hi, false);
+}
+
+uint64_t HELPER(paired_cmpxchg64_le_parallel)(CPUARMState *env, uint64_t addr,
+                                              uint64_t new_lo, uint64_t new_hi)
+{
+    return do_paired_cmpxchg64_le(env, addr, new_lo, new_hi, true);
+}
+
+static uint64_t do_paired_cmpxchg64_be(CPUARMState *env, uint64_t addr,
+                                       uint64_t new_lo, uint64_t new_hi,
+                                       bool parallel)
 {
     uintptr_t ra = GETPC();
     Int128 oldv, cmpv, newv;
@@ -494,7 +508,7 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
     cmpv = int128_make128(env->exclusive_val, env->exclusive_high);
     newv = int128_make128(new_lo, new_hi);
 
-    if (parallel_cpus) {
+    if (parallel) {
 #ifndef CONFIG_ATOMIC128
         cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
 #else
@@ -537,3 +551,15 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
 
     return !success;
 }
+
+uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
+                                     uint64_t new_lo, uint64_t new_hi)
+{
+    return do_paired_cmpxchg64_be(env, addr, new_lo, new_hi, false);
+}
+
+uint64_t HELPER(paired_cmpxchg64_be_parallel)(CPUARMState *env, uint64_t addr,
+                                     uint64_t new_lo, uint64_t new_hi)
+{
+    return do_paired_cmpxchg64_be(env, addr, new_lo, new_hi, true);
+}
diff --git a/target/arm/op_helper.c b/target/arm/op_helper.c
index 2a85666..a28f254 100644
--- a/target/arm/op_helper.c
+++ b/target/arm/op_helper.c
@@ -450,13 +450,6 @@ void HELPER(yield)(CPUARMState *env)
     ARMCPU *cpu = arm_env_get_cpu(env);
     CPUState *cs = CPU(cpu);
 
-    /* When running in MTTCG we don't generate jumps to the yield and
-     * WFE helpers as it won't affect the scheduling of other vCPUs.
-     * If we wanted to more completely model WFE/SEV so we don't busy
-     * spin unnecessarily we would need to do something more involved.
-     */
-    g_assert(!parallel_cpus);
-
     /* This is a non-trappable hint instruction that generally indicates
      * that the guest is currently busy-looping. Yield control back to the
      * top level loop so that a more deserving VCPU has a chance to run.
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 49d35c2..af28b26 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -1333,13 +1333,18 @@ static void handle_hint(DisasContext *s, uint32_t insn,
     case 3: /* WFI */
         s->is_jmp = DISAS_WFI;
         return;
+        /* When running in MTTCG we don't generate jumps to the yield and
+         * WFE helpers as it won't affect the scheduling of other vCPUs.
+         * If we wanted to more completely model WFE/SEV so we don't busy
+         * spin unnecessarily we would need to do something more involved.
+         */
     case 1: /* YIELD */
-        if (!parallel_cpus) {
+        if (!(s->tb->cflags & CF_PARALLEL)) {
             s->is_jmp = DISAS_YIELD;
         }
         return;
     case 2: /* WFE */
-        if (!parallel_cpus) {
+        if (!(s->tb->cflags & CF_PARALLEL)) {
             s->is_jmp = DISAS_WFE;
         }
         return;
@@ -1916,11 +1921,25 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
             tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, val);
             tcg_temp_free_i64(val);
         } else if (s->be_data == MO_LE) {
-            gen_helper_paired_cmpxchg64_le(tmp, cpu_env, addr, cpu_reg(s, rt),
-                                           cpu_reg(s, rt2));
+            if (s->tb->cflags & CF_PARALLEL) {
+                gen_helper_paired_cmpxchg64_le_parallel(tmp, cpu_env, addr,
+                                                        cpu_reg(s, rt),
+                                                        cpu_reg(s, rt2));
+            } else {
+                gen_helper_paired_cmpxchg64_le(tmp, cpu_env, addr,
+                                               cpu_reg(s, rt),
+                                               cpu_reg(s, rt2));
+            }
         } else {
-            gen_helper_paired_cmpxchg64_be(tmp, cpu_env, addr, cpu_reg(s, rt),
-                                           cpu_reg(s, rt2));
+            if (s->tb->cflags & CF_PARALLEL) {
+                gen_helper_paired_cmpxchg64_be_parallel(tmp, cpu_env, addr,
+                                                        cpu_reg(s, rt),
+                                                        cpu_reg(s, rt2));
+            } else {
+                gen_helper_paired_cmpxchg64_be(tmp, cpu_env, addr,
+                                               cpu_reg(s, rt),
+                                               cpu_reg(s, rt2));
+            }
         }
     } else {
         TCGv_i64 val = cpu_reg(s, rt);
diff --git a/target/arm/translate.c b/target/arm/translate.c
index ebbe407..34aa95d 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -4492,8 +4492,13 @@ static void gen_exception_return(DisasContext *s, TCGv_i32 pc)
 static void gen_nop_hint(DisasContext *s, int val)
 {
     switch (val) {
+        /* When running in MTTCG we don't generate jumps to the yield and
+         * WFE helpers as it won't affect the scheduling of other vCPUs.
+         * If we wanted to more completely model WFE/SEV so we don't busy
+         * spin unnecessarily we would need to do something more involved.
+         */
     case 1: /* yield */
-        if (!parallel_cpus) {
+        if (!(s->tb->cflags & CF_PARALLEL)) {
             gen_set_pc_im(s, s->pc);
             s->is_jmp = DISAS_YIELD;
         }
@@ -4503,7 +4508,7 @@ static void gen_nop_hint(DisasContext *s, int val)
         s->is_jmp = DISAS_WFI;
         break;
     case 2: /* wfe */
-        if (!parallel_cpus) {
+        if (!(s->tb->cflags & CF_PARALLEL)) {
             gen_set_pc_im(s, s->pc);
             s->is_jmp = DISAS_WFE;
         }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 16/45] target/hppa: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (14 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 15/45] target/arm: check CF_PARALLEL instead of parallel_cpus Emilio G. Cota
@ 2017-07-16 20:03 ` Emilio G. Cota
  2017-07-17 23:47   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 17/45] target/i386: " Emilio G. Cota
                   ` (28 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Thereby decoupling the resulting translated code from the current state
of the system.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/hppa/helper.h    |  2 ++
 target/hppa/op_helper.c | 32 ++++++++++++++++++++++++++++----
 target/hppa/translate.c | 12 ++++++++++--
 3 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/target/hppa/helper.h b/target/hppa/helper.h
index 789f07f..0a6b900 100644
--- a/target/hppa/helper.h
+++ b/target/hppa/helper.h
@@ -3,7 +3,9 @@ DEF_HELPER_FLAGS_2(tsv, TCG_CALL_NO_WG, void, env, tl)
 DEF_HELPER_FLAGS_2(tcond, TCG_CALL_NO_WG, void, env, tl)
 
 DEF_HELPER_FLAGS_3(stby_b, TCG_CALL_NO_WG, void, env, tl, tl)
+DEF_HELPER_FLAGS_3(stby_b_parallel, TCG_CALL_NO_WG, void, env, tl, tl)
 DEF_HELPER_FLAGS_3(stby_e, TCG_CALL_NO_WG, void, env, tl, tl)
+DEF_HELPER_FLAGS_3(stby_e_parallel, TCG_CALL_NO_WG, void, env, tl, tl)
 
 DEF_HELPER_FLAGS_1(probe_r, TCG_CALL_NO_RWG_SE, tl, tl)
 DEF_HELPER_FLAGS_1(probe_w, TCG_CALL_NO_RWG_SE, tl, tl)
diff --git a/target/hppa/op_helper.c b/target/hppa/op_helper.c
index c05c0d5..3104404 100644
--- a/target/hppa/op_helper.c
+++ b/target/hppa/op_helper.c
@@ -76,7 +76,8 @@ static void atomic_store_3(CPUHPPAState *env, target_ulong addr, uint32_t val,
 #endif
 }
 
-void HELPER(stby_b)(CPUHPPAState *env, target_ulong addr, target_ulong val)
+static void do_stby_b(CPUHPPAState *env, target_ulong addr, target_ulong val,
+                      bool parallel)
 {
     uintptr_t ra = GETPC();
 
@@ -89,7 +90,7 @@ void HELPER(stby_b)(CPUHPPAState *env, target_ulong addr, target_ulong val)
         break;
     case 1:
         /* The 3 byte store must appear atomic.  */
-        if (parallel_cpus) {
+        if (parallel) {
             atomic_store_3(env, addr, val, 0x00ffffffu, ra);
         } else {
             cpu_stb_data_ra(env, addr, val >> 16, ra);
@@ -102,14 +103,26 @@ void HELPER(stby_b)(CPUHPPAState *env, target_ulong addr, target_ulong val)
     }
 }
 
-void HELPER(stby_e)(CPUHPPAState *env, target_ulong addr, target_ulong val)
+void HELPER(stby_b)(CPUHPPAState *env, target_ulong addr, target_ulong val)
+{
+    do_stby_b(env, addr, val, false);
+}
+
+void HELPER(stby_b_parallel)(CPUHPPAState *env, target_ulong addr,
+                             target_ulong val)
+{
+    do_stby_b(env, addr, val, true);
+}
+
+static void do_stby_e(CPUHPPAState *env, target_ulong addr, target_ulong val,
+                      bool parallel)
 {
     uintptr_t ra = GETPC();
 
     switch (addr & 3) {
     case 3:
         /* The 3 byte store must appear atomic.  */
-        if (parallel_cpus) {
+        if (parallel) {
             atomic_store_3(env, addr - 3, val, 0xffffff00u, ra);
         } else {
             cpu_stw_data_ra(env, addr - 3, val >> 16, ra);
@@ -132,6 +145,17 @@ void HELPER(stby_e)(CPUHPPAState *env, target_ulong addr, target_ulong val)
     }
 }
 
+void HELPER(stby_e)(CPUHPPAState *env, target_ulong addr, target_ulong val)
+{
+    do_stby_e(env, addr, val, false);
+}
+
+void HELPER(stby_e_parallel)(CPUHPPAState *env, target_ulong addr,
+                             target_ulong val)
+{
+    do_stby_e(env, addr, val, true);
+}
+
 target_ulong HELPER(probe_r)(target_ulong addr)
 {
     return page_check_range(addr, 1, PAGE_READ);
diff --git a/target/hppa/translate.c b/target/hppa/translate.c
index 91053e2..fde3dba 100644
--- a/target/hppa/translate.c
+++ b/target/hppa/translate.c
@@ -2309,9 +2309,17 @@ static ExitStatus trans_stby(DisasContext *ctx, uint32_t insn,
     val = load_gpr(ctx, rt);
 
     if (a) {
-        gen_helper_stby_e(cpu_env, addr, val);
+        if (ctx->tb->cflags & CF_PARALLEL) {
+            gen_helper_stby_e_parallel(cpu_env, addr, val);
+        } else {
+            gen_helper_stby_e(cpu_env, addr, val);
+        }
     } else {
-        gen_helper_stby_b(cpu_env, addr, val);
+        if (ctx->tb->cflags & CF_PARALLEL) {
+            gen_helper_stby_b_parallel(cpu_env, addr, val);
+        } else {
+            gen_helper_stby_b(cpu_env, addr, val);
+        }
     }
 
     if (m) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 17/45] target/i386: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (15 preceding siblings ...)
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 16/45] target/hppa: " Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-17 23:47   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 18/45] target/m68k: " Emilio G. Cota
                   ` (27 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Thereby decoupling the resulting translated code from the current state
of the system.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/i386/translate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/i386/translate.c b/target/i386/translate.c
index 291c577..c5e4d77 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -5263,7 +5263,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             if (!(s->cpuid_ext_features & CPUID_EXT_CX16))
                 goto illegal_op;
             gen_lea_modrm(env, s, modrm);
-            if ((s->prefix & PREFIX_LOCK) && parallel_cpus) {
+            if ((s->prefix & PREFIX_LOCK) && s->tb->cflags & CF_PARALLEL) {
                 gen_helper_cmpxchg16b(cpu_env, cpu_A0);
             } else {
                 gen_helper_cmpxchg16b_unlocked(cpu_env, cpu_A0);
@@ -5274,7 +5274,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             if (!(s->cpuid_features & CPUID_CX8))
                 goto illegal_op;
             gen_lea_modrm(env, s, modrm);
-            if ((s->prefix & PREFIX_LOCK) && parallel_cpus) {
+            if ((s->prefix & PREFIX_LOCK) && s->tb->cflags & CF_PARALLEL) {
                 gen_helper_cmpxchg8b(cpu_env, cpu_A0);
             } else {
                 gen_helper_cmpxchg8b_unlocked(cpu_env, cpu_A0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 18/45] target/m68k: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (16 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 17/45] target/i386: " Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-17 23:52   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 19/45] target/s390x: " Emilio G. Cota
                   ` (26 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Thereby decoupling the resulting translated code from the current state
of the system.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/m68k/helper.h    |  2 ++
 target/m68k/op_helper.c | 32 ++++++++++++++++++++++++++++----
 target/m68k/translate.c | 12 ++++++++++--
 3 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/target/m68k/helper.h b/target/m68k/helper.h
index 475a1f2..137ef48 100644
--- a/target/m68k/helper.h
+++ b/target/m68k/helper.h
@@ -10,7 +10,9 @@ DEF_HELPER_4(divsll, void, env, int, int, s32)
 DEF_HELPER_2(set_sr, void, env, i32)
 DEF_HELPER_3(movec, void, env, i32, i32)
 DEF_HELPER_4(cas2w, void, env, i32, i32, i32)
+DEF_HELPER_4(cas2w_parallel, void, env, i32, i32, i32)
 DEF_HELPER_4(cas2l, void, env, i32, i32, i32)
+DEF_HELPER_4(cas2l_parallel, void, env, i32, i32, i32)
 
 #define dh_alias_fp ptr
 #define dh_ctype_fp FPReg *
diff --git a/target/m68k/op_helper.c b/target/m68k/op_helper.c
index 7b5126c..061d468 100644
--- a/target/m68k/op_helper.c
+++ b/target/m68k/op_helper.c
@@ -361,7 +361,8 @@ void HELPER(divsll)(CPUM68KState *env, int numr, int regr, int32_t den)
     env->dregs[numr] = quot;
 }
 
-void HELPER(cas2w)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
+static void do_cas2w(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2,
+                     bool parallel)
 {
     uint32_t Dc1 = extract32(regs, 9, 3);
     uint32_t Dc2 = extract32(regs, 6, 3);
@@ -374,7 +375,7 @@ void HELPER(cas2w)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
     int16_t l1, l2;
     uintptr_t ra = GETPC();
 
-    if (parallel_cpus) {
+    if (parallel) {
         /* Tell the main loop we need to serialize this insn.  */
         cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
     } else {
@@ -399,7 +400,19 @@ void HELPER(cas2w)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
     env->dregs[Dc2] = deposit32(env->dregs[Dc2], 0, 16, l2);
 }
 
-void HELPER(cas2l)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
+void HELPER(cas2w)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
+{
+    do_cas2w(env, regs, a1, a2, false);
+}
+
+void HELPER(cas2w_parallel)(CPUM68KState *env, uint32_t regs, uint32_t a1,
+                            uint32_t a2)
+{
+    do_cas2w(env, regs, a1, a2, true);
+}
+
+static void do_cas2l(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2,
+                     bool parallel)
 {
     uint32_t Dc1 = extract32(regs, 9, 3);
     uint32_t Dc2 = extract32(regs, 6, 3);
@@ -416,7 +429,7 @@ void HELPER(cas2l)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
     TCGMemOpIdx oi;
 #endif
 
-    if (parallel_cpus) {
+    if (parallel) {
         /* We're executing in a parallel context -- must be atomic.  */
 #ifdef CONFIG_ATOMIC64
         uint64_t c, u, l;
@@ -470,6 +483,17 @@ void HELPER(cas2l)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
     env->dregs[Dc2] = l2;
 }
 
+void HELPER(cas2l)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
+{
+    do_cas2l(env, regs, a1, a2, false);
+}
+
+void HELPER(cas2l_parallel)(CPUM68KState *env, uint32_t regs, uint32_t a1,
+                            uint32_t a2)
+{
+    do_cas2l(env, regs, a1, a2, true);
+}
+
 struct bf_data {
     uint32_t addr;
     uint32_t bofs;
diff --git a/target/m68k/translate.c b/target/m68k/translate.c
index 3a519b7..5cfa25f 100644
--- a/target/m68k/translate.c
+++ b/target/m68k/translate.c
@@ -2308,7 +2308,11 @@ DISAS_INSN(cas2w)
                          (REG(ext1, 6) << 3) |
                          (REG(ext2, 0) << 6) |
                          (REG(ext1, 0) << 9));
-    gen_helper_cas2w(cpu_env, regs, addr1, addr2);
+    if (s->tb->cflags & CF_PARALLEL) {
+        gen_helper_cas2w_parallel(cpu_env, regs, addr1, addr2);
+    } else {
+        gen_helper_cas2w(cpu_env, regs, addr1, addr2);
+    }
     tcg_temp_free(regs);
 
     /* Note that cas2w also assigned to env->cc_op.  */
@@ -2354,7 +2358,11 @@ DISAS_INSN(cas2l)
                          (REG(ext1, 6) << 3) |
                          (REG(ext2, 0) << 6) |
                          (REG(ext1, 0) << 9));
-    gen_helper_cas2l(cpu_env, regs, addr1, addr2);
+    if (s->tb->cflags & CF_PARALLEL) {
+        gen_helper_cas2l_parallel(cpu_env, regs, addr1, addr2);
+    } else {
+        gen_helper_cas2l(cpu_env, regs, addr1, addr2);
+    }
     tcg_temp_free(regs);
 
     /* Note that cas2l also assigned to env->cc_op.  */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 19/45] target/s390x: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (17 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 18/45] target/m68k: " Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-17 23:53   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 20/45] target/sparc: " Emilio G. Cota
                   ` (25 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Thereby decoupling the resulting translated code from the current state
of the system.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/s390x/helper.h     |  3 +++
 target/s390x/mem_helper.c | 50 +++++++++++++++++++++++++++++++++++++++--------
 target/s390x/translate.c  | 20 +++++++++++++++----
 3 files changed, 61 insertions(+), 12 deletions(-)

diff --git a/target/s390x/helper.h b/target/s390x/helper.h
index 964097b..db697d9 100644
--- a/target/s390x/helper.h
+++ b/target/s390x/helper.h
@@ -33,6 +33,7 @@ DEF_HELPER_3(celgb, i64, env, i64, i32)
 DEF_HELPER_3(cdlgb, i64, env, i64, i32)
 DEF_HELPER_3(cxlgb, i64, env, i64, i32)
 DEF_HELPER_4(cdsg, void, env, i64, i32, i32)
+DEF_HELPER_4(cdsg_parallel, void, env, i64, i32, i32)
 DEF_HELPER_FLAGS_3(aeb, TCG_CALL_NO_WG, i64, env, i64, i64)
 DEF_HELPER_FLAGS_3(adb, TCG_CALL_NO_WG, i64, env, i64, i64)
 DEF_HELPER_FLAGS_5(axb, TCG_CALL_NO_WG, i64, env, i64, i64, i64, i64)
@@ -104,7 +105,9 @@ DEF_HELPER_FLAGS_1(popcnt, TCG_CALL_NO_RWG_SE, i64, i64)
 DEF_HELPER_FLAGS_1(stfl, TCG_CALL_NO_RWG, void, env)
 DEF_HELPER_2(stfle, i32, env, i64)
 DEF_HELPER_FLAGS_2(lpq, TCG_CALL_NO_WG, i64, env, i64)
+DEF_HELPER_FLAGS_2(lpq_parallel, TCG_CALL_NO_WG, i64, env, i64)
 DEF_HELPER_FLAGS_4(stpq, TCG_CALL_NO_WG, void, env, i64, i64, i64)
+DEF_HELPER_FLAGS_4(stpq_parallel, TCG_CALL_NO_WG, void, env, i64, i64, i64)
 DEF_HELPER_4(mvcos, i32, env, i64, i64, i64)
 
 #ifndef CONFIG_USER_ONLY
diff --git a/target/s390x/mem_helper.c b/target/s390x/mem_helper.c
index ede8471..2d9cdb1 100644
--- a/target/s390x/mem_helper.c
+++ b/target/s390x/mem_helper.c
@@ -1312,8 +1312,8 @@ uint32_t HELPER(trXX)(CPUS390XState *env, uint32_t r1, uint32_t r2,
     return cc;
 }
 
-void HELPER(cdsg)(CPUS390XState *env, uint64_t addr,
-                  uint32_t r1, uint32_t r3)
+static void do_cdsg(CPUS390XState *env, uint64_t addr,
+                    uint32_t r1, uint32_t r3, bool parallel)
 {
     uintptr_t ra = GETPC();
     Int128 cmpv = int128_make128(env->regs[r1 + 1], env->regs[r1]);
@@ -1321,7 +1321,7 @@ void HELPER(cdsg)(CPUS390XState *env, uint64_t addr,
     Int128 oldv;
     bool fail;
 
-    if (parallel_cpus) {
+    if (parallel) {
 #ifndef CONFIG_ATOMIC128
         cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
 #else
@@ -1353,6 +1353,18 @@ void HELPER(cdsg)(CPUS390XState *env, uint64_t addr,
     env->regs[r1 + 1] = int128_getlo(oldv);
 }
 
+void HELPER(cdsg)(CPUS390XState *env, uint64_t addr,
+                  uint32_t r1, uint32_t r3)
+{
+    do_cdsg(env, addr, r1, r3, false);
+}
+
+void HELPER(cdsg_parallel)(CPUS390XState *env, uint64_t addr,
+                           uint32_t r1, uint32_t r3)
+{
+    do_cdsg(env, addr, r1, r3, true);
+}
+
 #if !defined(CONFIG_USER_ONLY)
 void HELPER(lctlg)(CPUS390XState *env, uint32_t r1, uint64_t a2, uint32_t r3)
 {
@@ -1795,12 +1807,12 @@ uint64_t HELPER(lra)(CPUS390XState *env, uint64_t addr)
 #endif
 
 /* load pair from quadword */
-uint64_t HELPER(lpq)(CPUS390XState *env, uint64_t addr)
+static uint64_t do_lpq(CPUS390XState *env, uint64_t addr, bool parallel)
 {
     uintptr_t ra = GETPC();
     uint64_t hi, lo;
 
-    if (parallel_cpus) {
+    if (parallel) {
 #ifndef CONFIG_ATOMIC128
         cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
 #else
@@ -1821,13 +1833,23 @@ uint64_t HELPER(lpq)(CPUS390XState *env, uint64_t addr)
     return hi;
 }
 
+uint64_t HELPER(lpq)(CPUS390XState *env, uint64_t addr)
+{
+    return do_lpq(env, addr, false);
+}
+
+uint64_t HELPER(lpq_parallel)(CPUS390XState *env, uint64_t addr)
+{
+    return do_lpq(env, addr, true);
+}
+
 /* store pair to quadword */
-void HELPER(stpq)(CPUS390XState *env, uint64_t addr,
-                  uint64_t low, uint64_t high)
+static void do_stpq(CPUS390XState *env, uint64_t addr,
+                    uint64_t low, uint64_t high, bool parallel)
 {
     uintptr_t ra = GETPC();
 
-    if (parallel_cpus) {
+    if (parallel) {
 #ifndef CONFIG_ATOMIC128
         cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
 #else
@@ -1845,6 +1867,18 @@ void HELPER(stpq)(CPUS390XState *env, uint64_t addr,
     }
 }
 
+void HELPER(stpq)(CPUS390XState *env, uint64_t addr,
+                  uint64_t low, uint64_t high)
+{
+    do_stpq(env, addr, low, high, false);
+}
+
+void HELPER(stpq_parallel)(CPUS390XState *env, uint64_t addr,
+                           uint64_t low, uint64_t high)
+{
+    do_stpq(env, addr, low, high, true);
+}
+
 /* Execute instruction.  This instruction executes an insn modified with
    the contents of r1.  It does not change the executed instruction in memory;
    it does not change the program counter.
diff --git a/target/s390x/translate.c b/target/s390x/translate.c
index b503c2c..6535f6c 100644
--- a/target/s390x/translate.c
+++ b/target/s390x/translate.c
@@ -2024,7 +2024,11 @@ static ExitStatus op_cdsg(DisasContext *s, DisasOps *o)
     addr = get_address(s, 0, b2, d2);
     t_r1 = tcg_const_i32(r1);
     t_r3 = tcg_const_i32(r3);
-    gen_helper_cdsg(cpu_env, addr, t_r1, t_r3);
+    if (s->tb->cflags & CF_PARALLEL) {
+        gen_helper_cdsg_parallel(cpu_env, addr, t_r1, t_r3);
+    } else {
+        gen_helper_cdsg(cpu_env, addr, t_r1, t_r3);
+    }
     tcg_temp_free_i64(addr);
     tcg_temp_free_i32(t_r1);
     tcg_temp_free_i32(t_r3);
@@ -2881,7 +2885,7 @@ static ExitStatus op_lpd(DisasContext *s, DisasOps *o)
     TCGMemOp mop = s->insn->data;
 
     /* In a parallel context, stop the world and single step.  */
-    if (parallel_cpus) {
+    if (s->tb->cflags & CF_PARALLEL) {
         potential_page_fault(s);
         gen_exception(EXCP_ATOMIC);
         return EXIT_NORETURN;
@@ -2902,7 +2906,11 @@ static ExitStatus op_lpd(DisasContext *s, DisasOps *o)
 
 static ExitStatus op_lpq(DisasContext *s, DisasOps *o)
 {
-    gen_helper_lpq(o->out, cpu_env, o->in2);
+    if (s->tb->cflags & CF_PARALLEL) {
+        gen_helper_lpq_parallel(o->out, cpu_env, o->in2);
+    } else {
+        gen_helper_lpq(o->out, cpu_env, o->in2);
+    }
     return_low128(o->out2);
     return NO_EXIT;
 }
@@ -4219,7 +4227,11 @@ static ExitStatus op_stmh(DisasContext *s, DisasOps *o)
 
 static ExitStatus op_stpq(DisasContext *s, DisasOps *o)
 {
-    gen_helper_stpq(cpu_env, o->in2, o->out2, o->out);
+    if (s->tb->cflags & CF_PARALLEL) {
+        gen_helper_stpq_parallel(cpu_env, o->in2, o->out2, o->out);
+    } else {
+        gen_helper_stpq(cpu_env, o->in2, o->out2, o->out);
+    }
     return NO_EXIT;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 20/45] target/sparc: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (18 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 19/45] target/s390x: " Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-17 23:54   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 21/45] tcg: " Emilio G. Cota
                   ` (24 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Thereby decoupling the resulting translated code from the current state
of the system.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/sparc/translate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/sparc/translate.c b/target/sparc/translate.c
index aa6734d..0274e83 100644
--- a/target/sparc/translate.c
+++ b/target/sparc/translate.c
@@ -2450,7 +2450,7 @@ static void gen_ldstub_asi(DisasContext *dc, TCGv dst, TCGv addr, int insn)
     default:
         /* ??? In theory, this should be raise DAE_invalid_asi.
            But the SS-20 roms do ldstuba [%l0] #ASI_M_CTL, %o1.  */
-        if (parallel_cpus) {
+        if (dc->tb->cflags & CF_PARALLEL) {
             gen_helper_exit_atomic(cpu_env);
         } else {
             TCGv_i32 r_asi = tcg_const_i32(da.asi);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 21/45] tcg: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (19 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 20/45] target/sparc: " Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-17 23:55   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 22/45] cpu-exec: lookup/generate TB outside exclusive region during step_atomic Emilio G. Cota
                   ` (23 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Thereby decoupling the resulting translated code from the current state
of the system.

The tb->cflags field is not passed to tcg generation functions. So
we add a bit to TCGContext, storing there whether CF_PARALLEL is set
before translating every TB.

Most architectures have <= 32 registers, which results in a 4-byte hole
in TCGContext. Use this hole for the bit we need; use a uint8_t instead
of a bool, since a bool might take more than one byte in some systems.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h                 |  1 +
 accel/tcg/translate-all.c |  1 +
 tcg/tcg-op.c              | 10 +++++-----
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 96872f8..bd1fdfa 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -656,6 +656,7 @@ struct TCGContext {
     uintptr_t *tb_jmp_target_addr; /* tb->jmp_target_addr if !USE_DIRECT_JUMP */
 
     TCGRegSet reserved_regs;
+    uint8_t cf_parallel; /* whether CF_PARALLEL is set in tb->cflags */
     intptr_t current_frame_offset;
     intptr_t frame_start;
     intptr_t frame_end;
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 483248f..80ac85a 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1275,6 +1275,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb->flags = flags;
     tb->cflags = cflags;
     tb->trace_vcpu_dstate = *cpu->trace_dstate;
+    tcg_ctx.cf_parallel = !!(cflags & CF_PARALLEL);
 
 #ifdef CONFIG_PROFILER
     tcg_ctx.tb_count1++; /* includes aborted translations because of
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 205d07f..ef420d4 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -150,7 +150,7 @@ void tcg_gen_op6(TCGContext *ctx, TCGOpcode opc, TCGArg a1, TCGArg a2,
 
 void tcg_gen_mb(TCGBar mb_type)
 {
-    if (parallel_cpus) {
+    if (tcg_ctx.cf_parallel) {
         tcg_gen_op1(&tcg_ctx, INDEX_op_mb, mb_type);
     }
 }
@@ -2794,7 +2794,7 @@ void tcg_gen_atomic_cmpxchg_i32(TCGv_i32 retv, TCGv addr, TCGv_i32 cmpv,
 {
     memop = tcg_canonicalize_memop(memop, 0, 0);
 
-    if (!parallel_cpus) {
+    if (!tcg_ctx.cf_parallel) {
         TCGv_i32 t1 = tcg_temp_new_i32();
         TCGv_i32 t2 = tcg_temp_new_i32();
 
@@ -2838,7 +2838,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr, TCGv_i64 cmpv,
 {
     memop = tcg_canonicalize_memop(memop, 1, 0);
 
-    if (!parallel_cpus) {
+    if (!tcg_ctx.cf_parallel) {
         TCGv_i64 t1 = tcg_temp_new_i64();
         TCGv_i64 t2 = tcg_temp_new_i64();
 
@@ -3015,7 +3015,7 @@ static void * const table_##NAME[16] = {                                \
 void tcg_gen_atomic_##NAME##_i32                                        \
     (TCGv_i32 ret, TCGv addr, TCGv_i32 val, TCGArg idx, TCGMemOp memop) \
 {                                                                       \
-    if (parallel_cpus) {                                                \
+    if (tcg_ctx.cf_parallel) {                                          \
         do_atomic_op_i32(ret, addr, val, idx, memop, table_##NAME);     \
     } else {                                                            \
         do_nonatomic_op_i32(ret, addr, val, idx, memop, NEW,            \
@@ -3025,7 +3025,7 @@ void tcg_gen_atomic_##NAME##_i32                                        \
 void tcg_gen_atomic_##NAME##_i64                                        \
     (TCGv_i64 ret, TCGv addr, TCGv_i64 val, TCGArg idx, TCGMemOp memop) \
 {                                                                       \
-    if (parallel_cpus) {                                                \
+    if (tcg_ctx.cf_parallel) {                                          \
         do_atomic_op_i64(ret, addr, val, idx, memop, table_##NAME);     \
     } else {                                                            \
         do_nonatomic_op_i64(ret, addr, val, idx, memop, NEW,            \
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 22/45] cpu-exec: lookup/generate TB outside exclusive region during step_atomic
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (20 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 21/45] tcg: " Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:01   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 23/45] translate-all: define and use DEBUG_TB_FLUSH_GATE Emilio G. Cota
                   ` (22 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Now that all code generation has been converted to check CF_PARALLEL, we can
generate !CF_PARALLEL code without having yet set !parallel_cpus --
and therefore without having to be in the exclusive region during
cpu_exec_step_atomic.

While at it, merge cpu_exec_step into cpu_exec_step_atomic.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index efe5c85..23e6d2c 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -226,7 +226,7 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
 }
 #endif
 
-static void cpu_exec_step(CPUState *cpu)
+void cpu_exec_step_atomic(CPUState *cpu)
 {
     CPUClass *cc = CPU_GET_CLASS(cpu);
     TranslationBlock *tb;
@@ -239,16 +239,26 @@ static void cpu_exec_step(CPUState *cpu)
         if (tb == NULL) {
             mmap_lock();
             tb_lock();
-            tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
+            tb = tb_htable_lookup(cpu, pc, cs_base, flags, mask_cf(cflags));
+            if (likely(tb == NULL)) {
+                tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
+            }
             tb_unlock();
             mmap_unlock();
         }
 
+        start_exclusive();
+
+        /* Since we got here, we know that parallel_cpus must be true.  */
+        parallel_cpus = false;
         cc->cpu_exec_enter(cpu);
         /* execute the generated code */
         trace_exec_tb(tb, pc);
         cpu_tb_exec(cpu, tb);
         cc->cpu_exec_exit(cpu);
+        parallel_cpus = true;
+
+        end_exclusive();
     } else {
         /* We may have exited due to another problem here, so we need
          * to reset any tb_locks we may have taken but didn't release.
@@ -262,18 +272,6 @@ static void cpu_exec_step(CPUState *cpu)
     }
 }
 
-void cpu_exec_step_atomic(CPUState *cpu)
-{
-    start_exclusive();
-
-    /* Since we got here, we know that parallel_cpus must be true.  */
-    parallel_cpus = false;
-    cpu_exec_step(cpu);
-    parallel_cpus = true;
-
-    end_exclusive();
-}
-
 struct tb_desc {
     target_ulong pc;
     target_ulong cs_base;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 23/45] translate-all: define and use DEBUG_TB_FLUSH_GATE
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (21 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 22/45] cpu-exec: lookup/generate TB outside exclusive region during step_atomic Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:01   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 24/45] exec-all: introduce TB_PAGE_ADDR_FMT Emilio G. Cota
                   ` (21 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This gets rid of some ifdef checks while ensuring that the debug code
is compiled, which prevents bit rot.

Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 80ac85a..c38448c 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -65,6 +65,12 @@
 /* make various TB consistency checks */
 /* #define DEBUG_TB_CHECK */
 
+#ifdef DEBUG_TB_FLUSH
+#define DEBUG_TB_FLUSH_GATE 1
+#else
+#define DEBUG_TB_FLUSH_GATE 0
+#endif
+
 #if !defined(CONFIG_USER_ONLY)
 /* TB consistency checks only implemented for usermode emulation.  */
 #undef DEBUG_TB_CHECK
@@ -899,13 +905,13 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
         goto done;
     }
 
-#if defined(DEBUG_TB_FLUSH)
-    printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
-           (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
-           tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.tb_ctx.nb_tbs > 0 ?
-           ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)) /
-           tcg_ctx.tb_ctx.nb_tbs : 0);
-#endif
+    if (DEBUG_TB_FLUSH_GATE) {
+        printf("qemu: flush code_size=%td nb_tbs=%d avg_tb_size=%td\n",
+               tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
+               tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.tb_ctx.nb_tbs > 0 ?
+               (tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) /
+               tcg_ctx.tb_ctx.nb_tbs : 0);
+    }
     if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)
         > tcg_ctx.code_gen_buffer_size) {
         cpu_abort(cpu, "Internal error: code buffer overflow\n");
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 24/45] exec-all: introduce TB_PAGE_ADDR_FMT
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (22 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 23/45] translate-all: define and use DEBUG_TB_FLUSH_GATE Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:02   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 25/45] translate-all: define and use DEBUG_TB_INVALIDATE_GATE Emilio G. Cota
                   ` (20 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

And fix the following warning when DEBUG_TB_INVALIDATE is enabled
in translate-all.c:

  CC      mipsn32-linux-user/accel/tcg/translate-all.o
/data/src/qemu/accel/tcg/translate-all.c: In function ‘tb_alloc_page’:
/data/src/qemu/accel/tcg/translate-all.c:1201:16: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘tb_page_addr_t {aka unsigned int}’ [-Werror=format=]
         printf("protecting code page: 0x" TARGET_FMT_lx "\n",
                ^
cc1: all warnings being treated as errors
/data/src/qemu/rules.mak:66: recipe for target 'accel/tcg/translate-all.o' failed
make[1]: *** [accel/tcg/translate-all.o] Error 1
Makefile:328: recipe for target 'subdir-mipsn32-linux-user' failed
make: *** [subdir-mipsn32-linux-user] Error 2
cota@flamenco:/data/src/qemu/build ((18f3fe1...) *$)$

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   | 2 ++
 accel/tcg/translate-all.c | 3 +--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index b3f04c3..13975af 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -31,8 +31,10 @@
    type.  */
 #if defined(CONFIG_USER_ONLY)
 typedef abi_ulong tb_page_addr_t;
+#define TB_PAGE_ADDR_FMT TARGET_ABI_FMT_lx
 #else
 typedef ram_addr_t tb_page_addr_t;
+#define TB_PAGE_ADDR_FMT RAM_ADDR_FMT
 #endif
 
 /* is_jmp field values */
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index c38448c..c8fa86a 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1198,8 +1198,7 @@ static inline void tb_alloc_page(TranslationBlock *tb,
         mprotect(g2h(page_addr), qemu_host_page_size,
                  (prot & PAGE_BITS) & ~PAGE_WRITE);
 #ifdef DEBUG_TB_INVALIDATE
-        printf("protecting code page: 0x" TARGET_FMT_lx "\n",
-               page_addr);
+        printf("protecting code page: 0x" TB_PAGE_ADDR_FMT "\n", page_addr);
 #endif
     }
 #else
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 25/45] translate-all: define and use DEBUG_TB_INVALIDATE_GATE
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (23 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 24/45] exec-all: introduce TB_PAGE_ADDR_FMT Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:02   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 26/45] translate-all: define and use DEBUG_TB_CHECK_GATE Emilio G. Cota
                   ` (19 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This gets rid of an ifdef check while ensuring that the debug code
is compiled, which prevents bit rot.

Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index c8fa86a..aaf3993 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -65,6 +65,12 @@
 /* make various TB consistency checks */
 /* #define DEBUG_TB_CHECK */
 
+#ifdef DEBUG_TB_INVALIDATE
+#define DEBUG_TB_INVALIDATE_GATE 1
+#else
+#define DEBUG_TB_INVALIDATE_GATE 0
+#endif
+
 #ifdef DEBUG_TB_FLUSH
 #define DEBUG_TB_FLUSH_GATE 1
 #else
@@ -1197,9 +1203,9 @@ static inline void tb_alloc_page(TranslationBlock *tb,
           }
         mprotect(g2h(page_addr), qemu_host_page_size,
                  (prot & PAGE_BITS) & ~PAGE_WRITE);
-#ifdef DEBUG_TB_INVALIDATE
-        printf("protecting code page: 0x" TB_PAGE_ADDR_FMT "\n", page_addr);
-#endif
+        if (DEBUG_TB_INVALIDATE_GATE) {
+            printf("protecting code page: 0x" TB_PAGE_ADDR_FMT "\n", page_addr);
+        }
     }
 #else
     /* if some code is already present, then the pages are already
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 26/45] translate-all: define and use DEBUG_TB_CHECK_GATE
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (24 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 25/45] translate-all: define and use DEBUG_TB_INVALIDATE_GATE Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:03   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 27/45] exec-all: extract tb->tc_* into a separate struct tc_tb Emilio G. Cota
                   ` (18 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This prevents bit rot by ensuring the debug code is compiled when
building a user-mode target.

Unfortunately the helpers are user-mode-only so we cannot fully
get rid of the ifdef checks. Add a comment to explain this.

Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index aaf3993..df1ccbf 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -82,6 +82,12 @@
 #undef DEBUG_TB_CHECK
 #endif
 
+#ifdef DEBUG_TB_CHECK
+#define DEBUG_TB_CHECK_GATE 1
+#else
+#define DEBUG_TB_CHECK_GATE 0
+#endif
+
 /* Access to the various translations structures need to be serialised via locks
  * for consistency. This is automatic for SoftMMU based system
  * emulation due to its single threaded nature. In user-mode emulation
@@ -950,7 +956,13 @@ void tb_flush(CPUState *cpu)
     }
 }
 
-#ifdef DEBUG_TB_CHECK
+/*
+ * Formerly ifdef DEBUG_TB_CHECK. These debug functions are user-mode-only,
+ * so in order to prevent bit rot we compile them unconditionally in user-mode,
+ * and let the optimizer get rid of them by wrapping their user-only callers
+ * with if (DEBUG_TB_CHECK_GATE).
+ */
+#ifdef CONFIG_USER_ONLY
 
 static void
 do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
@@ -994,7 +1006,7 @@ static void tb_page_check(void)
     qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
 }
 
-#endif
+#endif /* CONFIG_USER_ONLY */
 
 static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
 {
@@ -1242,8 +1254,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                      tb->trace_vcpu_dstate);
     qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
 
-#ifdef DEBUG_TB_CHECK
-    tb_page_check();
+#ifdef CONFIG_USER_ONLY
+    if (DEBUG_TB_CHECK_GATE) {
+        tb_page_check();
+    }
 #endif
 }
 
@@ -2223,8 +2237,10 @@ int page_unprotect(target_ulong address, uintptr_t pc)
             /* and since the content will be modified, we must invalidate
                the corresponding translated code. */
             current_tb_invalidated |= tb_invalidate_phys_page(addr, pc);
-#ifdef DEBUG_TB_CHECK
-            tb_invalidate_check(addr);
+#ifdef CONFIG_USER_ONLY
+            if (DEBUG_TB_CHECK_GATE) {
+                tb_invalidate_check(addr);
+            }
 #endif
         }
         mprotect((void *)g2h(host_start), qemu_host_page_size,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 27/45] exec-all: extract tb->tc_* into a separate struct tc_tb
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (25 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 26/45] translate-all: define and use DEBUG_TB_CHECK_GATE Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:04   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 28/45] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
                   ` (17 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

In preparation for adding tc.size to be able to keep track of
TB's using the binary search tree implementation from glib.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   | 20 ++++++++++++++------
 accel/tcg/cpu-exec.c      |  6 +++---
 accel/tcg/translate-all.c | 20 ++++++++++----------
 tcg/tcg-runtime.c         |  4 ++--
 tcg/tcg.c                 |  4 ++--
 5 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 13975af..7356c3e 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -315,6 +315,14 @@ static inline void tlb_flush_by_mmuidx_all_cpus_synced(CPUState *cpu,
 #define USE_DIRECT_JUMP
 #endif
 
+/*
+ * Translation Cache-related fields of a TB.
+ */
+struct tb_tc {
+    void *ptr;    /* pointer to the translated code */
+    uint8_t *search;  /* pointer to search data */
+};
+
 struct TranslationBlock {
     target_ulong pc;   /* simulated PC corresponding to this block (EIP + CS base) */
     target_ulong cs_base; /* CS base for this block */
@@ -334,8 +342,8 @@ struct TranslationBlock {
     /* Per-vCPU dynamic tracing state used to generate this TB */
     uint32_t trace_vcpu_dstate;
 
-    void *tc_ptr;    /* pointer to the translated code */
-    uint8_t *tc_search;  /* pointer to search data */
+    struct tb_tc tc;
+
     /* original tb when cflags has CF_NOCACHE */
     struct TranslationBlock *orig_tb;
     /* first and second physical page containing code. The lower bit
@@ -442,7 +450,7 @@ static inline void tb_set_jmp_target(TranslationBlock *tb,
                                      int n, uintptr_t addr)
 {
     uint16_t offset = tb->jmp_insn_offset[n];
-    tb_set_jmp_target1((uintptr_t)(tb->tc_ptr + offset), addr);
+    tb_set_jmp_target1((uintptr_t)(tb->tc.ptr + offset), addr);
 }
 
 #else
@@ -469,11 +477,11 @@ static inline void tb_add_jump(TranslationBlock *tb, int n,
     qemu_log_mask_and_addr(CPU_LOG_EXEC, tb->pc,
                            "Linking TBs %p [" TARGET_FMT_lx
                            "] index %d -> %p [" TARGET_FMT_lx "]\n",
-                           tb->tc_ptr, tb->pc, n,
-                           tb_next->tc_ptr, tb_next->pc);
+                           tb->tc.ptr, tb->pc, n,
+                           tb_next->tc.ptr, tb_next->pc);
 
     /* patch the native jump address */
-    tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc_ptr);
+    tb_set_jmp_target(tb, n, (uintptr_t)tb_next->tc.ptr);
 
     /* add in TB jmp circular list */
     tb->jmp_list_next[n] = tb_next->jmp_list_first;
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 23e6d2c..ba36f83 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -143,11 +143,11 @@ static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, TranslationBlock *itb)
     uintptr_t ret;
     TranslationBlock *last_tb;
     int tb_exit;
-    uint8_t *tb_ptr = itb->tc_ptr;
+    uint8_t *tb_ptr = itb->tc.ptr;
 
     qemu_log_mask_and_addr(CPU_LOG_EXEC, itb->pc,
                            "Trace %p [%d: " TARGET_FMT_lx "] %s\n",
-                           itb->tc_ptr, cpu->cpu_index, itb->pc,
+                           itb->tc.ptr, cpu->cpu_index, itb->pc,
                            lookup_symbol(itb->pc));
 
 #if defined(DEBUG_DISAS)
@@ -179,7 +179,7 @@ static inline tcg_target_ulong cpu_tb_exec(CPUState *cpu, TranslationBlock *itb)
         qemu_log_mask_and_addr(CPU_LOG_EXEC, last_tb->pc,
                                "Stopped execution of TB chain before %p ["
                                TARGET_FMT_lx "] %s\n",
-                               last_tb->tc_ptr, last_tb->pc,
+                               last_tb->tc.ptr, last_tb->pc,
                                lookup_symbol(last_tb->pc));
         if (cc->synchronize_from_tb) {
             cc->synchronize_from_tb(cpu, last_tb);
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index df1ccbf..cfef6da 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -260,7 +260,7 @@ static target_long decode_sleb128(uint8_t **pp)
    which comes from the host pc of the end of the code implementing the insn.
 
    Each line of the table is encoded as sleb128 deltas from the previous
-   line.  The seed for the first line is { tb->pc, 0..., tb->tc_ptr }.
+   line.  The seed for the first line is { tb->pc, 0..., tb->tc.ptr }.
    That is, the first column is seeded with the guest pc, the last column
    with the host pc, and the middle columns with zeros.  */
 
@@ -270,7 +270,7 @@ static int encode_search(TranslationBlock *tb, uint8_t *block)
     uint8_t *p = block;
     int i, j, n;
 
-    tb->tc_search = block;
+    tb->tc.search = block;
 
     for (i = 0, n = tb->icount; i < n; ++i) {
         target_ulong prev;
@@ -305,9 +305,9 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
                                      uintptr_t searched_pc)
 {
     target_ulong data[TARGET_INSN_START_WORDS] = { tb->pc };
-    uintptr_t host_pc = (uintptr_t)tb->tc_ptr;
+    uintptr_t host_pc = (uintptr_t)tb->tc.ptr;
     CPUArchState *env = cpu->env_ptr;
-    uint8_t *p = tb->tc_search;
+    uint8_t *p = tb->tc.search;
     int i, j, num_insns = tb->icount;
 #ifdef CONFIG_PROFILER
     int64_t ti = profile_getclock();
@@ -858,7 +858,7 @@ void tb_free(TranslationBlock *tb)
             tb == tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) {
         size_t struct_size = ROUND_UP(sizeof(*tb), qemu_icache_linesize);
 
-        tcg_ctx.code_gen_ptr = tb->tc_ptr - struct_size;
+        tcg_ctx.code_gen_ptr = tb->tc.ptr - struct_size;
         tcg_ctx.tb_ctx.nb_tbs--;
     }
 }
@@ -1059,7 +1059,7 @@ static inline void tb_remove_from_jmp_list(TranslationBlock *tb, int n)
    another TB */
 static inline void tb_reset_jump(TranslationBlock *tb, int n)
 {
-    uintptr_t addr = (uintptr_t)(tb->tc_ptr + tb->jmp_reset_offset[n]);
+    uintptr_t addr = (uintptr_t)(tb->tc.ptr + tb->jmp_reset_offset[n]);
     tb_set_jmp_target(tb, n, addr);
 }
 
@@ -1294,7 +1294,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     }
 
     gen_code_buf = tcg_ctx.code_gen_ptr;
-    tb->tc_ptr = gen_code_buf;
+    tb->tc.ptr = gen_code_buf;
     tb->pc = pc;
     tb->cs_base = cs_base;
     tb->flags = flags;
@@ -1314,7 +1314,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     gen_intermediate_code(env, tb);
     tcg_ctx.cpu = NULL;
 
-    trace_translate_block(tb, tb->pc, tb->tc_ptr);
+    trace_translate_block(tb, tb->pc, tb->tc.ptr);
 
     /* generate machine code */
     tb->jmp_reset_offset[0] = TB_JMP_RESET_OFFSET_INVALID;
@@ -1360,7 +1360,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
         qemu_log_in_addr_range(tb->pc)) {
         qemu_log_lock();
         qemu_log("OUT: [size=%d]\n", gen_code_size);
-        log_disas(tb->tc_ptr, gen_code_size);
+        log_disas(tb->tc.ptr, gen_code_size);
         qemu_log("\n");
         qemu_log_flush();
         qemu_log_unlock();
@@ -1696,7 +1696,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
     while (m_min <= m_max) {
         m = (m_min + m_max) >> 1;
         tb = tcg_ctx.tb_ctx.tbs[m];
-        v = (uintptr_t)tb->tc_ptr;
+        v = (uintptr_t)tb->tc.ptr;
         if (v == tc_ptr) {
             return tb;
         } else if (tc_ptr < v) {
diff --git a/tcg/tcg-runtime.c b/tcg/tcg-runtime.c
index bf6f248..08fe077 100644
--- a/tcg/tcg-runtime.c
+++ b/tcg/tcg-runtime.c
@@ -157,9 +157,9 @@ void *HELPER(lookup_tb_ptr)(CPUArchState *env)
     }
     qemu_log_mask_and_addr(CPU_LOG_EXEC, pc,
                            "Chain %p [%d: " TARGET_FMT_lx "] %s\n",
-                           tb->tc_ptr, cpu->cpu_index, pc,
+                           tb->tc.ptr, cpu->cpu_index, pc,
                            lookup_symbol(pc));
-    return tb->tc_ptr;
+    return tb->tc.ptr;
 }
 
 void HELPER(exit_atomic)(CPUArchState *env)
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 3559829..28c1b94 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -2616,8 +2616,8 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 
     tcg_reg_alloc_start(s);
 
-    s->code_buf = tb->tc_ptr;
-    s->code_ptr = tb->tc_ptr;
+    s->code_buf = tb->tc.ptr;
+    s->code_ptr = tb->tc.ptr;
 
     tcg_out_tb_init(s);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 28/45] translate-all: use a binary search tree to track TBs in TBContext
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (26 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 27/45] exec-all: extract tb->tc_* into a separate struct tc_tb Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:05   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 29/45] exec-all: rename tb_free to tb_remove Emilio G. Cota
                   ` (16 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This is a prerequisite for supporting multiple TCG contexts, since
we will have threads generating code in separate regions of
code_gen_buffer.

For this we need a new field (.size) in struct tb_tc to keep
track of the size of the translated code. This field adds a 4-byte
hole to the struct (and therefore to TranslationBlock), but we can
live with that.

The comparison function we use is optimized for the common case:
insertions. Profiling shows that upon booting debian-arm, 98%
of comparisons are between existing tb's (i.e. a->size and b->size
are both !0), which happens during insertions (and removals, but
those are rare). The remaining cases are lookups. From reading the glib
sources we see that the first key is always the lookup key. However,
the code does not assume this to always be the case because this
behaviour is not guaranteed in the glib docs. However, we embed
this knowledge in the code as a branch hint for the compiler.

Note that tb_free does not free space in the code_gen_buffer anymore,
since we cannot easily know whether the tb is the last one inserted
in code_gen_buffer. The next patch in this series renames tb_free
to tb_remove to reflect this.

Performance-wise, lookups in tb_find_pc are the same as before:
O(log n). However, insertions are O(log n) instead of O(1), which
results in a small slowdown when booting debian-arm:

Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
	-machine type=virt -nographic -smp 1 -m 4096 \
	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
	-device virtio-net-device,netdev=unet \
	-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
	-device virtio-blk-device,drive=myblock \
	-kernel img/arm/aarch32-current-linux-kernel-only.img \
	-append console=ttyAMA0 root=/dev/vda1 \
	-name arm,debug-threads=on -smp 1' (10 runs):

- Before:

       8048.598422      task-clock (msec)         #    0.931 CPUs utilized            ( +-  0.28% )
            16,974      context-switches          #    0.002 M/sec                    ( +-  0.12% )
                 0      cpu-migrations            #    0.000 K/sec
            10,125      page-faults               #    0.001 M/sec                    ( +-  1.23% )
    35,144,901,879      cycles                    #    4.367 GHz                      ( +-  0.14% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    65,758,252,643      instructions              #    1.87  insns per cycle          ( +-  0.33% )
    10,871,298,668      branches                  # 1350.707 M/sec                    ( +-  0.41% )
       192,322,212      branch-misses             #    1.77% of all branches          ( +-  0.32% )

       8.640869419 seconds time elapsed                                          ( +-  0.57% )

- After:
       8146.242027      task-clock (msec)         #    0.923 CPUs utilized            ( +-  1.23% )
            17,016      context-switches          #    0.002 M/sec                    ( +-  0.40% )
                 0      cpu-migrations            #    0.000 K/sec
            18,769      page-faults               #    0.002 M/sec                    ( +-  0.45% )
    35,660,956,120      cycles                    #    4.378 GHz                      ( +-  1.22% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    65,095,366,607      instructions              #    1.83  insns per cycle          ( +-  1.73% )
    10,803,480,261      branches                  # 1326.192 M/sec                    ( +-  1.95% )
       195,601,289      branch-misses             #    1.81% of all branches          ( +-  0.39% )

       8.828660235 seconds time elapsed                                          ( +-  0.38% )

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   |   5 ++
 include/exec/tb-context.h |   4 +-
 accel/tcg/translate-all.c | 217 ++++++++++++++++++++++++----------------------
 3 files changed, 118 insertions(+), 108 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 7356c3e..c7bf683 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -317,10 +317,15 @@ static inline void tlb_flush_by_mmuidx_all_cpus_synced(CPUState *cpu,
 
 /*
  * Translation Cache-related fields of a TB.
+ * This struct exists just for convenience; we keep track of TB's in a binary
+ * search tree, and the only fields needed to compare TB's in the tree are
+ * @ptr and @size. @search is brought here for consistency, since it is also
+ * a TC-related field.
  */
 struct tb_tc {
     void *ptr;    /* pointer to the translated code */
     uint8_t *search;  /* pointer to search data */
+    unsigned int size;
 };
 
 struct TranslationBlock {
diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index 25c2afe..1fa8dcc 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -31,10 +31,8 @@ typedef struct TBContext TBContext;
 
 struct TBContext {
 
-    TranslationBlock **tbs;
+    GTree *tb_tree;
     struct qht htable;
-    size_t tbs_size;
-    int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index cfef6da..7a01af0 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -776,6 +776,48 @@ static inline void *alloc_code_gen_buffer(void)
 }
 #endif /* USE_STATIC_CODE_GEN_BUFFER, WIN32, POSIX */
 
+/* compare a pointer @ptr and a tb_tc @s */
+static int ptr_cmp_tb_tc(const void *ptr, const struct tb_tc *s)
+{
+    if (ptr >= s->ptr + s->size) {
+        return 1;
+    } else if (ptr < s->ptr) {
+        return -1;
+    }
+    return 0;
+}
+
+static gint tb_tc_cmp(gconstpointer ap, gconstpointer bp)
+{
+    const struct tb_tc *a = ap;
+    const struct tb_tc *b = bp;
+
+    /*
+     * When both sizes are set, we know this isn't a lookup and therefore
+     * the two buffers are non-overlapping.
+     * This is the most likely case: every TB must be inserted; lookups
+     * are a lot less frequent.
+     */
+    if (likely(a->size && b->size)) {
+        /* a->ptr == b->ptr would mean the buffers overlap */
+        g_assert(a->ptr != b->ptr);
+
+        if (a->ptr > b->ptr) {
+            return 1;
+        }
+        return -1;
+    }
+    /*
+     * All lookups have either .size field set to 0.
+     * From the glib sources we see that @ap is always the lookup key. However
+     * the docs provide no guarantee, so we just mark this case as likely.
+     */
+    if (likely(a->size == 0)) {
+        return ptr_cmp_tb_tc(a->ptr, b);
+    }
+    return ptr_cmp_tb_tc(b->ptr, a);
+}
+
 static inline void code_gen_alloc(size_t tb_size)
 {
     tcg_ctx.code_gen_buffer_size = size_code_gen_buffer(tb_size);
@@ -784,15 +826,7 @@ static inline void code_gen_alloc(size_t tb_size)
         fprintf(stderr, "Could not allocate dynamic translator buffer\n");
         exit(1);
     }
-
-    /* size this conservatively -- realloc later if needed */
-    tcg_ctx.tb_ctx.tbs_size =
-        tcg_ctx.code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE / 8;
-    if (unlikely(!tcg_ctx.tb_ctx.tbs_size)) {
-        tcg_ctx.tb_ctx.tbs_size = 64 * 1024;
-    }
-    tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock *, tcg_ctx.tb_ctx.tbs_size);
-
+    tcg_ctx.tb_ctx.tb_tree = g_tree_new(tb_tc_cmp);
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
@@ -829,7 +863,6 @@ void tcg_exec_init(unsigned long tb_size)
 static TranslationBlock *tb_alloc(target_ulong pc)
 {
     TranslationBlock *tb;
-    TBContext *ctx;
 
     assert_tb_locked();
 
@@ -837,12 +870,6 @@ static TranslationBlock *tb_alloc(target_ulong pc)
     if (unlikely(tb == NULL)) {
         return NULL;
     }
-    ctx = &tcg_ctx.tb_ctx;
-    if (unlikely(ctx->nb_tbs == ctx->tbs_size)) {
-        ctx->tbs_size *= 2;
-        ctx->tbs = g_renew(TranslationBlock *, ctx->tbs, ctx->tbs_size);
-    }
-    ctx->tbs[ctx->nb_tbs++] = tb;
     return tb;
 }
 
@@ -851,16 +878,7 @@ void tb_free(TranslationBlock *tb)
 {
     assert_tb_locked();
 
-    /* In practice this is mostly used for single use temporary TB
-       Ignore the hard cases and just back up if this TB happens to
-       be the last one generated.  */
-    if (tcg_ctx.tb_ctx.nb_tbs > 0 &&
-            tb == tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) {
-        size_t struct_size = ROUND_UP(sizeof(*tb), qemu_icache_linesize);
-
-        tcg_ctx.code_gen_ptr = tb->tc.ptr - struct_size;
-        tcg_ctx.tb_ctx.nb_tbs--;
-    }
+    g_tree_remove(tcg_ctx.tb_ctx.tb_tree, &tb->tc);
 }
 
 static inline void invalidate_page_bitmap(PageDesc *p)
@@ -918,11 +936,12 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
     if (DEBUG_TB_FLUSH_GATE) {
-        printf("qemu: flush code_size=%td nb_tbs=%d avg_tb_size=%td\n",
-               tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
-               tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.tb_ctx.nb_tbs > 0 ?
-               (tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) /
-               tcg_ctx.tb_ctx.nb_tbs : 0);
+        size_t nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
+
+        printf("qemu: flush code_size=%td nb_tbs=%zu avg_tb_size=%td\n",
+               tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer, nb_tbs,
+               nb_tbs > 0 ?
+               (tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) / nb_tbs : 0);
     }
     if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)
         > tcg_ctx.code_gen_buffer_size) {
@@ -933,7 +952,10 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
         cpu_tb_jmp_cache_clear(cpu);
     }
 
-    tcg_ctx.tb_ctx.nb_tbs = 0;
+    /* Increment the refcount first so that destroy acts as a reset */
+    g_tree_ref(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_destroy(tcg_ctx.tb_ctx.tb_tree);
+
     qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
@@ -1347,6 +1369,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     if (unlikely(search_size < 0)) {
         goto buffer_overflow;
     }
+    tb->tc.size = gen_code_size;
 
 #ifdef CONFIG_PROFILER
     tcg_ctx.code_time += profile_getclock() - ti;
@@ -1397,6 +1420,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
      * through the physical hash table and physical page list.
      */
     tb_link_page(tb, phys_pc, phys_page2);
+    g_tree_insert(tcg_ctx.tb_ctx.tb_tree, &tb->tc, tb);
     return tb;
 }
 
@@ -1675,37 +1699,16 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
 }
 #endif
 
-/* find the TB 'tb' such that tb[0].tc_ptr <= tc_ptr <
-   tb[1].tc_ptr. Return NULL if not found */
+/*
+ * Find the TB 'tb' such that
+ * tb->tc.ptr <= tc_ptr < tb->tc.ptr + tb->tc.size
+ * Return NULL if not found.
+ */
 static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
 {
-    int m_min, m_max, m;
-    uintptr_t v;
-    TranslationBlock *tb;
+    struct tb_tc s = { .ptr = (void *)tc_ptr };
 
-    if (tcg_ctx.tb_ctx.nb_tbs <= 0) {
-        return NULL;
-    }
-    if (tc_ptr < (uintptr_t)tcg_ctx.code_gen_buffer ||
-        tc_ptr >= (uintptr_t)tcg_ctx.code_gen_ptr) {
-        return NULL;
-    }
-    /* binary search (cf Knuth) */
-    m_min = 0;
-    m_max = tcg_ctx.tb_ctx.nb_tbs - 1;
-    while (m_min <= m_max) {
-        m = (m_min + m_max) >> 1;
-        tb = tcg_ctx.tb_ctx.tbs[m];
-        v = (uintptr_t)tb->tc.ptr;
-        if (v == tc_ptr) {
-            return tb;
-        } else if (tc_ptr < v) {
-            m_max = m - 1;
-        } else {
-            m_min = m + 1;
-        }
-    }
-    return tcg_ctx.tb_ctx.tbs[m_max];
+    return g_tree_lookup(tcg_ctx.tb_ctx.tb_tree, &s);
 }
 
 #if !defined(CONFIG_USER_ONLY)
@@ -1893,63 +1896,67 @@ static void print_qht_statistics(FILE *f, fprintf_function cpu_fprintf,
     g_free(hgram);
 }
 
+struct tb_tree_stats {
+    size_t target_size;
+    size_t max_target_size;
+    size_t direct_jmp_count;
+    size_t direct_jmp2_count;
+    size_t cross_page;
+};
+
+static gboolean tb_tree_stats_iter(gpointer key, gpointer value, gpointer data)
+{
+    const TranslationBlock *tb = value;
+    struct tb_tree_stats *tst = data;
+
+    tst->target_size += tb->size;
+    if (tb->size > tst->max_target_size) {
+        tst->max_target_size = tb->size;
+    }
+    if (tb->page_addr[1] != -1) {
+        tst->cross_page++;
+    }
+    if (tb->jmp_reset_offset[0] != TB_JMP_RESET_OFFSET_INVALID) {
+        tst->direct_jmp_count++;
+        if (tb->jmp_reset_offset[1] != TB_JMP_RESET_OFFSET_INVALID) {
+            tst->direct_jmp2_count++;
+        }
+    }
+    return false;
+}
+
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 {
-    int i, target_code_size, max_target_code_size;
-    int direct_jmp_count, direct_jmp2_count, cross_page;
-    TranslationBlock *tb;
+    struct tb_tree_stats tst = {};
     struct qht_stats hst;
+    size_t nb_tbs;
 
     tb_lock();
 
-    target_code_size = 0;
-    max_target_code_size = 0;
-    cross_page = 0;
-    direct_jmp_count = 0;
-    direct_jmp2_count = 0;
-    for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) {
-        tb = tcg_ctx.tb_ctx.tbs[i];
-        target_code_size += tb->size;
-        if (tb->size > max_target_code_size) {
-            max_target_code_size = tb->size;
-        }
-        if (tb->page_addr[1] != -1) {
-            cross_page++;
-        }
-        if (tb->jmp_reset_offset[0] != TB_JMP_RESET_OFFSET_INVALID) {
-            direct_jmp_count++;
-            if (tb->jmp_reset_offset[1] != TB_JMP_RESET_OFFSET_INVALID) {
-                direct_jmp2_count++;
-            }
-        }
-    }
+    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
     cpu_fprintf(f, "gen code size       %td/%zd\n",
                 tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
                 tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
-    cpu_fprintf(f, "TB count            %d\n", tcg_ctx.tb_ctx.nb_tbs);
-    cpu_fprintf(f, "TB avg target size  %d max=%d bytes\n",
-            tcg_ctx.tb_ctx.nb_tbs ? target_code_size /
-                    tcg_ctx.tb_ctx.nb_tbs : 0,
-            max_target_code_size);
+    cpu_fprintf(f, "TB count            %zu\n", nb_tbs);
+    cpu_fprintf(f, "TB avg target size  %zu max=%zu bytes\n",
+                nb_tbs ? tst.target_size / nb_tbs : 0,
+                tst.max_target_size);
     cpu_fprintf(f, "TB avg host size    %td bytes (expansion ratio: %0.1f)\n",
-            tcg_ctx.tb_ctx.nb_tbs ? (tcg_ctx.code_gen_ptr -
-                                     tcg_ctx.code_gen_buffer) /
-                                     tcg_ctx.tb_ctx.nb_tbs : 0,
-                target_code_size ? (double) (tcg_ctx.code_gen_ptr -
-                                             tcg_ctx.code_gen_buffer) /
-                                             target_code_size : 0);
-    cpu_fprintf(f, "cross page TB count %d (%d%%)\n", cross_page,
-            tcg_ctx.tb_ctx.nb_tbs ? (cross_page * 100) /
-                                    tcg_ctx.tb_ctx.nb_tbs : 0);
-    cpu_fprintf(f, "direct jump count   %d (%d%%) (2 jumps=%d %d%%)\n",
-                direct_jmp_count,
-                tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp_count * 100) /
-                        tcg_ctx.tb_ctx.nb_tbs : 0,
-                direct_jmp2_count,
-                tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
-                        tcg_ctx.tb_ctx.nb_tbs : 0);
+                nb_tbs ? (tcg_ctx.code_gen_ptr -
+                          tcg_ctx.code_gen_buffer) / nb_tbs : 0,
+                tst.target_size ? (double) (tcg_ctx.code_gen_ptr -
+                                            tcg_ctx.code_gen_buffer) /
+                                            tst.target_size : 0);
+    cpu_fprintf(f, "cross page TB count %zu (%zu%%)\n", tst.cross_page,
+            nb_tbs ? (tst.cross_page * 100) / nb_tbs : 0);
+    cpu_fprintf(f, "direct jump count   %zu (%zu%%) (2 jumps=%zu %zu%%)\n",
+                tst.direct_jmp_count,
+                nb_tbs ? (tst.direct_jmp_count * 100) / nb_tbs : 0,
+                tst.direct_jmp2_count,
+                nb_tbs ? (tst.direct_jmp2_count * 100) / nb_tbs : 0);
 
     qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
     print_qht_statistics(f, cpu_fprintf, hst);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 29/45] exec-all: rename tb_free to tb_remove
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (27 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 28/45] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:05   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 30/45] translate-all: report correct avg host TB size Emilio G. Cota
                   ` (15 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

We don't really free anything in this function anymore; we just remove
the TB from the binary search tree.

Suggested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   | 2 +-
 accel/tcg/cpu-exec.c      | 2 +-
 accel/tcg/translate-all.c | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index c7bf683..37487d7 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -408,7 +408,7 @@ static inline uint32_t curr_cf_mask(void)
     return val;
 }
 
-void tb_free(TranslationBlock *tb);
+void tb_remove(TranslationBlock *tb);
 void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
 TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index ba36f83..604fee2 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -221,7 +221,7 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
 
     tb_lock();
     tb_phys_invalidate(tb, -1);
-    tb_free(tb);
+    tb_remove(tb);
     tb_unlock();
 }
 #endif
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 7a01af0..7c6e401 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -375,7 +375,7 @@ bool cpu_restore_state(CPUState *cpu, uintptr_t retaddr)
         if (tb->cflags & CF_NOCACHE) {
             /* one-shot translation, invalidate it immediately */
             tb_phys_invalidate(tb, -1);
-            tb_free(tb);
+            tb_remove(tb);
         }
         r = true;
     }
@@ -874,7 +874,7 @@ static TranslationBlock *tb_alloc(target_ulong pc)
 }
 
 /* Called with tb_lock held.  */
-void tb_free(TranslationBlock *tb)
+void tb_remove(TranslationBlock *tb)
 {
     assert_tb_locked();
 
@@ -1823,7 +1823,7 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
              * cpu_exec_nocache() */
             tb_phys_invalidate(tb->orig_tb, -1);
         }
-        tb_free(tb);
+        tb_remove(tb);
     }
     /* FIXME: In theory this could raise an exception.  In practice
        we have already translated the block once so it's probably ok.  */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 30/45] translate-all: report correct avg host TB size
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (28 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 29/45] exec-all: rename tb_free to tb_remove Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:06   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 31/45] tci: move tci_regs to tcg_qemu_tb_exec's stack Emilio G. Cota
                   ` (14 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Since commit 6e3b2bfd6 ("tcg: allocate TB structs before the
corresponding translated code") we are not fully utilizing
code_gen_buffer for translated code, and therefore are
incorrectly reporting the amount of translated code as well as
the average host TB size. Address this by:

- Making the conscious choice of misreporting the total translated code;
  doing otherwise would mislead users into thinking "-tb-size" is not
  honoured.

- Expanding tb_tree_stats to accurately count the bytes of translated code on
  the host, and using this for reporting the average tb host size,
  as well as the expansion ratio.

In the future we might want to consider reporting the accurate numbers for
the total translated code, together with a "bookkeeping/overhead" field to
account for the TB structs.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 7c6e401..b655931 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -923,6 +923,15 @@ static void page_flush_tb(void)
     }
 }
 
+static gboolean tb_host_size_iter(gpointer key, gpointer value, gpointer data)
+{
+    const TranslationBlock *tb = value;
+    size_t *size = data;
+
+    *size += tb->tc.size;
+    return false;
+}
+
 /* flush all the translation blocks */
 static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 {
@@ -937,11 +946,12 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 
     if (DEBUG_TB_FLUSH_GATE) {
         size_t nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
+        size_t host_size = 0;
 
-        printf("qemu: flush code_size=%td nb_tbs=%zu avg_tb_size=%td\n",
+        g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_host_size_iter, &host_size);
+        printf("qemu: flush code_size=%td nb_tbs=%zu avg_tb_size=%zu\n",
                tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer, nb_tbs,
-               nb_tbs > 0 ?
-               (tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) / nb_tbs : 0);
+               nb_tbs > 0 ? host_size / nb_tbs : 0);
     }
     if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)
         > tcg_ctx.code_gen_buffer_size) {
@@ -1897,6 +1907,7 @@ static void print_qht_statistics(FILE *f, fprintf_function cpu_fprintf,
 }
 
 struct tb_tree_stats {
+    size_t host_size;
     size_t target_size;
     size_t max_target_size;
     size_t direct_jmp_count;
@@ -1909,6 +1920,7 @@ static gboolean tb_tree_stats_iter(gpointer key, gpointer value, gpointer data)
     const TranslationBlock *tb = value;
     struct tb_tree_stats *tst = data;
 
+    tst->host_size += tb->tc.size;
     tst->target_size += tb->size;
     if (tb->size > tst->max_target_size) {
         tst->max_target_size = tb->size;
@@ -1937,6 +1949,11 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
+    /*
+     * Report total code size including the padding and TB structs;
+     * otherwise users might think "-tb-size" is not honoured.
+     * For avg host size we use the precise numbers from tb_tree_stats though.
+     */
     cpu_fprintf(f, "gen code size       %td/%zd\n",
                 tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
                 tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
@@ -1944,12 +1961,9 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     cpu_fprintf(f, "TB avg target size  %zu max=%zu bytes\n",
                 nb_tbs ? tst.target_size / nb_tbs : 0,
                 tst.max_target_size);
-    cpu_fprintf(f, "TB avg host size    %td bytes (expansion ratio: %0.1f)\n",
-                nb_tbs ? (tcg_ctx.code_gen_ptr -
-                          tcg_ctx.code_gen_buffer) / nb_tbs : 0,
-                tst.target_size ? (double) (tcg_ctx.code_gen_ptr -
-                                            tcg_ctx.code_gen_buffer) /
-                                            tst.target_size : 0);
+    cpu_fprintf(f, "TB avg host size    %zu bytes (expansion ratio: %0.1f)\n",
+                nb_tbs ? tst.host_size / nb_tbs : 0,
+                tst.target_size ? (double)tst.host_size / tst.target_size : 0);
     cpu_fprintf(f, "cross page TB count %zu (%zu%%)\n", tst.cross_page,
             nb_tbs ? (tst.cross_page * 100) / nb_tbs : 0);
     cpu_fprintf(f, "direct jump count   %zu (%zu%%) (2 jumps=%zu %zu%%)\n",
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 31/45] tci: move tci_regs to tcg_qemu_tb_exec's stack
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (29 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 30/45] translate-all: report correct avg host TB size Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:08   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 32/45] tcg: take tb_ctx out of TCGContext Emilio G. Cota
                   ` (13 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Groundwork for supporting multiple TCG contexts.

Compile-tested for all targets on an x86_64 host.

Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tci.c | 552 +++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 279 insertions(+), 273 deletions(-)

diff --git a/tcg/tci.c b/tcg/tci.c
index 4bdc645..f3216c1 100644
--- a/tcg/tci.c
+++ b/tcg/tci.c
@@ -55,93 +55,95 @@ typedef uint64_t (*helper_function)(tcg_target_ulong, tcg_target_ulong,
                                     tcg_target_ulong);
 #endif
 
-static tcg_target_ulong tci_reg[TCG_TARGET_NB_REGS];
-
-static tcg_target_ulong tci_read_reg(TCGReg index)
+static tcg_target_ulong tci_read_reg(const tcg_target_ulong *regs, TCGReg index)
 {
-    tci_assert(index < ARRAY_SIZE(tci_reg));
-    return tci_reg[index];
+    tci_assert(index < TCG_TARGET_NB_REGS);
+    return regs[index];
 }
 
 #if TCG_TARGET_HAS_ext8s_i32 || TCG_TARGET_HAS_ext8s_i64
-static int8_t tci_read_reg8s(TCGReg index)
+static int8_t tci_read_reg8s(const tcg_target_ulong *regs, TCGReg index)
 {
-    return (int8_t)tci_read_reg(index);
+    return (int8_t)tci_read_reg(regs, index);
 }
 #endif
 
 #if TCG_TARGET_HAS_ext16s_i32 || TCG_TARGET_HAS_ext16s_i64
-static int16_t tci_read_reg16s(TCGReg index)
+static int16_t tci_read_reg16s(const tcg_target_ulong *regs, TCGReg index)
 {
-    return (int16_t)tci_read_reg(index);
+    return (int16_t)tci_read_reg(regs, index);
 }
 #endif
 
 #if TCG_TARGET_REG_BITS == 64
-static int32_t tci_read_reg32s(TCGReg index)
+static int32_t tci_read_reg32s(const tcg_target_ulong *regs, TCGReg index)
 {
-    return (int32_t)tci_read_reg(index);
+    return (int32_t)tci_read_reg(regs, index);
 }
 #endif
 
-static uint8_t tci_read_reg8(TCGReg index)
+static uint8_t tci_read_reg8(const tcg_target_ulong *regs, TCGReg index)
 {
-    return (uint8_t)tci_read_reg(index);
+    return (uint8_t)tci_read_reg(regs, index);
 }
 
-static uint16_t tci_read_reg16(TCGReg index)
+static uint16_t tci_read_reg16(const tcg_target_ulong *regs, TCGReg index)
 {
-    return (uint16_t)tci_read_reg(index);
+    return (uint16_t)tci_read_reg(regs, index);
 }
 
-static uint32_t tci_read_reg32(TCGReg index)
+static uint32_t tci_read_reg32(const tcg_target_ulong *regs, TCGReg index)
 {
-    return (uint32_t)tci_read_reg(index);
+    return (uint32_t)tci_read_reg(regs, index);
 }
 
 #if TCG_TARGET_REG_BITS == 64
-static uint64_t tci_read_reg64(TCGReg index)
+static uint64_t tci_read_reg64(const tcg_target_ulong *regs, TCGReg index)
 {
-    return tci_read_reg(index);
+    return tci_read_reg(regs, index);
 }
 #endif
 
-static void tci_write_reg(TCGReg index, tcg_target_ulong value)
+static void
+tci_write_reg(tcg_target_ulong *regs, TCGReg index, tcg_target_ulong value)
 {
-    tci_assert(index < ARRAY_SIZE(tci_reg));
+    tci_assert(index < TCG_TARGET_NB_REGS);
     tci_assert(index != TCG_AREG0);
     tci_assert(index != TCG_REG_CALL_STACK);
-    tci_reg[index] = value;
+    regs[index] = value;
 }
 
 #if TCG_TARGET_REG_BITS == 64
-static void tci_write_reg32s(TCGReg index, int32_t value)
+static void
+tci_write_reg32s(tcg_target_ulong *regs, TCGReg index, int32_t value)
 {
-    tci_write_reg(index, value);
+    tci_write_reg(regs, index, value);
 }
 #endif
 
-static void tci_write_reg8(TCGReg index, uint8_t value)
+static void tci_write_reg8(tcg_target_ulong *regs, TCGReg index, uint8_t value)
 {
-    tci_write_reg(index, value);
+    tci_write_reg(regs, index, value);
 }
 
-static void tci_write_reg32(TCGReg index, uint32_t value)
+static void
+tci_write_reg32(tcg_target_ulong *regs, TCGReg index, uint32_t value)
 {
-    tci_write_reg(index, value);
+    tci_write_reg(regs, index, value);
 }
 
 #if TCG_TARGET_REG_BITS == 32
-static void tci_write_reg64(uint32_t high_index, uint32_t low_index,
-                            uint64_t value)
+static void tci_write_reg64(tcg_target_ulong *regs, uint32_t high_index,
+                            uint32_t low_index, uint64_t value)
 {
-    tci_write_reg(low_index, value);
-    tci_write_reg(high_index, value >> 32);
+    tci_write_reg(regs, low_index, value);
+    tci_write_reg(regs, high_index, value >> 32);
 }
 #elif TCG_TARGET_REG_BITS == 64
-static void tci_write_reg64(TCGReg index, uint64_t value)
+static void
+tci_write_reg64(tcg_target_ulong *regs, TCGReg index, uint64_t value)
 {
-    tci_write_reg(index, value);
+    tci_write_reg(regs, index, value);
 }
 #endif
 
@@ -188,94 +190,97 @@ static uint64_t tci_read_i64(uint8_t **tb_ptr)
 #endif
 
 /* Read indexed register (native size) from bytecode. */
-static tcg_target_ulong tci_read_r(uint8_t **tb_ptr)
+static tcg_target_ulong
+tci_read_r(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    tcg_target_ulong value = tci_read_reg(**tb_ptr);
+    tcg_target_ulong value = tci_read_reg(regs, **tb_ptr);
     *tb_ptr += 1;
     return value;
 }
 
 /* Read indexed register (8 bit) from bytecode. */
-static uint8_t tci_read_r8(uint8_t **tb_ptr)
+static uint8_t tci_read_r8(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    uint8_t value = tci_read_reg8(**tb_ptr);
+    uint8_t value = tci_read_reg8(regs, **tb_ptr);
     *tb_ptr += 1;
     return value;
 }
 
 #if TCG_TARGET_HAS_ext8s_i32 || TCG_TARGET_HAS_ext8s_i64
 /* Read indexed register (8 bit signed) from bytecode. */
-static int8_t tci_read_r8s(uint8_t **tb_ptr)
+static int8_t tci_read_r8s(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    int8_t value = tci_read_reg8s(**tb_ptr);
+    int8_t value = tci_read_reg8s(regs, **tb_ptr);
     *tb_ptr += 1;
     return value;
 }
 #endif
 
 /* Read indexed register (16 bit) from bytecode. */
-static uint16_t tci_read_r16(uint8_t **tb_ptr)
+static uint16_t tci_read_r16(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    uint16_t value = tci_read_reg16(**tb_ptr);
+    uint16_t value = tci_read_reg16(regs, **tb_ptr);
     *tb_ptr += 1;
     return value;
 }
 
 #if TCG_TARGET_HAS_ext16s_i32 || TCG_TARGET_HAS_ext16s_i64
 /* Read indexed register (16 bit signed) from bytecode. */
-static int16_t tci_read_r16s(uint8_t **tb_ptr)
+static int16_t tci_read_r16s(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    int16_t value = tci_read_reg16s(**tb_ptr);
+    int16_t value = tci_read_reg16s(regs, **tb_ptr);
     *tb_ptr += 1;
     return value;
 }
 #endif
 
 /* Read indexed register (32 bit) from bytecode. */
-static uint32_t tci_read_r32(uint8_t **tb_ptr)
+static uint32_t tci_read_r32(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    uint32_t value = tci_read_reg32(**tb_ptr);
+    uint32_t value = tci_read_reg32(regs, **tb_ptr);
     *tb_ptr += 1;
     return value;
 }
 
 #if TCG_TARGET_REG_BITS == 32
 /* Read two indexed registers (2 * 32 bit) from bytecode. */
-static uint64_t tci_read_r64(uint8_t **tb_ptr)
+static uint64_t tci_read_r64(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    uint32_t low = tci_read_r32(tb_ptr);
-    return tci_uint64(tci_read_r32(tb_ptr), low);
+    uint32_t low = tci_read_r32(regs, tb_ptr);
+    return tci_uint64(tci_read_r32(regs, tb_ptr), low);
 }
 #elif TCG_TARGET_REG_BITS == 64
 /* Read indexed register (32 bit signed) from bytecode. */
-static int32_t tci_read_r32s(uint8_t **tb_ptr)
+static int32_t tci_read_r32s(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    int32_t value = tci_read_reg32s(**tb_ptr);
+    int32_t value = tci_read_reg32s(regs, **tb_ptr);
     *tb_ptr += 1;
     return value;
 }
 
 /* Read indexed register (64 bit) from bytecode. */
-static uint64_t tci_read_r64(uint8_t **tb_ptr)
+static uint64_t tci_read_r64(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    uint64_t value = tci_read_reg64(**tb_ptr);
+    uint64_t value = tci_read_reg64(regs, **tb_ptr);
     *tb_ptr += 1;
     return value;
 }
 #endif
 
 /* Read indexed register(s) with target address from bytecode. */
-static target_ulong tci_read_ulong(uint8_t **tb_ptr)
+static target_ulong
+tci_read_ulong(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    target_ulong taddr = tci_read_r(tb_ptr);
+    target_ulong taddr = tci_read_r(regs, tb_ptr);
 #if TARGET_LONG_BITS > TCG_TARGET_REG_BITS
-    taddr += (uint64_t)tci_read_r(tb_ptr) << 32;
+    taddr += (uint64_t)tci_read_r(regs, tb_ptr) << 32;
 #endif
     return taddr;
 }
 
 /* Read indexed register or constant (native size) from bytecode. */
-static tcg_target_ulong tci_read_ri(uint8_t **tb_ptr)
+static tcg_target_ulong
+tci_read_ri(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
     tcg_target_ulong value;
     TCGReg r = **tb_ptr;
@@ -283,13 +288,13 @@ static tcg_target_ulong tci_read_ri(uint8_t **tb_ptr)
     if (r == TCG_CONST) {
         value = tci_read_i(tb_ptr);
     } else {
-        value = tci_read_reg(r);
+        value = tci_read_reg(regs, r);
     }
     return value;
 }
 
 /* Read indexed register or constant (32 bit) from bytecode. */
-static uint32_t tci_read_ri32(uint8_t **tb_ptr)
+static uint32_t tci_read_ri32(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
     uint32_t value;
     TCGReg r = **tb_ptr;
@@ -297,21 +302,21 @@ static uint32_t tci_read_ri32(uint8_t **tb_ptr)
     if (r == TCG_CONST) {
         value = tci_read_i32(tb_ptr);
     } else {
-        value = tci_read_reg32(r);
+        value = tci_read_reg32(regs, r);
     }
     return value;
 }
 
 #if TCG_TARGET_REG_BITS == 32
 /* Read two indexed registers or constants (2 * 32 bit) from bytecode. */
-static uint64_t tci_read_ri64(uint8_t **tb_ptr)
+static uint64_t tci_read_ri64(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
-    uint32_t low = tci_read_ri32(tb_ptr);
-    return tci_uint64(tci_read_ri32(tb_ptr), low);
+    uint32_t low = tci_read_ri32(regs, tb_ptr);
+    return tci_uint64(tci_read_ri32(regs, tb_ptr), low);
 }
 #elif TCG_TARGET_REG_BITS == 64
 /* Read indexed register or constant (64 bit) from bytecode. */
-static uint64_t tci_read_ri64(uint8_t **tb_ptr)
+static uint64_t tci_read_ri64(const tcg_target_ulong *regs, uint8_t **tb_ptr)
 {
     uint64_t value;
     TCGReg r = **tb_ptr;
@@ -319,7 +324,7 @@ static uint64_t tci_read_ri64(uint8_t **tb_ptr)
     if (r == TCG_CONST) {
         value = tci_read_i64(tb_ptr);
     } else {
-        value = tci_read_reg64(r);
+        value = tci_read_reg64(regs, r);
     }
     return value;
 }
@@ -465,12 +470,13 @@ static bool tci_compare64(uint64_t u0, uint64_t u1, TCGCond condition)
 /* Interpret pseudo code in tb. */
 uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
 {
+    tcg_target_ulong regs[TCG_TARGET_NB_REGS];
     long tcg_temps[CPU_TEMP_BUF_NLONGS];
     uintptr_t sp_value = (uintptr_t)(tcg_temps + CPU_TEMP_BUF_NLONGS);
     uintptr_t ret = 0;
 
-    tci_reg[TCG_AREG0] = (tcg_target_ulong)env;
-    tci_reg[TCG_REG_CALL_STACK] = sp_value;
+    regs[TCG_AREG0] = (tcg_target_ulong)env;
+    regs[TCG_REG_CALL_STACK] = sp_value;
     tci_assert(tb_ptr);
 
     for (;;) {
@@ -503,27 +509,27 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
 
         switch (opc) {
         case INDEX_op_call:
-            t0 = tci_read_ri(&tb_ptr);
+            t0 = tci_read_ri(regs, &tb_ptr);
 #if TCG_TARGET_REG_BITS == 32
-            tmp64 = ((helper_function)t0)(tci_read_reg(TCG_REG_R0),
-                                          tci_read_reg(TCG_REG_R1),
-                                          tci_read_reg(TCG_REG_R2),
-                                          tci_read_reg(TCG_REG_R3),
-                                          tci_read_reg(TCG_REG_R5),
-                                          tci_read_reg(TCG_REG_R6),
-                                          tci_read_reg(TCG_REG_R7),
-                                          tci_read_reg(TCG_REG_R8),
-                                          tci_read_reg(TCG_REG_R9),
-                                          tci_read_reg(TCG_REG_R10));
-            tci_write_reg(TCG_REG_R0, tmp64);
-            tci_write_reg(TCG_REG_R1, tmp64 >> 32);
+            tmp64 = ((helper_function)t0)(tci_read_reg(regs, TCG_REG_R0),
+                                          tci_read_reg(regs, TCG_REG_R1),
+                                          tci_read_reg(regs, TCG_REG_R2),
+                                          tci_read_reg(regs, TCG_REG_R3),
+                                          tci_read_reg(regs, TCG_REG_R5),
+                                          tci_read_reg(regs, TCG_REG_R6),
+                                          tci_read_reg(regs, TCG_REG_R7),
+                                          tci_read_reg(regs, TCG_REG_R8),
+                                          tci_read_reg(regs, TCG_REG_R9),
+                                          tci_read_reg(regs, TCG_REG_R10));
+            tci_write_reg(regs, TCG_REG_R0, tmp64);
+            tci_write_reg(regs, TCG_REG_R1, tmp64 >> 32);
 #else
-            tmp64 = ((helper_function)t0)(tci_read_reg(TCG_REG_R0),
-                                          tci_read_reg(TCG_REG_R1),
-                                          tci_read_reg(TCG_REG_R2),
-                                          tci_read_reg(TCG_REG_R3),
-                                          tci_read_reg(TCG_REG_R5));
-            tci_write_reg(TCG_REG_R0, tmp64);
+            tmp64 = ((helper_function)t0)(tci_read_reg(regs, TCG_REG_R0),
+                                          tci_read_reg(regs, TCG_REG_R1),
+                                          tci_read_reg(regs, TCG_REG_R2),
+                                          tci_read_reg(regs, TCG_REG_R3),
+                                          tci_read_reg(regs, TCG_REG_R5));
+            tci_write_reg(regs, TCG_REG_R0, tmp64);
 #endif
             break;
         case INDEX_op_br:
@@ -533,46 +539,46 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
             continue;
         case INDEX_op_setcond_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
+            t1 = tci_read_r32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
             condition = *tb_ptr++;
-            tci_write_reg32(t0, tci_compare32(t1, t2, condition));
+            tci_write_reg32(regs, t0, tci_compare32(t1, t2, condition));
             break;
 #if TCG_TARGET_REG_BITS == 32
         case INDEX_op_setcond2_i32:
             t0 = *tb_ptr++;
-            tmp64 = tci_read_r64(&tb_ptr);
-            v64 = tci_read_ri64(&tb_ptr);
+            tmp64 = tci_read_r64(regs, &tb_ptr);
+            v64 = tci_read_ri64(regs, &tb_ptr);
             condition = *tb_ptr++;
-            tci_write_reg32(t0, tci_compare64(tmp64, v64, condition));
+            tci_write_reg32(regs, t0, tci_compare64(tmp64, v64, condition));
             break;
 #elif TCG_TARGET_REG_BITS == 64
         case INDEX_op_setcond_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
+            t1 = tci_read_r64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
             condition = *tb_ptr++;
-            tci_write_reg64(t0, tci_compare64(t1, t2, condition));
+            tci_write_reg64(regs, t0, tci_compare64(t1, t2, condition));
             break;
 #endif
         case INDEX_op_mov_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r32(&tb_ptr);
-            tci_write_reg32(t0, t1);
+            t1 = tci_read_r32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1);
             break;
         case INDEX_op_movi_i32:
             t0 = *tb_ptr++;
             t1 = tci_read_i32(&tb_ptr);
-            tci_write_reg32(t0, t1);
+            tci_write_reg32(regs, t0, t1);
             break;
 
             /* Load/store operations (32 bit). */
 
         case INDEX_op_ld8u_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r(&tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
-            tci_write_reg8(t0, *(uint8_t *)(t1 + t2));
+            tci_write_reg8(regs, t0, *(uint8_t *)(t1 + t2));
             break;
         case INDEX_op_ld8s_i32:
         case INDEX_op_ld16u_i32:
@@ -583,25 +589,25 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
             break;
         case INDEX_op_ld_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r(&tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
-            tci_write_reg32(t0, *(uint32_t *)(t1 + t2));
+            tci_write_reg32(regs, t0, *(uint32_t *)(t1 + t2));
             break;
         case INDEX_op_st8_i32:
-            t0 = tci_read_r8(&tb_ptr);
-            t1 = tci_read_r(&tb_ptr);
+            t0 = tci_read_r8(regs, &tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
             *(uint8_t *)(t1 + t2) = t0;
             break;
         case INDEX_op_st16_i32:
-            t0 = tci_read_r16(&tb_ptr);
-            t1 = tci_read_r(&tb_ptr);
+            t0 = tci_read_r16(regs, &tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
             *(uint16_t *)(t1 + t2) = t0;
             break;
         case INDEX_op_st_i32:
-            t0 = tci_read_r32(&tb_ptr);
-            t1 = tci_read_r(&tb_ptr);
+            t0 = tci_read_r32(regs, &tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
             tci_assert(t1 != sp_value || (int32_t)t2 < 0);
             *(uint32_t *)(t1 + t2) = t0;
@@ -611,46 +617,46 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
 
         case INDEX_op_add_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, t1 + t2);
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1 + t2);
             break;
         case INDEX_op_sub_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, t1 - t2);
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1 - t2);
             break;
         case INDEX_op_mul_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, t1 * t2);
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1 * t2);
             break;
 #if TCG_TARGET_HAS_div_i32
         case INDEX_op_div_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, (int32_t)t1 / (int32_t)t2);
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, (int32_t)t1 / (int32_t)t2);
             break;
         case INDEX_op_divu_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, t1 / t2);
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1 / t2);
             break;
         case INDEX_op_rem_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, (int32_t)t1 % (int32_t)t2);
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, (int32_t)t1 % (int32_t)t2);
             break;
         case INDEX_op_remu_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, t1 % t2);
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1 % t2);
             break;
 #elif TCG_TARGET_HAS_div2_i32
         case INDEX_op_div2_i32:
@@ -660,71 +666,71 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
 #endif
         case INDEX_op_and_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, t1 & t2);
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1 & t2);
             break;
         case INDEX_op_or_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, t1 | t2);
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1 | t2);
             break;
         case INDEX_op_xor_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, t1 ^ t2);
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1 ^ t2);
             break;
 
             /* Shift/rotate operations (32 bit). */
 
         case INDEX_op_shl_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, t1 << (t2 & 31));
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1 << (t2 & 31));
             break;
         case INDEX_op_shr_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, t1 >> (t2 & 31));
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1 >> (t2 & 31));
             break;
         case INDEX_op_sar_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, ((int32_t)t1 >> (t2 & 31)));
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, ((int32_t)t1 >> (t2 & 31)));
             break;
 #if TCG_TARGET_HAS_rot_i32
         case INDEX_op_rotl_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, rol32(t1, t2 & 31));
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, rol32(t1, t2 & 31));
             break;
         case INDEX_op_rotr_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri32(&tb_ptr);
-            t2 = tci_read_ri32(&tb_ptr);
-            tci_write_reg32(t0, ror32(t1, t2 & 31));
+            t1 = tci_read_ri32(regs, &tb_ptr);
+            t2 = tci_read_ri32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, ror32(t1, t2 & 31));
             break;
 #endif
 #if TCG_TARGET_HAS_deposit_i32
         case INDEX_op_deposit_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r32(&tb_ptr);
-            t2 = tci_read_r32(&tb_ptr);
+            t1 = tci_read_r32(regs, &tb_ptr);
+            t2 = tci_read_r32(regs, &tb_ptr);
             tmp16 = *tb_ptr++;
             tmp8 = *tb_ptr++;
             tmp32 = (((1 << tmp8) - 1) << tmp16);
-            tci_write_reg32(t0, (t1 & ~tmp32) | ((t2 << tmp16) & tmp32));
+            tci_write_reg32(regs, t0, (t1 & ~tmp32) | ((t2 << tmp16) & tmp32));
             break;
 #endif
         case INDEX_op_brcond_i32:
-            t0 = tci_read_r32(&tb_ptr);
-            t1 = tci_read_ri32(&tb_ptr);
+            t0 = tci_read_r32(regs, &tb_ptr);
+            t1 = tci_read_ri32(regs, &tb_ptr);
             condition = *tb_ptr++;
             label = tci_read_label(&tb_ptr);
             if (tci_compare32(t0, t1, condition)) {
@@ -737,20 +743,20 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
         case INDEX_op_add2_i32:
             t0 = *tb_ptr++;
             t1 = *tb_ptr++;
-            tmp64 = tci_read_r64(&tb_ptr);
-            tmp64 += tci_read_r64(&tb_ptr);
-            tci_write_reg64(t1, t0, tmp64);
+            tmp64 = tci_read_r64(regs, &tb_ptr);
+            tmp64 += tci_read_r64(regs, &tb_ptr);
+            tci_write_reg64(regs, t1, t0, tmp64);
             break;
         case INDEX_op_sub2_i32:
             t0 = *tb_ptr++;
             t1 = *tb_ptr++;
-            tmp64 = tci_read_r64(&tb_ptr);
-            tmp64 -= tci_read_r64(&tb_ptr);
-            tci_write_reg64(t1, t0, tmp64);
+            tmp64 = tci_read_r64(regs, &tb_ptr);
+            tmp64 -= tci_read_r64(regs, &tb_ptr);
+            tci_write_reg64(regs, t1, t0, tmp64);
             break;
         case INDEX_op_brcond2_i32:
-            tmp64 = tci_read_r64(&tb_ptr);
-            v64 = tci_read_ri64(&tb_ptr);
+            tmp64 = tci_read_r64(regs, &tb_ptr);
+            v64 = tci_read_ri64(regs, &tb_ptr);
             condition = *tb_ptr++;
             label = tci_read_label(&tb_ptr);
             if (tci_compare64(tmp64, v64, condition)) {
@@ -762,86 +768,86 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
         case INDEX_op_mulu2_i32:
             t0 = *tb_ptr++;
             t1 = *tb_ptr++;
-            t2 = tci_read_r32(&tb_ptr);
-            tmp64 = tci_read_r32(&tb_ptr);
-            tci_write_reg64(t1, t0, t2 * tmp64);
+            t2 = tci_read_r32(regs, &tb_ptr);
+            tmp64 = tci_read_r32(regs, &tb_ptr);
+            tci_write_reg64(regs, t1, t0, t2 * tmp64);
             break;
 #endif /* TCG_TARGET_REG_BITS == 32 */
 #if TCG_TARGET_HAS_ext8s_i32
         case INDEX_op_ext8s_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r8s(&tb_ptr);
-            tci_write_reg32(t0, t1);
+            t1 = tci_read_r8s(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1);
             break;
 #endif
 #if TCG_TARGET_HAS_ext16s_i32
         case INDEX_op_ext16s_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r16s(&tb_ptr);
-            tci_write_reg32(t0, t1);
+            t1 = tci_read_r16s(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1);
             break;
 #endif
 #if TCG_TARGET_HAS_ext8u_i32
         case INDEX_op_ext8u_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r8(&tb_ptr);
-            tci_write_reg32(t0, t1);
+            t1 = tci_read_r8(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1);
             break;
 #endif
 #if TCG_TARGET_HAS_ext16u_i32
         case INDEX_op_ext16u_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r16(&tb_ptr);
-            tci_write_reg32(t0, t1);
+            t1 = tci_read_r16(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, t1);
             break;
 #endif
 #if TCG_TARGET_HAS_bswap16_i32
         case INDEX_op_bswap16_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r16(&tb_ptr);
-            tci_write_reg32(t0, bswap16(t1));
+            t1 = tci_read_r16(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, bswap16(t1));
             break;
 #endif
 #if TCG_TARGET_HAS_bswap32_i32
         case INDEX_op_bswap32_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r32(&tb_ptr);
-            tci_write_reg32(t0, bswap32(t1));
+            t1 = tci_read_r32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, bswap32(t1));
             break;
 #endif
 #if TCG_TARGET_HAS_not_i32
         case INDEX_op_not_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r32(&tb_ptr);
-            tci_write_reg32(t0, ~t1);
+            t1 = tci_read_r32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, ~t1);
             break;
 #endif
 #if TCG_TARGET_HAS_neg_i32
         case INDEX_op_neg_i32:
             t0 = *tb_ptr++;
-            t1 = tci_read_r32(&tb_ptr);
-            tci_write_reg32(t0, -t1);
+            t1 = tci_read_r32(regs, &tb_ptr);
+            tci_write_reg32(regs, t0, -t1);
             break;
 #endif
 #if TCG_TARGET_REG_BITS == 64
         case INDEX_op_mov_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r64(&tb_ptr);
-            tci_write_reg64(t0, t1);
+            t1 = tci_read_r64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1);
             break;
         case INDEX_op_movi_i64:
             t0 = *tb_ptr++;
             t1 = tci_read_i64(&tb_ptr);
-            tci_write_reg64(t0, t1);
+            tci_write_reg64(regs, t0, t1);
             break;
 
             /* Load/store operations (64 bit). */
 
         case INDEX_op_ld8u_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r(&tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
-            tci_write_reg8(t0, *(uint8_t *)(t1 + t2));
+            tci_write_reg8(regs, t0, *(uint8_t *)(t1 + t2));
             break;
         case INDEX_op_ld8s_i64:
         case INDEX_op_ld16u_i64:
@@ -850,43 +856,43 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
             break;
         case INDEX_op_ld32u_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r(&tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
-            tci_write_reg32(t0, *(uint32_t *)(t1 + t2));
+            tci_write_reg32(regs, t0, *(uint32_t *)(t1 + t2));
             break;
         case INDEX_op_ld32s_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r(&tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
-            tci_write_reg32s(t0, *(int32_t *)(t1 + t2));
+            tci_write_reg32s(regs, t0, *(int32_t *)(t1 + t2));
             break;
         case INDEX_op_ld_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r(&tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
-            tci_write_reg64(t0, *(uint64_t *)(t1 + t2));
+            tci_write_reg64(regs, t0, *(uint64_t *)(t1 + t2));
             break;
         case INDEX_op_st8_i64:
-            t0 = tci_read_r8(&tb_ptr);
-            t1 = tci_read_r(&tb_ptr);
+            t0 = tci_read_r8(regs, &tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
             *(uint8_t *)(t1 + t2) = t0;
             break;
         case INDEX_op_st16_i64:
-            t0 = tci_read_r16(&tb_ptr);
-            t1 = tci_read_r(&tb_ptr);
+            t0 = tci_read_r16(regs, &tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
             *(uint16_t *)(t1 + t2) = t0;
             break;
         case INDEX_op_st32_i64:
-            t0 = tci_read_r32(&tb_ptr);
-            t1 = tci_read_r(&tb_ptr);
+            t0 = tci_read_r32(regs, &tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
             *(uint32_t *)(t1 + t2) = t0;
             break;
         case INDEX_op_st_i64:
-            t0 = tci_read_r64(&tb_ptr);
-            t1 = tci_read_r(&tb_ptr);
+            t0 = tci_read_r64(regs, &tb_ptr);
+            t1 = tci_read_r(regs, &tb_ptr);
             t2 = tci_read_s32(&tb_ptr);
             tci_assert(t1 != sp_value || (int32_t)t2 < 0);
             *(uint64_t *)(t1 + t2) = t0;
@@ -896,21 +902,21 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
 
         case INDEX_op_add_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, t1 + t2);
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1 + t2);
             break;
         case INDEX_op_sub_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, t1 - t2);
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1 - t2);
             break;
         case INDEX_op_mul_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, t1 * t2);
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1 * t2);
             break;
 #if TCG_TARGET_HAS_div_i64
         case INDEX_op_div_i64:
@@ -927,71 +933,71 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
 #endif
         case INDEX_op_and_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, t1 & t2);
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1 & t2);
             break;
         case INDEX_op_or_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, t1 | t2);
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1 | t2);
             break;
         case INDEX_op_xor_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, t1 ^ t2);
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1 ^ t2);
             break;
 
             /* Shift/rotate operations (64 bit). */
 
         case INDEX_op_shl_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, t1 << (t2 & 63));
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1 << (t2 & 63));
             break;
         case INDEX_op_shr_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, t1 >> (t2 & 63));
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1 >> (t2 & 63));
             break;
         case INDEX_op_sar_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, ((int64_t)t1 >> (t2 & 63)));
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, ((int64_t)t1 >> (t2 & 63)));
             break;
 #if TCG_TARGET_HAS_rot_i64
         case INDEX_op_rotl_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, rol64(t1, t2 & 63));
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, rol64(t1, t2 & 63));
             break;
         case INDEX_op_rotr_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_ri64(&tb_ptr);
-            t2 = tci_read_ri64(&tb_ptr);
-            tci_write_reg64(t0, ror64(t1, t2 & 63));
+            t1 = tci_read_ri64(regs, &tb_ptr);
+            t2 = tci_read_ri64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, ror64(t1, t2 & 63));
             break;
 #endif
 #if TCG_TARGET_HAS_deposit_i64
         case INDEX_op_deposit_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r64(&tb_ptr);
-            t2 = tci_read_r64(&tb_ptr);
+            t1 = tci_read_r64(regs, &tb_ptr);
+            t2 = tci_read_r64(regs, &tb_ptr);
             tmp16 = *tb_ptr++;
             tmp8 = *tb_ptr++;
             tmp64 = (((1ULL << tmp8) - 1) << tmp16);
-            tci_write_reg64(t0, (t1 & ~tmp64) | ((t2 << tmp16) & tmp64));
+            tci_write_reg64(regs, t0, (t1 & ~tmp64) | ((t2 << tmp16) & tmp64));
             break;
 #endif
         case INDEX_op_brcond_i64:
-            t0 = tci_read_r64(&tb_ptr);
-            t1 = tci_read_ri64(&tb_ptr);
+            t0 = tci_read_r64(regs, &tb_ptr);
+            t1 = tci_read_ri64(regs, &tb_ptr);
             condition = *tb_ptr++;
             label = tci_read_label(&tb_ptr);
             if (tci_compare64(t0, t1, condition)) {
@@ -1003,29 +1009,29 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
 #if TCG_TARGET_HAS_ext8u_i64
         case INDEX_op_ext8u_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r8(&tb_ptr);
-            tci_write_reg64(t0, t1);
+            t1 = tci_read_r8(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1);
             break;
 #endif
 #if TCG_TARGET_HAS_ext8s_i64
         case INDEX_op_ext8s_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r8s(&tb_ptr);
-            tci_write_reg64(t0, t1);
+            t1 = tci_read_r8s(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1);
             break;
 #endif
 #if TCG_TARGET_HAS_ext16s_i64
         case INDEX_op_ext16s_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r16s(&tb_ptr);
-            tci_write_reg64(t0, t1);
+            t1 = tci_read_r16s(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1);
             break;
 #endif
 #if TCG_TARGET_HAS_ext16u_i64
         case INDEX_op_ext16u_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r16(&tb_ptr);
-            tci_write_reg64(t0, t1);
+            t1 = tci_read_r16(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1);
             break;
 #endif
 #if TCG_TARGET_HAS_ext32s_i64
@@ -1033,51 +1039,51 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
 #endif
         case INDEX_op_ext_i32_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r32s(&tb_ptr);
-            tci_write_reg64(t0, t1);
+            t1 = tci_read_r32s(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1);
             break;
 #if TCG_TARGET_HAS_ext32u_i64
         case INDEX_op_ext32u_i64:
 #endif
         case INDEX_op_extu_i32_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r32(&tb_ptr);
-            tci_write_reg64(t0, t1);
+            t1 = tci_read_r32(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, t1);
             break;
 #if TCG_TARGET_HAS_bswap16_i64
         case INDEX_op_bswap16_i64:
             TODO();
             t0 = *tb_ptr++;
-            t1 = tci_read_r16(&tb_ptr);
-            tci_write_reg64(t0, bswap16(t1));
+            t1 = tci_read_r16(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, bswap16(t1));
             break;
 #endif
 #if TCG_TARGET_HAS_bswap32_i64
         case INDEX_op_bswap32_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r32(&tb_ptr);
-            tci_write_reg64(t0, bswap32(t1));
+            t1 = tci_read_r32(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, bswap32(t1));
             break;
 #endif
 #if TCG_TARGET_HAS_bswap64_i64
         case INDEX_op_bswap64_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r64(&tb_ptr);
-            tci_write_reg64(t0, bswap64(t1));
+            t1 = tci_read_r64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, bswap64(t1));
             break;
 #endif
 #if TCG_TARGET_HAS_not_i64
         case INDEX_op_not_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r64(&tb_ptr);
-            tci_write_reg64(t0, ~t1);
+            t1 = tci_read_r64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, ~t1);
             break;
 #endif
 #if TCG_TARGET_HAS_neg_i64
         case INDEX_op_neg_i64:
             t0 = *tb_ptr++;
-            t1 = tci_read_r64(&tb_ptr);
-            tci_write_reg64(t0, -t1);
+            t1 = tci_read_r64(regs, &tb_ptr);
+            tci_write_reg64(regs, t0, -t1);
             break;
 #endif
 #endif /* TCG_TARGET_REG_BITS == 64 */
@@ -1098,7 +1104,7 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
             continue;
         case INDEX_op_qemu_ld_i32:
             t0 = *tb_ptr++;
-            taddr = tci_read_ulong(&tb_ptr);
+            taddr = tci_read_ulong(regs, &tb_ptr);
             oi = tci_read_i(&tb_ptr);
             switch (get_memop(oi) & (MO_BSWAP | MO_SSIZE)) {
             case MO_UB:
@@ -1128,14 +1134,14 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
             default:
                 tcg_abort();
             }
-            tci_write_reg(t0, tmp32);
+            tci_write_reg(regs, t0, tmp32);
             break;
         case INDEX_op_qemu_ld_i64:
             t0 = *tb_ptr++;
             if (TCG_TARGET_REG_BITS == 32) {
                 t1 = *tb_ptr++;
             }
-            taddr = tci_read_ulong(&tb_ptr);
+            taddr = tci_read_ulong(regs, &tb_ptr);
             oi = tci_read_i(&tb_ptr);
             switch (get_memop(oi) & (MO_BSWAP | MO_SSIZE)) {
             case MO_UB:
@@ -1177,14 +1183,14 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
             default:
                 tcg_abort();
             }
-            tci_write_reg(t0, tmp64);
+            tci_write_reg(regs, t0, tmp64);
             if (TCG_TARGET_REG_BITS == 32) {
-                tci_write_reg(t1, tmp64 >> 32);
+                tci_write_reg(regs, t1, tmp64 >> 32);
             }
             break;
         case INDEX_op_qemu_st_i32:
-            t0 = tci_read_r(&tb_ptr);
-            taddr = tci_read_ulong(&tb_ptr);
+            t0 = tci_read_r(regs, &tb_ptr);
+            taddr = tci_read_ulong(regs, &tb_ptr);
             oi = tci_read_i(&tb_ptr);
             switch (get_memop(oi) & (MO_BSWAP | MO_SIZE)) {
             case MO_UB:
@@ -1207,8 +1213,8 @@ uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr)
             }
             break;
         case INDEX_op_qemu_st_i64:
-            tmp64 = tci_read_r64(&tb_ptr);
-            taddr = tci_read_ulong(&tb_ptr);
+            tmp64 = tci_read_r64(regs, &tb_ptr);
+            taddr = tci_read_ulong(regs, &tb_ptr);
             oi = tci_read_i(&tb_ptr);
             switch (get_memop(oi) & (MO_BSWAP | MO_SIZE)) {
             case MO_UB:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 32/45] tcg: take tb_ctx out of TCGContext
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (30 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 31/45] tci: move tci_regs to tcg_qemu_tb_exec's stack Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 33/45] tcg: take .helpers " Emilio G. Cota
                   ` (12 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Groundwork for supporting multiple TCG contexts.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-context.h |  2 ++
 tcg/tcg.h                 |  2 --
 accel/tcg/cpu-exec.c      |  2 +-
 accel/tcg/translate-all.c | 57 +++++++++++++++++++++++------------------------
 linux-user/main.c         |  6 ++---
 5 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index 1fa8dcc..1d41202 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -41,4 +41,6 @@ struct TBContext {
     int tb_phys_invalidate_count;
 };
 
+extern TBContext tb_ctx;
+
 #endif
diff --git a/tcg/tcg.h b/tcg/tcg.h
index bd1fdfa..1090285 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -707,8 +707,6 @@ struct TCGContext {
     /* Threshold to flush the translated code buffer.  */
     void *code_gen_highwater;
 
-    TBContext tb_ctx;
-
     /* Track which vCPU triggers events */
     CPUState *cpu;                      /* *_trans */
     TCGv_env tcg_env;                   /* *_exec  */
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 604fee2..0799b16 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -327,7 +327,7 @@ TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
     phys_pc = get_page_addr_code(desc.env, pc);
     desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_hash_func(phys_pc, pc, flags, cf_mask, *cpu->trace_dstate);
-    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
+    return qht_lookup(&tb_ctx.htable, tb_cmp, &desc, h);
 }
 
 static inline TranslationBlock *tb_find(CPUState *cpu,
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index b655931..919ef6b 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -154,6 +154,7 @@ static void *l1_map[V_L1_MAX_SIZE];
 
 /* code generation context */
 TCGContext tcg_ctx;
+TBContext tb_ctx;
 bool parallel_cpus;
 
 /* translation block context */
@@ -185,7 +186,7 @@ static void page_table_config_init(void)
 void tb_lock(void)
 {
     assert_tb_unlocked();
-    qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
+    qemu_mutex_lock(&tb_ctx.tb_lock);
     have_tb_lock++;
 }
 
@@ -193,13 +194,13 @@ void tb_unlock(void)
 {
     assert_tb_locked();
     have_tb_lock--;
-    qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+    qemu_mutex_unlock(&tb_ctx.tb_lock);
 }
 
 void tb_lock_reset(void)
 {
     if (have_tb_lock) {
-        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+        qemu_mutex_unlock(&tb_ctx.tb_lock);
         have_tb_lock = 0;
     }
 }
@@ -826,15 +827,15 @@ static inline void code_gen_alloc(size_t tb_size)
         fprintf(stderr, "Could not allocate dynamic translator buffer\n");
         exit(1);
     }
-    tcg_ctx.tb_ctx.tb_tree = g_tree_new(tb_tc_cmp);
-    qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
+    tb_ctx.tb_tree = g_tree_new(tb_tc_cmp);
+    qemu_mutex_init(&tb_ctx.tb_lock);
 }
 
 static void tb_htable_init(void)
 {
     unsigned int mode = QHT_MODE_AUTO_RESIZE;
 
-    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
+    qht_init(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
 }
 
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
@@ -878,7 +879,7 @@ void tb_remove(TranslationBlock *tb)
 {
     assert_tb_locked();
 
-    g_tree_remove(tcg_ctx.tb_ctx.tb_tree, &tb->tc);
+    g_tree_remove(tb_ctx.tb_tree, &tb->tc);
 }
 
 static inline void invalidate_page_bitmap(PageDesc *p)
@@ -940,15 +941,15 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     /* If it is already been done on request of another CPU,
      * just retry.
      */
-    if (tcg_ctx.tb_ctx.tb_flush_count != tb_flush_count.host_int) {
+    if (tb_ctx.tb_flush_count != tb_flush_count.host_int) {
         goto done;
     }
 
     if (DEBUG_TB_FLUSH_GATE) {
-        size_t nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
+        size_t nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
         size_t host_size = 0;
 
-        g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_host_size_iter, &host_size);
+        g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
         printf("qemu: flush code_size=%td nb_tbs=%zu avg_tb_size=%zu\n",
                tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer, nb_tbs,
                nb_tbs > 0 ? host_size / nb_tbs : 0);
@@ -963,17 +964,16 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
     /* Increment the refcount first so that destroy acts as a reset */
-    g_tree_ref(tcg_ctx.tb_ctx.tb_tree);
-    g_tree_destroy(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_ref(tb_ctx.tb_tree);
+    g_tree_destroy(tb_ctx.tb_tree);
 
-    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
+    qht_reset_size(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
     /* XXX: flush processor icache at this point if cache flush is
        expensive */
-    atomic_mb_set(&tcg_ctx.tb_ctx.tb_flush_count,
-                  tcg_ctx.tb_ctx.tb_flush_count + 1);
+    atomic_mb_set(&tb_ctx.tb_flush_count, tb_ctx.tb_flush_count + 1);
 
 done:
     tb_unlock();
@@ -982,7 +982,7 @@ done:
 void tb_flush(CPUState *cpu)
 {
     if (tcg_enabled()) {
-        unsigned tb_flush_count = atomic_mb_read(&tcg_ctx.tb_ctx.tb_flush_count);
+        unsigned tb_flush_count = atomic_mb_read(&tb_ctx.tb_flush_count);
         async_safe_run_on_cpu(cpu, do_tb_flush,
                               RUN_ON_CPU_HOST_INT(tb_flush_count));
     }
@@ -1015,7 +1015,7 @@ do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 static void tb_invalidate_check(target_ulong address)
 {
     address &= TARGET_PAGE_MASK;
-    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
+    qht_iter(&tb_ctx.htable, do_tb_invalidate_check, &address);
 }
 
 static void
@@ -1035,7 +1035,7 @@ do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 /* verify that all the pages have correct rights for code */
 static void tb_page_check(void)
 {
-    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
+    qht_iter(&tb_ctx.htable, do_tb_page_check, NULL);
 }
 
 #endif /* CONFIG_USER_ONLY */
@@ -1133,7 +1133,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_hash_func(phys_pc, tb->pc, tb->flags, mask_cf(tb->cflags),
                      tb->trace_vcpu_dstate);
-    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
+    qht_remove(&tb_ctx.htable, tb, h);
 
     /*
      * Mark the TB as invalid *after* it's been removed from tb_hash, which
@@ -1168,7 +1168,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* suppress any remaining jumps to this TB */
     tb_jmp_unlink(tb);
 
-    tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
+    tb_ctx.tb_phys_invalidate_count++;
 }
 
 #ifdef CONFIG_SOFTMMU
@@ -1284,7 +1284,7 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     /* add in the hash table */
     h = tb_hash_func(phys_pc, tb->pc, tb->flags, mask_cf(tb->cflags),
                      tb->trace_vcpu_dstate);
-    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
+    qht_insert(&tb_ctx.htable, tb, h);
 
 #ifdef CONFIG_USER_ONLY
     if (DEBUG_TB_CHECK_GATE) {
@@ -1430,7 +1430,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
      * through the physical hash table and physical page list.
      */
     tb_link_page(tb, phys_pc, phys_page2);
-    g_tree_insert(tcg_ctx.tb_ctx.tb_tree, &tb->tc, tb);
+    g_tree_insert(tb_ctx.tb_tree, &tb->tc, tb);
     return tb;
 }
 
@@ -1718,7 +1718,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
 {
     struct tb_tc s = { .ptr = (void *)tc_ptr };
 
-    return g_tree_lookup(tcg_ctx.tb_ctx.tb_tree, &s);
+    return g_tree_lookup(tb_ctx.tb_tree, &s);
 }
 
 #if !defined(CONFIG_USER_ONLY)
@@ -1945,8 +1945,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 
     tb_lock();
 
-    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
-    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
+    nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
+    g_tree_foreach(tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
     /*
@@ -1972,15 +1972,14 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
                 tst.direct_jmp2_count,
                 nb_tbs ? (tst.direct_jmp2_count * 100) / nb_tbs : 0);
 
-    qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
+    qht_statistics_init(&tb_ctx.htable, &hst);
     print_qht_statistics(f, cpu_fprintf, hst);
     qht_statistics_destroy(&hst);
 
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %u\n",
-            atomic_read(&tcg_ctx.tb_ctx.tb_flush_count));
-    cpu_fprintf(f, "TB invalidate count %d\n",
-            tcg_ctx.tb_ctx.tb_phys_invalidate_count);
+                atomic_read(&tb_ctx.tb_flush_count));
+    cpu_fprintf(f, "TB invalidate count %d\n", tb_ctx.tb_phys_invalidate_count);
     cpu_fprintf(f, "TLB flush count     %zu\n", tlb_flush_count());
     tcg_dump_info(f, cpu_fprintf);
 
diff --git a/linux-user/main.c b/linux-user/main.c
index ad03c9e..630c73d 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -114,7 +114,7 @@ int cpu_get_pic_interrupt(CPUX86State *env)
 void fork_start(void)
 {
     cpu_list_lock();
-    qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
+    qemu_mutex_lock(&tb_ctx.tb_lock);
     mmap_fork_start();
 }
 
@@ -130,11 +130,11 @@ void fork_end(int child)
                 QTAILQ_REMOVE(&cpus, cpu, node);
             }
         }
-        qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
+        qemu_mutex_init(&tb_ctx.tb_lock);
         qemu_init_cpu_list();
         gdbserver_fork(thread_cpu);
     } else {
-        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+        qemu_mutex_unlock(&tb_ctx.tb_lock);
         cpu_list_unlock();
     }
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 33/45] tcg: take .helpers out of TCGContext
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (31 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 32/45] tcg: take tb_ctx out of TCGContext Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 34/45] tcg: define tcg_init_ctx and make tcg_ctx a pointer Emilio G. Cota
                   ` (11 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Groundwork for supporting multiple TCG contexts.

The hash table becomes read-only after it is filled in,
so we can save space by keeping just a global pointer to it.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h |  2 --
 tcg/tcg.c | 10 +++++-----
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 1090285..7cbe802 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -664,8 +664,6 @@ struct TCGContext {
 
     tcg_insn_unit *code_ptr;
 
-    GHashTable *helpers;
-
 #ifdef CONFIG_PROFILER
     /* profiling info */
     int64_t tb_count1;
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 28c1b94..c0c2d6c 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -319,6 +319,7 @@ typedef struct TCGHelperInfo {
 static const TCGHelperInfo all_helpers[] = {
 #include "exec/helper-tcg.h"
 };
+static GHashTable *helper_table;
 
 static int indirect_reg_alloc_order[ARRAY_SIZE(tcg_target_reg_alloc_order)];
 static void process_op_defs(TCGContext *s);
@@ -329,7 +330,6 @@ void tcg_context_init(TCGContext *s)
     TCGOpDef *def;
     TCGArgConstraint *args_ct;
     int *sorted_args;
-    GHashTable *helper_table;
 
     memset(s, 0, sizeof(*s));
     s->nb_globals = 0;
@@ -357,7 +357,7 @@ void tcg_context_init(TCGContext *s)
 
     /* Register helpers.  */
     /* Use g_direct_hash/equal for direct pointer comparisons on func.  */
-    s->helpers = helper_table = g_hash_table_new(NULL, NULL);
+    helper_table = g_hash_table_new(NULL, NULL);
 
     for (i = 0; i < ARRAY_SIZE(all_helpers); ++i) {
         g_hash_table_insert(helper_table, (gpointer)all_helpers[i].func,
@@ -761,7 +761,7 @@ void tcg_gen_callN(TCGContext *s, void *func, TCGArg ret,
     unsigned sizemask, flags;
     TCGHelperInfo *info;
 
-    info = g_hash_table_lookup(s->helpers, (gpointer)func);
+    info = g_hash_table_lookup(helper_table, (gpointer)func);
     flags = info->flags;
     sizemask = info->sizemask;
 
@@ -990,8 +990,8 @@ static char *tcg_get_arg_str_idx(TCGContext *s, char *buf,
 static inline const char *tcg_find_helper(TCGContext *s, uintptr_t val)
 {
     const char *ret = NULL;
-    if (s->helpers) {
-        TCGHelperInfo *info = g_hash_table_lookup(s->helpers, (gpointer)val);
+    if (helper_table) {
+        TCGHelperInfo *info = g_hash_table_lookup(helper_table, (gpointer)val);
         if (info) {
             ret = info->name;
         }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 34/45] tcg: define tcg_init_ctx and make tcg_ctx a pointer
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (32 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 33/45] tcg: take .helpers " Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  0:09   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 35/45] gen-icount: fold exitreq_label into TCGContext Emilio G. Cota
                   ` (10 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Groundwork for supporting multiple TCG contexts.

The core of this patch is this change to tcg/tcg.h:

> -extern TCGContext tcg_ctx;
> +extern TCGContext tcg_init_ctx;
> +extern TCGContext *tcg_ctx;

Note that for now we set *tcg_ctx to whatever TCGContext is passed
to tcg_context_init -- in this case &tcg_init_ctx.

To avoid diff churn we could do something like
> TCGContext *tcg_ctx_ptr;
> #define tcg_ctx (*tcg_ctx_ptr)
as Richard suggested during review, but sooner or later
we'd end up doing the conversion anyway, so do it now.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/gen-icount.h     | 10 ++---
 include/exec/helper-gen.h     | 12 +++---
 tcg/tcg-op.h                  | 80 +++++++++++++++++------------------
 tcg/tcg.h                     | 15 +++----
 accel/tcg/translate-all.c     | 97 ++++++++++++++++++++++---------------------
 bsd-user/main.c               |  2 +-
 linux-user/main.c             |  2 +-
 target/alpha/translate.c      |  2 +-
 target/arm/translate.c        |  2 +-
 target/cris/translate.c       |  2 +-
 target/cris/translate_v10.c   |  2 +-
 target/hppa/translate.c       |  2 +-
 target/i386/translate.c       |  2 +-
 target/lm32/translate.c       |  2 +-
 target/m68k/translate.c       |  2 +-
 target/microblaze/translate.c |  2 +-
 target/mips/translate.c       |  2 +-
 target/moxie/translate.c      |  2 +-
 target/openrisc/translate.c   |  2 +-
 target/ppc/translate.c        |  2 +-
 target/s390x/translate.c      |  2 +-
 target/sh4/translate.c        |  2 +-
 target/sparc/translate.c      |  2 +-
 target/tilegx/translate.c     |  2 +-
 target/tricore/translate.c    |  2 +-
 target/unicore32/translate.c  |  2 +-
 target/xtensa/translate.c     |  2 +-
 tcg/tcg-op.c                  | 58 +++++++++++++-------------
 tcg/tcg-runtime.c             |  2 +-
 tcg/tcg.c                     | 21 +++++-----
 30 files changed, 171 insertions(+), 168 deletions(-)

diff --git a/include/exec/gen-icount.h b/include/exec/gen-icount.h
index 9b3cb14..4a55da8 100644
--- a/include/exec/gen-icount.h
+++ b/include/exec/gen-icount.h
@@ -19,7 +19,7 @@ static inline void gen_tb_start(TranslationBlock *tb)
         count = tcg_temp_new_i32();
     }
 
-    tcg_gen_ld_i32(count, tcg_ctx.tcg_env,
+    tcg_gen_ld_i32(count, tcg_ctx->tcg_env,
                    -ENV_OFFSET + offsetof(CPUState, icount_decr.u32));
 
     if (tb->cflags & CF_USE_ICOUNT) {
@@ -37,7 +37,7 @@ static inline void gen_tb_start(TranslationBlock *tb)
     tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, exitreq_label);
 
     if (tb->cflags & CF_USE_ICOUNT) {
-        tcg_gen_st16_i32(count, tcg_ctx.tcg_env,
+        tcg_gen_st16_i32(count, tcg_ctx->tcg_env,
                          -ENV_OFFSET + offsetof(CPUState, icount_decr.u16.low));
     }
 
@@ -56,13 +56,13 @@ static inline void gen_tb_end(TranslationBlock *tb, int num_insns)
     tcg_gen_exit_tb((uintptr_t)tb + TB_EXIT_REQUESTED);
 
     /* Terminate the linked list.  */
-    tcg_ctx.gen_op_buf[tcg_ctx.gen_op_buf[0].prev].next = 0;
+    tcg_ctx->gen_op_buf[tcg_ctx->gen_op_buf[0].prev].next = 0;
 }
 
 static inline void gen_io_start(void)
 {
     TCGv_i32 tmp = tcg_const_i32(1);
-    tcg_gen_st_i32(tmp, tcg_ctx.tcg_env,
+    tcg_gen_st_i32(tmp, tcg_ctx->tcg_env,
                    -ENV_OFFSET + offsetof(CPUState, can_do_io));
     tcg_temp_free_i32(tmp);
 }
@@ -70,7 +70,7 @@ static inline void gen_io_start(void)
 static inline void gen_io_end(void)
 {
     TCGv_i32 tmp = tcg_const_i32(0);
-    tcg_gen_st_i32(tmp, tcg_ctx.tcg_env,
+    tcg_gen_st_i32(tmp, tcg_ctx->tcg_env,
                    -ENV_OFFSET + offsetof(CPUState, can_do_io));
     tcg_temp_free_i32(tmp);
 }
diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h
index 8239ffc..3bcb901 100644
--- a/include/exec/helper-gen.h
+++ b/include/exec/helper-gen.h
@@ -9,7 +9,7 @@
 #define DEF_HELPER_FLAGS_0(name, flags, ret)                            \
 static inline void glue(gen_helper_, name)(dh_retvar_decl0(ret))        \
 {                                                                       \
-  tcg_gen_callN(&tcg_ctx, HELPER(name), dh_retvar(ret), 0, NULL);       \
+  tcg_gen_callN(tcg_ctx, HELPER(name), dh_retvar(ret), 0, NULL);        \
 }
 
 #define DEF_HELPER_FLAGS_1(name, flags, ret, t1)                        \
@@ -17,7 +17,7 @@ static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
     dh_arg_decl(t1, 1))                                                 \
 {                                                                       \
   TCGArg args[1] = { dh_arg(t1, 1) };                                   \
-  tcg_gen_callN(&tcg_ctx, HELPER(name), dh_retvar(ret), 1, args);       \
+  tcg_gen_callN(tcg_ctx, HELPER(name), dh_retvar(ret), 1, args);        \
 }
 
 #define DEF_HELPER_FLAGS_2(name, flags, ret, t1, t2)                    \
@@ -25,7 +25,7 @@ static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
     dh_arg_decl(t1, 1), dh_arg_decl(t2, 2))                             \
 {                                                                       \
   TCGArg args[2] = { dh_arg(t1, 1), dh_arg(t2, 2) };                    \
-  tcg_gen_callN(&tcg_ctx, HELPER(name), dh_retvar(ret), 2, args);       \
+  tcg_gen_callN(tcg_ctx, HELPER(name), dh_retvar(ret), 2, args);        \
 }
 
 #define DEF_HELPER_FLAGS_3(name, flags, ret, t1, t2, t3)                \
@@ -33,7 +33,7 @@ static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
     dh_arg_decl(t1, 1), dh_arg_decl(t2, 2), dh_arg_decl(t3, 3))         \
 {                                                                       \
   TCGArg args[3] = { dh_arg(t1, 1), dh_arg(t2, 2), dh_arg(t3, 3) };     \
-  tcg_gen_callN(&tcg_ctx, HELPER(name), dh_retvar(ret), 3, args);       \
+  tcg_gen_callN(tcg_ctx, HELPER(name), dh_retvar(ret), 3, args);        \
 }
 
 #define DEF_HELPER_FLAGS_4(name, flags, ret, t1, t2, t3, t4)            \
@@ -43,7 +43,7 @@ static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
 {                                                                       \
   TCGArg args[4] = { dh_arg(t1, 1), dh_arg(t2, 2),                      \
                      dh_arg(t3, 3), dh_arg(t4, 4) };                    \
-  tcg_gen_callN(&tcg_ctx, HELPER(name), dh_retvar(ret), 4, args);       \
+  tcg_gen_callN(tcg_ctx, HELPER(name), dh_retvar(ret), 4, args);        \
 }
 
 #define DEF_HELPER_FLAGS_5(name, flags, ret, t1, t2, t3, t4, t5)        \
@@ -53,7 +53,7 @@ static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
 {                                                                       \
   TCGArg args[5] = { dh_arg(t1, 1), dh_arg(t2, 2), dh_arg(t3, 3),       \
                      dh_arg(t4, 4), dh_arg(t5, 5) };                    \
-  tcg_gen_callN(&tcg_ctx, HELPER(name), dh_retvar(ret), 5, args);       \
+  tcg_gen_callN(tcg_ctx, HELPER(name), dh_retvar(ret), 5, args);        \
 }
 
 #include "helper.h"
diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 18d01b2..75c15cc 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -40,161 +40,161 @@ void tcg_gen_op6(TCGContext *, TCGOpcode, TCGArg, TCGArg, TCGArg,
 
 static inline void tcg_gen_op1_i32(TCGOpcode opc, TCGv_i32 a1)
 {
-    tcg_gen_op1(&tcg_ctx, opc, GET_TCGV_I32(a1));
+    tcg_gen_op1(tcg_ctx, opc, GET_TCGV_I32(a1));
 }
 
 static inline void tcg_gen_op1_i64(TCGOpcode opc, TCGv_i64 a1)
 {
-    tcg_gen_op1(&tcg_ctx, opc, GET_TCGV_I64(a1));
+    tcg_gen_op1(tcg_ctx, opc, GET_TCGV_I64(a1));
 }
 
 static inline void tcg_gen_op1i(TCGOpcode opc, TCGArg a1)
 {
-    tcg_gen_op1(&tcg_ctx, opc, a1);
+    tcg_gen_op1(tcg_ctx, opc, a1);
 }
 
 static inline void tcg_gen_op2_i32(TCGOpcode opc, TCGv_i32 a1, TCGv_i32 a2)
 {
-    tcg_gen_op2(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2));
+    tcg_gen_op2(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2));
 }
 
 static inline void tcg_gen_op2_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2)
 {
-    tcg_gen_op2(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2));
+    tcg_gen_op2(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2));
 }
 
 static inline void tcg_gen_op2i_i32(TCGOpcode opc, TCGv_i32 a1, TCGArg a2)
 {
-    tcg_gen_op2(&tcg_ctx, opc, GET_TCGV_I32(a1), a2);
+    tcg_gen_op2(tcg_ctx, opc, GET_TCGV_I32(a1), a2);
 }
 
 static inline void tcg_gen_op2i_i64(TCGOpcode opc, TCGv_i64 a1, TCGArg a2)
 {
-    tcg_gen_op2(&tcg_ctx, opc, GET_TCGV_I64(a1), a2);
+    tcg_gen_op2(tcg_ctx, opc, GET_TCGV_I64(a1), a2);
 }
 
 static inline void tcg_gen_op2ii(TCGOpcode opc, TCGArg a1, TCGArg a2)
 {
-    tcg_gen_op2(&tcg_ctx, opc, a1, a2);
+    tcg_gen_op2(tcg_ctx, opc, a1, a2);
 }
 
 static inline void tcg_gen_op3_i32(TCGOpcode opc, TCGv_i32 a1,
                                    TCGv_i32 a2, TCGv_i32 a3)
 {
-    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_I32(a1),
+    tcg_gen_op3(tcg_ctx, opc, GET_TCGV_I32(a1),
                 GET_TCGV_I32(a2), GET_TCGV_I32(a3));
 }
 
 static inline void tcg_gen_op3_i64(TCGOpcode opc, TCGv_i64 a1,
                                    TCGv_i64 a2, TCGv_i64 a3)
 {
-    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_I64(a1),
+    tcg_gen_op3(tcg_ctx, opc, GET_TCGV_I64(a1),
                 GET_TCGV_I64(a2), GET_TCGV_I64(a3));
 }
 
 static inline void tcg_gen_op3i_i32(TCGOpcode opc, TCGv_i32 a1,
                                     TCGv_i32 a2, TCGArg a3)
 {
-    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2), a3);
+    tcg_gen_op3(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2), a3);
 }
 
 static inline void tcg_gen_op3i_i64(TCGOpcode opc, TCGv_i64 a1,
                                     TCGv_i64 a2, TCGArg a3)
 {
-    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2), a3);
+    tcg_gen_op3(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2), a3);
 }
 
 static inline void tcg_gen_ldst_op_i32(TCGOpcode opc, TCGv_i32 val,
                                        TCGv_ptr base, TCGArg offset)
 {
-    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_I32(val), GET_TCGV_PTR(base), offset);
+    tcg_gen_op3(tcg_ctx, opc, GET_TCGV_I32(val), GET_TCGV_PTR(base), offset);
 }
 
 static inline void tcg_gen_ldst_op_i64(TCGOpcode opc, TCGv_i64 val,
                                        TCGv_ptr base, TCGArg offset)
 {
-    tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_I64(val), GET_TCGV_PTR(base), offset);
+    tcg_gen_op3(tcg_ctx, opc, GET_TCGV_I64(val), GET_TCGV_PTR(base), offset);
 }
 
 static inline void tcg_gen_op4_i32(TCGOpcode opc, TCGv_i32 a1, TCGv_i32 a2,
                                    TCGv_i32 a3, TCGv_i32 a4)
 {
-    tcg_gen_op4(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
+    tcg_gen_op4(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
                 GET_TCGV_I32(a3), GET_TCGV_I32(a4));
 }
 
 static inline void tcg_gen_op4_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                                    TCGv_i64 a3, TCGv_i64 a4)
 {
-    tcg_gen_op4(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
+    tcg_gen_op4(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
                 GET_TCGV_I64(a3), GET_TCGV_I64(a4));
 }
 
 static inline void tcg_gen_op4i_i32(TCGOpcode opc, TCGv_i32 a1, TCGv_i32 a2,
                                     TCGv_i32 a3, TCGArg a4)
 {
-    tcg_gen_op4(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
+    tcg_gen_op4(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
                 GET_TCGV_I32(a3), a4);
 }
 
 static inline void tcg_gen_op4i_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                                     TCGv_i64 a3, TCGArg a4)
 {
-    tcg_gen_op4(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
+    tcg_gen_op4(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
                 GET_TCGV_I64(a3), a4);
 }
 
 static inline void tcg_gen_op4ii_i32(TCGOpcode opc, TCGv_i32 a1, TCGv_i32 a2,
                                      TCGArg a3, TCGArg a4)
 {
-    tcg_gen_op4(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2), a3, a4);
+    tcg_gen_op4(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2), a3, a4);
 }
 
 static inline void tcg_gen_op4ii_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                                      TCGArg a3, TCGArg a4)
 {
-    tcg_gen_op4(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2), a3, a4);
+    tcg_gen_op4(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2), a3, a4);
 }
 
 static inline void tcg_gen_op5_i32(TCGOpcode opc, TCGv_i32 a1, TCGv_i32 a2,
                                    TCGv_i32 a3, TCGv_i32 a4, TCGv_i32 a5)
 {
-    tcg_gen_op5(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
+    tcg_gen_op5(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
                 GET_TCGV_I32(a3), GET_TCGV_I32(a4), GET_TCGV_I32(a5));
 }
 
 static inline void tcg_gen_op5_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                                    TCGv_i64 a3, TCGv_i64 a4, TCGv_i64 a5)
 {
-    tcg_gen_op5(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
+    tcg_gen_op5(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
                 GET_TCGV_I64(a3), GET_TCGV_I64(a4), GET_TCGV_I64(a5));
 }
 
 static inline void tcg_gen_op5i_i32(TCGOpcode opc, TCGv_i32 a1, TCGv_i32 a2,
                                     TCGv_i32 a3, TCGv_i32 a4, TCGArg a5)
 {
-    tcg_gen_op5(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
+    tcg_gen_op5(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
                 GET_TCGV_I32(a3), GET_TCGV_I32(a4), a5);
 }
 
 static inline void tcg_gen_op5i_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                                     TCGv_i64 a3, TCGv_i64 a4, TCGArg a5)
 {
-    tcg_gen_op5(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
+    tcg_gen_op5(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
                 GET_TCGV_I64(a3), GET_TCGV_I64(a4), a5);
 }
 
 static inline void tcg_gen_op5ii_i32(TCGOpcode opc, TCGv_i32 a1, TCGv_i32 a2,
                                      TCGv_i32 a3, TCGArg a4, TCGArg a5)
 {
-    tcg_gen_op5(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
+    tcg_gen_op5(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
                 GET_TCGV_I32(a3), a4, a5);
 }
 
 static inline void tcg_gen_op5ii_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                                      TCGv_i64 a3, TCGArg a4, TCGArg a5)
 {
-    tcg_gen_op5(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
+    tcg_gen_op5(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
                 GET_TCGV_I64(a3), a4, a5);
 }
 
@@ -202,7 +202,7 @@ static inline void tcg_gen_op6_i32(TCGOpcode opc, TCGv_i32 a1, TCGv_i32 a2,
                                    TCGv_i32 a3, TCGv_i32 a4,
                                    TCGv_i32 a5, TCGv_i32 a6)
 {
-    tcg_gen_op6(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
+    tcg_gen_op6(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
                 GET_TCGV_I32(a3), GET_TCGV_I32(a4), GET_TCGV_I32(a5),
                 GET_TCGV_I32(a6));
 }
@@ -211,7 +211,7 @@ static inline void tcg_gen_op6_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                                    TCGv_i64 a3, TCGv_i64 a4,
                                    TCGv_i64 a5, TCGv_i64 a6)
 {
-    tcg_gen_op6(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
+    tcg_gen_op6(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
                 GET_TCGV_I64(a3), GET_TCGV_I64(a4), GET_TCGV_I64(a5),
                 GET_TCGV_I64(a6));
 }
@@ -220,7 +220,7 @@ static inline void tcg_gen_op6i_i32(TCGOpcode opc, TCGv_i32 a1, TCGv_i32 a2,
                                     TCGv_i32 a3, TCGv_i32 a4,
                                     TCGv_i32 a5, TCGArg a6)
 {
-    tcg_gen_op6(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
+    tcg_gen_op6(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
                 GET_TCGV_I32(a3), GET_TCGV_I32(a4), GET_TCGV_I32(a5), a6);
 }
 
@@ -228,7 +228,7 @@ static inline void tcg_gen_op6i_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                                     TCGv_i64 a3, TCGv_i64 a4,
                                     TCGv_i64 a5, TCGArg a6)
 {
-    tcg_gen_op6(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
+    tcg_gen_op6(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
                 GET_TCGV_I64(a3), GET_TCGV_I64(a4), GET_TCGV_I64(a5), a6);
 }
 
@@ -236,7 +236,7 @@ static inline void tcg_gen_op6ii_i32(TCGOpcode opc, TCGv_i32 a1, TCGv_i32 a2,
                                      TCGv_i32 a3, TCGv_i32 a4,
                                      TCGArg a5, TCGArg a6)
 {
-    tcg_gen_op6(&tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
+    tcg_gen_op6(tcg_ctx, opc, GET_TCGV_I32(a1), GET_TCGV_I32(a2),
                 GET_TCGV_I32(a3), GET_TCGV_I32(a4), a5, a6);
 }
 
@@ -244,7 +244,7 @@ static inline void tcg_gen_op6ii_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
                                      TCGv_i64 a3, TCGv_i64 a4,
                                      TCGArg a5, TCGArg a6)
 {
-    tcg_gen_op6(&tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
+    tcg_gen_op6(tcg_ctx, opc, GET_TCGV_I64(a1), GET_TCGV_I64(a2),
                 GET_TCGV_I64(a3), GET_TCGV_I64(a4), a5, a6);
 }
 
@@ -253,12 +253,12 @@ static inline void tcg_gen_op6ii_i64(TCGOpcode opc, TCGv_i64 a1, TCGv_i64 a2,
 
 static inline void gen_set_label(TCGLabel *l)
 {
-    tcg_gen_op1(&tcg_ctx, INDEX_op_set_label, label_arg(l));
+    tcg_gen_op1(tcg_ctx, INDEX_op_set_label, label_arg(l));
 }
 
 static inline void tcg_gen_br(TCGLabel *l)
 {
-    tcg_gen_op1(&tcg_ctx, INDEX_op_br, label_arg(l));
+    tcg_gen_op1(tcg_ctx, INDEX_op_br, label_arg(l));
 }
 
 void tcg_gen_mb(TCGBar);
@@ -732,12 +732,12 @@ static inline void tcg_gen_concat32_i64(TCGv_i64 ret, TCGv_i64 lo, TCGv_i64 hi)
 # if TARGET_LONG_BITS <= TCG_TARGET_REG_BITS
 static inline void tcg_gen_insn_start(target_ulong pc)
 {
-    tcg_gen_op1(&tcg_ctx, INDEX_op_insn_start, pc);
+    tcg_gen_op1(tcg_ctx, INDEX_op_insn_start, pc);
 }
 # else
 static inline void tcg_gen_insn_start(target_ulong pc)
 {
-    tcg_gen_op2(&tcg_ctx, INDEX_op_insn_start,
+    tcg_gen_op2(tcg_ctx, INDEX_op_insn_start,
                 (uint32_t)pc, (uint32_t)(pc >> 32));
 }
 # endif
@@ -745,12 +745,12 @@ static inline void tcg_gen_insn_start(target_ulong pc)
 # if TARGET_LONG_BITS <= TCG_TARGET_REG_BITS
 static inline void tcg_gen_insn_start(target_ulong pc, target_ulong a1)
 {
-    tcg_gen_op2(&tcg_ctx, INDEX_op_insn_start, pc, a1);
+    tcg_gen_op2(tcg_ctx, INDEX_op_insn_start, pc, a1);
 }
 # else
 static inline void tcg_gen_insn_start(target_ulong pc, target_ulong a1)
 {
-    tcg_gen_op4(&tcg_ctx, INDEX_op_insn_start,
+    tcg_gen_op4(tcg_ctx, INDEX_op_insn_start,
                 (uint32_t)pc, (uint32_t)(pc >> 32),
                 (uint32_t)a1, (uint32_t)(a1 >> 32));
 }
@@ -760,13 +760,13 @@ static inline void tcg_gen_insn_start(target_ulong pc, target_ulong a1)
 static inline void tcg_gen_insn_start(target_ulong pc, target_ulong a1,
                                       target_ulong a2)
 {
-    tcg_gen_op3(&tcg_ctx, INDEX_op_insn_start, pc, a1, a2);
+    tcg_gen_op3(tcg_ctx, INDEX_op_insn_start, pc, a1, a2);
 }
 # else
 static inline void tcg_gen_insn_start(target_ulong pc, target_ulong a1,
                                       target_ulong a2)
 {
-    tcg_gen_op6(&tcg_ctx, INDEX_op_insn_start,
+    tcg_gen_op6(tcg_ctx, INDEX_op_insn_start,
                 (uint32_t)pc, (uint32_t)(pc >> 32),
                 (uint32_t)a1, (uint32_t)(a1 >> 32),
                 (uint32_t)a2, (uint32_t)(a2 >> 32));
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 7cbe802..6913d4b 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -726,18 +726,19 @@ struct TCGContext {
     target_ulong gen_insn_data[TCG_MAX_INSNS][TARGET_INSN_START_WORDS];
 };
 
-extern TCGContext tcg_ctx;
+extern TCGContext tcg_init_ctx;
+extern TCGContext *tcg_ctx;
 
 static inline void tcg_set_insn_param(int op_idx, int arg, TCGArg v)
 {
-    int op_argi = tcg_ctx.gen_op_buf[op_idx].args;
-    tcg_ctx.gen_opparam_buf[op_argi + arg] = v;
+    int op_argi = tcg_ctx->gen_op_buf[op_idx].args;
+    tcg_ctx->gen_opparam_buf[op_argi + arg] = v;
 }
 
 /* The number of opcodes emitted so far.  */
 static inline int tcg_op_buf_count(void)
 {
-    return tcg_ctx.gen_next_op_idx;
+    return tcg_ctx->gen_next_op_idx;
 }
 
 /* Test for whether to terminate the TB for using too many opcodes.  */
@@ -756,13 +757,13 @@ TranslationBlock *tcg_tb_alloc(TCGContext *s);
 /* Called with tb_lock held.  */
 static inline void *tcg_malloc(int size)
 {
-    TCGContext *s = &tcg_ctx;
+    TCGContext *s = tcg_ctx;
     uint8_t *ptr, *ptr_end;
     size = (size + sizeof(long) - 1) & ~(sizeof(long) - 1);
     ptr = s->pool_cur;
     ptr_end = ptr + size;
     if (unlikely(ptr_end > s->pool_end)) {
-        return tcg_malloc_internal(&tcg_ctx, size);
+        return tcg_malloc_internal(tcg_ctx, size);
     } else {
         s->pool_cur = ptr_end;
         return ptr;
@@ -1100,7 +1101,7 @@ static inline unsigned get_mmuidx(TCGMemOpIdx oi)
 uintptr_t tcg_qemu_tb_exec(CPUArchState *env, uint8_t *tb_ptr);
 #else
 # define tcg_qemu_tb_exec(env, tb_ptr) \
-    ((uintptr_t (*)(void *, void *))tcg_ctx.code_gen_prologue)(env, tb_ptr)
+    ((uintptr_t (*)(void *, void *))tcg_ctx->code_gen_prologue)(env, tb_ptr)
 #endif
 
 void tcg_register_jit(void *buf, size_t buf_size);
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 919ef6b..961e357 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -153,7 +153,8 @@ static int v_l2_levels;
 static void *l1_map[V_L1_MAX_SIZE];
 
 /* code generation context */
-TCGContext tcg_ctx;
+TCGContext tcg_init_ctx;
+TCGContext *tcg_ctx;
 TBContext tb_ctx;
 bool parallel_cpus;
 
@@ -209,7 +210,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr);
 
 void cpu_gen_init(void)
 {
-    tcg_context_init(&tcg_ctx); 
+    tcg_context_init(&tcg_init_ctx);
 }
 
 /* Encode VAL as a signed leb128 sequence at P.
@@ -267,7 +268,7 @@ static target_long decode_sleb128(uint8_t **pp)
 
 static int encode_search(TranslationBlock *tb, uint8_t *block)
 {
-    uint8_t *highwater = tcg_ctx.code_gen_highwater;
+    uint8_t *highwater = tcg_ctx->code_gen_highwater;
     uint8_t *p = block;
     int i, j, n;
 
@@ -280,12 +281,12 @@ static int encode_search(TranslationBlock *tb, uint8_t *block)
             if (i == 0) {
                 prev = (j == 0 ? tb->pc : 0);
             } else {
-                prev = tcg_ctx.gen_insn_data[i - 1][j];
+                prev = tcg_ctx->gen_insn_data[i - 1][j];
             }
-            p = encode_sleb128(p, tcg_ctx.gen_insn_data[i][j] - prev);
+            p = encode_sleb128(p, tcg_ctx->gen_insn_data[i][j] - prev);
         }
-        prev = (i == 0 ? 0 : tcg_ctx.gen_insn_end_off[i - 1]);
-        p = encode_sleb128(p, tcg_ctx.gen_insn_end_off[i] - prev);
+        prev = (i == 0 ? 0 : tcg_ctx->gen_insn_end_off[i - 1]);
+        p = encode_sleb128(p, tcg_ctx->gen_insn_end_off[i] - prev);
 
         /* Test for (pending) buffer overflow.  The assumption is that any
            one row beginning below the high water mark cannot overrun
@@ -345,8 +346,8 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
     restore_state_to_opc(env, tb, data);
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx.restore_time += profile_getclock() - ti;
-    tcg_ctx.restore_count++;
+    tcg_ctx->restore_time += profile_getclock() - ti;
+    tcg_ctx->restore_count++;
 #endif
     return 0;
 }
@@ -592,7 +593,7 @@ static inline void *split_cross_256mb(void *buf1, size_t size1)
         buf1 = buf2;
     }
 
-    tcg_ctx.code_gen_buffer_size = size1;
+    tcg_ctx->code_gen_buffer_size = size1;
     return buf1;
 }
 #endif
@@ -655,16 +656,16 @@ static inline void *alloc_code_gen_buffer(void)
     size = full_size - qemu_real_host_page_size;
 
     /* Honor a command-line option limiting the size of the buffer.  */
-    if (size > tcg_ctx.code_gen_buffer_size) {
-        size = (((uintptr_t)buf + tcg_ctx.code_gen_buffer_size)
+    if (size > tcg_ctx->code_gen_buffer_size) {
+        size = (((uintptr_t)buf + tcg_ctx->code_gen_buffer_size)
                 & qemu_real_host_page_mask) - (uintptr_t)buf;
     }
-    tcg_ctx.code_gen_buffer_size = size;
+    tcg_ctx->code_gen_buffer_size = size;
 
 #ifdef __mips__
     if (cross_256mb(buf, size)) {
         buf = split_cross_256mb(buf, size);
-        size = tcg_ctx.code_gen_buffer_size;
+        size = tcg_ctx->code_gen_buffer_size;
     }
 #endif
 
@@ -677,7 +678,7 @@ static inline void *alloc_code_gen_buffer(void)
 #elif defined(_WIN32)
 static inline void *alloc_code_gen_buffer(void)
 {
-    size_t size = tcg_ctx.code_gen_buffer_size;
+    size_t size = tcg_ctx->code_gen_buffer_size;
     void *buf1, *buf2;
 
     /* Perform the allocation in two steps, so that the guard page
@@ -696,7 +697,7 @@ static inline void *alloc_code_gen_buffer(void)
 {
     int flags = MAP_PRIVATE | MAP_ANONYMOUS;
     uintptr_t start = 0;
-    size_t size = tcg_ctx.code_gen_buffer_size;
+    size_t size = tcg_ctx->code_gen_buffer_size;
     void *buf;
 
     /* Constrain the position of the buffer based on the host cpu.
@@ -713,7 +714,7 @@ static inline void *alloc_code_gen_buffer(void)
     flags |= MAP_32BIT;
     /* Cannot expect to map more than 800MB in low memory.  */
     if (size > 800u * 1024 * 1024) {
-        tcg_ctx.code_gen_buffer_size = size = 800u * 1024 * 1024;
+        tcg_ctx->code_gen_buffer_size = size = 800u * 1024 * 1024;
     }
 # elif defined(__sparc__)
     start = 0x40000000ul;
@@ -753,7 +754,7 @@ static inline void *alloc_code_gen_buffer(void)
         default:
             /* Split the original buffer.  Free the smaller half.  */
             buf2 = split_cross_256mb(buf, size);
-            size2 = tcg_ctx.code_gen_buffer_size;
+            size2 = tcg_ctx->code_gen_buffer_size;
             if (buf == buf2) {
                 munmap(buf + size2 + qemu_real_host_page_size, size - size2);
             } else {
@@ -821,9 +822,9 @@ static gint tb_tc_cmp(gconstpointer ap, gconstpointer bp)
 
 static inline void code_gen_alloc(size_t tb_size)
 {
-    tcg_ctx.code_gen_buffer_size = size_code_gen_buffer(tb_size);
-    tcg_ctx.code_gen_buffer = alloc_code_gen_buffer();
-    if (tcg_ctx.code_gen_buffer == NULL) {
+    tcg_ctx->code_gen_buffer_size = size_code_gen_buffer(tb_size);
+    tcg_ctx->code_gen_buffer = alloc_code_gen_buffer();
+    if (tcg_ctx->code_gen_buffer == NULL) {
         fprintf(stderr, "Could not allocate dynamic translator buffer\n");
         exit(1);
     }
@@ -851,7 +852,7 @@ void tcg_exec_init(unsigned long tb_size)
 #if defined(CONFIG_SOFTMMU)
     /* There's no guest base to take into account, so go ahead and
        initialize the prologue now.  */
-    tcg_prologue_init(&tcg_ctx);
+    tcg_prologue_init(tcg_ctx);
 #endif
 }
 
@@ -867,7 +868,7 @@ static TranslationBlock *tb_alloc(target_ulong pc)
 
     assert_tb_locked();
 
-    tb = tcg_tb_alloc(&tcg_ctx);
+    tb = tcg_tb_alloc(tcg_ctx);
     if (unlikely(tb == NULL)) {
         return NULL;
     }
@@ -951,11 +952,11 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 
         g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
         printf("qemu: flush code_size=%td nb_tbs=%zu avg_tb_size=%zu\n",
-               tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer, nb_tbs,
+               tcg_ctx->code_gen_ptr - tcg_ctx->code_gen_buffer, nb_tbs,
                nb_tbs > 0 ? host_size / nb_tbs : 0);
     }
-    if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)
-        > tcg_ctx.code_gen_buffer_size) {
+    if ((unsigned long)(tcg_ctx->code_gen_ptr - tcg_ctx->code_gen_buffer)
+        > tcg_ctx->code_gen_buffer_size) {
         cpu_abort(cpu, "Internal error: code buffer overflow\n");
     }
 
@@ -970,7 +971,7 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     qht_reset_size(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
-    tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
+    tcg_ctx->code_gen_ptr = tcg_ctx->code_gen_buffer;
     /* XXX: flush processor icache at this point if cache flush is
        expensive */
     atomic_mb_set(&tb_ctx.tb_flush_count, tb_ctx.tb_flush_count + 1);
@@ -1325,44 +1326,44 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
         cpu_loop_exit(cpu);
     }
 
-    gen_code_buf = tcg_ctx.code_gen_ptr;
+    gen_code_buf = tcg_ctx->code_gen_ptr;
     tb->tc.ptr = gen_code_buf;
     tb->pc = pc;
     tb->cs_base = cs_base;
     tb->flags = flags;
     tb->cflags = cflags;
     tb->trace_vcpu_dstate = *cpu->trace_dstate;
-    tcg_ctx.cf_parallel = !!(cflags & CF_PARALLEL);
+    tcg_ctx->cf_parallel = !!(cflags & CF_PARALLEL);
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx.tb_count1++; /* includes aborted translations because of
+    tcg_ctx->tb_count1++; /* includes aborted translations because of
                        exceptions */
     ti = profile_getclock();
 #endif
 
-    tcg_func_start(&tcg_ctx);
+    tcg_func_start(tcg_ctx);
 
-    tcg_ctx.cpu = ENV_GET_CPU(env);
+    tcg_ctx->cpu = ENV_GET_CPU(env);
     gen_intermediate_code(env, tb);
-    tcg_ctx.cpu = NULL;
+    tcg_ctx->cpu = NULL;
 
     trace_translate_block(tb, tb->pc, tb->tc.ptr);
 
     /* generate machine code */
     tb->jmp_reset_offset[0] = TB_JMP_RESET_OFFSET_INVALID;
     tb->jmp_reset_offset[1] = TB_JMP_RESET_OFFSET_INVALID;
-    tcg_ctx.tb_jmp_reset_offset = tb->jmp_reset_offset;
+    tcg_ctx->tb_jmp_reset_offset = tb->jmp_reset_offset;
 #ifdef USE_DIRECT_JUMP
-    tcg_ctx.tb_jmp_insn_offset = tb->jmp_insn_offset;
-    tcg_ctx.tb_jmp_target_addr = NULL;
+    tcg_ctx->tb_jmp_insn_offset = tb->jmp_insn_offset;
+    tcg_ctx->tb_jmp_target_addr = NULL;
 #else
-    tcg_ctx.tb_jmp_insn_offset = NULL;
-    tcg_ctx.tb_jmp_target_addr = tb->jmp_target_addr;
+    tcg_ctx->tb_jmp_insn_offset = NULL;
+    tcg_ctx->tb_jmp_target_addr = tb->jmp_target_addr;
 #endif
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx.tb_count++;
-    tcg_ctx.interm_time += profile_getclock() - ti;
+    tcg_ctx->tb_count++;
+    tcg_ctx->interm_time += profile_getclock() - ti;
     ti = profile_getclock();
 #endif
 
@@ -1371,7 +1372,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
        the tcg optimization currently hidden inside tcg_gen_code.  All
        that should be required is to flush the TBs, allocate a new TB,
        re-initialize it per above, and re-do the actual code generation.  */
-    gen_code_size = tcg_gen_code(&tcg_ctx, tb);
+    gen_code_size = tcg_gen_code(tcg_ctx, tb);
     if (unlikely(gen_code_size < 0)) {
         goto buffer_overflow;
     }
@@ -1382,10 +1383,10 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb->tc.size = gen_code_size;
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx.code_time += profile_getclock() - ti;
-    tcg_ctx.code_in_len += tb->size;
-    tcg_ctx.code_out_len += gen_code_size;
-    tcg_ctx.search_out_len += search_size;
+    tcg_ctx->code_time += profile_getclock() - ti;
+    tcg_ctx->code_in_len += tb->size;
+    tcg_ctx->code_out_len += gen_code_size;
+    tcg_ctx->search_out_len += search_size;
 #endif
 
 #ifdef DEBUG_DISAS
@@ -1400,7 +1401,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     }
 #endif
 
-    tcg_ctx.code_gen_ptr = (void *)
+    tcg_ctx->code_gen_ptr = (void *)
         ROUND_UP((uintptr_t)gen_code_buf + gen_code_size + search_size,
                  CODE_GEN_ALIGN);
 
@@ -1955,8 +1956,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
      * For avg host size we use the precise numbers from tb_tree_stats though.
      */
     cpu_fprintf(f, "gen code size       %td/%zd\n",
-                tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
-                tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
+                tcg_ctx->code_gen_ptr - tcg_ctx->code_gen_buffer,
+                tcg_ctx->code_gen_highwater - tcg_ctx->code_gen_buffer);
     cpu_fprintf(f, "TB count            %zu\n", nb_tbs);
     cpu_fprintf(f, "TB avg target size  %zu max=%zu bytes\n",
                 nb_tbs ? tst.target_size / nb_tbs : 0,
diff --git a/bsd-user/main.c b/bsd-user/main.c
index fa9c012..7a8b29e 100644
--- a/bsd-user/main.c
+++ b/bsd-user/main.c
@@ -978,7 +978,7 @@ int main(int argc, char **argv)
     /* Now that we've loaded the binary, GUEST_BASE is fixed.  Delay
        generating the prologue until now so that the prologue can take
        the real value of GUEST_BASE into account.  */
-    tcg_prologue_init(&tcg_ctx);
+    tcg_prologue_init(tcg_ctx);
 
     /* build Task State */
     memset(ts, 0, sizeof(TaskState));
diff --git a/linux-user/main.c b/linux-user/main.c
index 630c73d..ad4c6f5 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -4456,7 +4456,7 @@ int main(int argc, char **argv, char **envp)
     /* Now that we've loaded the binary, GUEST_BASE is fixed.  Delay
        generating the prologue until now so that the prologue can take
        the real value of GUEST_BASE into account.  */
-    tcg_prologue_init(&tcg_ctx);
+    tcg_prologue_init(tcg_ctx);
 
 #if defined(TARGET_I386)
     env->cr[0] = CR0_PG_MASK | CR0_WP_MASK | CR0_PE_MASK;
diff --git a/target/alpha/translate.c b/target/alpha/translate.c
index 96c527b..ee309bb 100644
--- a/target/alpha/translate.c
+++ b/target/alpha/translate.c
@@ -154,7 +154,7 @@ void alpha_translate_init(void)
     done_init = 1;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
     for (i = 0; i < 31; i++) {
         cpu_std_ir[i] = tcg_global_mem_new_i64(cpu_env,
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 34aa95d..4fbbd71 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -82,7 +82,7 @@ void arm_translate_init(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
     for (i = 0; i < 16; i++) {
         cpu_R[i] = tcg_global_mem_new_i32(cpu_env,
diff --git a/target/cris/translate.c b/target/cris/translate.c
index 0ee05ca..a503b96 100644
--- a/target/cris/translate.c
+++ b/target/cris/translate.c
@@ -3365,7 +3365,7 @@ void cris_initialize_tcg(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
     cc_x = tcg_global_mem_new(cpu_env,
                               offsetof(CPUCRISState, cc_x), "cc_x");
     cc_src = tcg_global_mem_new(cpu_env,
diff --git a/target/cris/translate_v10.c b/target/cris/translate_v10.c
index 4a0b485..5d48920 100644
--- a/target/cris/translate_v10.c
+++ b/target/cris/translate_v10.c
@@ -1273,7 +1273,7 @@ void cris_initialize_crisv10_tcg(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
     cc_x = tcg_global_mem_new(cpu_env,
                               offsetof(CPUCRISState, cc_x), "cc_x");
     cc_src = tcg_global_mem_new(cpu_env,
diff --git a/target/hppa/translate.c b/target/hppa/translate.c
index fde3dba..6f476db 100644
--- a/target/hppa/translate.c
+++ b/target/hppa/translate.c
@@ -145,7 +145,7 @@ void hppa_translate_init(void)
     done_init = 1;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
     TCGV_UNUSED(cpu_gr[0]);
     for (i = 1; i < 32; i++) {
diff --git a/target/i386/translate.c b/target/i386/translate.c
index c5e4d77..4da5a8f 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -8335,7 +8335,7 @@ void tcg_x86_init(void)
     initialized = true;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
     cpu_cc_op = tcg_global_mem_new_i32(cpu_env,
                                        offsetof(CPUX86State, cc_op), "cc_op");
     cpu_cc_dst = tcg_global_mem_new(cpu_env, offsetof(CPUX86State, cc_dst),
diff --git a/target/lm32/translate.c b/target/lm32/translate.c
index 692882f..b38e1e5 100644
--- a/target/lm32/translate.c
+++ b/target/lm32/translate.c
@@ -1203,7 +1203,7 @@ void lm32_translate_init(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
     for (i = 0; i < ARRAY_SIZE(cpu_R); i++) {
         cpu_R[i] = tcg_global_mem_new(cpu_env,
diff --git a/target/m68k/translate.c b/target/m68k/translate.c
index 5cfa25f..6dd72bc 100644
--- a/target/m68k/translate.c
+++ b/target/m68k/translate.c
@@ -69,7 +69,7 @@ void m68k_tcg_init(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
 #define DEFO32(name, offset) \
     QREG_##name = tcg_global_mem_new_i32(cpu_env, \
diff --git a/target/microblaze/translate.c b/target/microblaze/translate.c
index cb65d1e..cb2ef50 100644
--- a/target/microblaze/translate.c
+++ b/target/microblaze/translate.c
@@ -1861,7 +1861,7 @@ void mb_tcg_init(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
     env_debug = tcg_global_mem_new(cpu_env,
                     offsetof(CPUMBState, debug),
diff --git a/target/mips/translate.c b/target/mips/translate.c
index a2f5385..2929e61 100644
--- a/target/mips/translate.c
+++ b/target/mips/translate.c
@@ -20139,7 +20139,7 @@ void mips_tcg_init(void)
         return;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
     TCGV_UNUSED(cpu_gpr[0]);
     for (i = 1; i < 32; i++)
diff --git a/target/moxie/translate.c b/target/moxie/translate.c
index 0660b44..5a5f62f 100644
--- a/target/moxie/translate.c
+++ b/target/moxie/translate.c
@@ -106,7 +106,7 @@ void moxie_translate_init(void)
         return;
     }
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
     cpu_pc = tcg_global_mem_new_i32(cpu_env,
                                     offsetof(CPUMoxieState, pc), "$pc");
     for (i = 0; i < 16; i++)
diff --git a/target/openrisc/translate.c b/target/openrisc/translate.c
index e49518e..27714f6 100644
--- a/target/openrisc/translate.c
+++ b/target/openrisc/translate.c
@@ -75,7 +75,7 @@ void openrisc_translate_init(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
     cpu_sr = tcg_global_mem_new(cpu_env,
                                 offsetof(CPUOpenRISCState, sr), "sr");
     cpu_dflag = tcg_global_mem_new_i32(cpu_env,
diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index c0cd64d..b842bd5 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -90,7 +90,7 @@ void ppc_translate_init(void)
         return;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
     p = cpu_reg_names;
     cpu_reg_names_size = sizeof(cpu_reg_names);
diff --git a/target/s390x/translate.c b/target/s390x/translate.c
index 6535f6c..b23356b 100644
--- a/target/s390x/translate.c
+++ b/target/s390x/translate.c
@@ -171,7 +171,7 @@ void s390x_translate_init(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
     psw_addr = tcg_global_mem_new_i64(cpu_env,
                                       offsetof(CPUS390XState, psw.addr),
                                       "psw_addr");
diff --git a/target/sh4/translate.c b/target/sh4/translate.c
index 8bc132b..f745eb2 100644
--- a/target/sh4/translate.c
+++ b/target/sh4/translate.c
@@ -102,7 +102,7 @@ void sh4_translate_init(void)
         return;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
     for (i = 0; i < 24; i++)
         cpu_gregs[i] = tcg_global_mem_new_i32(cpu_env,
diff --git a/target/sparc/translate.c b/target/sparc/translate.c
index 0274e83..0b4ab3d 100644
--- a/target/sparc/translate.c
+++ b/target/sparc/translate.c
@@ -5933,7 +5933,7 @@ void gen_intermediate_code_init(CPUSPARCState *env)
     inited = 1;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
     cpu_regwptr = tcg_global_mem_new_ptr(cpu_env,
                                          offsetof(CPUSPARCState, regwptr),
diff --git a/target/tilegx/translate.c b/target/tilegx/translate.c
index ff2ef7b..913cbe4 100644
--- a/target/tilegx/translate.c
+++ b/target/tilegx/translate.c
@@ -2447,7 +2447,7 @@ void tilegx_tcg_init(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
     cpu_pc = tcg_global_mem_new_i64(cpu_env, offsetof(CPUTLGState, pc), "pc");
     for (i = 0; i < TILEGX_R_COUNT; i++) {
         cpu_regs[i] = tcg_global_mem_new_i64(cpu_env,
diff --git a/target/tricore/translate.c b/target/tricore/translate.c
index ddd2dd0..bb56f03 100644
--- a/target/tricore/translate.c
+++ b/target/tricore/translate.c
@@ -8886,7 +8886,7 @@ void tricore_tcg_init(void)
         return;
     }
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
     /* reg init */
     for (i = 0 ; i < 16 ; i++) {
         cpu_gpr_a[i] = tcg_global_mem_new(cpu_env,
diff --git a/target/unicore32/translate.c b/target/unicore32/translate.c
index 666a201..5f36fb3 100644
--- a/target/unicore32/translate.c
+++ b/target/unicore32/translate.c
@@ -70,7 +70,7 @@ void uc32_translate_init(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
 
     for (i = 0; i < 32; i++) {
         cpu_R[i] = tcg_global_mem_new_i32(cpu_env,
diff --git a/target/xtensa/translate.c b/target/xtensa/translate.c
index 2630024..c984fb4 100644
--- a/target/xtensa/translate.c
+++ b/target/xtensa/translate.c
@@ -218,7 +218,7 @@ void xtensa_translate_init(void)
     int i;
 
     cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env");
-    tcg_ctx.tcg_env = cpu_env;
+    tcg_ctx->tcg_env = cpu_env;
     cpu_pc = tcg_global_mem_new_i32(cpu_env,
             offsetof(CPUXtensaState, pc), "pc");
 
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index ef420d4..4a7057e 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -150,8 +150,8 @@ void tcg_gen_op6(TCGContext *ctx, TCGOpcode opc, TCGArg a1, TCGArg a2,
 
 void tcg_gen_mb(TCGBar mb_type)
 {
-    if (tcg_ctx.cf_parallel) {
-        tcg_gen_op1(&tcg_ctx, INDEX_op_mb, mb_type);
+    if (tcg_ctx->cf_parallel) {
+        tcg_gen_op1(tcg_ctx, INDEX_op_mb, mb_type);
     }
 }
 
@@ -2486,7 +2486,7 @@ void tcg_gen_extrl_i64_i32(TCGv_i32 ret, TCGv_i64 arg)
     if (TCG_TARGET_REG_BITS == 32) {
         tcg_gen_mov_i32(ret, TCGV_LOW(arg));
     } else if (TCG_TARGET_HAS_extrl_i64_i32) {
-        tcg_gen_op2(&tcg_ctx, INDEX_op_extrl_i64_i32,
+        tcg_gen_op2(tcg_ctx, INDEX_op_extrl_i64_i32,
                     GET_TCGV_I32(ret), GET_TCGV_I64(arg));
     } else {
         tcg_gen_mov_i32(ret, MAKE_TCGV_I32(GET_TCGV_I64(arg)));
@@ -2498,7 +2498,7 @@ void tcg_gen_extrh_i64_i32(TCGv_i32 ret, TCGv_i64 arg)
     if (TCG_TARGET_REG_BITS == 32) {
         tcg_gen_mov_i32(ret, TCGV_HIGH(arg));
     } else if (TCG_TARGET_HAS_extrh_i64_i32) {
-        tcg_gen_op2(&tcg_ctx, INDEX_op_extrh_i64_i32,
+        tcg_gen_op2(tcg_ctx, INDEX_op_extrh_i64_i32,
                     GET_TCGV_I32(ret), GET_TCGV_I64(arg));
     } else {
         TCGv_i64 t = tcg_temp_new_i64();
@@ -2514,7 +2514,7 @@ void tcg_gen_extu_i32_i64(TCGv_i64 ret, TCGv_i32 arg)
         tcg_gen_mov_i32(TCGV_LOW(ret), arg);
         tcg_gen_movi_i32(TCGV_HIGH(ret), 0);
     } else {
-        tcg_gen_op2(&tcg_ctx, INDEX_op_extu_i32_i64,
+        tcg_gen_op2(tcg_ctx, INDEX_op_extu_i32_i64,
                     GET_TCGV_I64(ret), GET_TCGV_I32(arg));
     }
 }
@@ -2525,7 +2525,7 @@ void tcg_gen_ext_i32_i64(TCGv_i64 ret, TCGv_i32 arg)
         tcg_gen_mov_i32(TCGV_LOW(ret), arg);
         tcg_gen_sari_i32(TCGV_HIGH(ret), TCGV_LOW(ret), 31);
     } else {
-        tcg_gen_op2(&tcg_ctx, INDEX_op_ext_i32_i64,
+        tcg_gen_op2(tcg_ctx, INDEX_op_ext_i32_i64,
                     GET_TCGV_I64(ret), GET_TCGV_I32(arg));
     }
 }
@@ -2581,8 +2581,8 @@ void tcg_gen_goto_tb(unsigned idx)
     tcg_debug_assert(idx <= 1);
 #ifdef CONFIG_DEBUG_TCG
     /* Verify that we havn't seen this numbered exit before.  */
-    tcg_debug_assert((tcg_ctx.goto_tb_issue_mask & (1 << idx)) == 0);
-    tcg_ctx.goto_tb_issue_mask |= 1 << idx;
+    tcg_debug_assert((tcg_ctx->goto_tb_issue_mask & (1 << idx)) == 0);
+    tcg_ctx->goto_tb_issue_mask |= 1 << idx;
 #endif
     tcg_gen_op1i(INDEX_op_goto_tb, idx);
 }
@@ -2591,7 +2591,7 @@ void tcg_gen_lookup_and_goto_ptr(void)
 {
     if (TCG_TARGET_HAS_goto_ptr && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
         TCGv_ptr ptr = tcg_temp_new_ptr();
-        gen_helper_lookup_tb_ptr(ptr, tcg_ctx.tcg_env);
+        gen_helper_lookup_tb_ptr(ptr, tcg_ctx->tcg_env);
         tcg_gen_op1i(INDEX_op_goto_ptr, GET_TCGV_PTR(ptr));
         tcg_temp_free_ptr(ptr);
     } else {
@@ -2637,7 +2637,7 @@ static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv addr,
     if (TCG_TARGET_REG_BITS == 32) {
         tcg_gen_op4i_i32(opc, val, TCGV_LOW(addr), TCGV_HIGH(addr), oi);
     } else {
-        tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_I32(val), GET_TCGV_I64(addr), oi);
+        tcg_gen_op3(tcg_ctx, opc, GET_TCGV_I32(val), GET_TCGV_I64(addr), oi);
     }
 #endif
 }
@@ -2650,7 +2650,7 @@ static void gen_ldst_i64(TCGOpcode opc, TCGv_i64 val, TCGv addr,
     if (TCG_TARGET_REG_BITS == 32) {
         tcg_gen_op4i_i32(opc, TCGV_LOW(val), TCGV_HIGH(val), addr, oi);
     } else {
-        tcg_gen_op3(&tcg_ctx, opc, GET_TCGV_I64(val), GET_TCGV_I32(addr), oi);
+        tcg_gen_op3(tcg_ctx, opc, GET_TCGV_I64(val), GET_TCGV_I32(addr), oi);
     }
 #else
     if (TCG_TARGET_REG_BITS == 32) {
@@ -2665,7 +2665,7 @@ static void gen_ldst_i64(TCGOpcode opc, TCGv_i64 val, TCGv addr,
 void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     memop = tcg_canonicalize_memop(memop, 0, 0);
-    trace_guest_mem_before_tcg(tcg_ctx.cpu, tcg_ctx.tcg_env,
+    trace_guest_mem_before_tcg(tcg_ctx->cpu, tcg_ctx->tcg_env,
                                addr, trace_mem_get_info(memop, 0));
     gen_ldst_i32(INDEX_op_qemu_ld_i32, val, addr, memop, idx);
 }
@@ -2673,7 +2673,7 @@ void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     memop = tcg_canonicalize_memop(memop, 0, 1);
-    trace_guest_mem_before_tcg(tcg_ctx.cpu, tcg_ctx.tcg_env,
+    trace_guest_mem_before_tcg(tcg_ctx->cpu, tcg_ctx->tcg_env,
                                addr, trace_mem_get_info(memop, 1));
     gen_ldst_i32(INDEX_op_qemu_st_i32, val, addr, memop, idx);
 }
@@ -2691,7 +2691,7 @@ void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
     }
 
     memop = tcg_canonicalize_memop(memop, 1, 0);
-    trace_guest_mem_before_tcg(tcg_ctx.cpu, tcg_ctx.tcg_env,
+    trace_guest_mem_before_tcg(tcg_ctx->cpu, tcg_ctx->tcg_env,
                                addr, trace_mem_get_info(memop, 0));
     gen_ldst_i64(INDEX_op_qemu_ld_i64, val, addr, memop, idx);
 }
@@ -2704,7 +2704,7 @@ void tcg_gen_qemu_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
     }
 
     memop = tcg_canonicalize_memop(memop, 1, 1);
-    trace_guest_mem_before_tcg(tcg_ctx.cpu, tcg_ctx.tcg_env,
+    trace_guest_mem_before_tcg(tcg_ctx->cpu, tcg_ctx->tcg_env,
                                addr, trace_mem_get_info(memop, 1));
     gen_ldst_i64(INDEX_op_qemu_st_i64, val, addr, memop, idx);
 }
@@ -2794,7 +2794,7 @@ void tcg_gen_atomic_cmpxchg_i32(TCGv_i32 retv, TCGv addr, TCGv_i32 cmpv,
 {
     memop = tcg_canonicalize_memop(memop, 0, 0);
 
-    if (!tcg_ctx.cf_parallel) {
+    if (!tcg_ctx->cf_parallel) {
         TCGv_i32 t1 = tcg_temp_new_i32();
         TCGv_i32 t2 = tcg_temp_new_i32();
 
@@ -2820,11 +2820,11 @@ void tcg_gen_atomic_cmpxchg_i32(TCGv_i32 retv, TCGv addr, TCGv_i32 cmpv,
 #ifdef CONFIG_SOFTMMU
         {
             TCGv_i32 oi = tcg_const_i32(make_memop_idx(memop & ~MO_SIGN, idx));
-            gen(retv, tcg_ctx.tcg_env, addr, cmpv, newv, oi);
+            gen(retv, tcg_ctx->tcg_env, addr, cmpv, newv, oi);
             tcg_temp_free_i32(oi);
         }
 #else
-        gen(retv, tcg_ctx.tcg_env, addr, cmpv, newv);
+        gen(retv, tcg_ctx->tcg_env, addr, cmpv, newv);
 #endif
 
         if (memop & MO_SIGN) {
@@ -2838,7 +2838,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr, TCGv_i64 cmpv,
 {
     memop = tcg_canonicalize_memop(memop, 1, 0);
 
-    if (!tcg_ctx.cf_parallel) {
+    if (!tcg_ctx->cf_parallel) {
         TCGv_i64 t1 = tcg_temp_new_i64();
         TCGv_i64 t2 = tcg_temp_new_i64();
 
@@ -2865,14 +2865,14 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr, TCGv_i64 cmpv,
 #ifdef CONFIG_SOFTMMU
         {
             TCGv_i32 oi = tcg_const_i32(make_memop_idx(memop, idx));
-            gen(retv, tcg_ctx.tcg_env, addr, cmpv, newv, oi);
+            gen(retv, tcg_ctx->tcg_env, addr, cmpv, newv, oi);
             tcg_temp_free_i32(oi);
         }
 #else
-        gen(retv, tcg_ctx.tcg_env, addr, cmpv, newv);
+        gen(retv, tcg_ctx->tcg_env, addr, cmpv, newv);
 #endif
 #else
-        gen_helper_exit_atomic(tcg_ctx.tcg_env);
+        gen_helper_exit_atomic(tcg_ctx->tcg_env);
         /* Produce a result, so that we have a well-formed opcode stream
            with respect to uses of the result in the (dead) code following.  */
         tcg_gen_movi_i64(retv, 0);
@@ -2928,11 +2928,11 @@ static void do_atomic_op_i32(TCGv_i32 ret, TCGv addr, TCGv_i32 val,
 #ifdef CONFIG_SOFTMMU
     {
         TCGv_i32 oi = tcg_const_i32(make_memop_idx(memop & ~MO_SIGN, idx));
-        gen(ret, tcg_ctx.tcg_env, addr, val, oi);
+        gen(ret, tcg_ctx->tcg_env, addr, val, oi);
         tcg_temp_free_i32(oi);
     }
 #else
-    gen(ret, tcg_ctx.tcg_env, addr, val);
+    gen(ret, tcg_ctx->tcg_env, addr, val);
 #endif
 
     if (memop & MO_SIGN) {
@@ -2973,14 +2973,14 @@ static void do_atomic_op_i64(TCGv_i64 ret, TCGv addr, TCGv_i64 val,
 #ifdef CONFIG_SOFTMMU
         {
             TCGv_i32 oi = tcg_const_i32(make_memop_idx(memop & ~MO_SIGN, idx));
-            gen(ret, tcg_ctx.tcg_env, addr, val, oi);
+            gen(ret, tcg_ctx->tcg_env, addr, val, oi);
             tcg_temp_free_i32(oi);
         }
 #else
-        gen(ret, tcg_ctx.tcg_env, addr, val);
+        gen(ret, tcg_ctx->tcg_env, addr, val);
 #endif
 #else
-        gen_helper_exit_atomic(tcg_ctx.tcg_env);
+        gen_helper_exit_atomic(tcg_ctx->tcg_env);
         /* Produce a result, so that we have a well-formed opcode stream
            with respect to uses of the result in the (dead) code following.  */
         tcg_gen_movi_i64(ret, 0);
@@ -3015,7 +3015,7 @@ static void * const table_##NAME[16] = {                                \
 void tcg_gen_atomic_##NAME##_i32                                        \
     (TCGv_i32 ret, TCGv addr, TCGv_i32 val, TCGArg idx, TCGMemOp memop) \
 {                                                                       \
-    if (tcg_ctx.cf_parallel) {                                          \
+    if (tcg_ctx->cf_parallel) {                                         \
         do_atomic_op_i32(ret, addr, val, idx, memop, table_##NAME);     \
     } else {                                                            \
         do_nonatomic_op_i32(ret, addr, val, idx, memop, NEW,            \
@@ -3025,7 +3025,7 @@ void tcg_gen_atomic_##NAME##_i32                                        \
 void tcg_gen_atomic_##NAME##_i64                                        \
     (TCGv_i64 ret, TCGv addr, TCGv_i64 val, TCGArg idx, TCGMemOp memop) \
 {                                                                       \
-    if (tcg_ctx.cf_parallel) {                                          \
+    if (tcg_ctx->cf_parallel) {                                         \
         do_atomic_op_i64(ret, addr, val, idx, memop, table_##NAME);     \
     } else {                                                            \
         do_nonatomic_op_i64(ret, addr, val, idx, memop, NEW,            \
diff --git a/tcg/tcg-runtime.c b/tcg/tcg-runtime.c
index 08fe077..2f80b19 100644
--- a/tcg/tcg-runtime.c
+++ b/tcg/tcg-runtime.c
@@ -153,7 +153,7 @@ void *HELPER(lookup_tb_ptr)(CPUArchState *env)
 
     tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, curr_cf_mask());
     if (tb == NULL) {
-        return tcg_ctx.code_gen_epilogue;
+        return tcg_ctx->code_gen_epilogue;
     }
     qemu_log_mask_and_addr(CPU_LOG_EXEC, pc,
                            "Chain %p [%d: " TARGET_FMT_lx "] %s\n",
diff --git a/tcg/tcg.c b/tcg/tcg.c
index c0c2d6c..f907c47 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -116,7 +116,6 @@ static void tcg_out_tb_init(TCGContext *s);
 static bool tcg_out_tb_finalize(TCGContext *s);
 
 
-
 static TCGRegSet tcg_target_available_regs[2];
 static TCGRegSet tcg_target_call_clobber_regs;
 
@@ -242,7 +241,7 @@ static void tcg_out_label(TCGContext *s, TCGLabel *l, tcg_insn_unit *ptr)
 
 TCGLabel *gen_new_label(void)
 {
-    TCGContext *s = &tcg_ctx;
+    TCGContext *s = tcg_ctx;
     TCGLabel *l = tcg_malloc(sizeof(TCGLabel));
 
     *l = (TCGLabel){
@@ -381,6 +380,8 @@ void tcg_context_init(TCGContext *s)
     for (; i < ARRAY_SIZE(tcg_target_reg_alloc_order); ++i) {
         indirect_reg_alloc_order[i] = tcg_target_reg_alloc_order[i];
     }
+
+    tcg_ctx = s;
 }
 
 /*
@@ -526,7 +527,7 @@ void tcg_set_frame(TCGContext *s, TCGReg reg, intptr_t start, intptr_t size)
 
 TCGv_i32 tcg_global_reg_new_i32(TCGReg reg, const char *name)
 {
-    TCGContext *s = &tcg_ctx;
+    TCGContext *s = tcg_ctx;
     int idx;
 
     if (tcg_regset_test_reg(s->reserved_regs, reg)) {
@@ -538,7 +539,7 @@ TCGv_i32 tcg_global_reg_new_i32(TCGReg reg, const char *name)
 
 TCGv_i64 tcg_global_reg_new_i64(TCGReg reg, const char *name)
 {
-    TCGContext *s = &tcg_ctx;
+    TCGContext *s = tcg_ctx;
     int idx;
 
     if (tcg_regset_test_reg(s->reserved_regs, reg)) {
@@ -551,7 +552,7 @@ TCGv_i64 tcg_global_reg_new_i64(TCGReg reg, const char *name)
 int tcg_global_mem_new_internal(TCGType type, TCGv_ptr base,
                                 intptr_t offset, const char *name)
 {
-    TCGContext *s = &tcg_ctx;
+    TCGContext *s = tcg_ctx;
     TCGTemp *base_ts = &s->temps[GET_TCGV_PTR(base)];
     TCGTemp *ts = tcg_global_alloc(s);
     int indirect_reg = 0, bigendian = 0;
@@ -606,7 +607,7 @@ int tcg_global_mem_new_internal(TCGType type, TCGv_ptr base,
 
 static int tcg_temp_new_internal(TCGType type, int temp_local)
 {
-    TCGContext *s = &tcg_ctx;
+    TCGContext *s = tcg_ctx;
     TCGTemp *ts;
     int idx, k;
 
@@ -668,7 +669,7 @@ TCGv_i64 tcg_temp_new_internal_i64(int temp_local)
 
 static void tcg_temp_free_internal(int idx)
 {
-    TCGContext *s = &tcg_ctx;
+    TCGContext *s = tcg_ctx;
     TCGTemp *ts;
     int k;
 
@@ -733,13 +734,13 @@ TCGv_i64 tcg_const_local_i64(int64_t val)
 #if defined(CONFIG_DEBUG_TCG)
 void tcg_clear_temp_count(void)
 {
-    TCGContext *s = &tcg_ctx;
+    TCGContext *s = tcg_ctx;
     s->temps_in_use = 0;
 }
 
 int tcg_check_temp_count(void)
 {
-    TCGContext *s = &tcg_ctx;
+    TCGContext *s = tcg_ctx;
     if (s->temps_in_use) {
         /* Clear the count so that we don't give another
          * warning immediately next time around.
@@ -2707,7 +2708,7 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 #ifdef CONFIG_PROFILER
 void tcg_dump_info(FILE *f, fprintf_function cpu_fprintf)
 {
-    TCGContext *s = &tcg_ctx;
+    TCGContext *s = tcg_ctx;
     int64_t tb_count = s->tb_count;
     int64_t tb_div_count = tb_count ? tb_count : 1;
     int64_t tot = s->interm_time + s->code_time;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 35/45] gen-icount: fold exitreq_label into TCGContext
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (33 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 34/45] tcg: define tcg_init_ctx and make tcg_ctx a pointer Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 36/45] tcg: dynamically allocate optimizer globals + fold " Emilio G. Cota
                   ` (9 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Groundwork for supporting multiple TCG contexts.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/gen-icount.h | 7 +++----
 tcg/tcg.h                 | 2 ++
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/exec/gen-icount.h b/include/exec/gen-icount.h
index 4a55da8..7723aa0 100644
--- a/include/exec/gen-icount.h
+++ b/include/exec/gen-icount.h
@@ -6,13 +6,12 @@
 /* Helpers for instruction counting code generation.  */
 
 static int icount_start_insn_idx;
-static TCGLabel *exitreq_label;
 
 static inline void gen_tb_start(TranslationBlock *tb)
 {
     TCGv_i32 count, imm;
 
-    exitreq_label = gen_new_label();
+    tcg_ctx->exitreq_label = gen_new_label();
     if (tb->cflags & CF_USE_ICOUNT) {
         count = tcg_temp_local_new_i32();
     } else {
@@ -34,7 +33,7 @@ static inline void gen_tb_start(TranslationBlock *tb)
         tcg_temp_free_i32(imm);
     }
 
-    tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, exitreq_label);
+    tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, tcg_ctx->exitreq_label);
 
     if (tb->cflags & CF_USE_ICOUNT) {
         tcg_gen_st16_i32(count, tcg_ctx->tcg_env,
@@ -52,7 +51,7 @@ static inline void gen_tb_end(TranslationBlock *tb, int num_insns)
         tcg_set_insn_param(icount_start_insn_idx, 1, num_insns);
     }
 
-    gen_set_label(exitreq_label);
+    gen_set_label(tcg_ctx->exitreq_label);
     tcg_gen_exit_tb((uintptr_t)tb + TB_EXIT_REQUESTED);
 
     /* Terminate the linked list.  */
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 6913d4b..569f823 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -712,6 +712,8 @@ struct TCGContext {
     /* The TCGBackendData structure is private to tcg-target.inc.c.  */
     struct TCGBackendData *be;
 
+    TCGLabel *exitreq_label;
+
     TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
     TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 36/45] tcg: dynamically allocate optimizer globals + fold into TCGContext
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (34 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 35/45] gen-icount: fold exitreq_label into TCGContext Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  3:53   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 37/45] tcg: introduce **tcg_ctxs to keep track of all TCGContext's Emilio G. Cota
                   ` (8 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Groundwork for supporting multiple TCG contexts.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h      | 12 ++++++++++++
 tcg/optimize.c | 40 +++++++++++++++++++++++-----------------
 2 files changed, 35 insertions(+), 17 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 569f823..175d4de 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -641,6 +641,14 @@ QEMU_BUILD_BUG_ON(OPPARAM_BUF_SIZE > (1 << 14));
 /* Make sure that we don't overflow 64 bits without noticing.  */
 QEMU_BUILD_BUG_ON(sizeof(TCGOp) > 8);
 
+struct tcg_temp_info {
+    bool is_const;
+    uint16_t prev_copy;
+    uint16_t next_copy;
+    tcg_target_ulong val;
+    tcg_target_ulong mask;
+};
+
 struct TCGContext {
     uint8_t *pool_cur, *pool_end;
     TCGPool *pool_first, *pool_current, *pool_first_large;
@@ -717,6 +725,10 @@ struct TCGContext {
     TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
     TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
 
+    /* optimizer */
+    struct tcg_temp_info *opt_temps;
+    TCGTempSet opt_temps_used;
+
     /* Tells which temporary holds a given register.
        It does not take into account fixed registers */
     TCGTemp *reg_to_temp[TCG_TARGET_NB_REGS];
diff --git a/tcg/optimize.c b/tcg/optimize.c
index adfc56c..61ca870 100644
--- a/tcg/optimize.c
+++ b/tcg/optimize.c
@@ -32,30 +32,21 @@
         glue(glue(case INDEX_op_, x), _i32):    \
         glue(glue(case INDEX_op_, x), _i64)
 
-struct tcg_temp_info {
-    bool is_const;
-    uint16_t prev_copy;
-    uint16_t next_copy;
-    tcg_target_ulong val;
-    tcg_target_ulong mask;
-};
-
-static struct tcg_temp_info temps[TCG_MAX_TEMPS];
-static TCGTempSet temps_used;
-
 static inline bool temp_is_const(TCGArg arg)
 {
-    return temps[arg].is_const;
+    return tcg_ctx->opt_temps[arg].is_const;
 }
 
 static inline bool temp_is_copy(TCGArg arg)
 {
-    return temps[arg].next_copy != arg;
+    return tcg_ctx->opt_temps[arg].next_copy != arg;
 }
 
 /* Reset TEMP's state, possibly removing the temp for the list of copies.  */
 static void reset_temp(TCGArg temp)
 {
+    struct tcg_temp_info *temps = tcg_ctx->opt_temps;
+
     temps[temps[temp].next_copy].prev_copy = temps[temp].prev_copy;
     temps[temps[temp].prev_copy].next_copy = temps[temp].next_copy;
     temps[temp].next_copy = temp;
@@ -67,18 +58,20 @@ static void reset_temp(TCGArg temp)
 /* Reset all temporaries, given that there are NB_TEMPS of them.  */
 static void reset_all_temps(int nb_temps)
 {
-    bitmap_zero(temps_used.l, nb_temps);
+    bitmap_zero(tcg_ctx->opt_temps_used.l, nb_temps);
 }
 
 /* Initialize and activate a temporary.  */
 static void init_temp_info(TCGArg temp)
 {
-    if (!test_bit(temp, temps_used.l)) {
+    struct tcg_temp_info *temps = tcg_ctx->opt_temps;
+
+    if (!test_bit(temp, tcg_ctx->opt_temps_used.l)) {
         temps[temp].next_copy = temp;
         temps[temp].prev_copy = temp;
         temps[temp].is_const = false;
         temps[temp].mask = -1;
-        set_bit(temp, temps_used.l);
+        set_bit(temp, tcg_ctx->opt_temps_used.l);
     }
 }
 
@@ -118,6 +111,7 @@ static TCGOpcode op_to_movi(TCGOpcode op)
 
 static TCGArg find_better_copy(TCGContext *s, TCGArg temp)
 {
+    struct tcg_temp_info *temps = tcg_ctx->opt_temps;
     TCGArg i;
 
     /* If this is already a global, we can't do better. */
@@ -147,6 +141,7 @@ static TCGArg find_better_copy(TCGContext *s, TCGArg temp)
 
 static bool temps_are_copies(TCGArg arg1, TCGArg arg2)
 {
+    struct tcg_temp_info *temps = tcg_ctx->opt_temps;
     TCGArg i;
 
     if (arg1 == arg2) {
@@ -169,6 +164,7 @@ static bool temps_are_copies(TCGArg arg1, TCGArg arg2)
 static void tcg_opt_gen_movi(TCGContext *s, TCGOp *op, TCGArg *args,
                              TCGArg dst, TCGArg val)
 {
+    struct tcg_temp_info *temps = tcg_ctx->opt_temps;
     TCGOpcode new_op = op_to_movi(op->opc);
     tcg_target_ulong mask;
 
@@ -196,6 +192,7 @@ static void tcg_opt_gen_mov(TCGContext *s, TCGOp *op, TCGArg *args,
         return;
     }
 
+    struct tcg_temp_info *temps = tcg_ctx->opt_temps;
     TCGOpcode new_op = op_to_mov(op->opc);
     tcg_target_ulong mask;
 
@@ -466,6 +463,8 @@ static bool do_constant_folding_cond_eq(TCGCond c)
 static TCGArg do_constant_folding_cond(TCGOpcode op, TCGArg x,
                                        TCGArg y, TCGCond c)
 {
+    struct tcg_temp_info *temps = tcg_ctx->opt_temps;
+
     if (temp_is_const(x) && temp_is_const(y)) {
         switch (op_bits(op)) {
         case 32:
@@ -494,6 +493,7 @@ static TCGArg do_constant_folding_cond(TCGOpcode op, TCGArg x,
    of the condition (0 or 1) if it can */
 static TCGArg do_constant_folding_cond2(TCGArg *p1, TCGArg *p2, TCGCond c)
 {
+    struct tcg_temp_info *temps = tcg_ctx->opt_temps;
     TCGArg al = p1[0], ah = p1[1];
     TCGArg bl = p2[0], bh = p2[1];
 
@@ -558,9 +558,15 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
 /* Propagate constants and copies, fold constant expressions. */
 void tcg_optimize(TCGContext *s)
 {
+    struct tcg_temp_info *temps;
     int oi, oi_next, nb_temps, nb_globals;
     TCGArg *prev_mb_args = NULL;
 
+    if (tcg_ctx->opt_temps == NULL) {
+        tcg_ctx->opt_temps = g_new(struct tcg_temp_info, TCG_MAX_TEMPS);
+    }
+    temps = tcg_ctx->opt_temps;
+
     /* Array VALS has an element for each temp.
        If this temp holds a constant then its value is kept in VALS' element.
        If this temp is a copy of other ones then the other copies are
@@ -1360,7 +1366,7 @@ void tcg_optimize(TCGContext *s)
             if (!(args[nb_oargs + nb_iargs + 1]
                   & (TCG_CALL_NO_READ_GLOBALS | TCG_CALL_NO_WRITE_GLOBALS))) {
                 for (i = 0; i < nb_globals; i++) {
-                    if (test_bit(i, temps_used.l)) {
+                    if (test_bit(i, tcg_ctx->opt_temps_used.l)) {
                         reset_temp(i);
                     }
                 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 37/45] tcg: introduce **tcg_ctxs to keep track of all TCGContext's
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (35 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 36/45] tcg: dynamically allocate optimizer globals + fold " Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  4:17   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 38/45] tcg: distribute profiling counters across TCGContext's Emilio G. Cota
                   ` (7 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Groundwork for supporting multiple TCG contexts.

Note that having n_tcg_ctxs is unnecessary. However, it is
convenient to have it, since it will simplify iterating over the
array: we'll have just a for loop instead of having to iterate
over a NULL-terminated array (which would require n+1 elems)
or having to check with ifdef's for usermode/softmmu.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index f907c47..8094278 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -115,6 +115,8 @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 static void tcg_out_tb_init(TCGContext *s);
 static bool tcg_out_tb_finalize(TCGContext *s);
 
+static TCGContext **tcg_ctxs;
+static unsigned int n_tcg_ctxs;
 
 static TCGRegSet tcg_target_available_regs[2];
 static TCGRegSet tcg_target_call_clobber_regs;
@@ -323,6 +325,13 @@ static GHashTable *helper_table;
 static int indirect_reg_alloc_order[ARRAY_SIZE(tcg_target_reg_alloc_order)];
 static void process_op_defs(TCGContext *s);
 
+static void tcg_ctxs_init(TCGContext *s)
+{
+    tcg_ctxs = g_new(TCGContext *, 1);
+    tcg_ctxs[0] = s;
+    n_tcg_ctxs = 1;
+}
+
 void tcg_context_init(TCGContext *s)
 {
     int op, total_args, n, i;
@@ -381,6 +390,7 @@ void tcg_context_init(TCGContext *s)
         indirect_reg_alloc_order[i] = tcg_target_reg_alloc_order[i];
     }
 
+    tcg_ctxs_init(s);
     tcg_ctx = s;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 38/45] tcg: distribute profiling counters across TCGContext's
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (36 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 37/45] tcg: introduce **tcg_ctxs to keep track of all TCGContext's Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  4:20   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 39/45] osdep: move qemu_real_host_page_size/mask to osdep Emilio G. Cota
                   ` (6 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This is groundwork for supporting multiple TCG contexts.

To avoid scalability issues when profiling info is enabled, this patch
makes the profiling info counters distributed via the following changes:

1) Consolidate profile info into its own struct, TCGProfile, which
   TCGContext also includes. Note that tcg_table_op_count is brought
   into TCGProfile after dropping the tcg_ prefix.
2) Iterate over the TCG contexts in the system to obtain the total counts.

This change also requires updating the accessors to TCGProfile fields to
use atomic_read/set whenever there may be conflicting accesses (as defined
in C11) to them.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h                 |  38 +++++++++-------
 accel/tcg/translate-all.c |  23 +++++-----
 tcg/tcg.c                 | 110 ++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 126 insertions(+), 45 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 175d4de..9d17584 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -649,6 +649,26 @@ struct tcg_temp_info {
     tcg_target_ulong mask;
 };
 
+typedef struct TCGProfile {
+    int64_t tb_count1;
+    int64_t tb_count;
+    int64_t op_count; /* total insn count */
+    int op_count_max; /* max insn per TB */
+    int64_t temp_count;
+    int temp_count_max;
+    int64_t del_op_count;
+    int64_t code_in_len;
+    int64_t code_out_len;
+    int64_t search_out_len;
+    int64_t interm_time;
+    int64_t code_time;
+    int64_t la_time;
+    int64_t opt_time;
+    int64_t restore_count;
+    int64_t restore_time;
+    int64_t table_op_count[NB_OPS];
+} TCGProfile;
+
 struct TCGContext {
     uint8_t *pool_cur, *pool_end;
     TCGPool *pool_first, *pool_current, *pool_first_large;
@@ -673,23 +693,7 @@ struct TCGContext {
     tcg_insn_unit *code_ptr;
 
 #ifdef CONFIG_PROFILER
-    /* profiling info */
-    int64_t tb_count1;
-    int64_t tb_count;
-    int64_t op_count; /* total insn count */
-    int op_count_max; /* max insn per TB */
-    int64_t temp_count;
-    int temp_count_max;
-    int64_t del_op_count;
-    int64_t code_in_len;
-    int64_t code_out_len;
-    int64_t search_out_len;
-    int64_t interm_time;
-    int64_t code_time;
-    int64_t la_time;
-    int64_t opt_time;
-    int64_t restore_count;
-    int64_t restore_time;
+    TCGProfile prof;
 #endif
 
 #ifdef CONFIG_DEBUG_TCG
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 961e357..fd3e4a0 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -312,6 +312,7 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
     uint8_t *p = tb->tc.search;
     int i, j, num_insns = tb->icount;
 #ifdef CONFIG_PROFILER
+    TCGProfile *prof = &tcg_ctx->prof;
     int64_t ti = profile_getclock();
 #endif
 
@@ -346,8 +347,9 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
     restore_state_to_opc(env, tb, data);
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx->restore_time += profile_getclock() - ti;
-    tcg_ctx->restore_count++;
+    atomic_set(&prof->restore_time,
+                prof->restore_time + profile_getclock() - ti);
+    atomic_set(&prof->restore_count, prof->restore_count + 1);
 #endif
     return 0;
 }
@@ -1306,6 +1308,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tcg_insn_unit *gen_code_buf;
     int gen_code_size, search_size;
 #ifdef CONFIG_PROFILER
+    TCGProfile *prof = &tcg_ctx->prof;
     int64_t ti;
 #endif
     assert_memory_lock();
@@ -1336,8 +1339,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tcg_ctx->cf_parallel = !!(cflags & CF_PARALLEL);
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx->tb_count1++; /* includes aborted translations because of
-                       exceptions */
+    /* includes aborted translations because of exceptions */
+    atomic_set(&prof->tb_count1, prof->tb_count1 + 1);
     ti = profile_getclock();
 #endif
 
@@ -1362,8 +1365,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 #endif
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx->tb_count++;
-    tcg_ctx->interm_time += profile_getclock() - ti;
+    atomic_set(&prof->tb_count, prof->tb_count + 1);
+    atomic_set(&prof->interm_time, prof->interm_time + profile_getclock() - ti);
     ti = profile_getclock();
 #endif
 
@@ -1383,10 +1386,10 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb->tc.size = gen_code_size;
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx->code_time += profile_getclock() - ti;
-    tcg_ctx->code_in_len += tb->size;
-    tcg_ctx->code_out_len += gen_code_size;
-    tcg_ctx->search_out_len += search_size;
+    atomic_set(&prof->code_time, prof->code_time + profile_getclock() - ti);
+    atomic_set(&prof->code_in_len, prof->code_in_len + tb->size);
+    atomic_set(&prof->code_out_len, prof->code_out_len + gen_code_size);
+    atomic_set(&prof->search_out_len, prof->search_out_len + search_size);
 #endif
 
 #ifdef DEBUG_DISAS
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 8094278..5afb80a 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1350,7 +1350,7 @@ void tcg_op_remove(TCGContext *s, TCGOp *op)
     memset(op, 0, sizeof(*op));
 
 #ifdef CONFIG_PROFILER
-    s->del_op_count++;
+    atomic_set(&s->prof.del_op_count, s->prof.del_op_count + 1);
 #endif
 }
 
@@ -2521,15 +2521,79 @@ static void tcg_reg_alloc_call(TCGContext *s, int nb_oargs, int nb_iargs,
 
 #ifdef CONFIG_PROFILER
 
-static int64_t tcg_table_op_count[NB_OPS];
+/* avoid copy/paste errors */
+#define PROF_ADD(to, from, field)                       \
+    do {                                                \
+        (to)->field += atomic_read(&((from)->field));   \
+    } while (0)
+
+#define PROF_ADD_MAX(to, from, field)                                   \
+    do {                                                                \
+        typeof((from)->field) val__ = atomic_read(&((from)->field));    \
+        if (val__ > (to)->field) {                                      \
+            (to)->field = val__;                                        \
+        }                                                               \
+    } while (0)
+
+/* Pass in a zero'ed @prof */
+static inline
+void tcg_profile_snapshot(TCGProfile *prof, bool counters, bool table)
+{
+    unsigned int i;
+
+    for (i = 0; i < n_tcg_ctxs; i++) {
+        const TCGProfile *orig = &tcg_ctxs[i]->prof;
+
+        if (counters) {
+            PROF_ADD(prof, orig, tb_count1);
+            PROF_ADD(prof, orig, tb_count);
+            PROF_ADD(prof, orig, op_count);
+            PROF_ADD_MAX(prof, orig, op_count_max);
+            PROF_ADD(prof, orig, temp_count);
+            PROF_ADD_MAX(prof, orig, temp_count_max);
+            PROF_ADD(prof, orig, del_op_count);
+            PROF_ADD(prof, orig, code_in_len);
+            PROF_ADD(prof, orig, code_out_len);
+            PROF_ADD(prof, orig, search_out_len);
+            PROF_ADD(prof, orig, interm_time);
+            PROF_ADD(prof, orig, code_time);
+            PROF_ADD(prof, orig, la_time);
+            PROF_ADD(prof, orig, opt_time);
+            PROF_ADD(prof, orig, restore_count);
+            PROF_ADD(prof, orig, restore_time);
+        }
+        if (table) {
+            int i;
+
+            for (i = 0; i < NB_OPS; i++) {
+                PROF_ADD(prof, orig, table_op_count[i]);
+            }
+        }
+    }
+}
+
+#undef PROF_ADD
+#undef PROF_ADD_MAX
+
+static void tcg_profile_snapshot_counters(TCGProfile *prof)
+{
+    tcg_profile_snapshot(prof, true, false);
+}
+
+static void tcg_profile_snapshot_table(TCGProfile *prof)
+{
+    tcg_profile_snapshot(prof, false, true);
+}
 
 void tcg_dump_op_count(FILE *f, fprintf_function cpu_fprintf)
 {
+    TCGProfile prof = {};
     int i;
 
+    tcg_profile_snapshot_table(&prof);
     for (i = 0; i < NB_OPS; i++) {
         cpu_fprintf(f, "%s %" PRId64 "\n", tcg_op_defs[i].name,
-                    tcg_table_op_count[i]);
+                    prof.table_op_count[i]);
     }
 }
 #else
@@ -2542,6 +2606,9 @@ void tcg_dump_op_count(FILE *f, fprintf_function cpu_fprintf)
 
 int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 {
+#ifdef CONFIG_PROFILER
+    TCGProfile *prof = &s->prof;
+#endif
     int i, oi, oi_next, num_insns;
 
 #ifdef CONFIG_PROFILER
@@ -2549,15 +2616,15 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
         int n;
 
         n = s->gen_op_buf[0].prev + 1;
-        s->op_count += n;
-        if (n > s->op_count_max) {
-            s->op_count_max = n;
+        atomic_set(&prof->op_count, prof->op_count + n);
+        if (n > prof->op_count_max) {
+            atomic_set(&prof->op_count_max, n);
         }
 
         n = s->nb_temps;
-        s->temp_count += n;
-        if (n > s->temp_count_max) {
-            s->temp_count_max = n;
+        atomic_set(&prof->temp_count, prof->temp_count + n);
+        if (n > prof->temp_count_max) {
+            atomic_set(&prof->temp_count_max, n);
         }
     }
 #endif
@@ -2574,7 +2641,7 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 #endif
 
 #ifdef CONFIG_PROFILER
-    s->opt_time -= profile_getclock();
+    atomic_set(&prof->opt_time, prof->opt_time - profile_getclock());
 #endif
 
 #ifdef USE_TCG_OPTIMIZATIONS
@@ -2582,8 +2649,8 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 #endif
 
 #ifdef CONFIG_PROFILER
-    s->opt_time += profile_getclock();
-    s->la_time -= profile_getclock();
+    atomic_set(&prof->opt_time, prof->opt_time + profile_getclock());
+    atomic_set(&prof->la_time, prof->la_time - profile_getclock());
 #endif
 
     {
@@ -2611,7 +2678,7 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
     }
 
 #ifdef CONFIG_PROFILER
-    s->la_time += profile_getclock();
+    atomic_set(&prof->la_time, prof->la_time + profile_getclock());
 #endif
 
 #ifdef DEBUG_DISAS
@@ -2642,7 +2709,7 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 
         oi_next = op->next;
 #ifdef CONFIG_PROFILER
-        tcg_table_op_count[opc]++;
+        atomic_set(&prof->table_op_count[opc], prof->table_op_count[opc] + 1);
 #endif
 
         switch (opc) {
@@ -2718,10 +2785,17 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 #ifdef CONFIG_PROFILER
 void tcg_dump_info(FILE *f, fprintf_function cpu_fprintf)
 {
-    TCGContext *s = tcg_ctx;
-    int64_t tb_count = s->tb_count;
-    int64_t tb_div_count = tb_count ? tb_count : 1;
-    int64_t tot = s->interm_time + s->code_time;
+    TCGProfile prof = {};
+    const TCGProfile *s;
+    int64_t tb_count;
+    int64_t tb_div_count;
+    int64_t tot;
+
+    tcg_profile_snapshot_counters(&prof);
+    s = &prof;
+    tb_count = s->tb_count;
+    tb_div_count = tb_count ? tb_count : 1;
+    tot = s->interm_time + s->code_time;
 
     cpu_fprintf(f, "JIT cycles          %" PRId64 " (%0.3f s at 2.4 GHz)\n",
                 tot, tot / 2.4e9);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 39/45] osdep: move qemu_real_host_page_size/mask to osdep
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (37 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 38/45] tcg: distribute profiling counters across TCGContext's Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  4:22   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 40/45] osdep: introduce qemu_mprotect_rwx/none Emilio G. Cota
                   ` (5 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

These only depend on the host and therefore belong in the common
osdep, not in a target-dependent object.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-all.h | 2 --
 include/qemu/osdep.h   | 8 ++++++++
 exec.c                 | 5 +----
 util/osdep.c           | 9 +++++++++
 4 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index ffe43d5..778031c 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -229,8 +229,6 @@ extern int target_page_bits;
 /* Using intptr_t ensures that qemu_*_page_mask is sign-extended even
  * when intptr_t is 32-bit and we are aligning a long long.
  */
-extern uintptr_t qemu_real_host_page_size;
-extern intptr_t qemu_real_host_page_mask;
 extern uintptr_t qemu_host_page_size;
 extern intptr_t qemu_host_page_mask;
 
diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 8559634..3cb36e6 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -483,6 +483,14 @@ char *qemu_get_pid_name(pid_t pid);
  */
 pid_t qemu_fork(Error **errp);
 
+void real_host_page_size_init(void);
+
+/* Using intptr_t ensures that qemu_*_page_mask is sign-extended even
+ * when intptr_t is 32-bit and we are aligning a long long.
+ */
+extern uintptr_t qemu_real_host_page_size;
+extern intptr_t qemu_real_host_page_mask;
+
 extern int qemu_icache_linesize;
 extern int qemu_dcache_linesize;
 
diff --git a/exec.c b/exec.c
index adc160f..135dcbc 100644
--- a/exec.c
+++ b/exec.c
@@ -120,8 +120,6 @@ int use_icount;
 
 uintptr_t qemu_host_page_size;
 intptr_t qemu_host_page_mask;
-uintptr_t qemu_real_host_page_size;
-intptr_t qemu_real_host_page_mask;
 
 bool set_preferred_target_page_bits(int bits)
 {
@@ -3608,8 +3606,7 @@ void page_size_init(void)
 {
     /* NOTE: we can always suppose that qemu_host_page_size >=
        TARGET_PAGE_SIZE */
-    qemu_real_host_page_size = getpagesize();
-    qemu_real_host_page_mask = -(intptr_t)qemu_real_host_page_size;
+    real_host_page_size_init();
     if (qemu_host_page_size == 0) {
         qemu_host_page_size = qemu_real_host_page_size;
     }
diff --git a/util/osdep.c b/util/osdep.c
index a2863c8..90f4f11 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -46,6 +46,9 @@ extern int madvise(caddr_t, size_t, int);
 #define QEMU_GETLK F_GETLK
 #endif
 
+uintptr_t qemu_real_host_page_size;
+intptr_t qemu_real_host_page_mask;
+
 static bool fips_enabled = false;
 
 static const char *hw_version = QEMU_HW_VERSION;
@@ -65,6 +68,12 @@ int socket_set_nodelay(int fd)
     return qemu_setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &v, sizeof(v));
 }
 
+void real_host_page_size_init(void)
+{
+    qemu_real_host_page_size = getpagesize();
+    qemu_real_host_page_mask = -(intptr_t)qemu_real_host_page_size;
+}
+
 int qemu_madvise(void *addr, size_t len, int advice)
 {
     if (advice == QEMU_MADV_INVALID) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 40/45] osdep: introduce qemu_mprotect_rwx/none
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (38 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 39/45] osdep: move qemu_real_host_page_size/mask to osdep Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  4:26   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 41/45] translate-all: use qemu_protect_rwx/none helpers Emilio G. Cota
                   ` (4 subsequent siblings)
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/osdep.h |  2 ++
 util/osdep.c         | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 3cb36e6..dcecfbc 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -348,6 +348,8 @@ void sigaction_invoke(struct sigaction *action,
 #endif
 
 int qemu_madvise(void *addr, size_t len, int advice);
+int qemu_mprotect_rwx(void *addr, size_t size);
+int qemu_mprotect_none(void *addr, size_t size);
 
 int qemu_open(const char *name, int flags, ...);
 int qemu_close(int fd);
diff --git a/util/osdep.c b/util/osdep.c
index 90f4f11..85df97e 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -90,6 +90,46 @@ int qemu_madvise(void *addr, size_t len, int advice)
 #endif
 }
 
+static int qemu_mprotect__osdep(void *addr, size_t size, int prot)
+{
+    void *start = QEMU_ALIGN_PTR_DOWN(addr, qemu_real_host_page_size);
+    void *end = QEMU_ALIGN_PTR_UP(addr + size, qemu_real_host_page_size);
+#ifdef _WIN32
+    DWORD old_protect;
+
+    if (!VirtualProtect(start, end - start, prot, &old_protect)) {
+        error_report("%s: VirtualProtect failed with error code %d",
+                     __func__, GetLastError());
+        return -1;
+    }
+    return 0;
+#else
+    if (mprotect(start, end - start, prot)) {
+        error_report("%s: mprotect failed: %s", __func__, strerror(errno));
+        return -1;
+    }
+    return 0;
+#endif
+}
+
+int qemu_mprotect_rwx(void *addr, size_t size)
+{
+#ifdef _WIN32
+    return qemu_mprotect__osdep(addr, size, PAGE_EXECUTE_READWRITE);
+#else
+    return qemu_mprotect__osdep(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
+#endif
+}
+
+int qemu_mprotect_none(void *addr, size_t size)
+{
+#ifdef _WIN32
+    return qemu_mprotect__osdep(addr, size, PAGE_NOACCESS);
+#else
+    return qemu_mprotect__osdep(addr, size, PROT_NONE);
+#endif
+}
+
 #ifndef _WIN32
 /*
  * Dups an fd and sets the flags
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 41/45] translate-all: use qemu_protect_rwx/none helpers
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (39 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 40/45] osdep: introduce qemu_mprotect_rwx/none Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 42/45] tcg: define TCG_HIGHWATER Emilio G. Cota
                   ` (3 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 49 ++++++-----------------------------------------
 1 file changed, 6 insertions(+), 43 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index fd3e4a0..913b1c5 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -604,47 +604,6 @@ static inline void *split_cross_256mb(void *buf1, size_t size1)
 static uint8_t static_code_gen_buffer[DEFAULT_CODE_GEN_BUFFER_SIZE]
     __attribute__((aligned(CODE_GEN_ALIGN)));
 
-# ifdef _WIN32
-static inline void do_protect(void *addr, long size, int prot)
-{
-    DWORD old_protect;
-    VirtualProtect(addr, size, prot, &old_protect);
-}
-
-static inline void map_exec(void *addr, long size)
-{
-    do_protect(addr, size, PAGE_EXECUTE_READWRITE);
-}
-
-static inline void map_none(void *addr, long size)
-{
-    do_protect(addr, size, PAGE_NOACCESS);
-}
-# else
-static inline void do_protect(void *addr, long size, int prot)
-{
-    uintptr_t start, end;
-
-    start = (uintptr_t)addr;
-    start &= qemu_real_host_page_mask;
-
-    end = (uintptr_t)addr + size;
-    end = ROUND_UP(end, qemu_real_host_page_size);
-
-    mprotect((void *)start, end - start, prot);
-}
-
-static inline void map_exec(void *addr, long size)
-{
-    do_protect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC);
-}
-
-static inline void map_none(void *addr, long size)
-{
-    do_protect(addr, size, PROT_NONE);
-}
-# endif /* WIN32 */
-
 static inline void *alloc_code_gen_buffer(void)
 {
     void *buf = static_code_gen_buffer;
@@ -671,8 +630,12 @@ static inline void *alloc_code_gen_buffer(void)
     }
 #endif
 
-    map_exec(buf, size);
-    map_none(buf + size, qemu_real_host_page_size);
+    if (qemu_mprotect_rwx(buf, size)) {
+        abort();
+    }
+    if (qemu_mprotect_none(buf + size, qemu_real_host_page_size)) {
+        abort();
+    }
     qemu_madvise(buf, size, QEMU_MADV_HUGEPAGE);
 
     return buf;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 42/45] tcg: define TCG_HIGHWATER
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (40 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 41/45] translate-all: use qemu_protect_rwx/none helpers Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 43/45] tcg: introduce regions to split code_gen_buffer Emilio G. Cota
                   ` (2 subsequent siblings)
  44 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Will come in handy very soon.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index 5afb80a..e8aae1f 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -115,6 +115,8 @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 static void tcg_out_tb_init(TCGContext *s);
 static bool tcg_out_tb_finalize(TCGContext *s);
 
+#define TCG_HIGHWATER 1024
+
 static TCGContext **tcg_ctxs;
 static unsigned int n_tcg_ctxs;
 
@@ -441,7 +443,7 @@ void tcg_prologue_init(TCGContext *s)
     /* Compute a high-water mark, at which we voluntarily flush the buffer
        and start over.  The size here is arbitrary, significantly larger
        than we expect the code generation for any one opcode to require.  */
-    s->code_gen_highwater = s->code_gen_buffer + (total_size - 1024);
+    s->code_gen_highwater = s->code_gen_buffer + (total_size - TCG_HIGHWATER);
 
     tcg_register_jit(s->code_gen_buffer, total_size);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 43/45] tcg: introduce regions to split code_gen_buffer
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (41 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 42/45] tcg: define TCG_HIGHWATER Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  5:09   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 44/45] translate-all: do not allocate a guard page for code_gen_buffer Emilio G. Cota
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 45/45] tcg: enable multiple TCG contexts in softmmu Emilio G. Cota
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This is groundwork for supporting multiple TCG contexts.

The naive solution here is to split code_gen_buffer statically
among the TCG threads; this however results in poor utilization
if translation needs are different across TCG threads.

What we do here is to add an extra layer of indirection, assigning
regions that act just like pages do in virtual memory allocation.
(BTW if you are wondering about the chosen naming, I did not want
to use blocks or pages because those are already heavily used in QEMU).

We use a global lock to serialize allocations as well as statistics
reporting (we now export the size of the used code_gen_buffer with
tcg_code_size()). Note that for the allocator we could just use
a counter and atomic_inc; however, that would complicate the gathering
of tcg_code_size()-like stats. So given that the region operations are
not a fast path, a lock seems the most reasonable choice.

The effectiveness of this approach is clear after seeing some numbers.
I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark.
Note that I'm evaluating this after enabling per-thread TCG (which
is done by a subsequent commit).

* -smp 1, 1 region (entire buffer):
    qemu: flush code_size=83885014 nb_tbs=154739 avg_tb_size=357
    qemu: flush code_size=83884902 nb_tbs=153136 avg_tb_size=363
    qemu: flush code_size=83885014 nb_tbs=152777 avg_tb_size=364
    qemu: flush code_size=83884950 nb_tbs=150057 avg_tb_size=373
    qemu: flush code_size=83884998 nb_tbs=150234 avg_tb_size=373
    qemu: flush code_size=83885014 nb_tbs=154009 avg_tb_size=360
    qemu: flush code_size=83885014 nb_tbs=151007 avg_tb_size=370
    qemu: flush code_size=83885014 nb_tbs=151816 avg_tb_size=367

That is, 8 flushes.

* -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]:

    qemu: flush code_size=76328008 nb_tbs=141040 avg_tb_size=356
    qemu: flush code_size=75366534 nb_tbs=138000 avg_tb_size=361
    qemu: flush code_size=76864546 nb_tbs=140653 avg_tb_size=361
    qemu: flush code_size=76309084 nb_tbs=135945 avg_tb_size=375
    qemu: flush code_size=74581856 nb_tbs=132909 avg_tb_size=375
    qemu: flush code_size=73927256 nb_tbs=135616 avg_tb_size=360
    qemu: flush code_size=78629426 nb_tbs=142896 avg_tb_size=365
    qemu: flush code_size=76667052 nb_tbs=138508 avg_tb_size=368

Again, 8 flushes. Note how buffer utilization is not 100%, but it
is close. Smaller region sizes would yield higher utilization,
but we want region allocation to be rare (it acquires a lock), so
we do not want to go too small.

* -smp 8, static partitioning of 8 regions (10 MB per region):
    qemu: flush code_size=21936504 nb_tbs=40570 avg_tb_size=354
    qemu: flush code_size=11472174 nb_tbs=20633 avg_tb_size=370
    qemu: flush code_size=11603976 nb_tbs=21059 avg_tb_size=365
    qemu: flush code_size=23254872 nb_tbs=41243 avg_tb_size=377
    qemu: flush code_size=28289496 nb_tbs=52057 avg_tb_size=358
    qemu: flush code_size=43605160 nb_tbs=78896 avg_tb_size=367
    qemu: flush code_size=45166552 nb_tbs=82158 avg_tb_size=364
    qemu: flush code_size=63289640 nb_tbs=116494 avg_tb_size=358
    qemu: flush code_size=51389960 nb_tbs=93937 avg_tb_size=362
    qemu: flush code_size=59665928 nb_tbs=107063 avg_tb_size=372
    qemu: flush code_size=38380824 nb_tbs=68597 avg_tb_size=374
    qemu: flush code_size=44884568 nb_tbs=79901 avg_tb_size=376
    qemu: flush code_size=50782632 nb_tbs=90681 avg_tb_size=374
    qemu: flush code_size=39848888 nb_tbs=71433 avg_tb_size=372
    qemu: flush code_size=64708840 nb_tbs=119052 avg_tb_size=359
    qemu: flush code_size=49830008 nb_tbs=90992 avg_tb_size=362
    qemu: flush code_size=68372408 nb_tbs=123442 avg_tb_size=368
    qemu: flush code_size=33555560 nb_tbs=59514 avg_tb_size=378
    qemu: flush code_size=44748344 nb_tbs=80974 avg_tb_size=367
    qemu: flush code_size=37104248 nb_tbs=67609 avg_tb_size=364

That is, 20 flushes. Note how a static partitioning approach uses
the code buffer poorly, leading to many unnecessary flushes.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   |   3 +
 tcg/tcg.h                 |   6 ++
 accel/tcg/translate-all.c |  56 +++++++++----
 bsd-user/main.c           |   1 +
 cpus.c                    |  12 +++
 linux-user/main.c         |   1 +
 tcg/tcg.c                 | 197 +++++++++++++++++++++++++++++++++++++++++++++-
 7 files changed, 260 insertions(+), 16 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 37487d7..69a2a21 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -49,6 +49,9 @@ void gen_intermediate_code(CPUArchState *env, struct TranslationBlock *tb);
 void restore_state_to_opc(CPUArchState *env, struct TranslationBlock *tb,
                           target_ulong *data);
 
+#ifdef CONFIG_SOFTMMU
+void softmmu_tcg_region_init(void);
+#endif
 void cpu_gen_init(void);
 bool cpu_restore_state(CPUState *cpu, uintptr_t searched_pc);
 
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 9d17584..6f6720b 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -772,6 +772,12 @@ void *tcg_malloc_internal(TCGContext *s, int size);
 void tcg_pool_reset(TCGContext *s);
 TranslationBlock *tcg_tb_alloc(TCGContext *s);
 
+void tcg_region_init(size_t n_regions);
+void tcg_region_reset_all(void);
+
+size_t tcg_code_size(void);
+size_t tcg_code_capacity(void);
+
 /* Called with tb_lock held.  */
 static inline void *tcg_malloc(int size)
 {
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 913b1c5..c30d400 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -59,6 +59,7 @@
 #include "qemu/main-loop.h"
 #include "exec/log.h"
 #include "sysemu/cpus.h"
+#include "sysemu/sysemu.h"
 
 /* #define DEBUG_TB_INVALIDATE */
 /* #define DEBUG_TB_FLUSH */
@@ -797,6 +798,39 @@ static inline void code_gen_alloc(size_t tb_size)
     qemu_mutex_init(&tb_ctx.tb_lock);
 }
 
+#ifdef CONFIG_SOFTMMU
+/*
+ * It is likely that some vCPUs will translate more code than others, so we
+ * first try to set more regions than smp_cpus, with those regions being
+ * larger than the minimum code_gen_buffer size. If that's not possible we
+ * make do by evenly dividing the code_gen_buffer among the vCPUs.
+ */
+void softmmu_tcg_region_init(void)
+{
+    size_t i;
+
+    /* Use a single region if all we have is one vCPU thread */
+    if (smp_cpus == 1 || !qemu_tcg_mttcg_enabled()) {
+        tcg_region_init(0);
+        return;
+    }
+
+    for (i = 8; i > 0; i--) {
+        size_t regions_per_thread = i;
+        size_t region_size;
+
+        region_size = tcg_init_ctx.code_gen_buffer_size;
+        region_size /= smp_cpus * regions_per_thread;
+
+        if (region_size >= 2 * MIN_CODE_GEN_BUFFER_SIZE) {
+            tcg_region_init(smp_cpus * regions_per_thread);
+            return;
+        }
+    }
+    tcg_region_init(smp_cpus);
+}
+#endif
+
 static void tb_htable_init(void)
 {
     unsigned int mode = QHT_MODE_AUTO_RESIZE;
@@ -916,13 +950,8 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
         size_t host_size = 0;
 
         g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
-        printf("qemu: flush code_size=%td nb_tbs=%zu avg_tb_size=%zu\n",
-               tcg_ctx->code_gen_ptr - tcg_ctx->code_gen_buffer, nb_tbs,
-               nb_tbs > 0 ? host_size / nb_tbs : 0);
-    }
-    if ((unsigned long)(tcg_ctx->code_gen_ptr - tcg_ctx->code_gen_buffer)
-        > tcg_ctx->code_gen_buffer_size) {
-        cpu_abort(cpu, "Internal error: code buffer overflow\n");
+        printf("qemu: flush code_size=%zu nb_tbs=%zu avg_tb_size=%zu\n",
+               tcg_code_size(), nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
     }
 
     CPU_FOREACH(cpu) {
@@ -936,7 +965,7 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     qht_reset_size(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
-    tcg_ctx->code_gen_ptr = tcg_ctx->code_gen_buffer;
+    tcg_region_reset_all();
     /* XXX: flush processor icache at this point if cache flush is
        expensive */
     atomic_mb_set(&tb_ctx.tb_flush_count, tb_ctx.tb_flush_count + 1);
@@ -1281,9 +1310,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
         cflags |= CF_USE_ICOUNT;
     }
 
+ buffer_overflow:
     tb = tb_alloc(pc);
     if (unlikely(!tb)) {
- buffer_overflow:
         /* flush must be done */
         tb_flush(cpu);
         mmap_unlock();
@@ -1367,9 +1396,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     }
 #endif
 
-    tcg_ctx->code_gen_ptr = (void *)
+    atomic_set(&tcg_ctx->code_gen_ptr, (void *)
         ROUND_UP((uintptr_t)gen_code_buf + gen_code_size + search_size,
-                 CODE_GEN_ALIGN);
+                 CODE_GEN_ALIGN));
 
     /* init jump list */
     assert(((uintptr_t)tb & 3) == 0);
@@ -1921,9 +1950,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
      * otherwise users might think "-tb-size" is not honoured.
      * For avg host size we use the precise numbers from tb_tree_stats though.
      */
-    cpu_fprintf(f, "gen code size       %td/%zd\n",
-                tcg_ctx->code_gen_ptr - tcg_ctx->code_gen_buffer,
-                tcg_ctx->code_gen_highwater - tcg_ctx->code_gen_buffer);
+    cpu_fprintf(f, "gen code size       %zu/%zu\n",
+                tcg_code_size(), tcg_code_capacity());
     cpu_fprintf(f, "TB count            %zu\n", nb_tbs);
     cpu_fprintf(f, "TB avg target size  %zu max=%zu bytes\n",
                 nb_tbs ? tst.target_size / nb_tbs : 0,
diff --git a/bsd-user/main.c b/bsd-user/main.c
index 7a8b29e..bc06c1c 100644
--- a/bsd-user/main.c
+++ b/bsd-user/main.c
@@ -979,6 +979,7 @@ int main(int argc, char **argv)
        generating the prologue until now so that the prologue can take
        the real value of GUEST_BASE into account.  */
     tcg_prologue_init(tcg_ctx);
+    tcg_region_init(0);
 
     /* build Task State */
     memset(ts, 0, sizeof(TaskState));
diff --git a/cpus.c b/cpus.c
index 14bb8d5..5455819 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1664,6 +1664,18 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
     char thread_name[VCPU_THREAD_NAME_SIZE];
     static QemuCond *single_tcg_halt_cond;
     static QemuThread *single_tcg_cpu_thread;
+    static int tcg_region_inited;
+
+    /*
+     * Initialize TCG regions--once, of course. Now is a good time, because:
+     * (1) TCG's init context, prologue and target globals have been set up.
+     * (2) qemu_tcg_mttcg_enabled() works now (TCG init code runs before the
+     *     -accel flag is processed, so the check doesn't work then).
+     */
+    if (!tcg_region_inited) {
+        softmmu_tcg_region_init();
+        tcg_region_inited = 1;
+    }
 
     if (qemu_tcg_mttcg_enabled() || !single_tcg_cpu_thread) {
         cpu->thread = g_malloc0(sizeof(QemuThread));
diff --git a/linux-user/main.c b/linux-user/main.c
index ad4c6f5..0500628 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -4457,6 +4457,7 @@ int main(int argc, char **argv, char **envp)
        generating the prologue until now so that the prologue can take
        the real value of GUEST_BASE into account.  */
     tcg_prologue_init(tcg_ctx);
+    tcg_region_init(0);
 
 #if defined(TARGET_I386)
     env->cr[0] = CR0_PG_MASK | CR0_WP_MASK | CR0_PE_MASK;
diff --git a/tcg/tcg.c b/tcg/tcg.c
index e8aae1f..daec7d1 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -33,6 +33,7 @@
 #include "qemu/cutils.h"
 #include "qemu/host-utils.h"
 #include "qemu/timer.h"
+#include "qemu/osdep.h"
 
 /* Note: the long term plan is to reduce the dependencies on the QEMU
    CPU definitions. Currently they are used for qemu_ld/st
@@ -120,6 +121,23 @@ static bool tcg_out_tb_finalize(TCGContext *s);
 static TCGContext **tcg_ctxs;
 static unsigned int n_tcg_ctxs;
 
+/*
+ * We divide code_gen_buffer into equally-sized "regions" that TCG threads
+ * dynamically allocate from as demand dictates. Given appropriate region
+ * sizing, this minimizes flushes even when some TCG threads generate a lot
+ * more code than others.
+ */
+struct tcg_region_state {
+    QemuMutex lock;
+    void *buf;      /* set at init time */
+    size_t n;       /* set at init time */
+    size_t size;    /* size of one region; set at init time */
+    size_t current; /* protected by the lock */
+    size_t n_full;  /* protected by the lock */
+};
+
+static struct tcg_region_state region;
+
 static TCGRegSet tcg_target_available_regs[2];
 static TCGRegSet tcg_target_call_clobber_regs;
 
@@ -257,6 +275,177 @@ TCGLabel *gen_new_label(void)
 
 #include "tcg-target.inc.c"
 
+static void tcg_region_assign(TCGContext *s, size_t curr_region)
+{
+    size_t guard_size = qemu_real_host_page_size;
+    void *buf = region.buf + curr_region * (region.size + guard_size);
+
+    s->code_gen_buffer = buf;
+    s->code_gen_ptr = buf;
+    s->code_gen_buffer_size = region.size;
+    s->code_gen_highwater = buf + region.size - TCG_HIGHWATER;
+}
+
+static bool tcg_region_alloc__locked(TCGContext *s)
+{
+    if (region.current == region.n) {
+        return true;
+    }
+    tcg_region_assign(s, region.current);
+    region.current++;
+    return false;
+}
+
+/*
+ * Request a new region once the one in use has filled up.
+ * Returns true on error.
+ */
+static bool tcg_region_alloc(TCGContext *s)
+{
+    bool err;
+
+    qemu_mutex_lock(&region.lock);
+    err = tcg_region_alloc__locked(s);
+    if (!err) {
+        region.n_full++;
+    }
+    qemu_mutex_unlock(&region.lock);
+    return err;
+}
+
+/*
+ * Perform a context's first region allocation.
+ * This function does _not_ increment region.n_full.
+ */
+static inline bool tcg_region_initial_alloc__locked(TCGContext *s)
+{
+    return tcg_region_alloc__locked(s);
+}
+
+/* Call from a safe-work context */
+void tcg_region_reset_all(void)
+{
+    unsigned int i;
+
+    qemu_mutex_lock(&region.lock);
+    region.current = 0;
+    region.n_full = 0;
+
+    for (i = 0; i < n_tcg_ctxs; i++) {
+        if (unlikely(tcg_region_initial_alloc__locked(tcg_ctxs[i]))) {
+            tcg_abort();
+        }
+    }
+    qemu_mutex_unlock(&region.lock);
+}
+
+static void tcg_region_set_guard_pages(void)
+{
+    size_t guard_size = qemu_real_host_page_size;
+    size_t i;
+
+    for (i = 0; i < region.n; i++) {
+        void *guard = region.buf + region.size + i * (region.size + guard_size);
+
+        if (qemu_mprotect_none(guard, qemu_real_host_page_size)) {
+            tcg_abort();
+        }
+    }
+}
+
+/*
+ * Initializes region partitioning, setting the number of regions via
+ * @n_regions.
+ * Set @n_regions to 0 or 1 to use a single region that uses all of
+ * code_gen_buffer.
+ *
+ * Called at init time from the parent thread (i.e. the one calling
+ * tcg_context_init), after the target's TCG globals have been set.
+ *
+ * Region partitioning works by splitting code_gen_buffer into separate regions,
+ * and then assigning regions to TCG threads so that the threads can translate
+ * code in parallel without synchronization.
+ */
+void tcg_region_init(size_t n_regions)
+{
+    void *buf = tcg_init_ctx.code_gen_buffer;
+    size_t size = tcg_init_ctx.code_gen_buffer_size;
+
+    if (!n_regions) {
+        n_regions = 1;
+    }
+
+    /* start on a page-aligned address */
+    buf = QEMU_ALIGN_PTR_UP(buf, qemu_real_host_page_size);
+    if (unlikely(buf > tcg_init_ctx.code_gen_buffer + size)) {
+        tcg_abort();
+    }
+    /* discard that initial portion */
+    size -= buf - tcg_init_ctx.code_gen_buffer;
+
+    /* make region.size a multiple of page_size */
+    region.size = size / n_regions;
+    region.size &= qemu_real_host_page_mask;
+
+    /* A region must have at least 2 pages; one code, one guard */
+    if (unlikely(region.size < 2 * qemu_real_host_page_size)) {
+        tcg_abort();
+    }
+
+    /* do not count the guard page in region.size */
+    region.size -= qemu_real_host_page_size;
+    region.n = n_regions;
+    region.buf = buf;
+    tcg_region_set_guard_pages();
+    qemu_mutex_init(&region.lock);
+    /*
+     * We do not yet support multiple TCG contexts, so do the initial
+     * allocation now.
+     */
+    if (unlikely(tcg_region_initial_alloc__locked(tcg_ctx))) {
+        tcg_abort();
+    }
+}
+
+/*
+ * Returns the size (in bytes) of all translated code (i.e. from all regions)
+ * currently in the cache.
+ * See also: tcg_code_capacity()
+ * Do not confuse with tcg_current_code_size(); that one applies to a single
+ * TCG context.
+ */
+size_t tcg_code_size(void)
+{
+    unsigned int i;
+    size_t total;
+
+    qemu_mutex_lock(&region.lock);
+    total = region.n_full * (region.size - TCG_HIGHWATER);
+    for (i = 0; i < n_tcg_ctxs; i++) {
+        const TCGContext *s = tcg_ctxs[i];
+        size_t size;
+
+        size = atomic_read(&s->code_gen_ptr) - s->code_gen_buffer;
+        if (unlikely(size > s->code_gen_buffer_size)) {
+            tcg_abort();
+        }
+        total += size;
+    }
+    qemu_mutex_unlock(&region.lock);
+    return total;
+}
+
+/*
+ * Returns the code capacity (in bytes) of the entire cache, i.e. including all
+ * regions.
+ * See also: tcg_code_size()
+ */
+size_t tcg_code_capacity(void)
+{
+    /* no need for synchronization; these variables are set at init time */
+    return region.n * (region.size - TCG_HIGHWATER);
+}
+
 /* pool based memory allocation */
 void *tcg_malloc_internal(TCGContext *s, int size)
 {
@@ -406,13 +595,17 @@ TranslationBlock *tcg_tb_alloc(TCGContext *s)
     TranslationBlock *tb;
     void *next;
 
+ retry:
     tb = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, align);
     next = (void *)ROUND_UP((uintptr_t)(tb + 1), align);
 
     if (unlikely(next > s->code_gen_highwater)) {
-        return NULL;
+        if (tcg_region_alloc(s)) {
+            return NULL;
+        }
+        goto retry;
     }
-    s->code_gen_ptr = next;
+    atomic_set(&s->code_gen_ptr, next);
     return tb;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 44/45] translate-all: do not allocate a guard page for code_gen_buffer
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (42 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 43/45] tcg: introduce regions to split code_gen_buffer Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  4:35   ` Richard Henderson
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 45/45] tcg: enable multiple TCG contexts in softmmu Emilio G. Cota
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

TCG regions already have a guard page.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 47 ++++++++++++-----------------------------------
 1 file changed, 12 insertions(+), 35 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index c30d400..98aa63e 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -608,19 +608,11 @@ static uint8_t static_code_gen_buffer[DEFAULT_CODE_GEN_BUFFER_SIZE]
 static inline void *alloc_code_gen_buffer(void)
 {
     void *buf = static_code_gen_buffer;
-    size_t full_size, size;
-
-    /* The size of the buffer, rounded down to end on a page boundary.  */
-    full_size = (((uintptr_t)buf + sizeof(static_code_gen_buffer))
-                 & qemu_real_host_page_mask) - (uintptr_t)buf;
-
-    /* Reserve a guard page.  */
-    size = full_size - qemu_real_host_page_size;
+    size_t size = sizeof(static_code_gen_buffer);
 
     /* Honor a command-line option limiting the size of the buffer.  */
     if (size > tcg_ctx->code_gen_buffer_size) {
-        size = (((uintptr_t)buf + tcg_ctx->code_gen_buffer_size)
-                & qemu_real_host_page_mask) - (uintptr_t)buf;
+        size = tcg_ctx->code_gen_buffer_size;
     }
     tcg_ctx->code_gen_buffer_size = size;
 
@@ -634,9 +626,6 @@ static inline void *alloc_code_gen_buffer(void)
     if (qemu_mprotect_rwx(buf, size)) {
         abort();
     }
-    if (qemu_mprotect_none(buf + size, qemu_real_host_page_size)) {
-        abort();
-    }
     qemu_madvise(buf, size, QEMU_MADV_HUGEPAGE);
 
     return buf;
@@ -645,22 +634,16 @@ static inline void *alloc_code_gen_buffer(void)
 static inline void *alloc_code_gen_buffer(void)
 {
     size_t size = tcg_ctx->code_gen_buffer_size;
-    void *buf1, *buf2;
-
-    /* Perform the allocation in two steps, so that the guard page
-       is reserved but uncommitted.  */
-    buf1 = VirtualAlloc(NULL, size + qemu_real_host_page_size,
-                        MEM_RESERVE, PAGE_NOACCESS);
-    if (buf1 != NULL) {
-        buf2 = VirtualAlloc(buf1, size, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
-        assert(buf1 == buf2);
-    }
+    void *buf;
 
-    return buf1;
+    buf = VirtualAlloc(NULL, size, MEM_RESERVE | MEM_COMMIT,
+                        PAGE_EXECUTE_READWRITE);
+    return buf;
 }
 #else
 static inline void *alloc_code_gen_buffer(void)
 {
+    int prot = PROT_WRITE | PROT_READ | PROT_EXEC;
     int flags = MAP_PRIVATE | MAP_ANONYMOUS;
     uintptr_t start = 0;
     size_t size = tcg_ctx->code_gen_buffer_size;
@@ -694,8 +677,7 @@ static inline void *alloc_code_gen_buffer(void)
 #  endif
 # endif
 
-    buf = mmap((void *)start, size + qemu_real_host_page_size,
-               PROT_NONE, flags, -1, 0);
+    buf = mmap((void *)start, size, prot, flags, -1, 0);
     if (buf == MAP_FAILED) {
         return NULL;
     }
@@ -705,24 +687,23 @@ static inline void *alloc_code_gen_buffer(void)
         /* Try again, with the original still mapped, to avoid re-acquiring
            that 256mb crossing.  This time don't specify an address.  */
         size_t size2;
-        void *buf2 = mmap(NULL, size + qemu_real_host_page_size,
-                          PROT_NONE, flags, -1, 0);
+        void *buf2 = mmap(NULL, size, prot, flags, -1, 0);
         switch ((int)(buf2 != MAP_FAILED)) {
         case 1:
             if (!cross_256mb(buf2, size)) {
                 /* Success!  Use the new buffer.  */
-                munmap(buf, size + qemu_real_host_page_size);
+                munmap(buf, size);
                 break;
             }
             /* Failure.  Work with what we had.  */
-            munmap(buf2, size + qemu_real_host_page_size);
+            munmap(buf2, size);
             /* fallthru */
         default:
             /* Split the original buffer.  Free the smaller half.  */
             buf2 = split_cross_256mb(buf, size);
             size2 = tcg_ctx->code_gen_buffer_size;
             if (buf == buf2) {
-                munmap(buf + size2 + qemu_real_host_page_size, size - size2);
+                munmap(buf + size2, size - size2);
             } else {
                 munmap(buf, size - size2);
             }
@@ -733,10 +714,6 @@ static inline void *alloc_code_gen_buffer(void)
     }
 #endif
 
-    /* Make the final buffer accessible.  The guard page at the end
-       will remain inaccessible with PROT_NONE.  */
-    mprotect(buf, size, PROT_WRITE | PROT_READ | PROT_EXEC);
-
     /* Request large pages for the buffer.  */
     qemu_madvise(buf, size, QEMU_MADV_HUGEPAGE);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* [Qemu-devel] [PATCH v2 45/45] tcg: enable multiple TCG contexts in softmmu
  2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
                   ` (43 preceding siblings ...)
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 44/45] translate-all: do not allocate a guard page for code_gen_buffer Emilio G. Cota
@ 2017-07-16 20:04 ` Emilio G. Cota
  2017-07-18  5:25   ` Richard Henderson
  44 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-16 20:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This enables parallel TCG code generation. However, we do not take
advantage of it yet since tb_lock is still held during tb_gen_code.

In user-mode we use a single TCG context; see the documentation
added to tcg_region_init for the rationale.

Note that targets do not need any conversion: targets initialize a
TCGContext (e.g. defining TCG globals), and after this initialization
has finished, the context is cloned by the vCPU threads, each of
them keeping a separate copy.

TCG threads claim one entry in tcg_ctxs[] atomically using cmpxchg.
They also increment n_tcg_ctxs atomically. Do not be too annoyed
by the subsequent atomic_read's of that variable; they are there just
to play nice with analysis tools such as thread sanitizer.

Previous patches folded some TCG globals into TCGContext. The non-const
globals remaining are only set at init time, i.e. before the TCG
threads are spawned. Here is a list of these set-at-init-time globals
under tcg/:

Only written by tcg_context_init:
- indirect_reg_alloc_order
- tcg_op_defs
Only written by tcg_target_init (called from tcg_context_init):
- tcg_target_available_regs
- tcg_target_call_clobber_regs
- arm: arm_arch, use_idiv_instructions
- i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt,
        have_movbe, have_popcnt
- mips: use_movnz_instructions, use_mips32_instructions,
        use_mips32r2_instructions, got_sigill (tcg_target_detect_isa)
- ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr
- s390: tb_ret_addr, s390_facilities
- sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines),
         use_vis3_instructions

Only written by tcg_prologue_init:
- 'struct jit_code_entry one_entry'
- aarch64: tb_ret_addr
- arm: tb_ret_addr
- i386: tb_ret_addr, guest_base_flags
- ia64: tb_ret_addr
- mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h                 |   7 ++--
 accel/tcg/translate-all.c |   2 +-
 cpus.c                    |   2 +
 linux-user/syscall.c      |   1 +
 tcg/tcg.c                 | 103 ++++++++++++++++++++++++++++++++++++++++++----
 5 files changed, 104 insertions(+), 11 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 6f6720b..2f7661b 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -745,7 +745,7 @@ struct TCGContext {
 };
 
 extern TCGContext tcg_init_ctx;
-extern TCGContext *tcg_ctx;
+extern __thread TCGContext *tcg_ctx;
 
 static inline void tcg_set_insn_param(int op_idx, int arg, TCGArg v)
 {
@@ -767,7 +767,7 @@ static inline bool tcg_op_buf_full(void)
 
 /* pool based memory allocation */
 
-/* tb_lock must be held for tcg_malloc_internal. */
+/* user-mode: tb_lock must be held for tcg_malloc_internal. */
 void *tcg_malloc_internal(TCGContext *s, int size);
 void tcg_pool_reset(TCGContext *s);
 TranslationBlock *tcg_tb_alloc(TCGContext *s);
@@ -778,7 +778,7 @@ void tcg_region_reset_all(void);
 size_t tcg_code_size(void);
 size_t tcg_code_capacity(void);
 
-/* Called with tb_lock held.  */
+/* user-mode: Called with tb_lock held.  */
 static inline void *tcg_malloc(int size)
 {
     TCGContext *s = tcg_ctx;
@@ -795,6 +795,7 @@ static inline void *tcg_malloc(int size)
 }
 
 void tcg_context_init(TCGContext *s);
+void tcg_register_thread(void);
 void tcg_prologue_init(TCGContext *s);
 void tcg_func_start(TCGContext *s);
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 98aa63e..78457a4 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -155,7 +155,7 @@ static void *l1_map[V_L1_MAX_SIZE];
 
 /* code generation context */
 TCGContext tcg_init_ctx;
-TCGContext *tcg_ctx;
+__thread TCGContext *tcg_ctx;
 TBContext tb_ctx;
 bool parallel_cpus;
 
diff --git a/cpus.c b/cpus.c
index 5455819..170071c 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1307,6 +1307,7 @@ static void *qemu_tcg_rr_cpu_thread_fn(void *arg)
     CPUState *cpu = arg;
 
     rcu_register_thread();
+    tcg_register_thread();
 
     qemu_mutex_lock_iothread();
     qemu_thread_get_self(cpu->thread);
@@ -1454,6 +1455,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
     g_assert(!use_icount);
 
     rcu_register_thread();
+    tcg_register_thread();
 
     qemu_mutex_lock_iothread();
     qemu_thread_get_self(cpu->thread);
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 925ae11..1beb11c 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -6214,6 +6214,7 @@ static void *clone_func(void *arg)
     TaskState *ts;
 
     rcu_register_thread();
+    tcg_register_thread();
     env = info->env;
     cpu = ENV_GET_CPU(env);
     thread_cpu = cpu;
diff --git a/tcg/tcg.c b/tcg/tcg.c
index daec7d1..f56ab44 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -59,6 +59,7 @@
 
 #include "elf.h"
 #include "exec/log.h"
+#include "sysemu/sysemu.h"
 
 /* Forward declarations for functions declared in tcg-target.inc.c and
    used here. */
@@ -325,13 +326,14 @@ static inline bool tcg_region_initial_alloc__locked(TCGContext *s)
 /* Call from a safe-work context */
 void tcg_region_reset_all(void)
 {
+    unsigned int n_ctxs = atomic_read(&n_tcg_ctxs);
     unsigned int i;
 
     qemu_mutex_lock(&region.lock);
     region.current = 0;
     region.n_full = 0;
 
-    for (i = 0; i < n_tcg_ctxs; i++) {
+    for (i = 0; i < n_ctxs; i++) {
         if (unlikely(tcg_region_initial_alloc__locked(tcg_ctxs[i]))) {
             tcg_abort();
         }
@@ -365,6 +367,23 @@ static void tcg_region_set_guard_pages(void)
  * Region partitioning works by splitting code_gen_buffer into separate regions,
  * and then assigning regions to TCG threads so that the threads can translate
  * code in parallel without synchronization.
+ *
+ * In softmmu the number of TCG threads is bounded by smp_cpus, so
+ * tcg_region_init callers must ensure that @n_regions is set so that there will
+ * be at least as many regions as TCG threads.
+ *
+ * User-mode callers must set @n_regions to 0/1, thereby using a single region.
+ * Having multiple regions in user-mode is not supported, since the number of
+ * vCPU threads (recall that each thread spawned by the guest corresponds to
+ * a vCPU thread) is only bounded by the OS, and usually this number is huge
+ * (tens of thousands is not uncommon).  Thus, given this large bound on the
+ * number of vCPU threads and the fact that code_gen_buffer is allocated at
+ * compile-time, we cannot guarantee that the availability of at least one
+ * region per vCPU thread.
+ *
+ * However, this user-mode limitation is unlikely to be a significant problem
+ * in practice. Multi-threaded guests share most if not all of their translated
+ * code, which makes parallel code generation less appealing than in softmmu.
  */
 void tcg_region_init(size_t n_regions)
 {
@@ -398,13 +417,71 @@ void tcg_region_init(size_t n_regions)
     region.buf = buf;
     tcg_region_set_guard_pages();
     qemu_mutex_init(&region.lock);
-    /*
-     * We do not yet support multiple TCG contexts, so do the initial
-     * allocation now.
-     */
+#ifdef CONFIG_USER_ONLY
+    /* In user-mode we support only one ctx, so do the initial allocation now */
     if (unlikely(tcg_region_initial_alloc__locked(tcg_ctx))) {
         tcg_abort();
     }
+#endif
+}
+
+/*
+ * All TCG threads except the parent (i.e. the one that called tcg_context_init
+ * and registered the target's TCG globals) must register with this function
+ * before initiating translation.
+ *
+ * In user-mode we just point tcg_ctx to tcg_init_ctx. See the documentation
+ * of tcg_region_init() for the reasoning behind this.
+ *
+ * In softmmu each caller registers its context in tcg_ctxs[]. Note that in
+ * softmmu tcg_ctxs[] does not track tcg_ctx_init, since the initial context
+ * is not used anymore for translation once this function is called.
+ *
+ * Not tracking tcg_init_ctx in tcg_ctxs[] in softmmu keeps code that iterates
+ * over the array (e.g. tcg_code_size() the same for both softmmu and user-mode.
+ */
+void tcg_register_thread(void)
+{
+#ifdef CONFIG_USER_ONLY
+    tcg_ctx = &tcg_init_ctx;
+#else
+    TCGContext *s = g_malloc(sizeof(*s));
+    int i;
+
+    memcpy(s, &tcg_init_ctx, sizeof(*s));
+    /* tcg_optimize will allocate a new opt_temps array for this ctx */
+    s->opt_temps = NULL;
+
+    /* claim the first free pointer in tcg_ctxs and increment n_tcg_ctxs */
+    for (i = 0; i < smp_cpus; i++) {
+        if (atomic_cmpxchg(&tcg_ctxs[i], NULL, s) == NULL) {
+            unsigned int n;
+
+            n = atomic_fetch_inc(&n_tcg_ctxs);
+            /*
+             * Zero out s->prof in all contexts but the first.
+             * This ensures that we correctly account for the profiling info
+             * generated during initialization, since tcg_init_ctx is not
+             * tracked by the array.
+             */
+            if (n != 0) {
+#ifdef CONFIG_PROFILER
+                memset(&s->prof, 0, sizeof(s->prof));
+#endif
+            }
+            break;
+        }
+    }
+    /* Only vCPU threads can call this function */
+    g_assert(i < smp_cpus);
+
+    tcg_ctx = s;
+    qemu_mutex_lock(&region.lock);
+    if (unlikely(tcg_region_initial_alloc__locked(tcg_ctx))) {
+        tcg_abort();
+    }
+    qemu_mutex_unlock(&region.lock);
+#endif
 }
 
 /*
@@ -416,12 +493,13 @@ void tcg_region_init(size_t n_regions)
  */
 size_t tcg_code_size(void)
 {
+    unsigned int n_ctxs = atomic_read(&n_tcg_ctxs);
     unsigned int i;
     size_t total;
 
     qemu_mutex_lock(&region.lock);
     total = region.n_full * (region.size - TCG_HIGHWATER);
-    for (i = 0; i < n_tcg_ctxs; i++) {
+    for (i = 0; i < n_ctxs; i++) {
         const TCGContext *s = tcg_ctxs[i];
         size_t size;
 
@@ -516,11 +594,21 @@ static GHashTable *helper_table;
 static int indirect_reg_alloc_order[ARRAY_SIZE(tcg_target_reg_alloc_order)];
 static void process_op_defs(TCGContext *s);
 
+/*
+ * In user-mode we simply share the init context among threads, since we
+ * use a single region. See the documentation tcg_region_init() for the
+ * reasoning behind this.
+ * In softmmu we will have at most smp_cpus TCG threads.
+ */
 static void tcg_ctxs_init(TCGContext *s)
 {
+#ifdef CONFIG_USER_ONLY
     tcg_ctxs = g_new(TCGContext *, 1);
     tcg_ctxs[0] = s;
     n_tcg_ctxs = 1;
+#else
+    tcg_ctxs = g_new0(TCGContext *, smp_cpus);
+#endif
 }
 
 void tcg_context_init(TCGContext *s)
@@ -2734,9 +2822,10 @@ static void tcg_reg_alloc_call(TCGContext *s, int nb_oargs, int nb_iargs,
 static inline
 void tcg_profile_snapshot(TCGProfile *prof, bool counters, bool table)
 {
+    unsigned int n_ctxs = atomic_read(&n_tcg_ctxs);
     unsigned int i;
 
-    for (i = 0; i < n_tcg_ctxs; i++) {
+    for (i = 0; i < n_ctxs; i++) {
         const TCGProfile *orig = &tcg_ctxs[i]->prof;
 
         if (counters) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 07/45] cpu-exec: rename have_tb_lock to acquired_tb_lock in tb_find
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 07/45] cpu-exec: rename have_tb_lock to acquired_tb_lock in tb_find Emilio G. Cota
@ 2017-07-17 22:39   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 22:39 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> Reusing the have_tb_lock name, which is also defined in translate-all.c,
> makes code reviewing unnecessarily harder.
> 
> Avoid potential confusion by renaming the local have_tb_lock variable
> to something else.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   accel/tcg/cpu-exec.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs Emilio G. Cota
@ 2017-07-17 22:55   ` Richard Henderson
  2017-07-18  0:27     ` Emilio G. Cota
  0 siblings, 1 reply; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 22:55 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> @@ -1073,13 +1073,17 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>   
>       assert_tb_locked();
>   
> -    atomic_set(&tb->invalid, true);
> -
>       /* remove the TB from the hash list */
>       phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
>       h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->trace_vcpu_dstate);
>       qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
>   
> +    /*
> +     * Mark the TB as invalid *after* it's been removed from tb_hash, which
> +     * eliminates the need to check this bit on lookups.
> +     */
> +    tb->invalid = true;

I believe you need atomic_store_release here.  Previously we were relying on 
the lock acquisition in qht_remove to provide the required memory barrier.

We definitely need to make sure this reaches memory before we zap the TB in the 
CPU_FOREACH loop.


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 11/45] exec-all: bring tb->invalid into tb->cflags
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 11/45] exec-all: bring tb->invalid into tb->cflags Emilio G. Cota
@ 2017-07-17 23:07   ` Richard Henderson
  2017-07-18  0:28     ` Emilio G. Cota
  0 siblings, 1 reply; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:07 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> -    tb->invalid = true;
> +    tb->cflags |= CF_INVALID;

Modulo the store_release comment for the last patch,

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 12/45] tcg: remove addr argument from lookup_tb_ptr
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 12/45] tcg: remove addr argument from lookup_tb_ptr Emilio G. Cota
@ 2017-07-17 23:25   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:25 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> It is unlikely that we will ever want to call this helper passing
> an argument other than the current PC. So just remove the argument,
> and use the pc we already get from cpu_get_tb_cpu_state.
> 
> This change paves the way to having a common "tb_lookup" function.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 13/45] tcg: consolidate TB lookups in tb_lookup__cpu_state
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 13/45] tcg: consolidate TB lookups in tb_lookup__cpu_state Emilio G. Cota
@ 2017-07-17 23:41   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:41 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> +    if (likely(tb &&
> +               tb->pc == *pc &&
> +               tb->cs_base == *cs_base &&
> +               tb->flags == *flags &&
> +               tb->trace_vcpu_dstate == *cpu->trace_dstate)) {

I'll just mention something I noticed while looking at perf data for Alpha. 
It's a tiny thing, however: we should order these by likelyhood of mismatch.

Almost no targets really use cs_base.  X86 won't use it when in a 32-bit or 
64-bit flat address space.  Sparc stuffs NPC into cs_base, but you really have 
to explore the limits of branch delay slots in order to see NPC != PC + 4, so 
NPC very rarely differs when PC is equal.  Similarly for HPPA.

Which suggests checking flags before cs_base.

That said, I've also considered rewriting this without short-circuit ands 
(after verifying TB != NULL) so that the compiler is free to perform all of the 
subsequent loads early.

That said, this is correct as-is.

Reviewed-by: Richard Henderson <rth@twiddle.net>

r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 14/45] tcg: define CF_PARALLEL and use it for TB hashing
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 14/45] tcg: define CF_PARALLEL and use it for TB hashing Emilio G. Cota
@ 2017-07-17 23:46   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:46 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> +/* mask cflags for hashing/comparison */
> +static inline uint32_t mask_cf(uint32_t cflags)
> +{
> +    uint32_t mask = 0;
> +
> +    mask |= CF_PARALLEL;
> +    return cflags & mask;
> +}

Surely we don't need a function for this, just a define near all the other CF_ 
definitions.

> +
> +/* current cflags, masked for hashing/comparison */
> +static inline uint32_t curr_cf_mask(void)
> +{
> +    uint32_t val = 0;
> +
> +    if (parallel_cpus) {
> +        val |= CF_PARALLEL;
> +    }
> +    return val;
> +}

Better as curr_cflags?  What's the "mask" part of this?

Also, let's write this more directly, e.g.

   return parallel_cpus ? CF_PARALLEL : 0;

until we have something more to put here.


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 15/45] target/arm: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 15/45] target/arm: check CF_PARALLEL instead of parallel_cpus Emilio G. Cota
@ 2017-07-17 23:46   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:46 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> Thereby decoupling the resulting translated code from the current state
> of the system.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   target/arm/helper-a64.h    |  4 ++++
>   target/arm/helper-a64.c    | 38 ++++++++++++++++++++++++++++++++------
>   target/arm/op_helper.c     |  7 -------
>   target/arm/translate-a64.c | 31 +++++++++++++++++++++++++------
>   target/arm/translate.c     |  9 +++++++--
>   5 files changed, 68 insertions(+), 21 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 16/45] target/hppa: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 16/45] target/hppa: " Emilio G. Cota
@ 2017-07-17 23:47   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:47 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> Thereby decoupling the resulting translated code from the current state
> of the system.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   target/hppa/helper.h    |  2 ++
>   target/hppa/op_helper.c | 32 ++++++++++++++++++++++++++++----
>   target/hppa/translate.c | 12 ++++++++++--
>   3 files changed, 40 insertions(+), 6 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 17/45] target/i386: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 17/45] target/i386: " Emilio G. Cota
@ 2017-07-17 23:47   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:47 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Thereby decoupling the resulting translated code from the current state
> of the system.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   target/i386/translate.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 18/45] target/m68k: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 18/45] target/m68k: " Emilio G. Cota
@ 2017-07-17 23:52   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:52 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Thereby decoupling the resulting translated code from the current state
> of the system.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>   target/m68k/helper.h    |  2 ++
>   target/m68k/op_helper.c | 32 ++++++++++++++++++++++++++++----
>   target/m68k/translate.c | 12 ++++++++++--
>   3 files changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/target/m68k/helper.h b/target/m68k/helper.h
> index 475a1f2..137ef48 100644
> --- a/target/m68k/helper.h
> +++ b/target/m68k/helper.h
> @@ -10,7 +10,9 @@ DEF_HELPER_4(divsll, void, env, int, int, s32)
>   DEF_HELPER_2(set_sr, void, env, i32)
>   DEF_HELPER_3(movec, void, env, i32, i32)
>   DEF_HELPER_4(cas2w, void, env, i32, i32, i32)
> +DEF_HELPER_4(cas2w_parallel, void, env, i32, i32, i32)
>   DEF_HELPER_4(cas2l, void, env, i32, i32, i32)
> +DEF_HELPER_4(cas2l_parallel, void, env, i32, i32, i32)
>   
>   #define dh_alias_fp ptr
>   #define dh_ctype_fp FPReg *
> diff --git a/target/m68k/op_helper.c b/target/m68k/op_helper.c
> index 7b5126c..061d468 100644
> --- a/target/m68k/op_helper.c
> +++ b/target/m68k/op_helper.c
> @@ -361,7 +361,8 @@ void HELPER(divsll)(CPUM68KState *env, int numr, int regr, int32_t den)
>       env->dregs[numr] = quot;
>   }
>   
> -void HELPER(cas2w)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
> +static void do_cas2w(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2,
> +                     bool parallel)
>   {
>       uint32_t Dc1 = extract32(regs, 9, 3);
>       uint32_t Dc2 = extract32(regs, 6, 3);
> @@ -374,7 +375,7 @@ void HELPER(cas2w)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
>       int16_t l1, l2;
>       uintptr_t ra = GETPC();
>   
> -    if (parallel_cpus) {
> +    if (parallel) {
>           /* Tell the main loop we need to serialize this insn.  */
>           cpu_loop_exit_atomic(ENV_GET_CPU(env), ra);
>       } else {
> @@ -399,7 +400,19 @@ void HELPER(cas2w)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
>       env->dregs[Dc2] = deposit32(env->dregs[Dc2], 0, 16, l2);
>   }
>   
> -void HELPER(cas2l)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
> +void HELPER(cas2w)(CPUM68KState *env, uint32_t regs, uint32_t a1, uint32_t a2)
> +{
> +    do_cas2w(env, regs, a1, a2, false);
> +}
> +
> +void HELPER(cas2w_parallel)(CPUM68KState *env, uint32_t regs, uint32_t a1,
> +                            uint32_t a2)
> +{
> +    do_cas2w(env, regs, a1, a2, true);

Well, cas2w_parallel is now exactly equivalent to gen_helper_exit_atomic.
I probably should have done that parallel check in the translator to begin.


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 19/45] target/s390x: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 19/45] target/s390x: " Emilio G. Cota
@ 2017-07-17 23:53   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:53 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Thereby decoupling the resulting translated code from the current state
> of the system.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>   target/s390x/helper.h     |  3 +++
>   target/s390x/mem_helper.c | 50 +++++++++++++++++++++++++++++++++++++++--------
>   target/s390x/translate.c  | 20 +++++++++++++++----
>   3 files changed, 61 insertions(+), 12 deletions(-)


Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 20/45] target/sparc: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 20/45] target/sparc: " Emilio G. Cota
@ 2017-07-17 23:54   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:54 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Thereby decoupling the resulting translated code from the current state
> of the system.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   target/sparc/translate.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 21/45] tcg: check CF_PARALLEL instead of parallel_cpus
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 21/45] tcg: " Emilio G. Cota
@ 2017-07-17 23:55   ` Richard Henderson
  2017-07-18  0:34     ` Emilio G. Cota
  0 siblings, 1 reply; 93+ messages in thread
From: Richard Henderson @ 2017-07-17 23:55 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Thereby decoupling the resulting translated code from the current state
> of the system.
> 
> The tb->cflags field is not passed to tcg generation functions. So
> we add a bit to TCGContext, storing there whether CF_PARALLEL is set
> before translating every TB.
> 
> Most architectures have <= 32 registers, which results in a 4-byte hole
> in TCGContext. Use this hole for the bit we need; use a uint8_t instead
> of a bool, since a bool might take more than one byte in some systems.

I would much rather use bool.

(1) I don't care about OSX and its broken ABI,
(2) Even then OSX still *works*.

Otherwise,


> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>   tcg/tcg.h                 |  1 +
>   accel/tcg/translate-all.c |  1 +
>   tcg/tcg-op.c              | 10 +++++-----
>   3 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 96872f8..bd1fdfa 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -656,6 +656,7 @@ struct TCGContext {
>       uintptr_t *tb_jmp_target_addr; /* tb->jmp_target_addr if !USE_DIRECT_JUMP */
>   
>       TCGRegSet reserved_regs;
> +    uint8_t cf_parallel; /* whether CF_PARALLEL is set in tb->cflags */
>       intptr_t current_frame_offset;
>       intptr_t frame_start;
>       intptr_t frame_end;
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 483248f..80ac85a 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1275,6 +1275,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>       tb->flags = flags;
>       tb->cflags = cflags;
>       tb->trace_vcpu_dstate = *cpu->trace_dstate;
> +    tcg_ctx.cf_parallel = !!(cflags & CF_PARALLEL);
>   
>   #ifdef CONFIG_PROFILER
>       tcg_ctx.tb_count1++; /* includes aborted translations because of
> diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
> index 205d07f..ef420d4 100644
> --- a/tcg/tcg-op.c
> +++ b/tcg/tcg-op.c
> @@ -150,7 +150,7 @@ void tcg_gen_op6(TCGContext *ctx, TCGOpcode opc, TCGArg a1, TCGArg a2,
>   
>   void tcg_gen_mb(TCGBar mb_type)
>   {
> -    if (parallel_cpus) {
> +    if (tcg_ctx.cf_parallel) {
>           tcg_gen_op1(&tcg_ctx, INDEX_op_mb, mb_type);
>       }
>   }
> @@ -2794,7 +2794,7 @@ void tcg_gen_atomic_cmpxchg_i32(TCGv_i32 retv, TCGv addr, TCGv_i32 cmpv,
>   {
>       memop = tcg_canonicalize_memop(memop, 0, 0);
>   
> -    if (!parallel_cpus) {
> +    if (!tcg_ctx.cf_parallel) {
>           TCGv_i32 t1 = tcg_temp_new_i32();
>           TCGv_i32 t2 = tcg_temp_new_i32();
>   
> @@ -2838,7 +2838,7 @@ void tcg_gen_atomic_cmpxchg_i64(TCGv_i64 retv, TCGv addr, TCGv_i64 cmpv,
>   {
>       memop = tcg_canonicalize_memop(memop, 1, 0);
>   
> -    if (!parallel_cpus) {
> +    if (!tcg_ctx.cf_parallel) {
>           TCGv_i64 t1 = tcg_temp_new_i64();
>           TCGv_i64 t2 = tcg_temp_new_i64();
>   
> @@ -3015,7 +3015,7 @@ static void * const table_##NAME[16] = {                                \
>   void tcg_gen_atomic_##NAME##_i32                                        \
>       (TCGv_i32 ret, TCGv addr, TCGv_i32 val, TCGArg idx, TCGMemOp memop) \
>   {                                                                       \
> -    if (parallel_cpus) {                                                \
> +    if (tcg_ctx.cf_parallel) {                                          \
>           do_atomic_op_i32(ret, addr, val, idx, memop, table_##NAME);     \
>       } else {                                                            \
>           do_nonatomic_op_i32(ret, addr, val, idx, memop, NEW,            \
> @@ -3025,7 +3025,7 @@ void tcg_gen_atomic_##NAME##_i32                                        \
>   void tcg_gen_atomic_##NAME##_i64                                        \
>       (TCGv_i64 ret, TCGv addr, TCGv_i64 val, TCGArg idx, TCGMemOp memop) \
>   {                                                                       \
> -    if (parallel_cpus) {                                                \
> +    if (tcg_ctx.cf_parallel) {                                          \
>           do_atomic_op_i64(ret, addr, val, idx, memop, table_##NAME);     \
>       } else {                                                            \
>           do_nonatomic_op_i64(ret, addr, val, idx, memop, NEW,            \
> 

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 22/45] cpu-exec: lookup/generate TB outside exclusive region during step_atomic
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 22/45] cpu-exec: lookup/generate TB outside exclusive region during step_atomic Emilio G. Cota
@ 2017-07-18  0:01   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:01 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Now that all code generation has been converted to check CF_PARALLEL, we can
> generate !CF_PARALLEL code without having yet set !parallel_cpus --
> and therefore without having to be in the exclusive region during
> cpu_exec_step_atomic.
> 
> While at it, merge cpu_exec_step into cpu_exec_step_atomic.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>   accel/tcg/cpu-exec.c | 26 ++++++++++++--------------
>   1 file changed, 12 insertions(+), 14 deletions(-)
> 
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index efe5c85..23e6d2c 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -226,7 +226,7 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
>   }
>   #endif
>   
> -static void cpu_exec_step(CPUState *cpu)
> +void cpu_exec_step_atomic(CPUState *cpu)
>   {
>       CPUClass *cc = CPU_GET_CLASS(cpu);
>       TranslationBlock *tb;
> @@ -239,16 +239,26 @@ static void cpu_exec_step(CPUState *cpu)
>           if (tb == NULL) {
>               mmap_lock();
>               tb_lock();
> -            tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
> +            tb = tb_htable_lookup(cpu, pc, cs_base, flags, mask_cf(cflags));
> +            if (likely(tb == NULL)) {
> +                tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
> +            }
>               tb_unlock();
>               mmap_unlock();
>           }
>   
> +        start_exclusive();
> +
> +        /* Since we got here, we know that parallel_cpus must be true.  */
> +        parallel_cpus = false;

Well, since we've moved parallel_cpus completely out of target/*, we no longer 
have to set this false, right?

I wonder how hard it would be to completely hide this variable now...
That said, even that would probably be better as a follow-on cleanup.

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 23/45] translate-all: define and use DEBUG_TB_FLUSH_GATE
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 23/45] translate-all: define and use DEBUG_TB_FLUSH_GATE Emilio G. Cota
@ 2017-07-18  0:01   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:01 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> This gets rid of some ifdef checks while ensuring that the debug code
> is compiled, which prevents bit rot.
> 
> Suggested-by: Alex Bennée<alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   accel/tcg/translate-all.c | 20 +++++++++++++-------
>   1 file changed, 13 insertions(+), 7 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 24/45] exec-all: introduce TB_PAGE_ADDR_FMT
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 24/45] exec-all: introduce TB_PAGE_ADDR_FMT Emilio G. Cota
@ 2017-07-18  0:02   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:02 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> And fix the following warning when DEBUG_TB_INVALIDATE is enabled
> in translate-all.c:
> 
>    CC      mipsn32-linux-user/accel/tcg/translate-all.o
> /data/src/qemu/accel/tcg/translate-all.c: In function ‘tb_alloc_page’:
> /data/src/qemu/accel/tcg/translate-all.c:1201:16: error: format ‘%lx’ expects argument of type ‘long unsigned int’, but argument 2 has type ‘tb_page_addr_t {aka unsigned int}’ [-Werror=format=]
>           printf("protecting code page: 0x" TARGET_FMT_lx "\n",
>                  ^
> cc1: all warnings being treated as errors
> /data/src/qemu/rules.mak:66: recipe for target 'accel/tcg/translate-all.o' failed
> make[1]: *** [accel/tcg/translate-all.o] Error 1
> Makefile:328: recipe for target 'subdir-mipsn32-linux-user' failed
> make: *** [subdir-mipsn32-linux-user] Error 2
> cota@flamenco:/data/src/qemu/build ((18f3fe1...) *$)$
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   include/exec/exec-all.h   | 2 ++
>   accel/tcg/translate-all.c | 3 +--
>   2 files changed, 3 insertions(+), 2 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 25/45] translate-all: define and use DEBUG_TB_INVALIDATE_GATE
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 25/45] translate-all: define and use DEBUG_TB_INVALIDATE_GATE Emilio G. Cota
@ 2017-07-18  0:02   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:02 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> This gets rid of an ifdef check while ensuring that the debug code
> is compiled, which prevents bit rot.
> 
> Suggested-by: Alex Bennée<alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   accel/tcg/translate-all.c | 12 +++++++++---
>   1 file changed, 9 insertions(+), 3 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 26/45] translate-all: define and use DEBUG_TB_CHECK_GATE
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 26/45] translate-all: define and use DEBUG_TB_CHECK_GATE Emilio G. Cota
@ 2017-07-18  0:03   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:03 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> This prevents bit rot by ensuring the debug code is compiled when
> building a user-mode target.
> 
> Unfortunately the helpers are user-mode-only so we cannot fully
> get rid of the ifdef checks. Add a comment to explain this.
> 
> Suggested-by: Alex Bennée<alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   accel/tcg/translate-all.c | 28 ++++++++++++++++++++++------
>   1 file changed, 22 insertions(+), 6 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 27/45] exec-all: extract tb->tc_* into a separate struct tc_tb
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 27/45] exec-all: extract tb->tc_* into a separate struct tc_tb Emilio G. Cota
@ 2017-07-18  0:04   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:04 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> In preparation for adding tc.size to be able to keep track of
> TB's using the binary search tree implementation from glib.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   include/exec/exec-all.h   | 20 ++++++++++++++------
>   accel/tcg/cpu-exec.c      |  6 +++---
>   accel/tcg/translate-all.c | 20 ++++++++++----------
>   tcg/tcg-runtime.c         |  4 ++--
>   tcg/tcg.c                 |  4 ++--
>   5 files changed, 31 insertions(+), 23 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 28/45] translate-all: use a binary search tree to track TBs in TBContext
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 28/45] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
@ 2017-07-18  0:05   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:05 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> This is a prerequisite for supporting multiple TCG contexts, since
> we will have threads generating code in separate regions of
> code_gen_buffer.
> 
> For this we need a new field (.size) in struct tb_tc to keep
> track of the size of the translated code. This field adds a 4-byte
> hole to the struct (and therefore to TranslationBlock), but we can
> live with that.
> 
> The comparison function we use is optimized for the common case:
> insertions. Profiling shows that upon booting debian-arm, 98%
> of comparisons are between existing tb's (i.e. a->size and b->size
> are both !0), which happens during insertions (and removals, but
> those are rare). The remaining cases are lookups. From reading the glib
> sources we see that the first key is always the lookup key. However,
> the code does not assume this to always be the case because this
> behaviour is not guaranteed in the glib docs. However, we embed
> this knowledge in the code as a branch hint for the compiler.
> 
> Note that tb_free does not free space in the code_gen_buffer anymore,
> since we cannot easily know whether the tb is the last one inserted
> in code_gen_buffer. The next patch in this series renames tb_free
> to tb_remove to reflect this.
> 
> Performance-wise, lookups in tb_find_pc are the same as before:
> O(log n). However, insertions are O(log n) instead of O(1), which
> results in a small slowdown when booting debian-arm:
> 
> Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
> 	-machine type=virt -nographic -smp 1 -m 4096 \
> 	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
> 	-device virtio-net-device,netdev=unet \
> 	-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
> 	-device virtio-blk-device,drive=myblock \
> 	-kernel img/arm/aarch32-current-linux-kernel-only.img \
> 	-append console=ttyAMA0 root=/dev/vda1 \
> 	-name arm,debug-threads=on -smp 1' (10 runs):
> 
> - Before:
> 
>         8048.598422      task-clock (msec)         #    0.931 CPUs utilized            ( +-  0.28% )
>              16,974      context-switches          #    0.002 M/sec                    ( +-  0.12% )
>                   0      cpu-migrations            #    0.000 K/sec
>              10,125      page-faults               #    0.001 M/sec                    ( +-  1.23% )
>      35,144,901,879      cycles                    #    4.367 GHz                      ( +-  0.14% )
>     <not supported>      stalled-cycles-frontend
>     <not supported>      stalled-cycles-backend
>      65,758,252,643      instructions              #    1.87  insns per cycle          ( +-  0.33% )
>      10,871,298,668      branches                  # 1350.707 M/sec                    ( +-  0.41% )
>         192,322,212      branch-misses             #    1.77% of all branches          ( +-  0.32% )
> 
>         8.640869419 seconds time elapsed                                          ( +-  0.57% )
> 
> - After:
>         8146.242027      task-clock (msec)         #    0.923 CPUs utilized            ( +-  1.23% )
>              17,016      context-switches          #    0.002 M/sec                    ( +-  0.40% )
>                   0      cpu-migrations            #    0.000 K/sec
>              18,769      page-faults               #    0.002 M/sec                    ( +-  0.45% )
>      35,660,956,120      cycles                    #    4.378 GHz                      ( +-  1.22% )
>     <not supported>      stalled-cycles-frontend
>     <not supported>      stalled-cycles-backend
>      65,095,366,607      instructions              #    1.83  insns per cycle          ( +-  1.73% )
>      10,803,480,261      branches                  # 1326.192 M/sec                    ( +-  1.95% )
>         195,601,289      branch-misses             #    1.81% of all branches          ( +-  0.39% )
> 
>         8.828660235 seconds time elapsed                                          ( +-  0.38% )
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>   include/exec/exec-all.h   |   5 ++
>   include/exec/tb-context.h |   4 +-
>   accel/tcg/translate-all.c | 217 ++++++++++++++++++++++++----------------------
>   3 files changed, 118 insertions(+), 108 deletions(-)
> 
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 7356c3e..c7bf683 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -317,10 +317,15 @@ static inline void tlb_flush_by_mmuidx_all_cpus_synced(CPUState *cpu,
>   
>   /*
>    * Translation Cache-related fields of a TB.
> + * This struct exists just for convenience; we keep track of TB's in a binary
> + * search tree, and the only fields needed to compare TB's in the tree are
> + * @ptr and @size. @search is brought here for consistency, since it is also
> + * a TC-related field.
>    */
>   struct tb_tc {
>       void *ptr;    /* pointer to the translated code */
>       uint8_t *search;  /* pointer to search data */
> +    unsigned int size;

You might as well just use size_t and avoid the hole.

Otherwise,

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 29/45] exec-all: rename tb_free to tb_remove
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 29/45] exec-all: rename tb_free to tb_remove Emilio G. Cota
@ 2017-07-18  0:05   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:05 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> We don't really free anything in this function anymore; we just remove
> the TB from the binary search tree.
> 
> Suggested-by: Alex Bennée<alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   include/exec/exec-all.h   | 2 +-
>   accel/tcg/cpu-exec.c      | 2 +-
>   accel/tcg/translate-all.c | 6 +++---
>   3 files changed, 5 insertions(+), 5 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 30/45] translate-all: report correct avg host TB size
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 30/45] translate-all: report correct avg host TB size Emilio G. Cota
@ 2017-07-18  0:06   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:06 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Since commit 6e3b2bfd6 ("tcg: allocate TB structs before the
> corresponding translated code") we are not fully utilizing
> code_gen_buffer for translated code, and therefore are
> incorrectly reporting the amount of translated code as well as
> the average host TB size. Address this by:
> 
> - Making the conscious choice of misreporting the total translated code;
>    doing otherwise would mislead users into thinking "-tb-size" is not
>    honoured.
> 
> - Expanding tb_tree_stats to accurately count the bytes of translated code on
>    the host, and using this for reporting the average tb host size,
>    as well as the expansion ratio.
> 
> In the future we might want to consider reporting the accurate numbers for
> the total translated code, together with a "bookkeeping/overhead" field to
> account for the TB structs.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   accel/tcg/translate-all.c | 32 +++++++++++++++++++++++---------
>   1 file changed, 23 insertions(+), 9 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 31/45] tci: move tci_regs to tcg_qemu_tb_exec's stack
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 31/45] tci: move tci_regs to tcg_qemu_tb_exec's stack Emilio G. Cota
@ 2017-07-18  0:08   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:08 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Groundwork for supporting multiple TCG contexts.
> 
> Compile-tested for all targets on an x86_64 host.
> 
> Suggested-by: Richard Henderson<rth@twiddle.net>
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   tcg/tci.c | 552 +++++++++++++++++++++++++++++++-------------------------------
>   1 file changed, 279 insertions(+), 273 deletions(-)

Acked-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 34/45] tcg: define tcg_init_ctx and make tcg_ctx a pointer
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 34/45] tcg: define tcg_init_ctx and make tcg_ctx a pointer Emilio G. Cota
@ 2017-07-18  0:09   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  0:09 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Groundwork for supporting multiple TCG contexts.
> 
> The core of this patch is this change to tcg/tcg.h:
> 
>> -extern TCGContext tcg_ctx;
>> +extern TCGContext tcg_init_ctx;
>> +extern TCGContext *tcg_ctx;
> Note that for now we set *tcg_ctx to whatever TCGContext is passed
> to tcg_context_init -- in this case &tcg_init_ctx.
> 
> To avoid diff churn we could do something like
>> TCGContext *tcg_ctx_ptr;
>> #define tcg_ctx (*tcg_ctx_ptr)
> as Richard suggested during review, but sooner or later
> we'd end up doing the conversion anyway, so do it now.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---

That is indeed fewer instances than I would have guessed.

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs
  2017-07-17 22:55   ` Richard Henderson
@ 2017-07-18  0:27     ` Emilio G. Cota
  2017-07-18  3:40       ` Richard Henderson
  0 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-18  0:27 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Mon, Jul 17, 2017 at 12:55:03 -1000, Richard Henderson wrote:
> On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> >@@ -1073,13 +1073,17 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
> >      assert_tb_locked();
> >-    atomic_set(&tb->invalid, true);
> >-
> >      /* remove the TB from the hash list */
> >      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
> >      h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->trace_vcpu_dstate);
> >      qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
> >+    /*
> >+     * Mark the TB as invalid *after* it's been removed from tb_hash, which
> >+     * eliminates the need to check this bit on lookups.
> >+     */
> >+    tb->invalid = true;
> 
> I believe you need atomic_store_release here.  Previously we were relying on
> the lock acquisition in qht_remove to provide the required memory barrier.
> 
> We definitely need to make sure this reaches memory before we zap the TB in
> the CPU_FOREACH loop.

After this patch tb->invalid is only read/set with tb_lock held, so no need for
atomics while accessing it.

		E.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 11/45] exec-all: bring tb->invalid into tb->cflags
  2017-07-17 23:07   ` Richard Henderson
@ 2017-07-18  0:28     ` Emilio G. Cota
  0 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-18  0:28 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Mon, Jul 17, 2017 at 13:07:53 -1000, Richard Henderson wrote:
> On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> >-    tb->invalid = true;
> >+    tb->cflags |= CF_INVALID;
> 
> Modulo the store_release comment for the last patch,

same thing: note tb_lock is always held when checking the flag, and also
on removal.

		E.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 21/45] tcg: check CF_PARALLEL instead of parallel_cpus
  2017-07-17 23:55   ` Richard Henderson
@ 2017-07-18  0:34     ` Emilio G. Cota
  2017-07-18  3:42       ` Richard Henderson
  0 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-18  0:34 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Mon, Jul 17, 2017 at 13:55:42 -1000, Richard Henderson wrote:
> On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> >Thereby decoupling the resulting translated code from the current state
> >of the system.
> >
> >The tb->cflags field is not passed to tcg generation functions. So
> >we add a bit to TCGContext, storing there whether CF_PARALLEL is set
> >before translating every TB.
> >
> >Most architectures have <= 32 registers, which results in a 4-byte hole
> >in TCGContext. Use this hole for the bit we need; use a uint8_t instead
> >of a bool, since a bool might take more than one byte in some systems.
> 
> I would much rather use bool.
> 
> (1) I don't care about OSX and its broken ABI,
> (2) Even then OSX still *works*.

Will do.

> Otherwise,

Missing R-b tag?

		E.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs
  2017-07-18  0:27     ` Emilio G. Cota
@ 2017-07-18  3:40       ` Richard Henderson
  2017-07-18  4:54         ` Emilio G. Cota
  0 siblings, 1 reply; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  3:40 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/17/2017 02:27 PM, Emilio G. Cota wrote:
> On Mon, Jul 17, 2017 at 12:55:03 -1000, Richard Henderson wrote:
>> On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
>>> @@ -1073,13 +1073,17 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>>>       assert_tb_locked();
>>> -    atomic_set(&tb->invalid, true);
>>> -
>>>       /* remove the TB from the hash list */
>>>       phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
>>>       h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->trace_vcpu_dstate);
>>>       qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
>>> +    /*
>>> +     * Mark the TB as invalid *after* it's been removed from tb_hash, which
>>> +     * eliminates the need to check this bit on lookups.
>>> +     */
>>> +    tb->invalid = true;
>>
>> I believe you need atomic_store_release here.  Previously we were relying on
>> the lock acquisition in qht_remove to provide the required memory barrier.
>>
>> We definitely need to make sure this reaches memory before we zap the TB in
>> the CPU_FOREACH loop.
> 
> After this patch tb->invalid is only read/set with tb_lock held, so no need for
> atomics while accessing it.

I think there's a path by which we do get stale data.
For threads A and B,

   (A) Lookup succeeds for TB in hash without tb_lock
        (B) Removes TB from hash
        (B) Sets tb->invalid
        (B) Clears FORALL_CPU jmp_cache
   (A) Store TB into local jmp_cache

... and since we never check for invalid again, there's nothing to evict TB 
from the jmp_cache again.

Here's a plan that will make me happy:

(1) Drop this patch, leaving the set of tb->invalid (or CF_INVALID) in place.
(2) Include CF_INVALID in the mask of bits compared in tb_lookup__cpu_state.
     (a) At this point in the patch set that's just

	(tb->flags & CF_INVALID) == 0

     (b) Later in the patch series when CF_PARALLEL is introduced
         (and CF_HASH_MASK, lets call it, instead of the cf_mask
         function you have now), this becomes

         (tb->flags & (CF_HASH_MASK | CF_INVALID)) == cf_mask

So that we continue to check CF_INVALID each time we lookup a TB, but now we 
get it for free as a part of the other flags check.


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 21/45] tcg: check CF_PARALLEL instead of parallel_cpus
  2017-07-18  0:34     ` Emilio G. Cota
@ 2017-07-18  3:42       ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  3:42 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/17/2017 02:34 PM, Emilio G. Cota wrote:
> On Mon, Jul 17, 2017 at 13:55:42 -1000, Richard Henderson wrote:
>> On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
>>> Thereby decoupling the resulting translated code from the current state
>>> of the system.
>>>
>>> The tb->cflags field is not passed to tcg generation functions. So
>>> we add a bit to TCGContext, storing there whether CF_PARALLEL is set
>>> before translating every TB.
>>>
>>> Most architectures have <= 32 registers, which results in a 4-byte hole
>>> in TCGContext. Use this hole for the bit we need; use a uint8_t instead
>>> of a bool, since a bool might take more than one byte in some systems.
>>
>> I would much rather use bool.
>>
>> (1) I don't care about OSX and its broken ABI,
>> (2) Even then OSX still *works*.
> 
> Will do.
> 
>> Otherwise,
> 
> Missing R-b tag?

Oops, yes.  Must have fat-fingered the ctrl-paste.

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 36/45] tcg: dynamically allocate optimizer globals + fold into TCGContext
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 36/45] tcg: dynamically allocate optimizer globals + fold " Emilio G. Cota
@ 2017-07-18  3:53   ` Richard Henderson
  2017-07-18  4:33     ` Emilio G. Cota
  0 siblings, 1 reply; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  3:53 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Groundwork for supporting multiple TCG contexts.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>   tcg/tcg.h      | 12 ++++++++++++
>   tcg/optimize.c | 40 +++++++++++++++++++++++-----------------
>   2 files changed, 35 insertions(+), 17 deletions(-)
> 
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 569f823..175d4de 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -641,6 +641,14 @@ QEMU_BUILD_BUG_ON(OPPARAM_BUF_SIZE > (1 << 14));
>   /* Make sure that we don't overflow 64 bits without noticing.  */
>   QEMU_BUILD_BUG_ON(sizeof(TCGOp) > 8);
>   
> +struct tcg_temp_info {
> +    bool is_const;
> +    uint16_t prev_copy;
> +    uint16_t next_copy;
> +    tcg_target_ulong val;
> +    tcg_target_ulong mask;
> +};
> +
>   struct TCGContext {
>       uint8_t *pool_cur, *pool_end;
>       TCGPool *pool_first, *pool_current, *pool_first_large;
> @@ -717,6 +725,10 @@ struct TCGContext {
>       TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
>       TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
>   
> +    /* optimizer */
> +    struct tcg_temp_info *opt_temps;
> +    TCGTempSet opt_temps_used;

I would prefer either

   (1) Dynamic allocation.  I know we eschew that most places during,
       but surely this is the exact situation for which it's handy.

   (2) Make opt_temps an array of TCG_MAX_TEMPS and drop the pointer.

I think the TCGTempSet should be a local within tcg_optimize.


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 37/45] tcg: introduce **tcg_ctxs to keep track of all TCGContext's
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 37/45] tcg: introduce **tcg_ctxs to keep track of all TCGContext's Emilio G. Cota
@ 2017-07-18  4:17   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  4:17 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> Groundwork for supporting multiple TCG contexts.
> 
> Note that having n_tcg_ctxs is unnecessary. However, it is
> convenient to have it, since it will simplify iterating over the
> array: we'll have just a for loop instead of having to iterate
> over a NULL-terminated array (which would require n+1 elems)
> or having to check with ifdef's for usermode/softmmu.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>   tcg/tcg.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index f907c47..8094278 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -115,6 +115,8 @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
>   static void tcg_out_tb_init(TCGContext *s);
>   static bool tcg_out_tb_finalize(TCGContext *s);
>   
> +static TCGContext **tcg_ctxs;
> +static unsigned int n_tcg_ctxs;

I'm perfectly happy introducing these now, and converting stuff to use them.

> +static void tcg_ctxs_init(TCGContext *s)
> +{
> +    tcg_ctxs = g_new(TCGContext *, 1);
> +    tcg_ctxs[0] = s;
> +    n_tcg_ctxs = 1;
> +}

This was confusing to me, trying to figure out how this function would be 
extended for multi-threading.  But it turns out it isn't -- it just goes away.

> @@ -381,6 +390,7 @@ void tcg_context_init(TCGContext *s)
>           indirect_reg_alloc_order[i] = tcg_target_reg_alloc_order[i];
>       }
>   
> +    tcg_ctxs_init(s);
>       tcg_ctx = s;
>   }

Thus I think it would be simpler for the interim to do

     tcg_ctx = s;
     tcg_ctxs = &tcg_ctx;
     n_tcg_ctxs = 1;


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 38/45] tcg: distribute profiling counters across TCGContext's
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 38/45] tcg: distribute profiling counters across TCGContext's Emilio G. Cota
@ 2017-07-18  4:20   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  4:20 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> +#define PROF_ADD_MAX(to, from, field)                                   \
> +    do {                                                                \
> +        typeof((from)->field) val__ = atomic_read(&((from)->field));    \
> +        if (val__ > (to)->field) {                                      \
> +            (to)->field = val__;                                        \
> +        }                                                               \
> +    } while (0)

PROF_MAX?  There's no addition involved.

Otherwise,

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 39/45] osdep: move qemu_real_host_page_size/mask to osdep
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 39/45] osdep: move qemu_real_host_page_size/mask to osdep Emilio G. Cota
@ 2017-07-18  4:22   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  4:22 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> These only depend on the host and therefore belong in the common
> osdep, not in a target-dependent object.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   include/exec/cpu-all.h | 2 --
>   include/qemu/osdep.h   | 8 ++++++++
>   exec.c                 | 5 +----
>   util/osdep.c           | 9 +++++++++
>   4 files changed, 18 insertions(+), 6 deletions(-)

I do wonder if a new file in util/ with a constructor to init would be cleaner.

But, ok I guess,

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 40/45] osdep: introduce qemu_mprotect_rwx/none
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 40/45] osdep: introduce qemu_mprotect_rwx/none Emilio G. Cota
@ 2017-07-18  4:26   ` Richard Henderson
  2017-07-18  4:57     ` Emilio G. Cota
  0 siblings, 1 reply; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  4:26 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> +static int qemu_mprotect__osdep(void *addr, size_t size, int prot)
> +{
> +    void *start = QEMU_ALIGN_PTR_DOWN(addr, qemu_real_host_page_size);
> +    void *end = QEMU_ALIGN_PTR_UP(addr + size, qemu_real_host_page_size);

I'm not keen on this.  Any good reason for it as opposed to asserting that the 
inputs are already page aligned?


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 36/45] tcg: dynamically allocate optimizer globals + fold into TCGContext
  2017-07-18  3:53   ` Richard Henderson
@ 2017-07-18  4:33     ` Emilio G. Cota
  2017-07-18  4:38       ` Richard Henderson
  0 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-18  4:33 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Mon, Jul 17, 2017 at 17:53:33 -1000, Richard Henderson wrote:
> On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> >Groundwork for supporting multiple TCG contexts.
(snip)
> >  struct TCGContext {
> >      uint8_t *pool_cur, *pool_end;
> >      TCGPool *pool_first, *pool_current, *pool_first_large;
> >@@ -717,6 +725,10 @@ struct TCGContext {
> >      TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
> >      TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
> >+    /* optimizer */
> >+    struct tcg_temp_info *opt_temps;
> >+    TCGTempSet opt_temps_used;
> 
> I would prefer either
> 
>   (1) Dynamic allocation.  I know we eschew that most places during,
>       but surely this is the exact situation for which it's handy.
> 
>   (2) Make opt_temps an array of TCG_MAX_TEMPS and drop the pointer.

Originally I implemented (2). But the array is pretty large and
realised that the init ctx doesn't use it at all. So I made
the allocation dynamic, i.e. tcg_optimize will allocate the
array if the ctx doesn't have it yet.

But I guess that's not what you mean with (1)? You mean to allocate
every single time we call tcg_optimize, allocating only the space
we need on each call?

> I think the TCGTempSet should be a local within tcg_optimize.

Will do.

		E.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 44/45] translate-all: do not allocate a guard page for code_gen_buffer
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 44/45] translate-all: do not allocate a guard page for code_gen_buffer Emilio G. Cota
@ 2017-07-18  4:35   ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  4:35 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> TCG regions already have a guard page.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   accel/tcg/translate-all.c | 47 ++++++++++++-----------------------------------
>   1 file changed, 12 insertions(+), 35 deletions(-)

This should just be folded into the previous patch that creates TCG Regions.


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 36/45] tcg: dynamically allocate optimizer globals + fold into TCGContext
  2017-07-18  4:33     ` Emilio G. Cota
@ 2017-07-18  4:38       ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  4:38 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/17/2017 06:33 PM, Emilio G. Cota wrote:
>> I would prefer either
>>
>>    (1) Dynamic allocation.  I know we eschew that most places during,
>>        but surely this is the exact situation for which it's handy.
...
> But I guess that's not what you mean with (1)? You mean to allocate
> every single time we call tcg_optimize, allocating only the space
> we need on each call?

Yes.

r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs
  2017-07-18  3:40       ` Richard Henderson
@ 2017-07-18  4:54         ` Emilio G. Cota
  2017-07-18  5:29           ` Richard Henderson
  0 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-18  4:54 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Mon, Jul 17, 2017 at 17:40:29 -1000, Richard Henderson wrote:
> On 07/17/2017 02:27 PM, Emilio G. Cota wrote:
> >On Mon, Jul 17, 2017 at 12:55:03 -1000, Richard Henderson wrote:
> >>On 07/16/2017 10:03 AM, Emilio G. Cota wrote:
> >>>@@ -1073,13 +1073,17 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
> >>>      assert_tb_locked();
> >>>-    atomic_set(&tb->invalid, true);
> >>>-
> >>>      /* remove the TB from the hash list */
> >>>      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
> >>>      h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->trace_vcpu_dstate);
> >>>      qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
> >>>+    /*
> >>>+     * Mark the TB as invalid *after* it's been removed from tb_hash, which
> >>>+     * eliminates the need to check this bit on lookups.
> >>>+     */
> >>>+    tb->invalid = true;
> >>
> >>I believe you need atomic_store_release here.  Previously we were relying on
> >>the lock acquisition in qht_remove to provide the required memory barrier.
> >>
> >>We definitely need to make sure this reaches memory before we zap the TB in
> >>the CPU_FOREACH loop.
> >
> >After this patch tb->invalid is only read/set with tb_lock held, so no need for
> >atomics while accessing it.
> 
> I think there's a path by which we do get stale data.
> For threads A and B,
> 
>   (A) Lookup succeeds for TB in hash without tb_lock
>        (B) Removes TB from hash
>        (B) Sets tb->invalid
>        (B) Clears FORALL_CPU jmp_cache
>   (A) Store TB into local jmp_cache
> 
> ... and since we never check for invalid again, there's nothing to evict TB
> from the jmp_cache again.

Ouch. Yes I see it now.

What threw me off was that in lookup_tb_ptr we're not checking tb->invalid,
and that biased me into thinking that it's not needed. But I should have
tried harder. Also, that's a bug, and yet another reason to have tb_lookup,
so that we fix these things at once in one place.

> Here's a plan that will make me happy:
> 
> (1) Drop this patch, leaving the set of tb->invalid (or CF_INVALID) in place.
> (2) Include CF_INVALID in the mask of bits compared in tb_lookup__cpu_state.
>     (a) At this point in the patch set that's just
> 
> 	(tb->flags & CF_INVALID) == 0
> 
>     (b) Later in the patch series when CF_PARALLEL is introduced
>         (and CF_HASH_MASK, lets call it, instead of the cf_mask
>         function you have now), this becomes
> 
>         (tb->flags & (CF_HASH_MASK | CF_INVALID)) == cf_mask
> 
> So that we continue to check CF_INVALID each time we lookup a TB, but now we
> get it for free as a part of the other flags check.

With the annoying atomic_read thrown in there :-) but yes, will do.

		E.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 40/45] osdep: introduce qemu_mprotect_rwx/none
  2017-07-18  4:26   ` Richard Henderson
@ 2017-07-18  4:57     ` Emilio G. Cota
  0 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-18  4:57 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Mon, Jul 17, 2017 at 18:26:09 -1000, Richard Henderson wrote:
> On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> >+static int qemu_mprotect__osdep(void *addr, size_t size, int prot)
> >+{
> >+    void *start = QEMU_ALIGN_PTR_DOWN(addr, qemu_real_host_page_size);
> >+    void *end = QEMU_ALIGN_PTR_UP(addr + size, qemu_real_host_page_size);
> 
> I'm not keen on this.  Any good reason for it as opposed to asserting that
> the inputs are already page aligned?

No particular reason other than "kept the same behaviour we had".

Let's go with asserts, I like that approach much better actually.

		E.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 43/45] tcg: introduce regions to split code_gen_buffer
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 43/45] tcg: introduce regions to split code_gen_buffer Emilio G. Cota
@ 2017-07-18  5:09   ` Richard Henderson
  2017-07-18 17:44     ` Emilio G. Cota
  0 siblings, 1 reply; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  5:09 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> +#ifdef CONFIG_SOFTMMU
> +/*
> + * It is likely that some vCPUs will translate more code than others, so we
> + * first try to set more regions than smp_cpus, with those regions being
> + * larger than the minimum code_gen_buffer size. If that's not possible we
> + * make do by evenly dividing the code_gen_buffer among the vCPUs.
> + */
> +void softmmu_tcg_region_init(void)
> +{
> +    size_t i;
> +
> +    /* Use a single region if all we have is one vCPU thread */
> +    if (smp_cpus == 1 || !qemu_tcg_mttcg_enabled()) {
> +        tcg_region_init(0);
> +        return;
> +    }
> +
> +    for (i = 8; i > 0; i--) {
> +        size_t regions_per_thread = i;
> +        size_t region_size;
> +
> +        region_size = tcg_init_ctx.code_gen_buffer_size;
> +        region_size /= smp_cpus * regions_per_thread;
> +
> +        if (region_size >= 2 * MIN_CODE_GEN_BUFFER_SIZE) {
> +            tcg_region_init(smp_cpus * regions_per_thread);
> +            return;
> +        }
> +    }
> +    tcg_region_init(smp_cpus);
> +}
> +#endif

Any reason this code wouldn't just live in tcg_region_init?
It would certainly simplify the interface.

In particular it appears to be a mistake to ever call with n_regions == 0, 
since it's just as easy to call with n_regions == 1.

> diff --git a/cpus.c b/cpus.c
> index 14bb8d5..5455819 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -1664,6 +1664,18 @@ static void qemu_tcg_init_vcpu(CPUState *cpu)
>       char thread_name[VCPU_THREAD_NAME_SIZE];
>       static QemuCond *single_tcg_halt_cond;
>       static QemuThread *single_tcg_cpu_thread;
> +    static int tcg_region_inited;
> +
> +    /*
> +     * Initialize TCG regions--once, of course. Now is a good time, because:
> +     * (1) TCG's init context, prologue and target globals have been set up.
> +     * (2) qemu_tcg_mttcg_enabled() works now (TCG init code runs before the
> +     *     -accel flag is processed, so the check doesn't work then).
> +     */
> +    if (!tcg_region_inited) {
> +        softmmu_tcg_region_init();
> +        tcg_region_inited = 1;
> +    }

Nit: Do not require the compiler to hold the address of the global across 
another call, or recompute it.  Generically this pattern is better as

   if (!flag) {
     flag = true;
     do_init();
   }

unless there's some compelling threaded reason to delay the set of the flag. 
Which is not the case here.

> +/* Call from a safe-work context */
> +void tcg_region_reset_all(void)
> +{
> +    unsigned int i;
> +
> +    qemu_mutex_lock(&region.lock);
> +    region.current = 0;
> +    region.n_full = 0;
> +
> +    for (i = 0; i < n_tcg_ctxs; i++) {
> +        if (unlikely(tcg_region_initial_alloc__locked(tcg_ctxs[i]))) {
> +            tcg_abort();
> +        }

Nit: I prefer

   bool ok = foo();
   assert(ok);

over

   if (!foo())
     abort();

> +static void tcg_region_set_guard_pages(void)
> +{
> +    size_t guard_size = qemu_real_host_page_size;
> +    size_t i;
> +
> +    for (i = 0; i < region.n; i++) {
> +        void *guard = region.buf + region.size + i * (region.size + guard_size);
> +
> +        if (qemu_mprotect_none(guard, qemu_real_host_page_size)) {

If you're going to have the local variable at all, guard_size here too.

> +            tcg_abort();
> +        }
> +    }
> +}
> +
> +/*
> + * Initializes region partitioning, setting the number of regions via
> + * @n_regions.
> + * Set @n_regions to 0 or 1 to use a single region that uses all of
> + * code_gen_buffer.
> + *
> + * Called at init time from the parent thread (i.e. the one calling
> + * tcg_context_init), after the target's TCG globals have been set.
> + *
> + * Region partitioning works by splitting code_gen_buffer into separate regions,
> + * and then assigning regions to TCG threads so that the threads can translate
> + * code in parallel without synchronization.
> + */
> +void tcg_region_init(size_t n_regions)
> +{
> +    void *buf = tcg_init_ctx.code_gen_buffer;
> +    size_t size = tcg_init_ctx.code_gen_buffer_size;
> +
> +    if (!n_regions) {
> +        n_regions = 1;
> +    }
> +
> +    /* start on a page-aligned address */
> +    buf = QEMU_ALIGN_PTR_UP(buf, qemu_real_host_page_size);
> +    if (unlikely(buf > tcg_init_ctx.code_gen_buffer + size)) {
> +        tcg_abort();
> +    }

assert.

> +    /* discard that initial portion */
> +    size -= buf - tcg_init_ctx.code_gen_buffer;
> +
> +    /* make region.size a multiple of page_size */
> +    region.size = size / n_regions;
> +    region.size &= qemu_real_host_page_mask;

QEMU_ALIGN_DOWN.

> +
> +    /* A region must have at least 2 pages; one code, one guard */
> +    if (unlikely(region.size < 2 * qemu_real_host_page_size)) {
> +        tcg_abort();
> +    }

assert.

> +
> +    /* do not count the guard page in region.size */
> +    region.size -= qemu_real_host_page_size;
> +    region.n = n_regions;
> +    region.buf = buf;
> +    tcg_region_set_guard_pages();

I think it would be clearer to inline the subroutine.  I was asking myself why 
we weren't subtracting the guard_size from region->size.


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 45/45] tcg: enable multiple TCG contexts in softmmu
  2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 45/45] tcg: enable multiple TCG contexts in softmmu Emilio G. Cota
@ 2017-07-18  5:25   ` Richard Henderson
  2017-07-18 17:52     ` Emilio G. Cota
  0 siblings, 1 reply; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  5:25 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> +
> +    /* claim the first free pointer in tcg_ctxs and increment n_tcg_ctxs */
> +    for (i = 0; i < smp_cpus; i++) {
> +        if (atomic_cmpxchg(&tcg_ctxs[i], NULL, s) == NULL) {
> +            unsigned int n;
> +
> +            n = atomic_fetch_inc(&n_tcg_ctxs);

Surely this is too much effort.  The increment on n_tcg_ctxs is sufficient to 
produce an index for assignment.  We never free the contexts...

Which also suggests that it might be better to avoid an indirection in tcg_ctxs 
and allocate all of the structures in one big block?  I.e.

TCGContext *tcg_ctxs;

// At the end of tcg_context_init.
#ifdef CONFIG_USER_ONLY
     tcg_ctxs = s;
#else
     // No need to zero; we'll completely overwrite each structure
     // during tcg_register_thread.
     tcg_ctxs = g_new(TCGContext, smp_cpus);
#endif


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs
  2017-07-18  4:54         ` Emilio G. Cota
@ 2017-07-18  5:29           ` Richard Henderson
  2017-07-18 23:30             ` Emilio G. Cota
  0 siblings, 1 reply; 93+ messages in thread
From: Richard Henderson @ 2017-07-18  5:29 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/17/2017 06:54 PM, Emilio G. Cota wrote:
> What threw me off was that in lookup_tb_ptr we're not checking tb->invalid,
> and that biased me into thinking that it's not needed. But I should have
> tried harder. Also, that's a bug, and yet another reason to have tb_lookup,
> so that we fix these things at once in one place.

Yes, me as well.  Quite right we need only one copy of this code.

>>          (tb->flags & (CF_HASH_MASK | CF_INVALID)) == cf_mask
>>
>> So that we continue to check CF_INVALID each time we lookup a TB, but now we
>> get it for free as a part of the other flags check.
> 
> With the annoying atomic_read thrown in there :-) but yes, will do.

Yes of course.  ;-)


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 43/45] tcg: introduce regions to split code_gen_buffer
  2017-07-18  5:09   ` Richard Henderson
@ 2017-07-18 17:44     ` Emilio G. Cota
  0 siblings, 0 replies; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-18 17:44 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Mon, Jul 17, 2017 at 19:09:28 -1000, Richard Henderson wrote:
> On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> >+#ifdef CONFIG_SOFTMMU
> >+/*
> >+ * It is likely that some vCPUs will translate more code than others, so we
> >+ * first try to set more regions than smp_cpus, with those regions being
> >+ * larger than the minimum code_gen_buffer size. If that's not possible we
> >+ * make do by evenly dividing the code_gen_buffer among the vCPUs.
> >+ */
> >+void softmmu_tcg_region_init(void)
> >+{
> >+    size_t i;
> >+
> >+    /* Use a single region if all we have is one vCPU thread */
> >+    if (smp_cpus == 1 || !qemu_tcg_mttcg_enabled()) {
> >+        tcg_region_init(0);
> >+        return;
> >+    }
> >+
> >+    for (i = 8; i > 0; i--) {
> >+        size_t regions_per_thread = i;
> >+        size_t region_size;
> >+
> >+        region_size = tcg_init_ctx.code_gen_buffer_size;
> >+        region_size /= smp_cpus * regions_per_thread;
> >+
> >+        if (region_size >= 2 * MIN_CODE_GEN_BUFFER_SIZE) {
> >+            tcg_region_init(smp_cpus * regions_per_thread);
> >+            return;
> >+        }
> >+    }
> >+    tcg_region_init(smp_cpus);
> >+}
> >+#endif
> 
> Any reason this code wouldn't just live in tcg_region_init?
> It would certainly simplify the interface.

Good point. Will move it there, adding a comment to make sure the
function is called once qemu_tcg_mttcg_enabled() has been set up.

		E.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 45/45] tcg: enable multiple TCG contexts in softmmu
  2017-07-18  5:25   ` Richard Henderson
@ 2017-07-18 17:52     ` Emilio G. Cota
  2017-07-18 18:26       ` Richard Henderson
  0 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-18 17:52 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Mon, Jul 17, 2017 at 19:25:14 -1000, Richard Henderson wrote:
> On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
> >+
> >+    /* claim the first free pointer in tcg_ctxs and increment n_tcg_ctxs */
> >+    for (i = 0; i < smp_cpus; i++) {
> >+        if (atomic_cmpxchg(&tcg_ctxs[i], NULL, s) == NULL) {
> >+            unsigned int n;
> >+
> >+            n = atomic_fetch_inc(&n_tcg_ctxs);
> 
> Surely this is too much effort.  The increment on n_tcg_ctxs is sufficient
> to produce an index for assignment.  We never free the contexts...
> 
> Which also suggests that it might be better to avoid an indirection in
> tcg_ctxs and allocate all of the structures in one big block?  I.e.
> 
> TCGContext *tcg_ctxs;
> 
> // At the end of tcg_context_init.
> #ifdef CONFIG_USER_ONLY
>     tcg_ctxs = s;
> #else
>     // No need to zero; we'll completely overwrite each structure
>     // during tcg_register_thread.
>     tcg_ctxs = g_new(TCGContext, smp_cpus);
> #endif

The primary reason is that at this point we don't know yet whether
we'll need smp_cpus contexts or just one context (!mttcg), since
qemu_tcg_mttcg_enabled() is set up after this function is called
(the path is: vl.c -> configure_accelerator() -> tcg_init_context ->
this function; quite a bit later, qemu_tcg_configure() is
called, setting up the bool behind qemu_tcg_mttcg_enabled().)

A secondary reason is to avoid false sharing of cachelines. But
it seems quite unlikely that the last cacheline of TCGContext
will ever be used, so this isn't really an issue.

		E.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 45/45] tcg: enable multiple TCG contexts in softmmu
  2017-07-18 17:52     ` Emilio G. Cota
@ 2017-07-18 18:26       ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18 18:26 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/18/2017 07:52 AM, Emilio G. Cota wrote:
> On Mon, Jul 17, 2017 at 19:25:14 -1000, Richard Henderson wrote:
>> On 07/16/2017 10:04 AM, Emilio G. Cota wrote:
>>> +
>>> +    /* claim the first free pointer in tcg_ctxs and increment n_tcg_ctxs */
>>> +    for (i = 0; i < smp_cpus; i++) {
>>> +        if (atomic_cmpxchg(&tcg_ctxs[i], NULL, s) == NULL) {
>>> +            unsigned int n;
>>> +
>>> +            n = atomic_fetch_inc(&n_tcg_ctxs);
>>
>> Surely this is too much effort.  The increment on n_tcg_ctxs is sufficient
>> to produce an index for assignment.  We never free the contexts...
>>
>> Which also suggests that it might be better to avoid an indirection in
>> tcg_ctxs and allocate all of the structures in one big block?  I.e.
>>
>> TCGContext *tcg_ctxs;
>>
>> // At the end of tcg_context_init.
>> #ifdef CONFIG_USER_ONLY
>>      tcg_ctxs = s;
>> #else
>>      // No need to zero; we'll completely overwrite each structure
>>      // during tcg_register_thread.
>>      tcg_ctxs = g_new(TCGContext, smp_cpus);
>> #endif
> 
> The primary reason is that at this point we don't know yet whether
> we'll need smp_cpus contexts or just one context (!mttcg), since
> qemu_tcg_mttcg_enabled() is set up after this function is called
> (the path is: vl.c -> configure_accelerator() -> tcg_init_context ->
> this function; quite a bit later, qemu_tcg_configure() is
> called, setting up the bool behind qemu_tcg_mttcg_enabled().)

Ok, leave it as-is for now.  But what I take from what you're saying is that 
startup is ripe for a bit of tidying up.  :-)


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs
  2017-07-18  5:29           ` Richard Henderson
@ 2017-07-18 23:30             ` Emilio G. Cota
  2017-07-18 23:43               ` Richard Henderson
  0 siblings, 1 reply; 93+ messages in thread
From: Emilio G. Cota @ 2017-07-18 23:30 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Mon, Jul 17, 2017 at 19:29:57 -1000, Richard Henderson wrote:
> On 07/17/2017 06:54 PM, Emilio G. Cota wrote:
> >What threw me off was that in lookup_tb_ptr we're not checking tb->invalid,
> >and that biased me into thinking that it's not needed. But I should have
> >tried harder. Also, that's a bug, and yet another reason to have tb_lookup,
> >so that we fix these things at once in one place.
> 
> Yes, me as well.  Quite right we need only one copy of this code.
> 
> >>         (tb->flags & (CF_HASH_MASK | CF_INVALID)) == cf_mask
> >>
> >>So that we continue to check CF_INVALID each time we lookup a TB, but now we
> >>get it for free as a part of the other flags check.
> >
> >With the annoying atomic_read thrown in there :-) but yes, will do.
> 
> Yes of course.  ;-)

Gaah, we'll need to update all readers of tb->cflags, of which we have
plenty (~145, mostly in target code) to avoid C11 undefined behaviour
and make Paolo happy.

Should I do those updates in the same patch where tb->invalid is brought
over to cflags? Alternatives: have a later patch where all readers
are converted to atomic_read, or keep tb->invalid as a separate field (we
could use that 4-byte hole in struct tb_tc..)

		E.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs
  2017-07-18 23:30             ` Emilio G. Cota
@ 2017-07-18 23:43               ` Richard Henderson
  0 siblings, 0 replies; 93+ messages in thread
From: Richard Henderson @ 2017-07-18 23:43 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/18/2017 01:30 PM, Emilio G. Cota wrote:
> Should I do those updates in the same patch where tb->invalid is brought
> over to cflags? Alternatives: have a later patch where all readers
> are converted to atomic_read, or keep tb->invalid as a separate field (we
> could use that 4-byte hole in struct tb_tc..)

I would prefer the readers be converted in a separate patch.
I wonder if an accessor inline might be in order?

static inline uint32_t tb_cflags(TranslationBlock *tb)
{
     return atomic_read(tb->cflags);
}

That might keep line lengths from expanding too much...

Please don't try to do anything clever to re-use padding holes.
I think that's just confusing and probably premature optimization.


r~

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2017-07-18 23:43 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-16 20:03 [Qemu-devel] [PATCH v2 00/45] tcg: support for multiple TCG contexts Emilio G. Cota
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 01/45] vl: fix breakage of -tb-size Emilio G. Cota
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 02/45] translate-all: remove redundant !tcg_enabled check in dump_exec_info Emilio G. Cota
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 03/45] cputlb: bring back tlb_flush_count under !TLB_DEBUG Emilio G. Cota
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 04/45] tcg: fix corruption of code_time profiling counter upon tb_flush Emilio G. Cota
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 05/45] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 06/45] translate-all: make have_tb_lock static Emilio G. Cota
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 07/45] cpu-exec: rename have_tb_lock to acquired_tb_lock in tb_find Emilio G. Cota
2017-07-17 22:39   ` Richard Henderson
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 08/45] tcg/i386: constify tcg_target_callee_save_regs Emilio G. Cota
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 09/45] tcg/mips: " Emilio G. Cota
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 10/45] translate-all: guarantee that tb_hash only holds valid TBs Emilio G. Cota
2017-07-17 22:55   ` Richard Henderson
2017-07-18  0:27     ` Emilio G. Cota
2017-07-18  3:40       ` Richard Henderson
2017-07-18  4:54         ` Emilio G. Cota
2017-07-18  5:29           ` Richard Henderson
2017-07-18 23:30             ` Emilio G. Cota
2017-07-18 23:43               ` Richard Henderson
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 11/45] exec-all: bring tb->invalid into tb->cflags Emilio G. Cota
2017-07-17 23:07   ` Richard Henderson
2017-07-18  0:28     ` Emilio G. Cota
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 12/45] tcg: remove addr argument from lookup_tb_ptr Emilio G. Cota
2017-07-17 23:25   ` Richard Henderson
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 13/45] tcg: consolidate TB lookups in tb_lookup__cpu_state Emilio G. Cota
2017-07-17 23:41   ` Richard Henderson
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 14/45] tcg: define CF_PARALLEL and use it for TB hashing Emilio G. Cota
2017-07-17 23:46   ` Richard Henderson
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 15/45] target/arm: check CF_PARALLEL instead of parallel_cpus Emilio G. Cota
2017-07-17 23:46   ` Richard Henderson
2017-07-16 20:03 ` [Qemu-devel] [PATCH v2 16/45] target/hppa: " Emilio G. Cota
2017-07-17 23:47   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 17/45] target/i386: " Emilio G. Cota
2017-07-17 23:47   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 18/45] target/m68k: " Emilio G. Cota
2017-07-17 23:52   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 19/45] target/s390x: " Emilio G. Cota
2017-07-17 23:53   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 20/45] target/sparc: " Emilio G. Cota
2017-07-17 23:54   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 21/45] tcg: " Emilio G. Cota
2017-07-17 23:55   ` Richard Henderson
2017-07-18  0:34     ` Emilio G. Cota
2017-07-18  3:42       ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 22/45] cpu-exec: lookup/generate TB outside exclusive region during step_atomic Emilio G. Cota
2017-07-18  0:01   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 23/45] translate-all: define and use DEBUG_TB_FLUSH_GATE Emilio G. Cota
2017-07-18  0:01   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 24/45] exec-all: introduce TB_PAGE_ADDR_FMT Emilio G. Cota
2017-07-18  0:02   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 25/45] translate-all: define and use DEBUG_TB_INVALIDATE_GATE Emilio G. Cota
2017-07-18  0:02   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 26/45] translate-all: define and use DEBUG_TB_CHECK_GATE Emilio G. Cota
2017-07-18  0:03   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 27/45] exec-all: extract tb->tc_* into a separate struct tc_tb Emilio G. Cota
2017-07-18  0:04   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 28/45] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
2017-07-18  0:05   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 29/45] exec-all: rename tb_free to tb_remove Emilio G. Cota
2017-07-18  0:05   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 30/45] translate-all: report correct avg host TB size Emilio G. Cota
2017-07-18  0:06   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 31/45] tci: move tci_regs to tcg_qemu_tb_exec's stack Emilio G. Cota
2017-07-18  0:08   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 32/45] tcg: take tb_ctx out of TCGContext Emilio G. Cota
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 33/45] tcg: take .helpers " Emilio G. Cota
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 34/45] tcg: define tcg_init_ctx and make tcg_ctx a pointer Emilio G. Cota
2017-07-18  0:09   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 35/45] gen-icount: fold exitreq_label into TCGContext Emilio G. Cota
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 36/45] tcg: dynamically allocate optimizer globals + fold " Emilio G. Cota
2017-07-18  3:53   ` Richard Henderson
2017-07-18  4:33     ` Emilio G. Cota
2017-07-18  4:38       ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 37/45] tcg: introduce **tcg_ctxs to keep track of all TCGContext's Emilio G. Cota
2017-07-18  4:17   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 38/45] tcg: distribute profiling counters across TCGContext's Emilio G. Cota
2017-07-18  4:20   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 39/45] osdep: move qemu_real_host_page_size/mask to osdep Emilio G. Cota
2017-07-18  4:22   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 40/45] osdep: introduce qemu_mprotect_rwx/none Emilio G. Cota
2017-07-18  4:26   ` Richard Henderson
2017-07-18  4:57     ` Emilio G. Cota
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 41/45] translate-all: use qemu_protect_rwx/none helpers Emilio G. Cota
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 42/45] tcg: define TCG_HIGHWATER Emilio G. Cota
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 43/45] tcg: introduce regions to split code_gen_buffer Emilio G. Cota
2017-07-18  5:09   ` Richard Henderson
2017-07-18 17:44     ` Emilio G. Cota
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 44/45] translate-all: do not allocate a guard page for code_gen_buffer Emilio G. Cota
2017-07-18  4:35   ` Richard Henderson
2017-07-16 20:04 ` [Qemu-devel] [PATCH v2 45/45] tcg: enable multiple TCG contexts in softmmu Emilio G. Cota
2017-07-18  5:25   ` Richard Henderson
2017-07-18 17:52     ` Emilio G. Cota
2017-07-18 18:26       ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.