All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG
@ 2017-07-09  7:49 Emilio G. Cota
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 01/22] vl: fix breakage of -tb-size Emilio G. Cota
                   ` (23 more replies)
  0 siblings, 24 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Original RFC here:
  https://lists.nongnu.org/archive/html/qemu-devel/2017-06/msg06874.html

I included Richard's feedback (Thanks!) from the original RFC, and
added quite a few things. This is now a proper PATCHset since it is
a lot more mature.

Highlights:
- It works! I tested single/multi-threaded arm, aarch64 and alpha softmmu
  with various -smp's (up to 120 on aarch64) and -tb-size's.
  Also tested x86_64-linux-user with multi-threaded code. valgrind's
  drd shows no obvious issues (it doesn't swallow C11 atomics, so it
  spits out a lot of false positives though). Have not tested on a
  non-x86 host, but given the audit I did of global non-const variables
  (see commit message in patch 21), it should be OK.

- Region-based allocation to maximize code_gen_buffer utilization.
  See patch 20.

- Patches 1-8 are unrelated fixes, but I'm keeping them as part of this
  series to avoid merge headaches later on.

- Performance-wise we get a 20% improvement when booting+shutting down
  debian-arm with MTTCG and -smp 8 (see patch 22). Not bad! This is due
  to not holding tb_lock during code translation, although the fact that
  we still have to take it after every translation remains a scalability
  issue. But before focusing on that, I'd like to get this reviewed.

I broke down features as much as possible, so that we do not end up
with a "per-thread TCG" megapatch.

The series applies on top of the current master (b11365867568).

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 01/22] vl: fix breakage of -tb-size
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
@ 2017-07-09  7:49 ` Emilio G. Cota
  2017-07-09 19:56   ` Richard Henderson
  2017-07-11 15:37   ` Alex Bennée
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 02/22] translate-all: remove redundant !tcg_enabled check in dump_exec_info Emilio G. Cota
                   ` (22 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Commit e7b161d573 ("vl: add tcg_enabled() for tcg related code") adds
a check to exit the program when !tcg_enabled() while parsing the -tb-size
flag.

It turns out that when the -tb-size flag is evaluated, tcg_enabled() can
only return 0, since it is set (or not) much later by configure_accelerator().

Fix it by unconditionally exiting if the flag is passed to a QEMU binary
built with !CONFIG_TCG.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 vl.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/vl.c b/vl.c
index d17c863..9ece570 100644
--- a/vl.c
+++ b/vl.c
@@ -3933,10 +3933,10 @@ int main(int argc, char **argv, char **envp)
                 configure_rtc(opts);
                 break;
             case QEMU_OPTION_tb_size:
-                if (!tcg_enabled()) {
-                    error_report("TCG is disabled");
-                    exit(1);
-                }
+#ifndef CONFIG_TCG
+                error_report("TCG is disabled");
+                exit(1);
+#endif
                 if (qemu_strtoul(optarg, NULL, 0, &tcg_tb_size) < 0) {
                     error_report("Invalid argument to -tb-size");
                     exit(1);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 02/22] translate-all: remove redundant !tcg_enabled check in dump_exec_info
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 01/22] vl: fix breakage of -tb-size Emilio G. Cota
@ 2017-07-09  7:49 ` Emilio G. Cota
  2017-07-09 19:57   ` Richard Henderson
                     ` (2 more replies)
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG Emilio G. Cota
                   ` (21 subsequent siblings)
  23 siblings, 3 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This check is redundant because it is already performed by the only
caller of dump_exec_info -- the caller was updated by b7da97eef
("monitor: Check whether TCG is enabled before running the "info jit"
code").

Checking twice wouldn't necessarily be too bad, but here the check also
returns with tb_lock held. So we can either do the check before tb_lock is
acquired, or just get rid of it. Given that it is redundant, I am going
for the latter option.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index dfb9f0d..f768681 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1851,11 +1851,6 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 
     tb_lock();
 
-    if (!tcg_enabled()) {
-        cpu_fprintf(f, "TCG not enabled\n");
-        return;
-    }
-
     target_code_size = 0;
     max_target_code_size = 0;
     cross_page = 0;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 01/22] vl: fix breakage of -tb-size Emilio G. Cota
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 02/22] translate-all: remove redundant !tcg_enabled check in dump_exec_info Emilio G. Cota
@ 2017-07-09  7:49 ` Emilio G. Cota
  2017-07-09 20:00   ` Richard Henderson
  2017-07-12 13:26   ` Alex Bennée
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 04/22] tcg: fix corruption of code_time profiling counter upon tb_flush Emilio G. Cota
                   ` (20 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Commit f0aff0f124 ("cputlb: add assert_cpu_is_self checks") buried
the increment of tlb_flush_count under TLB_DEBUG. This results in
"info jit" always (mis)reporting 0 TLB flushes when !TLB_DEBUG.

Besides, under MTTCG tlb_flush_count is updated by several threads,
so in order not to lose counts we'd either have to use atomic ops
or distribute the counter, which is more scalable.

This patch does the latter by embedding tlb_flush_count in CPUArchState.
The global count is then easily obtained by iterating over the CPU list.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-defs.h   |  1 +
 include/exec/cputlb.h     |  3 +--
 accel/tcg/cputlb.c        | 17 ++++++++++++++---
 accel/tcg/translate-all.c |  2 +-
 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index bc8e7f8..e43ff83 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -137,6 +137,7 @@ typedef struct CPUIOTLBEntry {
     CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];               \
     CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                    \
     CPUIOTLBEntry iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                 \
+    size_t tlb_flush_count;                                             \
     target_ulong tlb_flush_addr;                                        \
     target_ulong tlb_flush_mask;                                        \
     target_ulong vtlb_index;                                            \
diff --git a/include/exec/cputlb.h b/include/exec/cputlb.h
index 3f94178..c91db21 100644
--- a/include/exec/cputlb.h
+++ b/include/exec/cputlb.h
@@ -23,7 +23,6 @@
 /* cputlb.c */
 void tlb_protect_code(ram_addr_t ram_addr);
 void tlb_unprotect_code(ram_addr_t ram_addr);
-extern int tlb_flush_count;
-
+size_t tlb_flush_count(void);
 #endif
 #endif
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 85635ae..9377110 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -92,8 +92,18 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
     }
 }
 
-/* statistics */
-int tlb_flush_count;
+size_t tlb_flush_count(void)
+{
+    CPUState *cpu;
+    size_t count = 0;
+
+    CPU_FOREACH(cpu) {
+        CPUArchState *env = cpu->env_ptr;
+
+        count += atomic_read(&env->tlb_flush_count);
+    }
+    return count;
+}
 
 /* This is OK because CPU architectures generally permit an
  * implementation to drop entries from the TLB at any time, so
@@ -112,7 +122,8 @@ static void tlb_flush_nocheck(CPUState *cpu)
     }
 
     assert_cpu_is_self(cpu);
-    tlb_debug("(count: %d)\n", tlb_flush_count++);
+    atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);
+    tlb_debug("(count: %zu)\n", tlb_flush_count());
 
     tb_lock();
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index f768681..a936a5f 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1909,7 +1909,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
             atomic_read(&tcg_ctx.tb_ctx.tb_flush_count));
     cpu_fprintf(f, "TB invalidate count %d\n",
             tcg_ctx.tb_ctx.tb_phys_invalidate_count);
-    cpu_fprintf(f, "TLB flush count     %d\n", tlb_flush_count);
+    cpu_fprintf(f, "TLB flush count     %zu\n", tlb_flush_count());
     tcg_dump_info(f, cpu_fprintf);
 
     tb_unlock();
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 04/22] tcg: fix corruption of code_time profiling counter upon tb_flush
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (2 preceding siblings ...)
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG Emilio G. Cota
@ 2017-07-09  7:49 ` Emilio G. Cota
  2017-07-09 20:01   ` Richard Henderson
                     ` (2 more replies)
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 05/22] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
                   ` (19 subsequent siblings)
  23 siblings, 3 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Whenever there is an overflow in code_gen_buffer (e.g. we run out
of space in it and have to flush it), the code_time profiling counter
ends up with an invalid value (that is, code_time -= profile_getclock(),
without later on getting += profile_getclock() due to the goto).

Fix it by using the ti variable, so that we only update code_time
when there is no overflow. Note that in case there is an overflow
we fail to account for the elapsed coding time, but this is quite rare
so we can probably live with it.

"info jit" before/after, roughly at the same time during debian-arm bootup:

- before:
Statistics:
TB flush count      1
TB invalidate count 4665
TLB flush count     998
JIT cycles          -615191529184601 (-256329.804 s at 2.4 GHz)
translated TBs      302310 (aborted=0 0.0%)
avg ops/TB          48.4 max=438
deleted ops/TB      8.54
avg temps/TB        32.31 max=38
avg host code/TB    361.5
avg search data/TB  24.5
cycles/op           -42014693.0
cycles/in byte      -121444900.2
cycles/out byte     -5629031.1
cycles/search byte     -83114481.0
  gen_interm time   -0.0%
  gen_code time     100.0%
optim./code time    -0.0%
liveness/code time  -0.0%
cpu_restore count   6236
  avg cycles        110.4

- after:
Statistics:
TB flush count      1
TB invalidate count 4665
TLB flush count     1010
JIT cycles          1996899624 (0.832 s at 2.4 GHz)
translated TBs      297961 (aborted=0 0.0%)
avg ops/TB          48.5 max=438
deleted ops/TB      8.56
avg temps/TB        32.31 max=38
avg host code/TB    361.8
avg search data/TB  24.5
cycles/op           138.2
cycles/in byte      398.4
cycles/out byte     18.5
cycles/search byte     273.1
  gen_interm time   14.0%
  gen_code time     86.0%
optim./code time    19.4%
liveness/code time  10.3%
cpu_restore count   6372
  avg cycles        111.0

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index a936a5f..72ce445 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1293,7 +1293,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 #ifdef CONFIG_PROFILER
     tcg_ctx.tb_count++;
     tcg_ctx.interm_time += profile_getclock() - ti;
-    tcg_ctx.code_time -= profile_getclock();
+    ti = profile_getclock();
 #endif
 
     /* ??? Overflow could be handled better here.  In particular, we
@@ -1311,7 +1311,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     }
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx.code_time += profile_getclock();
+    tcg_ctx.code_time += profile_getclock() - ti;
     tcg_ctx.code_in_len += tb->size;
     tcg_ctx.code_out_len += gen_code_size;
     tcg_ctx.search_out_len += search_size;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 05/22] exec-all: fix typos in TranslationBlock's documentation
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (3 preceding siblings ...)
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 04/22] tcg: fix corruption of code_time profiling counter upon tb_flush Emilio G. Cota
@ 2017-07-09  7:49 ` Emilio G. Cota
  2017-07-12 14:37   ` Alex Bennée
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 06/22] translate-all: make have_tb_lock static Emilio G. Cota
                   ` (18 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 8096d64..8326e7d 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -341,7 +341,7 @@ struct TranslationBlock {
     /* The following data are used to directly call another TB from
      * the code of this one. This can be done either by emitting direct or
      * indirect native jump instructions. These jumps are reset so that the TB
-     * just continue its execution. The TB can be linked to another one by
+     * just continues its execution. The TB can be linked to another one by
      * setting one of the jump targets (or patching the jump instruction). Only
      * two of such jumps are supported.
      */
@@ -352,7 +352,7 @@ struct TranslationBlock {
 #else
     uintptr_t jmp_target_addr[2]; /* target address for indirect jump */
 #endif
-    /* Each TB has an assosiated circular list of TBs jumping to this one.
+    /* Each TB has an associated circular list of TBs jumping to this one.
      * jmp_list_first points to the first TB jumping to this one.
      * jmp_list_next is used to point to the next TB in a list.
      * Since each TB can have two jumps, it can participate in two lists.
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 06/22] translate-all: make have_tb_lock static
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (4 preceding siblings ...)
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 05/22] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
@ 2017-07-09  7:49 ` Emilio G. Cota
  2017-07-09 20:02   ` Richard Henderson
  2017-07-12 14:38   ` Alex Bennée
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs Emilio G. Cota
                   ` (17 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

It is only used by this object, and it's not exported to any other.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 72ce445..2fa9f65 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -133,7 +133,7 @@ TCGContext tcg_ctx;
 bool parallel_cpus;
 
 /* translation block context */
-__thread int have_tb_lock;
+static __thread int have_tb_lock;
 
 static void page_table_config_init(void)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (5 preceding siblings ...)
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 06/22] translate-all: make have_tb_lock static Emilio G. Cota
@ 2017-07-09  7:49 ` Emilio G. Cota
  2017-07-09 20:02   ` Richard Henderson
                     ` (2 more replies)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 08/22] tcg/mips: " Emilio G. Cota
                   ` (16 subsequent siblings)
  23 siblings, 3 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/i386/tcg-target.inc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 01e3b4e..06df01a 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -2514,7 +2514,7 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
     return NULL;
 }
 
-static int tcg_target_callee_save_regs[] = {
+static const int tcg_target_callee_save_regs[] = {
 #if TCG_TARGET_REG_BITS == 64
     TCG_REG_RBP,
     TCG_REG_RBX,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 08/22] tcg/mips: constify tcg_target_callee_save_regs
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (6 preceding siblings ...)
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 20:02   ` Richard Henderson
                     ` (2 more replies)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t Emilio G. Cota
                   ` (15 subsequent siblings)
  23 siblings, 3 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/mips/tcg-target.inc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcg/mips/tcg-target.inc.c b/tcg/mips/tcg-target.inc.c
index 8cff9a6..790b4fc 100644
--- a/tcg/mips/tcg-target.inc.c
+++ b/tcg/mips/tcg-target.inc.c
@@ -2323,7 +2323,7 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
     return NULL;
 }
 
-static int tcg_target_callee_save_regs[] = {
+static const int tcg_target_callee_save_regs[] = {
     TCG_REG_S0,       /* used for the global env (TCG_AREG0) */
     TCG_REG_S1,
     TCG_REG_S2,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (7 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 08/22] tcg/mips: " Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 20:11   ` Richard Henderson
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 10/22] exec-all: move tb->invalid to the end of the struct Emilio G. Cota
                   ` (14 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

To avoid wasting a byte. I don't have any use in mind for this byte,
but I think it's good to leave this byte explicitly free for future use.
See this discussion for how the u16 came to be:
  https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04564.html
We could use a bool but in some systems that would take > 1 byte.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 8326e7d..a388756 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -327,7 +327,7 @@ struct TranslationBlock {
 #define CF_USE_ICOUNT  0x20000
 #define CF_IGNORE_ICOUNT 0x40000 /* Do not generate icount code */
 
-    uint16_t invalid;
+    uint8_t invalid;
 
     void *tc_ptr;    /* pointer to the translated code */
     uint8_t *tc_search;  /* pointer to search data */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 10/22] exec-all: move tb->invalid to the end of the struct
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (8 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 11/22] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
                   ` (13 subsequent siblings)
  23 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This opens up a 4-byte hole to be used by upcoming work.

Note that moving this field to the 2nd cache line of the struct
does not affect performance: tb->page_addr is in the 2nd cache
line as well, and both are accessed during code lookup. Besides,
the tb->invalid check is easily predicted.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index a388756..fd20bca 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -327,8 +327,6 @@ struct TranslationBlock {
 #define CF_USE_ICOUNT  0x20000
 #define CF_IGNORE_ICOUNT 0x40000 /* Do not generate icount code */
 
-    uint8_t invalid;
-
     void *tc_ptr;    /* pointer to the translated code */
     uint8_t *tc_search;  /* pointer to search data */
     /* original tb when cflags has CF_NOCACHE */
@@ -366,6 +364,7 @@ struct TranslationBlock {
      */
     uintptr_t jmp_list_next[2];
     uintptr_t jmp_list_first;
+    uint8_t invalid;
 };
 
 void tb_free(TranslationBlock *tb);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 11/22] translate-all: use a binary search tree to track TBs in TBContext
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (9 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 10/22] exec-all: move tb->invalid to the end of the struct Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 20:33   ` Richard Henderson
  2017-07-12 15:10   ` Alex Bennée
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 12/22] translate-all: report correct avg host TB size Emilio G. Cota
                   ` (12 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This is a prerequisite for having threads generate code on separate
buffers, which will help scalability when booting multiple cores
under MTTCG.

For this we need a new field (.tc_size) in TranslationBlock to keep
track of the size of the translated code. This field is added into
a 4-byte hole that the previous commit created.

In order to use glib's binary search tree we embed a helper struct
in TranslationBlock to allow us to compare tb's based on their
tc_ptr as well as their tc_size fields. We use an anonymous struct
in TranslationBlock to minimize churn; the alternatives I can
see are to (a) just add a comment and cross our fingers, (b) use
-fms-extensions, and (c) embed the struct and update all calling
code. I think using an anonymous struct is superior, but I can be
persuaded otherwise.

The comparison function we use is optimized for the common case:
insertions. Profiling shows that upon booting debian-arm, 98%
of comparisons are between existing tb's (i.e. a->size and b->size
are both !0), which happens during insertions (and removals, but
those are rare). The remaining cases are lookups. From reading the glib
sources we see that the first key is always the lookup key. However,
the code does not assume this to always be the case because this
behaviour is not guaranteed in the glib docs. However, we embed
this knowledge in the code as a branch hint for the compiler.

Note that tb_free does not free space in the code_gen_buffer anymore,
since we cannot easily know whether the tb is the last one inserted
in code_gen_buffer.

Performance-wise, lookups in tb_find_pc are the same as before:
O(log n). However, insertions are O(log n) instead of O(1), which
results in a small slowdown when booting debian-arm:

Performance counter stats for 'build/arm-softmmu/qemu-system-arm \
	-machine type=virt -nographic -smp 1 -m 4096 \
	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
	-device virtio-net-device,netdev=unet \
	-drive file=img/arm/jessie-arm32.qcow2,id=myblock,index=0,if=none \
	-device virtio-blk-device,drive=myblock \
	-kernel img/arm/aarch32-current-linux-kernel-only.img \
	-append console=ttyAMA0 root=/dev/vda1 \
	-name arm,debug-threads=on -smp 1' (10 runs):

- Before:

       8048.598422      task-clock (msec)         #    0.931 CPUs utilized            ( +-  0.28% )
            16,974      context-switches          #    0.002 M/sec                    ( +-  0.12% )
                 0      cpu-migrations            #    0.000 K/sec
            10,125      page-faults               #    0.001 M/sec                    ( +-  1.23% )
    35,144,901,879      cycles                    #    4.367 GHz                      ( +-  0.14% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    65,758,252,643      instructions              #    1.87  insns per cycle          ( +-  0.33% )
    10,871,298,668      branches                  # 1350.707 M/sec                    ( +-  0.41% )
       192,322,212      branch-misses             #    1.77% of all branches          ( +-  0.32% )

       8.640869419 seconds time elapsed                                          ( +-  0.57% )

- After:
       8146.242027      task-clock (msec)         #    0.923 CPUs utilized            ( +-  1.23% )
            17,016      context-switches          #    0.002 M/sec                    ( +-  0.40% )
                 0      cpu-migrations            #    0.000 K/sec
            18,769      page-faults               #    0.002 M/sec                    ( +-  0.45% )
    35,660,956,120      cycles                    #    4.378 GHz                      ( +-  1.22% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    65,095,366,607      instructions              #    1.83  insns per cycle          ( +-  1.73% )
    10,803,480,261      branches                  # 1326.192 M/sec                    ( +-  1.95% )
       195,601,289      branch-misses             #    1.81% of all branches          ( +-  0.39% )

       8.828660235 seconds time elapsed                                          ( +-  0.38% )

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   |  17 +++-
 include/exec/tb-context.h |   4 +-
 accel/tcg/translate-all.c | 212 ++++++++++++++++++++++++----------------------
 3 files changed, 125 insertions(+), 108 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index fd20bca..673b26d 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -320,14 +320,25 @@ struct TranslationBlock {
     uint16_t size;      /* size of target code for this block (1 <=
                            size <= TARGET_PAGE_SIZE) */
     uint16_t icount;
-    uint32_t cflags;    /* compile flags */
+    /*
+     * @tc_size must be kept right after @tc_ptr to facilitate TB lookups in a
+     * binary search tree -- see struct ptr_size.
+     * We use an anonymous struct here to avoid updating all calling code,
+     * which would be quite a lot of churn.
+     * The only reason to bring @cflags into the anonymous struct is to
+     * avoid inducing a hole in TranslationBlock.
+     */
+    struct {
+        void *tc_ptr;    /* pointer to the translated code */
+        uint32_t tc_size; /* size of translated code for this block */
+
+        uint32_t cflags;    /* compile flags */
 #define CF_COUNT_MASK  0x7fff
 #define CF_LAST_IO     0x8000 /* Last insn may be an IO access.  */
 #define CF_NOCACHE     0x10000 /* To be freed after execution */
 #define CF_USE_ICOUNT  0x20000
 #define CF_IGNORE_ICOUNT 0x40000 /* Do not generate icount code */
-
-    void *tc_ptr;    /* pointer to the translated code */
+    };
     uint8_t *tc_search;  /* pointer to search data */
     /* original tb when cflags has CF_NOCACHE */
     struct TranslationBlock *orig_tb;
diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index 25c2afe..1fa8dcc 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -31,10 +31,8 @@ typedef struct TBContext TBContext;
 
 struct TBContext {
 
-    TranslationBlock **tbs;
+    GTree *tb_tree;
     struct qht htable;
-    size_t tbs_size;
-    int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 2fa9f65..aa3a08b 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -752,6 +752,47 @@ static inline void *alloc_code_gen_buffer(void)
 }
 #endif /* USE_STATIC_CODE_GEN_BUFFER, WIN32, POSIX */
 
+struct ptr_size {
+    void *ptr;
+    uint32_t size;
+};
+
+/* compare a single @ptr and a ptr_size @s */
+static int ptr_size_cmp(const void *ptr, const struct ptr_size *s)
+{
+    if (ptr >= s->ptr + s->size) {
+        return 1;
+    } else if (ptr < s->ptr) {
+        return -1;
+    }
+    return 0;
+}
+
+static gint tc_ptr_cmp(gconstpointer ap, gconstpointer bp)
+{
+    const struct ptr_size *a = ap;
+    const struct ptr_size *b = bp;
+
+    /*
+     * When both sizes are set, we know this isn't a lookup and therefore
+     * the two buffers are non-overlapping: a pointer comparison will do.
+     * This is the most likely case: every TB must be inserted; lookups
+     * are a lot less frequent.
+     */
+    if (likely(a->size && b->size)) {
+        return a->ptr - b->ptr;
+    }
+    /*
+     * All lookups have either .size field set to 0.
+     * From the glib sources we see that @ap is always the lookup key. However
+     * the docs provide no guarantee, so we just mark this case as likely.
+     */
+    if (likely(a->size == 0)) {
+        return ptr_size_cmp(a->ptr, b);
+    }
+    return ptr_size_cmp(b->ptr, a);
+}
+
 static inline void code_gen_alloc(size_t tb_size)
 {
     tcg_ctx.code_gen_buffer_size = size_code_gen_buffer(tb_size);
@@ -760,15 +801,7 @@ static inline void code_gen_alloc(size_t tb_size)
         fprintf(stderr, "Could not allocate dynamic translator buffer\n");
         exit(1);
     }
-
-    /* size this conservatively -- realloc later if needed */
-    tcg_ctx.tb_ctx.tbs_size =
-        tcg_ctx.code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE / 8;
-    if (unlikely(!tcg_ctx.tb_ctx.tbs_size)) {
-        tcg_ctx.tb_ctx.tbs_size = 64 * 1024;
-    }
-    tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock *, tcg_ctx.tb_ctx.tbs_size);
-
+    tcg_ctx.tb_ctx.tb_tree = g_tree_new(tc_ptr_cmp);
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
@@ -805,7 +838,6 @@ void tcg_exec_init(unsigned long tb_size)
 static TranslationBlock *tb_alloc(target_ulong pc)
 {
     TranslationBlock *tb;
-    TBContext *ctx;
 
     assert_tb_locked();
 
@@ -813,12 +845,6 @@ static TranslationBlock *tb_alloc(target_ulong pc)
     if (unlikely(tb == NULL)) {
         return NULL;
     }
-    ctx = &tcg_ctx.tb_ctx;
-    if (unlikely(ctx->nb_tbs == ctx->tbs_size)) {
-        ctx->tbs_size *= 2;
-        ctx->tbs = g_renew(TranslationBlock *, ctx->tbs, ctx->tbs_size);
-    }
-    ctx->tbs[ctx->nb_tbs++] = tb;
     return tb;
 }
 
@@ -827,16 +853,7 @@ void tb_free(TranslationBlock *tb)
 {
     assert_tb_locked();
 
-    /* In practice this is mostly used for single use temporary TB
-       Ignore the hard cases and just back up if this TB happens to
-       be the last one generated.  */
-    if (tcg_ctx.tb_ctx.nb_tbs > 0 &&
-            tb == tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) {
-        size_t struct_size = ROUND_UP(sizeof(*tb), qemu_icache_linesize);
-
-        tcg_ctx.code_gen_ptr = tb->tc_ptr - struct_size;
-        tcg_ctx.tb_ctx.nb_tbs--;
-    }
+    g_tree_remove(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr);
 }
 
 static inline void invalidate_page_bitmap(PageDesc *p)
@@ -884,6 +901,8 @@ static void page_flush_tb(void)
 /* flush all the translation blocks */
 static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 {
+    int nb_tbs __attribute__((unused));
+
     tb_lock();
 
     /* If it is already been done on request of another CPU,
@@ -894,11 +913,12 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
 #if defined(DEBUG_TB_FLUSH)
+    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
     printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
            (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
-           tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.tb_ctx.nb_tbs > 0 ?
+           nb_tbs, nb_tbs > 0 ?
            ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)) /
-           tcg_ctx.tb_ctx.nb_tbs : 0);
+           nb_tbs : 0);
 #endif
     if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)
         > tcg_ctx.code_gen_buffer_size) {
@@ -909,7 +929,10 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
         cpu_tb_jmp_cache_clear(cpu);
     }
 
-    tcg_ctx.tb_ctx.nb_tbs = 0;
+    /* Increment the refcount first so that destroy acts as a reset */
+    g_tree_ref(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_destroy(tcg_ctx.tb_ctx.tb_tree);
+
     qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
@@ -1309,6 +1332,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     if (unlikely(search_size < 0)) {
         goto buffer_overflow;
     }
+    tb->tc_size = gen_code_size;
 
 #ifdef CONFIG_PROFILER
     tcg_ctx.code_time += profile_getclock() - ti;
@@ -1359,6 +1383,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
      * through the physical hash table and physical page list.
      */
     tb_link_page(tb, phys_pc, phys_page2);
+    g_tree_insert(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr, tb);
     return tb;
 }
 
@@ -1627,37 +1652,16 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
 }
 #endif
 
-/* find the TB 'tb' such that tb[0].tc_ptr <= tc_ptr <
-   tb[1].tc_ptr. Return NULL if not found */
+/*
+ * Find the TB 'tb' such that
+ * tb->tc_ptr <= tc_ptr < tb->tc_ptr + tb->tc_size
+ * Return NULL if not found.
+ */
 static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
 {
-    int m_min, m_max, m;
-    uintptr_t v;
-    TranslationBlock *tb;
+    struct ptr_size s = { .ptr = (void *)tc_ptr };
 
-    if (tcg_ctx.tb_ctx.nb_tbs <= 0) {
-        return NULL;
-    }
-    if (tc_ptr < (uintptr_t)tcg_ctx.code_gen_buffer ||
-        tc_ptr >= (uintptr_t)tcg_ctx.code_gen_ptr) {
-        return NULL;
-    }
-    /* binary search (cf Knuth) */
-    m_min = 0;
-    m_max = tcg_ctx.tb_ctx.nb_tbs - 1;
-    while (m_min <= m_max) {
-        m = (m_min + m_max) >> 1;
-        tb = tcg_ctx.tb_ctx.tbs[m];
-        v = (uintptr_t)tb->tc_ptr;
-        if (v == tc_ptr) {
-            return tb;
-        } else if (tc_ptr < v) {
-            m_max = m - 1;
-        } else {
-            m_min = m + 1;
-        }
-    }
-    return tcg_ctx.tb_ctx.tbs[m_max];
+    return g_tree_lookup(tcg_ctx.tb_ctx.tb_tree, &s);
 }
 
 #if !defined(CONFIG_USER_ONLY)
@@ -1842,63 +1846,67 @@ static void print_qht_statistics(FILE *f, fprintf_function cpu_fprintf,
     g_free(hgram);
 }
 
+struct tb_tree_stats {
+    size_t target_size;
+    size_t max_target_size;
+    size_t direct_jmp_count;
+    size_t direct_jmp2_count;
+    size_t cross_page;
+};
+
+static gboolean tb_tree_stats_iter(gpointer key, gpointer value, gpointer data)
+{
+    const TranslationBlock *tb = value;
+    struct tb_tree_stats *tst = data;
+
+    tst->target_size += tb->size;
+    if (tb->size > tst->max_target_size) {
+        tst->max_target_size = tb->size;
+    }
+    if (tb->page_addr[1] != -1) {
+        tst->cross_page++;
+    }
+    if (tb->jmp_reset_offset[0] != TB_JMP_RESET_OFFSET_INVALID) {
+        tst->direct_jmp_count++;
+        if (tb->jmp_reset_offset[1] != TB_JMP_RESET_OFFSET_INVALID) {
+            tst->direct_jmp2_count++;
+        }
+    }
+    return false;
+}
+
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 {
-    int i, target_code_size, max_target_code_size;
-    int direct_jmp_count, direct_jmp2_count, cross_page;
-    TranslationBlock *tb;
+    struct tb_tree_stats tst = {};
     struct qht_stats hst;
+    int nb_tbs;
 
     tb_lock();
 
-    target_code_size = 0;
-    max_target_code_size = 0;
-    cross_page = 0;
-    direct_jmp_count = 0;
-    direct_jmp2_count = 0;
-    for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) {
-        tb = tcg_ctx.tb_ctx.tbs[i];
-        target_code_size += tb->size;
-        if (tb->size > max_target_code_size) {
-            max_target_code_size = tb->size;
-        }
-        if (tb->page_addr[1] != -1) {
-            cross_page++;
-        }
-        if (tb->jmp_reset_offset[0] != TB_JMP_RESET_OFFSET_INVALID) {
-            direct_jmp_count++;
-            if (tb->jmp_reset_offset[1] != TB_JMP_RESET_OFFSET_INVALID) {
-                direct_jmp2_count++;
-            }
-        }
-    }
+    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
     cpu_fprintf(f, "gen code size       %td/%zd\n",
                 tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
                 tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
-    cpu_fprintf(f, "TB count            %d\n", tcg_ctx.tb_ctx.nb_tbs);
-    cpu_fprintf(f, "TB avg target size  %d max=%d bytes\n",
-            tcg_ctx.tb_ctx.nb_tbs ? target_code_size /
-                    tcg_ctx.tb_ctx.nb_tbs : 0,
-            max_target_code_size);
+    cpu_fprintf(f, "TB count            %d\n", nb_tbs);
+    cpu_fprintf(f, "TB avg target size  %zu max=%zu bytes\n",
+                nb_tbs ? tst.target_size / nb_tbs : 0,
+                tst.max_target_size);
     cpu_fprintf(f, "TB avg host size    %td bytes (expansion ratio: %0.1f)\n",
-            tcg_ctx.tb_ctx.nb_tbs ? (tcg_ctx.code_gen_ptr -
-                                     tcg_ctx.code_gen_buffer) /
-                                     tcg_ctx.tb_ctx.nb_tbs : 0,
-                target_code_size ? (double) (tcg_ctx.code_gen_ptr -
-                                             tcg_ctx.code_gen_buffer) /
-                                             target_code_size : 0);
-    cpu_fprintf(f, "cross page TB count %d (%d%%)\n", cross_page,
-            tcg_ctx.tb_ctx.nb_tbs ? (cross_page * 100) /
-                                    tcg_ctx.tb_ctx.nb_tbs : 0);
-    cpu_fprintf(f, "direct jump count   %d (%d%%) (2 jumps=%d %d%%)\n",
-                direct_jmp_count,
-                tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp_count * 100) /
-                        tcg_ctx.tb_ctx.nb_tbs : 0,
-                direct_jmp2_count,
-                tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
-                        tcg_ctx.tb_ctx.nb_tbs : 0);
+                nb_tbs ? (tcg_ctx.code_gen_ptr -
+                          tcg_ctx.code_gen_buffer) / nb_tbs : 0,
+                tst.target_size ? (double) (tcg_ctx.code_gen_ptr -
+                                            tcg_ctx.code_gen_buffer) /
+                                            tst.target_size : 0);
+    cpu_fprintf(f, "cross page TB count %zu (%zu%%)\n", tst.cross_page,
+            nb_tbs ? (tst.cross_page * 100) / nb_tbs : 0);
+    cpu_fprintf(f, "direct jump count   %zu (%zu%%) (2 jumps=%zu %zu%%)\n",
+                tst.direct_jmp_count,
+                nb_tbs ? (tst.direct_jmp_count * 100) / nb_tbs : 0,
+                tst.direct_jmp2_count,
+                nb_tbs ? (tst.direct_jmp2_count * 100) / nb_tbs : 0);
 
     qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
     print_qht_statistics(f, cpu_fprintf, hst);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 12/22] translate-all: report correct avg host TB size
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (10 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 11/22] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-12 15:25   ` Alex Bennée
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 13/22] tcg: take tb_ctx out of TCGContext Emilio G. Cota
                   ` (11 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Since commit 6e3b2bfd6 ("tcg: allocate TB structs before the
corresponding translated code") we are not fully utilizing
code_gen_buffer for translated code, and therefore are
incorrectly reporting the amount of translated code as well as
the average host TB size. Address this by:

- Making the conscious choice of misreporting the total translated code;
  doing otherwise would mislead users into thinking "-tb-size" is not
  honoured.

- Expanding tb_tree_stats to accurately count the bytes of translated code on
  the host, and using this for reporting the average tb host size,
  as well as the expansion ratio.

In the future we might want to consider reporting the accurate numbers for
the total translated code, together with a "bookkeeping/overhead" field to
account for the TB structs.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/translate-all.c | 34 ++++++++++++++++++++++++----------
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index aa3a08b..aa71292 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -898,9 +898,20 @@ static void page_flush_tb(void)
     }
 }
 
+static __attribute__((unused))
+gboolean tb_host_size_iter(gpointer key, gpointer value, gpointer data)
+{
+    const TranslationBlock *tb = value;
+    size_t *size = data;
+
+    *size += tb->tc_size;
+    return false;
+}
+
 /* flush all the translation blocks */
 static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 {
+    size_t host_size __attribute__((unused)) = 0;
     int nb_tbs __attribute__((unused));
 
     tb_lock();
@@ -913,12 +924,11 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
 #if defined(DEBUG_TB_FLUSH)
+    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_host_size_iter, &host_size);
     nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
-    printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
+    printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%zu\n",
            (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
-           nb_tbs, nb_tbs > 0 ?
-           ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)) /
-           nb_tbs : 0);
+           nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
 #endif
     if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)
         > tcg_ctx.code_gen_buffer_size) {
@@ -1847,6 +1857,7 @@ static void print_qht_statistics(FILE *f, fprintf_function cpu_fprintf,
 }
 
 struct tb_tree_stats {
+    size_t host_size;
     size_t target_size;
     size_t max_target_size;
     size_t direct_jmp_count;
@@ -1859,6 +1870,7 @@ static gboolean tb_tree_stats_iter(gpointer key, gpointer value, gpointer data)
     const TranslationBlock *tb = value;
     struct tb_tree_stats *tst = data;
 
+    tst->host_size += tb->tc_size;
     tst->target_size += tb->size;
     if (tb->size > tst->max_target_size) {
         tst->max_target_size = tb->size;
@@ -1887,6 +1899,11 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
+    /*
+     * Report total code size including the padding and TB structs;
+     * otherwise users might think "-tb-size" is not honoured.
+     * For avg host size we use the precise numbers from tb_tree_stats though.
+     */
     cpu_fprintf(f, "gen code size       %td/%zd\n",
                 tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
                 tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
@@ -1894,12 +1911,9 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     cpu_fprintf(f, "TB avg target size  %zu max=%zu bytes\n",
                 nb_tbs ? tst.target_size / nb_tbs : 0,
                 tst.max_target_size);
-    cpu_fprintf(f, "TB avg host size    %td bytes (expansion ratio: %0.1f)\n",
-                nb_tbs ? (tcg_ctx.code_gen_ptr -
-                          tcg_ctx.code_gen_buffer) / nb_tbs : 0,
-                tst.target_size ? (double) (tcg_ctx.code_gen_ptr -
-                                            tcg_ctx.code_gen_buffer) /
-                                            tst.target_size : 0);
+    cpu_fprintf(f, "TB avg host size    %zu bytes (expansion ratio: %0.1f)\n",
+                nb_tbs ? tst.host_size / nb_tbs : 0,
+                tst.target_size ? (double)tst.host_size / tst.target_size : 0);
     cpu_fprintf(f, "cross page TB count %zu (%zu%%)\n", tst.cross_page,
             nb_tbs ? (tst.cross_page * 100) / nb_tbs : 0);
     cpu_fprintf(f, "direct jump count   %zu (%zu%%) (2 jumps=%zu %zu%%)\n",
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 13/22] tcg: take tb_ctx out of TCGContext
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (11 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 12/22] translate-all: report correct avg host TB size Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-12 15:27   ` Alex Bennée
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 14/22] tcg: take .helpers " Emilio G. Cota
                   ` (10 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Before TCGContext is made thread-local.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-context.h |  2 ++
 tcg/tcg.h                 |  2 --
 accel/tcg/cpu-exec.c      |  2 +-
 accel/tcg/translate-all.c | 57 +++++++++++++++++++++++------------------------
 linux-user/main.c         |  6 ++---
 5 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index 1fa8dcc..1d41202 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -41,4 +41,6 @@ struct TBContext {
     int tb_phys_invalidate_count;
 };
 
+extern TBContext tb_ctx;
+
 #endif
diff --git a/tcg/tcg.h b/tcg/tcg.h
index da78721..ad2d959 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -706,8 +706,6 @@ struct TCGContext {
     /* Threshold to flush the translated code buffer.  */
     void *code_gen_highwater;
 
-    TBContext tb_ctx;
-
     /* Track which vCPU triggers events */
     CPUState *cpu;                      /* *_trans */
     TCGv_env tcg_env;                   /* *_exec  */
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 3581618..54ecae2 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -323,7 +323,7 @@ TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
     phys_pc = get_page_addr_code(desc.env, pc);
     desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_hash_func(phys_pc, pc, flags);
-    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
+    return qht_lookup(&tb_ctx.htable, tb_cmp, &desc, h);
 }
 
 static inline TranslationBlock *tb_find(CPUState *cpu,
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index aa71292..84e19d9 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -130,6 +130,7 @@ static void *l1_map[V_L1_MAX_SIZE];
 
 /* code generation context */
 TCGContext tcg_ctx;
+TBContext tb_ctx;
 bool parallel_cpus;
 
 /* translation block context */
@@ -161,7 +162,7 @@ static void page_table_config_init(void)
 void tb_lock(void)
 {
     assert_tb_unlocked();
-    qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
+    qemu_mutex_lock(&tb_ctx.tb_lock);
     have_tb_lock++;
 }
 
@@ -169,13 +170,13 @@ void tb_unlock(void)
 {
     assert_tb_locked();
     have_tb_lock--;
-    qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+    qemu_mutex_unlock(&tb_ctx.tb_lock);
 }
 
 void tb_lock_reset(void)
 {
     if (have_tb_lock) {
-        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+        qemu_mutex_unlock(&tb_ctx.tb_lock);
         have_tb_lock = 0;
     }
 }
@@ -801,15 +802,15 @@ static inline void code_gen_alloc(size_t tb_size)
         fprintf(stderr, "Could not allocate dynamic translator buffer\n");
         exit(1);
     }
-    tcg_ctx.tb_ctx.tb_tree = g_tree_new(tc_ptr_cmp);
-    qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
+    tb_ctx.tb_tree = g_tree_new(tc_ptr_cmp);
+    qemu_mutex_init(&tb_ctx.tb_lock);
 }
 
 static void tb_htable_init(void)
 {
     unsigned int mode = QHT_MODE_AUTO_RESIZE;
 
-    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
+    qht_init(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
 }
 
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
@@ -853,7 +854,7 @@ void tb_free(TranslationBlock *tb)
 {
     assert_tb_locked();
 
-    g_tree_remove(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr);
+    g_tree_remove(tb_ctx.tb_tree, &tb->tc_ptr);
 }
 
 static inline void invalidate_page_bitmap(PageDesc *p)
@@ -919,13 +920,13 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     /* If it is already been done on request of another CPU,
      * just retry.
      */
-    if (tcg_ctx.tb_ctx.tb_flush_count != tb_flush_count.host_int) {
+    if (tb_ctx.tb_flush_count != tb_flush_count.host_int) {
         goto done;
     }
 
 #if defined(DEBUG_TB_FLUSH)
-    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_host_size_iter, &host_size);
-    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
+    nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
     printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%zu\n",
            (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
            nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
@@ -940,17 +941,16 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
     /* Increment the refcount first so that destroy acts as a reset */
-    g_tree_ref(tcg_ctx.tb_ctx.tb_tree);
-    g_tree_destroy(tcg_ctx.tb_ctx.tb_tree);
+    g_tree_ref(tb_ctx.tb_tree);
+    g_tree_destroy(tb_ctx.tb_tree);
 
-    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
+    qht_reset_size(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
     /* XXX: flush processor icache at this point if cache flush is
        expensive */
-    atomic_mb_set(&tcg_ctx.tb_ctx.tb_flush_count,
-                  tcg_ctx.tb_ctx.tb_flush_count + 1);
+    atomic_mb_set(&tb_ctx.tb_flush_count, tb_ctx.tb_flush_count + 1);
 
 done:
     tb_unlock();
@@ -959,7 +959,7 @@ done:
 void tb_flush(CPUState *cpu)
 {
     if (tcg_enabled()) {
-        unsigned tb_flush_count = atomic_mb_read(&tcg_ctx.tb_ctx.tb_flush_count);
+        unsigned tb_flush_count = atomic_mb_read(&tb_ctx.tb_flush_count);
         async_safe_run_on_cpu(cpu, do_tb_flush,
                               RUN_ON_CPU_HOST_INT(tb_flush_count));
     }
@@ -986,7 +986,7 @@ do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 static void tb_invalidate_check(target_ulong address)
 {
     address &= TARGET_PAGE_MASK;
-    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
+    qht_iter(&tb_ctx.htable, do_tb_invalidate_check, &address);
 }
 
 static void
@@ -1006,7 +1006,7 @@ do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 /* verify that all the pages have correct rights for code */
 static void tb_page_check(void)
 {
-    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
+    qht_iter(&tb_ctx.htable, do_tb_page_check, NULL);
 }
 
 #endif
@@ -1105,7 +1105,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
+    qht_remove(&tb_ctx.htable, tb, h);
 
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
@@ -1134,7 +1134,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* suppress any remaining jumps to this TB */
     tb_jmp_unlink(tb);
 
-    tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
+    tb_ctx.tb_phys_invalidate_count++;
 }
 
 #ifdef CONFIG_SOFTMMU
@@ -1250,7 +1250,7 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
 
     /* add in the hash table */
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
+    qht_insert(&tb_ctx.htable, tb, h);
 
 #ifdef DEBUG_TB_CHECK
     tb_page_check();
@@ -1393,7 +1393,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
      * through the physical hash table and physical page list.
      */
     tb_link_page(tb, phys_pc, phys_page2);
-    g_tree_insert(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr, tb);
+    g_tree_insert(tb_ctx.tb_tree, &tb->tc_ptr, tb);
     return tb;
 }
 
@@ -1671,7 +1671,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
 {
     struct ptr_size s = { .ptr = (void *)tc_ptr };
 
-    return g_tree_lookup(tcg_ctx.tb_ctx.tb_tree, &s);
+    return g_tree_lookup(tb_ctx.tb_tree, &s);
 }
 
 #if !defined(CONFIG_USER_ONLY)
@@ -1895,8 +1895,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
 
     tb_lock();
 
-    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
-    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
+    nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
+    g_tree_foreach(tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
     /* XXX: avoid using doubles ? */
     cpu_fprintf(f, "Translation buffer state:\n");
     /*
@@ -1922,15 +1922,14 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
                 tst.direct_jmp2_count,
                 nb_tbs ? (tst.direct_jmp2_count * 100) / nb_tbs : 0);
 
-    qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
+    qht_statistics_init(&tb_ctx.htable, &hst);
     print_qht_statistics(f, cpu_fprintf, hst);
     qht_statistics_destroy(&hst);
 
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %u\n",
-            atomic_read(&tcg_ctx.tb_ctx.tb_flush_count));
-    cpu_fprintf(f, "TB invalidate count %d\n",
-            tcg_ctx.tb_ctx.tb_phys_invalidate_count);
+                atomic_read(&tb_ctx.tb_flush_count));
+    cpu_fprintf(f, "TB invalidate count %d\n", tb_ctx.tb_phys_invalidate_count);
     cpu_fprintf(f, "TLB flush count     %zu\n", tlb_flush_count());
     tcg_dump_info(f, cpu_fprintf);
 
diff --git a/linux-user/main.c b/linux-user/main.c
index ad03c9e..630c73d 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -114,7 +114,7 @@ int cpu_get_pic_interrupt(CPUX86State *env)
 void fork_start(void)
 {
     cpu_list_lock();
-    qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
+    qemu_mutex_lock(&tb_ctx.tb_lock);
     mmap_fork_start();
 }
 
@@ -130,11 +130,11 @@ void fork_end(int child)
                 QTAILQ_REMOVE(&cpus, cpu, node);
             }
         }
-        qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
+        qemu_mutex_init(&tb_ctx.tb_lock);
         qemu_init_cpu_list();
         gdbserver_fork(thread_cpu);
     } else {
-        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
+        qemu_mutex_unlock(&tb_ctx.tb_lock);
         cpu_list_unlock();
     }
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 14/22] tcg: take .helpers out of TCGContext
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (12 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 13/22] tcg: take tb_ctx out of TCGContext Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 20:35   ` Richard Henderson
  2017-07-12 15:28   ` Alex Bennée
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 15/22] gen-icount: fold exitreq_label into TCGContext Emilio G. Cota
                   ` (9 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Before TCGContext is made thread-local.

The hash table becomes read-only after it is filled in,
so we can save space by keeping just a global pointer to it.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h |  2 --
 tcg/tcg.c | 10 +++++-----
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index ad2d959..4f57878 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -663,8 +663,6 @@ struct TCGContext {
 
     tcg_insn_unit *code_ptr;
 
-    GHashTable *helpers;
-
 #ifdef CONFIG_PROFILER
     /* profiling info */
     int64_t tb_count1;
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 3559829..d9b083a 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -319,6 +319,7 @@ typedef struct TCGHelperInfo {
 static const TCGHelperInfo all_helpers[] = {
 #include "exec/helper-tcg.h"
 };
+static GHashTable *helper_table;
 
 static int indirect_reg_alloc_order[ARRAY_SIZE(tcg_target_reg_alloc_order)];
 static void process_op_defs(TCGContext *s);
@@ -329,7 +330,6 @@ void tcg_context_init(TCGContext *s)
     TCGOpDef *def;
     TCGArgConstraint *args_ct;
     int *sorted_args;
-    GHashTable *helper_table;
 
     memset(s, 0, sizeof(*s));
     s->nb_globals = 0;
@@ -357,7 +357,7 @@ void tcg_context_init(TCGContext *s)
 
     /* Register helpers.  */
     /* Use g_direct_hash/equal for direct pointer comparisons on func.  */
-    s->helpers = helper_table = g_hash_table_new(NULL, NULL);
+    helper_table = g_hash_table_new(NULL, NULL);
 
     for (i = 0; i < ARRAY_SIZE(all_helpers); ++i) {
         g_hash_table_insert(helper_table, (gpointer)all_helpers[i].func,
@@ -761,7 +761,7 @@ void tcg_gen_callN(TCGContext *s, void *func, TCGArg ret,
     unsigned sizemask, flags;
     TCGHelperInfo *info;
 
-    info = g_hash_table_lookup(s->helpers, (gpointer)func);
+    info = g_hash_table_lookup(helper_table, (gpointer)func);
     flags = info->flags;
     sizemask = info->sizemask;
 
@@ -990,8 +990,8 @@ static char *tcg_get_arg_str_idx(TCGContext *s, char *buf,
 static inline const char *tcg_find_helper(TCGContext *s, uintptr_t val)
 {
     const char *ret = NULL;
-    if (s->helpers) {
-        TCGHelperInfo *info = g_hash_table_lookup(s->helpers, (gpointer)val);
+    if (helper_table) {
+        TCGHelperInfo *info = g_hash_table_lookup(helper_table, (gpointer)val);
         if (info) {
             ret = info->name;
         }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 15/22] gen-icount: fold exitreq_label into TCGContext
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (13 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 14/22] tcg: take .helpers " Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 20:36   ` Richard Henderson
  2017-07-12 15:29   ` Alex Bennée
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 16/22] tcg: keep a list of TCGContext's Emilio G. Cota
                   ` (8 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Before we make TCGContext thread-local.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/gen-icount.h | 7 +++----
 tcg/tcg.h                 | 2 ++
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/exec/gen-icount.h b/include/exec/gen-icount.h
index 9b3cb14..489aff7 100644
--- a/include/exec/gen-icount.h
+++ b/include/exec/gen-icount.h
@@ -6,13 +6,12 @@
 /* Helpers for instruction counting code generation.  */
 
 static int icount_start_insn_idx;
-static TCGLabel *exitreq_label;
 
 static inline void gen_tb_start(TranslationBlock *tb)
 {
     TCGv_i32 count, imm;
 
-    exitreq_label = gen_new_label();
+    tcg_ctx.exitreq_label = gen_new_label();
     if (tb->cflags & CF_USE_ICOUNT) {
         count = tcg_temp_local_new_i32();
     } else {
@@ -34,7 +33,7 @@ static inline void gen_tb_start(TranslationBlock *tb)
         tcg_temp_free_i32(imm);
     }
 
-    tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, exitreq_label);
+    tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, tcg_ctx.exitreq_label);
 
     if (tb->cflags & CF_USE_ICOUNT) {
         tcg_gen_st16_i32(count, tcg_ctx.tcg_env,
@@ -52,7 +51,7 @@ static inline void gen_tb_end(TranslationBlock *tb, int num_insns)
         tcg_set_insn_param(icount_start_insn_idx, 1, num_insns);
     }
 
-    gen_set_label(exitreq_label);
+    gen_set_label(tcg_ctx.exitreq_label);
     tcg_gen_exit_tb((uintptr_t)tb + TB_EXIT_REQUESTED);
 
     /* Terminate the linked list.  */
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 4f57878..534ead5 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -711,6 +711,8 @@ struct TCGContext {
     /* The TCGBackendData structure is private to tcg-target.inc.c.  */
     struct TCGBackendData *be;
 
+    TCGLabel *exitreq_label;
+
     TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
     TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 16/22] tcg: keep a list of TCGContext's
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (14 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 15/22] gen-icount: fold exitreq_label into TCGContext Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 20:43   ` Richard Henderson
  2017-07-12 15:32   ` Alex Bennée
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 17/22] tcg: distribute profiling counters across TCGContext's Emilio G. Cota
                   ` (7 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Before we make TCGContext thread-local. Once that is done, iterating
over all TCG contexts will be quite useful; for instance we
will need it to gather profiling info from each TCGContext.

A possible alternative would be to keep an array of TCGContext pointers.
However this option however is not that trivial, because vCPUs are spawned in
parallel. So let's just keep it simple and use a list protected by a lock.

Note that this lock will soon be used for other purposes, hence the
generic "tcg_lock" name.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h |  3 +++
 tcg/tcg.c | 23 +++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 534ead5..8e1cd45 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -725,6 +725,8 @@ struct TCGContext {
 
     uint16_t gen_insn_end_off[TCG_MAX_INSNS];
     target_ulong gen_insn_data[TCG_MAX_INSNS][TARGET_INSN_START_WORDS];
+
+    QSIMPLEQ_ENTRY(TCGContext) entry;
 };
 
 extern TCGContext tcg_ctx;
@@ -773,6 +775,7 @@ static inline void *tcg_malloc(int size)
 
 void tcg_context_init(TCGContext *s);
 void tcg_prologue_init(TCGContext *s);
+void tcg_register_thread(void);
 void tcg_func_start(TCGContext *s);
 
 int tcg_gen_code(TCGContext *s, TranslationBlock *tb);
diff --git a/tcg/tcg.c b/tcg/tcg.c
index d9b083a..0da7c61 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -115,7 +115,16 @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 static void tcg_out_tb_init(TCGContext *s);
 static bool tcg_out_tb_finalize(TCGContext *s);
 
+static QemuMutex tcg_lock;
 
+/*
+ * List of TCGContext's in the system. Protected by tcg_lock.
+ * Once vcpu threads have been inited, there will be no further modifications
+ * to the list (vcpu threads never return) so we can safely traverse the list
+ * without synchronization.
+ */
+static QSIMPLEQ_HEAD(, TCGContext) ctx_list =
+    QSIMPLEQ_HEAD_INITIALIZER(ctx_list);
 
 static TCGRegSet tcg_target_available_regs[2];
 static TCGRegSet tcg_target_call_clobber_regs;
@@ -324,6 +333,17 @@ static GHashTable *helper_table;
 static int indirect_reg_alloc_order[ARRAY_SIZE(tcg_target_reg_alloc_order)];
 static void process_op_defs(TCGContext *s);
 
+/*
+ * Child TCG threads, i.e. the ones that do not call tcg_context_init, must call
+ * this function before initiating translation.
+ */
+void tcg_register_thread(void)
+{
+    qemu_mutex_lock(&tcg_lock);
+    QSIMPLEQ_INSERT_TAIL(&ctx_list, &tcg_ctx, entry);
+    qemu_mutex_unlock(&tcg_lock);
+}
+
 void tcg_context_init(TCGContext *s)
 {
     int op, total_args, n, i;
@@ -381,6 +401,9 @@ void tcg_context_init(TCGContext *s)
     for (; i < ARRAY_SIZE(tcg_target_reg_alloc_order); ++i) {
         indirect_reg_alloc_order[i] = tcg_target_reg_alloc_order[i];
     }
+
+    qemu_mutex_init(&tcg_lock);
+    tcg_register_thread();
 }
 
 /*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 17/22] tcg: distribute profiling counters across TCGContext's
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (15 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 16/22] tcg: keep a list of TCGContext's Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 20:45   ` Richard Henderson
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 18/22] tcg: define TCG_HIGHWATER Emilio G. Cota
                   ` (6 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

TCGContext is about to be made thread-local. To avoid scalability issues
when profiling info is enabled, this patch makes the profiling info counters
distributed via the following changes:

1) Consolidate profile info into its own struct, TCGProfile, which
   TCGContext also includes. Note that tcg_table_op_count is brought
   into TCGProfile after dropping the tcg_ prefix.
2) Iterate over the TCG contexts in the system to obtain the total counts.

Note that this change also requires updating the accessors to TCGProfile
fields to use atomic_read/set whenever there may be concurrent accesses
to them.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h                 |  38 ++++++++--------
 accel/tcg/translate-all.c |  23 +++++-----
 tcg/tcg.c                 | 108 ++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 124 insertions(+), 45 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 8e1cd45..2a64ee2 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -641,6 +641,26 @@ QEMU_BUILD_BUG_ON(OPPARAM_BUF_SIZE > (1 << 14));
 /* Make sure that we don't overflow 64 bits without noticing.  */
 QEMU_BUILD_BUG_ON(sizeof(TCGOp) > 8);
 
+typedef struct TCGProfile {
+    int64_t tb_count1;
+    int64_t tb_count;
+    int64_t op_count; /* total insn count */
+    int op_count_max; /* max insn per TB */
+    int64_t temp_count;
+    int temp_count_max;
+    int64_t del_op_count;
+    int64_t code_in_len;
+    int64_t code_out_len;
+    int64_t search_out_len;
+    int64_t interm_time;
+    int64_t code_time;
+    int64_t la_time;
+    int64_t opt_time;
+    int64_t restore_count;
+    int64_t restore_time;
+    int64_t table_op_count[NB_OPS];
+} TCGProfile;
+
 struct TCGContext {
     uint8_t *pool_cur, *pool_end;
     TCGPool *pool_first, *pool_current, *pool_first_large;
@@ -664,23 +684,7 @@ struct TCGContext {
     tcg_insn_unit *code_ptr;
 
 #ifdef CONFIG_PROFILER
-    /* profiling info */
-    int64_t tb_count1;
-    int64_t tb_count;
-    int64_t op_count; /* total insn count */
-    int op_count_max; /* max insn per TB */
-    int64_t temp_count;
-    int temp_count_max;
-    int64_t del_op_count;
-    int64_t code_in_len;
-    int64_t code_out_len;
-    int64_t search_out_len;
-    int64_t interm_time;
-    int64_t code_time;
-    int64_t la_time;
-    int64_t opt_time;
-    int64_t restore_count;
-    int64_t restore_time;
+    TCGProfile prof;
 #endif
 
 #ifdef CONFIG_DEBUG_TCG
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 84e19d9..31a9d42 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -287,6 +287,7 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
     uint8_t *p = tb->tc_search;
     int i, j, num_insns = tb->icount;
 #ifdef CONFIG_PROFILER
+    TCGProfile *prof = &tcg_ctx.prof;
     int64_t ti = profile_getclock();
 #endif
 
@@ -321,8 +322,9 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
     restore_state_to_opc(env, tb, data);
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx.restore_time += profile_getclock() - ti;
-    tcg_ctx.restore_count++;
+    atomic_set(&prof->restore_time,
+                prof->restore_time + profile_getclock() - ti);
+    atomic_set(&prof->restore_count, prof->restore_count + 1);
 #endif
     return 0;
 }
@@ -1269,6 +1271,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tcg_insn_unit *gen_code_buf;
     int gen_code_size, search_size;
 #ifdef CONFIG_PROFILER
+    TCGProfile *prof = &tcg_ctx.prof;
     int64_t ti;
 #endif
     assert_memory_lock();
@@ -1298,8 +1301,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb->invalid = false;
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx.tb_count1++; /* includes aborted translations because of
-                       exceptions */
+    /* includes aborted translations because of exceptions */
+    atomic_set(&prof->tb_count1, prof->tb_count1 + 1);
     ti = profile_getclock();
 #endif
 
@@ -1324,8 +1327,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
 #endif
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx.tb_count++;
-    tcg_ctx.interm_time += profile_getclock() - ti;
+    atomic_set(&prof->tb_count, prof->tb_count + 1);
+    atomic_set(&prof->interm_time, prof->interm_time + profile_getclock() - ti);
     ti = profile_getclock();
 #endif
 
@@ -1345,10 +1348,10 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb->tc_size = gen_code_size;
 
 #ifdef CONFIG_PROFILER
-    tcg_ctx.code_time += profile_getclock() - ti;
-    tcg_ctx.code_in_len += tb->size;
-    tcg_ctx.code_out_len += gen_code_size;
-    tcg_ctx.search_out_len += search_size;
+    atomic_set(&prof->code_time, prof->code_time + profile_getclock() - ti);
+    atomic_set(&prof->code_in_len, prof->code_in_len + tb->size);
+    atomic_set(&prof->code_out_len, prof->code_out_len + gen_code_size);
+    atomic_set(&prof->search_out_len, prof->search_out_len + search_size);
 #endif
 
 #ifdef DEBUG_DISAS
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 0da7c61..c19c473 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1362,7 +1362,7 @@ void tcg_op_remove(TCGContext *s, TCGOp *op)
     memset(op, 0, sizeof(*op));
 
 #ifdef CONFIG_PROFILER
-    s->del_op_count++;
+    atomic_set(&s->prof.del_op_count, s->prof.del_op_count + 1);
 #endif
 }
 
@@ -2533,15 +2533,77 @@ static void tcg_reg_alloc_call(TCGContext *s, int nb_oargs, int nb_iargs,
 
 #ifdef CONFIG_PROFILER
 
-static int64_t tcg_table_op_count[NB_OPS];
+/* avoid copy/paste errors */
+#define PROF_ADD(to, from, field)                       \
+    (to)->field += atomic_read(&((from)->field))
+
+#define PROF_ADD_MAX(to, from, field)                                   \
+    do {                                                                \
+        typeof((from)->field) val__ = atomic_read(&((from)->field));    \
+        if (val__ > (to)->field) {                                      \
+            (to)->field = val__;                                        \
+        }                                                               \
+    } while (0)
+
+/* Pass in a zero'ed @prof */
+static inline
+void tcg_profile_snapshot(TCGProfile *prof, bool counters, bool table)
+{
+    const TCGContext *s;
+
+    QSIMPLEQ_FOREACH(s, &ctx_list, entry) {
+        const TCGProfile *orig = &s->prof;
+
+        if (counters) {
+            PROF_ADD(prof, orig, tb_count1);
+            PROF_ADD(prof, orig, tb_count);
+            PROF_ADD(prof, orig, op_count);
+            PROF_ADD_MAX(prof, orig, op_count_max);
+            PROF_ADD(prof, orig, temp_count);
+            PROF_ADD_MAX(prof, orig, temp_count_max);
+            PROF_ADD(prof, orig, del_op_count);
+            PROF_ADD(prof, orig, code_in_len);
+            PROF_ADD(prof, orig, code_out_len);
+            PROF_ADD(prof, orig, search_out_len);
+            PROF_ADD(prof, orig, interm_time);
+            PROF_ADD(prof, orig, code_time);
+            PROF_ADD(prof, orig, la_time);
+            PROF_ADD(prof, orig, opt_time);
+            PROF_ADD(prof, orig, restore_count);
+            PROF_ADD(prof, orig, restore_time);
+        }
+        if (table) {
+            int i;
+
+            for (i = 0; i < NB_OPS; i++) {
+                PROF_ADD(prof, orig, table_op_count[i]);
+            }
+        }
+    }
+}
+
+#undef PROF_ADD
+#undef PROF_ADD_MAX
+
+static void tcg_profile_snapshot_counters(TCGProfile *prof)
+{
+    tcg_profile_snapshot(prof, true, false);
+}
+
+static void tcg_profile_snapshot_table(TCGProfile *prof)
+{
+    tcg_profile_snapshot(prof, false, true);
+}
 
 void tcg_dump_op_count(FILE *f, fprintf_function cpu_fprintf)
 {
+    TCGProfile prof = {};
     int i;
 
+    tcg_profile_snapshot_table(&prof);
     for (i = 0; i < NB_OPS; i++) {
         cpu_fprintf(f, "%s %" PRId64 "\n", tcg_op_defs[i].name,
-                    tcg_table_op_count[i]);
+                    prof.table_op_count[i]);
     }
 }
 #else
@@ -2554,6 +2616,9 @@ void tcg_dump_op_count(FILE *f, fprintf_function cpu_fprintf)
 
 int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 {
+#ifdef CONFIG_PROFILER
+    TCGProfile *prof = &s->prof;
+#endif
     int i, oi, oi_next, num_insns;
 
 #ifdef CONFIG_PROFILER
@@ -2561,15 +2626,15 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
         int n;
 
         n = s->gen_op_buf[0].prev + 1;
-        s->op_count += n;
-        if (n > s->op_count_max) {
-            s->op_count_max = n;
+        atomic_set(&prof->op_count, prof->op_count + n);
+        if (n > prof->op_count_max) {
+            atomic_set(&prof->op_count_max, n);
         }
 
         n = s->nb_temps;
-        s->temp_count += n;
-        if (n > s->temp_count_max) {
-            s->temp_count_max = n;
+        atomic_set(&prof->temp_count, prof->temp_count + n);
+        if (n > prof->temp_count_max) {
+            atomic_set(&prof->temp_count_max, n);
         }
     }
 #endif
@@ -2586,7 +2651,7 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 #endif
 
 #ifdef CONFIG_PROFILER
-    s->opt_time -= profile_getclock();
+    atomic_set(&prof->opt_time, prof->opt_time - profile_getclock());
 #endif
 
 #ifdef USE_TCG_OPTIMIZATIONS
@@ -2594,8 +2659,8 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 #endif
 
 #ifdef CONFIG_PROFILER
-    s->opt_time += profile_getclock();
-    s->la_time -= profile_getclock();
+    atomic_set(&prof->opt_time, prof->opt_time + profile_getclock());
+    atomic_set(&prof->la_time, prof->la_time - profile_getclock());
 #endif
 
     {
@@ -2623,7 +2688,7 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
     }
 
 #ifdef CONFIG_PROFILER
-    s->la_time += profile_getclock();
+    atomic_set(&prof->la_time, prof->la_time + profile_getclock());
 #endif
 
 #ifdef DEBUG_DISAS
@@ -2654,7 +2719,7 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 
         oi_next = op->next;
 #ifdef CONFIG_PROFILER
-        tcg_table_op_count[opc]++;
+        atomic_set(&prof->table_op_count[opc], prof->table_op_count[opc] + 1);
 #endif
 
         switch (opc) {
@@ -2730,10 +2795,17 @@ int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
 #ifdef CONFIG_PROFILER
 void tcg_dump_info(FILE *f, fprintf_function cpu_fprintf)
 {
-    TCGContext *s = &tcg_ctx;
-    int64_t tb_count = s->tb_count;
-    int64_t tb_div_count = tb_count ? tb_count : 1;
-    int64_t tot = s->interm_time + s->code_time;
+    TCGProfile prof = {};
+    const TCGProfile *s;
+    int64_t tb_count;
+    int64_t tb_div_count;
+    int64_t tot;
+
+    tcg_profile_snapshot_counters(&prof);
+    s = &prof;
+    tb_count = s->tb_count;
+    tb_div_count = tb_count ? tb_count : 1;
+    tot = s->interm_time + s->code_time;
 
     cpu_fprintf(f, "JIT cycles          %" PRId64 " (%0.3f s at 2.4 GHz)\n",
                 tot, tot / 2.4e9);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 18/22] tcg: define TCG_HIGHWATER
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (16 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 17/22] tcg: distribute profiling counters across TCGContext's Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 20:46   ` Richard Henderson
  2017-07-12 15:33   ` Alex Bennée
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone Emilio G. Cota
                   ` (5 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Will come in handy very soon.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index c19c473..2f003a0 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -115,6 +115,8 @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
 static void tcg_out_tb_init(TCGContext *s);
 static bool tcg_out_tb_finalize(TCGContext *s);
 
+#define TCG_HIGHWATER 1024
+
 static QemuMutex tcg_lock;
 
 /*
@@ -453,7 +455,7 @@ void tcg_prologue_init(TCGContext *s)
     /* Compute a high-water mark, at which we voluntarily flush the buffer
        and start over.  The size here is arbitrary, significantly larger
        than we expect the code generation for any one opcode to require.  */
-    s->code_gen_highwater = s->code_gen_buffer + (total_size - 1024);
+    s->code_gen_highwater = s->code_gen_buffer + (total_size - TCG_HIGHWATER);
 
     tcg_register_jit(s->code_gen_buffer, total_size);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (17 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 18/22] tcg: define TCG_HIGHWATER Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 20:48   ` Richard Henderson
  2017-07-12 16:02   ` Alex Bennée
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 20/22] tcg: dynamically allocate from code_gen_buffer using equally-sized regions Emilio G. Cota
                   ` (4 subsequent siblings)
  23 siblings, 2 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Before we make TCGContext thread-local.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h |  1 +
 tcg/tcg.c | 14 ++++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 2a64ee2..be5f3fd 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -778,6 +778,7 @@ static inline void *tcg_malloc(int size)
 }
 
 void tcg_context_init(TCGContext *s);
+void tcg_context_clone(TCGContext *s);
 void tcg_prologue_init(TCGContext *s);
 void tcg_register_thread(void);
 void tcg_func_start(TCGContext *s);
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 2f003a0..8febf53 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -117,6 +117,7 @@ static bool tcg_out_tb_finalize(TCGContext *s);
 
 #define TCG_HIGHWATER 1024
 
+static const TCGContext *tcg_init_ctx;
 static QemuMutex tcg_lock;
 
 /*
@@ -353,6 +354,7 @@ void tcg_context_init(TCGContext *s)
     TCGArgConstraint *args_ct;
     int *sorted_args;
 
+    tcg_init_ctx = s;
     memset(s, 0, sizeof(*s));
     s->nb_globals = 0;
 
@@ -409,6 +411,18 @@ void tcg_context_init(TCGContext *s)
 }
 
 /*
+ * Clone the initial TCGContext. Used by TCG threads to copy the TCGContext
+ * set up by their parent thread via tcg_context_init().
+ */
+void tcg_context_clone(TCGContext *s)
+{
+    if (unlikely(tcg_init_ctx == NULL || tcg_init_ctx == s)) {
+        tcg_abort();
+    }
+    memcpy(s, tcg_init_ctx, sizeof(*s));
+}
+
+/*
  * Allocate TBs right before their corresponding translated code, making
  * sure that TBs and code are on different cache lines.
  */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 20/22] tcg: dynamically allocate from code_gen_buffer using equally-sized regions
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (18 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 21:03   ` Richard Henderson
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu Emilio G. Cota
                   ` (3 subsequent siblings)
  23 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

In preparation for having multiple TCG threads.

The naive solution here is to split code_gen_buffer statically
among the TCG threads; this however results in poor utilization
if translation needs are different across TCG threads.

What we do here is to add an extra layer of indirection, assigning
regions that act just like pages do in virtual memory allocation.
(BTW if you are wondering about the chosen naming, I did not want
to use blocks or pages because those are already heavily used in QEMU).

The effectiveness of this approach is clear after seeing some numbers.
I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark.
Note that I'm evaluating this after enabling per-thread TCG (which
is done by a subsequent commit).

* -smp 1, 1 region (entire buffer):
    qemu: flush code_size=83885014 nb_tbs=154739 avg_tb_size=357
    qemu: flush code_size=83884902 nb_tbs=153136 avg_tb_size=363
    qemu: flush code_size=83885014 nb_tbs=152777 avg_tb_size=364
    qemu: flush code_size=83884950 nb_tbs=150057 avg_tb_size=373
    qemu: flush code_size=83884998 nb_tbs=150234 avg_tb_size=373
    qemu: flush code_size=83885014 nb_tbs=154009 avg_tb_size=360
    qemu: flush code_size=83885014 nb_tbs=151007 avg_tb_size=370
    qemu: flush code_size=83885014 nb_tbs=151816 avg_tb_size=367

That is, 8 flushes.

* -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]:

    qemu: flush code_size=76328008 nb_tbs=141040 avg_tb_size=356
    qemu: flush code_size=75366534 nb_tbs=138000 avg_tb_size=361
    qemu: flush code_size=76864546 nb_tbs=140653 avg_tb_size=361
    qemu: flush code_size=76309084 nb_tbs=135945 avg_tb_size=375
    qemu: flush code_size=74581856 nb_tbs=132909 avg_tb_size=375
    qemu: flush code_size=73927256 nb_tbs=135616 avg_tb_size=360
    qemu: flush code_size=78629426 nb_tbs=142896 avg_tb_size=365
    qemu: flush code_size=76667052 nb_tbs=138508 avg_tb_size=368

Again, 8 flushes. Note how buffer utilization is not 100%, but it
is close. Smaller region sizes would yield higher utilization,
but we want region allocation to be rare (it acquires a lock), so
we do not want to go too small.

* -smp 8, static partitioning of 8 regions (10 MB per region):
    qemu: flush code_size=21936504 nb_tbs=40570 avg_tb_size=354
    qemu: flush code_size=11472174 nb_tbs=20633 avg_tb_size=370
    qemu: flush code_size=11603976 nb_tbs=21059 avg_tb_size=365
    qemu: flush code_size=23254872 nb_tbs=41243 avg_tb_size=377
    qemu: flush code_size=28289496 nb_tbs=52057 avg_tb_size=358
    qemu: flush code_size=43605160 nb_tbs=78896 avg_tb_size=367
    qemu: flush code_size=45166552 nb_tbs=82158 avg_tb_size=364
    qemu: flush code_size=63289640 nb_tbs=116494 avg_tb_size=358
    qemu: flush code_size=51389960 nb_tbs=93937 avg_tb_size=362
    qemu: flush code_size=59665928 nb_tbs=107063 avg_tb_size=372
    qemu: flush code_size=38380824 nb_tbs=68597 avg_tb_size=374
    qemu: flush code_size=44884568 nb_tbs=79901 avg_tb_size=376
    qemu: flush code_size=50782632 nb_tbs=90681 avg_tb_size=374
    qemu: flush code_size=39848888 nb_tbs=71433 avg_tb_size=372
    qemu: flush code_size=64708840 nb_tbs=119052 avg_tb_size=359
    qemu: flush code_size=49830008 nb_tbs=90992 avg_tb_size=362
    qemu: flush code_size=68372408 nb_tbs=123442 avg_tb_size=368
    qemu: flush code_size=33555560 nb_tbs=59514 avg_tb_size=378
    qemu: flush code_size=44748344 nb_tbs=80974 avg_tb_size=367
    qemu: flush code_size=37104248 nb_tbs=67609 avg_tb_size=364

That is, 20 flushes. Note how a static partitioning approach uses
the code buffer poorly, leading to many unnecessary flushes.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h                 |   8 +++
 accel/tcg/translate-all.c |  61 ++++++++++++----
 bsd-user/main.c           |   1 +
 linux-user/main.c         |   1 +
 tcg/tcg.c                 | 175 +++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 230 insertions(+), 16 deletions(-)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index be5f3fd..a767a33 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -761,6 +761,14 @@ void *tcg_malloc_internal(TCGContext *s, int size);
 void tcg_pool_reset(TCGContext *s);
 TranslationBlock *tcg_tb_alloc(TCGContext *s);
 
+void tcg_region_init(TCGContext *s);
+bool tcg_region_alloc(TCGContext *s);
+void tcg_region_set_size(size_t size);
+void tcg_region_reset_all(void);
+
+size_t tcg_code_size(void);
+size_t tcg_code_capacity(void);
+
 /* Called with tb_lock held.  */
 static inline void *tcg_malloc(int size)
 {
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 31a9d42..ce9d746 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -53,11 +53,13 @@
 #include "exec/cputlb.h"
 #include "exec/tb-hash.h"
 #include "translate-all.h"
+#include "qemu/error-report.h"
 #include "qemu/bitmap.h"
 #include "qemu/timer.h"
 #include "qemu/main-loop.h"
 #include "exec/log.h"
 #include "sysemu/cpus.h"
+#include "sysemu/sysemu.h"
 
 /* #define DEBUG_TB_INVALIDATE */
 /* #define DEBUG_TB_FLUSH */
@@ -808,6 +810,41 @@ static inline void code_gen_alloc(size_t tb_size)
     qemu_mutex_init(&tb_ctx.tb_lock);
 }
 
+#ifdef CONFIG_SOFTMMU
+/*
+ * It is likely that some vCPUs will translate more code than others, so we
+ * first try to set more regions than smp_cpus, with those regions being
+ * larger than the minimum code_gen_buffer size. If that's not possible we
+ * make do by evenly dividing the code_gen_buffer among the vCPUs.
+ */
+static void code_gen_set_region_size(TCGContext *s)
+{
+    size_t per_cpu = s->code_gen_buffer_size / smp_cpus;
+    size_t div;
+
+    assert(per_cpu);
+    /*
+     * Use a single region if all we have is one vCPU.
+     * We could also use a single region with !mttcg, but at this time we have
+     * not yet processed the thread=single|multi flag.
+     */
+    if (smp_cpus == 1) {
+        tcg_region_set_size(0);
+        return;
+    }
+
+    for (div = 8; div > 0; div--) {
+        size_t region_size = per_cpu / div;
+
+        if (region_size >= 2 * MIN_CODE_GEN_BUFFER_SIZE) {
+            tcg_region_set_size(region_size);
+            return;
+        }
+    }
+    tcg_region_set_size(per_cpu);
+}
+#endif
+
 static void tb_htable_init(void)
 {
     unsigned int mode = QHT_MODE_AUTO_RESIZE;
@@ -829,6 +866,8 @@ void tcg_exec_init(unsigned long tb_size)
     /* There's no guest base to take into account, so go ahead and
        initialize the prologue now.  */
     tcg_prologue_init(&tcg_ctx);
+    code_gen_set_region_size(&tcg_ctx);
+    tcg_region_init(&tcg_ctx);
 #endif
 }
 
@@ -929,14 +968,9 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 #if defined(DEBUG_TB_FLUSH)
     g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
     nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
-    printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%zu\n",
-           (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
-           nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
+    fprintf(stderr, "qemu: flush code_size=%zu nb_tbs=%d avg_tb_size=%zu\n",
+           tcg_code_size(), nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
 #endif
-    if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)
-        > tcg_ctx.code_gen_buffer_size) {
-        cpu_abort(cpu, "Internal error: code buffer overflow\n");
-    }
 
     CPU_FOREACH(cpu) {
         cpu_tb_jmp_cache_clear(cpu);
@@ -949,7 +983,7 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     qht_reset_size(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
-    tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
+    tcg_region_reset_all();
     /* XXX: flush processor icache at this point if cache flush is
        expensive */
     atomic_mb_set(&tb_ctx.tb_flush_count, tb_ctx.tb_flush_count + 1);
@@ -1281,9 +1315,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
         cflags |= CF_USE_ICOUNT;
     }
 
+ buffer_overflow:
     tb = tb_alloc(pc);
     if (unlikely(!tb)) {
- buffer_overflow:
         /* flush must be done */
         tb_flush(cpu);
         mmap_unlock();
@@ -1366,9 +1400,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     }
 #endif
 
-    tcg_ctx.code_gen_ptr = (void *)
+    atomic_set(&tcg_ctx.code_gen_ptr, (void *)
         ROUND_UP((uintptr_t)gen_code_buf + gen_code_size + search_size,
-                 CODE_GEN_ALIGN);
+                 CODE_GEN_ALIGN));
 
     /* init jump list */
     assert(((uintptr_t)tb & 3) == 0);
@@ -1907,9 +1941,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
      * otherwise users might think "-tb-size" is not honoured.
      * For avg host size we use the precise numbers from tb_tree_stats though.
      */
-    cpu_fprintf(f, "gen code size       %td/%zd\n",
-                tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
-                tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
+    cpu_fprintf(f, "gen code size       %zu/%zd\n",
+                tcg_code_size(), tcg_code_capacity());
     cpu_fprintf(f, "TB count            %d\n", nb_tbs);
     cpu_fprintf(f, "TB avg target size  %zu max=%zu bytes\n",
                 nb_tbs ? tst.target_size / nb_tbs : 0,
diff --git a/bsd-user/main.c b/bsd-user/main.c
index fa9c012..1a16052 100644
--- a/bsd-user/main.c
+++ b/bsd-user/main.c
@@ -979,6 +979,7 @@ int main(int argc, char **argv)
        generating the prologue until now so that the prologue can take
        the real value of GUEST_BASE into account.  */
     tcg_prologue_init(&tcg_ctx);
+    tcg_region_init(&tcg_ctx);
 
     /* build Task State */
     memset(ts, 0, sizeof(TaskState));
diff --git a/linux-user/main.c b/linux-user/main.c
index 630c73d..b73759c 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -4457,6 +4457,7 @@ int main(int argc, char **argv, char **envp)
        generating the prologue until now so that the prologue can take
        the real value of GUEST_BASE into account.  */
     tcg_prologue_init(&tcg_ctx);
+    tcg_region_init(&tcg_ctx);
 
 #if defined(TARGET_I386)
     env->cr[0] = CR0_PG_MASK | CR0_WP_MASK | CR0_PE_MASK;
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 8febf53..03ebc8c 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -129,6 +129,23 @@ static QemuMutex tcg_lock;
 static QSIMPLEQ_HEAD(, TCGContext) ctx_list =
     QSIMPLEQ_HEAD_INITIALIZER(ctx_list);
 
+/*
+ * We divide code_gen_buffer into equally-sized "regions" that TCG threads
+ * dynamically allocate from as demand dictates. Given appropriate region
+ * sizing, this minimizes flushes even when some TCG threads generate a lot
+ * more code than others.
+ */
+struct tcg_region_state {
+    void *buf;
+    size_t n;
+    size_t current;
+    size_t n_full;
+    size_t size; /* size of one region */
+};
+
+/* protected by tcg_lock */
+static struct tcg_region_state region;
+
 static TCGRegSet tcg_target_available_regs[2];
 static TCGRegSet tcg_target_call_clobber_regs;
 
@@ -410,6 +427,156 @@ void tcg_context_init(TCGContext *s)
     tcg_register_thread();
 }
 
+static void tcg_region_set_size__locked(size_t size)
+{
+    if (!size) {
+        region.size = tcg_init_ctx->code_gen_buffer_size;
+        region.n = 1;
+    } else {
+        region.size = size;
+        region.n = tcg_init_ctx->code_gen_buffer_size / size;
+    }
+    if (unlikely(region.size < TCG_HIGHWATER)) {
+        tcg_abort();
+    }
+}
+
+/*
+ * Call this function at init time (i.e. only once). Calling this function is
+ * optional: if no region size is set, a single region will be used.
+ *
+ * Note: calling this function *after* calling tcg_region_init() is a bug.
+ */
+void tcg_region_set_size(size_t size)
+{
+    tcg_debug_assert(!region.size);
+
+    qemu_mutex_lock(&tcg_lock);
+    tcg_region_set_size__locked(size);
+    qemu_mutex_unlock(&tcg_lock);
+}
+
+static void tcg_region_assign__locked(TCGContext *s)
+{
+    void *buf = region.buf + region.size * region.current;
+
+    s->code_gen_buffer = buf;
+    s->code_gen_ptr = buf;
+    s->code_gen_buffer_size = region.size;
+    s->code_gen_highwater = buf + region.size - TCG_HIGHWATER;
+}
+
+static bool tcg_region_alloc__locked(TCGContext *s)
+{
+    if (region.current == region.n) {
+        return false;
+    }
+    tcg_region_assign__locked(s);
+    region.current++;
+    return true;
+}
+
+/*
+ * Request a new region once the one in use has filled up.
+ * Note: upon initializing a TCG thread, allocate a new region with
+ * tcg_region_init() instead.
+ * Returns true on success.
+ * */
+bool tcg_region_alloc(TCGContext *s)
+{
+    bool success;
+
+    qemu_mutex_lock(&tcg_lock);
+    success = tcg_region_alloc__locked(s);
+    if (success) {
+        region.n_full++;
+    }
+    qemu_mutex_unlock(&tcg_lock);
+    return success;
+}
+
+/*
+ * Allocate an initial region.
+ * All TCG threads must have called this function before any of them initiates
+ * translation.
+ *
+ * The region size might have previously been set by tcg_region_set_size();
+ * otherwise a single region will be used on the entire code_gen_buffer.
+ *
+ * Note: allocate subsequent regions with tcg_region_alloc().
+ */
+void tcg_region_init(TCGContext *s)
+{
+    qemu_mutex_lock(&tcg_lock);
+    if (region.buf == NULL) {
+        region.buf = tcg_init_ctx->code_gen_buffer;
+    }
+    if (!region.size) {
+        tcg_region_set_size__locked(0);
+    }
+    /* if we cannot allocate on init, then we did something wrong */
+    if (!tcg_region_alloc__locked(s)) {
+        tcg_abort();
+    }
+    qemu_mutex_unlock(&tcg_lock);
+
+}
+
+/* Call from a safe-work context */
+void tcg_region_reset_all(void)
+{
+    TCGContext *s;
+
+    qemu_mutex_lock(&tcg_lock);
+    region.current = 0;
+    region.n_full = 0;
+
+    QSIMPLEQ_FOREACH(s, &ctx_list, entry) {
+        if (unlikely(!tcg_region_alloc__locked(s))) {
+            tcg_abort();
+        }
+    }
+    qemu_mutex_unlock(&tcg_lock);
+}
+
+/*
+ * Returns the size (in bytes) of all translated code (i.e. from all regions)
+ * currently in the cache.
+ * See also: tcg_code_capacity()
+ * Do not confuse with tcg_current_code_size(); that one applies to a single
+ * TCG context.
+ */
+size_t tcg_code_size(void)
+{
+    const TCGContext *s;
+    size_t total;
+
+    qemu_mutex_lock(&tcg_lock);
+    total = region.n_full * (region.size - TCG_HIGHWATER);
+    QSIMPLEQ_FOREACH(s, &ctx_list, entry) {
+        size_t size;
+
+        size = atomic_read(&s->code_gen_ptr) - s->code_gen_buffer;
+        if (unlikely(size > s->code_gen_buffer_size)) {
+            tcg_abort();
+        }
+        total += size;
+    }
+    qemu_mutex_unlock(&tcg_lock);
+    return total;
+}
+
+/*
+ * Returns the code capacity (in bytes) of the entire cache, i.e. including all
+ * regions.
+ * See also: tcg_code_size()
+ */
+size_t tcg_code_capacity(void)
+{
+    /* no need for synchronization; these variables are set at init time */
+    return region.n * (region.size - TCG_HIGHWATER);
+}
+
 /*
  * Clone the initial TCGContext. Used by TCG threads to copy the TCGContext
  * set up by their parent thread via tcg_context_init().
@@ -432,13 +599,17 @@ TranslationBlock *tcg_tb_alloc(TCGContext *s)
     TranslationBlock *tb;
     void *next;
 
+ retry:
     tb = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, align);
     next = (void *)ROUND_UP((uintptr_t)(tb + 1), align);
 
     if (unlikely(next > s->code_gen_highwater)) {
-        return NULL;
+        if (!tcg_region_alloc(s)) {
+            return NULL;
+        }
+        goto retry;
     }
-    s->code_gen_ptr = next;
+    atomic_set(&s->code_gen_ptr, next);
     return tb;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (19 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 20/22] tcg: dynamically allocate from code_gen_buffer using equally-sized regions Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 21:07   ` Richard Henderson
                     ` (2 more replies)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu Emilio G. Cota
                   ` (2 subsequent siblings)
  23 siblings, 3 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

This allows us to generate TCG code in parallel. MTTCG already uses
it, although the next commit pushes down a lock to actually
perform parallel generation.

User-mode is kept out of this: contention due to concurrent translation
is more commonly found in full-system mode.

This patch is fairly small due to the preparation work done in previous
patches.

Note that targets do not need any conversion: the TCGContext set up
during initialization (i.e. where globals are set) is then cloned
by the vCPU threads, which also double as TCG threads.

I searched for globals under tcg/ that might have to be converted
to thread-local. I converted the ones that I saw, and I wrote down the
ones that I found are non-const globals that are only set at init-time:

Only written by tcg_context_init:
- indirect_reg_alloc_order
- tcg_op_defs
Only written by tcg_target_init (called from tcg_context_init):
- tcg_target_available_regs
- tcg_target_call_clobber_regs
- arm: arm_arch, use_idiv_instructions
- i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt,
        have_movbe, have_popcnt
- mips: use_movnz_instructions, use_mips32_instructions,
        use_mips32r2_instructions, got_sigill (tcg_target_detect_isa)
- ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr
- s390: tb_ret_addr, s390_facilities
- sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines),
         use_vis3_instructions

Only written by tcg_prologue_init:
- 'struct jit_code_entry one_entry'
- aarch64: tb_ret_addr
- arm: tb_ret_addr
- i386: tb_ret_addr, guest_base_flags
- ia64: tb_ret_addr
- mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr

I was not sure about tci_regs. From code inspection it seems that
they have to be per-thread, so I converted them, but I do not think
anyone has ever tried to get MTTCG working with TCI.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   |  4 +++-
 tcg/tcg.h                 | 12 +++++++++---
 accel/tcg/translate-all.c | 20 +++++++++++++-------
 cpus.c                    |  3 +++
 tcg/optimize.c            |  4 ++--
 tcg/tcg.c                 | 10 ++++++++++
 tcg/tci.c                 |  2 +-
 7 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 673b26d..5334b7a 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -47,7 +47,9 @@ void gen_intermediate_code(CPUArchState *env, struct TranslationBlock *tb);
 void restore_state_to_opc(CPUArchState *env, struct TranslationBlock *tb,
                           target_ulong *data);
 
-void cpu_gen_init(void);
+#ifdef CONFIG_SOFTMMU
+void cpu_thread_tcg_init(void);
+#endif
 bool cpu_restore_state(CPUState *cpu, uintptr_t searched_pc);
 
 void QEMU_NORETURN cpu_loop_exit_noexc(CPUState *cpu);
diff --git a/tcg/tcg.h b/tcg/tcg.h
index a767a33..0cc2cab 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -733,7 +733,13 @@ struct TCGContext {
     QSIMPLEQ_ENTRY(TCGContext) entry;
 };
 
-extern TCGContext tcg_ctx;
+#ifdef CONFIG_SOFTMMU
+#define TCG_THREAD __thread
+#else
+#define TCG_THREAD
+#endif
+
+extern TCG_THREAD TCGContext tcg_ctx;
 extern bool parallel_cpus;
 
 static inline void tcg_set_insn_param(int op_idx, int arg, TCGArg v)
@@ -756,7 +762,7 @@ static inline bool tcg_op_buf_full(void)
 
 /* pool based memory allocation */
 
-/* tb_lock must be held for tcg_malloc_internal. */
+/* user-mode: tb_lock must be held for tcg_malloc_internal. */
 void *tcg_malloc_internal(TCGContext *s, int size);
 void tcg_pool_reset(TCGContext *s);
 TranslationBlock *tcg_tb_alloc(TCGContext *s);
@@ -769,7 +775,7 @@ void tcg_region_reset_all(void);
 size_t tcg_code_size(void);
 size_t tcg_code_capacity(void);
 
-/* Called with tb_lock held.  */
+/* user-mode: Called with tb_lock held.  */
 static inline void *tcg_malloc(int size)
 {
     TCGContext *s = &tcg_ctx;
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index ce9d746..17b18a9 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -131,7 +131,7 @@ static int v_l2_levels;
 static void *l1_map[V_L1_MAX_SIZE];
 
 /* code generation context */
-TCGContext tcg_ctx;
+TCG_THREAD TCGContext tcg_ctx;
 TBContext tb_ctx;
 bool parallel_cpus;
 
@@ -185,10 +185,6 @@ void tb_lock_reset(void)
 
 static TranslationBlock *tb_find_pc(uintptr_t tc_ptr);
 
-void cpu_gen_init(void)
-{
-    tcg_context_init(&tcg_ctx); 
-}
 
 /* Encode VAL as a signed leb128 sequence at P.
    Return P incremented past the encoded value.  */
@@ -812,6 +808,17 @@ static inline void code_gen_alloc(size_t tb_size)
 
 #ifdef CONFIG_SOFTMMU
 /*
+ * Threads calling this function must be the TCG threads, i.e. they
+ * have their own tcg_ctx.
+ */
+void cpu_thread_tcg_init(void)
+{
+    tcg_context_clone(&tcg_ctx);
+    tcg_register_thread();
+    tcg_region_init(&tcg_ctx);
+}
+
+/*
  * It is likely that some vCPUs will translate more code than others, so we
  * first try to set more regions than smp_cpus, with those regions being
  * larger than the minimum code_gen_buffer size. If that's not possible we
@@ -858,7 +865,7 @@ static void tb_htable_init(void)
 void tcg_exec_init(unsigned long tb_size)
 {
     tcg_allowed = true;
-    cpu_gen_init();
+    tcg_context_init(&tcg_ctx);
     page_init();
     tb_htable_init();
     code_gen_alloc(tb_size);
@@ -867,7 +874,6 @@ void tcg_exec_init(unsigned long tb_size)
        initialize the prologue now.  */
     tcg_prologue_init(&tcg_ctx);
     code_gen_set_region_size(&tcg_ctx);
-    tcg_region_init(&tcg_ctx);
 #endif
 }
 
diff --git a/cpus.c b/cpus.c
index 14bb8d5..58efc95 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1307,6 +1307,8 @@ static void *qemu_tcg_rr_cpu_thread_fn(void *arg)
     CPUState *cpu = arg;
 
     rcu_register_thread();
+    /* For single-threaded TCG we just need to initialize one tcg_ctx */
+    cpu_thread_tcg_init();
 
     qemu_mutex_lock_iothread();
     qemu_thread_get_self(cpu->thread);
@@ -1454,6 +1456,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
     g_assert(!use_icount);
 
     rcu_register_thread();
+    cpu_thread_tcg_init();
 
     qemu_mutex_lock_iothread();
     qemu_thread_get_self(cpu->thread);
diff --git a/tcg/optimize.c b/tcg/optimize.c
index adfc56c..71af19b 100644
--- a/tcg/optimize.c
+++ b/tcg/optimize.c
@@ -40,8 +40,8 @@ struct tcg_temp_info {
     tcg_target_ulong mask;
 };
 
-static struct tcg_temp_info temps[TCG_MAX_TEMPS];
-static TCGTempSet temps_used;
+static TCG_THREAD struct tcg_temp_info temps[TCG_MAX_TEMPS];
+static TCG_THREAD TCGTempSet temps_used;
 
 static inline bool temp_is_const(TCGArg arg)
 {
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 03ebc8c..0ba61ea 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -532,6 +532,11 @@ void tcg_region_reset_all(void)
     region.n_full = 0;
 
     QSIMPLEQ_FOREACH(s, &ctx_list, entry) {
+#ifdef CONFIG_SOFTMMU
+        if (s == tcg_init_ctx) {
+            continue;
+        }
+#endif
         if (unlikely(!tcg_region_alloc__locked(s))) {
             tcg_abort();
         }
@@ -556,6 +561,11 @@ size_t tcg_code_size(void)
     QSIMPLEQ_FOREACH(s, &ctx_list, entry) {
         size_t size;
 
+#ifdef CONFIG_SOFTMMU
+        if (s == tcg_init_ctx) {
+            continue;
+        }
+#endif
         size = atomic_read(&s->code_gen_ptr) - s->code_gen_buffer;
         if (unlikely(size > s->code_gen_buffer_size)) {
             tcg_abort();
diff --git a/tcg/tci.c b/tcg/tci.c
index 4bdc645..d374ddc 100644
--- a/tcg/tci.c
+++ b/tcg/tci.c
@@ -55,7 +55,7 @@ typedef uint64_t (*helper_function)(tcg_target_ulong, tcg_target_ulong,
                                     tcg_target_ulong);
 #endif
 
-static tcg_target_ulong tci_reg[TCG_TARGET_NB_REGS];
+static TCG_THREAD tcg_target_ulong tci_reg[TCG_TARGET_NB_REGS];
 
 static tcg_target_ulong tci_read_reg(TCGReg index)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (20 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu Emilio G. Cota
@ 2017-07-09  7:50 ` Emilio G. Cota
  2017-07-09 21:38   ` Richard Henderson
  2017-07-09 18:27 ` [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
  2017-07-10  9:50 ` Alex Bennée
  23 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09  7:50 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

Each vCPU can now generate code with TCG in parallel. Thus,
drop tb_lock around code generation in softmmu.

Note that we still have to take tb_lock after code translation,
since there is global state that we have to update.

Nonetheless holding tb_lock for less time provides significant performance
improvements to workloads that are translation-heavy. A good example
of this is booting Linux; in my measurements, bootup+shutdown time of
debian-arm is reduced by 20% before/after this entire patchset, when
using -smp 8 and MTTCG on a machine with >= 8 real cores:

 Host: Intel(R) Xeon(R) CPU E5-2690 @ 2.90GHz
 Performance counter stats for 'qemu/build/arm-softmmu/qemu-system-arm \
	-machine type=virt -nographic -smp 1 -m 4096 \
	-netdev user,id=unet,hostfwd=tcp::2222-:22 \
	-device virtio-net-device,netdev=unet \
	-drive file=foobar.qcow2,id=myblock,index=0,if=none \
	-device virtio-blk-device,drive=myblock \
	-kernel /foobar.img -append console=ttyAMA0 root=/dev/vda1 \
	-name arm,debug-threads=on -smp 8' (3 runs):

Before:
      28764.018852 task-clock                #    1.663 CPUs utilized            ( +-  0.30% )
           727,490 context-switches          #    0.025 M/sec                    ( +-  0.68% )
             2,429 CPU-migrations            #    0.000 M/sec                    ( +- 11.36% )
            14,042 page-faults               #    0.000 M/sec                    ( +-  1.00% )
    70,644,349,920 cycles                    #    2.456 GHz                      ( +-  0.96% ) [83.42%]
    37,129,806,098 stalled-cycles-frontend   #   52.56% frontend cycles idle     ( +-  1.27% ) [83.20%]
    26,620,190,524 stalled-cycles-backend    #   37.68% backend  cycles idle     ( +-  1.29% ) [66.50%]
    85,528,287,892 instructions              #    1.21  insns per cycle
                                             #    0.43  stalled cycles per insn  ( +-  0.62% ) [83.40%]
    14,417,482,689 branches                  #  501.233 M/sec                    ( +-  0.49% ) [83.36%]
       321,182,192 branch-misses             #    2.23% of all branches          ( +-  1.17% ) [83.53%]

      17.297750583 seconds time elapsed                                          ( +-  1.08% )

After:
      28690.888633 task-clock                #    2.069 CPUs utilized            ( +-  1.54% )
           473,947 context-switches          #    0.017 M/sec                    ( +-  1.32% )
             2,793 CPU-migrations            #    0.000 M/sec                    ( +- 18.74% )
            22,634 page-faults               #    0.001 M/sec                    ( +-  1.20% )
    69,314,663,510 cycles                    #    2.416 GHz                      ( +-  1.08% ) [83.50%]
    36,114,710,208 stalled-cycles-frontend   #   52.10% frontend cycles idle     ( +-  1.64% ) [83.26%]
    25,519,842,658 stalled-cycles-backend    #   36.82% backend  cycles idle     ( +-  1.70% ) [66.77%]
    84,588,443,638 instructions              #    1.22  insns per cycle
                                             #    0.43  stalled cycles per insn  ( +-  0.78% ) [83.44%]
    14,258,100,183 branches                  #  496.956 M/sec                    ( +-  0.87% ) [83.32%]
       324,984,804 branch-misses             #    2.28% of all branches          ( +-  0.51% ) [83.17%]

      13.870347754 seconds time elapsed                                          ( +-  1.65% )

That is, a speedup of 17.29/13.87=1.24X.

Similar numbers on a slower machine:

Host: AMD Opteron(tm) Processor 6376:

Before:
      74765.850569      task-clock (msec)         #    1.956 CPUs utilized            ( +-  1.42% )
           841,430      context-switches          #    0.011 M/sec                    ( +-  2.50% )
            18,228      cpu-migrations            #    0.244 K/sec                    ( +-  2.87% )
            26,565      page-faults               #    0.355 K/sec                    ( +-  9.19% )
    98,775,815,944      cycles                    #    1.321 GHz                      ( +-  1.40% )  (83.44%)
    26,325,365,757      stalled-cycles-frontend   #   26.65% frontend cycles idle     ( +-  1.96% )  (83.26%)
    17,270,620,447      stalled-cycles-backend    #   17.48% backend  cycles idle     ( +-  3.45% )  (33.32%)
    82,998,905,540      instructions              #    0.84  insns per cycle
                                                  #    0.32  stalled cycles per insn  ( +-  0.71% )  (50.06%)
    14,209,593,402      branches                  #  190.055 M/sec                    ( +-  1.01% )  (66.74%)
       571,258,648      branch-misses             #    4.02% of all branches          ( +-  0.20% )  (83.40%)

      38.220740889 seconds time elapsed                                          ( +-  0.72% )

After:
      73281.226761      task-clock (msec)         #    2.415 CPUs utilized            ( +-  0.29% )
           571,984      context-switches          #    0.008 M/sec                    ( +-  1.11% )
            14,301      cpu-migrations            #    0.195 K/sec                    ( +-  2.90% )
            42,635      page-faults               #    0.582 K/sec                    ( +-  7.76% )
    98,478,185,775      cycles                    #    1.344 GHz                      ( +-  0.32% )  (83.39%)
    25,555,945,935      stalled-cycles-frontend   #   25.95% frontend cycles idle     ( +-  0.47% )  (83.37%)
    15,174,223,390      stalled-cycles-backend    #   15.41% backend  cycles idle     ( +-  0.83% )  (33.26%)
    81,939,511,983      instructions              #    0.83  insns per cycle
                                                  #    0.31  stalled cycles per insn  ( +-  0.12% )  (49.95%)
    13,992,075,918      branches                  #  190.937 M/sec                    ( +-  0.16% )  (66.65%)
       580,790,655      branch-misses             #    4.15% of all branches          ( +-  0.20% )  (83.26%)

      30.340574988 seconds time elapsed                                          ( +-  0.39% )

That is, a speedup of 1.25X.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cpu-exec.c      |  7 ++++++-
 accel/tcg/translate-all.c | 22 ++++++++++++++++++++++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 54ecae2..2b34d58 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -351,6 +351,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
              * single threaded the locks are NOPs.
              */
             mmap_lock();
+#ifdef CONFIG_USER_ONLY
             tb_lock();
             have_tb_lock = true;
 
@@ -362,7 +363,11 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
                 /* if no translated code available, then translate it now */
                 tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
             }
-
+#else
+            tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
+            /* tb_gen_code returns with tb_lock acquired */
+            have_tb_lock = true;
+#endif
             mmap_unlock();
         }
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 17b18a9..6cab609 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -887,7 +887,9 @@ static TranslationBlock *tb_alloc(target_ulong pc)
 {
     TranslationBlock *tb;
 
+#ifdef CONFIG_USER_ONLY
     assert_tb_locked();
+#endif
 
     tb = tcg_tb_alloc(&tcg_ctx);
     if (unlikely(tb == NULL)) {
@@ -1314,7 +1316,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     TCGProfile *prof = &tcg_ctx.prof;
     int64_t ti;
 #endif
+#ifdef CONFIG_USER_ONLY
     assert_memory_lock();
+#endif
 
     phys_pc = get_page_addr_code(env, pc);
     if (use_icount && !(cflags & CF_IGNORE_ICOUNT)) {
@@ -1430,6 +1434,24 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     if ((pc & TARGET_PAGE_MASK) != virt_page2) {
         phys_page2 = get_page_addr_code(env, virt_page2);
     }
+    if (!have_tb_lock) {
+        TranslationBlock *t;
+
+        tb_lock();
+        /*
+         * There's a chance that our desired tb has been translated while
+         * we were translating it.
+         */
+        t = tb_htable_lookup(cpu, pc, cs_base, flags);
+        if (unlikely(t)) {
+            /* discard what we just translated */
+            uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
+
+            orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
+            atomic_set(&tcg_ctx.code_gen_ptr, orig_aligned);
+            return t;
+        }
+    }
     /* As long as consistency of the TB stuff is provided by tb_lock in user
      * mode and is implicit in single-threaded softmmu emulation, no explicit
      * memory barrier is required before tb_link_page() makes the TB visible
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (21 preceding siblings ...)
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu Emilio G. Cota
@ 2017-07-09 18:27 ` Emilio G. Cota
  2017-07-10  9:50 ` Alex Bennée
  23 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09 18:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson

On Sun, Jul 09, 2017 at 03:49:52 -0400, Emilio G. Cota wrote:
> The series applies on top of the current master (b11365867568).

It's a lot of patches -- you can fetch them from:
  https://github.com/cota/qemu/commits/multi-tcg

Note that there's a patch in the branch there that is not part
of the patchset ("scripts: add "git.orderfile" for ordering ...")

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 01/22] vl: fix breakage of -tb-size
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 01/22] vl: fix breakage of -tb-size Emilio G. Cota
@ 2017-07-09 19:56   ` Richard Henderson
  2017-07-11 15:37   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 19:56 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:49 PM, Emilio G. Cota wrote:
> Commit e7b161d573 ("vl: add tcg_enabled() for tcg related code") adds
> a check to exit the program when !tcg_enabled() while parsing the -tb-size
> flag.
> 
> It turns out that when the -tb-size flag is evaluated, tcg_enabled() can
> only return 0, since it is set (or not) much later by configure_accelerator().
> 
> Fix it by unconditionally exiting if the flag is passed to a QEMU binary
> built with !CONFIG_TCG.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   vl.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 02/22] translate-all: remove redundant !tcg_enabled check in dump_exec_info
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 02/22] translate-all: remove redundant !tcg_enabled check in dump_exec_info Emilio G. Cota
@ 2017-07-09 19:57   ` Richard Henderson
  2017-07-10  6:15   ` Thomas Huth
  2017-07-12 12:32   ` Alex Bennée
  2 siblings, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 19:57 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:49 PM, Emilio G. Cota wrote:
> This check is redundant because it is already performed by the only
> caller of dump_exec_info -- the caller was updated by b7da97eef
> ("monitor: Check whether TCG is enabled before running the "info jit"
> code").
> 
> Checking twice wouldn't necessarily be too bad, but here the check also
> returns with tb_lock held. So we can either do the check before tb_lock is
> acquired, or just get rid of it. Given that it is redundant, I am going
> for the latter option.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG Emilio G. Cota
@ 2017-07-09 20:00   ` Richard Henderson
  2017-07-09 20:56     ` Emilio G. Cota
  2017-07-12 13:26   ` Alex Bennée
  1 sibling, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:00 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:49 PM, Emilio G. Cota wrote:
> +    atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);

Want atomic_read here, so they're all the same.

Otherwise,

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 04/22] tcg: fix corruption of code_time profiling counter upon tb_flush
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 04/22] tcg: fix corruption of code_time profiling counter upon tb_flush Emilio G. Cota
@ 2017-07-09 20:01   ` Richard Henderson
  2017-07-12 14:36   ` Alex Bennée
  2017-07-12 17:09   ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:01 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:49 PM, Emilio G. Cota wrote:
> Whenever there is an overflow in code_gen_buffer (e.g. we run out
> of space in it and have to flush it), the code_time profiling counter
> ends up with an invalid value (that is, code_time -= profile_getclock(),
> without later on getting += profile_getclock() due to the goto).
> 
> Fix it by using the ti variable, so that we only update code_time
> when there is no overflow. Note that in case there is an overflow
> we fail to account for the elapsed coding time, but this is quite rare
> so we can probably live with it.

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 06/22] translate-all: make have_tb_lock static
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 06/22] translate-all: make have_tb_lock static Emilio G. Cota
@ 2017-07-09 20:02   ` Richard Henderson
  2017-07-12 14:38   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:02 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:49 PM, Emilio G. Cota wrote:
> It is only used by this object, and it's not exported to any other.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   accel/tcg/translate-all.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs Emilio G. Cota
@ 2017-07-09 20:02   ` Richard Henderson
  2017-07-12 14:39   ` Alex Bennée
  2017-07-12 17:00   ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:02 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:49 PM, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   tcg/i386/tcg-target.inc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 08/22] tcg/mips: constify tcg_target_callee_save_regs
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 08/22] tcg/mips: " Emilio G. Cota
@ 2017-07-09 20:02   ` Richard Henderson
  2017-07-12 14:39   ` Alex Bennée
  2017-07-12 17:01   ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:02 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   tcg/mips/tcg-target.inc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t Emilio G. Cota
@ 2017-07-09 20:11   ` Richard Henderson
  2017-07-10 23:57     ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:11 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> To avoid wasting a byte. I don't have any use in mind for this byte,
> but I think it's good to leave this byte explicitly free for future use.
> See this discussion for how the u16 came to be:
>    https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04564.html
> We could use a bool but in some systems that would take > 1 byte.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   include/exec/exec-all.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)

What I would prefer to do is generalize tb->cflags.  Those values *do* affect 
how we compile the TB, and yet we don't take them into account.  So I think it 
would be a good idea to feed that into the TB hash.

At present we use 18 bits of the uint32_t.  That leaves plenty of room for:

* the compile-time value of parallel_cpus, so we don't have to keep flushing 
the TBs that we create for cpu_exec_step_atomic.

* invalid, which no TB within the hashtable would have set.

* other stuff which may become relevant in the future.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 11/22] translate-all: use a binary search tree to track TBs in TBContext
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 11/22] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
@ 2017-07-09 20:33   ` Richard Henderson
  2017-07-09 21:01     ` Emilio G. Cota
  2017-07-12 15:10   ` Alex Bennée
  1 sibling, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:33 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> In order to use glib's binary search tree we embed a helper struct
> in TranslationBlock to allow us to compare tb's based on their
> tc_ptr as well as their tc_size fields.

Using an anon struct really doesn't help.  You're effectively using two 
different structs when you pass them in.  It "works" simply because in both 
cases we happen to know that a given pointer is followed by a uint32_t.

But with that in mind, and knowing that qemu builds with -fno-strict-aliasing, 
you could just as well *not* use the anon struct and address the pointer and 
the uint32_t within the un-partitioned TranslationBlock.  With appropriate 
comments in both places, of course.

> +    /*
> +     * When both sizes are set, we know this isn't a lookup and therefore
> +     * the two buffers are non-overlapping: a pointer comparison will do.
> +     * This is the most likely case: every TB must be inserted; lookups
> +     * are a lot less frequent.
> +     */
> +    if (likely(a->size && b->size)) {
> +        return a->ptr - b->ptr;
> +    }

sizeof(gint) < sizeof(ptrdiff_t) on 64-bit hosts.  You wouldn't have seen a 
problem on x86_64 because of MAX_CODE_GEN_BUFFER_SIZE.  I think it would be 
better to reduce this to -1/0/1.

>  static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>  {
> +    int nb_tbs __attribute__((unused));
> +
>      tb_lock();
>  
>      /* If it is already been done on request of another CPU,
> @@ -894,11 +913,12 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>      }
>  
>  #if defined(DEBUG_TB_FLUSH)
> +    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
>      printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
>             (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
> -           tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.tb_ctx.nb_tbs > 0 ?
> +           nb_tbs, nb_tbs > 0 ?
>             ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)) /
> -           tcg_ctx.tb_ctx.nb_tbs : 0);
> +           nb_tbs : 0);
>  #endif

Variable declaration within braces within the ifdef.  Better as size_t or 
unsigned long.  Using int to count thing on 64-bit hosts always seems like a bug.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 14/22] tcg: take .helpers out of TCGContext
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 14/22] tcg: take .helpers " Emilio G. Cota
@ 2017-07-09 20:35   ` Richard Henderson
  2017-07-12 15:28   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:35 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> Before TCGContext is made thread-local.
> 
> The hash table becomes read-only after it is filled in,
> so we can save space by keeping just a global pointer to it.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   tcg/tcg.h |  2 --
>   tcg/tcg.c | 10 +++++-----
>   2 files changed, 5 insertions(+), 7 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 15/22] gen-icount: fold exitreq_label into TCGContext
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 15/22] gen-icount: fold exitreq_label into TCGContext Emilio G. Cota
@ 2017-07-09 20:36   ` Richard Henderson
  2017-07-12 15:29   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:36 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> Before we make TCGContext thread-local.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   include/exec/gen-icount.h | 7 +++----
>   tcg/tcg.h                 | 2 ++
>   2 files changed, 5 insertions(+), 4 deletions(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 16/22] tcg: keep a list of TCGContext's
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 16/22] tcg: keep a list of TCGContext's Emilio G. Cota
@ 2017-07-09 20:43   ` Richard Henderson
  2017-07-12 15:32   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:43 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> Before we make TCGContext thread-local. Once that is done, iterating
> over all TCG contexts will be quite useful; for instance we
> will need it to gather profiling info from each TCGContext.
> 
> A possible alternative would be to keep an array of TCGContext pointers.
> However this option however is not that trivial, because vCPUs are spawned in
> parallel. So let's just keep it simple and use a list protected by a lock.

Do we not know the number of cpus?  I would have thought that we did, via -smp. 
  Even cpu-hotplug would still have some sort of bounds?

Which really suggests an array would be better.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 17/22] tcg: distribute profiling counters across TCGContext's
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 17/22] tcg: distribute profiling counters across TCGContext's Emilio G. Cota
@ 2017-07-09 20:45   ` Richard Henderson
  2017-07-09 21:14     ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:45 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> +    /* includes aborted translations because of exceptions */
> +    atomic_set(&prof->tb_count1, prof->tb_count1 + 1);

Again, atomic_set without atomic_read is pointless.
Either you're trying to give the compiler extra information, or you aren't.

As always, it won't ever matter in practice because aligned native types never 
tear.  This is all about markup for compiler tools.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 18/22] tcg: define TCG_HIGHWATER
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 18/22] tcg: define TCG_HIGHWATER Emilio G. Cota
@ 2017-07-09 20:46   ` Richard Henderson
  2017-07-12 15:33   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:46 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> Will come in handy very soon.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   tcg/tcg.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone Emilio G. Cota
@ 2017-07-09 20:48   ` Richard Henderson
  2017-07-09 21:04     ` Emilio G. Cota
  2017-07-12 16:02   ` Alex Bennée
  1 sibling, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 20:48 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> @@ -409,6 +411,18 @@ void tcg_context_init(TCGContext *s)
>   }
>   
>   /*
> + * Clone the initial TCGContext. Used by TCG threads to copy the TCGContext
> + * set up by their parent thread via tcg_context_init().
> + */
> +void tcg_context_clone(TCGContext *s)
> +{
> +    if (unlikely(tcg_init_ctx == NULL || tcg_init_ctx == s)) {
> +        tcg_abort();
> +    }
> +    memcpy(s, tcg_init_ctx, sizeof(*s));
> +}

Under what conditions will this be called?  How much of this might you need to 
zero out again after the fact?


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG
  2017-07-09 20:00   ` Richard Henderson
@ 2017-07-09 20:56     ` Emilio G. Cota
  2017-07-09 21:20       ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09 20:56 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 10:00:01 -1000, Richard Henderson wrote:
> On 07/08/2017 09:49 PM, Emilio G. Cota wrote:
> >+    atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);
> 
> Want atomic_read here, so they're all the same.

It's not needed. Note that this thread is the only one ever writing
to env->tlb_flush_count, so the thread can read this value without
atomic accesses.

You'll see this pattern all across the patchset.

Thanks,

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 11/22] translate-all: use a binary search tree to track TBs in TBContext
  2017-07-09 20:33   ` Richard Henderson
@ 2017-07-09 21:01     ` Emilio G. Cota
  0 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09 21:01 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 10:33:41 -1000, Richard Henderson wrote:
> On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> > #if defined(DEBUG_TB_FLUSH)
> >+    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
> >     printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
> >            (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
> >-           tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.tb_ctx.nb_tbs > 0 ?
> >+           nb_tbs, nb_tbs > 0 ?
> >            ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)) /
> >-           tcg_ctx.tb_ctx.nb_tbs : 0);
> >+           nb_tbs : 0);
> > #endif
> 
> Variable declaration within braces within the ifdef.  Better as size_t or
> unsigned long.  Using int to count thing on 64-bit hosts always seems like a
> bug.

g_tree_nnodes returns a gint, which is documented to be a typedef for 'int'.
So I went with that.

But yes, a size_t here is better.

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 20/22] tcg: dynamically allocate from code_gen_buffer using equally-sized regions
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 20/22] tcg: dynamically allocate from code_gen_buffer using equally-sized regions Emilio G. Cota
@ 2017-07-09 21:03   ` Richard Henderson
  0 siblings, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 21:03 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> +static void code_gen_set_region_size(TCGContext *s)
> +{
> +    size_t per_cpu = s->code_gen_buffer_size / smp_cpus;
> +    size_t div;
> +
> +    assert(per_cpu);
> +    /*
> +     * Use a single region if all we have is one vCPU.
> +     * We could also use a single region with !mttcg, but at this time we have
> +     * not yet processed the thread=single|multi flag.
> +     */
> +    if (smp_cpus == 1) {
> +        tcg_region_set_size(0);
> +        return;
> +    }
> +
> +    for (div = 8; div > 0; div--) {
> +        size_t region_size = per_cpu / div;
> +
> +        if (region_size >= 2 * MIN_CODE_GEN_BUFFER_SIZE) {
> +            tcg_region_set_size(region_size);
> +            return;
> +        }
> +    }
> +    tcg_region_set_size(per_cpu);
> +}

Is it worth setting a guard page after each of these regions?  The guard page 
on the main buffer has caught bugs before (although not in a while now).  If 
not, then we might drop the guard page on the main buffer too.

> +static void tcg_region_set_size__locked(size_t size)
> +{
> +    if (!size) {
> +        region.size = tcg_init_ctx->code_gen_buffer_size;
> +        region.n = 1;
> +    } else {
> +        region.size = size;
> +        region.n = tcg_init_ctx->code_gen_buffer_size / size;
> +    }
> +    if (unlikely(region.size < TCG_HIGHWATER)) {
> +        tcg_abort();
> +    }
> +}
> +
> +/*
> + * Call this function at init time (i.e. only once). Calling this function is
> + * optional: if no region size is set, a single region will be used.
> + *
> + * Note: calling this function *after* calling tcg_region_init() is a bug.
> + */
> +void tcg_region_set_size(size_t size)
> +{
> +    tcg_debug_assert(!region.size);
> +
> +    qemu_mutex_lock(&tcg_lock);
> +    tcg_region_set_size__locked(size);
> +    qemu_mutex_unlock(&tcg_lock);
> +}

If this is called during init, then why does it need a lock?  Surely this is 
before we spawn threads.

>      if (unlikely(next > s->code_gen_highwater)) {
> -        return NULL;
> +        if (!tcg_region_alloc(s)) {
> +            return NULL;
> +        }
> +        goto retry;
>      }

Nit: positive logic is almost always clearer: if (region_alloc) goto retry;


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone
  2017-07-09 20:48   ` Richard Henderson
@ 2017-07-09 21:04     ` Emilio G. Cota
  0 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09 21:04 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 10:48:27 -1000, Richard Henderson wrote:
> On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> >@@ -409,6 +411,18 @@ void tcg_context_init(TCGContext *s)
> >  }
> >  /*
> >+ * Clone the initial TCGContext. Used by TCG threads to copy the TCGContext
> >+ * set up by their parent thread via tcg_context_init().
> >+ */
> >+void tcg_context_clone(TCGContext *s)
> >+{
> >+    if (unlikely(tcg_init_ctx == NULL || tcg_init_ctx == s)) {
> >+        tcg_abort();
> >+    }
> >+    memcpy(s, tcg_init_ctx, sizeof(*s));
> >+}
> 
> Under what conditions will this be called?  How much of this might you need
> to zero out again after the fact?

I checked the profile/tb counts and all of them are zero when this
is called, which is right after the thread has been created.
But it is conceivable that those counts might be !0 for some targets,
so yes it'd be better to actively zero out those.

I don't think there are any other fields that would have to be zeroed
out.

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu Emilio G. Cota
@ 2017-07-09 21:07   ` Richard Henderson
  2017-07-09 21:19   ` Richard Henderson
  2017-07-10 12:05   ` Paolo Bonzini
  2 siblings, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 21:07 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> I was not sure about tci_regs. From code inspection it seems that
> they have to be per-thread, so I converted them, but I do not think
> anyone has ever tried to get MTTCG working with TCI.

Yes, those should be per-thread.

Really, they should be on the stack in tcg_qemu_tb_exec, but the helpers aren't 
really structured to support that.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 17/22] tcg: distribute profiling counters across TCGContext's
  2017-07-09 20:45   ` Richard Henderson
@ 2017-07-09 21:14     ` Emilio G. Cota
  2017-07-09 21:44       ` Richard Henderson
  0 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09 21:14 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 10:45:55 -1000, Richard Henderson wrote:
> On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> >+    /* includes aborted translations because of exceptions */
> >+    atomic_set(&prof->tb_count1, prof->tb_count1 + 1);
> 
> Again, atomic_set without atomic_read is pointless.
> Either you're trying to give the compiler extra information, or you aren't.

See my comment to patch 3.

> As always, it won't ever matter in practice because aligned native types
> never tear.  This is all about markup for compiler tools.

I do it mostly to avoid undefined behaviour under C11. Pleasing
(some) tools is a nice side effect though.

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu Emilio G. Cota
  2017-07-09 21:07   ` Richard Henderson
@ 2017-07-09 21:19   ` Richard Henderson
  2017-07-09 21:29     ` Emilio G. Cota
  2017-07-10 12:05   ` Paolo Bonzini
  2 siblings, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 21:19 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> This allows us to generate TCG code in parallel. MTTCG already uses
> it, although the next commit pushes down a lock to actually
> perform parallel generation.
> 
> User-mode is kept out of this: contention due to concurrent translation
> is more commonly found in full-system mode.

Um, why do you believe that?  Are you suggesting that a multi-threaded 
user-only guest is much more likely to share TBs and do much less code 
generation total?

At the moment I think it's just a confusing distinction.  As proven by some of 
the comment adjustments you made.

> -TCGContext tcg_ctx;
> +TCG_THREAD TCGContext tcg_ctx;

This is a really large structure, and it's not needed by any of the I/O 
threads.  We're probably better off dynamically allocating this ourselves and 
do something like

__thread TCGContext *tcg_ctx_ptr;
#define tcg_ctx (*tcg_ctx_ptr)

You could then even have tcg_init_ctx be the object instead of a pointer to the 
object.

for the main thread, so you don't have to have an extra pointer to this; just 
reference it by symbol.

> -static struct tcg_temp_info temps[TCG_MAX_TEMPS];
> -static TCGTempSet temps_used;
> +static TCG_THREAD struct tcg_temp_info temps[TCG_MAX_TEMPS];
> +static TCG_THREAD TCGTempSet temps_used;


I've been meaning to dynamically allocate these for a while now.  I think that 
would be better than putting this large array in TLS.  Which, again, is not 
needed for I/O threads.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG
  2017-07-09 20:56     ` Emilio G. Cota
@ 2017-07-09 21:20       ` Emilio G. Cota
  0 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09 21:20 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 16:56:23 -0400, Emilio G. Cota wrote:
> On Sun, Jul 09, 2017 at 10:00:01 -1000, Richard Henderson wrote:
> > On 07/08/2017 09:49 PM, Emilio G. Cota wrote:
> > >+    atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);
> > 
> > Want atomic_read here, so they're all the same.
> 
> It's not needed. Note that this thread is the only one ever writing
> to env->tlb_flush_count, so the thread can read this value without
> atomic accesses.
> 
> You'll see this pattern all across the patchset.

We already have this kind of pattern in QEMU. See this patch and
related discussion:
  https://patchwork.kernel.org/patch/9358939/

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-09 21:19   ` Richard Henderson
@ 2017-07-09 21:29     ` Emilio G. Cota
  2017-07-09 21:48       ` Richard Henderson
  0 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-09 21:29 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 11:19:37 -1000, Richard Henderson wrote:
> On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> >This allows us to generate TCG code in parallel. MTTCG already uses
> >it, although the next commit pushes down a lock to actually
> >perform parallel generation.
> >
> >User-mode is kept out of this: contention due to concurrent translation
> >is more commonly found in full-system mode.
> 
> Um, why do you believe that?  Are you suggesting that a multi-threaded
> user-only guest is much more likely to share TBs and do much less code
> generation total?

Exactly. Also, in user-mode "vCPU threads" (i.e. host threads) come and
go all the time, so this doesn't work well with having a single
code_gen_buffer, which I assumed was non-negotiable.

> At the moment I think it's just a confusing distinction.  As proven by some
> of the comment adjustments you made.
> 
> >-TCGContext tcg_ctx;
> >+TCG_THREAD TCGContext tcg_ctx;
> 
> This is a really large structure, and it's not needed by any of the I/O
> threads.  We're probably better off dynamically allocating this ourselves
> and do something like
(snip)

Agreed, will look into this.

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu Emilio G. Cota
@ 2017-07-09 21:38   ` Richard Henderson
  2017-07-10  3:51     ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 21:38 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> +    if (!have_tb_lock) {
> +        TranslationBlock *t;
> +
> +        tb_lock();
> +        /*
> +         * There's a chance that our desired tb has been translated while
> +         * we were translating it.
> +         */
> +        t = tb_htable_lookup(cpu, pc, cs_base, flags);
> +        if (unlikely(t)) {
> +            /* discard what we just translated */
> +            uintptr_t orig_aligned = (uintptr_t)gen_code_buf;
> +
> +            orig_aligned -= ROUND_UP(sizeof(*tb), qemu_icache_linesize);
> +            atomic_set(&tcg_ctx.code_gen_ptr, orig_aligned);
> +            return t;
> +        }
> +    }
>       /* As long as consistency of the TB stuff is provided by tb_lock in user
>        * mode and is implicit in single-threaded softmmu emulation, no explicit
>        * memory barrier is required before tb_link_page() makes the TB visible

I think it would be better to have a tb_htable_lookup_or_insert function, which 
performs the insert iff a matching object isn't already there, returning the 
entry which *is* there in either case.

The hash table already has per-bucket locking.  So here you're taking 3 locks 
(tb_lock, bucket lock for lookup, bucket lock for insert) instead of just 
taking the bucket lock once.  And, recall, the htable is designed such that the 
buckets do not contend often, so you ought to be eliminating most of the 
contention that you're seeing on tb_lock.

It might also be helpful to merge tb_link_page into its one caller in order to 
make the locking issues within that subroutine less muddled.

Anyway, you'd wind up with

   TranslationBlock *ret;

   ret = tb_htable_lookup_or_insert(arguments);
   if (unlikely(ret != tb)) {
      /* discard what we just translated */
   }

   return ret;


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 17/22] tcg: distribute profiling counters across TCGContext's
  2017-07-09 21:14     ` Emilio G. Cota
@ 2017-07-09 21:44       ` Richard Henderson
  2017-07-10 16:00         ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 21:44 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/09/2017 11:14 AM, Emilio G. Cota wrote:
> On Sun, Jul 09, 2017 at 10:45:55 -1000, Richard Henderson wrote:
>> On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
>>> +    /* includes aborted translations because of exceptions */
>>> +    atomic_set(&prof->tb_count1, prof->tb_count1 + 1);
>>
>> Again, atomic_set without atomic_read is pointless.
>> Either you're trying to give the compiler extra information, or you aren't.
> 
> See my comment to patch 3.

I still disagree.  It's Just Plain Confusing.

You'll continue to get questions like this from me and other reviewers in 
future.  And it's not like avoiding atomic_read here makes anything faster. 
Both forms will compile to the same assembler.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-09 21:29     ` Emilio G. Cota
@ 2017-07-09 21:48       ` Richard Henderson
  2017-07-10  3:54         ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-09 21:48 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/09/2017 11:29 AM, Emilio G. Cota wrote:
> On Sun, Jul 09, 2017 at 11:19:37 -1000, Richard Henderson wrote:
>> On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
>>> This allows us to generate TCG code in parallel. MTTCG already uses
>>> it, although the next commit pushes down a lock to actually
>>> perform parallel generation.
>>>
>>> User-mode is kept out of this: contention due to concurrent translation
>>> is more commonly found in full-system mode.
>>
>> Um, why do you believe that?  Are you suggesting that a multi-threaded
>> user-only guest is much more likely to share TBs and do much less code
>> generation total?
> 
> Exactly. Also, in user-mode "vCPU threads" (i.e. host threads) come and
> go all the time, so this doesn't work well with having a single
> code_gen_buffer, which I assumed was non-negotiable.

Ah, yes.  For any subdivision N of code_gen_buffer that we choose, at some 
point we may have N+1 threads and need to do Something Else.

That's probably something worth commenting somewhere with the "first" #ifndef 
CONFIG_USER_ONLY.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu
  2017-07-09 21:38   ` Richard Henderson
@ 2017-07-10  3:51     ` Emilio G. Cota
  2017-07-10  5:59       ` Richard Henderson
  0 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-10  3:51 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 11:38:50 -1000, Richard Henderson wrote:
> On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
(snip)
> I think it would be better to have a tb_htable_lookup_or_insert function,
> which performs the insert iff a matching object isn't already there,
> returning the entry which *is* there in either case.

qht_insert behaves exactly like this, except that it returns a bool.
But we could make it return a void *.

> The hash table already has per-bucket locking.  So here you're taking 3
> locks (tb_lock, bucket lock for lookup, bucket lock for insert) instead of
> just taking the bucket lock once.  And, recall, the htable is designed such
> that the buckets do not contend often, so you ought to be eliminating most
> of the contention that you're seeing on tb_lock.

Minor nit: the lookup is just a seqlock_read, so it's not really a lock.

Your point is correct though. Performing a lookup here (that qht_insert
will do anyway) is wasteful. I didn't look further into this because I
thought getting rid of tb_lock here (due to tb_link_page) wasn't trivial.
But I'll look into it; if we manage to just have the lock in qht_insert,
we'll get great scalability.

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-09 21:48       ` Richard Henderson
@ 2017-07-10  3:54         ` Emilio G. Cota
  0 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-10  3:54 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 11:48:53 -1000, Richard Henderson wrote:
> On 07/09/2017 11:29 AM, Emilio G. Cota wrote:
(snip)
> >Exactly. Also, in user-mode "vCPU threads" (i.e. host threads) come and
> >go all the time, so this doesn't work well with having a single
> >code_gen_buffer, which I assumed was non-negotiable.
> 
> Ah, yes.  For any subdivision N of code_gen_buffer that we choose, at some
> point we may have N+1 threads and need to do Something Else.
> 
> That's probably something worth commenting somewhere with the "first"
> #ifndef CONFIG_USER_ONLY.

Yep, will do.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu
  2017-07-10  3:51     ` Emilio G. Cota
@ 2017-07-10  5:59       ` Richard Henderson
  2017-07-10 15:28         ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-10  5:59 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/09/2017 05:51 PM, Emilio G. Cota wrote:
> On Sun, Jul 09, 2017 at 11:38:50 -1000, Richard Henderson wrote:
>> On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> (snip)
>> I think it would be better to have a tb_htable_lookup_or_insert function,
>> which performs the insert iff a matching object isn't already there,
>> returning the entry which *is* there in either case.
> 
> qht_insert behaves exactly like this, except that it returns a bool.
> But we could make it return a void *.

Err.. no it doesn't.  It returns false if the *exact same object* is inserted 
twice.  That's not the same as being passed a qht_lookup_func_t to see if two 
different objects compare equal.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 02/22] translate-all: remove redundant !tcg_enabled check in dump_exec_info
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 02/22] translate-all: remove redundant !tcg_enabled check in dump_exec_info Emilio G. Cota
  2017-07-09 19:57   ` Richard Henderson
@ 2017-07-10  6:15   ` Thomas Huth
  2017-07-12 12:32   ` Alex Bennée
  2 siblings, 0 replies; 95+ messages in thread
From: Thomas Huth @ 2017-07-10  6:15 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Richard Henderson

On 09.07.2017 09:49, Emilio G. Cota wrote:
> This check is redundant because it is already performed by the only
> caller of dump_exec_info -- the caller was updated by b7da97eef
> ("monitor: Check whether TCG is enabled before running the "info jit"
> code").
> 
> Checking twice wouldn't necessarily be too bad, but here the check also
> returns with tb_lock held. So we can either do the check before tb_lock is
> acquired, or just get rid of it. Given that it is redundant, I am going
> for the latter option.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/translate-all.c | 5 -----
>  1 file changed, 5 deletions(-)
> 
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index dfb9f0d..f768681 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1851,11 +1851,6 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>  
>      tb_lock();
>  
> -    if (!tcg_enabled()) {
> -        cpu_fprintf(f, "TCG not enabled\n");
> -        return;
> -    }
> -
>      target_code_size = 0;
>      max_target_code_size = 0;
>      cross_page = 0;
> 

Reviewed-by: Thomas Huth <thuth@redhat.com>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG
  2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
                   ` (22 preceding siblings ...)
  2017-07-09 18:27 ` [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
@ 2017-07-10  9:50 ` Alex Bennée
  2017-07-10 17:04   ` Richard Henderson
  23 siblings, 1 reply; 95+ messages in thread
From: Alex Bennée @ 2017-07-10  9:50 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Original RFC here:
>   https://lists.nongnu.org/archive/html/qemu-devel/2017-06/msg06874.html
>
> I included Richard's feedback (Thanks!) from the original RFC, and
> added quite a few things. This is now a proper PATCHset since it is
> a lot more mature.
>
> Highlights:
> - It works! I tested single/multi-threaded arm, aarch64 and alpha softmmu
>   with various -smp's (up to 120 on aarch64) and -tb-size's.
>   Also tested x86_64-linux-user with multi-threaded code. valgrind's
>   drd shows no obvious issues (it doesn't swallow C11 atomics, so it
>   spits out a lot of false positives though). Have not tested on a
>   non-x86 host, but given the audit I did of global non-const variables
>   (see commit message in patch 21), it should be OK.

It would be really nice if we could get ThreadSanitizer to support our
setcontext() co-routines. It was very useful during the original MTTCG
work and is a lot faster than Valgrind. There was some discussion on the
sanitizer lists and a basic plan of what is needed is known but its
unlikely to get done by the project itself.

>
> - Region-based allocation to maximize code_gen_buffer utilization.
>   See patch 20.
>
> - Patches 1-8 are unrelated fixes, but I'm keeping them as part of this
>   series to avoid merge headaches later on.
>
> - Performance-wise we get a 20% improvement when booting+shutting down
>   debian-arm with MTTCG and -smp 8 (see patch 22). Not bad! This is due
>   to not holding tb_lock during code translation, although the fact that
>   we still have to take it after every translation remains a scalability
>   issue. But before focusing on that, I'd like to get this reviewed.

Side issue. Have we considered the impact on codegen buffer utilisation
by doing an "off-code_gen_buffer" no cache translation the first time we
ever see a TB?

>
> I broke down features as much as possible, so that we do not end up
> with a "per-thread TCG" megapatch.
>
> The series applies on top of the current master (b11365867568).
>
> Thanks,
>
> 		Emilio


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu Emilio G. Cota
  2017-07-09 21:07   ` Richard Henderson
  2017-07-09 21:19   ` Richard Henderson
@ 2017-07-10 12:05   ` Paolo Bonzini
  2017-07-10 21:14     ` Emilio G. Cota
  2 siblings, 1 reply; 95+ messages in thread
From: Paolo Bonzini @ 2017-07-10 12:05 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel; +Cc: Richard Henderson

On 09/07/2017 09:50, Emilio G. Cota wrote:
> User-mode is kept out of this: contention due to concurrent translation
> is more commonly found in full-system mode.

Out of curiosity, is it harder or you just didn't try?  It would be nice
if the commit message mentioned the problems (if any) in addition to the
reason why you didn't do it.

Having similar policies for user and softmmu emulation is much more
maintainable (for an earlier example, see the unification of user mode
emulation's start/end_exclusive logic with softmmu's "safe work").

Paolo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu
  2017-07-10  5:59       ` Richard Henderson
@ 2017-07-10 15:28         ` Emilio G. Cota
  0 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-10 15:28 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 19:59:47 -1000, Richard Henderson wrote:
> On 07/09/2017 05:51 PM, Emilio G. Cota wrote:
> >On Sun, Jul 09, 2017 at 11:38:50 -1000, Richard Henderson wrote:
> >>On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> >(snip)
> >>I think it would be better to have a tb_htable_lookup_or_insert function,
> >>which performs the insert iff a matching object isn't already there,
> >>returning the entry which *is* there in either case.
> >
> >qht_insert behaves exactly like this, except that it returns a bool.
> >But we could make it return a void *.
> 
> Err.. no it doesn't.  It returns false if the *exact same object* is
> inserted twice.  That's not the same as being passed a qht_lookup_func_t to
> see if two different objects compare equal.

True, I misremembered. My original implementation worked like that though,
then I figured for our use case we could skip calling the comparison function.

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 17/22] tcg: distribute profiling counters across TCGContext's
  2017-07-09 21:44       ` Richard Henderson
@ 2017-07-10 16:00         ` Emilio G. Cota
  0 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-10 16:00 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 11:44:10 -1000, Richard Henderson wrote:
> On 07/09/2017 11:14 AM, Emilio G. Cota wrote:
> >On Sun, Jul 09, 2017 at 10:45:55 -1000, Richard Henderson wrote:
> >>On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> >>>+    /* includes aborted translations because of exceptions */
> >>>+    atomic_set(&prof->tb_count1, prof->tb_count1 + 1);
> >>
> >>Again, atomic_set without atomic_read is pointless.
> >>Either you're trying to give the compiler extra information, or you aren't.
> >
> >See my comment to patch 3.
> 
> I still disagree.  It's Just Plain Confusing.

To me,
  atomic_set(&foo, foo + 1)
and
  atomic_set(&foo, atomic_read(&foo) + 1)
read differently.

The former tells me that no other thread can update the value.

The latter makes me presume that other threads *can* update the value,
which makes me wonder whether this is a bug and we should use atomic_inc(),
or perhaps we are OK with missing counts.

		Emilio

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG
  2017-07-10  9:50 ` Alex Bennée
@ 2017-07-10 17:04   ` Richard Henderson
  0 siblings, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-10 17:04 UTC (permalink / raw)
  To: Alex Bennée, Emilio G. Cota; +Cc: qemu-devel

On 07/09/2017 11:50 PM, Alex Bennée wrote:
> Side issue. Have we considered the impact on codegen buffer utilisation
> by doing an "off-code_gen_buffer" no cache translation the first time we
> ever see a TB?

No we haven't.  Possibly because we'd need additional infrastructure to do even 
that -- we'd want to record somehow that we've seen a given TB before, so that 
we *could* generate permanent code for it in future.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-10 12:05   ` Paolo Bonzini
@ 2017-07-10 21:14     ` Emilio G. Cota
  2017-07-10 21:33       ` Paolo Bonzini
  0 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-10 21:14 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, Richard Henderson

On Mon, Jul 10, 2017 at 14:05:01 +0200, Paolo Bonzini wrote:
> On 09/07/2017 09:50, Emilio G. Cota wrote:
> > User-mode is kept out of this: contention due to concurrent translation
> > is more commonly found in full-system mode.
> 
> Out of curiosity, is it harder or you just didn't try?  It would be nice
> if the commit message mentioned the problems (if any) in addition to the
> reason why you didn't do it.
> 
> Having similar policies for user and softmmu emulation is much more
> maintainable (for an earlier example, see the unification of user mode
> emulation's start/end_exclusive logic with softmmu's "safe work").

I agree that it would be nice to have the same mechanism for all.

The main hurdle I see is how to allow for concurrent code generation while
minimizing flushes of the single, fixed-size[*] code_gen_buffer.
In user-mode this is tricky because there is no way to bound the number
of threads that might be spawned by the guest code (I don't think reading
/proc/sys/kernel/threads-max is a viable solution here).

Switching to a "__thread *tcg_ctx_ptr" model will help minimize
user-mode/softmmu differences though. The only remaining difference would be
that user-mode would need tb_lock() around tb_gen_code, whereas softmmu
wouldn't, but everything else would be the same.

		E.

[*] Note that in user-mode we use code_gen_buffer defined at compile-time
as a static buffer[].

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-10 21:14     ` Emilio G. Cota
@ 2017-07-10 21:33       ` Paolo Bonzini
  2017-07-10 22:13         ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Paolo Bonzini @ 2017-07-10 21:33 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


> I agree that it would be nice to have the same mechanism for all.
> 
> The main hurdle I see is how to allow for concurrent code generation while
> minimizing flushes of the single, fixed-size[*] code_gen_buffer.
> In user-mode this is tricky because there is no way to bound the number
> of threads that might be spawned by the guest code (I don't think reading
> /proc/sys/kernel/threads-max is a viable solution here).
> 
> Switching to a "__thread *tcg_ctx_ptr" model will help minimize
> user-mode/softmmu differences though. The only remaining difference would be
> that user-mode would need tb_lock() around tb_gen_code, whereas softmmu
> wouldn't, but everything else would be the same.

Hmm, tb_gen_code is already protected by mmap_lock in linux-user, so you wouldn't
get any parallelism.  On the other hand, you could just say that the fixed-size
code_gen_buffer is protected by mmap_lock, which doesn't exist for softmmu.

Paolo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-10 21:33       ` Paolo Bonzini
@ 2017-07-10 22:13         ` Emilio G. Cota
  2017-07-11  8:02           ` Paolo Bonzini
  0 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-10 22:13 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, Richard Henderson

On Mon, Jul 10, 2017 at 17:33:07 -0400, Paolo Bonzini wrote:
> 
> > I agree that it would be nice to have the same mechanism for all.
> > 
> > The main hurdle I see is how to allow for concurrent code generation while
> > minimizing flushes of the single, fixed-size[*] code_gen_buffer.
> > In user-mode this is tricky because there is no way to bound the number
> > of threads that might be spawned by the guest code (I don't think reading
> > /proc/sys/kernel/threads-max is a viable solution here).
> > 
> > Switching to a "__thread *tcg_ctx_ptr" model will help minimize
> > user-mode/softmmu differences though. The only remaining difference would be
> > that user-mode would need tb_lock() around tb_gen_code, whereas softmmu
> > wouldn't, but everything else would be the same.
> 
> Hmm, tb_gen_code is already protected by mmap_lock in linux-user, so you wouldn't
> get any parallelism.  On the other hand, you could just say that the fixed-size
> code_gen_buffer is protected by mmap_lock, which doesn't exist for softmmu.

Yes. tb_lock/mmap_lock, or like they're called in some asserts, memory_lock.

A way to get some parallelism in user-mode given the constraints
would be to share regions among TCG threads. Threads would still need to take
a per-region lock, but it wouldn't be a global lock so that would scale better.

I'm not sure we really need that much parallelism for code generation in user-mode,
though. So I wouldn't focus on this until seeing benchmarks that have a clear
bottleneck due to "memory_lock".

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t
  2017-07-09 20:11   ` Richard Henderson
@ 2017-07-10 23:57     ` Emilio G. Cota
  2017-07-12  0:53       ` Richard Henderson
  0 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-10 23:57 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Sun, Jul 09, 2017 at 10:11:21 -1000, Richard Henderson wrote:
> On 07/08/2017 09:50 PM, Emilio G. Cota wrote:
> >To avoid wasting a byte. I don't have any use in mind for this byte,
> >but I think it's good to leave this byte explicitly free for future use.
> >See this discussion for how the u16 came to be:
> >   https://lists.gnu.org/archive/html/qemu-devel/2016-07/msg04564.html
> >We could use a bool but in some systems that would take > 1 byte.
> >
> >Signed-off-by: Emilio G. Cota<cota@braap.org>
> >---
> >  include/exec/exec-all.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> What I would prefer to do is generalize tb->cflags.  Those values *do*
> affect how we compile the TB, and yet we don't take them into account.  So I
> think it would be a good idea to feed that into the TB hash.

I'm having trouble seeing how this could work.
Where do we get the "current" values from the current state, i.e.
the ones we need to generate the hash and perform comparisons?
In particular:
- CF_COUNT_MASK: just use CF_COUNT_MASK?
- CF_LAST_IO: ?
- CF_NOCACHE: always 0 I guess
- CF_USE/IGNORE_ICOUNT: ?

Or should we just mask these out for hashing/cmp purposes?

> At present we use 18 bits of the uint32_t.  That leaves plenty of room for:
> 
> * the compile-time value of parallel_cpus, so we don't have to keep flushing
> the TBs that we create for cpu_exec_step_atomic.

This makes sense, yes.

> * invalid, which no TB within the hashtable would have set.

Agreed.

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu
  2017-07-10 22:13         ` Emilio G. Cota
@ 2017-07-11  8:02           ` Paolo Bonzini
  0 siblings, 0 replies; 95+ messages in thread
From: Paolo Bonzini @ 2017-07-11  8:02 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson

On 11/07/2017 00:13, Emilio G. Cota wrote:
> On Mon, Jul 10, 2017 at 17:33:07 -0400, Paolo Bonzini wrote:
>>
>>> I agree that it would be nice to have the same mechanism for all.
>>>
>>> The main hurdle I see is how to allow for concurrent code generation while
>>> minimizing flushes of the single, fixed-size[*] code_gen_buffer.
>>> In user-mode this is tricky because there is no way to bound the number
>>> of threads that might be spawned by the guest code (I don't think reading
>>> /proc/sys/kernel/threads-max is a viable solution here).
>>>
>>> Switching to a "__thread *tcg_ctx_ptr" model will help minimize
>>> user-mode/softmmu differences though. The only remaining difference would be
>>> that user-mode would need tb_lock() around tb_gen_code, whereas softmmu
>>> wouldn't, but everything else would be the same.
>>
>> Hmm, tb_gen_code is already protected by mmap_lock in linux-user, so you wouldn't
>> get any parallelism.  On the other hand, you could just say that the fixed-size
>> code_gen_buffer is protected by mmap_lock, which doesn't exist for softmmu.
> 
> Yes. tb_lock/mmap_lock, or like they're called in some asserts, memory_lock.
> 
> A way to get some parallelism in user-mode given the constraints
> would be to share regions among TCG threads. Threads would still need to take
> a per-region lock, but it wouldn't be a global lock so that would scale better.
> 
> I'm not sure we really need that much parallelism for code generation in user-mode,
> though. So I wouldn't focus on this until seeing benchmarks that have a clear
> bottleneck due to "memory_lock".

I agree.  Still, we could minimize the differences by protecting
tb_gen_code only with mmap_lock, instead of mmap_lock+tb_lock.

Paolo

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 01/22] vl: fix breakage of -tb-size
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 01/22] vl: fix breakage of -tb-size Emilio G. Cota
  2017-07-09 19:56   ` Richard Henderson
@ 2017-07-11 15:37   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-11 15:37 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Commit e7b161d573 ("vl: add tcg_enabled() for tcg related code") adds
> a check to exit the program when !tcg_enabled() while parsing the -tb-size
> flag.
>
> It turns out that when the -tb-size flag is evaluated, tcg_enabled() can
> only return 0, since it is set (or not) much later by configure_accelerator().
>
> Fix it by unconditionally exiting if the flag is passed to a QEMU binary
> built with !CONFIG_TCG.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  vl.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/vl.c b/vl.c
> index d17c863..9ece570 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -3933,10 +3933,10 @@ int main(int argc, char **argv, char **envp)
>                  configure_rtc(opts);
>                  break;
>              case QEMU_OPTION_tb_size:
> -                if (!tcg_enabled()) {
> -                    error_report("TCG is disabled");
> -                    exit(1);
> -                }
> +#ifndef CONFIG_TCG
> +                error_report("TCG is disabled");
> +                exit(1);
> +#endif
>                  if (qemu_strtoul(optarg, NULL, 0, &tcg_tb_size) < 0) {
>                      error_report("Invalid argument to -tb-size");
>                      exit(1);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t
  2017-07-10 23:57     ` Emilio G. Cota
@ 2017-07-12  0:53       ` Richard Henderson
  2017-07-12 20:48         ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-12  0:53 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/10/2017 01:57 PM, Emilio G. Cota wrote:
>> What I would prefer to do is generalize tb->cflags.  Those values *do*
>> affect how we compile the TB, and yet we don't take them into account.  So I
>> think it would be a good idea to feed that into the TB hash.
> 
> I'm having trouble seeing how this could work.
> Where do we get the "current" values from the current state, i.e.
> the ones we need to generate the hash and perform comparisons?
> In particular:
> - CF_COUNT_MASK: just use CF_COUNT_MASK?
> - CF_LAST_IO: ?
> - CF_NOCACHE: always 0 I guess

All of these are set by cpu_io_recompile as needed.
They are all clear for normal TBs.

> - CF_USE/IGNORE_ICOUNT: ?
CF_IGNORE_ICOUNT probably shouldn't exist.  Probably the callers of tb_gen_code 
should simply set CF_USE_ICOUNT properly if use_icount is true, rather than 
having two flags control the same feature.

At which point CF_USE_ICOUNT should be set iff use_icount is true.

Likewise CF_PARALLEL would be set iff parallel_cpus is true, except for within 
cpu_exec_step_atomic where we would always use 0 (because that's the whole 
point of that function).


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 02/22] translate-all: remove redundant !tcg_enabled check in dump_exec_info
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 02/22] translate-all: remove redundant !tcg_enabled check in dump_exec_info Emilio G. Cota
  2017-07-09 19:57   ` Richard Henderson
  2017-07-10  6:15   ` Thomas Huth
@ 2017-07-12 12:32   ` Alex Bennée
  2 siblings, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 12:32 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> This check is redundant because it is already performed by the only
> caller of dump_exec_info -- the caller was updated by b7da97eef
> ("monitor: Check whether TCG is enabled before running the "info jit"
> code").
>
> Checking twice wouldn't necessarily be too bad, but here the check also
> returns with tb_lock held. So we can either do the check before tb_lock is
> acquired, or just get rid of it. Given that it is redundant, I am going
> for the latter option.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/translate-all.c | 5 -----
>  1 file changed, 5 deletions(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index dfb9f0d..f768681 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1851,11 +1851,6 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>
>      tb_lock();
>
> -    if (!tcg_enabled()) {
> -        cpu_fprintf(f, "TCG not enabled\n");
> -        return;
> -    }
> -
>      target_code_size = 0;
>      max_target_code_size = 0;
>      cross_page = 0;


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG Emilio G. Cota
  2017-07-09 20:00   ` Richard Henderson
@ 2017-07-12 13:26   ` Alex Bennée
  2017-07-12 18:19     ` Emilio G. Cota
  1 sibling, 1 reply; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 13:26 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Commit f0aff0f124 ("cputlb: add assert_cpu_is_self checks") buried
> the increment of tlb_flush_count under TLB_DEBUG. This results in
> "info jit" always (mis)reporting 0 TLB flushes when !TLB_DEBUG.
>
> Besides, under MTTCG tlb_flush_count is updated by several threads,
> so in order not to lose counts we'd either have to use atomic ops
> or distribute the counter, which is more scalable.
>
> This patch does the latter by embedding tlb_flush_count in CPUArchState.
> The global count is then easily obtained by iterating over the CPU list.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

As it actually fixes unintentional breakage:

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

That said I'm not sure if this number alone is helpful given the range
of flushes we have. Really from a performance point of view we should
differentiate between inline per-vCPU flushes as well as the cross-vCPU
flushes of both asynchronus and synced varieties.

I had a go at this using QEMUs tracing infrastructure:

  https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg04076.html

But I guess the ideal way would be something that both keeps counters
and optionally enable tracepoints.

> ---
>  include/exec/cpu-defs.h   |  1 +
>  include/exec/cputlb.h     |  3 +--
>  accel/tcg/cputlb.c        | 17 ++++++++++++++---
>  accel/tcg/translate-all.c |  2 +-
>  4 files changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index bc8e7f8..e43ff83 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -137,6 +137,7 @@ typedef struct CPUIOTLBEntry {
>      CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];               \
>      CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                    \
>      CPUIOTLBEntry iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                 \
> +    size_t tlb_flush_count;                                             \
>      target_ulong tlb_flush_addr;                                        \
>      target_ulong tlb_flush_mask;                                        \
>      target_ulong vtlb_index;                                            \
> diff --git a/include/exec/cputlb.h b/include/exec/cputlb.h
> index 3f94178..c91db21 100644
> --- a/include/exec/cputlb.h
> +++ b/include/exec/cputlb.h
> @@ -23,7 +23,6 @@
>  /* cputlb.c */
>  void tlb_protect_code(ram_addr_t ram_addr);
>  void tlb_unprotect_code(ram_addr_t ram_addr);
> -extern int tlb_flush_count;
> -
> +size_t tlb_flush_count(void);
>  #endif
>  #endif
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index 85635ae..9377110 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -92,8 +92,18 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
>      }
>  }
>
> -/* statistics */
> -int tlb_flush_count;
> +size_t tlb_flush_count(void)
> +{
> +    CPUState *cpu;
> +    size_t count = 0;
> +
> +    CPU_FOREACH(cpu) {
> +        CPUArchState *env = cpu->env_ptr;
> +
> +        count += atomic_read(&env->tlb_flush_count);
> +    }
> +    return count;
> +}
>
>  /* This is OK because CPU architectures generally permit an
>   * implementation to drop entries from the TLB at any time, so
> @@ -112,7 +122,8 @@ static void tlb_flush_nocheck(CPUState *cpu)
>      }
>
>      assert_cpu_is_self(cpu);
> -    tlb_debug("(count: %d)\n", tlb_flush_count++);
> +    atomic_set(&env->tlb_flush_count, env->tlb_flush_count + 1);
> +    tlb_debug("(count: %zu)\n", tlb_flush_count());
>
>      tb_lock();
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index f768681..a936a5f 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1909,7 +1909,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>              atomic_read(&tcg_ctx.tb_ctx.tb_flush_count));
>      cpu_fprintf(f, "TB invalidate count %d\n",
>              tcg_ctx.tb_ctx.tb_phys_invalidate_count);
> -    cpu_fprintf(f, "TLB flush count     %d\n", tlb_flush_count);
> +    cpu_fprintf(f, "TLB flush count     %zu\n", tlb_flush_count());
>      tcg_dump_info(f, cpu_fprintf);
>
>      tb_unlock();


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 04/22] tcg: fix corruption of code_time profiling counter upon tb_flush
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 04/22] tcg: fix corruption of code_time profiling counter upon tb_flush Emilio G. Cota
  2017-07-09 20:01   ` Richard Henderson
@ 2017-07-12 14:36   ` Alex Bennée
  2017-07-12 17:09   ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 14:36 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Whenever there is an overflow in code_gen_buffer (e.g. we run out
> of space in it and have to flush it), the code_time profiling counter
> ends up with an invalid value (that is, code_time -= profile_getclock(),
> without later on getting += profile_getclock() due to the goto).
>
> Fix it by using the ti variable, so that we only update code_time
> when there is no overflow. Note that in case there is an overflow
> we fail to account for the elapsed coding time, but this is quite rare
> so we can probably live with it.
>
<snip>
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/translate-all.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index a936a5f..72ce445 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1293,7 +1293,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>  #ifdef CONFIG_PROFILER
>      tcg_ctx.tb_count++;
>      tcg_ctx.interm_time += profile_getclock() - ti;
> -    tcg_ctx.code_time -= profile_getclock();
> +    ti = profile_getclock();
>  #endif
>
>      /* ??? Overflow could be handled better here.  In particular, we
> @@ -1311,7 +1311,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>      }
>
>  #ifdef CONFIG_PROFILER
> -    tcg_ctx.code_time += profile_getclock();
> +    tcg_ctx.code_time += profile_getclock() - ti;
>      tcg_ctx.code_in_len += tb->size;
>      tcg_ctx.code_out_len += gen_code_size;
>      tcg_ctx.search_out_len += search_size;


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 05/22] exec-all: fix typos in TranslationBlock's documentation
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 05/22] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
@ 2017-07-12 14:37   ` Alex Bennée
  0 siblings, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 14:37 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  include/exec/exec-all.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 8096d64..8326e7d 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -341,7 +341,7 @@ struct TranslationBlock {
>      /* The following data are used to directly call another TB from
>       * the code of this one. This can be done either by emitting direct or
>       * indirect native jump instructions. These jumps are reset so that the TB
> -     * just continue its execution. The TB can be linked to another one by
> +     * just continues its execution. The TB can be linked to another one by
>       * setting one of the jump targets (or patching the jump instruction). Only
>       * two of such jumps are supported.
>       */
> @@ -352,7 +352,7 @@ struct TranslationBlock {
>  #else
>      uintptr_t jmp_target_addr[2]; /* target address for indirect jump */
>  #endif
> -    /* Each TB has an assosiated circular list of TBs jumping to this one.
> +    /* Each TB has an associated circular list of TBs jumping to this one.
>       * jmp_list_first points to the first TB jumping to this one.
>       * jmp_list_next is used to point to the next TB in a list.
>       * Since each TB can have two jumps, it can participate in two lists.


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 06/22] translate-all: make have_tb_lock static
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 06/22] translate-all: make have_tb_lock static Emilio G. Cota
  2017-07-09 20:02   ` Richard Henderson
@ 2017-07-12 14:38   ` Alex Bennée
  2017-07-12 18:22     ` Emilio G. Cota
  1 sibling, 1 reply; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 14:38 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> It is only used by this object, and it's not exported to any other.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

I was almost caught out by the name re-use in cpu-exec.c ;-)

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  accel/tcg/translate-all.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 72ce445..2fa9f65 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -133,7 +133,7 @@ TCGContext tcg_ctx;
>  bool parallel_cpus;
>
>  /* translation block context */
> -__thread int have_tb_lock;
> +static __thread int have_tb_lock;
>
>  static void page_table_config_init(void)
>  {


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs Emilio G. Cota
  2017-07-09 20:02   ` Richard Henderson
@ 2017-07-12 14:39   ` Alex Bennée
  2017-07-12 17:00   ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 14:39 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  tcg/i386/tcg-target.inc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
> index 01e3b4e..06df01a 100644
> --- a/tcg/i386/tcg-target.inc.c
> +++ b/tcg/i386/tcg-target.inc.c
> @@ -2514,7 +2514,7 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
>      return NULL;
>  }
>
> -static int tcg_target_callee_save_regs[] = {
> +static const int tcg_target_callee_save_regs[] = {
>  #if TCG_TARGET_REG_BITS == 64
>      TCG_REG_RBP,
>      TCG_REG_RBX,


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 08/22] tcg/mips: constify tcg_target_callee_save_regs
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 08/22] tcg/mips: " Emilio G. Cota
  2017-07-09 20:02   ` Richard Henderson
@ 2017-07-12 14:39   ` Alex Bennée
  2017-07-12 17:01   ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 14:39 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  tcg/mips/tcg-target.inc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tcg/mips/tcg-target.inc.c b/tcg/mips/tcg-target.inc.c
> index 8cff9a6..790b4fc 100644
> --- a/tcg/mips/tcg-target.inc.c
> +++ b/tcg/mips/tcg-target.inc.c
> @@ -2323,7 +2323,7 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
>      return NULL;
>  }
>
> -static int tcg_target_callee_save_regs[] = {
> +static const int tcg_target_callee_save_regs[] = {
>      TCG_REG_S0,       /* used for the global env (TCG_AREG0) */
>      TCG_REG_S1,
>      TCG_REG_S2,


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 11/22] translate-all: use a binary search tree to track TBs in TBContext
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 11/22] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
  2017-07-09 20:33   ` Richard Henderson
@ 2017-07-12 15:10   ` Alex Bennée
  2017-07-12 18:38     ` Emilio G. Cota
  1 sibling, 1 reply; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 15:10 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> This is a prerequisite for having threads generate code on separate
> buffers, which will help scalability when booting multiple cores
> under MTTCG.
>
> For this we need a new field (.tc_size) in TranslationBlock to keep
> track of the size of the translated code. This field is added into
> a 4-byte hole that the previous commit created.
>
<snip>
>
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index fd20bca..673b26d 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -320,14 +320,25 @@ struct TranslationBlock {
>      uint16_t size;      /* size of target code for this block (1 <=
>                             size <= TARGET_PAGE_SIZE) */
>      uint16_t icount;
> -    uint32_t cflags;    /* compile flags */
> +    /*
> +     * @tc_size must be kept right after @tc_ptr to facilitate TB lookups in a
> +     * binary search tree -- see struct ptr_size.
> +     * We use an anonymous struct here to avoid updating all calling code,
> +     * which would be quite a lot of churn.
> +     * The only reason to bring @cflags into the anonymous struct is to
> +     * avoid inducing a hole in TranslationBlock.
> +     */
> +    struct {
> +        void *tc_ptr;    /* pointer to the translated code */
> +        uint32_t tc_size; /* size of translated code for this block */
> +
> +        uint32_t cflags;    /* compile flags */
>  #define CF_COUNT_MASK  0x7fff
>  #define CF_LAST_IO     0x8000 /* Last insn may be an IO access.  */
>  #define CF_NOCACHE     0x10000 /* To be freed after execution */
>  #define CF_USE_ICOUNT  0x20000
>  #define CF_IGNORE_ICOUNT 0x40000 /* Do not generate icount code */
> -
> -    void *tc_ptr;    /* pointer to the translated code */
> +    };

Why not just have a named structure for this so there isn't ambiguity
between struct ptrsize and this thing.

>      uint8_t *tc_search;  /* pointer to search data */
>      /* original tb when cflags has CF_NOCACHE */
>      struct TranslationBlock *orig_tb;
> diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
> index 25c2afe..1fa8dcc 100644
> --- a/include/exec/tb-context.h
> +++ b/include/exec/tb-context.h
> @@ -31,10 +31,8 @@ typedef struct TBContext TBContext;
>
>  struct TBContext {
>
> -    TranslationBlock **tbs;
> +    GTree *tb_tree;
>      struct qht htable;
> -    size_t tbs_size;
> -    int nb_tbs;
>      /* any access to the tbs or the page table must use this lock */
>      QemuMutex tb_lock;
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index 2fa9f65..aa3a08b 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -752,6 +752,47 @@ static inline void *alloc_code_gen_buffer(void)
>  }
>  #endif /* USE_STATIC_CODE_GEN_BUFFER, WIN32, POSIX */
>
> +struct ptr_size {
> +    void *ptr;
> +    uint32_t size;
> +};

If we must have this you need a comment here alluding to the layout of
TranslationBlock.

> +
> +/* compare a single @ptr and a ptr_size @s */
> +static int ptr_size_cmp(const void *ptr, const struct ptr_size *s)
> +{
> +    if (ptr >= s->ptr + s->size) {
> +        return 1;
> +    } else if (ptr < s->ptr) {
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +static gint tc_ptr_cmp(gconstpointer ap, gconstpointer bp)
> +{
> +    const struct ptr_size *a = ap;
> +    const struct ptr_size *b = bp;
> +
> +    /*
> +     * When both sizes are set, we know this isn't a lookup and therefore
> +     * the two buffers are non-overlapping: a pointer comparison will do.
> +     * This is the most likely case: every TB must be inserted; lookups
> +     * are a lot less frequent.
> +     */
> +    if (likely(a->size && b->size)) {
> +        return a->ptr - b->ptr;
> +    }
> +    /*
> +     * All lookups have either .size field set to 0.
> +     * From the glib sources we see that @ap is always the lookup key. However
> +     * the docs provide no guarantee, so we just mark this case as likely.
> +     */
> +    if (likely(a->size == 0)) {
> +        return ptr_size_cmp(a->ptr, b);
> +    }
> +    return ptr_size_cmp(b->ptr, a);
> +}
> +
>  static inline void code_gen_alloc(size_t tb_size)
>  {
>      tcg_ctx.code_gen_buffer_size = size_code_gen_buffer(tb_size);
> @@ -760,15 +801,7 @@ static inline void code_gen_alloc(size_t tb_size)
>          fprintf(stderr, "Could not allocate dynamic translator buffer\n");
>          exit(1);
>      }
> -
> -    /* size this conservatively -- realloc later if needed */
> -    tcg_ctx.tb_ctx.tbs_size =
> -        tcg_ctx.code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE / 8;
> -    if (unlikely(!tcg_ctx.tb_ctx.tbs_size)) {
> -        tcg_ctx.tb_ctx.tbs_size = 64 * 1024;
> -    }
> -    tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock *, tcg_ctx.tb_ctx.tbs_size);
> -
> +    tcg_ctx.tb_ctx.tb_tree = g_tree_new(tc_ptr_cmp);
>      qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
>  }
>
> @@ -805,7 +838,6 @@ void tcg_exec_init(unsigned long tb_size)
>  static TranslationBlock *tb_alloc(target_ulong pc)
>  {
>      TranslationBlock *tb;
> -    TBContext *ctx;
>
>      assert_tb_locked();
>
> @@ -813,12 +845,6 @@ static TranslationBlock *tb_alloc(target_ulong pc)
>      if (unlikely(tb == NULL)) {
>          return NULL;
>      }
> -    ctx = &tcg_ctx.tb_ctx;
> -    if (unlikely(ctx->nb_tbs == ctx->tbs_size)) {
> -        ctx->tbs_size *= 2;
> -        ctx->tbs = g_renew(TranslationBlock *, ctx->tbs, ctx->tbs_size);
> -    }
> -    ctx->tbs[ctx->nb_tbs++] = tb;
>      return tb;
>  }
>
> @@ -827,16 +853,7 @@ void tb_free(TranslationBlock *tb)
>  {
>      assert_tb_locked();
>
> -    /* In practice this is mostly used for single use temporary TB
> -       Ignore the hard cases and just back up if this TB happens to
> -       be the last one generated.  */
> -    if (tcg_ctx.tb_ctx.nb_tbs > 0 &&
> -            tb == tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) {
> -        size_t struct_size = ROUND_UP(sizeof(*tb), qemu_icache_linesize);
> -
> -        tcg_ctx.code_gen_ptr = tb->tc_ptr - struct_size;
> -        tcg_ctx.tb_ctx.nb_tbs--;
> -    }
> +    g_tree_remove(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr);
>  }

This function should be renamed as we never attempt to free (and it was
pretty half hearted before). Maybe tb_remove or tb_deref?

>
>  static inline void invalidate_page_bitmap(PageDesc *p)
> @@ -884,6 +901,8 @@ static void page_flush_tb(void)
>  /* flush all the translation blocks */
>  static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>  {
> +    int nb_tbs __attribute__((unused));
> +
>      tb_lock();
>
>      /* If it is already been done on request of another CPU,
> @@ -894,11 +913,12 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>      }
>
>  #if defined(DEBUG_TB_FLUSH)
> +    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
>      printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
>             (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
> -           tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.tb_ctx.nb_tbs > 0 ?
> +           nb_tbs, nb_tbs > 0 ?
>             ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)) /
> -           tcg_ctx.tb_ctx.nb_tbs : 0);
> +           nb_tbs : 0);
>  #endif
>      if ((unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer)
>          > tcg_ctx.code_gen_buffer_size) {
> @@ -909,7 +929,10 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>          cpu_tb_jmp_cache_clear(cpu);
>      }
>
> -    tcg_ctx.tb_ctx.nb_tbs = 0;
> +    /* Increment the refcount first so that destroy acts as a reset */
> +    g_tree_ref(tcg_ctx.tb_ctx.tb_tree);
> +    g_tree_destroy(tcg_ctx.tb_ctx.tb_tree);
> +
>      qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
>      page_flush_tb();
>
> @@ -1309,6 +1332,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>      if (unlikely(search_size < 0)) {
>          goto buffer_overflow;
>      }
> +    tb->tc_size = gen_code_size;
>
>  #ifdef CONFIG_PROFILER
>      tcg_ctx.code_time += profile_getclock() - ti;
> @@ -1359,6 +1383,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>       * through the physical hash table and physical page list.
>       */
>      tb_link_page(tb, phys_pc, phys_page2);
> +    g_tree_insert(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr, tb);
>      return tb;
>  }
>
> @@ -1627,37 +1652,16 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
>  }
>  #endif
>
> -/* find the TB 'tb' such that tb[0].tc_ptr <= tc_ptr <
> -   tb[1].tc_ptr. Return NULL if not found */
> +/*
> + * Find the TB 'tb' such that
> + * tb->tc_ptr <= tc_ptr < tb->tc_ptr + tb->tc_size
> + * Return NULL if not found.
> + */
>  static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
>  {
> -    int m_min, m_max, m;
> -    uintptr_t v;
> -    TranslationBlock *tb;
> +    struct ptr_size s = { .ptr = (void *)tc_ptr };
>
> -    if (tcg_ctx.tb_ctx.nb_tbs <= 0) {
> -        return NULL;
> -    }
> -    if (tc_ptr < (uintptr_t)tcg_ctx.code_gen_buffer ||
> -        tc_ptr >= (uintptr_t)tcg_ctx.code_gen_ptr) {
> -        return NULL;
> -    }
> -    /* binary search (cf Knuth) */
> -    m_min = 0;
> -    m_max = tcg_ctx.tb_ctx.nb_tbs - 1;
> -    while (m_min <= m_max) {
> -        m = (m_min + m_max) >> 1;
> -        tb = tcg_ctx.tb_ctx.tbs[m];
> -        v = (uintptr_t)tb->tc_ptr;
> -        if (v == tc_ptr) {
> -            return tb;
> -        } else if (tc_ptr < v) {
> -            m_max = m - 1;
> -        } else {
> -            m_min = m + 1;
> -        }
> -    }
> -    return tcg_ctx.tb_ctx.tbs[m_max];
> +    return g_tree_lookup(tcg_ctx.tb_ctx.tb_tree, &s);

Other than the anonymous struct ickiness I'm all for using library code here.

>  }
>
>  #if !defined(CONFIG_USER_ONLY)
> @@ -1842,63 +1846,67 @@ static void print_qht_statistics(FILE *f, fprintf_function cpu_fprintf,
>      g_free(hgram);
>  }
>
> +struct tb_tree_stats {
> +    size_t target_size;
> +    size_t max_target_size;
> +    size_t direct_jmp_count;
> +    size_t direct_jmp2_count;
> +    size_t cross_page;
> +};
> +
> +static gboolean tb_tree_stats_iter(gpointer key, gpointer value, gpointer data)
> +{
> +    const TranslationBlock *tb = value;
> +    struct tb_tree_stats *tst = data;
> +
> +    tst->target_size += tb->size;
> +    if (tb->size > tst->max_target_size) {
> +        tst->max_target_size = tb->size;
> +    }
> +    if (tb->page_addr[1] != -1) {
> +        tst->cross_page++;
> +    }
> +    if (tb->jmp_reset_offset[0] != TB_JMP_RESET_OFFSET_INVALID) {
> +        tst->direct_jmp_count++;
> +        if (tb->jmp_reset_offset[1] != TB_JMP_RESET_OFFSET_INVALID) {
> +            tst->direct_jmp2_count++;
> +        }
> +    }
> +    return false;
> +}
> +
>  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>  {
> -    int i, target_code_size, max_target_code_size;
> -    int direct_jmp_count, direct_jmp2_count, cross_page;
> -    TranslationBlock *tb;
> +    struct tb_tree_stats tst = {};
>      struct qht_stats hst;
> +    int nb_tbs;
>
>      tb_lock();
>
> -    target_code_size = 0;
> -    max_target_code_size = 0;
> -    cross_page = 0;
> -    direct_jmp_count = 0;
> -    direct_jmp2_count = 0;
> -    for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) {
> -        tb = tcg_ctx.tb_ctx.tbs[i];
> -        target_code_size += tb->size;
> -        if (tb->size > max_target_code_size) {
> -            max_target_code_size = tb->size;
> -        }
> -        if (tb->page_addr[1] != -1) {
> -            cross_page++;
> -        }
> -        if (tb->jmp_reset_offset[0] != TB_JMP_RESET_OFFSET_INVALID) {
> -            direct_jmp_count++;
> -            if (tb->jmp_reset_offset[1] != TB_JMP_RESET_OFFSET_INVALID) {
> -                direct_jmp2_count++;
> -            }
> -        }
> -    }
> +    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
> +    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
>      /* XXX: avoid using doubles ? */
>      cpu_fprintf(f, "Translation buffer state:\n");
>      cpu_fprintf(f, "gen code size       %td/%zd\n",
>                  tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
>                  tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
> -    cpu_fprintf(f, "TB count            %d\n", tcg_ctx.tb_ctx.nb_tbs);
> -    cpu_fprintf(f, "TB avg target size  %d max=%d bytes\n",
> -            tcg_ctx.tb_ctx.nb_tbs ? target_code_size /
> -                    tcg_ctx.tb_ctx.nb_tbs : 0,
> -            max_target_code_size);
> +    cpu_fprintf(f, "TB count            %d\n", nb_tbs);
> +    cpu_fprintf(f, "TB avg target size  %zu max=%zu bytes\n",
> +                nb_tbs ? tst.target_size / nb_tbs : 0,
> +                tst.max_target_size);
>      cpu_fprintf(f, "TB avg host size    %td bytes (expansion ratio: %0.1f)\n",
> -            tcg_ctx.tb_ctx.nb_tbs ? (tcg_ctx.code_gen_ptr -
> -                                     tcg_ctx.code_gen_buffer) /
> -                                     tcg_ctx.tb_ctx.nb_tbs : 0,
> -                target_code_size ? (double) (tcg_ctx.code_gen_ptr -
> -                                             tcg_ctx.code_gen_buffer) /
> -                                             target_code_size : 0);
> -    cpu_fprintf(f, "cross page TB count %d (%d%%)\n", cross_page,
> -            tcg_ctx.tb_ctx.nb_tbs ? (cross_page * 100) /
> -                                    tcg_ctx.tb_ctx.nb_tbs : 0);
> -    cpu_fprintf(f, "direct jump count   %d (%d%%) (2 jumps=%d %d%%)\n",
> -                direct_jmp_count,
> -                tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp_count * 100) /
> -                        tcg_ctx.tb_ctx.nb_tbs : 0,
> -                direct_jmp2_count,
> -                tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
> -                        tcg_ctx.tb_ctx.nb_tbs : 0);
> +                nb_tbs ? (tcg_ctx.code_gen_ptr -
> +                          tcg_ctx.code_gen_buffer) / nb_tbs : 0,
> +                tst.target_size ? (double) (tcg_ctx.code_gen_ptr -
> +                                            tcg_ctx.code_gen_buffer) /
> +                                            tst.target_size : 0);
> +    cpu_fprintf(f, "cross page TB count %zu (%zu%%)\n", tst.cross_page,
> +            nb_tbs ? (tst.cross_page * 100) / nb_tbs : 0);
> +    cpu_fprintf(f, "direct jump count   %zu (%zu%%) (2 jumps=%zu %zu%%)\n",
> +                tst.direct_jmp_count,
> +                nb_tbs ? (tst.direct_jmp_count * 100) / nb_tbs : 0,
> +                tst.direct_jmp2_count,
> +                nb_tbs ? (tst.direct_jmp2_count * 100) / nb_tbs : 0);
>
>      qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
>      print_qht_statistics(f, cpu_fprintf, hst);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 12/22] translate-all: report correct avg host TB size
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 12/22] translate-all: report correct avg host TB size Emilio G. Cota
@ 2017-07-12 15:25   ` Alex Bennée
  2017-07-12 18:45     ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 15:25 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Since commit 6e3b2bfd6 ("tcg: allocate TB structs before the
> corresponding translated code") we are not fully utilizing
> code_gen_buffer for translated code, and therefore are
> incorrectly reporting the amount of translated code as well as
> the average host TB size. Address this by:
>
> - Making the conscious choice of misreporting the total translated code;
>   doing otherwise would mislead users into thinking "-tb-size" is not
>   honoured.
>
> - Expanding tb_tree_stats to accurately count the bytes of translated code on
>   the host, and using this for reporting the average tb host size,
>   as well as the expansion ratio.
>
> In the future we might want to consider reporting the accurate numbers for
> the total translated code, together with a "bookkeeping/overhead" field to
> account for the TB structs.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/translate-all.c | 34 ++++++++++++++++++++++++----------
>  1 file changed, 24 insertions(+), 10 deletions(-)
>
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index aa3a08b..aa71292 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -898,9 +898,20 @@ static void page_flush_tb(void)
>      }
>  }
>
> +static __attribute__((unused))
> +gboolean tb_host_size_iter(gpointer key, gpointer value, gpointer data)
> +{
> +    const TranslationBlock *tb = value;
> +    size_t *size = data;
> +
> +    *size += tb->tc_size;
> +    return false;
> +}
> +

I think having the __attribute__ stuff is confusing. Why don't we just
do what the newer debug stuff does:

modified   accel/tcg/translate-all.c
@@ -66,6 +66,12 @@
 /* make various TB consistency checks */
 /* #define DEBUG_TB_CHECK */

+#if defined(DEBUG_TB_FLUSH)
+#define DEBUG_TB_FLUSH_GATE 1
+#else
+#define DEBUG_TB_FLUSH_GATE 0
+#endif
+
 #if !defined(CONFIG_USER_ONLY)
 /* TB consistency checks only implemented for usermode emulation.  */
 #undef DEBUG_TB_CHECK
@@ -948,8 +954,7 @@ static void page_flush_tb(void)
     }
 }

-static __attribute__((unused))
-gboolean tb_host_size_iter(gpointer key, gpointer value, gpointer data)
+static gboolean tb_host_size_iter(gpointer key, gpointer value, gpointer data)
 {
     const TranslationBlock *tb = value;
     size_t *size = data;
@@ -958,11 +963,22 @@ gboolean tb_host_size_iter(gpointer key, gpointer value, gpointer data)
     return false;
 }

+static void dump_tb_sizes(void)
+{
+    if (DEBUG_TB_FLUSH_GATE) {
+        size_t host_size = 0;
+        int nb_tbs;
+
+        g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
+        nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
+        fprintf(stderr, "qemu: flush code_size=%zu nb_tbs=%d avg_tb_size=%zu\n",
+                tcg_code_size(), nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
+    }
+}
+
 /* flush all the translation blocks */
 static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 {
-    size_t host_size __attribute__((unused)) = 0;
-    int nb_tbs __attribute__((unused));

     tb_lock();

@@ -973,12 +989,7 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
         goto done;
     }

-#if defined(DEBUG_TB_FLUSH)
-    g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
-    nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
-    fprintf(stderr, "qemu: flush code_size=%zu nb_tbs=%d avg_tb_size=%zu\n",
-           tcg_code_size(), nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
-#endif
+    dump_tb_sizes();


Which will a) ensure all the debug code is compiled even when not
enabled and b) the compiler won't bitch at you when it optimises stuff
away.

Better?

--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 13/22] tcg: take tb_ctx out of TCGContext
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 13/22] tcg: take tb_ctx out of TCGContext Emilio G. Cota
@ 2017-07-12 15:27   ` Alex Bennée
  0 siblings, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 15:27 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Before TCGContext is made thread-local.
>
> Reviewed-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  include/exec/tb-context.h |  2 ++
>  tcg/tcg.h                 |  2 --
>  accel/tcg/cpu-exec.c      |  2 +-
>  accel/tcg/translate-all.c | 57 +++++++++++++++++++++++------------------------
>  linux-user/main.c         |  6 ++---
>  5 files changed, 34 insertions(+), 35 deletions(-)
>
> diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
> index 1fa8dcc..1d41202 100644
> --- a/include/exec/tb-context.h
> +++ b/include/exec/tb-context.h
> @@ -41,4 +41,6 @@ struct TBContext {
>      int tb_phys_invalidate_count;
>  };
>
> +extern TBContext tb_ctx;
> +
>  #endif
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index da78721..ad2d959 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -706,8 +706,6 @@ struct TCGContext {
>      /* Threshold to flush the translated code buffer.  */
>      void *code_gen_highwater;
>
> -    TBContext tb_ctx;
> -
>      /* Track which vCPU triggers events */
>      CPUState *cpu;                      /* *_trans */
>      TCGv_env tcg_env;                   /* *_exec  */
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 3581618..54ecae2 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -323,7 +323,7 @@ TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
>      phys_pc = get_page_addr_code(desc.env, pc);
>      desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
>      h = tb_hash_func(phys_pc, pc, flags);
> -    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
> +    return qht_lookup(&tb_ctx.htable, tb_cmp, &desc, h);
>  }
>
>  static inline TranslationBlock *tb_find(CPUState *cpu,
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index aa71292..84e19d9 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -130,6 +130,7 @@ static void *l1_map[V_L1_MAX_SIZE];
>
>  /* code generation context */
>  TCGContext tcg_ctx;
> +TBContext tb_ctx;
>  bool parallel_cpus;
>
>  /* translation block context */
> @@ -161,7 +162,7 @@ static void page_table_config_init(void)
>  void tb_lock(void)
>  {
>      assert_tb_unlocked();
> -    qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
> +    qemu_mutex_lock(&tb_ctx.tb_lock);
>      have_tb_lock++;
>  }
>
> @@ -169,13 +170,13 @@ void tb_unlock(void)
>  {
>      assert_tb_locked();
>      have_tb_lock--;
> -    qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
> +    qemu_mutex_unlock(&tb_ctx.tb_lock);
>  }
>
>  void tb_lock_reset(void)
>  {
>      if (have_tb_lock) {
> -        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
> +        qemu_mutex_unlock(&tb_ctx.tb_lock);
>          have_tb_lock = 0;
>      }
>  }
> @@ -801,15 +802,15 @@ static inline void code_gen_alloc(size_t tb_size)
>          fprintf(stderr, "Could not allocate dynamic translator buffer\n");
>          exit(1);
>      }
> -    tcg_ctx.tb_ctx.tb_tree = g_tree_new(tc_ptr_cmp);
> -    qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
> +    tb_ctx.tb_tree = g_tree_new(tc_ptr_cmp);
> +    qemu_mutex_init(&tb_ctx.tb_lock);
>  }
>
>  static void tb_htable_init(void)
>  {
>      unsigned int mode = QHT_MODE_AUTO_RESIZE;
>
> -    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
> +    qht_init(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
>  }
>
>  /* Must be called before using the QEMU cpus. 'tb_size' is the size
> @@ -853,7 +854,7 @@ void tb_free(TranslationBlock *tb)
>  {
>      assert_tb_locked();
>
> -    g_tree_remove(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr);
> +    g_tree_remove(tb_ctx.tb_tree, &tb->tc_ptr);
>  }
>
>  static inline void invalidate_page_bitmap(PageDesc *p)
> @@ -919,13 +920,13 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>      /* If it is already been done on request of another CPU,
>       * just retry.
>       */
> -    if (tcg_ctx.tb_ctx.tb_flush_count != tb_flush_count.host_int) {
> +    if (tb_ctx.tb_flush_count != tb_flush_count.host_int) {
>          goto done;
>      }
>
>  #if defined(DEBUG_TB_FLUSH)
> -    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_host_size_iter, &host_size);
> -    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
> +    g_tree_foreach(tb_ctx.tb_tree, tb_host_size_iter, &host_size);
> +    nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
>      printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%zu\n",
>             (unsigned long)(tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer),
>             nb_tbs, nb_tbs > 0 ? host_size / nb_tbs : 0);
> @@ -940,17 +941,16 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
>      }
>
>      /* Increment the refcount first so that destroy acts as a reset */
> -    g_tree_ref(tcg_ctx.tb_ctx.tb_tree);
> -    g_tree_destroy(tcg_ctx.tb_ctx.tb_tree);
> +    g_tree_ref(tb_ctx.tb_tree);
> +    g_tree_destroy(tb_ctx.tb_tree);
>
> -    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
> +    qht_reset_size(&tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
>      page_flush_tb();
>
>      tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
>      /* XXX: flush processor icache at this point if cache flush is
>         expensive */
> -    atomic_mb_set(&tcg_ctx.tb_ctx.tb_flush_count,
> -                  tcg_ctx.tb_ctx.tb_flush_count + 1);
> +    atomic_mb_set(&tb_ctx.tb_flush_count, tb_ctx.tb_flush_count + 1);
>
>  done:
>      tb_unlock();
> @@ -959,7 +959,7 @@ done:
>  void tb_flush(CPUState *cpu)
>  {
>      if (tcg_enabled()) {
> -        unsigned tb_flush_count = atomic_mb_read(&tcg_ctx.tb_ctx.tb_flush_count);
> +        unsigned tb_flush_count = atomic_mb_read(&tb_ctx.tb_flush_count);
>          async_safe_run_on_cpu(cpu, do_tb_flush,
>                                RUN_ON_CPU_HOST_INT(tb_flush_count));
>      }
> @@ -986,7 +986,7 @@ do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>  static void tb_invalidate_check(target_ulong address)
>  {
>      address &= TARGET_PAGE_MASK;
> -    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
> +    qht_iter(&tb_ctx.htable, do_tb_invalidate_check, &address);
>  }
>
>  static void
> @@ -1006,7 +1006,7 @@ do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>  /* verify that all the pages have correct rights for code */
>  static void tb_page_check(void)
>  {
> -    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
> +    qht_iter(&tb_ctx.htable, do_tb_page_check, NULL);
>  }
>
>  #endif
> @@ -1105,7 +1105,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>      /* remove the TB from the hash list */
>      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
> +    qht_remove(&tb_ctx.htable, tb, h);
>
>      /* remove the TB from the page list */
>      if (tb->page_addr[0] != page_addr) {
> @@ -1134,7 +1134,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>      /* suppress any remaining jumps to this TB */
>      tb_jmp_unlink(tb);
>
> -    tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
> +    tb_ctx.tb_phys_invalidate_count++;
>  }
>
>  #ifdef CONFIG_SOFTMMU
> @@ -1250,7 +1250,7 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>
>      /* add in the hash table */
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
> +    qht_insert(&tb_ctx.htable, tb, h);
>
>  #ifdef DEBUG_TB_CHECK
>      tb_page_check();
> @@ -1393,7 +1393,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>       * through the physical hash table and physical page list.
>       */
>      tb_link_page(tb, phys_pc, phys_page2);
> -    g_tree_insert(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr, tb);
> +    g_tree_insert(tb_ctx.tb_tree, &tb->tc_ptr, tb);
>      return tb;
>  }
>
> @@ -1671,7 +1671,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
>  {
>      struct ptr_size s = { .ptr = (void *)tc_ptr };
>
> -    return g_tree_lookup(tcg_ctx.tb_ctx.tb_tree, &s);
> +    return g_tree_lookup(tb_ctx.tb_tree, &s);
>  }
>
>  #if !defined(CONFIG_USER_ONLY)
> @@ -1895,8 +1895,8 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>
>      tb_lock();
>
> -    nb_tbs = g_tree_nnodes(tcg_ctx.tb_ctx.tb_tree);
> -    g_tree_foreach(tcg_ctx.tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
> +    nb_tbs = g_tree_nnodes(tb_ctx.tb_tree);
> +    g_tree_foreach(tb_ctx.tb_tree, tb_tree_stats_iter, &tst);
>      /* XXX: avoid using doubles ? */
>      cpu_fprintf(f, "Translation buffer state:\n");
>      /*
> @@ -1922,15 +1922,14 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
>                  tst.direct_jmp2_count,
>                  nb_tbs ? (tst.direct_jmp2_count * 100) / nb_tbs : 0);
>
> -    qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
> +    qht_statistics_init(&tb_ctx.htable, &hst);
>      print_qht_statistics(f, cpu_fprintf, hst);
>      qht_statistics_destroy(&hst);
>
>      cpu_fprintf(f, "\nStatistics:\n");
>      cpu_fprintf(f, "TB flush count      %u\n",
> -            atomic_read(&tcg_ctx.tb_ctx.tb_flush_count));
> -    cpu_fprintf(f, "TB invalidate count %d\n",
> -            tcg_ctx.tb_ctx.tb_phys_invalidate_count);
> +                atomic_read(&tb_ctx.tb_flush_count));
> +    cpu_fprintf(f, "TB invalidate count %d\n", tb_ctx.tb_phys_invalidate_count);
>      cpu_fprintf(f, "TLB flush count     %zu\n", tlb_flush_count());
>      tcg_dump_info(f, cpu_fprintf);
>
> diff --git a/linux-user/main.c b/linux-user/main.c
> index ad03c9e..630c73d 100644
> --- a/linux-user/main.c
> +++ b/linux-user/main.c
> @@ -114,7 +114,7 @@ int cpu_get_pic_interrupt(CPUX86State *env)
>  void fork_start(void)
>  {
>      cpu_list_lock();
> -    qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
> +    qemu_mutex_lock(&tb_ctx.tb_lock);
>      mmap_fork_start();
>  }
>
> @@ -130,11 +130,11 @@ void fork_end(int child)
>                  QTAILQ_REMOVE(&cpus, cpu, node);
>              }
>          }
> -        qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
> +        qemu_mutex_init(&tb_ctx.tb_lock);
>          qemu_init_cpu_list();
>          gdbserver_fork(thread_cpu);
>      } else {
> -        qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock);
> +        qemu_mutex_unlock(&tb_ctx.tb_lock);
>          cpu_list_unlock();
>      }
>  }


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 14/22] tcg: take .helpers out of TCGContext
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 14/22] tcg: take .helpers " Emilio G. Cota
  2017-07-09 20:35   ` Richard Henderson
@ 2017-07-12 15:28   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 15:28 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Before TCGContext is made thread-local.
>
> The hash table becomes read-only after it is filled in,
> so we can save space by keeping just a global pointer to it.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  tcg/tcg.h |  2 --
>  tcg/tcg.c | 10 +++++-----
>  2 files changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index ad2d959..4f57878 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -663,8 +663,6 @@ struct TCGContext {
>
>      tcg_insn_unit *code_ptr;
>
> -    GHashTable *helpers;
> -
>  #ifdef CONFIG_PROFILER
>      /* profiling info */
>      int64_t tb_count1;
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 3559829..d9b083a 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -319,6 +319,7 @@ typedef struct TCGHelperInfo {
>  static const TCGHelperInfo all_helpers[] = {
>  #include "exec/helper-tcg.h"
>  };
> +static GHashTable *helper_table;
>
>  static int indirect_reg_alloc_order[ARRAY_SIZE(tcg_target_reg_alloc_order)];
>  static void process_op_defs(TCGContext *s);
> @@ -329,7 +330,6 @@ void tcg_context_init(TCGContext *s)
>      TCGOpDef *def;
>      TCGArgConstraint *args_ct;
>      int *sorted_args;
> -    GHashTable *helper_table;
>
>      memset(s, 0, sizeof(*s));
>      s->nb_globals = 0;
> @@ -357,7 +357,7 @@ void tcg_context_init(TCGContext *s)
>
>      /* Register helpers.  */
>      /* Use g_direct_hash/equal for direct pointer comparisons on func.  */
> -    s->helpers = helper_table = g_hash_table_new(NULL, NULL);
> +    helper_table = g_hash_table_new(NULL, NULL);
>
>      for (i = 0; i < ARRAY_SIZE(all_helpers); ++i) {
>          g_hash_table_insert(helper_table, (gpointer)all_helpers[i].func,
> @@ -761,7 +761,7 @@ void tcg_gen_callN(TCGContext *s, void *func, TCGArg ret,
>      unsigned sizemask, flags;
>      TCGHelperInfo *info;
>
> -    info = g_hash_table_lookup(s->helpers, (gpointer)func);
> +    info = g_hash_table_lookup(helper_table, (gpointer)func);
>      flags = info->flags;
>      sizemask = info->sizemask;
>
> @@ -990,8 +990,8 @@ static char *tcg_get_arg_str_idx(TCGContext *s, char *buf,
>  static inline const char *tcg_find_helper(TCGContext *s, uintptr_t val)
>  {
>      const char *ret = NULL;
> -    if (s->helpers) {
> -        TCGHelperInfo *info = g_hash_table_lookup(s->helpers, (gpointer)val);
> +    if (helper_table) {
> +        TCGHelperInfo *info = g_hash_table_lookup(helper_table, (gpointer)val);
>          if (info) {
>              ret = info->name;
>          }


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 15/22] gen-icount: fold exitreq_label into TCGContext
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 15/22] gen-icount: fold exitreq_label into TCGContext Emilio G. Cota
  2017-07-09 20:36   ` Richard Henderson
@ 2017-07-12 15:29   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 15:29 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Before we make TCGContext thread-local.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  include/exec/gen-icount.h | 7 +++----
>  tcg/tcg.h                 | 2 ++
>  2 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/include/exec/gen-icount.h b/include/exec/gen-icount.h
> index 9b3cb14..489aff7 100644
> --- a/include/exec/gen-icount.h
> +++ b/include/exec/gen-icount.h
> @@ -6,13 +6,12 @@
>  /* Helpers for instruction counting code generation.  */
>
>  static int icount_start_insn_idx;
> -static TCGLabel *exitreq_label;
>
>  static inline void gen_tb_start(TranslationBlock *tb)
>  {
>      TCGv_i32 count, imm;
>
> -    exitreq_label = gen_new_label();
> +    tcg_ctx.exitreq_label = gen_new_label();
>      if (tb->cflags & CF_USE_ICOUNT) {
>          count = tcg_temp_local_new_i32();
>      } else {
> @@ -34,7 +33,7 @@ static inline void gen_tb_start(TranslationBlock *tb)
>          tcg_temp_free_i32(imm);
>      }
>
> -    tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, exitreq_label);
> +    tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, tcg_ctx.exitreq_label);
>
>      if (tb->cflags & CF_USE_ICOUNT) {
>          tcg_gen_st16_i32(count, tcg_ctx.tcg_env,
> @@ -52,7 +51,7 @@ static inline void gen_tb_end(TranslationBlock *tb, int num_insns)
>          tcg_set_insn_param(icount_start_insn_idx, 1, num_insns);
>      }
>
> -    gen_set_label(exitreq_label);
> +    gen_set_label(tcg_ctx.exitreq_label);
>      tcg_gen_exit_tb((uintptr_t)tb + TB_EXIT_REQUESTED);
>
>      /* Terminate the linked list.  */
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 4f57878..534ead5 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -711,6 +711,8 @@ struct TCGContext {
>      /* The TCGBackendData structure is private to tcg-target.inc.c.  */
>      struct TCGBackendData *be;
>
> +    TCGLabel *exitreq_label;
> +
>      TCGTempSet free_temps[TCG_TYPE_COUNT * 2];
>      TCGTemp temps[TCG_MAX_TEMPS]; /* globals first, temps after */


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 16/22] tcg: keep a list of TCGContext's
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 16/22] tcg: keep a list of TCGContext's Emilio G. Cota
  2017-07-09 20:43   ` Richard Henderson
@ 2017-07-12 15:32   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 15:32 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Before we make TCGContext thread-local.

Really? Not allocated for each thread that needs it?

> Once that is done, iterating
> over all TCG contexts will be quite useful; for instance we
> will need it to gather profiling info from each TCGContext.

How often will we need to do this and how performance sensitive will it be?

>
> A possible alternative would be to keep an array of TCGContext pointers.
> However this option however is not that trivial, because vCPUs are spawned in
> parallel. So let's just keep it simple and use a list protected by a
> lock.

Given we are protecting with a lock I don't see why it wouldn't be as
equally trivial.

>
> Note that this lock will soon be used for other purposes, hence the
> generic "tcg_lock" name.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tcg/tcg.h |  3 +++
>  tcg/tcg.c | 23 +++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 534ead5..8e1cd45 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -725,6 +725,8 @@ struct TCGContext {
>
>      uint16_t gen_insn_end_off[TCG_MAX_INSNS];
>      target_ulong gen_insn_data[TCG_MAX_INSNS][TARGET_INSN_START_WORDS];
> +
> +    QSIMPLEQ_ENTRY(TCGContext) entry;
>  };
>
>  extern TCGContext tcg_ctx;
> @@ -773,6 +775,7 @@ static inline void *tcg_malloc(int size)
>
>  void tcg_context_init(TCGContext *s);
>  void tcg_prologue_init(TCGContext *s);
> +void tcg_register_thread(void);
>  void tcg_func_start(TCGContext *s);
>
>  int tcg_gen_code(TCGContext *s, TranslationBlock *tb);
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index d9b083a..0da7c61 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -115,7 +115,16 @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
>  static void tcg_out_tb_init(TCGContext *s);
>  static bool tcg_out_tb_finalize(TCGContext *s);
>
> +static QemuMutex tcg_lock;
>
> +/*
> + * List of TCGContext's in the system. Protected by tcg_lock.
> + * Once vcpu threads have been inited, there will be no further modifications
> + * to the list (vcpu threads never return) so we can safely traverse the list
> + * without synchronization.
> + */
> +static QSIMPLEQ_HEAD(, TCGContext) ctx_list =
> +    QSIMPLEQ_HEAD_INITIALIZER(ctx_list);
>
>  static TCGRegSet tcg_target_available_regs[2];
>  static TCGRegSet tcg_target_call_clobber_regs;
> @@ -324,6 +333,17 @@ static GHashTable *helper_table;
>  static int indirect_reg_alloc_order[ARRAY_SIZE(tcg_target_reg_alloc_order)];
>  static void process_op_defs(TCGContext *s);
>
> +/*
> + * Child TCG threads, i.e. the ones that do not call tcg_context_init, must call
> + * this function before initiating translation.
> + */
> +void tcg_register_thread(void)
> +{
> +    qemu_mutex_lock(&tcg_lock);
> +    QSIMPLEQ_INSERT_TAIL(&ctx_list, &tcg_ctx, entry);
> +    qemu_mutex_unlock(&tcg_lock);
> +}
> +
>  void tcg_context_init(TCGContext *s)
>  {
>      int op, total_args, n, i;
> @@ -381,6 +401,9 @@ void tcg_context_init(TCGContext *s)
>      for (; i < ARRAY_SIZE(tcg_target_reg_alloc_order); ++i) {
>          indirect_reg_alloc_order[i] = tcg_target_reg_alloc_order[i];
>      }
> +
> +    qemu_mutex_init(&tcg_lock);
> +    tcg_register_thread();
>  }
>
>  /*


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 18/22] tcg: define TCG_HIGHWATER
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 18/22] tcg: define TCG_HIGHWATER Emilio G. Cota
  2017-07-09 20:46   ` Richard Henderson
@ 2017-07-12 15:33   ` Alex Bennée
  1 sibling, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 15:33 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Will come in handy very soon.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  tcg/tcg.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index c19c473..2f003a0 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -115,6 +115,8 @@ static int tcg_target_const_match(tcg_target_long val, TCGType type,
>  static void tcg_out_tb_init(TCGContext *s);
>  static bool tcg_out_tb_finalize(TCGContext *s);
>
> +#define TCG_HIGHWATER 1024
> +
>  static QemuMutex tcg_lock;
>
>  /*
> @@ -453,7 +455,7 @@ void tcg_prologue_init(TCGContext *s)
>      /* Compute a high-water mark, at which we voluntarily flush the buffer
>         and start over.  The size here is arbitrary, significantly larger
>         than we expect the code generation for any one opcode to require.  */
> -    s->code_gen_highwater = s->code_gen_buffer + (total_size - 1024);
> +    s->code_gen_highwater = s->code_gen_buffer + (total_size - TCG_HIGHWATER);
>
>      tcg_register_jit(s->code_gen_buffer, total_size);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone Emilio G. Cota
  2017-07-09 20:48   ` Richard Henderson
@ 2017-07-12 16:02   ` Alex Bennée
  2017-07-12 17:25     ` Richard Henderson
  1 sibling, 1 reply; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 16:02 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Before we make TCGContext thread-local.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tcg/tcg.h |  1 +
>  tcg/tcg.c | 14 ++++++++++++++
>  2 files changed, 15 insertions(+)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 2a64ee2..be5f3fd 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -778,6 +778,7 @@ static inline void *tcg_malloc(int size)
>  }
>
>  void tcg_context_init(TCGContext *s);
> +void tcg_context_clone(TCGContext *s);
>  void tcg_prologue_init(TCGContext *s);
>  void tcg_register_thread(void);
>  void tcg_func_start(TCGContext *s);
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 2f003a0..8febf53 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -117,6 +117,7 @@ static bool tcg_out_tb_finalize(TCGContext *s);
>
>  #define TCG_HIGHWATER 1024
>
> +static const TCGContext *tcg_init_ctx;
>  static QemuMutex tcg_lock;
>
>  /*
> @@ -353,6 +354,7 @@ void tcg_context_init(TCGContext *s)
>      TCGArgConstraint *args_ct;
>      int *sorted_args;
>
> +    tcg_init_ctx = s;
>      memset(s, 0, sizeof(*s));
>      s->nb_globals = 0;
>
> @@ -409,6 +411,18 @@ void tcg_context_init(TCGContext *s)
>  }
>
>  /*
> + * Clone the initial TCGContext. Used by TCG threads to copy the TCGContext
> + * set up by their parent thread via tcg_context_init().
> + */
> +void tcg_context_clone(TCGContext *s)
> +{
> +    if (unlikely(tcg_init_ctx == NULL || tcg_init_ctx == s)) {
> +        tcg_abort();
> +    }
> +    memcpy(s, tcg_init_ctx, sizeof(*s));
> +}
> +

Why a copy approach as opposed to a plain tcg_context_new() with the
appropriate init cleanup. Is is it just because of the extra shared
stuff in:

#if defined(CONFIG_SOFTMMU)
    /* There's no guest base to take into account, so go ahead and
       initialize the prologue now.  */
    tcg_prologue_init(&tcg_ctx);
    code_gen_set_region_size(&tcg_ctx);
#endif

?

> +/*
>   * Allocate TBs right before their corresponding translated code, making
>   * sure that TBs and code are on different cache lines.
>   */


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs Emilio G. Cota
  2017-07-09 20:02   ` Richard Henderson
  2017-07-12 14:39   ` Alex Bennée
@ 2017-07-12 17:00   ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 95+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-07-12 17:00 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/09/2017 04:49 AM, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

> ---
>   tcg/i386/tcg-target.inc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
> index 01e3b4e..06df01a 100644
> --- a/tcg/i386/tcg-target.inc.c
> +++ b/tcg/i386/tcg-target.inc.c
> @@ -2514,7 +2514,7 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
>       return NULL;
>   }
>   
> -static int tcg_target_callee_save_regs[] = {
> +static const int tcg_target_callee_save_regs[] = {
>   #if TCG_TARGET_REG_BITS == 64
>       TCG_REG_RBP,
>       TCG_REG_RBX,
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 08/22] tcg/mips: constify tcg_target_callee_save_regs
  2017-07-09  7:50 ` [Qemu-devel] [PATCH 08/22] tcg/mips: " Emilio G. Cota
  2017-07-09 20:02   ` Richard Henderson
  2017-07-12 14:39   ` Alex Bennée
@ 2017-07-12 17:01   ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 95+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-07-12 17:01 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/09/2017 04:50 AM, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

> ---
>   tcg/mips/tcg-target.inc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tcg/mips/tcg-target.inc.c b/tcg/mips/tcg-target.inc.c
> index 8cff9a6..790b4fc 100644
> --- a/tcg/mips/tcg-target.inc.c
> +++ b/tcg/mips/tcg-target.inc.c
> @@ -2323,7 +2323,7 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
>       return NULL;
>   }
>   
> -static int tcg_target_callee_save_regs[] = {
> +static const int tcg_target_callee_save_regs[] = {
>       TCG_REG_S0,       /* used for the global env (TCG_AREG0) */
>       TCG_REG_S1,
>       TCG_REG_S2,
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 04/22] tcg: fix corruption of code_time profiling counter upon tb_flush
  2017-07-09  7:49 ` [Qemu-devel] [PATCH 04/22] tcg: fix corruption of code_time profiling counter upon tb_flush Emilio G. Cota
  2017-07-09 20:01   ` Richard Henderson
  2017-07-12 14:36   ` Alex Bennée
@ 2017-07-12 17:09   ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 95+ messages in thread
From: Philippe Mathieu-Daudé @ 2017-07-12 17:09 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel

On 07/09/2017 04:49 AM, Emilio G. Cota wrote:
> Whenever there is an overflow in code_gen_buffer (e.g. we run out
> of space in it and have to flush it), the code_time profiling counter
> ends up with an invalid value (that is, code_time -= profile_getclock(),
> without later on getting += profile_getclock() due to the goto).
[...]
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

> ---
>   accel/tcg/translate-all.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
> index a936a5f..72ce445 100644
> --- a/accel/tcg/translate-all.c
> +++ b/accel/tcg/translate-all.c
> @@ -1293,7 +1293,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>   #ifdef CONFIG_PROFILER
>       tcg_ctx.tb_count++;
>       tcg_ctx.interm_time += profile_getclock() - ti;
> -    tcg_ctx.code_time -= profile_getclock();
> +    ti = profile_getclock();
>   #endif
>   
>       /* ??? Overflow could be handled better here.  In particular, we
> @@ -1311,7 +1311,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>       }
>   
>   #ifdef CONFIG_PROFILER
> -    tcg_ctx.code_time += profile_getclock();
> +    tcg_ctx.code_time += profile_getclock() - ti;
>       tcg_ctx.code_in_len += tb->size;
>       tcg_ctx.code_out_len += gen_code_size;
>       tcg_ctx.search_out_len += search_size;
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone
  2017-07-12 16:02   ` Alex Bennée
@ 2017-07-12 17:25     ` Richard Henderson
  2017-07-12 17:47       ` Alex Bennée
  0 siblings, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-12 17:25 UTC (permalink / raw)
  To: Alex Bennée, Emilio G. Cota; +Cc: qemu-devel

On 07/12/2017 06:02 AM, Alex Bennée wrote:
> Why a copy approach as opposed to a plain tcg_context_new() with the
> appropriate init cleanup.

We need the globals that were set up by the target/foo/translate.c code to be 
the same for each thread.  That's what makes it easier to copy the struct.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone
  2017-07-12 17:25     ` Richard Henderson
@ 2017-07-12 17:47       ` Alex Bennée
  0 siblings, 0 replies; 95+ messages in thread
From: Alex Bennée @ 2017-07-12 17:47 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Emilio G. Cota, qemu-devel


Richard Henderson <rth@twiddle.net> writes:

> On 07/12/2017 06:02 AM, Alex Bennée wrote:
>> Why a copy approach as opposed to a plain tcg_context_new() with the
>> appropriate init cleanup.
>
> We need the globals that were set up by the target/foo/translate.c
> code to be the same for each thread.  That's what makes it easier to
> copy the struct.

Ahh I see. Maybe expand the comment for the function a little then?


--
Alex Bennée

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG
  2017-07-12 13:26   ` Alex Bennée
@ 2017-07-12 18:19     ` Emilio G. Cota
  0 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-12 18:19 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Richard Henderson

On Wed, Jul 12, 2017 at 14:26:36 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
(snip)
> > This patch does the latter by embedding tlb_flush_count in CPUArchState.
> > The global count is then easily obtained by iterating over the CPU list.
> >
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> 
> As it actually fixes unintentional breakage:
> 
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> 
> That said I'm not sure if this number alone is helpful given the range
> of flushes we have. Really from a performance point of view we should
> differentiate between inline per-vCPU flushes as well as the cross-vCPU
> flushes of both asynchronus and synced varieties.
> 
> I had a go at this using QEMUs tracing infrastructure:
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg04076.html
> 
> But I guess the ideal way would be something that both keeps counters
> and optionally enable tracepoints.

Yeah the counters in my patch are there to fix the breakage while
not hurting scalability in MTTCG.

Having those counters always on + the tracers in your patchset
for more detailed info seems reasonable to me.

Maybe it's time to push to get those tracers changes in?

		Emilio

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 06/22] translate-all: make have_tb_lock static
  2017-07-12 14:38   ` Alex Bennée
@ 2017-07-12 18:22     ` Emilio G. Cota
  0 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-12 18:22 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Richard Henderson

On Wed, Jul 12, 2017 at 15:38:28 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
> 
> > It is only used by this object, and it's not exported to any other.
> >
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> 
> I was almost caught out by the name re-use in cpu-exec.c ;-)
> 
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

Yes that's very unfortunate. In v2 I'll add a patch to rename it.

Thanks,

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 11/22] translate-all: use a binary search tree to track TBs in TBContext
  2017-07-12 15:10   ` Alex Bennée
@ 2017-07-12 18:38     ` Emilio G. Cota
  0 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-12 18:38 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Richard Henderson

On Wed, Jul 12, 2017 at 16:10:15 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
(snip)
> > diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> > index fd20bca..673b26d 100644
> > --- a/include/exec/exec-all.h
> > +++ b/include/exec/exec-all.h
> > @@ -320,14 +320,25 @@ struct TranslationBlock {
> >      uint16_t size;      /* size of target code for this block (1 <=
> >                             size <= TARGET_PAGE_SIZE) */
> >      uint16_t icount;
> > -    uint32_t cflags;    /* compile flags */
> > +    /*
> > +     * @tc_size must be kept right after @tc_ptr to facilitate TB lookups in a
> > +     * binary search tree -- see struct ptr_size.
> > +     * We use an anonymous struct here to avoid updating all calling code,
> > +     * which would be quite a lot of churn.
> > +     * The only reason to bring @cflags into the anonymous struct is to
> > +     * avoid inducing a hole in TranslationBlock.
> > +     */
> > +    struct {
> > +        void *tc_ptr;    /* pointer to the translated code */
> > +        uint32_t tc_size; /* size of translated code for this block */
> > +
> > +        uint32_t cflags;    /* compile flags */
> >  #define CF_COUNT_MASK  0x7fff
> >  #define CF_LAST_IO     0x8000 /* Last insn may be an IO access.  */
> >  #define CF_NOCACHE     0x10000 /* To be freed after execution */
> >  #define CF_USE_ICOUNT  0x20000
> >  #define CF_IGNORE_ICOUNT 0x40000 /* Do not generate icount code */
> > -
> > -    void *tc_ptr;    /* pointer to the translated code */
> > +    };
> 
> Why not just have a named structure for this so there isn't ambiguity
> between struct ptrsize and this thing.

Yeah I did v2 of this patch yesterday. Turns out using an intermediate
struct here (and in the comparison code) doesn't end up in as much churn
as I expected, so I went with that.

(snip)
> > @@ -827,16 +853,7 @@ void tb_free(TranslationBlock *tb)
> >  {
> >      assert_tb_locked();
> >
> > -    /* In practice this is mostly used for single use temporary TB
> > -       Ignore the hard cases and just back up if this TB happens to
> > -       be the last one generated.  */
> > -    if (tcg_ctx.tb_ctx.nb_tbs > 0 &&
> > -            tb == tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) {
> > -        size_t struct_size = ROUND_UP(sizeof(*tb), qemu_icache_linesize);
> > -
> > -        tcg_ctx.code_gen_ptr = tb->tc_ptr - struct_size;
> > -        tcg_ctx.tb_ctx.nb_tbs--;
> > -    }
> > +    g_tree_remove(tcg_ctx.tb_ctx.tb_tree, &tb->tc_ptr);
> >  }
> 
> This function should be renamed as we never attempt to free (and it was
> pretty half hearted before). Maybe tb_remove or tb_deref?

Good point. I like tb_remove.

		Emilio

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 12/22] translate-all: report correct avg host TB size
  2017-07-12 15:25   ` Alex Bennée
@ 2017-07-12 18:45     ` Emilio G. Cota
  0 siblings, 0 replies; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-12 18:45 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Richard Henderson

On Wed, Jul 12, 2017 at 16:25:45 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
(snip)
> I think having the __attribute__ stuff is confusing. Why don't we just
> do what the newer debug stuff does:
> 
> modified   accel/tcg/translate-all.c
> @@ -66,6 +66,12 @@
>  /* make various TB consistency checks */
>  /* #define DEBUG_TB_CHECK */
> 
> +#if defined(DEBUG_TB_FLUSH)
> +#define DEBUG_TB_FLUSH_GATE 1
> +#else
> +#define DEBUG_TB_FLUSH_GATE 0
> +#endif
(snip)
> Which will a) ensure all the debug code is compiled even when not
> enabled and b) the compiler won't bitch at you when it optimises stuff
> away.
> 
> Better?

Much better! The unused attribute was there to achieve this, but what
you propose is much better.

		E.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t
  2017-07-12  0:53       ` Richard Henderson
@ 2017-07-12 20:48         ` Emilio G. Cota
  2017-07-12 23:06           ` Richard Henderson
  0 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-12 20:48 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Tue, Jul 11, 2017 at 14:53:00 -1000, Richard Henderson wrote:
> On 07/10/2017 01:57 PM, Emilio G. Cota wrote:
> >>What I would prefer to do is generalize tb->cflags.  Those values *do*
> >>affect how we compile the TB, and yet we don't take them into account.  So I
> >>think it would be a good idea to feed that into the TB hash.
> >
> >I'm having trouble seeing how this could work.
> >Where do we get the "current" values from the current state, i.e.
> >the ones we need to generate the hash and perform comparisons?
> >In particular:
> >- CF_COUNT_MASK: just use CF_COUNT_MASK?
> >- CF_LAST_IO: ?
> >- CF_NOCACHE: always 0 I guess
> 
> All of these are set by cpu_io_recompile as needed.
> They are all clear for normal TBs.
> 
> >- CF_USE/IGNORE_ICOUNT: ?
> CF_IGNORE_ICOUNT probably shouldn't exist.  Probably the callers of
> tb_gen_code should simply set CF_USE_ICOUNT properly if use_icount is true,
> rather than having two flags control the same feature.
> 
> At which point CF_USE_ICOUNT should be set iff use_icount is true.
> 
> Likewise CF_PARALLEL would be set iff parallel_cpus is true, except for
> within cpu_exec_step_atomic where we would always use 0 (because that's the
> whole point of that function).

Would it be OK for this series to just start with CF_PARALLEL? I'm not
too familiar with how icount mode recompiles code, and I'm now on
patch 27 of v2 and still have quite a few patches to go through.

In v2 I have a helper function to mask which bits from cflags to
use--it would be easy to add more flags there. See a preview below
(tb_lookup__cpu_state is introduced earlier in the series.)
The current WIP v2 tree is here:
  https://github.com/cota/qemu/commits/multi-tcg-v2-2017-07-12

Thanks,

		Emilio


commit 6a55d5225a708f1c8eea263a71c8ca3cb5d40bf0
Author: Emilio G. Cota <cota@braap.org>
Date:   Tue Jul 11 14:29:37 2017 -0400

    tcg: bring parallel_cpus to tb->cflags and use it for TB hashing
    
    This allows us to avoid flushing TB's when parallel_cpus is set.
    
    Note that the declaration of parallel_cpus is brought to exec-all.h
    to be able to define there the inlines. The inlines use an unnecessary
    temp variable that is there just to make it easier to add more bits
    to the mask in the future.
    
    Signed-off-by: Emilio G. Cota <cota@braap.org>

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index c9f27f9..f770e15 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -327,6 +327,7 @@ struct TranslationBlock {
 #define CF_USE_ICOUNT  0x20000
 #define CF_IGNORE_ICOUNT 0x40000 /* Do not generate icount code */
 #define CF_INVALID     0x80000 /* Protected by tb_lock */
+#define CF_PARALLEL    0x100000 /* matches the parallel_cpus global */
 
     /* Per-vCPU dynamic tracing state used to generate this TB */
     uint32_t trace_vcpu_dstate;
@@ -370,6 +371,28 @@ struct TranslationBlock {
     uintptr_t jmp_list_first;
 };
 
+extern bool parallel_cpus;
+
+/* tb->cflags, masked for hashing/comparison */
+static inline uint32_t tb_cf_mask(const TranslationBlock *tb)
+{
+    uint32_t mask = 0;
+
+    mask |= CF_PARALLEL;
+    return tb->cflags & mask;
+}
+
+/* current cflags, masked for hashing/comparison */
+static inline uint32_t curr_cf_mask(void)
+{
+    uint32_t val = 0;
+
+    if (parallel_cpus) {
+        val |= CF_PARALLEL;
+    }
+    return val;
+}
+
 void tb_free(TranslationBlock *tb);
 void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
index 6cd3022..747a9a6 100644
--- a/include/exec/tb-hash-xx.h
+++ b/include/exec/tb-hash-xx.h
@@ -48,8 +48,8 @@
  * xxhash32, customized for input variables that are not guaranteed to be
  * contiguous in memory.
  */
-static inline
-uint32_t tb_hash_func6(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f)
+static inline uint32_t
+tb_hash_func7(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f, uint32_t g)
 {
     uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
     uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
@@ -78,7 +78,7 @@ uint32_t tb_hash_func6(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f)
     v4 *= PRIME32_1;
 
     h32 = rol32(v1, 1) + rol32(v2, 7) + rol32(v3, 12) + rol32(v4, 18);
-    h32 += 24;
+    h32 += 28;
 
     h32 += e * PRIME32_3;
     h32  = rol32(h32, 17) * PRIME32_4;
@@ -86,6 +86,9 @@ uint32_t tb_hash_func6(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f)
     h32 += f * PRIME32_3;
     h32  = rol32(h32, 17) * PRIME32_4;
 
+    h32 += g * PRIME32_3;
+    h32  = rol32(h32, 17) * PRIME32_4;
+
     h32 ^= h32 >> 15;
     h32 *= PRIME32_2;
     h32 ^= h32 >> 13;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 17b5ee0..0526c4f 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -59,9 +59,9 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
 
 static inline
 uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t flags,
-                      uint32_t trace_vcpu_dstate)
+                      uint32_t cf_mask, uint32_t trace_vcpu_dstate)
 {
-    return tb_hash_func6(phys_pc, pc, flags, trace_vcpu_dstate);
+    return tb_hash_func7(phys_pc, pc, flags, cf_mask, trace_vcpu_dstate);
 }
 
 #endif
diff --git a/tcg/tcg.h b/tcg/tcg.h
index da78721..96872f8 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -730,7 +730,6 @@ struct TCGContext {
 };
 
 extern TCGContext tcg_ctx;
-extern bool parallel_cpus;
 
 static inline void tcg_set_insn_param(int op_idx, int arg, TCGArg v)
 {
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 49c1ecf..2531b73 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -224,31 +224,27 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
 static void cpu_exec_step(CPUState *cpu)
 {
     CPUClass *cc = CPU_GET_CLASS(cpu);
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb;
     target_ulong cs_base, pc;
     uint32_t flags;
 
-    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
     if (sigsetjmp(cpu->jmp_env, 0) == 0) {
-        mmap_lock();
-        tb_lock();
-        tb = tb_gen_code(cpu, pc, cs_base, flags,
-                         1 | CF_NOCACHE | CF_IGNORE_ICOUNT);
-        tb->orig_tb = NULL;
-        tb_unlock();
-        mmap_unlock();
+        tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags);
+        if (tb == NULL) {
+            mmap_lock();
+            tb_lock();
+            tb = tb_gen_code(cpu, pc, cs_base, flags,
+                             1 | CF_IGNORE_ICOUNT);
+            tb->orig_tb = NULL;
+            tb_unlock();
+            mmap_unlock();
+        }
 
         cc->cpu_exec_enter(cpu);
         /* execute the generated code */
-        trace_exec_tb_nocache(tb, pc);
+        trace_exec_tb(tb, pc);
         cpu_tb_exec(cpu, tb);
         cc->cpu_exec_exit(cpu);
-
-        tb_lock();
-        tb_phys_invalidate(tb, -1);
-        tb_free(tb);
-        tb_unlock();
     } else {
         /* We may have exited due to another problem here, so we need
          * to reset any tb_locks we may have taken but didn't release.
@@ -280,6 +276,7 @@ struct tb_desc {
     CPUArchState *env;
     tb_page_addr_t phys_page1;
     uint32_t flags;
+    uint32_t cf_mask;
     uint32_t trace_vcpu_dstate;
 };
 
@@ -292,6 +289,7 @@ static bool tb_cmp(const void *p, const void *d)
         tb->page_addr[0] == desc->phys_page1 &&
         tb->cs_base == desc->cs_base &&
         tb->flags == desc->flags &&
+        tb_cf_mask(tb) == desc->cf_mask &&
         tb->trace_vcpu_dstate == desc->trace_vcpu_dstate) {
         /* check next page if needed */
         if (tb->page_addr[1] == -1) {
@@ -320,11 +318,12 @@ static TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
     desc.env = (CPUArchState *)cpu->env_ptr;
     desc.cs_base = cs_base;
     desc.flags = flags;
+    desc.cf_mask = curr_cf_mask();
     desc.trace_vcpu_dstate = *cpu->trace_dstate;
     desc.pc = pc;
     phys_pc = get_page_addr_code(desc.env, pc);
     desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_hash_func(phys_pc, pc, flags, *cpu->trace_dstate);
+    h = tb_hash_func(phys_pc, pc, flags, curr_cf_mask(), *cpu->trace_dstate);
     return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
 }
 
@@ -345,6 +344,7 @@ TranslationBlock *tb_lookup__cpu_state(CPUState *cpu, target_ulong *pc,
                tb->pc == *pc &&
                tb->cs_base == *cs_base &&
                tb->flags == *flags &&
+               tb_cf_mask(tb) == curr_cf_mask() &&
                tb->trace_vcpu_dstate == *cpu->trace_dstate)) {
         return tb;
     }
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 53fbb06..c9e8c1d 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1075,7 +1075,8 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->trace_vcpu_dstate);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb_cf_mask(tb),
+                     tb->trace_vcpu_dstate);
     qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /*
@@ -1226,7 +1227,8 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     }
 
     /* add in the hash table */
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->trace_vcpu_dstate);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb_cf_mask(tb),
+                     tb->trace_vcpu_dstate);
     qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
 
 #ifdef DEBUG_TB_CHECK
@@ -1254,6 +1256,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     if (use_icount && !(cflags & CF_IGNORE_ICOUNT)) {
         cflags |= CF_USE_ICOUNT;
     }
+    if (parallel_cpus) {
+        cflags |= CF_PARALLEL;
+    }
 
     tb = tb_alloc(pc);
     if (unlikely(!tb)) {
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 925ae11..fa40f6c 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -6312,11 +6312,10 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
         sigprocmask(SIG_BLOCK, &sigmask, &info.sigmask);
 
         /* If this is our first additional thread, we need to ensure we
-         * generate code for parallel execution and flush old translations.
+         * generate code for parallel execution.
          */
         if (!parallel_cpus) {
             parallel_cpus = true;
-            tb_flush(cpu);
         }
 
         ret = pthread_create(&info.thread, &attr, clone_func, &info);
diff --git a/tests/qht-bench.c b/tests/qht-bench.c
index 11c1cec..4cabdfd 100644
--- a/tests/qht-bench.c
+++ b/tests/qht-bench.c
@@ -103,7 +103,7 @@ static bool is_equal(const void *obj, const void *userp)
 
 static inline uint32_t h(unsigned long v)
 {
-    return tb_hash_func6(v, 0, 0, 0);
+    return tb_hash_func7(v, 0, 0, 0, 0);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t
  2017-07-12 20:48         ` Emilio G. Cota
@ 2017-07-12 23:06           ` Richard Henderson
  2017-07-16  1:43             ` Emilio G. Cota
  0 siblings, 1 reply; 95+ messages in thread
From: Richard Henderson @ 2017-07-12 23:06 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/12/2017 10:48 AM, Emilio G. Cota wrote:
> Would it be OK for this series to just start with CF_PARALLEL? I'm not
> too familiar with how icount mode recompiles code, and I'm now on
> patch 27 of v2 and still have quite a few patches to go through.

Certainly.

> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index 49c1ecf..2531b73 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -224,31 +224,27 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
>   static void cpu_exec_step(CPUState *cpu)
>   {
>       CPUClass *cc = CPU_GET_CLASS(cpu);
> -    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
>       TranslationBlock *tb;
>       target_ulong cs_base, pc;
>       uint32_t flags;
>   
> -    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
>       if (sigsetjmp(cpu->jmp_env, 0) == 0) {
> -        mmap_lock();
> -        tb_lock();
> -        tb = tb_gen_code(cpu, pc, cs_base, flags,
> -                         1 | CF_NOCACHE | CF_IGNORE_ICOUNT);
> -        tb->orig_tb = NULL;
> -        tb_unlock();
> -        mmap_unlock();
> +        tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags);
> +        if (tb == NULL) {
> +            mmap_lock();
> +            tb_lock();
> +            tb = tb_gen_code(cpu, pc, cs_base, flags,
> +                             1 | CF_IGNORE_ICOUNT);

You've got a problem here in that you're not including CF_COUNT_MASK in the 
hash and you dropped the flush when changing to parallel_cpus = true.  That 
means you could find an old TB with CF_COUNT > 1.

Not required for this patch set, but what I'd like to see eventually is

   (1) cpu_exec_step merged into cpu_exec_step_atomic for clarity.
   (2) callers of tb_gen_code add in CF_PARALLEL as needed; do not
       pick it up from parallel_cpus within tb_gen_code.
   (3) target/*/translate.c uses CF_PARALLEL instead of parallel_cpus.
   (4) cpu_exec_step_atomic does the tb lookup and code gen outside
       of the start_exclusive/end_exclusive lock.

And to that end I think there are some slightly different choices you can make 
now in order to reduce churn for that later.

> @@ -320,11 +318,12 @@ static TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
>       desc.env = (CPUArchState *)cpu->env_ptr;
>       desc.cs_base = cs_base;
>       desc.flags = flags;
> +    desc.cf_mask = curr_cf_mask();
>       desc.trace_vcpu_dstate = *cpu->trace_dstate;
>       desc.pc = pc;
>       phys_pc = get_page_addr_code(desc.env, pc);
>       desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
> -    h = tb_hash_func(phys_pc, pc, flags, *cpu->trace_dstate);
> +    h = tb_hash_func(phys_pc, pc, flags, curr_cf_mask(), *cpu->trace_dstate);
>       return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
>   }

E.g. this fundamental lookup function should have cf_mask passed in.

> @@ -1254,6 +1256,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
>       if (use_icount && !(cflags & CF_IGNORE_ICOUNT)) {
>           cflags |= CF_USE_ICOUNT;
>       }
> +    if (parallel_cpus) {
> +        cflags |= CF_PARALLEL;
> +    }

E.g. pass this in.  Callers using curr_cf_mask() should suffice where it's not 
obvious.

> diff --git a/linux-user/syscall.c b/linux-user/syscall.c
> index 925ae11..fa40f6c 100644
> --- a/linux-user/syscall.c
> +++ b/linux-user/syscall.c
> @@ -6312,11 +6312,10 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
>           sigprocmask(SIG_BLOCK, &sigmask, &info.sigmask);
>   
>           /* If this is our first additional thread, we need to ensure we
> -         * generate code for parallel execution and flush old translations.
> +         * generate code for parallel execution.
>            */
>           if (!parallel_cpus) {
>               parallel_cpus = true;
> -            tb_flush(cpu);

As per above, I think you must retain this for now.

I strongly suspect that it will be worthwhile forever, since we're pretty much 
guaranteed that none of the existing TBs will ever be used again.



r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t
  2017-07-12 23:06           ` Richard Henderson
@ 2017-07-16  1:43             ` Emilio G. Cota
  2017-07-16  7:22               ` Richard Henderson
  0 siblings, 1 reply; 95+ messages in thread
From: Emilio G. Cota @ 2017-07-16  1:43 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On Wed, Jul 12, 2017 at 13:06:23 -1000, Richard Henderson wrote:
> You've got a problem here in that you're not including CF_COUNT_MASK in the
> hash and you dropped the flush when changing to parallel_cpus = true.  That
> means you could find an old TB with CF_COUNT > 1.
> 
> Not required for this patch set, but what I'd like to see eventually is
> 
>   (1) cpu_exec_step merged into cpu_exec_step_atomic for clarity.
>   (2) callers of tb_gen_code add in CF_PARALLEL as needed; do not
>       pick it up from parallel_cpus within tb_gen_code.
>   (3) target/*/translate.c uses CF_PARALLEL instead of parallel_cpus.
>   (4) cpu_exec_step_atomic does the tb lookup and code gen outside
>       of the start_exclusive/end_exclusive lock.

I have implemented these for v2, which is almost ready to go. However,
just noticed that tcg-op.c also checks parallel_cpus to decide whether
to emit a real atomic or a non-atomic op. Should we export the two
flavours of these ops to targets, since targets are the ones that can
check CF_PARALLEL? Or perhaps set a bit in the now-per-thread *tcg_ctx?

You can see the current v2 here:
  https://github.com/cota/qemu/tree/multi-tcg-v2-2017-07-15

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t
  2017-07-16  1:43             ` Emilio G. Cota
@ 2017-07-16  7:22               ` Richard Henderson
  0 siblings, 0 replies; 95+ messages in thread
From: Richard Henderson @ 2017-07-16  7:22 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel

On 07/15/2017 03:43 PM, Emilio G. Cota wrote:
> On Wed, Jul 12, 2017 at 13:06:23 -1000, Richard Henderson wrote:
>> You've got a problem here in that you're not including CF_COUNT_MASK in the
>> hash and you dropped the flush when changing to parallel_cpus = true.  That
>> means you could find an old TB with CF_COUNT > 1.
>>
>> Not required for this patch set, but what I'd like to see eventually is
>>
>>    (1) cpu_exec_step merged into cpu_exec_step_atomic for clarity.
>>    (2) callers of tb_gen_code add in CF_PARALLEL as needed; do not
>>        pick it up from parallel_cpus within tb_gen_code.
>>    (3) target/*/translate.c uses CF_PARALLEL instead of parallel_cpus.
>>    (4) cpu_exec_step_atomic does the tb lookup and code gen outside
>>        of the start_exclusive/end_exclusive lock.
> 
> I have implemented these for v2, which is almost ready to go. However,
> just noticed that tcg-op.c also checks parallel_cpus to decide whether
> to emit a real atomic or a non-atomic op. Should we export the two
> flavours of these ops to targets, since targets are the ones that can
> check CF_PARALLEL? Or perhaps set a bit in the now-per-thread *tcg_ctx?

Jeez.  I forgot about that one.  A bit in tcg_ctx seems the best.


r~

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2017-07-16  7:23 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-09  7:49 [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
2017-07-09  7:49 ` [Qemu-devel] [PATCH 01/22] vl: fix breakage of -tb-size Emilio G. Cota
2017-07-09 19:56   ` Richard Henderson
2017-07-11 15:37   ` Alex Bennée
2017-07-09  7:49 ` [Qemu-devel] [PATCH 02/22] translate-all: remove redundant !tcg_enabled check in dump_exec_info Emilio G. Cota
2017-07-09 19:57   ` Richard Henderson
2017-07-10  6:15   ` Thomas Huth
2017-07-12 12:32   ` Alex Bennée
2017-07-09  7:49 ` [Qemu-devel] [PATCH 03/22] cputlb: bring back tlb_flush_count under !TLB_DEBUG Emilio G. Cota
2017-07-09 20:00   ` Richard Henderson
2017-07-09 20:56     ` Emilio G. Cota
2017-07-09 21:20       ` Emilio G. Cota
2017-07-12 13:26   ` Alex Bennée
2017-07-12 18:19     ` Emilio G. Cota
2017-07-09  7:49 ` [Qemu-devel] [PATCH 04/22] tcg: fix corruption of code_time profiling counter upon tb_flush Emilio G. Cota
2017-07-09 20:01   ` Richard Henderson
2017-07-12 14:36   ` Alex Bennée
2017-07-12 17:09   ` Philippe Mathieu-Daudé
2017-07-09  7:49 ` [Qemu-devel] [PATCH 05/22] exec-all: fix typos in TranslationBlock's documentation Emilio G. Cota
2017-07-12 14:37   ` Alex Bennée
2017-07-09  7:49 ` [Qemu-devel] [PATCH 06/22] translate-all: make have_tb_lock static Emilio G. Cota
2017-07-09 20:02   ` Richard Henderson
2017-07-12 14:38   ` Alex Bennée
2017-07-12 18:22     ` Emilio G. Cota
2017-07-09  7:49 ` [Qemu-devel] [PATCH 07/22] tcg/i386: constify tcg_target_callee_save_regs Emilio G. Cota
2017-07-09 20:02   ` Richard Henderson
2017-07-12 14:39   ` Alex Bennée
2017-07-12 17:00   ` Philippe Mathieu-Daudé
2017-07-09  7:50 ` [Qemu-devel] [PATCH 08/22] tcg/mips: " Emilio G. Cota
2017-07-09 20:02   ` Richard Henderson
2017-07-12 14:39   ` Alex Bennée
2017-07-12 17:01   ` Philippe Mathieu-Daudé
2017-07-09  7:50 ` [Qemu-devel] [PATCH 09/22] exec-all: shrink tb->invalid to uint8_t Emilio G. Cota
2017-07-09 20:11   ` Richard Henderson
2017-07-10 23:57     ` Emilio G. Cota
2017-07-12  0:53       ` Richard Henderson
2017-07-12 20:48         ` Emilio G. Cota
2017-07-12 23:06           ` Richard Henderson
2017-07-16  1:43             ` Emilio G. Cota
2017-07-16  7:22               ` Richard Henderson
2017-07-09  7:50 ` [Qemu-devel] [PATCH 10/22] exec-all: move tb->invalid to the end of the struct Emilio G. Cota
2017-07-09  7:50 ` [Qemu-devel] [PATCH 11/22] translate-all: use a binary search tree to track TBs in TBContext Emilio G. Cota
2017-07-09 20:33   ` Richard Henderson
2017-07-09 21:01     ` Emilio G. Cota
2017-07-12 15:10   ` Alex Bennée
2017-07-12 18:38     ` Emilio G. Cota
2017-07-09  7:50 ` [Qemu-devel] [PATCH 12/22] translate-all: report correct avg host TB size Emilio G. Cota
2017-07-12 15:25   ` Alex Bennée
2017-07-12 18:45     ` Emilio G. Cota
2017-07-09  7:50 ` [Qemu-devel] [PATCH 13/22] tcg: take tb_ctx out of TCGContext Emilio G. Cota
2017-07-12 15:27   ` Alex Bennée
2017-07-09  7:50 ` [Qemu-devel] [PATCH 14/22] tcg: take .helpers " Emilio G. Cota
2017-07-09 20:35   ` Richard Henderson
2017-07-12 15:28   ` Alex Bennée
2017-07-09  7:50 ` [Qemu-devel] [PATCH 15/22] gen-icount: fold exitreq_label into TCGContext Emilio G. Cota
2017-07-09 20:36   ` Richard Henderson
2017-07-12 15:29   ` Alex Bennée
2017-07-09  7:50 ` [Qemu-devel] [PATCH 16/22] tcg: keep a list of TCGContext's Emilio G. Cota
2017-07-09 20:43   ` Richard Henderson
2017-07-12 15:32   ` Alex Bennée
2017-07-09  7:50 ` [Qemu-devel] [PATCH 17/22] tcg: distribute profiling counters across TCGContext's Emilio G. Cota
2017-07-09 20:45   ` Richard Henderson
2017-07-09 21:14     ` Emilio G. Cota
2017-07-09 21:44       ` Richard Henderson
2017-07-10 16:00         ` Emilio G. Cota
2017-07-09  7:50 ` [Qemu-devel] [PATCH 18/22] tcg: define TCG_HIGHWATER Emilio G. Cota
2017-07-09 20:46   ` Richard Henderson
2017-07-12 15:33   ` Alex Bennée
2017-07-09  7:50 ` [Qemu-devel] [PATCH 19/22] tcg: introduce tcg_context_clone Emilio G. Cota
2017-07-09 20:48   ` Richard Henderson
2017-07-09 21:04     ` Emilio G. Cota
2017-07-12 16:02   ` Alex Bennée
2017-07-12 17:25     ` Richard Henderson
2017-07-12 17:47       ` Alex Bennée
2017-07-09  7:50 ` [Qemu-devel] [PATCH 20/22] tcg: dynamically allocate from code_gen_buffer using equally-sized regions Emilio G. Cota
2017-07-09 21:03   ` Richard Henderson
2017-07-09  7:50 ` [Qemu-devel] [PATCH 21/22] tcg: enable per-thread TCG for softmmu Emilio G. Cota
2017-07-09 21:07   ` Richard Henderson
2017-07-09 21:19   ` Richard Henderson
2017-07-09 21:29     ` Emilio G. Cota
2017-07-09 21:48       ` Richard Henderson
2017-07-10  3:54         ` Emilio G. Cota
2017-07-10 12:05   ` Paolo Bonzini
2017-07-10 21:14     ` Emilio G. Cota
2017-07-10 21:33       ` Paolo Bonzini
2017-07-10 22:13         ` Emilio G. Cota
2017-07-11  8:02           ` Paolo Bonzini
2017-07-09  7:50 ` [Qemu-devel] [PATCH 22/22] translate-all: do not hold tb_lock during code generation in softmmu Emilio G. Cota
2017-07-09 21:38   ` Richard Henderson
2017-07-10  3:51     ` Emilio G. Cota
2017-07-10  5:59       ` Richard Henderson
2017-07-10 15:28         ` Emilio G. Cota
2017-07-09 18:27 ` [Qemu-devel] [PATCH 00/22] tcg: per-thread TCG Emilio G. Cota
2017-07-10  9:50 ` Alex Bennée
2017-07-10 17:04   ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.