All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10
@ 2017-04-27  3:29 Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 01/11] exec-all: export tb_htable_lookup Emilio G. Cota
                   ` (12 more replies)
  0 siblings, 13 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

v3 for context: https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04795.html

Changes from v3:

- Added reviewed-by tags.

- Added a couple of suggested-by tags that I forgot to add in v3
  regarding lookup_and_goto_ptr and i386's implementation of goto_ptr.

- lookup_tb_ptr
  + Dropped the unnecessary exit_request check, as suggested by Paolo and
    Richard.
  + Only get the CPU state if we get a tb from the jmp_cache, as suggested
    by Richard.
  + Added tb_htable_lookup if we miss in tb_jmp_cache, as suggested by
    Richard. This requires an extra patch to export tb_htable_lookup.

- goto_ptr: add IMPL(has_goto_ptr), as pointed out by Richard.

- target/arm: added a comment about gen_jr. See the v3 thread for why
  it is needed.

- target/i386: use TCGV_UNUSED instead of (ab)using NULL on a TCGv,
  as suggested by Richard. Also took his suggestion to simplify
  the addition of jr + cs_base.
  To minimize churn I renamed gen_eob_worker to do_gen_eob_worker,
  which takes the newly added argument.

I have *not* re-run all experiments, because it takes several hours and
performance hasn't changed much from v3, as can be seen in these two charts:
* spec06int user-mode, test input, v2.9.0 baseline: http://imgur.com/ME2eMq1
* spec06int softmmu, test input, v3 baseline: http://imgur.com/Clolu9Z
The perf differences are mostly due to adding the htable check. Note that
its impact is small, since tb_jmp_cache has a %hit rate in the high 90's.

You can inspect/fetch the changes at:
  https://github.com/cota/qemu/tree/tcg-opt-v4

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 01/11] exec-all: export tb_htable_lookup
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 02/11] tcg-runtime: add lookup_tb_ptr helper Emilio G. Cota
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 6 ++----
 include/exec/exec-all.h | 2 ++
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 63a56d0..5b181c1 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -309,10 +309,8 @@ static bool tb_cmp(const void *p, const void *d)
     return false;
 }
 
-static TranslationBlock *tb_htable_lookup(CPUState *cpu,
-                                          target_ulong pc,
-                                          target_ulong cs_base,
-                                          uint32_t flags)
+TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
+                                   target_ulong cs_base, uint32_t flags)
 {
     tb_page_addr_t phys_pc;
     struct tb_desc desc;
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index bcde1e6..87ae10b 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -368,6 +368,8 @@ struct TranslationBlock {
 void tb_free(TranslationBlock *tb);
 void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
+TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
+                                   target_ulong cs_base, uint32_t flags);
 
 #if defined(USE_DIRECT_JUMP)
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 02/11] tcg-runtime: add lookup_tb_ptr helper
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 01/11] exec-all: export tb_htable_lookup Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 03/11] tcg: introduce goto_ptr opcode Emilio G. Cota
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

This paves the way for upcoming work.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg-runtime.c     | 24 ++++++++++++++++++++++++
 tcg/tcg-runtime.h |  2 ++
 tcg/tcg.h         |  1 +
 3 files changed, 27 insertions(+)

diff --git a/tcg-runtime.c b/tcg-runtime.c
index 4c60c96..8a24bdd 100644
--- a/tcg-runtime.c
+++ b/tcg-runtime.c
@@ -27,6 +27,7 @@
 #include "exec/helper-proto.h"
 #include "exec/cpu_ldst.h"
 #include "exec/exec-all.h"
+#include "exec/tb-hash.h"
 
 /* 32-bit helpers */
 
@@ -141,6 +142,29 @@ uint64_t HELPER(ctpop_i64)(uint64_t arg)
     return ctpop64(arg);
 }
 
+void *HELPER(lookup_tb_ptr)(CPUArchState *env, target_ulong addr)
+{
+    CPUState *cpu = ENV_GET_CPU(env);
+    TranslationBlock *tb;
+    target_ulong cs_base, pc;
+    uint32_t flags;
+
+    tb = atomic_rcu_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(addr)]);
+    if (likely(tb)) {
+        cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
+        if (likely(tb->pc == addr && tb->cs_base == cs_base &&
+                   tb->flags == flags)) {
+            return tb->tc_ptr;
+        }
+        tb = tb_htable_lookup(cpu, pc, cs_base, flags);
+        if (likely(tb)) {
+            atomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(addr)], tb);
+            return tb->tc_ptr;
+        }
+    }
+    return tcg_ctx.code_gen_epilogue;
+}
+
 void HELPER(exit_atomic)(CPUArchState *env)
 {
     cpu_loop_exit_atomic(ENV_GET_CPU(env), GETPC());
diff --git a/tcg/tcg-runtime.h b/tcg/tcg-runtime.h
index 114ea6f..c41d38a 100644
--- a/tcg/tcg-runtime.h
+++ b/tcg/tcg-runtime.h
@@ -24,6 +24,8 @@ DEF_HELPER_FLAGS_1(clrsb_i64, TCG_CALL_NO_RWG_SE, i64, i64)
 DEF_HELPER_FLAGS_1(ctpop_i32, TCG_CALL_NO_RWG_SE, i32, i32)
 DEF_HELPER_FLAGS_1(ctpop_i64, TCG_CALL_NO_RWG_SE, i64, i64)
 
+DEF_HELPER_FLAGS_2(lookup_tb_ptr, TCG_CALL_NO_WG_SE, ptr, env, tl)
+
 DEF_HELPER_FLAGS_1(exit_atomic, TCG_CALL_NO_WG, noreturn, env)
 
 #ifdef CONFIG_SOFTMMU
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 6c216bb..5ec48d1 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -699,6 +699,7 @@ struct TCGContext {
        extension that allows arithmetic on void*.  */
     int code_gen_max_blocks;
     void *code_gen_prologue;
+    void *code_gen_epilogue;
     void *code_gen_buffer;
     size_t code_gen_buffer_size;
     void *code_gen_ptr;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 03/11] tcg: introduce goto_ptr opcode
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 01/11] exec-all: export tb_htable_lookup Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 02/11] tcg-runtime: add lookup_tb_ptr helper Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  8:09   ` Richard Henderson
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 04/11] tcg: export tcg_gen_lookup_and_goto_ptr Emilio G. Cota
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/aarch64/tcg-target.h | 1 +
 tcg/arm/tcg-target.h     | 1 +
 tcg/i386/tcg-target.h    | 1 +
 tcg/ia64/tcg-target.h    | 1 +
 tcg/mips/tcg-target.h    | 1 +
 tcg/ppc/tcg-target.h     | 1 +
 tcg/s390/tcg-target.h    | 1 +
 tcg/sparc/tcg-target.h   | 1 +
 tcg/tcg-opc.h            | 1 +
 tcg/tci/tcg-target.h     | 1 +
 10 files changed, 10 insertions(+)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 1a5ea23..b82eac4 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -77,6 +77,7 @@ typedef enum {
 #define TCG_TARGET_HAS_mulsh_i32        0
 #define TCG_TARGET_HAS_extrl_i64_i32    0
 #define TCG_TARGET_HAS_extrh_i64_i32    0
+#define TCG_TARGET_HAS_goto_ptr         0
 
 #define TCG_TARGET_HAS_div_i64          1
 #define TCG_TARGET_HAS_rem_i64          1
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 09a19c6..2f3ecfd 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -123,6 +123,7 @@ extern bool use_idiv_instructions;
 #define TCG_TARGET_HAS_mulsh_i32        0
 #define TCG_TARGET_HAS_div_i32          use_idiv_instructions
 #define TCG_TARGET_HAS_rem_i32          0
+#define TCG_TARGET_HAS_goto_ptr         0
 
 enum {
     TCG_AREG0 = TCG_REG_R6,
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 4275787..59d9835 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -107,6 +107,7 @@ extern bool have_popcnt;
 #define TCG_TARGET_HAS_muls2_i32        1
 #define TCG_TARGET_HAS_muluh_i32        0
 #define TCG_TARGET_HAS_mulsh_i32        0
+#define TCG_TARGET_HAS_goto_ptr         0
 
 #if TCG_TARGET_REG_BITS == 64
 #define TCG_TARGET_HAS_extrl_i64_i32    0
diff --git a/tcg/ia64/tcg-target.h b/tcg/ia64/tcg-target.h
index 42aea03..901bb75 100644
--- a/tcg/ia64/tcg-target.h
+++ b/tcg/ia64/tcg-target.h
@@ -173,6 +173,7 @@ typedef enum {
 #define TCG_TARGET_HAS_mulsh_i64        0
 #define TCG_TARGET_HAS_extrl_i64_i32    0
 #define TCG_TARGET_HAS_extrh_i64_i32    0
+#define TCG_TARGET_HAS_goto_ptr         0
 
 #define TCG_TARGET_deposit_i32_valid(ofs, len) ((len) <= 16)
 #define TCG_TARGET_deposit_i64_valid(ofs, len) ((len) <= 16)
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index f46d64a..e3240cf 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -130,6 +130,7 @@ extern bool use_mips32r2_instructions;
 #define TCG_TARGET_HAS_muluh_i32        1
 #define TCG_TARGET_HAS_mulsh_i32        1
 #define TCG_TARGET_HAS_bswap32_i32      1
+#define TCG_TARGET_HAS_goto_ptr         0
 
 #if TCG_TARGET_REG_BITS == 64
 #define TCG_TARGET_HAS_add2_i32         0
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index abd8b3d..a9aa974 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -82,6 +82,7 @@ extern bool have_isa_3_00;
 #define TCG_TARGET_HAS_muls2_i32        0
 #define TCG_TARGET_HAS_muluh_i32        1
 #define TCG_TARGET_HAS_mulsh_i32        1
+#define TCG_TARGET_HAS_goto_ptr         0
 
 #if TCG_TARGET_REG_BITS == 64
 #define TCG_TARGET_HAS_add2_i32         0
diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
index cbdd2a6..6b7bcfb 100644
--- a/tcg/s390/tcg-target.h
+++ b/tcg/s390/tcg-target.h
@@ -92,6 +92,7 @@ extern uint64_t s390_facilities;
 #define TCG_TARGET_HAS_mulsh_i32      0
 #define TCG_TARGET_HAS_extrl_i64_i32  0
 #define TCG_TARGET_HAS_extrh_i64_i32  0
+#define TCG_TARGET_HAS_goto_ptr       0
 
 #define TCG_TARGET_HAS_div2_i64       1
 #define TCG_TARGET_HAS_rot_i64        1
diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
index b8b74f9..9348ddd 100644
--- a/tcg/sparc/tcg-target.h
+++ b/tcg/sparc/tcg-target.h
@@ -123,6 +123,7 @@ extern bool use_vis3_instructions;
 #define TCG_TARGET_HAS_muls2_i32        1
 #define TCG_TARGET_HAS_muluh_i32        0
 #define TCG_TARGET_HAS_mulsh_i32        0
+#define TCG_TARGET_HAS_goto_ptr         0
 
 #define TCG_TARGET_HAS_extrl_i64_i32    1
 #define TCG_TARGET_HAS_extrh_i64_i32    1
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index f06f894..956fb1e 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -193,6 +193,7 @@ DEF(insn_start, 0, 0, TLADDR_ARGS * TARGET_INSN_START_WORDS,
     TCG_OPF_NOT_PRESENT)
 DEF(exit_tb, 0, 0, 1, TCG_OPF_BB_END)
 DEF(goto_tb, 0, 0, 1, TCG_OPF_BB_END)
+DEF(goto_ptr, 0, 1, 0, TCG_OPF_BB_END | IMPL(TCG_TARGET_HAS_goto_ptr))
 
 DEF(qemu_ld_i32, 1, TLADDR_ARGS, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
index 838bf3a..0696328 100644
--- a/tcg/tci/tcg-target.h
+++ b/tcg/tci/tcg-target.h
@@ -85,6 +85,7 @@
 #define TCG_TARGET_HAS_muls2_i32        0
 #define TCG_TARGET_HAS_muluh_i32        0
 #define TCG_TARGET_HAS_mulsh_i32        0
+#define TCG_TARGET_HAS_goto_ptr         0
 
 #if TCG_TARGET_REG_BITS == 64
 #define TCG_TARGET_HAS_extrl_i64_i32    0
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 04/11] tcg: export tcg_gen_lookup_and_goto_ptr
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
                   ` (2 preceding siblings ...)
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 03/11] tcg: introduce goto_ptr opcode Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 05/11] tcg/i386: implement goto_ptr op Emilio G. Cota
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

Instead of exporting goto_ptr directly to TCG frontends, export
tcg_gen_lookup_and_goto_ptr(), which calls goto_ptr with the pointer
returned by the lookup_tb_ptr() helper. This is the only use case
we have for goto_ptr and lookup_tb_ptr, so having this function is
very convenient. Furthermore, it trivially allows us to avoid calling
the lookup helper if goto_ptr is not implemented by the backend.

Suggested-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/README   |  8 ++++++++
 tcg/tcg-op.c | 13 +++++++++++++
 tcg/tcg-op.h | 11 +++++++++++
 3 files changed, 32 insertions(+)

diff --git a/tcg/README b/tcg/README
index a9858c2..bf49e82 100644
--- a/tcg/README
+++ b/tcg/README
@@ -477,6 +477,14 @@ current TB was linked to this TB. Otherwise execute the next
 instructions. Only indices 0 and 1 are valid and tcg_gen_goto_tb may be issued
 at most once with each slot index per TB.
 
+* lookup_and_goto_ptr tb_addr
+
+Look up a TB address ('tb_addr') and jump to it if valid. If not valid,
+jump to the TCG epilogue to go back to the exec loop.
+
+This operation is optional. If the TCG backend does not implement the
+goto_ptr opcode, emitting this op is equivalent to emitting exit_tb(0).
+
 * qemu_ld_i32/i64 t0, t1, flags, memidx
 * qemu_st_i32/i64 t0, t1, flags, memidx
 
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 95a39b7..8ff1eaf 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -2587,6 +2587,19 @@ void tcg_gen_goto_tb(unsigned idx)
     tcg_gen_op1i(INDEX_op_goto_tb, idx);
 }
 
+void tcg_gen_lookup_and_goto_ptr(TCGv addr)
+{
+    if (TCG_TARGET_HAS_goto_ptr) {
+        TCGv_ptr ptr = tcg_temp_new_ptr();
+
+        gen_helper_lookup_tb_ptr(ptr, tcg_ctx.tcg_env, addr);
+        tcg_gen_op1i(INDEX_op_goto_ptr, GET_TCGV_PTR(ptr));
+        tcg_temp_free_ptr(ptr);
+    } else {
+        tcg_gen_exit_tb(0);
+    }
+}
+
 static inline TCGMemOp tcg_canonicalize_memop(TCGMemOp op, bool is64, bool st)
 {
     /* Trigger the asserts within as early as possible.  */
diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index c68e300..5d3278f 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -796,6 +796,17 @@ static inline void tcg_gen_exit_tb(uintptr_t val)
  */
 void tcg_gen_goto_tb(unsigned idx);
 
+/**
+ * tcg_gen_lookup_and_goto_ptr() - look up a TB and jump to it if valid
+ * @addr: Guest address of the target TB
+ *
+ * If the TB is not valid, jump to the epilogue.
+ *
+ * This operation is optional. If the TCG backend does not implement goto_ptr,
+ * this op is equivalent to calling tcg_gen_exit_tb() with 0 as the argument.
+ */
+void tcg_gen_lookup_and_goto_ptr(TCGv addr);
+
 #if TARGET_LONG_BITS == 32
 #define tcg_temp_new() tcg_temp_new_i32()
 #define tcg_global_reg_new tcg_global_reg_new_i32
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 05/11] tcg/i386: implement goto_ptr op
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
                   ` (3 preceding siblings ...)
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 04/11] tcg: export tcg_gen_lookup_and_goto_ptr Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 06/11] target/arm: optimize cross-page direct jumps in softmmu Emilio G. Cota
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

Suggested-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/i386/tcg-target.h     |  2 +-
 tcg/i386/tcg-target.inc.c | 15 +++++++++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 59d9835..73a15f7 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -107,7 +107,7 @@ extern bool have_popcnt;
 #define TCG_TARGET_HAS_muls2_i32        1
 #define TCG_TARGET_HAS_muluh_i32        0
 #define TCG_TARGET_HAS_mulsh_i32        0
-#define TCG_TARGET_HAS_goto_ptr         0
+#define TCG_TARGET_HAS_goto_ptr         1
 
 #if TCG_TARGET_REG_BITS == 64
 #define TCG_TARGET_HAS_extrl_i64_i32    0
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 5918008..d0bf53a 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -1906,6 +1906,10 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         }
         s->tb_jmp_reset_offset[a0] = tcg_current_code_size(s);
         break;
+    case INDEX_op_goto_ptr:
+        /* jmp to the given host address (could be epilogue) */
+        tcg_out_modrm(s, OPC_GRP5, EXT5_JMPN_Ev, a0);
+        break;
     case INDEX_op_br:
         tcg_out_jxx(s, JCC_JMP, arg_label(a0), 0);
         break;
@@ -2277,6 +2281,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
 
 static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
 {
+    static const TCGTargetOpDef r = { .args_ct_str = { "r" } };
     static const TCGTargetOpDef ri_r = { .args_ct_str = { "ri", "r" } };
     static const TCGTargetOpDef re_r = { .args_ct_str = { "re", "r" } };
     static const TCGTargetOpDef qi_r = { .args_ct_str = { "qi", "r" } };
@@ -2299,6 +2304,9 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
         = { .args_ct_str = { "L", "L", "L", "L" } };
 
     switch (op) {
+    case INDEX_op_goto_ptr:
+        return &r;
+
     case INDEX_op_ld8u_i32:
     case INDEX_op_ld8u_i64:
     case INDEX_op_ld8s_i32:
@@ -2567,6 +2575,13 @@ static void tcg_target_qemu_prologue(TCGContext *s)
     tcg_out_modrm(s, OPC_GRP5, EXT5_JMPN_Ev, tcg_target_call_iarg_regs[1]);
 #endif
 
+    /*
+     * Return path for goto_ptr. Set return value to 0, a-la exit_tb,
+     * and fall through to the rest of the epilogue.
+     */
+    s->code_gen_epilogue = s->code_ptr;
+    tcg_out_movi(s, TCG_TYPE_REG, TCG_REG_EAX, 0);
+
     /* TB epilogue */
     tb_ret_addr = s->code_ptr;
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 06/11] target/arm: optimize cross-page direct jumps in softmmu
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
                   ` (4 preceding siblings ...)
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 05/11] tcg/i386: implement goto_ptr op Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 07/11] target/arm: optimize indirect branches Emilio G. Cota
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

Instead of unconditionally exiting to the exec loop, use the
lookup_and_goto_ptr helper to jump to the target if it is valid.

Perf impact: see next commit's log.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/arm/translate.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/target/arm/translate.c b/target/arm/translate.c
index e32e38c..02cad96 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -4085,8 +4085,12 @@ static inline void gen_goto_tb(DisasContext *s, int n, target_ulong dest)
         gen_set_pc_im(s, dest);
         tcg_gen_exit_tb((uintptr_t)s->tb + n);
     } else {
+        TCGv addr = tcg_temp_new();
+
         gen_set_pc_im(s, dest);
-        tcg_gen_exit_tb(0);
+        tcg_gen_extu_i32_tl(addr, cpu_R[15]);
+        tcg_gen_lookup_and_goto_ptr(addr);
+        tcg_temp_free(addr);
     }
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 07/11] target/arm: optimize indirect branches
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
                   ` (5 preceding siblings ...)
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 06/11] target/arm: optimize cross-page direct jumps in softmmu Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  9:36   ` Aurelien Jarno
  2017-04-27  9:41   ` Alex Bennée
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 08/11] target/i386: introduce gen_jr helper to generate lookup_and_goto_ptr Emilio G. Cota
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

Speed up indirect branches by jumping to the target if it is valid.

Softmmu measurements (see later commit for user-mode results):

Note: baseline (i.e. speedup == 1x) is QEMU v2.9.0.

- Impact on Boot time

| setup  | ARM debian jessie boot+shutdown time | stddev |
|--------+--------------------------------------+--------|
| v2.9.0 |                                 8.84 |   0.07 |
| +cross |                                 8.85 |   0.03 |
| +jr    |                                 8.83 |   0.06 |

-                            NBench, arm-softmmu (debian jessie guest). Host: Intel i7-4790K @ 4.00GHz

  1.3x +-+-------------------------------------------------------------------------------------------------------------+-+
       |                                                                                                                 |
       |   cross                                                          ####                                           |
 1.25x +cross+jr..........................................................#++#.........................................+-+
       |                                                        ####      #  #                                           |
       |                                                     +++#  #      #  #                                           |
       |                                      +++            ****  #      #  #                                           |
  1.2x +-+...................................####............*..*..#......#..#.........................................+-+
       |                                  ****  #            *  *  #      #  #     ####                                  |
       |                                  *  *  #            *  *  #      #  #     #  #                                  |
 1.15x +-+................................*..*..#............*..*..#......#..#.....#..#................................+-+
       |                                  *  *  #            *  *  #      #  #     #  #                                  |
       |                                  *  *  #      ####  *  *  #      #  #     #  #                                  |
       |                                  *  *  #      #  #  *  *  #      #  #     #  #                         ####     |
  1.1x +-+................................*..*..#......#..#..*..*..#......#..#.....#..#.........................#..#...+-+
       |                                  *  *  #      #  #  *  *  #      #  #     #  #                         #  #     |
       |                                  *  *  #      #  #  *  *  #      #  #     #  #                         #  #     |
 1.05x +-+..........................####..*..*..#......#..#..*..*..#......#..#.....#..#......+++............*****..#...+-+
       |                        *****  #  *  *  #      #  #  *  *  #  *****  #     #  #   +++ |    ****###  *   *  #     |
       |                        *+++*  #  *  *  #      #  #  *  *  #  *+++*  #  ****  #  *****###  *  *  #  *   *  #     |
       |     *****###  +++####  *   *  #  *  *  #  *****  #  *  *  #  *   *  #  *  *  #  * | *++#  *  *  #  *   *  #     |
    1x +-++-+*+++*-+#++****++#++*+-+*++#+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-++-+
       |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
       |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
 0.95x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
       ASSIGNMENT BITFIELD   FOURFP EMULATION   HUFFMAN   LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT     hmean
  png: http://imgur.com/eOLmZNR

NB. 'cross' represents the previous commit.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/arm/translate.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/target/arm/translate.c b/target/arm/translate.c
index 02cad96..d46a576 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -65,6 +65,7 @@ static TCGv_i32 cpu_R[16];
 TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
 TCGv_i64 cpu_exclusive_addr;
 TCGv_i64 cpu_exclusive_val;
+static bool gen_jr;
 
 /* FIXME:  These should be removed.  */
 static TCGv_i32 cpu_F0s, cpu_F1s;
@@ -221,6 +222,7 @@ static void store_reg(DisasContext *s, int reg, TCGv_i32 var)
          */
         tcg_gen_andi_i32(var, var, s->thumb ? ~1 : ~3);
         s->is_jmp = DISAS_JUMP;
+        gen_jr = true;
     }
     tcg_gen_mov_i32(cpu_R[reg], var);
     tcg_temp_free_i32(var);
@@ -893,6 +895,7 @@ static inline void gen_bx_im(DisasContext *s, uint32_t addr)
         tcg_temp_free_i32(tmp);
     }
     tcg_gen_movi_i32(cpu_R[15], addr & ~1);
+    gen_jr = true;
 }
 
 /* Set PC and Thumb state from var.  var is marked as dead.  */
@@ -902,6 +905,7 @@ static inline void gen_bx(DisasContext *s, TCGv_i32 var)
     tcg_gen_andi_i32(cpu_R[15], var, ~1);
     tcg_gen_andi_i32(var, var, 1);
     store_cpu_field(var, thumb);
+    gen_jr = true;
 }
 
 /* Variant of store_reg which uses branch&exchange logic when storing
@@ -12034,6 +12038,20 @@ void gen_intermediate_code(CPUARMState *env, TranslationBlock *tb)
             gen_set_pc_im(dc, dc->pc);
             /* fall through */
         case DISAS_JUMP:
+            /*
+             * gen_jr is not set on every DISAS_JUMP because for some of those
+             * we do want to exit to the exec loop.
+             */
+            if (gen_jr) {
+                TCGv addr = tcg_temp_new();
+
+                gen_jr = false;
+                tcg_gen_extu_i32_tl(addr, cpu_R[15]);
+                tcg_gen_lookup_and_goto_ptr(addr);
+                tcg_temp_free(addr);
+                break;
+            }
+            /* fall through */
         default:
             /* indicate that the hash table must be used to find the next TB */
             tcg_gen_exit_tb(0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 08/11] target/i386: introduce gen_jr helper to generate lookup_and_goto_ptr
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
                   ` (6 preceding siblings ...)
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 07/11] target/arm: optimize indirect branches Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  8:12   ` Richard Henderson
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 09/11] target/i386: optimize cross-page direct jumps in softmmu Emilio G. Cota
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

This helper will be used by subsequent changes.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/i386/translate.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/target/i386/translate.c b/target/i386/translate.c
index 1d1372f..f0e48dc 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -141,6 +141,7 @@ typedef struct DisasContext {
 } DisasContext;
 
 static void gen_eob(DisasContext *s);
+static void gen_jr(DisasContext *s, TCGv dest);
 static void gen_jmp(DisasContext *s, target_ulong eip);
 static void gen_jmp_tb(DisasContext *s, target_ulong eip, int tb_num);
 static void gen_op(DisasContext *s1, int op, TCGMemOp ot, int d);
@@ -2509,7 +2510,8 @@ static void gen_bnd_jmp(DisasContext *s)
    If INHIBIT, set HF_INHIBIT_IRQ_MASK if it isn't already set.
    If RECHECK_TF, emit a rechecking helper for #DB, ignoring the state of
    S->TF.  This is used by the syscall/sysret insns.  */
-static void gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf)
+static void
+do_gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf, TCGv jr)
 {
     gen_update_cc_op(s);
 
@@ -2530,12 +2532,27 @@ static void gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf)
         tcg_gen_exit_tb(0);
     } else if (s->tf) {
         gen_helper_single_step(cpu_env);
+    } else if (!TCGV_IS_UNUSED(jr)) {
+        TCGv vaddr = tcg_temp_new();
+
+        tcg_gen_add_tl(vaddr, jr, cpu_seg_base[R_CS]);
+        tcg_gen_lookup_and_goto_ptr(vaddr);
+        tcg_temp_free(vaddr);
     } else {
         tcg_gen_exit_tb(0);
     }
     s->is_jmp = DISAS_TB_JUMP;
 }
 
+static inline void
+gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf)
+{
+    TCGv unused;
+
+    TCGV_UNUSED(unused);
+    do_gen_eob_worker(s, inhibit, recheck_tf, unused);
+}
+
 /* End of block.
    If INHIBIT, set HF_INHIBIT_IRQ_MASK if it isn't already set.  */
 static void gen_eob_inhibit_irq(DisasContext *s, bool inhibit)
@@ -2549,6 +2566,12 @@ static void gen_eob(DisasContext *s)
     gen_eob_worker(s, false, false);
 }
 
+/* Jump to register */
+static void gen_jr(DisasContext *s, TCGv dest)
+{
+    do_gen_eob_worker(s, false, false, dest);
+}
+
 /* generate a jump to eip. No segment change must happen before as a
    direct call to the next block may occur */
 static void gen_jmp_tb(DisasContext *s, target_ulong eip, int tb_num)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 09/11] target/i386: optimize cross-page direct jumps in softmmu
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
                   ` (7 preceding siblings ...)
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 08/11] target/i386: introduce gen_jr helper to generate lookup_and_goto_ptr Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 10/11] target/i386: optimize indirect branches Emilio G. Cota
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

Instead of unconditionally exiting to the exec loop, use the
gen_jr helper to jump to the target if it is valid.

Perf impact: see next commit's log.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/i386/translate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/target/i386/translate.c b/target/i386/translate.c
index f0e48dc..ea113fe 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -2154,9 +2154,9 @@ static inline void gen_goto_tb(DisasContext *s, int tb_num, target_ulong eip)
         gen_jmp_im(eip);
         tcg_gen_exit_tb((uintptr_t)s->tb + tb_num);
     } else {
-        /* jump to another page: currently not optimized */
+        /* jump to another page */
         gen_jmp_im(eip);
-        gen_eob(s);
+        gen_jr(s, cpu_tmp0);
     }
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 10/11] target/i386: optimize indirect branches
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
                   ` (8 preceding siblings ...)
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 09/11] target/i386: optimize cross-page direct jumps in softmmu Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 11/11] tb-hash: improve tb_jmp_cache hash function in user mode Emilio G. Cota
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

Speed up indirect branches by jumping to the target if it is valid.

Softmmu measurements (see later commit for user-mode numbers):

Note: baseline (i.e. speedup == 1x) is QEMU v2.9.0.

-                  SPECint06 (test set), x86_64-softmmu (Ubuntu 16.04 guest). Host: Intel i7-4790K @ 4.00GHz

 2.4x +-+--------------------------------------------------------------------------------------------------------------+-+
      |                                                                                                                  |
      |   cross                                                                                                          |
 2.2x +cross+jr..........................................................................+++...........................+-+
      |                                                                                   |                              |
      |                                                                               +++ |                              |
   2x +-+..............................................................................|..|............................+-+
      |                                                                                |  |                              |
      |                                                                                |  |                              |
 1.8x +-+..............................................................................|####...........................+-+
      |                                                                                |# |#                             |
      |                                                                              **** |#                             |
 1.6x +-+............................................................................*.|*.|#...........................+-+
      |                                                                              * |* |#                             |
      |                                                                              * |* |#                             |
 1.4x +-+.......................................................................+++..*.|*.|#...........................+-+
      |                                                      ++++++             #### * |*++#             +++             |
      |                        +++                            |  |              #++# *++*  #          +++ |              |
 1.2x +-+......................###.....####....+++............|..|...........****..#.*..*..#....####...|.###.....####..+-+
      |        +++          **** #  ****  #    ####          ***###          *++*  # *  *  #    #++#  ****|#  +++#++#    |
      |    ****###     +++  *++* #  *++*  #  ++#  #    ####  *|* |#     +++  *  *  # *  *  #  ***  #  *| *|#  ****  #    |
   1x +-++-*++*++#++***###++*++*+#++*+-*++#+****++#++***++#+-*+*++#-+****##++*++*-+#+*++*-+#++*+*++#++*-+*+#++*++*++#-++-+
      |    *  *  #  * *  #  *  * #  *  *  # *  *  #  * *  #  *|* |#  *++* #  *  *  # *  *  #  * *  #  *  * #  *  *  #    |
      |    *  *  #  * *  #  *  * #  *  *  # *  *  #  * *  #  *+*++#  *  * #  *  *  # *  *  #  * *  #  *  * #  *  *  #    |
 0.8x +-+--****###--***###--****##--****###-****###--***###--***###--****##--****###-****###--***###--****##--****###--+-+
         astar   bzip2      gcc   gobmk h264ref   hmmlibquantum      mcf omnetpperlbench   sjengxalancbmk   hmean
  png: http://imgur.com/DU36YFU

NB. 'cross' represents the previous commit.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/i386/translate.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/target/i386/translate.c b/target/i386/translate.c
index ea113fe..674ec96 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -4996,7 +4996,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             gen_push_v(s, cpu_T1);
             gen_op_jmp_v(cpu_T0);
             gen_bnd_jmp(s);
-            gen_eob(s);
+            gen_jr(s, cpu_T0);
             break;
         case 3: /* lcall Ev */
             gen_op_ld_v(s, ot, cpu_T1, cpu_A0);
@@ -5014,7 +5014,8 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                                       tcg_const_i32(dflag - 1),
                                       tcg_const_i32(s->pc - s->cs_base));
             }
-            gen_eob(s);
+            tcg_gen_ld_tl(cpu_tmp4, cpu_env, offsetof(CPUX86State, eip));
+            gen_jr(s, cpu_tmp4);
             break;
         case 4: /* jmp Ev */
             if (dflag == MO_16) {
@@ -5022,7 +5023,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             }
             gen_op_jmp_v(cpu_T0);
             gen_bnd_jmp(s);
-            gen_eob(s);
+            gen_jr(s, cpu_T0);
             break;
         case 5: /* ljmp Ev */
             gen_op_ld_v(s, ot, cpu_T1, cpu_A0);
@@ -5037,7 +5038,8 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                 gen_op_movl_seg_T0_vm(R_CS);
                 gen_op_jmp_v(cpu_T1);
             }
-            gen_eob(s);
+            tcg_gen_ld_tl(cpu_tmp4, cpu_env, offsetof(CPUX86State, eip));
+            gen_jr(s, cpu_tmp4);
             break;
         case 6: /* push Ev */
             gen_push_v(s, cpu_T0);
@@ -6417,7 +6419,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
         /* Note that gen_pop_T0 uses a zero-extending load.  */
         gen_op_jmp_v(cpu_T0);
         gen_bnd_jmp(s);
-        gen_eob(s);
+        gen_jr(s, cpu_T0);
         break;
     case 0xc3: /* ret */
         ot = gen_pop_T0(s);
@@ -6425,7 +6427,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
         /* Note that gen_pop_T0 uses a zero-extending load.  */
         gen_op_jmp_v(cpu_T0);
         gen_bnd_jmp(s);
-        gen_eob(s);
+        gen_jr(s, cpu_T0);
         break;
     case 0xca: /* lret im */
         val = cpu_ldsw_code(env, s->pc);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH v4 11/11] tb-hash: improve tb_jmp_cache hash function in user mode
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
                   ` (9 preceding siblings ...)
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 10/11] target/i386: optimize indirect branches Emilio G. Cota
@ 2017-04-27  3:29 ` Emilio G. Cota
  2017-04-27  3:32 ` [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
  2017-04-27  9:39 ` Aurelien Jarno
  12 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

Optimizations to cross-page chaining and indirect branches make
performance more sensitive to the hit rate of tb_jmp_cache.
The constraint of reserving some bits for the page number
lowers the achievable quality of the hashing function.

However, user-mode does not have this requirement. Thus,
with this change we use for user-mode a hashing function that
is both faster and of better quality than the previous one.

Measurements:

Note: baseline (i.e. speedup == 1x) is QEMU v2.9.0.

-                           SPECint06 (test set), x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz

 2.2x +-+--------------------------------------------------------------------------------------------------------------+-+
      |                                                                                                                  |
      |         jr                                                                                                       |
   2x +jr+multhash        +....................................................+++++...................................+-+
      |    jr+hash                                                              |$$$                                     |
      |                                                                         |$+$                                     |
      |                                                                        ### $                                     |
 1.8x +-+......................................................................#|#.$...................................+-+
      |                                                                      ++#+# $                                     |
      |                                                                       |# # $                                     |
 1.6x +-+....................................................................***.#.$....................++$$$..........+-+
      |                                         $$$                          *+* # $                     |$+$            |
      |                       ++$$$           ### $                          * * # $                  +++|$ $            |
      |                     ++###+$           # # $                          * * # $           ###   ****## $            |
 1.4x +-+...................***+#.$.........***.#.$..........................*.*.#.$...........#+#$$.*++*|#.$..........+-+
      |                     *+* # $         * * # $                          * * # $           # # $ *  *+# $            |
      |                     * * # $   +++++ * * # $                          * * # $         *** # $ *  * # $   ###$$    |
 1.2x +-+...................*.*.#.$.***##$$.*.*.#.$..........................*.*.#.$.........*.*.#.$.*..*.#.$.***+#+$..+-+
      |                     * * # $ *+* # $ * * # $   +++                    * * # $ ++###$$ * * # $ *  * # $ * * # $    |
      |    ***##$$          * * # $ * * # $ * * # $ ***##$$          ++###   * * # $ *** #+$ * * # $ *  * # $ * * # $    |
      |    *+*+#+$ ***##$$$ * * # $ * * # $ * * # $ *+* # $ ++####$$ ***+#   * * # $ * * # $ * * # $ *  * # $ * * # $    |
   1x +-++-*+*+#+$+*+*+#-+$+*+*-#+$+*+*+#+$+*+*+#+$+*-*+#+$+***++#+$+*+*+#$$+*+*+#+$+*+*+#+$+*+*-#+$+*+-*+#+$+*+*+#+$-++-+
      |    * * # $ * * #  $ * * # $ * * # $ * * # $ * * # $ * *  # $ * * # $ * * # $ * * # $ * * # $ *  * # $ * * # $    |
      |    * * # $ * * #  $ * * # $ * * # $ * * # $ * * # $ * *  # $ * * # $ * * # $ * * # $ * * # $ *  * # $ * * # $    |
 0.8x +-+--***##$$-***##$$$-***##$$-***##$$-***##$$-***##$$-***###$$-***##$$-***##$$-***##$$-***##$$-****##$$-***##$$--+-+
         astar   bzip2      gcc   gobmk h264ref   hmmlibquantum      mcf omnetpperlbench   sjengxalancbmk   hmean
  png: http://imgur.com/4UXTrEc

Here I also tried the hash function suggested by Paolo ("multhash"):

  return ((uint64_t) (pc * 2654435761) >> 32) & (TB_JMP_CACHE_SIZE - 1);

As you can see it is just as good as the other new function ("hash"),
which is what I ended up going with.

-                          SPECint06 (train set), x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz

 2.6x +-+--------------------------------------------------------------------------------------------------------------+-+
      |                                                                                                                  |
      |     jr                                                                                           ###             |
 2.4x +jr+hash...........................................................................................#.#...........+-+
      |                                                                                                  # #             |
      |                                                                                                  # #             |
 2.2x +-+................................................................................................#.#...........+-+
      |                                                                                                  # #             |
      |                                                                                                  # #             |
   2x +-+................................................................................................#.#...........+-+
      |                                                                                               **** #             |
      |                                                                                               *  * #             |
 1.8x +-+.............................................................................................*..*.#...........+-+
      |                                                                         +++                   *  * #             |
      |                                                                         ####    ####          *  * #             |
 1.6x +-+......................................####.............................#..#.****..#..........*..*.#...........+-+
      |                        +++             #++#                          ****  # *  *  #    ####  *  * #             |
      |                        ###             #  #                          *  *  # *  *  #    #  #  *  * #             |
 1.4x +-+...................****+#..........****..#..........................*..*..#.*..*..#....#..#..*..*.#...........+-+
      |                     *++* #          *  *  #                          *  *  # *  *  #  ***  #  *  * #     ####    |
      |                     *  * #     #### *  *  #                          *  *  # *  *  #  * *  #  *  * #  ****  #    |
 1.2x +-+...................*..*.#..****++#.*..*..#..........................*..*..#.*..*..#..*.*..#..*..*.#..*..*..#..+-+
      |    ****###          *  * #  *  *  # *  *  #                          *  *  # *  *  #  * *  #  *  * #  *  *  #    |
      |    *  *  #  ***###  *  * #  *  *  # *  *  #                  ****##  *  *  # *  *  #  * *  #  *  * #  *  *  #    |
   1x +-+--****###--***###--****##--****###-****###--***###--***###--****##--****###-****###--***###--****##--****###--+-+
         astar   bzip2      gcc   gobmk h264ref   hmmlibquantum      mcf omnetpperlbench   sjengxalancbmk   hmean
  png: http://imgur.com/ArCbHqo

-                                    NBench, x86_64-linux-user. Host: Intel i7-6700K @ 4.00GHz

 1.12x +-+-------------------------------------------------------------------------------------------------------------+-+
       |                                                                                                                 |
       |     jr                                                           +++                                            |
  1.1x +jr+hash...........................................................####.........................................+-+
       |                                                               +++#| #                                           |
       |                                                                | #++#                                           |
 1.08x +-+................................+++................+++.+++..*****..#.........................................+-+
       |                                   |  +++             |   |   * | *  #                                           |
       |                                   |   |              |   |   *+++*  #                                           |
 1.06x +-+................................****###.............|...|...*...*..#.........................+++.............+-+
       |                                  *| * |#            ****###  *   *  #                          |                |
       |                                  *| *++#            *| * |#  *   *  #                        ####               |
 1.04x +-+................................*++*..#............*|.*.|#..*...*..#........................#.|#.............+-+
       |                                  *  *  #            *++*++#  *   *  #                     +++#++#               |
       |                                  *  *  #            *  *  #  *   *  #                      | #  #   +++####     |
 1.02x +-+................................*..*..#......+++...*..*..#..*...*..#.....................****..#..*****++#...+-+
       |         +++                      *  *  #   +++ |    *  *  #  *   *  #  +++                *| *  #  *+++*  #     |
       |      +++ |    +++ +++   ++++++   *  *  #  *****###  *  *  #  *   *  #   |  +++   ++++++   *++*  #  *   *  #     |
    1x +-++-+++++####++****###++++-+####+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-+++####-+*****###++*++*++#++*+-+*++#+-++-+
       |     *****| #  *++* |#  *****| #  *  *  #  *   *++#  *  *  #  *   *  #  **** |#  *   *  #  *  *  #  *   *  #     |
       |     * | *| #  *  *++#  * | *++#  *  *  #  *   *  #  *  *  #  *   *  #  *| *++#  *   *  #  *  *  #  *   *  #     |
 0.98x +-+...*.|.*++#..*..*..#..*+++*..#..*..*..#..*...*..#..*..*..#..*...*..#..*++*..#..*...*..#..*..*..#..*...*..#...+-+
       |     *+++*  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
       |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
 0.96x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
       ASSIGNMENT BITFIELD   FOURFP EMULATION   HUFFMAN   LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT     hmean
  png: http://imgur.com/ZXFX0hJ

-                                   NBench, arm-linux-user. Host: Intel i7-4790K @ 4.00GHz

  1.3x +-+-------------------------------------------------------------------------------------------------------------+-+
       |                            ####                                                                                 |
       |     jr                     #  #                                            +++                                  |
 1.25x +jr+hash.....................#..#...........................................####................................+-+
       |                            #  #                                           #  #                                  |
       |                            #  #                                           #  #                                  |
  1.2x +-+..........................#..#...........................................#..#................................+-+
       |                            #  #                                           #  #                                  |
       |                            #  #                                           #  #                                  |
 1.15x +-+..........................#..#...........................................#..#................................+-+
       |                            #  #                                  ####     #  #                                  |
       |                            #  #                                  #  #     #  #                                  |
  1.1x +-+..........................#..#..................................#..#.....#..#................................+-+
       |                            #  #                                  #  #     #  #                         +++      |
       |                            #  #               ####               #  #     #  #                         ####     |
 1.05x +-+..........................#..#...............#..#.....####......#..#.....#..#.........................#..#...+-+
       |                            #  #               #  #     #  #      #  #     #  #                +++      #  #     |
       |                   +++  *****  #     ####  *****  #     #  #   +++#  #  ****  #            ****###      #  #     |
    1x +-++-+*****###++****+++++*+-+*++#+-****++#-+*+++*-+#+++++#++#++*****++#+-*++*++#-+*****-++++*++*++#++*****++#+-++-+
       |     *   *  #  *  * |   *   *  #  *  *  #  *   *  #  ****  #  *   *  #  *  *  #  *   *###  *  *++#  *   *  #     |
       |     *   *  #  *  *###  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
 0.95x +-+...*...*..#..*..*.|#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#..*..*..#..*...*..#...+-+
       |     *   *  #  *  * |#  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
       |     *   *  #  *  * |#  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
  0.9x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
       ASSIGNMENT BITFIELD   FOURFP EMULATION   HUFFMAN   LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT     hmean
  png: http://imgur.com/FfD27ey

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-hash.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 2c27490..b1fe2d0 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -22,6 +22,8 @@
 
 #include "exec/tb-hash-xx.h"
 
+#ifdef CONFIG_SOFTMMU
+
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
    addresses on the same page.  The top bits are the same.  This allows
    TLB invalidation to quickly clear a subset of the hash table.  */
@@ -45,6 +47,16 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
            | (tmp & TB_JMP_ADDR_MASK));
 }
 
+#else
+
+/* In user-mode we can get better hashing because we do not have a TLB */
+static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
+{
+    return (pc ^ (pc >> TB_JMP_CACHE_BITS)) & (TB_JMP_CACHE_SIZE - 1);
+}
+
+#endif /* CONFIG_SOFTMMU */
+
 static inline
 uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t flags)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
                   ` (10 preceding siblings ...)
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 11/11] tb-hash: improve tb_jmp_cache hash function in user mode Emilio G. Cota
@ 2017-04-27  3:32 ` Emilio G. Cota
  2017-04-27  9:39 ` Aurelien Jarno
  12 siblings, 0 replies; 20+ messages in thread
From: Emilio G. Cota @ 2017-04-27  3:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

Just to avoid confusion,

On Wed, Apr 26, 2017 at 23:29:13 -0400, Emilio G. Cota wrote:
> I have *not* re-run all experiments, because it takes several hours and
> performance hasn't changed much from v3, as can be seen in these two charts:
> * spec06int user-mode, test input, v2.9.0 baseline: http://imgur.com/ME2eMq1

Here v4 doesn't have the htable lookup. (v4+htable has it)

> * spec06int softmmu, test input, v3 baseline: http://imgur.com/Clolu9Z

Here v4 does have it.

		E.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] tcg: introduce goto_ptr opcode
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 03/11] tcg: introduce goto_ptr opcode Emilio G. Cota
@ 2017-04-27  8:09   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-04-27  8:09 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Peter Maydell, Eduardo Habkost,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

On 04/27/2017 05:29 AM, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   tcg/aarch64/tcg-target.h | 1 +
>   tcg/arm/tcg-target.h     | 1 +
>   tcg/i386/tcg-target.h    | 1 +
>   tcg/ia64/tcg-target.h    | 1 +
>   tcg/mips/tcg-target.h    | 1 +
>   tcg/ppc/tcg-target.h     | 1 +
>   tcg/s390/tcg-target.h    | 1 +
>   tcg/sparc/tcg-target.h   | 1 +
>   tcg/tcg-opc.h            | 1 +
>   tcg/tci/tcg-target.h     | 1 +
>   10 files changed, 10 insertions(+)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/11] target/i386: introduce gen_jr helper to generate lookup_and_goto_ptr
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 08/11] target/i386: introduce gen_jr helper to generate lookup_and_goto_ptr Emilio G. Cota
@ 2017-04-27  8:12   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-04-27  8:12 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Peter Maydell, Eduardo Habkost,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

On 04/27/2017 05:29 AM, Emilio G. Cota wrote:
> This helper will be used by subsequent changes.
> 
> Signed-off-by: Emilio G. Cota<cota@braap.org>
> ---
>   target/i386/translate.c | 25 ++++++++++++++++++++++++-
>   1 file changed, 24 insertions(+), 1 deletion(-)

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] target/arm: optimize indirect branches
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 07/11] target/arm: optimize indirect branches Emilio G. Cota
@ 2017-04-27  9:36   ` Aurelien Jarno
  2017-04-27  9:42     ` Richard Henderson
  2017-04-27  9:41   ` Alex Bennée
  1 sibling, 1 reply; 20+ messages in thread
From: Aurelien Jarno @ 2017-04-27  9:36 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: qemu-devel, Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Alexander Graf, Stefan Weil, qemu-arm, alex.bennee,
	Pranith Kumar

On 2017-04-26 23:29, Emilio G. Cota wrote:
> Speed up indirect branches by jumping to the target if it is valid.
> 
> Softmmu measurements (see later commit for user-mode results):
> 
> Note: baseline (i.e. speedup == 1x) is QEMU v2.9.0.
> 
> - Impact on Boot time
> 
> | setup  | ARM debian jessie boot+shutdown time | stddev |
> |--------+--------------------------------------+--------|
> | v2.9.0 |                                 8.84 |   0.07 |
> | +cross |                                 8.85 |   0.03 |
> | +jr    |                                 8.83 |   0.06 |
> 
> -                            NBench, arm-softmmu (debian jessie guest). Host: Intel i7-4790K @ 4.00GHz
> 
>   1.3x +-+-------------------------------------------------------------------------------------------------------------+-+
>        |                                                                                                                 |
>        |   cross                                                          ####                                           |
>  1.25x +cross+jr..........................................................#++#.........................................+-+
>        |                                                        ####      #  #                                           |
>        |                                                     +++#  #      #  #                                           |
>        |                                      +++            ****  #      #  #                                           |
>   1.2x +-+...................................####............*..*..#......#..#.........................................+-+
>        |                                  ****  #            *  *  #      #  #     ####                                  |
>        |                                  *  *  #            *  *  #      #  #     #  #                                  |
>  1.15x +-+................................*..*..#............*..*..#......#..#.....#..#................................+-+
>        |                                  *  *  #            *  *  #      #  #     #  #                                  |
>        |                                  *  *  #      ####  *  *  #      #  #     #  #                                  |
>        |                                  *  *  #      #  #  *  *  #      #  #     #  #                         ####     |
>   1.1x +-+................................*..*..#......#..#..*..*..#......#..#.....#..#.........................#..#...+-+
>        |                                  *  *  #      #  #  *  *  #      #  #     #  #                         #  #     |
>        |                                  *  *  #      #  #  *  *  #      #  #     #  #                         #  #     |
>  1.05x +-+..........................####..*..*..#......#..#..*..*..#......#..#.....#..#......+++............*****..#...+-+
>        |                        *****  #  *  *  #      #  #  *  *  #  *****  #     #  #   +++ |    ****###  *   *  #     |
>        |                        *+++*  #  *  *  #      #  #  *  *  #  *+++*  #  ****  #  *****###  *  *  #  *   *  #     |
>        |     *****###  +++####  *   *  #  *  *  #  *****  #  *  *  #  *   *  #  *  *  #  * | *++#  *  *  #  *   *  #     |
>     1x +-++-+*+++*-+#++****++#++*+-+*++#+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-++-+
>        |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
>        |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
>  0.95x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
>        ASSIGNMENT BITFIELD   FOURFP EMULATION   HUFFMAN   LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT     hmean
>   png: http://imgur.com/eOLmZNR
> 
> NB. 'cross' represents the previous commit.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  target/arm/translate.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/target/arm/translate.c b/target/arm/translate.c
> index 02cad96..d46a576 100644
> --- a/target/arm/translate.c
> +++ b/target/arm/translate.c
> @@ -65,6 +65,7 @@ static TCGv_i32 cpu_R[16];
>  TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
>  TCGv_i64 cpu_exclusive_addr;
>  TCGv_i64 cpu_exclusive_val;
> +static bool gen_jr;
>  
>  /* FIXME:  These should be removed.  */
>  static TCGv_i32 cpu_F0s, cpu_F1s;
> @@ -221,6 +222,7 @@ static void store_reg(DisasContext *s, int reg, TCGv_i32 var)
>           */
>          tcg_gen_andi_i32(var, var, s->thumb ? ~1 : ~3);
>          s->is_jmp = DISAS_JUMP;
> +        gen_jr = true;
>      }
>      tcg_gen_mov_i32(cpu_R[reg], var);
>      tcg_temp_free_i32(var);
> @@ -893,6 +895,7 @@ static inline void gen_bx_im(DisasContext *s, uint32_t addr)
>          tcg_temp_free_i32(tmp);
>      }
>      tcg_gen_movi_i32(cpu_R[15], addr & ~1);
> +    gen_jr = true;
>  }
>  
>  /* Set PC and Thumb state from var.  var is marked as dead.  */
> @@ -902,6 +905,7 @@ static inline void gen_bx(DisasContext *s, TCGv_i32 var)
>      tcg_gen_andi_i32(cpu_R[15], var, ~1);
>      tcg_gen_andi_i32(var, var, 1);
>      store_cpu_field(var, thumb);
> +    gen_jr = true;
>  }
>  
>  /* Variant of store_reg which uses branch&exchange logic when storing
> @@ -12034,6 +12038,20 @@ void gen_intermediate_code(CPUARMState *env, TranslationBlock *tb)
>              gen_set_pc_im(dc, dc->pc);
>              /* fall through */
>          case DISAS_JUMP:
> +            /*
> +             * gen_jr is not set on every DISAS_JUMP because for some of those
> +             * we do want to exit to the exec loop.
> +             */

What would be the reason for that? IIUC the lookup_tb_ptr helper calls
cpu_get_tb_cpu_state to get the new TB flags go lookup from the current
CPU state. It means it is able for example to handle a transition from
user to privileged mode. Also the exit_req flag or its new equivalent
is tested at the beginning of each TB in case there is an interruption.

It therefore seems to be that we can replace all calls to
tcg_gen_exit_tb by tcg_gen_lookup_and_goto_ptr with the program counter
in argument.

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10
  2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
                   ` (11 preceding siblings ...)
  2017-04-27  3:32 ` [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
@ 2017-04-27  9:39 ` Aurelien Jarno
  12 siblings, 0 replies; 20+ messages in thread
From: Aurelien Jarno @ 2017-04-27  9:39 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: qemu-devel, Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Alexander Graf, Stefan Weil, qemu-arm, alex.bennee,
	Pranith Kumar

On 2017-04-26 23:29, Emilio G. Cota wrote:
> v3 for context: https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04795.html
> 
> Changes from v3:
> 
> - Added reviewed-by tags.
> 
> - Added a couple of suggested-by tags that I forgot to add in v3
>   regarding lookup_and_goto_ptr and i386's implementation of goto_ptr.
> 
> - lookup_tb_ptr
>   + Dropped the unnecessary exit_request check, as suggested by Paolo and
>     Richard.
>   + Only get the CPU state if we get a tb from the jmp_cache, as suggested
>     by Richard.
>   + Added tb_htable_lookup if we miss in tb_jmp_cache, as suggested by
>     Richard. This requires an extra patch to export tb_htable_lookup.
> 
> - goto_ptr: add IMPL(has_goto_ptr), as pointed out by Richard.
> 
> - target/arm: added a comment about gen_jr. See the v3 thread for why
>   it is needed.
> 
> - target/i386: use TCGV_UNUSED instead of (ab)using NULL on a TCGv,
>   as suggested by Richard. Also took his suggestion to simplify
>   the addition of jr + cs_base.
>   To minimize churn I renamed gen_eob_worker to do_gen_eob_worker,
>   which takes the newly added argument.
> 
> I have *not* re-run all experiments, because it takes several hours and
> performance hasn't changed much from v3, as can be seen in these two charts:
> * spec06int user-mode, test input, v2.9.0 baseline: http://imgur.com/ME2eMq1
> * spec06int softmmu, test input, v3 baseline: http://imgur.com/Clolu9Z
> The perf differences are mostly due to adding the htable check. Note that
> its impact is small, since tb_jmp_cache has a %hit rate in the high 90's.
> 
> You can inspect/fetch the changes at:
>   https://github.com/cota/qemu/tree/tcg-opt-v4

Thanks for this patchset. I have tested it with an arm target, but also
with a mips target with and additional patch. I haven't done any precise
benchmark yet. The patch is trivial and only changes 3 lines, but I am
not 100% sure I have done things correctly (see my comment on patch 7).

Tested-by: Aurelien Jarno <aurelien@aurel32.net>

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] target/arm: optimize indirect branches
  2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 07/11] target/arm: optimize indirect branches Emilio G. Cota
  2017-04-27  9:36   ` Aurelien Jarno
@ 2017-04-27  9:41   ` Alex Bennée
  1 sibling, 0 replies; 20+ messages in thread
From: Alex Bennée @ 2017-04-27  9:41 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: qemu-devel, Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	Pranith Kumar


Emilio G. Cota <cota@braap.org> writes:

> Speed up indirect branches by jumping to the target if it is valid.
>
> Softmmu measurements (see later commit for user-mode results):
>
> Note: baseline (i.e. speedup == 1x) is QEMU v2.9.0.
>
> - Impact on Boot time
>
> | setup  | ARM debian jessie boot+shutdown time | stddev |
> |--------+--------------------------------------+--------|
> | v2.9.0 |                                 8.84 |   0.07 |
> | +cross |                                 8.85 |   0.03 |
> | +jr    |                                 8.83 |   0.06 |
>
> -                            NBench, arm-softmmu (debian jessie guest). Host: Intel i7-4790K @ 4.00GHz
>
>   1.3x +-+-------------------------------------------------------------------------------------------------------------+-+
>        |                                                                                                                 |
>        |   cross                                                          ####                                           |
>  1.25x +cross+jr..........................................................#++#.........................................+-+
>        |                                                        ####      #  #                                           |
>        |                                                     +++#  #      #  #                                           |
>        |                                      +++            ****  #      #  #                                           |
>   1.2x +-+...................................####............*..*..#......#..#.........................................+-+
>        |                                  ****  #            *  *  #      #  #     ####                                  |
>        |                                  *  *  #            *  *  #      #  #     #  #                                  |
>  1.15x +-+................................*..*..#............*..*..#......#..#.....#..#................................+-+
>        |                                  *  *  #            *  *  #      #  #     #  #                                  |
>        |                                  *  *  #      ####  *  *  #      #  #     #  #                                  |
>        |                                  *  *  #      #  #  *  *  #      #  #     #  #                         ####     |
>   1.1x +-+................................*..*..#......#..#..*..*..#......#..#.....#..#.........................#..#...+-+
>        |                                  *  *  #      #  #  *  *  #      #  #     #  #                         #  #     |
>        |                                  *  *  #      #  #  *  *  #      #  #     #  #                         #  #     |
>  1.05x +-+..........................####..*..*..#......#..#..*..*..#......#..#.....#..#......+++............*****..#...+-+
>        |                        *****  #  *  *  #      #  #  *  *  #  *****  #     #  #   +++ |    ****###  *   *  #     |
>        |                        *+++*  #  *  *  #      #  #  *  *  #  *+++*  #  ****  #  *****###  *  *  #  *   *  #     |
>        |     *****###  +++####  *   *  #  *  *  #  *****  #  *  *  #  *   *  #  *  *  #  * | *++#  *  *  #  *   *  #     |
>     1x +-++-+*+++*-+#++****++#++*+-+*++#+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-++-+
>        |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
>        |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
>  0.95x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
>        ASSIGNMENT BITFIELD   FOURFP EMULATION   HUFFMAN   LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT     hmean
>   png: http://imgur.com/eOLmZNR
>
> NB. 'cross' represents the previous commit.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>


Hmm not sure why but this doesn't cleanly apply to master.

> ---
>  target/arm/translate.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> diff --git a/target/arm/translate.c b/target/arm/translate.c
> index 02cad96..d46a576 100644
> --- a/target/arm/translate.c
> +++ b/target/arm/translate.c
> @@ -65,6 +65,7 @@ static TCGv_i32 cpu_R[16];
>  TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
>  TCGv_i64 cpu_exclusive_addr;
>  TCGv_i64 cpu_exclusive_val;
> +static bool gen_jr;

Isn't this something that should be part of the DisasContext rather than
a global? I know we are unlikely to making the translators run
independently anytime soon but we shouldn't use globals where we can
help it.

>
>  /* FIXME:  These should be removed.  */
>  static TCGv_i32 cpu_F0s, cpu_F1s;
> @@ -221,6 +222,7 @@ static void store_reg(DisasContext *s, int reg, TCGv_i32 var)
>           */
>          tcg_gen_andi_i32(var, var, s->thumb ? ~1 : ~3);
>          s->is_jmp = DISAS_JUMP;
> +        gen_jr = true;
>      }
>      tcg_gen_mov_i32(cpu_R[reg], var);
>      tcg_temp_free_i32(var);
> @@ -893,6 +895,7 @@ static inline void gen_bx_im(DisasContext *s, uint32_t addr)
>          tcg_temp_free_i32(tmp);
>      }
>      tcg_gen_movi_i32(cpu_R[15], addr & ~1);
> +    gen_jr = true;
>  }
>
>  /* Set PC and Thumb state from var.  var is marked as dead.  */
> @@ -902,6 +905,7 @@ static inline void gen_bx(DisasContext *s, TCGv_i32 var)
>      tcg_gen_andi_i32(cpu_R[15], var, ~1);
>      tcg_gen_andi_i32(var, var, 1);
>      store_cpu_field(var, thumb);
> +    gen_jr = true;
>  }
>
>  /* Variant of store_reg which uses branch&exchange logic when storing
> @@ -12034,6 +12038,20 @@ void gen_intermediate_code(CPUARMState *env, TranslationBlock *tb)
>              gen_set_pc_im(dc, dc->pc);
>              /* fall through */
>          case DISAS_JUMP:
> +            /*
> +             * gen_jr is not set on every DISAS_JUMP because for some of those
> +             * we do want to exit to the exec loop.
> +             */
> +            if (gen_jr) {
> +                TCGv addr = tcg_temp_new();
> +
> +                gen_jr = false;
> +                tcg_gen_extu_i32_tl(addr, cpu_R[15]);
> +                tcg_gen_lookup_and_goto_ptr(addr);
> +                tcg_temp_free(addr);
> +                break;
> +            }
> +            /* fall through */
>          default:
>              /* indicate that the hash table must be used to find the next TB */
>              tcg_gen_exit_tb(0);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] target/arm: optimize indirect branches
  2017-04-27  9:36   ` Aurelien Jarno
@ 2017-04-27  9:42     ` Richard Henderson
  2017-04-27 10:15       ` Aurelien Jarno
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2017-04-27  9:42 UTC (permalink / raw)
  To: Aurelien Jarno, Emilio G. Cota
  Cc: qemu-devel, Paolo Bonzini, Peter Crosthwaite, Peter Maydell,
	Eduardo Habkost, Andrzej Zaborowski, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

On 04/27/2017 11:36 AM, Aurelien Jarno wrote:
> On 2017-04-26 23:29, Emilio G. Cota wrote:
>> Speed up indirect branches by jumping to the target if it is valid.
>>
>> Softmmu measurements (see later commit for user-mode results):
>>
>> Note: baseline (i.e. speedup == 1x) is QEMU v2.9.0.
>>
>> - Impact on Boot time
>>
>> | setup  | ARM debian jessie boot+shutdown time | stddev |
>> |--------+--------------------------------------+--------|
>> | v2.9.0 |                                 8.84 |   0.07 |
>> | +cross |                                 8.85 |   0.03 |
>> | +jr    |                                 8.83 |   0.06 |
>>
>> -                            NBench, arm-softmmu (debian jessie guest). Host: Intel i7-4790K @ 4.00GHz
>>
>>    1.3x +-+-------------------------------------------------------------------------------------------------------------+-+
>>         |                                                                                                                 |
>>         |   cross                                                          ####                                           |
>>   1.25x +cross+jr..........................................................#++#.........................................+-+
>>         |                                                        ####      #  #                                           |
>>         |                                                     +++#  #      #  #                                           |
>>         |                                      +++            ****  #      #  #                                           |
>>    1.2x +-+...................................####............*..*..#......#..#.........................................+-+
>>         |                                  ****  #            *  *  #      #  #     ####                                  |
>>         |                                  *  *  #            *  *  #      #  #     #  #                                  |
>>   1.15x +-+................................*..*..#............*..*..#......#..#.....#..#................................+-+
>>         |                                  *  *  #            *  *  #      #  #     #  #                                  |
>>         |                                  *  *  #      ####  *  *  #      #  #     #  #                                  |
>>         |                                  *  *  #      #  #  *  *  #      #  #     #  #                         ####     |
>>    1.1x +-+................................*..*..#......#..#..*..*..#......#..#.....#..#.........................#..#...+-+
>>         |                                  *  *  #      #  #  *  *  #      #  #     #  #                         #  #     |
>>         |                                  *  *  #      #  #  *  *  #      #  #     #  #                         #  #     |
>>   1.05x +-+..........................####..*..*..#......#..#..*..*..#......#..#.....#..#......+++............*****..#...+-+
>>         |                        *****  #  *  *  #      #  #  *  *  #  *****  #     #  #   +++ |    ****###  *   *  #     |
>>         |                        *+++*  #  *  *  #      #  #  *  *  #  *+++*  #  ****  #  *****###  *  *  #  *   *  #     |
>>         |     *****###  +++####  *   *  #  *  *  #  *****  #  *  *  #  *   *  #  *  *  #  * | *++#  *  *  #  *   *  #     |
>>      1x +-++-+*+++*-+#++****++#++*+-+*++#+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-++-+
>>         |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
>>         |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
>>   0.95x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
>>         ASSIGNMENT BITFIELD   FOURFP EMULATION   HUFFMAN   LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT     hmean
>>    png: http://imgur.com/eOLmZNR
>>
>> NB. 'cross' represents the previous commit.
>>
>> Signed-off-by: Emilio G. Cota <cota@braap.org>
>> ---
>>   target/arm/translate.c | 18 ++++++++++++++++++
>>   1 file changed, 18 insertions(+)
>>
>> diff --git a/target/arm/translate.c b/target/arm/translate.c
>> index 02cad96..d46a576 100644
>> --- a/target/arm/translate.c
>> +++ b/target/arm/translate.c
>> @@ -65,6 +65,7 @@ static TCGv_i32 cpu_R[16];
>>   TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
>>   TCGv_i64 cpu_exclusive_addr;
>>   TCGv_i64 cpu_exclusive_val;
>> +static bool gen_jr;
>>   
>>   /* FIXME:  These should be removed.  */
>>   static TCGv_i32 cpu_F0s, cpu_F1s;
>> @@ -221,6 +222,7 @@ static void store_reg(DisasContext *s, int reg, TCGv_i32 var)
>>            */
>>           tcg_gen_andi_i32(var, var, s->thumb ? ~1 : ~3);
>>           s->is_jmp = DISAS_JUMP;
>> +        gen_jr = true;
>>       }
>>       tcg_gen_mov_i32(cpu_R[reg], var);
>>       tcg_temp_free_i32(var);
>> @@ -893,6 +895,7 @@ static inline void gen_bx_im(DisasContext *s, uint32_t addr)
>>           tcg_temp_free_i32(tmp);
>>       }
>>       tcg_gen_movi_i32(cpu_R[15], addr & ~1);
>> +    gen_jr = true;
>>   }
>>   
>>   /* Set PC and Thumb state from var.  var is marked as dead.  */
>> @@ -902,6 +905,7 @@ static inline void gen_bx(DisasContext *s, TCGv_i32 var)
>>       tcg_gen_andi_i32(cpu_R[15], var, ~1);
>>       tcg_gen_andi_i32(var, var, 1);
>>       store_cpu_field(var, thumb);
>> +    gen_jr = true;
>>   }
>>   
>>   /* Variant of store_reg which uses branch&exchange logic when storing
>> @@ -12034,6 +12038,20 @@ void gen_intermediate_code(CPUARMState *env, TranslationBlock *tb)
>>               gen_set_pc_im(dc, dc->pc);
>>               /* fall through */
>>           case DISAS_JUMP:
>> +            /*
>> +             * gen_jr is not set on every DISAS_JUMP because for some of those
>> +             * we do want to exit to the exec loop.
>> +             */
> 
> What would be the reason for that? IIUC the lookup_tb_ptr helper calls
> cpu_get_tb_cpu_state to get the new TB flags go lookup from the current
> CPU state. It means it is able for example to handle a transition from
> user to privileged mode. Also the exit_req flag or its new equivalent
> is tested at the beginning of each TB in case there is an interruption.
> 
> It therefore seems to be that we can replace all calls to
> tcg_gen_exit_tb by tcg_gen_lookup_and_goto_ptr with the program counter
> in argument.
> 

That was my thought too.  The only examples I saw while converting target/alpha 
that I really want exit_tb are when I have just flushed the TBs, and I know 
that the lookup will definitely fail.

That said, elsewhere (in the v3 thread?) Emilio mentioned that ARM has some 
cases where it wants interrupts (or something) to be recognized right away, 
which would also seem to call for an exit_tb.

My thought is that these cases should be specifically noted with a new 
DISAS_EXIT or something like that, rather than overloading DISAS_JUMP with a 
gen_jr flag.


r~

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] target/arm: optimize indirect branches
  2017-04-27  9:42     ` Richard Henderson
@ 2017-04-27 10:15       ` Aurelien Jarno
  0 siblings, 0 replies; 20+ messages in thread
From: Aurelien Jarno @ 2017-04-27 10:15 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Emilio G. Cota, qemu-devel, Paolo Bonzini, Peter Crosthwaite,
	Peter Maydell, Eduardo Habkost, Andrzej Zaborowski,
	Alexander Graf, Stefan Weil, qemu-arm, alex.bennee,
	Pranith Kumar

On 2017-04-27 11:42, Richard Henderson wrote:
> On 04/27/2017 11:36 AM, Aurelien Jarno wrote:
> > On 2017-04-26 23:29, Emilio G. Cota wrote:
> > > Speed up indirect branches by jumping to the target if it is valid.
> > > 
> > > Softmmu measurements (see later commit for user-mode results):
> > > 
> > > Note: baseline (i.e. speedup == 1x) is QEMU v2.9.0.
> > > 
> > > - Impact on Boot time
> > > 
> > > | setup  | ARM debian jessie boot+shutdown time | stddev |
> > > |--------+--------------------------------------+--------|
> > > | v2.9.0 |                                 8.84 |   0.07 |
> > > | +cross |                                 8.85 |   0.03 |
> > > | +jr    |                                 8.83 |   0.06 |
> > > 
> > > -                            NBench, arm-softmmu (debian jessie guest). Host: Intel i7-4790K @ 4.00GHz
> > > 
> > >    1.3x +-+-------------------------------------------------------------------------------------------------------------+-+
> > >         |                                                                                                                 |
> > >         |   cross                                                          ####                                           |
> > >   1.25x +cross+jr..........................................................#++#.........................................+-+
> > >         |                                                        ####      #  #                                           |
> > >         |                                                     +++#  #      #  #                                           |
> > >         |                                      +++            ****  #      #  #                                           |
> > >    1.2x +-+...................................####............*..*..#......#..#.........................................+-+
> > >         |                                  ****  #            *  *  #      #  #     ####                                  |
> > >         |                                  *  *  #            *  *  #      #  #     #  #                                  |
> > >   1.15x +-+................................*..*..#............*..*..#......#..#.....#..#................................+-+
> > >         |                                  *  *  #            *  *  #      #  #     #  #                                  |
> > >         |                                  *  *  #      ####  *  *  #      #  #     #  #                                  |
> > >         |                                  *  *  #      #  #  *  *  #      #  #     #  #                         ####     |
> > >    1.1x +-+................................*..*..#......#..#..*..*..#......#..#.....#..#.........................#..#...+-+
> > >         |                                  *  *  #      #  #  *  *  #      #  #     #  #                         #  #     |
> > >         |                                  *  *  #      #  #  *  *  #      #  #     #  #                         #  #     |
> > >   1.05x +-+..........................####..*..*..#......#..#..*..*..#......#..#.....#..#......+++............*****..#...+-+
> > >         |                        *****  #  *  *  #      #  #  *  *  #  *****  #     #  #   +++ |    ****###  *   *  #     |
> > >         |                        *+++*  #  *  *  #      #  #  *  *  #  *+++*  #  ****  #  *****###  *  *  #  *   *  #     |
> > >         |     *****###  +++####  *   *  #  *  *  #  *****  #  *  *  #  *   *  #  *  *  #  * | *++#  *  *  #  *   *  #     |
> > >      1x +-++-+*+++*-+#++****++#++*+-+*++#+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-*++*++#-+*+++*-+#++*++*++#++*+-+*++#+-++-+
> > >         |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
> > >         |     *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #  *  *  #  *   *  #     |
> > >   0.95x +-+---*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###--****###--*****###---+-+
> > >         ASSIGNMENT BITFIELD   FOURFP EMULATION   HUFFMAN   LU DECOMPOSITIONEURAL NNUMERIC SOSTRING SORT     hmean
> > >    png: http://imgur.com/eOLmZNR
> > > 
> > > NB. 'cross' represents the previous commit.
> > > 
> > > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > > ---
> > >   target/arm/translate.c | 18 ++++++++++++++++++
> > >   1 file changed, 18 insertions(+)
> > > 
> > > diff --git a/target/arm/translate.c b/target/arm/translate.c
> > > index 02cad96..d46a576 100644
> > > --- a/target/arm/translate.c
> > > +++ b/target/arm/translate.c
> > > @@ -65,6 +65,7 @@ static TCGv_i32 cpu_R[16];
> > >   TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
> > >   TCGv_i64 cpu_exclusive_addr;
> > >   TCGv_i64 cpu_exclusive_val;
> > > +static bool gen_jr;
> > >   /* FIXME:  These should be removed.  */
> > >   static TCGv_i32 cpu_F0s, cpu_F1s;
> > > @@ -221,6 +222,7 @@ static void store_reg(DisasContext *s, int reg, TCGv_i32 var)
> > >            */
> > >           tcg_gen_andi_i32(var, var, s->thumb ? ~1 : ~3);
> > >           s->is_jmp = DISAS_JUMP;
> > > +        gen_jr = true;
> > >       }
> > >       tcg_gen_mov_i32(cpu_R[reg], var);
> > >       tcg_temp_free_i32(var);
> > > @@ -893,6 +895,7 @@ static inline void gen_bx_im(DisasContext *s, uint32_t addr)
> > >           tcg_temp_free_i32(tmp);
> > >       }
> > >       tcg_gen_movi_i32(cpu_R[15], addr & ~1);
> > > +    gen_jr = true;
> > >   }
> > >   /* Set PC and Thumb state from var.  var is marked as dead.  */
> > > @@ -902,6 +905,7 @@ static inline void gen_bx(DisasContext *s, TCGv_i32 var)
> > >       tcg_gen_andi_i32(cpu_R[15], var, ~1);
> > >       tcg_gen_andi_i32(var, var, 1);
> > >       store_cpu_field(var, thumb);
> > > +    gen_jr = true;
> > >   }
> > >   /* Variant of store_reg which uses branch&exchange logic when storing
> > > @@ -12034,6 +12038,20 @@ void gen_intermediate_code(CPUARMState *env, TranslationBlock *tb)
> > >               gen_set_pc_im(dc, dc->pc);
> > >               /* fall through */
> > >           case DISAS_JUMP:
> > > +            /*
> > > +             * gen_jr is not set on every DISAS_JUMP because for some of those
> > > +             * we do want to exit to the exec loop.
> > > +             */
> > 
> > What would be the reason for that? IIUC the lookup_tb_ptr helper calls
> > cpu_get_tb_cpu_state to get the new TB flags go lookup from the current
> > CPU state. It means it is able for example to handle a transition from
> > user to privileged mode. Also the exit_req flag or its new equivalent
> > is tested at the beginning of each TB in case there is an interruption.
> > 
> > It therefore seems to be that we can replace all calls to
> > tcg_gen_exit_tb by tcg_gen_lookup_and_goto_ptr with the program counter
> > in argument.
> > 
> 
> That was my thought too.  The only examples I saw while converting
> target/alpha that I really want exit_tb are when I have just flushed the
> TBs, and I know that the lookup will definitely fail.

Indeed that's suboptimal, but that should still work.

> That said, elsewhere (in the v3 thread?) Emilio mentioned that ARM has some
> cases where it wants interrupts (or something) to be recognized right away,
> which would also seem to call for an exit_tb.

Thanks for the pointer, I will answer the mail.

> My thought is that these cases should be specifically noted with a new
> DISAS_EXIT or something like that, rather than overloading DISAS_JUMP with a
> gen_jr flag.

I agree.

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-04-27 10:16 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-27  3:29 [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 01/11] exec-all: export tb_htable_lookup Emilio G. Cota
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 02/11] tcg-runtime: add lookup_tb_ptr helper Emilio G. Cota
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 03/11] tcg: introduce goto_ptr opcode Emilio G. Cota
2017-04-27  8:09   ` Richard Henderson
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 04/11] tcg: export tcg_gen_lookup_and_goto_ptr Emilio G. Cota
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 05/11] tcg/i386: implement goto_ptr op Emilio G. Cota
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 06/11] target/arm: optimize cross-page direct jumps in softmmu Emilio G. Cota
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 07/11] target/arm: optimize indirect branches Emilio G. Cota
2017-04-27  9:36   ` Aurelien Jarno
2017-04-27  9:42     ` Richard Henderson
2017-04-27 10:15       ` Aurelien Jarno
2017-04-27  9:41   ` Alex Bennée
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 08/11] target/i386: introduce gen_jr helper to generate lookup_and_goto_ptr Emilio G. Cota
2017-04-27  8:12   ` Richard Henderson
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 09/11] target/i386: optimize cross-page direct jumps in softmmu Emilio G. Cota
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 10/11] target/i386: optimize indirect branches Emilio G. Cota
2017-04-27  3:29 ` [Qemu-devel] [PATCH v4 11/11] tb-hash: improve tb_jmp_cache hash function in user mode Emilio G. Cota
2017-04-27  3:32 ` [Qemu-devel] [PATCH v4 00/11] TCG optimizations for 2.10 Emilio G. Cota
2017-04-27  9:39 ` Aurelien Jarno

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.