All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10
@ 2017-04-12  1:17 Emilio G. Cota
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 01/10] exec-all: add tb_from_jmp_cache Emilio G. Cota
                   ` (10 more replies)
  0 siblings, 11 replies; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

Hi all,

This series is aimed at 2.10 or beyond. Its goal is to improve
TCG performance by optimizing:

1- Cross-page direct jumps (softmmu only, obviously). Patches 1-4.
2- Indirect branches (softmmu and user-mode). Patches 5-9.
3- tb_jmp_cache hashing in user-mode. Patch 10.

I decided to work on this after reading this paper [1] (code at [2]),
which among other optimizations it proposes solutions for 1 and 2.
I followed the same overall scheme they follow, that is to use helpers
to check whether the target vaddr is valid, and if so, jump to its
corresponding translated code (host address) without having to go back
to the exec loop. My implementation differs from that in the paper
in that it uses tb_jmp_cache instead of adding more caches,
which is simpler and probably more resilient in environments
where TLB invalidations are frequent (in the paper they acknowledge
that they limited background processes to a minimum, which isn't
realistic).

These changes require modifications on the targets and, for optimization
number 2, a new TCG opcode to jump to a host address contained in a register.

For now I only implemented this for the i386 and arm targets, and
the i386 TCG backend. Other targets/backends can easily opt-in.

The 3rd optimization is implemented in the last patch: it improves
tb_jmp_cache hashing for user-mode by removing the requirement of
being able to clear parts of the cache given a page number, since this
requirement only applies to softmmu.

The series applies cleanly on top of 95b31d709ba34.

The commit logs include many measurements, performed using SPECint06 and
NBench from dbt-bench[3].

Feedback welcome! Thanks,

		Emilio

[1] "Optimizing Control Transfer and Memory Virtualization
in Full System Emulators", Ding-Yong Hong, Chun-Chen Hsu, Cheng-Yi Chou,
Wei-Chung Hsu, Pangfeng Liu, Jan-Jan Wu. ACM TACO, Jan. 2016.
  http://www.iis.sinica.edu.tw/page/library/TechReport/tr2015/tr15002.pdf

[2] https://github.com/tkhsu/quick-android-emulator/tree/quick-qemu

[3] https://github.com/cota/dbt-bench

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 01/10] exec-all: add tb_from_jmp_cache
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
@ 2017-04-12  1:17 ` Emilio G. Cota
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 02/10] exec-all: inline tb_from_jmp_cache Emilio G. Cota
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

This paves the way for upcoming changes.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 19 +++++++++++++++++++
 include/exec/exec-all.h |  2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 748cb66..ce9750a 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -309,6 +309,25 @@ static bool tb_cmp(const void *p, const void *d)
     return false;
 }
 
+TranslationBlock *tb_from_jmp_cache(CPUArchState *env, target_ulong vaddr)
+{
+    CPUState *cpu = ENV_GET_CPU(env);
+    TranslationBlock *tb;
+    target_ulong cs_base, pc;
+    uint32_t flags;
+
+    if (unlikely(atomic_read(&cpu->exit_request))) {
+        return NULL;
+    }
+    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
+    tb = atomic_rcu_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(vaddr)]);
+    if (likely(tb && tb->pc == vaddr && tb->cs_base == cs_base &&
+               tb->flags == flags)) {
+        return tb;
+    }
+    return NULL;
+}
+
 static TranslationBlock *tb_htable_lookup(CPUState *cpu,
                                           target_ulong pc,
                                           target_ulong cs_base,
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index bcde1e6..18b80bc 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -56,7 +56,6 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
                               target_ulong pc, target_ulong cs_base,
                               uint32_t flags,
                               int cflags);
-
 void QEMU_NORETURN cpu_loop_exit(CPUState *cpu);
 void QEMU_NORETURN cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc);
 void QEMU_NORETURN cpu_loop_exit_atomic(CPUState *cpu, uintptr_t pc);
@@ -368,6 +367,7 @@ struct TranslationBlock {
 void tb_free(TranslationBlock *tb);
 void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
+TranslationBlock *tb_from_jmp_cache(CPUArchState *env, target_ulong vaddr);
 
 #if defined(USE_DIRECT_JUMP)
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 02/10] exec-all: inline tb_from_jmp_cache
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 01/10] exec-all: add tb_from_jmp_cache Emilio G. Cota
@ 2017-04-12  1:17 ` Emilio G. Cota
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 03/10] target/arm: optimize cross-page block chaining in softmmu Emilio G. Cota
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

The inline improves performance, as shown in subsequent commits' logs.

This commit is kept separate to ease review, since the inclusion
of tb-hash.h might be controversial. The problem here, which was
introduced before this commit, is that tb_hash_func() depends on
page_addr_t: this defeats the original purpose of tb-hash.h,
which was to be self-contained and CPU-agnostic.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 19 -------------------
 include/exec/exec-all.h | 24 +++++++++++++++++++++++-
 2 files changed, 23 insertions(+), 20 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index ce9750a..748cb66 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -309,25 +309,6 @@ static bool tb_cmp(const void *p, const void *d)
     return false;
 }
 
-TranslationBlock *tb_from_jmp_cache(CPUArchState *env, target_ulong vaddr)
-{
-    CPUState *cpu = ENV_GET_CPU(env);
-    TranslationBlock *tb;
-    target_ulong cs_base, pc;
-    uint32_t flags;
-
-    if (unlikely(atomic_read(&cpu->exit_request))) {
-        return NULL;
-    }
-    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-    tb = atomic_rcu_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(vaddr)]);
-    if (likely(tb && tb->pc == vaddr && tb->cs_base == cs_base &&
-               tb->flags == flags)) {
-        return tb;
-    }
-    return NULL;
-}
-
 static TranslationBlock *tb_htable_lookup(CPUState *cpu,
                                           target_ulong pc,
                                           target_ulong cs_base,
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 18b80bc..bd76987 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -367,7 +367,29 @@ struct TranslationBlock {
 void tb_free(TranslationBlock *tb);
 void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
-TranslationBlock *tb_from_jmp_cache(CPUArchState *env, target_ulong vaddr);
+
+/* tb_hash_func() in tb-hash.h needs tb_page_addr_t, defined above */
+#include "tb-hash.h"
+
+static inline
+TranslationBlock *tb_from_jmp_cache(CPUArchState *env, target_ulong vaddr)
+{
+    CPUState *cpu = ENV_GET_CPU(env);
+    TranslationBlock *tb;
+    target_ulong cs_base, pc;
+    uint32_t flags;
+
+    if (unlikely(atomic_read(&cpu->exit_request))) {
+        return NULL;
+    }
+    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
+    tb = atomic_rcu_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(vaddr)]);
+    if (likely(tb && tb->pc == vaddr && tb->cs_base == cs_base &&
+               tb->flags == flags)) {
+        return tb;
+    }
+    return NULL;
+}
 
 #if defined(USE_DIRECT_JUMP)
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 03/10] target/arm: optimize cross-page block chaining in softmmu
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 01/10] exec-all: add tb_from_jmp_cache Emilio G. Cota
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 02/10] exec-all: inline tb_from_jmp_cache Emilio G. Cota
@ 2017-04-12  1:17 ` Emilio G. Cota
  2017-04-15 11:24   ` Richard Henderson
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 04/10] target/i386: " Emilio G. Cota
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

Instead of unconditionally exiting to the exec loop, add a helper to
check whether the target TB is valid. As long as the hit rate in
tb_jmp_cache remains high, this improves performance.

Measurements:

- Boot time of ARM debian jessie on Intel host:

| setup              | ARM debian boot+shutdown time | stddev |
|--------------------+-------------------------------+--------|
| master             |                  10.050247057 | 0.0361 |
| +cross             |                  10.311265443 | 0.0721 |

That is a 2.58% slowdown when booting. This is reasonable given that
tb_jmp_cache's hit rate when booting is expected to be low.

-                NBench, arm-softmmu. Host: Intel i7-4790K @ 4.00GHz
                        (y axis: Speedup over 95b31d70)

    1.3x+-+--------------------------------------------------------------+-+
        |                                           cross+noinline $$$     |
        |                                           cross+inline   %%%     |
        |                   $$$%%                                          |
    1.2x+-+.................$.$.%.......$$$..............................+-+
        |                   $ $ %       $ $%                               |
        |                   $ $ %       $ $%                               |
    1.1x+-+.................$.$.%.......$.$%.............................+-+
        |             $$$%% $ $ %       $ $%                               |
        |             $ $ % $ $ %       $ $% $$$%%             $$$%% $$$%% |
        | $$$%% $$$%% $ $ % $ $ % $$$%% $ $% $ $ %   %%%       $ $ % $ $ % |
      1x+-$.$B%R$R$A%G$A$H%T$M$_%P$L$i%l$n$%.$.$.%...%.%.$$$%%.$.$.%.$.$.%-+
        | $ $ % $ $ % $ $ % $ $ % $ $ % $ $% $ $ %   % % $ $ % $ $ % $ $ % |
        | $ $ % $ $ % $ $ % $ $ % $ $ % $ $% $ $ %   % % $ $ % $ $ % $ $ % |
    0.9x+-$.$.%.$.$.%.$.$.%.$.$.%.$.$.%.$.$%.$.$.%...%.%.$.$.%.$.$.%.$.$.%-+
        | $ $ % $ $ % $ $ % $ $ % $ $ % $ $% $ $ %   % % $ $ % $ $ % $ $ % |
        | $ $ % $ $ % $ $ % $ $ % $ $ % $ $% $ $ % $$$ % $ $ % $ $ % $ $ % |
        | $ $ % $ $ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % $ $ % $ $ % |
    0.8x+-$$$%%-$$$%%-$$$%%-$$$%%-$$$%%-$$$%-$$$%%-$$$%%-$$$%%-$$$%%-$$$%%-+
       ASSIGNMBITFIELFOUFP_EMULATHUFFMALU_DECOMPNEURANUMERICSTRING_SOhmean

  png: http://imgur.com/1rmYSaF

That is, a 4.04% hmean perf improvement over master with tb_from_jmp_cache
not inlined, and a 5.82% hmean perf improvement over master with tb_from_jmp_cache
inlined (i.e. this commit). The largest improvement is 21% for the FP_EMULATION
benchmark.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/arm/helper.c    |  5 +++++
 target/arm/helper.h    |  2 ++
 target/arm/translate.c | 12 ++++++++++++
 3 files changed, 19 insertions(+)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index 8cb7a94..10b8807 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -9922,3 +9922,8 @@ uint32_t HELPER(crc32c)(uint32_t acc, uint32_t val, uint32_t bytes)
     /* Linux crc32c converts the output to one's complement.  */
     return crc32c(acc, buf, bytes) ^ 0xffffffff;
 }
+
+uint32_t HELPER(cross_page_check)(CPUARMState *env, target_ulong vaddr)
+{
+    return !!tb_from_jmp_cache(env, vaddr);
+}
diff --git a/target/arm/helper.h b/target/arm/helper.h
index df86bf7..d4b779b 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -1,6 +1,8 @@
 DEF_HELPER_FLAGS_1(sxtb16, TCG_CALL_NO_RWG_SE, i32, i32)
 DEF_HELPER_FLAGS_1(uxtb16, TCG_CALL_NO_RWG_SE, i32, i32)
 
+DEF_HELPER_2(cross_page_check, i32, env, tl)
+
 DEF_HELPER_3(add_setq, i32, env, i32, i32)
 DEF_HELPER_3(add_saturate, i32, env, i32, i32)
 DEF_HELPER_3(sub_saturate, i32, env, i32, i32)
diff --git a/target/arm/translate.c b/target/arm/translate.c
index e32e38c..ce97d0c 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -4085,6 +4085,18 @@ static inline void gen_goto_tb(DisasContext *s, int n, target_ulong dest)
         gen_set_pc_im(s, dest);
         tcg_gen_exit_tb((uintptr_t)s->tb + n);
     } else {
+        TCGv vaddr = tcg_const_tl(dest);
+        TCGv_i32 valid = tcg_temp_new_i32();
+        TCGLabel *label = gen_new_label();
+
+        gen_helper_cross_page_check(valid, cpu_env, vaddr);
+        tcg_temp_free(vaddr);
+        tcg_gen_brcondi_i32(TCG_COND_EQ, valid, 0, label);
+        tcg_temp_free_i32(valid);
+        tcg_gen_goto_tb(n);
+        gen_set_pc_im(s, dest);
+        tcg_gen_exit_tb((uintptr_t)s->tb + n);
+        gen_set_label(label);
         gen_set_pc_im(s, dest);
         tcg_gen_exit_tb(0);
     }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 04/10] target/i386: optimize cross-page block chaining in softmmu
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
                   ` (2 preceding siblings ...)
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 03/10] target/arm: optimize cross-page block chaining in softmmu Emilio G. Cota
@ 2017-04-12  1:17 ` Emilio G. Cota
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 05/10] tcg: add jr opcode Emilio G. Cota
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

Instead of unconditionally exiting to the exec loop, add a helper to
check whether the target TB is valid. As long as the hit rate in
tb_jmp_cache remains high, this improves performance.

Measurements:

-       specINT 2006 (test set), x86_64-softmmu. Host: Intel i7-4790K @ 4.00GHz
                          Y axis: Speedup over 95b31d70

     1.3x+-+-------------------------------------------------------------+-+
         |           cross $$                                              |
    1.25x+-+.............................................................+-+
         |                                                                 |
     1.2x+-+.............................................................+-+
         |                                                        :        |
         |                                                        :        |
    1.15x+-+.............................................................+-+
         |           $$$$ $$$$                     +++            :        |
     1.1x+-+.........$..$.$..$...........................................+-+
         |           $  $ $  $                     $$$           $$$$      |
    1.05x+-+.........$..$.$..$.....................$.$.$$$$......$..$....+-+
         |           $  $ $  $      +++  +++  +++  $+$ $++$  +++ $: $ $$$$ |
         |       +++ $  $ $  $ +++  $$$   :    :   $ $ $  $ $$$$ $: $ $++$ |
       1x+-$$$$G$$$$_$EM$_$ro$s$$$..$.$.......$$$..$.$.$..$.$..$.$..$.$..$-+
         | $++$ $ :$ $  $ $  $ $ $  $ $   :   $+$  $ $ $  $ $++$ $: $ $  $ |
    0.95x+-$..$.$..$.$..$.$..$.$.$..$.$..$$$..$.$..$.$.$..$.$..$.$..$.$..$-+
         | $  $ $  $ $  $ $  $ $ $  $ $  $:$  $ $  $ $ $  $ $  $ $  $ $  $ |
     0.9x+-$$$$-$$$$-$$$$-$$$$-$$$--$$$--$$$--$$$--$$$-$$$$-$$$$-$$$$-$$$$-+
           astarbzip2gcc gobmh264rehmlibquantumcfomneperlbensjxalancbhmean
  png: http://imgur.com/cwRnmCi

That is, a hmean gain of 2.6%.

-      specINT 2006 (train set), x86_64-softmmu. Host: Intel i7-4790K @ 4.00GHz
                          Y axis: Speedup over 95b31d70

    1.25x+-+-------------------------------------------------------------+-+
         |                cross $$                                         |
         |                                                                 |
     1.2x+-+.............................................................+-+
         |                                          :            +++       |
    1.15x+-+.............................................................+-+
         |            :                            $$$ $$$$      $$$$      |
         |           $$$$ +++                      $:$ $++$  +++ $: $      |
     1.1x+-+.........$..$.$$$$.....................$.$.$..$......$..$....+-+
         |       +++ $++$ $++$ +++   :             $ $ $  $   :  $++$ +++  |
    1.05x+-+....$$$$.$..$.$..$......$$$............$.$.$..$.$$$$.$..$.$$$$-+
         |      $++$ $  $ $  $ $$$  $:$            $ $ $  $ $ :$ $  $ $  $ |
         |      $  $ $  $ $  $ $:$  $+$  +++  +++  $ $ $  $ $ :$ $  $ $  $ |
       1x+-$$$$G$AP$_$EM$_$ro$s$i$li$e$..$$$.......$.$.$..$.$..$.$..$.$..$-+
         | $++$ $  $ $  $ $  $ $+$  $ $  $:$  $$$  $ $ $  $ $  $ $  $ $  $ |
    0.95x+-$..$.$..$.$..$.$..$.$.$..$.$..$.$..$.$..$.$.$..$.$..$.$..$.$..$-+
         | $  $ $  $ $  $ $  $ $ $  $ $  $ $  $+$  $ $ $  $ $  $ $  $ $  $ |
         | $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ |
     0.9x+-$$$$-$$$$-$$$$-$$$$-$$$--$$$--$$$--$$$--$$$-$$$$-$$$$-$$$$-$$$$-+
           astarbzip2gcc gobmh264rehmlibquantumcfomneperlbensjxalancbhmean
  png: http://imgur.com/0CbG7dD

This is the larger "train" set. We get a hmean improvement of 6.1%.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/i386/helper.h      |  2 ++
 target/i386/misc_helper.c |  5 +++++
 target/i386/translate.c   | 14 +++++++++++++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/target/i386/helper.h b/target/i386/helper.h
index 6fb8fb9..dceb343 100644
--- a/target/i386/helper.h
+++ b/target/i386/helper.h
@@ -1,6 +1,8 @@
 DEF_HELPER_FLAGS_4(cc_compute_all, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 DEF_HELPER_FLAGS_4(cc_compute_c, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 
+DEF_HELPER_2(cross_page_check, i32, env, tl)
+
 DEF_HELPER_3(write_eflags, void, env, tl, i32)
 DEF_HELPER_1(read_eflags, tl, env)
 DEF_HELPER_2(divb_AL, void, env, tl)
diff --git a/target/i386/misc_helper.c b/target/i386/misc_helper.c
index ca2ea09..a41daed 100644
--- a/target/i386/misc_helper.c
+++ b/target/i386/misc_helper.c
@@ -637,3 +637,8 @@ void helper_wrpkru(CPUX86State *env, uint32_t ecx, uint64_t val)
     env->pkru = val;
     tlb_flush(cs);
 }
+
+uint32_t helper_cross_page_check(CPUX86State *env, target_ulong vaddr)
+{
+    return !!tb_from_jmp_cache(env, vaddr);
+}
diff --git a/target/i386/translate.c b/target/i386/translate.c
index 1d1372f..ffc8ccc 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -2153,7 +2153,19 @@ static inline void gen_goto_tb(DisasContext *s, int tb_num, target_ulong eip)
         gen_jmp_im(eip);
         tcg_gen_exit_tb((uintptr_t)s->tb + tb_num);
     } else {
-        /* jump to another page: currently not optimized */
+        /* jump to another page */
+        TCGv vaddr = tcg_const_tl(eip);
+        TCGv_i32 valid = tcg_temp_new_i32();
+        TCGLabel *label = gen_new_label();
+
+        gen_helper_cross_page_check(valid, cpu_env, vaddr);
+        tcg_temp_free(vaddr);
+        tcg_gen_brcondi_i32(TCG_COND_EQ, valid, 0, label);
+        tcg_temp_free_i32(valid);
+        tcg_gen_goto_tb(tb_num);
+        gen_jmp_im(eip);
+        tcg_gen_exit_tb((uintptr_t)s->tb + tb_num);
+        gen_set_label(label);
         gen_jmp_im(eip);
         gen_eob(s);
     }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 05/10] tcg: add jr opcode
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
                   ` (3 preceding siblings ...)
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 04/10] target/i386: " Emilio G. Cota
@ 2017-04-12  1:17 ` Emilio G. Cota
  2017-04-13  5:09   ` Paolo Bonzini
  2017-04-15 11:40   ` Richard Henderson
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 06/10] tcg: add brcondi_ptr Emilio G. Cota
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

This will be used by TCG targets to implement a fast path
for indirect branches.

I only have implemented and tested this on an i386 host, so
make this opcode optional and mark it as not implemented by
other TCG backends.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/aarch64/tcg-target.h  | 1 +
 tcg/arm/tcg-target.h      | 1 +
 tcg/i386/tcg-target.h     | 1 +
 tcg/i386/tcg-target.inc.c | 7 +++++++
 tcg/ia64/tcg-target.h     | 1 +
 tcg/mips/tcg-target.h     | 1 +
 tcg/ppc/tcg-target.h      | 1 +
 tcg/s390/tcg-target.h     | 1 +
 tcg/sparc/tcg-target.h    | 1 +
 tcg/tcg-op.h              | 6 ++++++
 tcg/tcg-opc.h             | 1 +
 tcg/tcg.c                 | 1 +
 tcg/tci/tcg-target.h      | 1 +
 13 files changed, 24 insertions(+)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 1a5ea23..ed2fb84 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -77,6 +77,7 @@ typedef enum {
 #define TCG_TARGET_HAS_mulsh_i32        0
 #define TCG_TARGET_HAS_extrl_i64_i32    0
 #define TCG_TARGET_HAS_extrh_i64_i32    0
+#define TCG_TARGET_HAS_jr               0
 
 #define TCG_TARGET_HAS_div_i64          1
 #define TCG_TARGET_HAS_rem_i64          1
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 09a19c6..1c9f0a2 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -123,6 +123,7 @@ extern bool use_idiv_instructions;
 #define TCG_TARGET_HAS_mulsh_i32        0
 #define TCG_TARGET_HAS_div_i32          use_idiv_instructions
 #define TCG_TARGET_HAS_rem_i32          0
+#define TCG_TARGET_HAS_jr               0
 
 enum {
     TCG_AREG0 = TCG_REG_R6,
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 4275787..ebbddb3 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -107,6 +107,7 @@ extern bool have_popcnt;
 #define TCG_TARGET_HAS_muls2_i32        1
 #define TCG_TARGET_HAS_muluh_i32        0
 #define TCG_TARGET_HAS_mulsh_i32        0
+#define TCG_TARGET_HAS_jr               1
 
 #if TCG_TARGET_REG_BITS == 64
 #define TCG_TARGET_HAS_extrl_i64_i32    0
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 5918008..53baf71 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -1909,6 +1909,9 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_br:
         tcg_out_jxx(s, JCC_JMP, arg_label(a0), 0);
         break;
+    case INDEX_op_jr:
+        tcg_out_modrm(s, OPC_GRP5, EXT5_JMPN_Ev, a0);
+        break;
     OP_32_64(ld8u):
         /* Note that we can ignore REXW for the zero-extend to 64-bit.  */
         tcg_out_modrm_offset(s, OPC_MOVZBL, a0, a1, a2);
@@ -2277,6 +2280,7 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
 
 static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
 {
+    static const TCGTargetOpDef ri = { .args_ct_str = { "ri" } };
     static const TCGTargetOpDef ri_r = { .args_ct_str = { "ri", "r" } };
     static const TCGTargetOpDef re_r = { .args_ct_str = { "re", "r" } };
     static const TCGTargetOpDef qi_r = { .args_ct_str = { "qi", "r" } };
@@ -2324,6 +2328,9 @@ static const TCGTargetOpDef *tcg_target_op_def(TCGOpcode op)
     case INDEX_op_st_i64:
         return &re_r;
 
+    case INDEX_op_jr:
+        return &ri;
+
     case INDEX_op_add_i32:
     case INDEX_op_add_i64:
         return &r_r_re;
diff --git a/tcg/ia64/tcg-target.h b/tcg/ia64/tcg-target.h
index 42aea03..a2760ba 100644
--- a/tcg/ia64/tcg-target.h
+++ b/tcg/ia64/tcg-target.h
@@ -173,6 +173,7 @@ typedef enum {
 #define TCG_TARGET_HAS_mulsh_i64        0
 #define TCG_TARGET_HAS_extrl_i64_i32    0
 #define TCG_TARGET_HAS_extrh_i64_i32    0
+#define TCG_TARGET_HAS_jr               0
 
 #define TCG_TARGET_deposit_i32_valid(ofs, len) ((len) <= 16)
 #define TCG_TARGET_deposit_i64_valid(ofs, len) ((len) <= 16)
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index f46d64a..d06e495 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -130,6 +130,7 @@ extern bool use_mips32r2_instructions;
 #define TCG_TARGET_HAS_muluh_i32        1
 #define TCG_TARGET_HAS_mulsh_i32        1
 #define TCG_TARGET_HAS_bswap32_i32      1
+#define TCG_TARGET_HAS_jr               0
 
 #if TCG_TARGET_REG_BITS == 64
 #define TCG_TARGET_HAS_add2_i32         0
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index abd8b3d..461bb0c 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -82,6 +82,7 @@ extern bool have_isa_3_00;
 #define TCG_TARGET_HAS_muls2_i32        0
 #define TCG_TARGET_HAS_muluh_i32        1
 #define TCG_TARGET_HAS_mulsh_i32        1
+#define TCG_TARGET_HAS_jr               0
 
 #if TCG_TARGET_REG_BITS == 64
 #define TCG_TARGET_HAS_add2_i32         0
diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
index cbdd2a6..b35c7b1 100644
--- a/tcg/s390/tcg-target.h
+++ b/tcg/s390/tcg-target.h
@@ -92,6 +92,7 @@ extern uint64_t s390_facilities;
 #define TCG_TARGET_HAS_mulsh_i32      0
 #define TCG_TARGET_HAS_extrl_i64_i32  0
 #define TCG_TARGET_HAS_extrh_i64_i32  0
+#define TCG_TARGET_HAS_jr             0
 
 #define TCG_TARGET_HAS_div2_i64       1
 #define TCG_TARGET_HAS_rot_i64        1
diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
index b8b74f9..3d6f872 100644
--- a/tcg/sparc/tcg-target.h
+++ b/tcg/sparc/tcg-target.h
@@ -123,6 +123,7 @@ extern bool use_vis3_instructions;
 #define TCG_TARGET_HAS_muls2_i32        1
 #define TCG_TARGET_HAS_muluh_i32        0
 #define TCG_TARGET_HAS_mulsh_i32        0
+#define TCG_TARGET_HAS_jr               0
 
 #define TCG_TARGET_HAS_extrl_i64_i32    1
 #define TCG_TARGET_HAS_extrh_i64_i32    1
diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index c68e300..1924633 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -261,6 +261,12 @@ static inline void tcg_gen_br(TCGLabel *l)
     tcg_gen_op1(&tcg_ctx, INDEX_op_br, label_arg(l));
 }
 
+/* jump to a host address contained in a register */
+static inline void tcg_gen_jr(TCGv_ptr arg)
+{
+    tcg_gen_op1i(INDEX_op_jr, GET_TCGV_PTR(arg));
+}
+
 void tcg_gen_mb(TCGBar);
 
 /* Helper calls. */
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index f06f894..1e869af 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -34,6 +34,7 @@ DEF(set_label, 0, 0, 1, TCG_OPF_BB_END | TCG_OPF_NOT_PRESENT)
 DEF(call, 0, 0, 3, TCG_OPF_CALL_CLOBBER | TCG_OPF_NOT_PRESENT)
 
 DEF(br, 0, 0, 1, TCG_OPF_BB_END)
+DEF(jr, 0, 1, 0, TCG_OPF_BB_END)
 
 #define IMPL(X) (__builtin_constant_p(X) && !(X) ? TCG_OPF_NOT_PRESENT : 0)
 #if TCG_TARGET_REG_BITS == 32
diff --git a/tcg/tcg.c b/tcg/tcg.c
index cb898f1..a7e7842 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1139,6 +1139,7 @@ void tcg_dump_ops(TCGContext *s)
             switch (c) {
             case INDEX_op_set_label:
             case INDEX_op_br:
+            case INDEX_op_jr:
             case INDEX_op_brcond_i32:
             case INDEX_op_brcond_i64:
             case INDEX_op_brcond2_i32:
diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
index 838bf3a..63d1a57 100644
--- a/tcg/tci/tcg-target.h
+++ b/tcg/tci/tcg-target.h
@@ -85,6 +85,7 @@
 #define TCG_TARGET_HAS_muls2_i32        0
 #define TCG_TARGET_HAS_muluh_i32        0
 #define TCG_TARGET_HAS_mulsh_i32        0
+#define TCG_TARGET_HAS_jr               0
 
 #if TCG_TARGET_REG_BITS == 64
 #define TCG_TARGET_HAS_extrl_i64_i32    0
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 06/10] tcg: add brcondi_ptr
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
                   ` (4 preceding siblings ...)
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 05/10] tcg: add jr opcode Emilio G. Cota
@ 2017-04-12  1:17 ` Emilio G. Cota
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 07/10] tcg: add tcg_temp_local_new_ptr Emilio G. Cota
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

This will be used by TCG targets to implement a fast path
for indirect branches.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg-op.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 1924633..abf784b 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -1118,6 +1118,8 @@ void tcg_gen_atomic_xor_fetch_i64(TCGv_i64, TCGv, TCGv_i64, TCGArg, TCGMemOp);
     tcg_gen_addi_i32(TCGV_PTR_TO_NAT(R), TCGV_PTR_TO_NAT(A), (B))
 # define tcg_gen_ext_i32_ptr(R, A) \
     tcg_gen_mov_i32(TCGV_PTR_TO_NAT(R), (A))
+# define tcg_gen_brcondi_ptr(C, A, I, L) \
+    tcg_gen_brcondi_i32(C, TCGV_PTR_TO_NAT(A), (uintptr_t)I, L)
 #else
 # define tcg_gen_ld_ptr(R, A, O) \
     tcg_gen_ld_i64(TCGV_PTR_TO_NAT(R), (A), (O))
@@ -1129,4 +1131,6 @@ void tcg_gen_atomic_xor_fetch_i64(TCGv_i64, TCGv, TCGv_i64, TCGArg, TCGMemOp);
     tcg_gen_addi_i64(TCGV_PTR_TO_NAT(R), TCGV_PTR_TO_NAT(A), (B))
 # define tcg_gen_ext_i32_ptr(R, A) \
     tcg_gen_ext_i32_i64(TCGV_PTR_TO_NAT(R), (A))
+# define tcg_gen_brcondi_ptr(C, A, I, L) \
+    tcg_gen_brcondi_i64(C, TCGV_PTR_TO_NAT(A), (uintptr_t)I, L)
 #endif /* UINTPTR_MAX == UINT32_MAX */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 07/10] tcg: add tcg_temp_local_new_ptr
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
                   ` (5 preceding siblings ...)
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 06/10] tcg: add brcondi_ptr Emilio G. Cota
@ 2017-04-12  1:17 ` Emilio G. Cota
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 08/10] target/arm: optimize indirect branches with TCG's jr op Emilio G. Cota
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

This will be used by TCG targets to implement a fast path
for indirect branches.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 6c216bb..37a7c8e 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -912,6 +912,7 @@ do {\
 #define tcg_global_mem_new_ptr(R, O, N) \
     TCGV_NAT_TO_PTR(tcg_global_mem_new_i32((R), (O), (N)))
 #define tcg_temp_new_ptr() TCGV_NAT_TO_PTR(tcg_temp_new_i32())
+#define tcg_temp_local_new_ptr() TCGV_NAT_TO_PTR(tcg_temp_local_new_i32())
 #define tcg_temp_free_ptr(T) tcg_temp_free_i32(TCGV_PTR_TO_NAT(T))
 #else
 #define TCGV_NAT_TO_PTR(n) MAKE_TCGV_PTR(GET_TCGV_I64(n))
@@ -923,6 +924,7 @@ do {\
 #define tcg_global_mem_new_ptr(R, O, N) \
     TCGV_NAT_TO_PTR(tcg_global_mem_new_i64((R), (O), (N)))
 #define tcg_temp_new_ptr() TCGV_NAT_TO_PTR(tcg_temp_new_i64())
+#define tcg_temp_local_new_ptr() TCGV_NAT_TO_PTR(tcg_temp_local_new_i64())
 #define tcg_temp_free_ptr(T) tcg_temp_free_i64(TCGV_PTR_TO_NAT(T))
 #endif
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 08/10] target/arm: optimize indirect branches with TCG's jr op
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
                   ` (6 preceding siblings ...)
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 07/10] tcg: add tcg_temp_local_new_ptr Emilio G. Cota
@ 2017-04-12  1:17 ` Emilio G. Cota
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 09/10] target/i386: " Emilio G. Cota
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

Speed up indirect branches by adding a helper to look for the
TB in tb_jmp_cache. The helper returns either the corresponding
host address or NULL.

Measurements:

- Impact on Boot time

| setup   | ARM debian boot+shutdown time | stddev |
|---------+-------------------------------+--------|
| master  |                  10.050247057 | 0.0361 |
| +cross  |                  10.311265443 | 0.0721 |
| +jr     |                  10.216832579 | 0.0878 |
| +inline |                  10.405597879 | 0.0332 |

That is, a 3.5% slowdown. This is reasonable since booting
has low hit rates in tb_jmp_cache.

-                NBench, arm-linux-user. Host: Intel i7-4790K @ 4.00GHz
                            Y axis: speedup over 95b31d70

    1.25x+-+-------------------------------------------------------------+-+
         |                                                jr          $$$  |
         |                                                jr+inline   %%%  |
     1.2x+-+..................................$$$%%......................+-+
         |                                    $ $ %                        |
         |                                    $ $ %                        |
         |                          %%%       $ $ %   %%                   |
    1.15x+-+........................%.%.......$.$.%.$$$%.................+-+
         |                          % %       $ $ % $ $%                   |
         |                        $$$ %       $ $ % $ $%                   |
     1.1x+-+......................$.$.%.......$.$.%.$.$%.................+-+
         |                        $ $ %       $ $ % $ $%                   |
         |                        $ $ % $$$   $ $ % $ $%               %%% |
    1.05x+-+......................$.$.%.$.$%%.$.$.%.$.$%.............$$$.%-+
         |             $$$%% $$%% $ $ % $ $ % $ $ % $ $%             $ $ % |
         | $$$%%       $ $ % $$ % $ $ % $ $ % $ $ % $ $%             $ $ % |
         | $ $ %       $ $ % $$ % $ $ % $ $ % $ $ % $ $%   %%% $$$%% $ $ % |
       1x+-$.$B%R$$$%%G$A$H%T$$P%j$+$n%i$e$.%.$.$.%.$.$%.$$$.%.$.$.%.$.$.%-+
         +-$$$%%-$$$%%-$$$%%-$$%%-$$$%%-$$$%%-$$$%%-$$$%-$$$%%-$$$%%-$$$%%-+
        ASSIGNMBITFIELFOFP_EMULATHUFFMANLU_DECOMPNEURNUMERICSTRING_SOhmean
  png: http://imgur.com/ihqQj6l

That is, a 6.65% hmean improvement with jr+inline (5.92% w/o inlining).
Peak improvement is 21% for HUFFMAN.

-                NBench, arm-softmmu. Host: Intel i7-4790K @ 4.00GHz
                            Y axis: speedup over 95b31d70
        +------------------------------------------------------------------+
        |                                                                  |
    1.3x+-+........................................ cross+noinline    $$ +-+
        |                                           cross+inline      %%   |
        |                      &&              @@&& cross+jr+noinline @@   |
        |                   $$%@&              @@ & cross+jr+inline   &&   |
    1.2x+-+.................$$%@&......$$..&&..@@.&......................+-+
        |                   $$%@&      $$%%@&  @@ &  @@&                   |
        |                   $$%@&      $$ %@&  @@ &  @@&                   |
    1.1x+-+.................$$%@&...@@.$$.%@&..@@.&..@@&................&&-+
        |             $$%@& $$%@&   @@&$$ %@&  @@ &  @@&               @@& |
        |             $$%@& $$%@&   @@&$$ %@&$$%@ &  @@&       $$%@& $$%@& |
        | $$%&& $$%&& $$%@& $$%@&$$$%@&$$ %@&$$%@ & %%@&       $$%@& $$%@& |
      1x+-$$%@&A$$%@&A$$%@&A$$%@&$R$%@&$$T%@&$$%@s&+%%@&n$$%@&.$$%@&.$$%@&-+
        | $$%@& $$%@& $$%@& $$%@&$ $%@&$$ %@&$$%@ & %%@& $$%@& $$%@& $$%@& |
        | $$%@& $$%@& $$%@& $$%@&$ $%@&$$ %@&$$%@ & %%@& $$%@& $$%@& $$%@& |
    0.9x+-$$%@&.$$%@&.$$%@&.$$%@&$.$%@&$$.%@&$$%@.&.%%@&.$$%@&.$$%@&.$$%@&-+
        | $$%@& $$%@& $$%@& $$%@&$ $%@&$$ %@&$$%@ & %%@& $$%@& $$%@& $$%@& |
        | $$%@& $$%@& $$%@& $$%@&$ $%@&$$ %@&$$%@ &$$%@& $$%@& $$%@& $$%@& |
        | $$%@& $$%@& $$%@& $$%@&$ $%@&$$ %@&$$%@ &$$%@& $$%@& $$%@& $$%@& |
    0.8x+-$$%@&-$$%@&-$$%@&-$$%@&$$$%@&$$%%@&$$%@&&$$%@&-$$%@&-$$%@&-$$%@&-+
       ASSIGNMBITFIELFOUFP_EMULATHUFFMALU_DECOMPNEURANUMERICSTRING_SOhmean
   png: http://imgur.com/yWJivBl

That is, a 9.86% hmean improvement when combining cross+jr+inline (this commit)
over current master. Peak improvement is 25% for FP_EMULATION.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/arm/helper.c    | 11 +++++++++++
 target/arm/helper.h    |  1 +
 target/arm/translate.c | 23 +++++++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/target/arm/helper.c b/target/arm/helper.c
index 10b8807..dfbc488 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -9927,3 +9927,14 @@ uint32_t HELPER(cross_page_check)(CPUARMState *env, target_ulong vaddr)
 {
     return !!tb_from_jmp_cache(env, vaddr);
 }
+
+void *HELPER(get_hostptr)(CPUARMState *env, target_ulong vaddr)
+{
+    TranslationBlock *tb;
+
+    tb = tb_from_jmp_cache(env, vaddr);
+    if (unlikely(tb == NULL)) {
+        return NULL;
+    }
+    return tb->tc_ptr;
+}
diff --git a/target/arm/helper.h b/target/arm/helper.h
index d4b779b..0faacc1 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -2,6 +2,7 @@ DEF_HELPER_FLAGS_1(sxtb16, TCG_CALL_NO_RWG_SE, i32, i32)
 DEF_HELPER_FLAGS_1(uxtb16, TCG_CALL_NO_RWG_SE, i32, i32)
 
 DEF_HELPER_2(cross_page_check, i32, env, tl)
+DEF_HELPER_2(get_hostptr, ptr, env, tl)
 
 DEF_HELPER_3(add_setq, i32, env, i32, i32)
 DEF_HELPER_3(add_saturate, i32, env, i32, i32)
diff --git a/target/arm/translate.c b/target/arm/translate.c
index ce97d0c..2510bb2 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -65,6 +65,14 @@ static TCGv_i32 cpu_R[16];
 TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
 TCGv_i64 cpu_exclusive_addr;
 TCGv_i64 cpu_exclusive_val;
+static bool gen_jr;
+
+static inline void set_jr(void)
+{
+    if (TCG_TARGET_HAS_jr) {
+        gen_jr = true;
+    }
+}
 
 /* FIXME:  These should be removed.  */
 static TCGv_i32 cpu_F0s, cpu_F1s;
@@ -221,6 +229,7 @@ static void store_reg(DisasContext *s, int reg, TCGv_i32 var)
          */
         tcg_gen_andi_i32(var, var, s->thumb ? ~1 : ~3);
         s->is_jmp = DISAS_JUMP;
+        set_jr();
     }
     tcg_gen_mov_i32(cpu_R[reg], var);
     tcg_temp_free_i32(var);
@@ -893,6 +902,7 @@ static inline void gen_bx_im(DisasContext *s, uint32_t addr)
         tcg_temp_free_i32(tmp);
     }
     tcg_gen_movi_i32(cpu_R[15], addr & ~1);
+    set_jr();
 }
 
 /* Set PC and Thumb state from var.  var is marked as dead.  */
@@ -902,6 +912,7 @@ static inline void gen_bx(DisasContext *s, TCGv_i32 var)
     tcg_gen_andi_i32(cpu_R[15], var, ~1);
     tcg_gen_andi_i32(var, var, 1);
     store_cpu_field(var, thumb);
+    set_jr();
 }
 
 /* Variant of store_reg which uses branch&exchange logic when storing
@@ -12042,6 +12053,18 @@ void gen_intermediate_code(CPUARMState *env, TranslationBlock *tb)
             gen_set_pc_im(dc, dc->pc);
             /* fall through */
         case DISAS_JUMP:
+            if (TCG_TARGET_HAS_jr && gen_jr) {
+                TCGv_ptr ptr = tcg_temp_local_new_ptr();
+                TCGLabel *label = gen_new_label();
+
+                gen_jr = false;
+                gen_helper_get_hostptr(ptr, cpu_env, cpu_R[15]);
+                tcg_gen_brcondi_ptr(TCG_COND_EQ, ptr, NULL, label);
+                tcg_gen_jr(ptr);
+                tcg_temp_free_ptr(ptr);
+                gen_set_label(label);
+                /* fall through */
+            }
         default:
             /* indicate that the hash table must be used to find the next TB */
             tcg_gen_exit_tb(0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
                   ` (7 preceding siblings ...)
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 08/10] target/arm: optimize indirect branches with TCG's jr op Emilio G. Cota
@ 2017-04-12  1:17 ` Emilio G. Cota
  2017-04-12  3:43   ` Paolo Bonzini
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 10/10] tb-hash: improve tb_jmp_cache hash function in user mode Emilio G. Cota
  2017-04-12 10:03 ` [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Alex Bennée
  10 siblings, 1 reply; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

Speed up indirect branches by adding a helper to look for the
TB in tb_jmp_cache. The helper returns either the corresponding
host address or NULL.

Measurements:

-             NBench, x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

     1.1x+-+-------------------------------------------------------------+-+
         |          jr             $$                                      |
    1.08x+-+......  jr+inline      %%  ..................................+-+
         |                                                                 |
         | $$$                                                             |
    1.06x+-$.$............................%%%............................+-+
         | $ $%%                          % %                              |
    1.04x+-$.$.%..........................%.%............................+-+
         | $ $ %                        $$$ %                  $$$         |
         | $ $ %              %%%       $ $ %                  $ $%%       |
    1.02x+-$.$.%.........%%%.$$.%.......$.$.%...%%%...%%.......$.$.%.$$$%%-+
         | $ $ %         % % $$ % $$$   $ $ % $$$ %   %% $$$%% $ $ % $ $ % |
       1x+-$.$B%R$$$ARGRA%H%T$$P%j$+$%%i$e$.%.$.$.%.$$$%.$.$.%.$.$.%.$.$.%-+
         | $ $ % $ $%% $$$ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % |
    0.98x+-$.$.%.$.$.%.$.$.%.$$.%.$.$.%.$.$.%.$.$.%.$.$%.$.$.%.$.$.%.$.$.%-+
         | $ $ % $ $ % $ $ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % |
         | $ $ % $ $ % $ $ % $$ % $ $ % $ $ % $ $ % $ $% $ $ % $ $ % $ $ % |
    0.96x+-$.$.%.$.$.%.$.$.%.$$.%.$.$.%.$.$.%.$.$.%.$.$%.$.$.%.$.$.%.$.$.%-+
         +-$$$%%-$$$%%-$$$%%-$$%%-$$$%%-$$$%%-$$$%%-$$$%-$$$%%-$$$%%-$$$%%-+
        ASSIGNMBITFIELFOFP_EMULATHUFFMANLU_DECOMPNEURNUMERICSTRING_SOhmean
  png: http://imgur.com/Jxj4hBd

The fact that NBench is not very sensitive to changes here is a
little surprising, especially given the significant improvements for
ARM shown in the previous commit. I wonder whether the compiler is doing
a better job compiling the x86_64 version (I'm using gcc 5.4.0), or I'm simply
missing some i386 instructions to which the jr optimization should
be applied.

     specINT 2006 (test set), x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

     1.3x+-+-------------------------------------------------------------+-+
         |          jr+inline $$                                           |
    1.25x+-+.............................................................+-+
         |                                                                 |
     1.2x+-+.............................................................+-+
         |                                                                 |
         |                     +++                 +++                     |
    1.15x+-+...................$$$.................$$$...................+-+
         |                     $ $                 $:$                     |
     1.1x+-+...................$.$.................$.$...........$$$$....+-+
         |           +++       $ $                 $ $       +++ $++$      |
    1.05x+-+.........$$$$......$.$.................$.$...........$..$....+-+
         |           $  $      $ $  $$$            $ $ $$$$ $$$$ $  $ $$$$ |
         | $$$$  +++ $  $ +++  $ $  $ $  +++  $$$  $ $ $  $ $++$ $  $ $  $ |
       1x+-$BA$G$$$$_$EM$_$$$$.$.$..$.$..$$$..$.$..$.$.$..$.$..$.$..$.$..$-+
         | $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ |
    0.95x+-$..$.$..$.$..$.$..$.$.$..$.$..$.$..$.$..$.$.$..$.$..$.$..$.$..$-+
         | $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ $ $  $ $  $ $  $ $  $ |
     0.9x+-$$$$-$$$$-$$$$-$$$$-$$$--$$$--$$$--$$$--$$$-$$$$-$$$$-$$$$-$$$$-+
           astarbzip2gcc gobmh264rehmlibquantumcfomneperlbensjxalancbhmean
  png: http://imgur.com/63Ncmx8

That is a 4.4% hmean perf improvement.

-  specINT 2006 (train set), x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

    1.4x+-+--------------------------------------------------------------+-+
        |        jr  $$                                                    |
        |                                                                  |
    1.3x+-+..............................................................+-+
        |                                                                  |
        |                                                                  |
    1.2x+-+......................................................$$$$....+-+
        |                      +++                     $$$$  :   $++$      |
        |                     $$$$                $$$$ $  $  :   $  $      |
    1.1x+-+...................$..$................$..$.$..$.$$$$.$..$....+-+
        |                     $  $                $  $ $  $ $: $ $  $ +++  |
        |  +++       +++  +++ $  $ $$$$  +++      $  $ $  $ $: $ $  $ $$$$ |
      1x+-$$$$GRAPH_$$$$_$$$$.$..$.$..$.$$$$......$..$.$..$.$..$.$..$.$..$-+
        | $++$ $$$$ $  $ $++$ $  $ $  $ $  $      $  $ $  $ $  $ $  $ $  $ |
        | $  $ $  $ $  $ $  $ $  $ $  $ $  $      $  $ $  $ $  $ $  $ $  $ |
    0.9x+-$..$.$..$.$..$.$..$.$..$.$..$.$..$......$..$.$..$.$..$.$..$.$..$-+
        | $  $ $  $ $  $ $  $ $  $ $  $ $  $ $$$$ $  $ $  $ $  $ $  $ $  $ |
        | $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ $  $ |
    0.8x+-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-$$$$-+
          astarbzip2 gcc gobmh264rehmlibquantmcfomneperlbensjexalancbhmean
  png: http://imgur.com/hd0BhU6

That is, a 4.39 % hmean improvement for jr+inline, i.e. this commit.
(4.5% for noinline). Peak improvement is 20% for xalancbmk.

-    specINT 2006 (test set), x86_64-softmmu. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

     1.3x+-+-------------------------------------------------------------+-+
         |         cross    $$                                             |
    1.25x+-+.....  jr       %%  .........................................+-+
         |         cross+jr @@                                      :      |
     1.2x+-+.............................................................+-+
         |                                                       :  :      |
         |             +++                                       :  :      |
    1.15x+-+...........@@................................................+-+
         |           $$@@ $$++                    +++            : @@      |
     1.1x+-+.........$$@@.$$@@.....................................@@....+-+
         |           $$@@ $$@@                    $$ :   @@@  +++$$@@      |
    1.05x+-+.........$$@@.$$@@...@@...............$$...$$@.@.....$$@@....+-+
         |        +++$$%@ $$@@  %%@+++++++++++++++$$+: $$@ @++@@ $$%@+$$@@+|
         |  +@@+++@@+$$%@ $$@@++%%@$$$%  ::@@ ::@@$$@@@$$% @$$@@ $$%@+$$@@ |
       1x+-$$%@A$$%@R$$%@R$$%@$$$%@$_$%@s%%%@$$%%@$$@.@$$%.@$$@@.$$%@.$$%@-+
         |+$$%@ $$%@ $$%@ $$%@$ $%@$+$%@ %+%@$$+%@$$@+@$$% @$$@@ $$%@+$$%@ |
    0.95x+-$$%@.$$%@.$$%@.$$%@$.$%@$.$%@$$.%@$$.%@$$@.@$$%.@$$%@.$$%@.$$%@-+
         | $$%@ $$%@ $$%@ $$%@$ $%@$ $%@$$ %@$$ %@$$%+@$$% @$$%@ $$%@ $$%@ |
     0.9x+-$$%@-$$%@-$$%@-$$%@$$$%@$$$%@$$%%@$$%%@$$%@@$$%@@$$%@-$$%@-$$%@-+
           astabzip2 gcc gobmh264rehmlibquantumcfomneperlbensjexalanchmean
  png: http://imgur.com/IV9UtSa

Here we see how jr works best when combined with cross -- jr by itself is
disappointingly around baseline performance. I attribute this to the frequent
page invalidations and/or TLB flushes (I'm running Ubuntu 16.04 as the guest,
so there are many processes), which lowers the maximum attainable hit rate in
tb_jmp_cache.

Overall the greatest hmean improvement comes from cross+jr though.

-      specINT 2006 (train set), x86_64-softmmu. Host: Intel i7-4790K @ 4.00GHz
                             Y axis: Speedup over 95b31d70

    1.25x+-+-------------------------------------------------------------+-+
         |        cross+inline    $$                                       |
         |        cross+jr+inline %%                     +++      +++      |
     1.2x+-+.............................................................+-+
         |                                         :      :      +++       |
    1.15x+-+.......................................................%%....+-+
         |            ::   +++                    $$$  $$$%      $$$%      |
         |           $$%%++%%%                    $:$  $+$% +++  $:$%      |
     1.1x+-+.........$$.%.$$.%....................$.$..$.$%......$.$%....+-+
         |      +++  $$+%+$$ %+++++ :+++          $ $: $ $%  :%% $+$% +++  |
    1.05x+-+....$$...$$.%.$$.%......$$............$.$%.$.$%.$$$%.$.$%.$$%%-+
         |      $$%% $$ % $$ % $$%% $$:   +++     $ $% $ $% $:$% $ $% $$+% |
         |      $$+% $$ % $$ % $$:%+$$%%+++: +++  $ $%+$ $% $:$% $ $% $$ % |
       1x+-$$$AR$$A%G$$P%_$$M%_$$o%s$$r%$$$%%e....$.$%.$.$%.$.$%.$.$%.$$.%-+
         | $+$% $$ % $$ % $$ %+$$+% $$:%$:$+%$$$++$ $% $ $% $ $% $ $% $$ % |
    0.95x+-$.$%.$$.%.$$.%.$$.%.$$.%.$$.%$.$.%$.$..$.$%.$.$%.$.$%.$.$%.$$.%-+
         | $ $% $$ % $$ % $$ % $$ % $$ %$ $ %$+$% $ $% $ $% $ $% $ $% $$ % |
         | $ $% $$ % $$ % $$ % $$ % $$ %$ $ %$ $% $ $% $ $% $ $% $ $% $$ % |
     0.9x+-$$$%-$$%%-$$%%-$$%%-$$%%-$$%%$$$%%$$$%-$$$%-$$$%-$$$%-$$$%-$$%%-+
           astabzip2 gcc gobmh264rehmlibquantumcfomneperlbensjexalanchmean
  png: http://imgur.com/CBMxrBH

This is the larger "train" set of SPECint06. Here cross+jr comes slightly
below cross, but it's within the noise margins (I didn't run this many
times, since it takes several hours).

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 target/i386/helper.h      |  1 +
 target/i386/misc_helper.c | 11 +++++++++++
 target/i386/translate.c   | 42 +++++++++++++++++++++++++++++++++---------
 3 files changed, 45 insertions(+), 9 deletions(-)

diff --git a/target/i386/helper.h b/target/i386/helper.h
index dceb343..f7e9f9c 100644
--- a/target/i386/helper.h
+++ b/target/i386/helper.h
@@ -2,6 +2,7 @@ DEF_HELPER_FLAGS_4(cc_compute_all, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 DEF_HELPER_FLAGS_4(cc_compute_c, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 
 DEF_HELPER_2(cross_page_check, i32, env, tl)
+DEF_HELPER_2(get_hostptr, ptr, env, tl)
 
 DEF_HELPER_3(write_eflags, void, env, tl, i32)
 DEF_HELPER_1(read_eflags, tl, env)
diff --git a/target/i386/misc_helper.c b/target/i386/misc_helper.c
index a41daed..5d50ab0 100644
--- a/target/i386/misc_helper.c
+++ b/target/i386/misc_helper.c
@@ -642,3 +642,14 @@ uint32_t helper_cross_page_check(CPUX86State *env, target_ulong vaddr)
 {
     return !!tb_from_jmp_cache(env, vaddr);
 }
+
+void *helper_get_hostptr(CPUX86State *env, target_ulong vaddr)
+{
+    TranslationBlock *tb;
+
+    tb = tb_from_jmp_cache(env, vaddr);
+    if (unlikely(tb == NULL)) {
+        return NULL;
+    }
+    return tb->tc_ptr;
+}
diff --git a/target/i386/translate.c b/target/i386/translate.c
index ffc8ccc..aab5c13 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -2521,7 +2521,8 @@ static void gen_bnd_jmp(DisasContext *s)
    If INHIBIT, set HF_INHIBIT_IRQ_MASK if it isn't already set.
    If RECHECK_TF, emit a rechecking helper for #DB, ignoring the state of
    S->TF.  This is used by the syscall/sysret insns.  */
-static void gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf)
+static void
+gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf, TCGv jr)
 {
     gen_update_cc_op(s);
 
@@ -2542,6 +2543,22 @@ static void gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf)
         tcg_gen_exit_tb(0);
     } else if (s->tf) {
         gen_helper_single_step(cpu_env);
+    } else if (jr) {
+#ifdef TCG_TARGET_HAS_JR
+        TCGLabel *label = gen_new_label();
+        TCGv_ptr ptr = tcg_temp_local_new_ptr();
+        TCGv vaddr = tcg_temp_new();
+
+        tcg_gen_ld_tl(vaddr, cpu_env, offsetof(CPUX86State, segs[R_CS].base));
+        tcg_gen_add_tl(vaddr, vaddr, jr);
+        gen_helper_get_hostptr(ptr, cpu_env, vaddr);
+        tcg_temp_free(vaddr);
+        tcg_gen_brcondi_ptr(TCG_COND_EQ, ptr, NULL, label);
+        tcg_gen_jr(ptr);
+        tcg_temp_free_ptr(ptr);
+        gen_set_label(label);
+#endif
+        tcg_gen_exit_tb(0);
     } else {
         tcg_gen_exit_tb(0);
     }
@@ -2552,13 +2569,18 @@ static void gen_eob_worker(DisasContext *s, bool inhibit, bool recheck_tf)
    If INHIBIT, set HF_INHIBIT_IRQ_MASK if it isn't already set.  */
 static void gen_eob_inhibit_irq(DisasContext *s, bool inhibit)
 {
-    gen_eob_worker(s, inhibit, false);
+    gen_eob_worker(s, inhibit, false, NULL);
 }
 
 /* End of block, resetting the inhibit irq flag.  */
 static void gen_eob(DisasContext *s)
 {
-    gen_eob_worker(s, false, false);
+    gen_eob_worker(s, false, false, NULL);
+}
+
+static void gen_jr(DisasContext *s, TCGv dest)
+{
+    gen_eob_worker(s, false, false, dest);
 }
 
 /* generate a jump to eip. No segment change must happen before as a
@@ -4985,7 +5007,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             gen_push_v(s, cpu_T1);
             gen_op_jmp_v(cpu_T0);
             gen_bnd_jmp(s);
-            gen_eob(s);
+            gen_jr(s, cpu_T0);
             break;
         case 3: /* lcall Ev */
             gen_op_ld_v(s, ot, cpu_T1, cpu_A0);
@@ -5003,7 +5025,8 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                                       tcg_const_i32(dflag - 1),
                                       tcg_const_i32(s->pc - s->cs_base));
             }
-            gen_eob(s);
+            tcg_gen_ld_tl(cpu_tmp4, cpu_env, offsetof(CPUX86State, eip));
+            gen_jr(s, cpu_tmp4);
             break;
         case 4: /* jmp Ev */
             if (dflag == MO_16) {
@@ -5011,7 +5034,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             }
             gen_op_jmp_v(cpu_T0);
             gen_bnd_jmp(s);
-            gen_eob(s);
+            gen_jr(s, cpu_T0);
             break;
         case 5: /* ljmp Ev */
             gen_op_ld_v(s, ot, cpu_T1, cpu_A0);
@@ -5026,7 +5049,8 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                 gen_op_movl_seg_T0_vm(R_CS);
                 gen_op_jmp_v(cpu_T1);
             }
-            gen_eob(s);
+            tcg_gen_ld_tl(cpu_tmp4, cpu_env, offsetof(CPUX86State, eip));
+            gen_jr(s, cpu_tmp4);
             break;
         case 6: /* push Ev */
             gen_push_v(s, cpu_T0);
@@ -7143,7 +7167,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
         /* TF handling for the syscall insn is different. The TF bit is  checked
            after the syscall insn completes. This allows #DB to not be
            generated after one has entered CPL0 if TF is set in FMASK.  */
-        gen_eob_worker(s, false, true);
+        gen_eob_worker(s, false, true, NULL);
         break;
     case 0x107: /* sysret */
         if (!s->pe) {
@@ -7158,7 +7182,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                checked after the sysret insn completes. This allows #DB to be
                generated "as if" the syscall insn in userspace has just
                completed.  */
-            gen_eob_worker(s, false, true);
+            gen_eob_worker(s, false, true, NULL);
         }
         break;
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Qemu-devel] [PATCH 10/10] tb-hash: improve tb_jmp_cache hash function in user mode
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
                   ` (8 preceding siblings ...)
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 09/10] target/i386: " Emilio G. Cota
@ 2017-04-12  1:17 ` Emilio G. Cota
  2017-04-12  3:46   ` Paolo Bonzini
  2017-04-12 10:03 ` [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Alex Bennée
  10 siblings, 1 reply; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  1:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, alex.bennee, Pranith Kumar

Optimizations to cross-page chaining and indirect jumps make
performance more sensitive to the hit rate of tb_jmp_cache.
The constraint of reserving some bits for the page number
lowers the achievable quality of the hashing function.

However, user-mode does not have this requirement. Thus,
with this change we use for user-mode a hashing function that
is both faster and of better quality than the previous one.

Measurements:

-    specINT 2006 (test set), x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                              Y axis: Speedup over 95b31d70

     1.3x+-+-------------------------------------------------------------+-+
         |        jr             $$                                        |
    1.25x+-+....  jr+xxhash      %%  ....................................+-+
         |        jr+hash+inline @@                 +++                    |
     1.2x+-+.............................................................+-+
         |                                          @@@                    |
         |                    +++@@               ++@:@       +++  @@+     |
    1.15x+-+..................$$$@@...............$$@.@.......@@...@@....+-+
         |                    $ $@@               $$@ @      %%@   @@      |
     1.1x+-+..................$.$@@...............$$@.@......%%@.$$@@....+-+
         |          +++@@+    $ $@@               $$@ @    ++%%@+$$@@   +++|
    1.05x+-+.........$$@@.....$.$@@...@@..........$$@.@..@@@.%%@.$$@@...@@-+
         |           $$@@     $ $@@$$$@@          $$% @$$@+@$$%@ $$@@+$$@@ |
         |+$$++++++++$$@@+++@@$ $@@$+$@@+++@@$$+@@$$% @$$@+@$$%@ $$%@ $$@@ |
       1x+-$$@@A$$%@R$$@@R$$@@$_$%@$_$%@$$s@@$$%%@$$%.@$$%.@$$%@.$$%@.$$%@-+
         | $$@@+$$%@ $$%@ $$@@$+$%@$ $%@$$%%@$$+%@$$% @$$% @$$%@ $$%@ $$%@ |
    0.95x+-$$%@.$$%@.$$%@.$$%@$.$%@$.$%@$$.%@$$.%@$$%.@$$%.@$$%@.$$%@.$$%@-+
         | $$%@ $$%@ $$%@ $$%@$ $%@$ $%@$$ %@$$ %@$$% @$$% @$$%@ $$%@ $$%@ |
     0.9x+-$$%@-$$%@-$$%@-$$%@$$$%@$$$%@$$%%@$$%%@$$%@@$$%@@$$%@-$$%@-$$%@-+
           astabzip2 gcc gobmh264rehmlibquantumcfomneperlbensjexalanchmean
  png: http://imgur.com/RiaBuIi

That is, a 6.45% hmean improvement for this commit. Note that this is the
test set, so some benchmarks take almost no time (and therefore aren't that
sensitive to changes here). See "train" results below.

Note also that hashing quality is not the only requirement: xxhash
gives on average the highest hit rates. However, the time spent computing
the hash negates the performance gains coming from the increase in hit rate.
Given these results, I dropped xxhash from subsequent experiments.

-   specINT 2006 (train set), x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz
                              Y axis: Speedup over 95b31d70

    1.4x+-+--------------------------------------------------------------+-+
        |    jr      $$                                           +++      |
        |    jr+hash %%                                            :       |
    1.3x+-+.......................................................%%%....+-+
        |                                               +++  +++  %:%      |
        |                      +++                      %%%   :   %+%      |
    1.2x+-+.....................%%......................%.%..%%%.$$.%....+-+
        |                     ++%%                 %%% $$+%  %:% $$+%      |
        |            +++      $$$%                $$+% $$ %  %:% $$ %      |
    1.1x+-+...........%%......$.$%................$$.%.$$.%.$$.%.$$.%..%%%-+
        |  +++        %%      $ $%            +++ $$ % $$ % $$ % $$ % +%+% |
        | ++%%  +++ ++%% ++%% $ $% $$$+ +++   %%% $$ % $$ % $$ % $$ % $$+% |
      1x+-$$$%RGR%%R$$$%H$$$%P$j$%h$s$%.$$%%..%.%.$$.%.$$.%.$$.%.$$.%.$$.%-+
        | $+$% $$$% $ $% $+$% $ $% $ $% $$+%  % % $$ % $$ % $$ % $$ % $$ % |
        | $ $% $ $% $ $% $ $% $ $% $ $% $$ %  % % $$ % $$ % $$ % $$ % $$ % |
    0.9x+-$.$%.$.$%.$.$%.$.$%.$.$%.$.$%.$$.%..%.%.$$.%.$$.%.$$.%.$$.%.$$.%-+
        | $ $% $ $% $ $% $ $% $ $% $ $% $$ % $$+% $$ % $$ % $$ % $$ % $$ % |
        | $ $% $ $% $ $% $ $% $ $% $ $% $$ % $$ % $$ % $$ % $$ % $$ % $$ % |
    0.8x+-$$$%-$$$%-$$$%-$$$%-$$$%-$$$%-$$%%-$$%%-$$%%-$$%%-$$%%-$$%%-$$%%-+
          astarbzip2 gcc gobmh264rehlibquantumcfomneperlbensjexalancbhmean
  png: http://imgur.com/55iJJgD

That is, a 10.19% hmean improvement for jr+hash (this commit).

-               NBench, arm-linux-user. Host: Intel i7-4790K @ 4.00GHz
                              Y axis: Speedup over 95b31d70

    1.35x+-+-------------------------------------------------------------+-+
         |               @@@   jr              $$                          |
     1.3x+-+.............@.@.  jr+inline       %%  ...@@@................+-+
         |               @ @   jr+inline+hash  @@     @ @                  |
         |               @ @                          @ @                  |
    1.25x+-+.............@.@..........................@.@................+-+
         |               @ @                    @@@   @ @                  |
     1.2x+-+.............@.@..................$$%.@...@.@................+-+
         |               @ @                  $$% @   @ @                  |
         |               @ @        %%@       $$% @  %% @                  |
    1.15x+-+.............@.@........%%@.......$$%.@$$$%.@................+-+
         |               @ @        %%@       $$% @$ $% @                  |
     1.1x+-+.............@.@......$$$%@.......$$%.@$.$%.@...............@@-+
         |               @ @      $ $%@       $$% @$ $% @               @@ |
         |               @ @      $ $%@ $$%%@ $$% @$ $% @            $$%%@ |
    1.05x+-+...........$$%.@$$$%@@$.$%@.$$.%@.$$%.@$.$%.@.........@@.$$.%@-+
         | $$%%@       $$% @$ $% @$ $%@ $$ %@ $$% @$ $% @       %%%@ $$ %@ |
       1x+-$$.%@AR%%%@R$$%B@$G$%P@$T$%@_$$+%@l$$%+@$s$%.@$$$%@.$$.%@.$$.%@-+
         +-$$%%@-$$%%@-$$%@@$$$%@@$$$%@-$$%%@-$$%@@$$$%@@$$$%@-$$%%@-$$%%@-+
        ASSIGNMBITFIELFOFP_EMULATHUFFMANLU_DECOMPNEURNUMERICSTRING_SOhmean
  png: http://imgur.com/i5e1gdY

That is, a 11% hmean perf gain--it almost doubles the perf gain
from implementing the jr optimization.

-              NBench, x86_64-linux-user. Host: Intel i7-4790K @ 4.00GHz

     1.1x+-+-------------------------------------------------------------+-+
         |         jr             $$                                       |
    1.08x+-+.....  jr+inline      %%  ...................................+-+
         |         jr+inline+hash @@                                       |
         | $$ @@                                                           |
    1.06x+-$$.@@.........................%%%.............................+-+
         | $$%%@                         % %                               |
    1.04x+-$$.%@.........................%.%.............................+-+
         | $$ %@         @@@            $$ %                   $$          |
         | $$ %@         @ @  %%        $$ %                   $$%%@       |
    1.02x+-$$.%@........%%.@$$$%@@......$$.%@..%%@@..%%........$$.%@.$$%%@-+
         | $$ %@    @@  %% @$ $% @$$$   $$ %@ $$% @  %%@@$$$%  $$ %@ $$ %@ |
       1x+-$$.%@A$$R@@RG%%B@$G$%P@$T$%P_$$T%@h$$%+@$$$%e@$.$%@.$$.%@.$$.%@-+
         | $$ %@ $$%%@ $$% @$ $% @$ $%  $$ %@ $$% @$ $% @$ $%@ $$ %@ $$ %@ |
    0.98x+-$$.%@.$$.%@.$$%.@$.$%.@$.$%@.$$.%@.$$%.@$.$%.@$.$%@.$$.%@.$$.%@-+
         | $$ %@ $$ %@ $$% @$ $% @$ $%@ $$ %@ $$% @$ $% @$ $%@ $$ %@ $$ %@ |
         | $$ %@ $$ %@ $$% @$ $% @$ $%@ $$ %@ $$% @$ $% @$ $%@ $$ %@ $$ %@ |
    0.96x+-$$.%@.$$.%@.$$%.@$.$%.@$.$%@.$$.%@.$$%.@$.$%.@$.$%@.$$.%@.$$.%@-+
         +-$$%%@-$$%%@-$$%@@$$$%@@$$$%@-$$%%@-$$%@@$$$%@@$$$%@-$$%%@-$$%%@-+
        ASSIGNMBITFIELFOFP_EMULATHUFFMANLU_DECOMPNEURNUMERICSTRING_SOhmean
  png: http://imgur.com/Xu0Owgu

The fact that NBench is not very sensitive to changes here was mentioned
in the previous commit's log. We get a very slight overall decrease in hmean
performance, although some workloads improve as well. Note that there are
no error bars: NBench re-runs itself until confidence on the stability of
the average is >= 95%, and it doesn't report the resulting stddev.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-hash.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 2c27490..b1fe2d0 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -22,6 +22,8 @@
 
 #include "exec/tb-hash-xx.h"
 
+#ifdef CONFIG_SOFTMMU
+
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
    addresses on the same page.  The top bits are the same.  This allows
    TLB invalidation to quickly clear a subset of the hash table.  */
@@ -45,6 +47,16 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
            | (tmp & TB_JMP_ADDR_MASK));
 }
 
+#else
+
+/* In user-mode we can get better hashing because we do not have a TLB */
+static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
+{
+    return (pc ^ (pc >> TB_JMP_CACHE_BITS)) & (TB_JMP_CACHE_SIZE - 1);
+}
+
+#endif /* CONFIG_SOFTMMU */
+
 static inline
 uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t flags)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 09/10] target/i386: " Emilio G. Cota
@ 2017-04-12  3:43   ` Paolo Bonzini
  2017-04-13  1:46     ` Emilio G. Cota
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2017-04-12  3:43 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel
  Cc: Peter Crosthwaite, Richard Henderson, Peter Maydell,
	Eduardo Habkost, Claudio Fontana, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar



On 12/04/2017 09:17, Emilio G. Cota wrote:
> 
> The fact that NBench is not very sensitive to changes here is a
> little surprising, especially given the significant improvements for
> ARM shown in the previous commit. I wonder whether the compiler is doing
> a better job compiling the x86_64 version (I'm using gcc 5.4.0), or I'm simply
> missing some i386 instructions to which the jr optimization should
> be applied.

Maybe it is "ret"?  That would be a straightforward "bx lr" on ARM, but
it is missing in your i386 patch.

Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 10/10] tb-hash: improve tb_jmp_cache hash function in user mode
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 10/10] tb-hash: improve tb_jmp_cache hash function in user mode Emilio G. Cota
@ 2017-04-12  3:46   ` Paolo Bonzini
  2017-04-12  5:07     ` Emilio G. Cota
  0 siblings, 1 reply; 21+ messages in thread
From: Paolo Bonzini @ 2017-04-12  3:46 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel
  Cc: Peter Crosthwaite, Richard Henderson, Peter Maydell,
	Eduardo Habkost, Claudio Fontana, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar



On 12/04/2017 09:17, Emilio G. Cota wrote:
> +
> +/* In user-mode we can get better hashing because we do not have a TLB */
> +static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
> +{
> +    return (pc ^ (pc >> TB_JMP_CACHE_BITS)) & (TB_JMP_CACHE_SIZE - 1);
> +}

What about multiplicative hashing?

	return (uint64_t) (pc * 2654435761) >> 32;

Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 10/10] tb-hash: improve tb_jmp_cache hash function in user mode
  2017-04-12  3:46   ` Paolo Bonzini
@ 2017-04-12  5:07     ` Emilio G. Cota
  0 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-12  5:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, Peter Crosthwaite, Richard Henderson, Peter Maydell,
	Eduardo Habkost, Andrzej Zaborowski, Aurelien Jarno,
	Alexander Graf, Stefan Weil, qemu-arm, alex.bennee,
	Pranith Kumar

On Wed, Apr 12, 2017 at 11:46:47 +0800, Paolo Bonzini wrote:
> 
> 
> On 12/04/2017 09:17, Emilio G. Cota wrote:
> > +
> > +/* In user-mode we can get better hashing because we do not have a TLB */
> > +static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
> > +{
> > +    return (pc ^ (pc >> TB_JMP_CACHE_BITS)) & (TB_JMP_CACHE_SIZE - 1);
> > +}
> 
> What about multiplicative hashing?
> 
> 	return (uint64_t) (pc * 2654435761) >> 32;

I tested this one, taking the TB_JMP_CACHE_SIZE-1 lower bits of
the result:

  http://imgur.com/QIhm875

In terms of quality it's good (I profile hit rates and they're all
pretty good), but shift+xor are just so hard to beat: shift+xor
take 1 cycle each; the multiplication takes on my machine 3 or 4
cycles (source: Fog's tables).

Thanks,

		E.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10
  2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
                   ` (9 preceding siblings ...)
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 10/10] tb-hash: improve tb_jmp_cache hash function in user mode Emilio G. Cota
@ 2017-04-12 10:03 ` Alex Bennée
  10 siblings, 0 replies; 21+ messages in thread
From: Alex Bennée @ 2017-04-12 10:03 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: qemu-devel, Paolo Bonzini, Peter Crosthwaite, Richard Henderson,
	Peter Maydell, Eduardo Habkost, Claudio Fontana,
	Andrzej Zaborowski, Aurelien Jarno, Alexander Graf, Stefan Weil,
	qemu-arm, Pranith Kumar


Emilio G. Cota <cota@braap.org> writes:

> Hi all,
>
> This series is aimed at 2.10 or beyond. Its goal is to improve
> TCG performance by optimizing:
>
> 1- Cross-page direct jumps (softmmu only, obviously). Patches 1-4.
> 2- Indirect branches (softmmu and user-mode). Patches 5-9.
> 3- tb_jmp_cache hashing in user-mode. Patch 10.
>
> I decided to work on this after reading this paper [1] (code at [2]),
> which among other optimizations it proposes solutions for 1 and 2.
> I followed the same overall scheme they follow, that is to use helpers
> to check whether the target vaddr is valid, and if so, jump to its
> corresponding translated code (host address) without having to go back
> to the exec loop. My implementation differs from that in the paper
> in that it uses tb_jmp_cache instead of adding more caches,
> which is simpler and probably more resilient in environments
> where TLB invalidations are frequent (in the paper they acknowledge
> that they limited background processes to a minimum, which isn't
> realistic).

Hi Emilio,

If you want to get some numbers on TLB invalidations please have a look
at my WIP branch:

  https://github.com/stsquad/qemu/tree/misc/tlb-flush-stats

It's mainly an experiment at how easy it is to extract number data using
QEMU's trace subsystem (it turns out pretty easy). I had started looking
at the execution trace but got a little bogged down with re-implementing
hashes in python - it would be nice if we could just ctype dll load the
C implementation (or maybe just save the computed hashes in another
trace point rather than inferring via exec_tb).

>
> These changes require modifications on the targets and, for optimization
> number 2, a new TCG opcode to jump to a host address contained in a register.
>
> For now I only implemented this for the i386 and arm targets, and
> the i386 TCG backend. Other targets/backends can easily opt-in.
>
> The 3rd optimization is implemented in the last patch: it improves
> tb_jmp_cache hashing for user-mode by removing the requirement of
> being able to clear parts of the cache given a page number, since this
> requirement only applies to softmmu.
>
> The series applies cleanly on top of 95b31d709ba34.
>
> The commit logs include many measurements, performed using SPECint06 and
> NBench from dbt-bench[3].
>
> Feedback welcome! Thanks,

Given my notes above I think it would be worthwhile coming up with some
trace-points in the helpers and hash lookups so we can analyse their
behaviour as well as just looking at the performance improvement in
benchmarks.

>
> 		Emilio
>
> [1] "Optimizing Control Transfer and Memory Virtualization
> in Full System Emulators", Ding-Yong Hong, Chun-Chen Hsu, Cheng-Yi Chou,
> Wei-Chung Hsu, Pangfeng Liu, Jan-Jan Wu. ACM TACO, Jan. 2016.
>   http://www.iis.sinica.edu.tw/page/library/TechReport/tr2015/tr15002.pdf
>
> [2] https://github.com/tkhsu/quick-android-emulator/tree/quick-qemu
>
> [3] https://github.com/cota/dbt-bench


--
Alex Bennée

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op
  2017-04-12  3:43   ` Paolo Bonzini
@ 2017-04-13  1:46     ` Emilio G. Cota
  2017-04-14  5:17       ` Paolo Bonzini
  0 siblings, 1 reply; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-13  1:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, Peter Crosthwaite, Richard Henderson, Peter Maydell,
	Eduardo Habkost, Claudio Fontana, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar

On Wed, Apr 12, 2017 at 11:43:45 +0800, Paolo Bonzini wrote:
> 
> 
> On 12/04/2017 09:17, Emilio G. Cota wrote:
> > 
> > The fact that NBench is not very sensitive to changes here is a
> > little surprising, especially given the significant improvements for
> > ARM shown in the previous commit. I wonder whether the compiler is doing
> > a better job compiling the x86_64 version (I'm using gcc 5.4.0), or I'm simply
> > missing some i386 instructions to which the jr optimization should
> > be applied.
> 
> Maybe it is "ret"?  That would be a straightforward "bx lr" on ARM, but
> it is missing in your i386 patch.

Yes I missed that. I added this fix-up:

diff --git a/target/i386/translate.c b/target/i386/translate.c
index aab5c13..f2b5a0f 100644
--- a/target/i386/translate.c
+++ b/target/i386/translate.c
@@ -6430,7 +6430,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
         /* Note that gen_pop_T0 uses a zero-extending load.  */
         gen_op_jmp_v(cpu_T0);
         gen_bnd_jmp(s);
-        gen_eob(s);
+        gen_jr(s, cpu_T0);
         break;
     case 0xc3: /* ret */
         ot = gen_pop_T0(s);
@@ -6438,7 +6438,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
         /* Note that gen_pop_T0 uses a zero-extending load.  */
         gen_op_jmp_v(cpu_T0);
         gen_bnd_jmp(s);
-        gen_eob(s);
+        gen_jr(s, cpu_T0);
         break;
     case 0xca: /* lret im */
         val = cpu_ldsw_code(env, s->pc);

Any other instructions I should look into? Perhaps lret/lret im?

Anyway, nbench does not improve much with the above. The reason seems to be
that it's full of direct jumps (visible with -d in_asm). Also tried softmmu
to see whether these jumps are in-page or not: peak improvement is ~8%, so
I guess most of them are in-page. See http://imgur.com/EKRrYUz

I'm running new tests on a server with no other users and which has
frequency scaling disabled. This should help get less noisy numbers,
since I'm having trouble replicating my own results :> (I used my desktop
machine until now). Will post these numbers tomorrow (running overnight
SPECint both train and set sizes).

Thanks,

		Emilio

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 05/10] tcg: add jr opcode
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 05/10] tcg: add jr opcode Emilio G. Cota
@ 2017-04-13  5:09   ` Paolo Bonzini
  2017-04-15 11:40   ` Richard Henderson
  1 sibling, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2017-04-13  5:09 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Peter Crosthwaite, Stefan Weil,
	Claudio Fontana, Alexander Graf, alex.bennee, qemu-arm,
	Pranith Kumar, Aurelien Jarno, Richard Henderson



On 12/04/2017 09:17, Emilio G. Cota wrote:
> This will be used by TCG targets to implement a fast path
> for indirect branches.
> 
> I only have implemented and tested this on an i386 host, so
> make this opcode optional and mark it as not implemented by
> other TCG backends.

Please don't forget to document this in tcg/README.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 09/10] target/i386: optimize indirect branches with TCG's jr op
  2017-04-13  1:46     ` Emilio G. Cota
@ 2017-04-14  5:17       ` Paolo Bonzini
  0 siblings, 0 replies; 21+ messages in thread
From: Paolo Bonzini @ 2017-04-14  5:17 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: qemu-devel, Peter Crosthwaite, Richard Henderson, Peter Maydell,
	Eduardo Habkost, Claudio Fontana, Andrzej Zaborowski,
	Aurelien Jarno, Alexander Graf, Stefan Weil, qemu-arm,
	alex.bennee, Pranith Kumar


> Any other instructions I should look into? Perhaps lret/lret im?

Possibly (for completeness), but they are extremely rare in 32- and
64-bit code.

You also didn't cover any of syscall/sysret and sysenter/sysexit in your
patch, which would be on a relatively slow path but not _that_ slow.
But that probably should be a separate patch, moving the env->eip
assignment from seg_helper.c to translate.c and using the resulting TCGv
as the argument for jr.

Paolo

> Anyway, nbench does not improve much with the above. The reason seems to be
> that it's full of direct jumps (visible with -d in_asm). Also tried softmmu
> to see whether these jumps are in-page or not: peak improvement is ~8%, so
> I guess most of them are in-page. See http://imgur.com/EKRrYUz
> 
> I'm running new tests on a server with no other users and which has
> frequency scaling disabled. This should help get less noisy numbers,
> since I'm having trouble replicating my own results :> (I used my desktop
> machine until now). Will post these numbers tomorrow (running overnight
> SPECint both train and set sizes).
> 
> Thanks,
> 
> 		Emilio
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] target/arm: optimize cross-page block chaining in softmmu
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 03/10] target/arm: optimize cross-page block chaining in softmmu Emilio G. Cota
@ 2017-04-15 11:24   ` Richard Henderson
  0 siblings, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2017-04-15 11:24 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Peter Crosthwaite, Stefan Weil,
	Claudio Fontana, Alexander Graf, alex.bennee, qemu-arm,
	Pranith Kumar, Paolo Bonzini, Aurelien Jarno

On 04/11/2017 06:17 PM, Emilio G. Cota wrote:
> +uint32_t HELPER(cross_page_check)(CPUARMState *env, target_ulong vaddr)
> +{
> +    return !!tb_from_jmp_cache(env, vaddr);
> +}

FWIW, helpers like this that are intended to be used by more than one target 
should go into tcg-runtime.[ch].

That said, I don't think this is the proper abstraction.  More later...


r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 05/10] tcg: add jr opcode
  2017-04-12  1:17 ` [Qemu-devel] [PATCH 05/10] tcg: add jr opcode Emilio G. Cota
  2017-04-13  5:09   ` Paolo Bonzini
@ 2017-04-15 11:40   ` Richard Henderson
  2017-04-16 18:28     ` Emilio G. Cota
  1 sibling, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2017-04-15 11:40 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Peter Crosthwaite, Stefan Weil,
	Claudio Fontana, Alexander Graf, alex.bennee, qemu-arm,
	Pranith Kumar, Paolo Bonzini, Aurelien Jarno

On 04/11/2017 06:17 PM, Emilio G. Cota wrote:
> This will be used by TCG targets to implement a fast path
> for indirect branches.
>
> I only have implemented and tested this on an i386 host, so
> make this opcode optional and mark it as not implemented by
> other TCG backends.

I don't think this is quite the right abstraction.  In particular, if we can 
always return a valid address from the helper, we can eliminate a conditional 
branch.

I think this should work as follows:

(1) tb_ret_addr gets moved into TCGContext so that it's available for other 
code to see.

(2) Have a generic helper

void *HELPER(lookup_tb_ptr)(CPUArchState *env, target_ulong addr)
{
     TranslationBlock *tb = tb_from_jmp_cache(env, addr);
     return tb ? tb->tc_ptr : tcg_ctx.tb_ret_addr;
}

(3) Emit TCG opcodes like

	call t0,lookup_tb_ptr,env,addr
	jmp_tb t0

(4) Emit code for jmp_tb like

	mov	%rax,%rdx	// save target into new register
	xor	%eax,%eax	// set return value a-la exit_tb
	jmp	*%edx		// branch to tb or epilogue.

(5) There needs to be a convenience function in tcg/tcg-op.c.  If the host does 
not support jmp_tb, we should just generate exit_tb like we do now.  There 
should be no ifdefs inside target/*.



r~

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Qemu-devel] [PATCH 05/10] tcg: add jr opcode
  2017-04-15 11:40   ` Richard Henderson
@ 2017-04-16 18:28     ` Emilio G. Cota
  0 siblings, 0 replies; 21+ messages in thread
From: Emilio G. Cota @ 2017-04-16 18:28 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, Peter Maydell, Eduardo Habkost, Peter Crosthwaite,
	Stefan Weil, Alexander Graf, alex.bennee, qemu-arm,
	Pranith Kumar, Paolo Bonzini, Aurelien Jarno

On Sat, Apr 15, 2017 at 04:40:35 -0700, Richard Henderson wrote:
> On 04/11/2017 06:17 PM, Emilio G. Cota wrote:
> >This will be used by TCG targets to implement a fast path
> >for indirect branches.
> >
> >I only have implemented and tested this on an i386 host, so
> >make this opcode optional and mark it as not implemented by
> >other TCG backends.
> 
> I don't think this is quite the right abstraction.  In particular, if we can
> always return a valid address from the helper, we can eliminate a
> conditional branch.
> 
> I think this should work as follows:
(snip)

Yes that's much better. In fact in the cover letter I forgot to
mention that the code by the paper authors does something similar
to avoid the branch.

I went with the design with a branch because (1) I wasn't sure that
exporting tb_ret_addr would get your approval and (2) my knowledge
of TCG backend code is shamefully poor.

Will work on a v2. Thanks for the feedback!

		Emilio

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-04-16 18:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-12  1:17 [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Emilio G. Cota
2017-04-12  1:17 ` [Qemu-devel] [PATCH 01/10] exec-all: add tb_from_jmp_cache Emilio G. Cota
2017-04-12  1:17 ` [Qemu-devel] [PATCH 02/10] exec-all: inline tb_from_jmp_cache Emilio G. Cota
2017-04-12  1:17 ` [Qemu-devel] [PATCH 03/10] target/arm: optimize cross-page block chaining in softmmu Emilio G. Cota
2017-04-15 11:24   ` Richard Henderson
2017-04-12  1:17 ` [Qemu-devel] [PATCH 04/10] target/i386: " Emilio G. Cota
2017-04-12  1:17 ` [Qemu-devel] [PATCH 05/10] tcg: add jr opcode Emilio G. Cota
2017-04-13  5:09   ` Paolo Bonzini
2017-04-15 11:40   ` Richard Henderson
2017-04-16 18:28     ` Emilio G. Cota
2017-04-12  1:17 ` [Qemu-devel] [PATCH 06/10] tcg: add brcondi_ptr Emilio G. Cota
2017-04-12  1:17 ` [Qemu-devel] [PATCH 07/10] tcg: add tcg_temp_local_new_ptr Emilio G. Cota
2017-04-12  1:17 ` [Qemu-devel] [PATCH 08/10] target/arm: optimize indirect branches with TCG's jr op Emilio G. Cota
2017-04-12  1:17 ` [Qemu-devel] [PATCH 09/10] target/i386: " Emilio G. Cota
2017-04-12  3:43   ` Paolo Bonzini
2017-04-13  1:46     ` Emilio G. Cota
2017-04-14  5:17       ` Paolo Bonzini
2017-04-12  1:17 ` [Qemu-devel] [PATCH 10/10] tb-hash: improve tb_jmp_cache hash function in user mode Emilio G. Cota
2017-04-12  3:46   ` Paolo Bonzini
2017-04-12  5:07     ` Emilio G. Cota
2017-04-12 10:03 ` [Qemu-devel] [PATCH 00/10] TCG optimizations for 2.10 Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.