All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
@ 2012-10-09 12:37 Yeongkyoon Lee
  2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Yeongkyoon Lee @ 2012-10-09 12:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: Yeongkyoon Lee

Hi, all.

Here is the 5th version of the series optimizing TCG qemu_ld/st code generation.

v5:
  - Remove RFC tag

v4:
  - Remove CONFIG_SOFTMMU pre-condition from configure
  - Instead, add some CONFIG_SOFTMMU condition to TCG sources
  - Remove some unnecessary comments

v3:
  - Support CONFIG_TCG_PASS_AREG0
    (expected to get more performance enhancement than others)
  - Remove the configure option "--enable-ldst-optimization""
  - Make the optimization as default on i386 and x86_64 hosts
  - Fix some mistyping and apply checkpatch.pl before committing
  - Test i386, arm and sparc softmmu targets on i386 and x86_64 hosts
  - Test linux-user-test-0.3

v2:
  - Follow the submit rule of qemu

v1:
  - Initial commit request

I think the generated codes from qemu_ld/st IRs are relatively heavy, which are
up to 12 instructions for TLB hit case on i386 host.
This patch series enhance the code quality of TCG qemu_ld/st IRs by reducing
jump and enhancing locality.
Main idea is simple and has been already described in the comments in
tcg-target.c, which separates slow path (TLB miss case), and generates it at the
end of TB.

For example, the generated code from qemu_ld changes as follow.
Before:
(1) TLB check
(2) If hit fall through, else jump to TLB miss case (5)
(3) TLB hit case: Load value from host memory
(4) Jump to next code (6)
(5) TLB miss case: call MMU helper
(6) ... (next code)

After:
(1) TLB check
(2) If hit fall through, else jump to TLB miss case (7)
(3) TLB hit case: Load value from host memory
(4) ... (next code)
...
(7) TLB miss case: call MMU helper
(8) Return to next code (4)

Following is some performance results measured based on qemu 1.0.
Although there was measurement error, the results was not negligible.

* EEMBC CoreMark (before -> after)
  - Guest: i386, Linux (Tizen platform)
  - Host: Intel Core2 Quad 2.4GHz, 2GB RAM, Linux
  - Results: 1135.6 -> 1179.9 (+3.9%)

* nbench (before -> after)
  - Guest: i386, Linux (linux-0.2.img included in QEMU source)
  - Host: Intel Core2 Quad 2.4GHz, 2GB RAM, Linux
  - Results
    . MEMORY INDEX: 1.6782 -> 1.6818 (+0.2%)
    . INTEGER INDEX: 1.8258 -> 1.877 (+2.8%)
    . FLOATING-POINT INDEX: 0.5944 -> 0.5954 (+0.2%)

Summarized features:
 - The changes are wrapped by macro "CONFIG_QEMU_LDST_OPTIMIZATION" and
   they are enabled by default on i386/x86_64 hosts
 - Forced removal of the macro will cause compilation error on i386/x86_64 hosts
 - No implementations other than i386/x86_64 hosts yet

In addition, I have tried to remove the generated codes of calling MMU helpers
for TLB miss case from end of TB, however, have not found good solution yet.
In my opinion, TLB hit case performance could be degraded if removing the
calling codes, because it needs to set runtime parameters, such as, data,
mmu index and return address, in register or stack though they are not used
in TLB hit case.
This remains as a further issue.

Yeongkyoon Lee (3):
  configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st
    optimization
  tcg: Add declarations and templates of extended MMU helpers
  tcg: Optimize qemu_ld/st by generating slow paths at the end of a
    block

 configure             |    6 +
 softmmu_defs.h        |   39 +++++
 softmmu_header.h      |   15 ++
 softmmu_template.h    |   41 ++++-
 tcg/i386/tcg-target.c |  420 ++++++++++++++++++++++++++++++++-----------------
 tcg/tcg.c             |   13 ++
 tcg/tcg.h             |   35 ++++
 7 files changed, 416 insertions(+), 153 deletions(-)

--
1.7.5.4

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v5 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization
  2012-10-09 12:37 [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
@ 2012-10-09 12:37 ` Yeongkyoon Lee
  2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 2/3] tcg: Add declarations and templates of extended MMU helpers Yeongkyoon Lee
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Yeongkyoon Lee @ 2012-10-09 12:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: Yeongkyoon Lee

Enable CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization only when
a host is i386 or x86_64.

Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>
---
 configure |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/configure b/configure
index e58846d..b02e079 100755
--- a/configure
+++ b/configure
@@ -3856,6 +3856,12 @@ upper() {
     echo "$@"| LC_ALL=C tr '[a-z]' '[A-Z]'
 }

+case "$cpu" in
+  i386|x86_64)
+    echo "CONFIG_QEMU_LDST_OPTIMIZATION=y" >> $config_target_mak
+  ;;
+esac
+
 echo "TARGET_SHORT_ALIGNMENT=$target_short_alignment" >> $config_target_mak
 echo "TARGET_INT_ALIGNMENT=$target_int_alignment" >> $config_target_mak
 echo "TARGET_LONG_ALIGNMENT=$target_long_alignment" >> $config_target_mak
--
1.7.5.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v5 2/3] tcg: Add declarations and templates of extended MMU helpers
  2012-10-09 12:37 [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
  2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
@ 2012-10-09 12:37 ` Yeongkyoon Lee
  2012-10-09 18:36   ` Richard Henderson
  2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
  2012-10-09 14:26 ` [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Aurelien Jarno
  3 siblings, 1 reply; 16+ messages in thread
From: Yeongkyoon Lee @ 2012-10-09 12:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: Yeongkyoon Lee

Add declarations and templates of extended MMU helpers.
An extended helper takes an additional argument of the host address accessing
a guest memory which differs from the address of the call site to the helper
because helper call sites locate at the end of a generated code block.

Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>
---
 softmmu_defs.h     |   39 +++++++++++++++++++++++++++++++++++++++
 softmmu_header.h   |   15 +++++++++++++++
 softmmu_template.h |   41 +++++++++++++++++++++++++++++++++--------
 3 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/softmmu_defs.h b/softmmu_defs.h
index 1f25e33..a93adf0 100644
--- a/softmmu_defs.h
+++ b/softmmu_defs.h
@@ -9,6 +9,7 @@
 #ifndef SOFTMMU_DEFS_H
 #define SOFTMMU_DEFS_H

+#ifndef CONFIG_QEMU_LDST_OPTIMIZATION
 uint8_t helper_ldb_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
 void helper_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
                     int mmu_idx);
@@ -34,4 +35,42 @@ void helper_stl_cmmu(CPUArchState *env, target_ulong addr, uint32_t val,
 uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
 void helper_stq_cmmu(CPUArchState *env, target_ulong addr, uint64_t val,
                      int mmu_idx);
+#else
+/* Extended versions of MMU helpers for qemu_ld/st optimization.
+   The additional argument is a host code address accessing guest memory */
+uint8_t ext_helper_ldb_mmu(CPUArchState *env, target_ulong addr, int mmu_idx,
+                           uintptr_t ra);
+void ext_helper_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
+                        int mmu_idx, uintptr_t ra);
+uint16_t ext_helper_ldw_mmu(CPUArchState *env, target_ulong addr, int mmu_idx,
+                            uintptr_t ra);
+void ext_helper_stw_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
+                        int mmu_idx, uintptr_t ra);
+uint32_t ext_helper_ldl_mmu(CPUArchState *env, target_ulong addr, int mmu_idx,
+                            uintptr_t ra);
+void ext_helper_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                        int mmu_idx, uintptr_t ra);
+uint64_t ext_helper_ldq_mmu(CPUArchState *env, target_ulong addr, int mmu_idx,
+                            uintptr_t ra);
+void ext_helper_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+                        int mmu_idx, uintptr_t ra);
+
+uint8_t ext_helper_ldb_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx,
+                            uintptr_t ra);
+void ext_helper_stb_cmmu(CPUArchState *env, target_ulong addr, uint8_t val,
+                         int mmu_idx, uintptr_t ra);
+uint16_t ext_helper_ldw_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx,
+                             uintptr_t ra);
+void ext_helper_stw_cmmu(CPUArchState *env, target_ulong addr, uint16_t val,
+                         int mmu_idx, uintptr_t ra);
+uint32_t ext_helper_ldl_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx,
+                             uintptr_t ra);
+void ext_helper_stl_cmmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                         int mmu_idx, uintptr_t ra);
+uint64_t ext_helper_ldq_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx,
+                             uintptr_t ra);
+void ext_helper_stq_cmmu(CPUArchState *env, target_ulong addr, uint64_t val,
+                         int mmu_idx, uintptr_t ra);
+#endif  /* CONFIG_QEMU_LDST_OPTIMIZATION */
+
 #endif
diff --git a/softmmu_header.h b/softmmu_header.h
index d8d9c81..d18c8f8 100644
--- a/softmmu_header.h
+++ b/softmmu_header.h
@@ -93,7 +93,12 @@ glue(glue(cpu_ld, USUFFIX), MEMSUFFIX)(CPUArchState *env, target_ulong ptr)
     mmu_idx = CPU_MMU_INDEX;
     if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ !=
                  (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
+#ifdef CONFIG_QEMU_LDST_OPTIMIZATION
+        res = glue(glue(ext_helper_ld, SUFFIX), MMUSUFFIX)(env, addr, mmu_idx,
+                                                           (uintptr_t)NULL);
+#else
         res = glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(env, addr, mmu_idx);
+#endif
     } else {
         uintptr_t hostaddr = addr + env->tlb_table[mmu_idx][page_index].addend;
         res = glue(glue(ld, USUFFIX), _raw)(hostaddr);
@@ -114,8 +119,13 @@ glue(glue(cpu_lds, SUFFIX), MEMSUFFIX)(CPUArchState *env, target_ulong ptr)
     mmu_idx = CPU_MMU_INDEX;
     if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ !=
                  (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
+#ifdef CONFIG_QEMU_LDST_OPTIMIZATION
+        res = (DATA_STYPE)glue(glue(ext_helper_ld, SUFFIX),
+                               MMUSUFFIX)(env, addr, mmu_idx, (uintptr_t)NULL);
+#else
         res = (DATA_STYPE)glue(glue(helper_ld, SUFFIX),
                                MMUSUFFIX)(env, addr, mmu_idx);
+#endif
     } else {
         uintptr_t hostaddr = addr + env->tlb_table[mmu_idx][page_index].addend;
         res = glue(glue(lds, SUFFIX), _raw)(hostaddr);
@@ -141,7 +151,12 @@ glue(glue(cpu_st, SUFFIX), MEMSUFFIX)(CPUArchState *env, target_ulong ptr,
     mmu_idx = CPU_MMU_INDEX;
     if (unlikely(env->tlb_table[mmu_idx][page_index].addr_write !=
                  (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) {
+#ifdef CONFIG_QEMU_LDST_OPTIMIZATION
+        glue(glue(ext_helper_st, SUFFIX), MMUSUFFIX)(env, addr, v, mmu_idx,
+                                                     (uintptr_t)NULL);
+#else
         glue(glue(helper_st, SUFFIX), MMUSUFFIX)(env, addr, v, mmu_idx);
+#endif
     } else {
         uintptr_t hostaddr = addr + env->tlb_table[mmu_idx][page_index].addend;
         glue(glue(st, SUFFIX), _raw)(hostaddr, v);
diff --git a/softmmu_template.h b/softmmu_template.h
index e2490f0..e40c060 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -54,6 +54,14 @@
 #define ADDR_READ addr_read
 #endif

+#ifdef CONFIG_QEMU_LDST_OPTIMIZATION
+/* An extended MMU helper takes one more argument which is
+   a host address of generated code accessing guest memory */
+#define GET_RET_ADDR() ra
+#else
+#define GET_RET_ADDR() GETPC()
+#endif  /* CONFIG_QEMU_LDST_OPTIMIZATION */
+
 static DATA_TYPE glue(glue(slow_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
                                                         target_ulong addr,
                                                         int mmu_idx,
@@ -91,9 +99,17 @@ static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
 }

 /* handle all cases except unaligned access which span two pages */
+#ifdef CONFIG_QEMU_LDST_OPTIMIZATION
+DATA_TYPE
+glue(glue(ext_helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
+                                             target_ulong addr,
+                                             int mmu_idx,
+                                             uintptr_t ra)
+#else
 DATA_TYPE
 glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
                                          int mmu_idx)
+#endif
 {
     DATA_TYPE res;
     int index;
@@ -111,13 +127,13 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
             /* IO access */
             if ((addr & (DATA_SIZE - 1)) != 0)
                 goto do_unaligned_access;
-            retaddr = GETPC();
+            retaddr = GET_RET_ADDR();
             ioaddr = env->iotlb[mmu_idx][index];
             res = glue(io_read, SUFFIX)(env, ioaddr, addr, retaddr);
         } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
             /* slow unaligned access (it spans two pages or IO) */
         do_unaligned_access:
-            retaddr = GETPC();
+            retaddr = GET_RET_ADDR();
 #ifdef ALIGNED_ONLY
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
 #endif
@@ -128,7 +144,7 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
             uintptr_t addend;
 #ifdef ALIGNED_ONLY
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                retaddr = GETPC();
+                retaddr = GET_RET_ADDR();
                 do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
             }
 #endif
@@ -138,7 +154,7 @@ glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
         }
     } else {
         /* the page is not in the TLB : fill it */
-        retaddr = GETPC();
+        retaddr = GET_RET_ADDR();
 #ifdef ALIGNED_ONLY
         if ((addr & (DATA_SIZE - 1)) != 0)
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
@@ -240,9 +256,17 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
 #endif /* SHIFT > 2 */
 }

+#ifdef CONFIG_QEMU_LDST_OPTIMIZATION
+void glue(glue(ext_helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
+                                                  target_ulong addr,
+                                                  DATA_TYPE val,
+                                                  int mmu_idx,
+                                                  uintptr_t ra)
+#else
 void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
                                               target_ulong addr, DATA_TYPE val,
                                               int mmu_idx)
+#endif
 {
     target_phys_addr_t ioaddr;
     target_ulong tlb_addr;
@@ -257,12 +281,12 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
             /* IO access */
             if ((addr & (DATA_SIZE - 1)) != 0)
                 goto do_unaligned_access;
-            retaddr = GETPC();
+            retaddr = GET_RET_ADDR();
             ioaddr = env->iotlb[mmu_idx][index];
             glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
         } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
         do_unaligned_access:
-            retaddr = GETPC();
+            retaddr = GET_RET_ADDR();
 #ifdef ALIGNED_ONLY
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
 #endif
@@ -273,7 +297,7 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
             uintptr_t addend;
 #ifdef ALIGNED_ONLY
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                retaddr = GETPC();
+                retaddr = GET_RET_ADDR();
                 do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
             }
 #endif
@@ -283,7 +307,7 @@ void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
         }
     } else {
         /* the page is not in the TLB : fill it */
-        retaddr = GETPC();
+        retaddr = GET_RET_ADDR();
 #ifdef ALIGNED_ONLY
         if ((addr & (DATA_SIZE - 1)) != 0)
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
@@ -352,3 +376,4 @@ static void glue(glue(slow_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
 #undef USUFFIX
 #undef DATA_SIZE
 #undef ADDR_READ
+#undef GET_RET_ADDR
--
1.7.5.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [Qemu-devel] [PATCH v5 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block
  2012-10-09 12:37 [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
  2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
  2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 2/3] tcg: Add declarations and templates of extended MMU helpers Yeongkyoon Lee
@ 2012-10-09 12:37 ` Yeongkyoon Lee
  2012-10-09 18:49   ` Richard Henderson
  2012-10-09 14:26 ` [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Aurelien Jarno
  3 siblings, 1 reply; 16+ messages in thread
From: Yeongkyoon Lee @ 2012-10-09 12:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: Yeongkyoon Lee

Add optimized TCG qemu_ld/st generation which locates the code of TLB miss
cases at the end of a block after generating the other IRs.
Currently, this optimization supports only i386 and x86_64 hosts.

Signed-off-by: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>
---
 tcg/i386/tcg-target.c |  420 ++++++++++++++++++++++++++++++++-----------------
 tcg/tcg.c             |   13 ++
 tcg/tcg.h             |   35 ++++
 3 files changed, 323 insertions(+), 145 deletions(-)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 0e218c8..4c50542 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -983,24 +983,34 @@ static void tcg_out_jmp(TCGContext *s, tcg_target_long dest)

 #include "../../softmmu_defs.h"

-/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
-   int mmu_idx) */
+/* extended helper signature: ext_helper_ld_mmu(CPUState *env,
+   target_ulong addr, int mmu_idx, uintptr_t raddr) */
 static const void *qemu_ld_helpers[4] = {
-    helper_ldb_mmu,
-    helper_ldw_mmu,
-    helper_ldl_mmu,
-    helper_ldq_mmu,
+    ext_helper_ldb_mmu,
+    ext_helper_ldw_mmu,
+    ext_helper_ldl_mmu,
+    ext_helper_ldq_mmu,
 };

-/* helper signature: helper_st_mmu(CPUState *env, target_ulong addr,
-   uintxx_t val, int mmu_idx) */
+/* extended helper signature: ext_helper_st_mmu(CPUState *env,
+   target_ulong addr, uintxx_t val, int mmu_idx, uintptr_t raddr) */
 static const void *qemu_st_helpers[4] = {
-    helper_stb_mmu,
-    helper_stw_mmu,
-    helper_stl_mmu,
-    helper_stq_mmu,
+    ext_helper_stb_mmu,
+    ext_helper_stw_mmu,
+    ext_helper_stl_mmu,
+    ext_helper_stq_mmu,
 };

+static void add_qemu_ldst_label(TCGContext *s,
+                                int opc_ext,
+                                int data_reg,
+                                int data_reg2,
+                                int addrlo_reg,
+                                int addrhi_reg,
+                                int mem_index,
+                                uint8_t *raddr,
+                                uint8_t **label_ptr);
+
 /* Perform the TLB load and compare.

    Inputs:
@@ -1059,19 +1069,21 @@ static inline void tcg_out_tlb_load(TCGContext *s, int addrlo_idx,

     tcg_out_mov(s, type, r0, addrlo);

-    /* jne label1 */
-    tcg_out8(s, OPC_JCC_short + JCC_JNE);
+    /* jne slow_path */
+    /* XXX: How to avoid using OPC_JCC_long for peephole optimization? */
+    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
     label_ptr[0] = s->code_ptr;
-    s->code_ptr++;
+    s->code_ptr += 4;

     if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
         /* cmp 4(r1), addrhi */
         tcg_out_modrm_offset(s, OPC_CMP_GvEv, args[addrlo_idx+1], r1, 4);

-        /* jne label1 */
-        tcg_out8(s, OPC_JCC_short + JCC_JNE);
+        /* jne slow_path */
+        /* XXX: How to avoid using OPC_JCC_long for peephole optimization? */
+        tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
         label_ptr[1] = s->code_ptr;
-        s->code_ptr++;
+        s->code_ptr += 4;
     }

     /* TLB Hit.  */
@@ -1169,12 +1181,7 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args,
     int addrlo_idx;
 #if defined(CONFIG_SOFTMMU)
     int mem_index, s_bits;
-#if TCG_TARGET_REG_BITS == 64
-    int arg_idx;
-#else
-    int stack_adjust;
-#endif
-    uint8_t *label_ptr[3];
+    uint8_t *label_ptr[2];
 #endif

     data_reg = args[0];
@@ -1194,93 +1201,16 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args,
     /* TLB Hit.  */
     tcg_out_qemu_ld_direct(s, data_reg, data_reg2, TCG_REG_L0, 0, opc);

-    /* jmp label2 */
-    tcg_out8(s, OPC_JMP_short);
-    label_ptr[2] = s->code_ptr;
-    s->code_ptr++;
-
-    /* TLB Miss.  */
-
-    /* label1: */
-    *label_ptr[0] = s->code_ptr - label_ptr[0] - 1;
-    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
-        *label_ptr[1] = s->code_ptr - label_ptr[1] - 1;
-    }
-
-    /* XXX: move that code at the end of the TB */
-#if TCG_TARGET_REG_BITS == 32
-    tcg_out_pushi(s, mem_index);
-    stack_adjust = 4;
-    if (TARGET_LONG_BITS == 64) {
-        tcg_out_push(s, args[addrlo_idx + 1]);
-        stack_adjust += 4;
-    }
-    tcg_out_push(s, args[addrlo_idx]);
-    stack_adjust += 4;
-    tcg_out_push(s, TCG_AREG0);
-    stack_adjust += 4;
-#else
-    /* The first argument is already loaded with addrlo.  */
-    arg_idx = 1;
-    tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[arg_idx],
-                 mem_index);
-    /* XXX/FIXME: suboptimal */
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[3], TCG_REG_L2);
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[2], TCG_REG_L1);
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[1], TCG_REG_L0);
-    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
-#endif
-
-    tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
-
-#if TCG_TARGET_REG_BITS == 32
-    if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
-        /* Pop and discard.  This is 2 bytes smaller than the add.  */
-        tcg_out_pop(s, TCG_REG_ECX);
-    } else if (stack_adjust != 0) {
-        tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
-    }
-#endif
-
-    switch(opc) {
-    case 0 | 4:
-        tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
-        break;
-    case 1 | 4:
-        tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
-        break;
-    case 0:
-        tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
-        break;
-    case 1:
-        tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
-        break;
-    case 2:
-        tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
-        break;
-#if TCG_TARGET_REG_BITS == 64
-    case 2 | 4:
-        tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
-        break;
-#endif
-    case 3:
-        if (TCG_TARGET_REG_BITS == 64) {
-            tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
-        } else if (data_reg == TCG_REG_EDX) {
-            /* xchg %edx, %eax */
-            tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EAX);
-        } else {
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
-            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EDX);
-        }
-        break;
-    default:
-        tcg_abort();
-    }
-
-    /* label2: */
-    *label_ptr[2] = s->code_ptr - label_ptr[2] - 1;
+    /* Record the current context of a load into ldst label */
+    add_qemu_ldst_label(s,
+                        opc,
+                        data_reg,
+                        data_reg2,
+                        args[addrlo_idx],
+                        args[addrlo_idx + 1],
+                        mem_index,
+                        s->code_ptr,
+                        label_ptr);
 #else
     {
         int32_t offset = GUEST_BASE;
@@ -1372,8 +1302,7 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     int addrlo_idx;
 #if defined(CONFIG_SOFTMMU)
     int mem_index, s_bits;
-    int stack_adjust;
-    uint8_t *label_ptr[3];
+    uint8_t *label_ptr[2];
 #endif

     data_reg = args[0];
@@ -1393,23 +1322,220 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     /* TLB Hit.  */
     tcg_out_qemu_st_direct(s, data_reg, data_reg2, TCG_REG_L0, 0, opc);

-    /* jmp label2 */
-    tcg_out8(s, OPC_JMP_short);
-    label_ptr[2] = s->code_ptr;
-    s->code_ptr++;
+    /* Record the current context of a store into ldst label */
+    add_qemu_ldst_label(s,
+                        opc | HL_ST_MASK,
+                        data_reg,
+                        data_reg2,
+                        args[addrlo_idx],
+                        args[addrlo_idx + 1],
+                        mem_index,
+                        s->code_ptr,
+                        label_ptr);
+#else
+    {
+        int32_t offset = GUEST_BASE;
+        int base = args[addrlo_idx];

-    /* TLB Miss.  */
+        if (TCG_TARGET_REG_BITS == 64) {
+            /* ??? We assume all operations have left us with register
+               contents that are zero extended.  So far this appears to
+               be true.  If we want to enforce this, we can either do
+               an explicit zero-extension here, or (if GUEST_BASE == 0)
+               use the ADDR32 prefix.  For now, do nothing.  */

-    /* label1: */
-    *label_ptr[0] = s->code_ptr - label_ptr[0] - 1;
+            if (offset != GUEST_BASE) {
+                tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_L0, GUEST_BASE);
+                tgen_arithr(s, ARITH_ADD + P_REXW, TCG_REG_L0, base);
+                base = TCG_REG_L0;
+                offset = 0;
+            }
+        }
+
+        tcg_out_qemu_st_direct(s, data_reg, data_reg2, base, offset, opc);
+    }
+#endif
+}
+
+#if defined(CONFIG_SOFTMMU)
+/*
+ * Record the context of a call to the out of line helper code for the slow path
+ * for a load or store, so that we can later generate the correct helper code
+ */
+static void add_qemu_ldst_label(TCGContext *s,
+                                int opc_ext,
+                                int data_reg,
+                                int data_reg2,
+                                int addrlo_reg,
+                                int addrhi_reg,
+                                int mem_index,
+                                uint8_t *raddr,
+                                uint8_t **label_ptr)
+{
+    int idx;
+    TCGLabelQemuLdst *label;
+
+    if (s->nb_qemu_ldst_labels >= TCG_MAX_QEMU_LDST) {
+        tcg_abort();
+    }
+
+    idx = s->nb_qemu_ldst_labels++;
+    label = (TCGLabelQemuLdst *)&s->qemu_ldst_labels[idx];
+    label->opc_ext = opc_ext;
+    label->datalo_reg = data_reg;
+    label->datahi_reg = data_reg2;
+    label->addrlo_reg = addrlo_reg;
+    label->addrhi_reg = addrhi_reg;
+    label->mem_index = mem_index;
+    label->raddr = raddr;
+    label->label_ptr[0] = label_ptr[0];
     if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
-        *label_ptr[1] = s->code_ptr - label_ptr[1] - 1;
+        label->label_ptr[1] = label_ptr[1];
     }
+}

-    /* XXX: move that code at the end of the TB */
+/*
+ * Generate code for the slow path for a load at the end of block
+ */
+static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
+{
+    int s_bits;
+    int opc = label->opc_ext & HL_OPC_MASK;
+    int mem_index = label->mem_index;
 #if TCG_TARGET_REG_BITS == 32
+    int stack_adjust;
+    int addrlo_reg = label->addrlo_reg;
+    int addrhi_reg = label->addrhi_reg;
+#endif
+    int data_reg = label->datalo_reg;
+    int data_reg2 = label->datahi_reg;
+    uint8_t *raddr = label->raddr;
+    uint8_t **label_ptr = &label->label_ptr[0];
+
+    s_bits = opc & 3;
+
+    /* resolve label address */
+    *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
+    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+        *(uint32_t *)label_ptr[1] = (uint32_t)(s->code_ptr - label_ptr[1] - 4);
+    }
+
+    /* extended helper signature: ext_helper_ld_mmu(CPUState *env,
+       target_ulong addr, int mmu_idx, uintptr_t raddr) */
+#if TCG_TARGET_REG_BITS == 32
+    /* The last arg is the generated code address corresponding to qemu_ld IR */
+    tcg_out_pushi(s, (tcg_target_ulong)(raddr - 1));
+    stack_adjust = 4;
     tcg_out_pushi(s, mem_index);
+    stack_adjust += 4;
+    if (TARGET_LONG_BITS == 64) {
+        tcg_out_push(s, addrhi_reg);
+        stack_adjust += 4;
+    }
+    tcg_out_push(s, addrlo_reg);
+    stack_adjust += 4;
+    tcg_out_push(s, TCG_AREG0);
+    stack_adjust += 4;
+#else
+    /* The first argument is already loaded with addrlo.  */
+    tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[1],
+                 mem_index);
+    /* The last arg is the generated code address corresponding to qemu_ld IR */
+    tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2],
+                 (tcg_target_ulong)(raddr - 1));
+    /* XXX/FIXME: suboptimal */
+    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[3], TCG_REG_L2);
+    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[2], TCG_REG_L1);
+    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[1], TCG_REG_L0);
+    tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[0], TCG_AREG0);
+#endif
+
+    tcg_out_calli(s, (tcg_target_long)qemu_ld_helpers[s_bits]);
+
+#if TCG_TARGET_REG_BITS == 32
+    if (stack_adjust == (TCG_TARGET_REG_BITS / 8)) {
+        /* Pop and discard.  This is 2 bytes smaller than the add.  */
+        tcg_out_pop(s, TCG_REG_ECX);
+    } else if (stack_adjust != 0) {
+        tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
+    }
+#endif
+
+    switch (opc) {
+    case 0 | 4:
+        tcg_out_ext8s(s, data_reg, TCG_REG_EAX, P_REXW);
+        break;
+    case 1 | 4:
+        tcg_out_ext16s(s, data_reg, TCG_REG_EAX, P_REXW);
+        break;
+    case 0:
+        tcg_out_ext8u(s, data_reg, TCG_REG_EAX);
+        break;
+    case 1:
+        tcg_out_ext16u(s, data_reg, TCG_REG_EAX);
+        break;
+    case 2:
+        tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
+        break;
+#if TCG_TARGET_REG_BITS == 64
+    case 2 | 4:
+        tcg_out_ext32s(s, data_reg, TCG_REG_EAX);
+        break;
+#endif
+    case 3:
+        if (TCG_TARGET_REG_BITS == 64) {
+            tcg_out_mov(s, TCG_TYPE_I64, data_reg, TCG_REG_RAX);
+        } else if (data_reg == TCG_REG_EDX) {
+            /* xchg %edx, %eax */
+            tcg_out_opc(s, OPC_XCHG_ax_r32 + TCG_REG_EDX, 0, 0, 0);
+            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EAX);
+        } else {
+            tcg_out_mov(s, TCG_TYPE_I32, data_reg, TCG_REG_EAX);
+            tcg_out_mov(s, TCG_TYPE_I32, data_reg2, TCG_REG_EDX);
+        }
+        break;
+    default:
+        tcg_abort();
+    }
+
+    /* Jump back to the original code accessing a guest memory */
+    tcg_out_jmp(s, (tcg_target_long) raddr);
+}
+
+/*
+ * Generate code for the slow path for a store at the end of block
+ */
+static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *label)
+{
+    int s_bits;
+    int stack_adjust;
+    int opc = label->opc_ext & HL_OPC_MASK;
+    int mem_index = label->mem_index;
+    int data_reg = label->datalo_reg;
+#if TCG_TARGET_REG_BITS == 32
+    int data_reg2 = label->datahi_reg;
+    int addrlo_reg = label->addrlo_reg;
+    int addrhi_reg = label->addrhi_reg;
+#endif
+    uint8_t *raddr = label->raddr;
+    uint8_t **label_ptr = &label->label_ptr[0];
+
+    s_bits = opc & 3;
+
+    /* resolve label address */
+    *(uint32_t *)label_ptr[0] = (uint32_t)(s->code_ptr - label_ptr[0] - 4);
+    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+        *(uint32_t *)label_ptr[1] = (uint32_t)(s->code_ptr - label_ptr[1] - 4);
+    }
+
+    /* extended helper signature: ext_helper_st_mmu(CPUState *env,
+       target_ulong addr, uintxx_t val, int mmu_idx, uintptr_t raddr) */
+#if TCG_TARGET_REG_BITS == 32
+    /* The last arg is the generated code address corresponding to qemu_st IR */
+    tcg_out_pushi(s, (tcg_target_ulong)(raddr - 1));
     stack_adjust = 4;
+    tcg_out_pushi(s, mem_index);
+    stack_adjust += 4;
     if (opc == 3) {
         tcg_out_push(s, data_reg2);
         stack_adjust += 4;
@@ -1417,10 +1543,10 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
     tcg_out_push(s, data_reg);
     stack_adjust += 4;
     if (TARGET_LONG_BITS == 64) {
-        tcg_out_push(s, args[addrlo_idx + 1]);
+        tcg_out_push(s, addrhi_reg);
         stack_adjust += 4;
     }
-    tcg_out_push(s, args[addrlo_idx]);
+    tcg_out_push(s, addrlo_reg);
     stack_adjust += 4;
     tcg_out_push(s, TCG_AREG0);
     stack_adjust += 4;
@@ -1429,6 +1555,14 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
                 TCG_REG_L1, data_reg);
     tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_L2, mem_index);
     stack_adjust = 0;
+    /* The last arg is the generated code address corresponding to qemu_st IR */
+#if defined(_WIN64)
+    tcg_out_pushi(s, (tcg_target_ulong)(raddr - 1));
+    stack_adjust += 8;
+#else
+    tcg_out_movi(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[4],
+                 (tcg_target_ulong)(raddr - 1));
+#endif
     /* XXX/FIXME: suboptimal */
     tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[3], TCG_REG_L2);
     tcg_out_mov(s, TCG_TYPE_I64, tcg_target_call_iarg_regs[2], TCG_REG_L1);
@@ -1445,32 +1579,28 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args,
         tcg_out_addi(s, TCG_REG_CALL_STACK, stack_adjust);
     }

-    /* label2: */
-    *label_ptr[2] = s->code_ptr - label_ptr[2] - 1;
-#else
-    {
-        int32_t offset = GUEST_BASE;
-        int base = args[addrlo_idx];
+    /* Jump back to the original code accessing a guest memory */
+    tcg_out_jmp(s, (tcg_target_long) raddr);
+}

-        if (TCG_TARGET_REG_BITS == 64) {
-            /* ??? We assume all operations have left us with register
-               contents that are zero extended.  So far this appears to
-               be true.  If we want to enforce this, we can either do
-               an explicit zero-extension here, or (if GUEST_BASE == 0)
-               use the ADDR32 prefix.  For now, do nothing.  */
+/*
+ * Generate all of the slow paths of qemu_ld/st at the end of block
+ */
+void tcg_out_qemu_ldst_slow_path(TCGContext *s)
+{
+    int i;
+    TCGLabelQemuLdst *label;

-            if (offset != GUEST_BASE) {
-                tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_L0, GUEST_BASE);
-                tgen_arithr(s, ARITH_ADD + P_REXW, TCG_REG_L0, base);
-                base = TCG_REG_L0;
-                offset = 0;
+    for (i = 0; i < s->nb_qemu_ldst_labels; i++) {
+        label = (TCGLabelQemuLdst *)&s->qemu_ldst_labels[i];
+        if (IS_QEMU_LD_LABEL(label)) {
+            tcg_out_qemu_ld_slow_path(s, label);
+        } else {
+            tcg_out_qemu_st_slow_path(s, label);
             }
         }
-
-        tcg_out_qemu_st_direct(s, data_reg, data_reg2, base, offset, opc);
-    }
-#endif
 }
+#endif  /* CONFIG_SOFTMMU */

 static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
                               const TCGArg *args, const int *const_args)
diff --git a/tcg/tcg.c b/tcg/tcg.c
index c069e44..c96b7f1 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -304,6 +304,14 @@ void tcg_func_start(TCGContext *s)

     gen_opc_ptr = gen_opc_buf;
     gen_opparam_ptr = gen_opparam_buf;
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+    /* Initialize qemu_ld/st labels to assist code generation at the end of TB
+       for TLB miss cases at the end of TB */
+    s->qemu_ldst_labels = tcg_malloc(sizeof(TCGLabelQemuLdst) *
+                                     TCG_MAX_QEMU_LDST);
+    s->nb_qemu_ldst_labels = 0;
+#endif
 }

 static inline void tcg_temp_alloc(TCGContext *s, int n)
@@ -2163,6 +2171,11 @@ static inline int tcg_gen_code_common(TCGContext *s, uint8_t *gen_code_buf,
 #endif
     }
  the_end:
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+    /* Generate slow paths of qemu_ld/st IRs which call MMU helpers at
+       the end of block */
+    tcg_out_qemu_ldst_slow_path(s);
+#endif
     return -1;
 }

diff --git a/tcg/tcg.h b/tcg/tcg.h
index af7464a..b54884f 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -188,6 +188,29 @@ typedef tcg_target_ulong TCGArg;
    are aliases for target_ulong and host pointer sized values respectively.
  */

+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* Macros/structures for qemu_ld/st IR code optimization:
+   TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in exec-all.h. */
+#define TCG_MAX_QEMU_LDST       640
+#define HL_LDST_SHIFT           4
+#define HL_LDST_MASK            (1 << HL_LDST_SHIFT)
+#define HL_ST_MASK              HL_LDST_MASK
+#define HL_OPC_MASK             (HL_LDST_MASK - 1)
+#define IS_QEMU_LD_LABEL(L)     (!((L)->opc_ext & HL_LDST_MASK))
+#define IS_QEMU_ST_LABEL(L)     ((L)->opc_ext & HL_LDST_MASK)
+
+typedef struct TCGLabelQemuLdst {
+    int opc_ext;            /* | 27bit(reserved) | 1bit(ld/st) | 4bit(opc) | */
+    int addrlo_reg;         /* reg index for low word of guest virtual addr */
+    int addrhi_reg;         /* reg index for high word of guest virtual addr */
+    int datalo_reg;         /* reg index for low word to be loaded or stored */
+    int datahi_reg;         /* reg index for high word to be loaded or stored */
+    int mem_index;          /* soft MMU memory index */
+    uint8_t *raddr;         /* gen code addr of the next IR of qemu_ld/st IR */
+    uint8_t *label_ptr[2];  /* label pointers to be updated */
+} TCGLabelQemuLdst;
+#endif
+
 #ifdef CONFIG_DEBUG_TCG
 #define DEBUG_TCGV 1
 #endif
@@ -392,6 +415,13 @@ struct TCGContext {
     int temps_in_use;
     int goto_tb_issue_mask;
 #endif
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+    /* labels info for qemu_ld/st IRs
+       The labels help to generate TLB miss case codes at the end of TB */
+    TCGLabelQemuLdst *qemu_ldst_labels;
+    int nb_qemu_ldst_labels;
+#endif
 };

 extern TCGContext tcg_ctx;
@@ -595,3 +625,8 @@ extern uint8_t code_gen_prologue[];
 #endif

 void tcg_register_jit(void *buf, size_t buf_size);
+
+#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
+/* Generate all of the slow paths of qemu_ld/st at the end of block */
+void tcg_out_qemu_ldst_slow_path(TCGContext *s);
+#endif
--
1.7.5.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
  2012-10-09 12:37 [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
                   ` (2 preceding siblings ...)
  2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
@ 2012-10-09 14:26 ` Aurelien Jarno
  2012-10-09 16:19   ` Aurelien Jarno
  3 siblings, 1 reply; 16+ messages in thread
From: Aurelien Jarno @ 2012-10-09 14:26 UTC (permalink / raw)
  To: Yeongkyoon Lee; +Cc: qemu-devel

On Tue, Oct 09, 2012 at 09:37:29PM +0900, Yeongkyoon Lee wrote:
> Hi, all.
> 
> Here is the 5th version of the series optimizing TCG qemu_ld/st code generation.
> 
> v5:
>   - Remove RFC tag
> 
> v4:
>   - Remove CONFIG_SOFTMMU pre-condition from configure
>   - Instead, add some CONFIG_SOFTMMU condition to TCG sources
>   - Remove some unnecessary comments
> 
> v3:
>   - Support CONFIG_TCG_PASS_AREG0
>     (expected to get more performance enhancement than others)
>   - Remove the configure option "--enable-ldst-optimization""
>   - Make the optimization as default on i386 and x86_64 hosts
>   - Fix some mistyping and apply checkpatch.pl before committing
>   - Test i386, arm and sparc softmmu targets on i386 and x86_64 hosts
>   - Test linux-user-test-0.3
> 
> v2:
>   - Follow the submit rule of qemu
> 
> v1:
>   - Initial commit request
> 
> I think the generated codes from qemu_ld/st IRs are relatively heavy, which are
> up to 12 instructions for TLB hit case on i386 host.
> This patch series enhance the code quality of TCG qemu_ld/st IRs by reducing
> jump and enhancing locality.
> Main idea is simple and has been already described in the comments in
> tcg-target.c, which separates slow path (TLB miss case), and generates it at the
> end of TB.
> 
> For example, the generated code from qemu_ld changes as follow.
> Before:
> (1) TLB check
> (2) If hit fall through, else jump to TLB miss case (5)
> (3) TLB hit case: Load value from host memory
> (4) Jump to next code (6)
> (5) TLB miss case: call MMU helper
> (6) ... (next code)
> 
> After:
> (1) TLB check
> (2) If hit fall through, else jump to TLB miss case (7)
> (3) TLB hit case: Load value from host memory
> (4) ... (next code)
> ...
> (7) TLB miss case: call MMU helper
> (8) Return to next code (4)
> 

Instead of calling the MMU helper with an additional argument (7), and
then jump back (8) to the next code (4), what about pushing the address
of the next code (4) on the stack and use a jmp instead of the call. In
that case you don't need the extra argument to the helpers.

Otherwise I haven't looked the code, but on the principle it looks fine
to me.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
  2012-10-09 14:26 ` [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Aurelien Jarno
@ 2012-10-09 16:19   ` Aurelien Jarno
  2012-10-09 16:55     ` Paolo Bonzini
  0 siblings, 1 reply; 16+ messages in thread
From: Aurelien Jarno @ 2012-10-09 16:19 UTC (permalink / raw)
  To: Yeongkyoon Lee; +Cc: qemu-devel

On Tue, Oct 09, 2012 at 04:26:10PM +0200, Aurelien Jarno wrote:
> On Tue, Oct 09, 2012 at 09:37:29PM +0900, Yeongkyoon Lee wrote:
> > Hi, all.
> > 
> > Here is the 5th version of the series optimizing TCG qemu_ld/st code generation.
> > 
> > v5:
> >   - Remove RFC tag
> > 
> > v4:
> >   - Remove CONFIG_SOFTMMU pre-condition from configure
> >   - Instead, add some CONFIG_SOFTMMU condition to TCG sources
> >   - Remove some unnecessary comments
> > 
> > v3:
> >   - Support CONFIG_TCG_PASS_AREG0
> >     (expected to get more performance enhancement than others)
> >   - Remove the configure option "--enable-ldst-optimization""
> >   - Make the optimization as default on i386 and x86_64 hosts
> >   - Fix some mistyping and apply checkpatch.pl before committing
> >   - Test i386, arm and sparc softmmu targets on i386 and x86_64 hosts
> >   - Test linux-user-test-0.3
> > 
> > v2:
> >   - Follow the submit rule of qemu
> > 
> > v1:
> >   - Initial commit request
> > 
> > I think the generated codes from qemu_ld/st IRs are relatively heavy, which are
> > up to 12 instructions for TLB hit case on i386 host.
> > This patch series enhance the code quality of TCG qemu_ld/st IRs by reducing
> > jump and enhancing locality.
> > Main idea is simple and has been already described in the comments in
> > tcg-target.c, which separates slow path (TLB miss case), and generates it at the
> > end of TB.
> > 
> > For example, the generated code from qemu_ld changes as follow.
> > Before:
> > (1) TLB check
> > (2) If hit fall through, else jump to TLB miss case (5)
> > (3) TLB hit case: Load value from host memory
> > (4) Jump to next code (6)
> > (5) TLB miss case: call MMU helper
> > (6) ... (next code)
> > 
> > After:
> > (1) TLB check
> > (2) If hit fall through, else jump to TLB miss case (7)
> > (3) TLB hit case: Load value from host memory
> > (4) ... (next code)
> > ...
> > (7) TLB miss case: call MMU helper
> > (8) Return to next code (4)
> > 
> 
> Instead of calling the MMU helper with an additional argument (7), and
> then jump back (8) to the next code (4), what about pushing the address
> of the next code (4) on the stack and use a jmp instead of the call. In
> that case you don't need the extra argument to the helpers.
> 

Maybe it wasn't very clear. This is based on the fact that call is
basically push %rip + jmp. Therefore we can fake the return address by
putting the value we want, here the address of the next code. This mean
that we don't need to pass the extra argument to the helper for the 
return address, as GET_PC() would work correctly (it basically reads the
return address on the stack).

For other architectures, it might not be a push, but rather a move to
link register, basically put the return address where the calling
convention asks for.

OTOH I just realized it only works if the end of the slow path (moving
the value from the return address to the correct register). It might be
something doable.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
  2012-10-09 16:19   ` Aurelien Jarno
@ 2012-10-09 16:55     ` Paolo Bonzini
  2012-10-09 17:09       ` Aurelien Jarno
  0 siblings, 1 reply; 16+ messages in thread
From: Paolo Bonzini @ 2012-10-09 16:55 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: qemu-devel, Yeongkyoon Lee

Il 09/10/2012 18:19, Aurelien Jarno ha scritto:
>>> > > 
>> > 
>> > Instead of calling the MMU helper with an additional argument (7), and
>> > then jump back (8) to the next code (4), what about pushing the address
>> > of the next code (4) on the stack and use a jmp instead of the call. In
>> > that case you don't need the extra argument to the helpers.
>> > 
> Maybe it wasn't very clear. This is based on the fact that call is
> basically push %rip + jmp. Therefore we can fake the return address by
> putting the value we want, here the address of the next code. This mean
> that we don't need to pass the extra argument to the helper for the 
> return address, as GET_PC() would work correctly (it basically reads the
> return address on the stack).
> 
> For other architectures, it might not be a push, but rather a move to
> link register, basically put the return address where the calling
> convention asks for.
> 
> OTOH I just realized it only works if the end of the slow path (moving
> the value from the return address to the correct register). It might be
> something doable.

Branch predictors will not oldschool tricks like this one. :)

Paolo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
  2012-10-09 16:55     ` Paolo Bonzini
@ 2012-10-09 17:09       ` Aurelien Jarno
  2012-10-10  4:17         ` Yeongkyoon Lee
  0 siblings, 1 reply; 16+ messages in thread
From: Aurelien Jarno @ 2012-10-09 17:09 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, Yeongkyoon Lee

On Tue, Oct 09, 2012 at 06:55:58PM +0200, Paolo Bonzini wrote:
> Il 09/10/2012 18:19, Aurelien Jarno ha scritto:
> >>> > > 
> >> > 
> >> > Instead of calling the MMU helper with an additional argument (7), and
> >> > then jump back (8) to the next code (4), what about pushing the address
> >> > of the next code (4) on the stack and use a jmp instead of the call. In
> >> > that case you don't need the extra argument to the helpers.
> >> > 
> > Maybe it wasn't very clear. This is based on the fact that call is
> > basically push %rip + jmp. Therefore we can fake the return address by
> > putting the value we want, here the address of the next code. This mean
> > that we don't need to pass the extra argument to the helper for the 
> > return address, as GET_PC() would work correctly (it basically reads the
> > return address on the stack).
> > 
> > For other architectures, it might not be a push, but rather a move to
> > link register, basically put the return address where the calling
> > convention asks for.
> > 
> > OTOH I just realized it only works if the end of the slow path (moving
> > the value from the return address to the correct register). It might be
> > something doable.
> 
> Branch predictors will not oldschool tricks like this one. :)
> 

Given it is only used in the slow path (ie the exception more than the
rule), branch prediction isn't that important there.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/3] tcg: Add declarations and templates of extended MMU helpers
  2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 2/3] tcg: Add declarations and templates of extended MMU helpers Yeongkyoon Lee
@ 2012-10-09 18:36   ` Richard Henderson
  2012-10-10 11:04     ` Yeongkyoon Lee
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Henderson @ 2012-10-09 18:36 UTC (permalink / raw)
  To: Yeongkyoon Lee; +Cc: qemu-devel

On 10/09/2012 05:37 AM, Yeongkyoon Lee wrote:
> Add declarations and templates of extended MMU helpers.
> An extended helper takes an additional argument of the host address accessing
> a guest memory which differs from the address of the call site to the helper
> because helper call sites locate at the end of a generated code block.
...
> +#ifndef CONFIG_QEMU_LDST_OPTIMIZATION


My feedback from the last round of review is that a version of the
helper functions that take the return address should *always* be available.

There are existing issues in the target-*/foo_helper.c files where
if a helper touches memory that we do no necessarily handle any
fault properly.  This is less true of system mode than user mode,
but it's still a problem.

The helper.c files ought to be changed to use these new "ra-enabled"
routines and pass GETPC().  That way a fault from a helper gets
treated *exactly* like it would if it were called from TCG generated code.

Thus, all this conditionalization should vanish.


r~

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block
  2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
@ 2012-10-09 18:49   ` Richard Henderson
  2012-10-10  4:41     ` Yeongkyoon Lee
  0 siblings, 1 reply; 16+ messages in thread
From: Richard Henderson @ 2012-10-09 18:49 UTC (permalink / raw)
  To: Yeongkyoon Lee; +Cc: qemu-devel

On 10/09/2012 05:37 AM, Yeongkyoon Lee wrote:
> +#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
> +    /* Initialize qemu_ld/st labels to assist code generation at the end of TB
> +       for TLB miss cases at the end of TB */
> +    s->qemu_ldst_labels = tcg_malloc(sizeof(TCGLabelQemuLdst) *
> +                                     TCG_MAX_QEMU_LDST);
> +    s->nb_qemu_ldst_labels = 0;
> +#endif

I said before that I wasn't fond of this sort of "constant" dynamic allocation.
Regardless of what surrounding code does.  You could clean those up too,
as a separate patch...

> +#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
> +    /* Generate slow paths of qemu_ld/st IRs which call MMU helpers at
> +       the end of block */
> +    tcg_out_qemu_ldst_slow_path(s);
> +#endif

This interface is so close to "tcg_out_ldst_and_constant_pools(s)" that
I don't think the function should be specific to ldst.  Just call it
tcg_out_tb_finalize or something.

> +/* Macros/structures for qemu_ld/st IR code optimization:
> +   TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in exec-all.h. */
> +#define TCG_MAX_QEMU_LDST       640
> +#define HL_LDST_SHIFT           4
> +#define HL_LDST_MASK            (1 << HL_LDST_SHIFT)
> +#define HL_ST_MASK              HL_LDST_MASK
> +#define HL_OPC_MASK             (HL_LDST_MASK - 1)
> +#define IS_QEMU_LD_LABEL(L)     (!((L)->opc_ext & HL_LDST_MASK))
> +#define IS_QEMU_ST_LABEL(L)     ((L)->opc_ext & HL_LDST_MASK)
> +
> +typedef struct TCGLabelQemuLdst {
> +    int opc_ext;            /* | 27bit(reserved) | 1bit(ld/st) | 4bit(opc) | */

Any good reason to use all these masks when the compiler can do it
for you with bitfields?


r~

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
  2012-10-09 17:09       ` Aurelien Jarno
@ 2012-10-10  4:17         ` Yeongkyoon Lee
  2012-10-10  6:45           ` Aurelien Jarno
  0 siblings, 1 reply; 16+ messages in thread
From: Yeongkyoon Lee @ 2012-10-10  4:17 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: Paolo Bonzini, qemu-devel

On 2012년 10월 10일 02:09, Aurelien Jarno wrote:
> On Tue, Oct 09, 2012 at 06:55:58PM +0200, Paolo Bonzini wrote:
>> Il 09/10/2012 18:19, Aurelien Jarno ha scritto:
>>>>> Instead of calling the MMU helper with an additional argument (7), and
>>>>> then jump back (8) to the next code (4), what about pushing the address
>>>>> of the next code (4) on the stack and use a jmp instead of the call. In
>>>>> that case you don't need the extra argument to the helpers.
>>>>>
>>> Maybe it wasn't very clear. This is based on the fact that call is
>>> basically push %rip + jmp. Therefore we can fake the return address by
>>> putting the value we want, here the address of the next code. This mean
>>> that we don't need to pass the extra argument to the helper for the
>>> return address, as GET_PC() would work correctly (it basically reads the
>>> return address on the stack).
>>>
>>> For other architectures, it might not be a push, but rather a move to
>>> link register, basically put the return address where the calling
>>> convention asks for.
>>>
>>> OTOH I just realized it only works if the end of the slow path (moving
>>> the value from the return address to the correct register). It might be
>>> something doable.
>> Branch predictors will not oldschool tricks like this one. :)
>>
> Given it is only used in the slow path (ie the exception more than the
> rule), branch prediction isn't that important there.
>

I had already considered the approach of using jmp and removing extra 
argument for helper call.
However, the problem is that the helper needs the gen code addr used by 
tb_find_pc() and cpu_restore_state().
That means the code addr in the helper can be actually said the addr 
corresponding to QEMU_ld/st IR rather than the return addr.
In my LDST optimization, the helper call site is not in the code of IR 
but in the end of TB.
So, if the code addr is not explicitly given as a helper argument, it is 
too difficult to calculate it using such GETPC() manner.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block
  2012-10-09 18:49   ` Richard Henderson
@ 2012-10-10  4:41     ` Yeongkyoon Lee
  0 siblings, 0 replies; 16+ messages in thread
From: Yeongkyoon Lee @ 2012-10-10  4:41 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On 2012년 10월 10일 03:49, Richard Henderson wrote:
> On 10/09/2012 05:37 AM, Yeongkyoon Lee wrote:
>> +#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
>> +    /* Initialize qemu_ld/st labels to assist code generation at the end of TB
>> +       for TLB miss cases at the end of TB */
>> +    s->qemu_ldst_labels = tcg_malloc(sizeof(TCGLabelQemuLdst) *
>> +                                     TCG_MAX_QEMU_LDST);
>> +    s->nb_qemu_ldst_labels = 0;
>> +#endif
> I said before that I wasn't fond of this sort of "constant" dynamic allocation.
> Regardless of what surrounding code does.  You could clean those up too,
> as a separate patch...

I can change the dynamic allocation to static one as you said, however, 
one concern is that we might use redundant memory on non-TCG 
environment, such as, KVM mode.
What's you opinion about this?

>
>> +#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFTMMU)
>> +    /* Generate slow paths of qemu_ld/st IRs which call MMU helpers at
>> +       the end of block */
>> +    tcg_out_qemu_ldst_slow_path(s);
>> +#endif
> This interface is so close to "tcg_out_ldst_and_constant_pools(s)" that
> I don't think the function should be specific to ldst.  Just call it
> tcg_out_tb_finalize or something.

That looks good.
I'll do refactoring for the function names later.

>
>> +/* Macros/structures for qemu_ld/st IR code optimization:
>> +   TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in exec-all.h. */
>> +#define TCG_MAX_QEMU_LDST       640
>> +#define HL_LDST_SHIFT           4
>> +#define HL_LDST_MASK            (1 << HL_LDST_SHIFT)
>> +#define HL_ST_MASK              HL_LDST_MASK
>> +#define HL_OPC_MASK             (HL_LDST_MASK - 1)
>> +#define IS_QEMU_LD_LABEL(L)     (!((L)->opc_ext & HL_LDST_MASK))
>> +#define IS_QEMU_ST_LABEL(L)     ((L)->opc_ext & HL_LDST_MASK)
>> +
>> +typedef struct TCGLabelQemuLdst {
>> +    int opc_ext;            /* | 27bit(reserved) | 1bit(ld/st) | 4bit(opc) | */
> Any good reason to use all these masks when the compiler can do it
> for you with bitfields?

No. It is just my coding style.
However, there might be no compiler problems and bitfields might look 
somewhat pretty, so I'll use bitfields later.

>
>
> r~
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
  2012-10-10  4:17         ` Yeongkyoon Lee
@ 2012-10-10  6:45           ` Aurelien Jarno
  2012-10-10 10:34             ` Yeongkyoon Lee
  0 siblings, 1 reply; 16+ messages in thread
From: Aurelien Jarno @ 2012-10-10  6:45 UTC (permalink / raw)
  To: Yeongkyoon Lee; +Cc: Paolo Bonzini, qemu-devel

On Wed, Oct 10, 2012 at 01:17:36PM +0900, Yeongkyoon Lee wrote:
> On 2012년 10월 10일 02:09, Aurelien Jarno wrote:
> >On Tue, Oct 09, 2012 at 06:55:58PM +0200, Paolo Bonzini wrote:
> >>Il 09/10/2012 18:19, Aurelien Jarno ha scritto:
> >>>>>Instead of calling the MMU helper with an additional argument (7), and
> >>>>>then jump back (8) to the next code (4), what about pushing the address
> >>>>>of the next code (4) on the stack and use a jmp instead of the call. In
> >>>>>that case you don't need the extra argument to the helpers.
> >>>>>
> >>>Maybe it wasn't very clear. This is based on the fact that call is
> >>>basically push %rip + jmp. Therefore we can fake the return address by
> >>>putting the value we want, here the address of the next code. This mean
> >>>that we don't need to pass the extra argument to the helper for the
> >>>return address, as GET_PC() would work correctly (it basically reads the
> >>>return address on the stack).
> >>>
> >>>For other architectures, it might not be a push, but rather a move to
> >>>link register, basically put the return address where the calling
> >>>convention asks for.
> >>>
> >>>OTOH I just realized it only works if the end of the slow path (moving
> >>>the value from the return address to the correct register). It might be
> >>>something doable.
> >>Branch predictors will not oldschool tricks like this one. :)
> >>
> >Given it is only used in the slow path (ie the exception more than the
> >rule), branch prediction isn't that important there.
> >
> 
> I had already considered the approach of using jmp and removing
> extra argument for helper call.
> However, the problem is that the helper needs the gen code addr used
> by tb_find_pc() and cpu_restore_state().
> That means the code addr in the helper can be actually said the addr
> corresponding to QEMU_ld/st IR rather than the return addr.
> In my LDST optimization, the helper call site is not in the code of
> IR but in the end of TB.

GETPC() uses the return address to determine the call place, and as long
as the code at the end of the TB set a return address corresponding to
the one of the fast path instructions, tb_find_pc() will be able to find
the correct instruction.

That implies that at least one instruction at the end of the generated
code is shared between the slow path and the fast path, but in the other
hand it avoids having to different kind of mmu helpers.

-- 
Aurelien Jarno                          GPG: 1024D/F1BCDB73
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
  2012-10-10  6:45           ` Aurelien Jarno
@ 2012-10-10 10:34             ` Yeongkyoon Lee
  2012-10-10 14:09               ` Yeongkyoon Lee
  0 siblings, 1 reply; 16+ messages in thread
From: Yeongkyoon Lee @ 2012-10-10 10:34 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: Paolo Bonzini, qemu-devel

On 2012년 10월 10일 15:45, Aurelien Jarno wrote:
> On Wed, Oct 10, 2012 at 01:17:36PM +0900, Yeongkyoon Lee wrote:
>> On 2012년 10월 10일 02:09, Aurelien Jarno wrote:
>>> On Tue, Oct 09, 2012 at 06:55:58PM +0200, Paolo Bonzini wrote:
>>>> Il 09/10/2012 18:19, Aurelien Jarno ha scritto:
>>>>>>> Instead of calling the MMU helper with an additional argument (7), and
>>>>>>> then jump back (8) to the next code (4), what about pushing the address
>>>>>>> of the next code (4) on the stack and use a jmp instead of the call. In
>>>>>>> that case you don't need the extra argument to the helpers.
>>>>>>>
>>>>> Maybe it wasn't very clear. This is based on the fact that call is
>>>>> basically push %rip + jmp. Therefore we can fake the return address by
>>>>> putting the value we want, here the address of the next code. This mean
>>>>> that we don't need to pass the extra argument to the helper for the
>>>>> return address, as GET_PC() would work correctly (it basically reads the
>>>>> return address on the stack).
>>>>>
>>>>> For other architectures, it might not be a push, but rather a move to
>>>>> link register, basically put the return address where the calling
>>>>> convention asks for.
>>>>>
>>>>> OTOH I just realized it only works if the end of the slow path (moving
>>>>> the value from the return address to the correct register). It might be
>>>>> something doable.
>>>> Branch predictors will not oldschool tricks like this one. :)
>>>>
>>> Given it is only used in the slow path (ie the exception more than the
>>> rule), branch prediction isn't that important there.
>>>
>> I had already considered the approach of using jmp and removing
>> extra argument for helper call.
>> However, the problem is that the helper needs the gen code addr used
>> by tb_find_pc() and cpu_restore_state().
>> That means the code addr in the helper can be actually said the addr
>> corresponding to QEMU_ld/st IR rather than the return addr.
>> In my LDST optimization, the helper call site is not in the code of
>> IR but in the end of TB.
> GETPC() uses the return address to determine the call place, and as long
> as the code at the end of the TB set a return address corresponding to
> the one of the fast path instructions, tb_find_pc() will be able to find
> the correct instruction.
>
> That implies that at least one instruction at the end of the generated
> code is shared between the slow path and the fast path, but in the other
> hand it avoids having to different kind of mmu helpers.
>

How about nop instruction at the end of fast path as return address of 
helper?
That means the change of "call helper" to "push addr of nop" and "jmp 
helper".
Although I need to check the feasibility, it is expected to avoid helper 
fragmentation and to make performance degradation to be minimum.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/3] tcg: Add declarations and templates of extended MMU helpers
  2012-10-09 18:36   ` Richard Henderson
@ 2012-10-10 11:04     ` Yeongkyoon Lee
  0 siblings, 0 replies; 16+ messages in thread
From: Yeongkyoon Lee @ 2012-10-10 11:04 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel

On 2012년 10월 10일 03:36, Richard Henderson wrote:
> On 10/09/2012 05:37 AM, Yeongkyoon Lee wrote:
>> Add declarations and templates of extended MMU helpers.
>> An extended helper takes an additional argument of the host address accessing
>> a guest memory which differs from the address of the call site to the helper
>> because helper call sites locate at the end of a generated code block.
> ...
>> +#ifndef CONFIG_QEMU_LDST_OPTIMIZATION
>
> My feedback from the last round of review is that a version of the
> helper functions that take the return address should *always* be available.
>
> There are existing issues in the target-*/foo_helper.c files where
> if a helper touches memory that we do no necessarily handle any
> fault properly.  This is less true of system mode than user mode,
> but it's still a problem.
>
> The helper.c files ought to be changed to use these new "ra-enabled"
> routines and pass GETPC().  That way a fault from a helper gets
> treated *exactly* like it would if it were called from TCG generated code.
>
> Thus, all this conditionalization should vanish.

Do you mean that there are call sites in target-*/foo_helper.c which 
call the helpers of softmmu_def.h?
As far as I know, there is no access to those helpers other than from 
the functions in softmmu_header.h in which extra argument is handled.

Anyway, I'll try an approach to avoid helper fragmentation, which takes 
slight performance degradation of just one instruction for each fast path.

>
>
> r~
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs
  2012-10-10 10:34             ` Yeongkyoon Lee
@ 2012-10-10 14:09               ` Yeongkyoon Lee
  0 siblings, 0 replies; 16+ messages in thread
From: Yeongkyoon Lee @ 2012-10-10 14:09 UTC (permalink / raw)
  To: Aurelien Jarno; +Cc: Paolo Bonzini, qemu-devel

On 2012년 10월 10일 19:34, Yeongkyoon Lee wrote:
> On 2012년 10월 10일 15:45, Aurelien Jarno wrote:
>> On Wed, Oct 10, 2012 at 01:17:36PM +0900, Yeongkyoon Lee wrote:
>>> On 2012년 10월 10일 02:09, Aurelien Jarno wrote:
>>>> On Tue, Oct 09, 2012 at 06:55:58PM +0200, Paolo Bonzini wrote:
>>>>> Il 09/10/2012 18:19, Aurelien Jarno ha scritto:
>>>>>>>> Instead of calling the MMU helper with an additional argument 
>>>>>>>> (7), and
>>>>>>>> then jump back (8) to the next code (4), what about pushing the 
>>>>>>>> address
>>>>>>>> of the next code (4) on the stack and use a jmp instead of the 
>>>>>>>> call. In
>>>>>>>> that case you don't need the extra argument to the helpers.
>>>>>>>>
>>>>>> Maybe it wasn't very clear. This is based on the fact that call is
>>>>>> basically push %rip + jmp. Therefore we can fake the return 
>>>>>> address by
>>>>>> putting the value we want, here the address of the next code. 
>>>>>> This mean
>>>>>> that we don't need to pass the extra argument to the helper for the
>>>>>> return address, as GET_PC() would work correctly (it basically 
>>>>>> reads the
>>>>>> return address on the stack).
>>>>>>
>>>>>> For other architectures, it might not be a push, but rather a 
>>>>>> move to
>>>>>> link register, basically put the return address where the calling
>>>>>> convention asks for.
>>>>>>
>>>>>> OTOH I just realized it only works if the end of the slow path 
>>>>>> (moving
>>>>>> the value from the return address to the correct register). It 
>>>>>> might be
>>>>>> something doable.
>>>>> Branch predictors will not oldschool tricks like this one. :)
>>>>>
>>>> Given it is only used in the slow path (ie the exception more than the
>>>> rule), branch prediction isn't that important there.
>>>>
>>> I had already considered the approach of using jmp and removing
>>> extra argument for helper call.
>>> However, the problem is that the helper needs the gen code addr used
>>> by tb_find_pc() and cpu_restore_state().
>>> That means the code addr in the helper can be actually said the addr
>>> corresponding to QEMU_ld/st IR rather than the return addr.
>>> In my LDST optimization, the helper call site is not in the code of
>>> IR but in the end of TB.
>> GETPC() uses the return address to determine the call place, and as long
>> as the code at the end of the TB set a return address corresponding to
>> the one of the fast path instructions, tb_find_pc() will be able to find
>> the correct instruction.
>>
>> That implies that at least one instruction at the end of the generated
>> code is shared between the slow path and the fast path, but in the other
>> hand it avoids having to different kind of mmu helpers.
>>
>
> How about nop instruction at the end of fast path as return address of 
> helper?
> That means the change of "call helper" to "push addr of nop" and "jmp 
> helper".
> Although I need to check the feasibility, it is expected to avoid 
> helper fragmentation and to make performance degradation to be minimum.
>
>

I've done some tests about performance degradation when nop instruction 
is inserted to qemu_ld/st fast path.
The result is ok because I did not find any notable performance degradation.
I'll patch new version without the change of MMU helper's description soon.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-10-10 14:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-09 12:37 [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Yeongkyoon Lee
2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 1/3] configure: Add CONFIG_QEMU_LDST_OPTIMIZATION for TCG qemu_ld/st optimization Yeongkyoon Lee
2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 2/3] tcg: Add declarations and templates of extended MMU helpers Yeongkyoon Lee
2012-10-09 18:36   ` Richard Henderson
2012-10-10 11:04     ` Yeongkyoon Lee
2012-10-09 12:37 ` [Qemu-devel] [PATCH v5 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block Yeongkyoon Lee
2012-10-09 18:49   ` Richard Henderson
2012-10-10  4:41     ` Yeongkyoon Lee
2012-10-09 14:26 ` [Qemu-devel] [PATCH v5 0/3] tcg: enhance code generation quality for qemu_ld/st IRs Aurelien Jarno
2012-10-09 16:19   ` Aurelien Jarno
2012-10-09 16:55     ` Paolo Bonzini
2012-10-09 17:09       ` Aurelien Jarno
2012-10-10  4:17         ` Yeongkyoon Lee
2012-10-10  6:45           ` Aurelien Jarno
2012-10-10 10:34             ` Yeongkyoon Lee
2012-10-10 14:09               ` Yeongkyoon Lee

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.