All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH  0/5] Experimenting with tb-lookup tweaks
@ 2021-02-24 16:58 Alex Bennée
  2021-02-24 16:58 ` [RFC PATCH 1/5] accel/tcg: rename tb_lookup__cpu_state and hoist state extraction Alex Bennée
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Alex Bennée @ 2021-02-24 16:58 UTC (permalink / raw)
  To: richard.henderson; +Cc: cota, Alex Bennée, qemu-devel

Hi Richard,

Well I spun up some of the ideas we talked about to see if there was
anything to be squeezed out of the function. In the end the results
seem to be a washout with my pigz benchmark:

 qemu-system-aarch64 -cpu cortex-a57 \
   -machine type=virt,virtualization=on,gic-version=3 \
   -serial mon:stdio \
   -netdev user,id=unet,hostfwd=tcp::2222-:22 \
   -device virtio-net-pci,netdev=unet,id=virt-net,disable-legacy=on \
   -device virtio-scsi-pci,id=virt-scsi,disable-legacy=on \
   -blockdev driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-buster-arm64 \
   -device scsi-hd,drive=hd,id=virt-scsi-hd \
   -smp 4 -m 4096 \
   -kernel ~/lsrc/linux.git/builds/arm64/arch/arm64/boot/Image \
   -append "root=/dev/sda2 systemd.unit=benchmark-pigz.service" \
   -display none -snapshot

| Command | Mean [s]       | Min [s] | Max [s] | Relative |
|---------+----------------+---------+---------+----------|
| Before  | 46.597 ± 2.482 |  45.208 |  53.618 |     1.00 |
| After   | 46.867 ± 2.242 |  45.871 |  53.180 |     1.00 |

Maybe the code cleanup itself makes it worthwhile. WDYT?

Alex Bennée (5):
  accel/tcg: rename tb_lookup__cpu_state and hoist state extraction
  accel/tcg: move CF_CLUSTER calculation to curr_cflags
  accel/tcg: drop the use of CF_HASH_MASK and rename params
  include/exec: lightly re-arrange TranslationBlock
  include/exec/tb-lookup: try and reduce branch prediction issues

 include/exec/exec-all.h   | 20 +++++++++++---------
 include/exec/tb-lookup.h  | 34 +++++++++++++++++-----------------
 accel/tcg/cpu-exec.c      | 31 ++++++++++++++++++-------------
 accel/tcg/tcg-runtime.c   |  6 ++++--
 accel/tcg/translate-all.c | 14 ++++++++------
 softmmu/physmem.c         |  2 +-
 6 files changed, 59 insertions(+), 48 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/5] accel/tcg: rename tb_lookup__cpu_state and hoist state extraction
  2021-02-24 16:58 [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Alex Bennée
@ 2021-02-24 16:58 ` Alex Bennée
  2021-02-24 16:58 ` [RFC PATCH 2/5] accel/tcg: move CF_CLUSTER calculation to curr_cflags Alex Bennée
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Alex Bennée @ 2021-02-24 16:58 UTC (permalink / raw)
  To: richard.henderson; +Cc: Paolo Bonzini, cota, Alex Bennée, qemu-devel

Having a function return either and valid TB and some system state
seems excessive. It will make the subsequent re-factoring easier if we
lookup the current state where we are.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 include/exec/tb-lookup.h | 18 ++++++++----------
 accel/tcg/cpu-exec.c     | 10 ++++++++--
 accel/tcg/tcg-runtime.c  |  4 +++-
 3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/include/exec/tb-lookup.h b/include/exec/tb-lookup.h
index 9cf475bb03..62a509535d 100644
--- a/include/exec/tb-lookup.h
+++ b/include/exec/tb-lookup.h
@@ -17,30 +17,28 @@
 #include "exec/tb-hash.h"
 
 /* Might cause an exception, so have a longjmp destination ready */
-static inline TranslationBlock *
-tb_lookup__cpu_state(CPUState *cpu, target_ulong *pc, target_ulong *cs_base,
-                     uint32_t *flags, uint32_t cf_mask)
+static inline TranslationBlock * tb_lookup(CPUState *cpu,
+                                           target_ulong pc, target_ulong cs_base,
+                                           uint32_t flags, uint32_t cf_mask)
 {
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb;
     uint32_t hash;
 
-    cpu_get_tb_cpu_state(env, pc, cs_base, flags);
-    hash = tb_jmp_cache_hash_func(*pc);
+    hash = tb_jmp_cache_hash_func(pc);
     tb = qatomic_rcu_read(&cpu->tb_jmp_cache[hash]);
 
     cf_mask &= ~CF_CLUSTER_MASK;
     cf_mask |= cpu->cluster_index << CF_CLUSTER_SHIFT;
 
     if (likely(tb &&
-               tb->pc == *pc &&
-               tb->cs_base == *cs_base &&
-               tb->flags == *flags &&
+               tb->pc == pc &&
+               tb->cs_base == cs_base &&
+               tb->flags == flags &&
                tb->trace_vcpu_dstate == *cpu->trace_dstate &&
                (tb_cflags(tb) & (CF_HASH_MASK | CF_INVALID)) == cf_mask)) {
         return tb;
     }
-    tb = tb_htable_lookup(cpu, *pc, *cs_base, *flags, cf_mask);
+    tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
     if (tb == NULL) {
         return NULL;
     }
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 16e4fe3ccd..ef96b312a1 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -245,6 +245,7 @@ static void cpu_exec_exit(CPUState *cpu)
 
 void cpu_exec_step_atomic(CPUState *cpu)
 {
+    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb;
     target_ulong cs_base, pc;
     uint32_t flags;
@@ -258,7 +259,9 @@ void cpu_exec_step_atomic(CPUState *cpu)
         g_assert(!cpu->running);
         cpu->running = true;
 
-        tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
+        cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
+        tb = tb_lookup(cpu, pc, cs_base, flags, cf_mask);
+
         if (tb == NULL) {
             mmap_lock();
             tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
@@ -418,11 +421,14 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
                                         TranslationBlock *last_tb,
                                         int tb_exit, uint32_t cf_mask)
 {
+    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb;
     target_ulong cs_base, pc;
     uint32_t flags;
 
-    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, cf_mask);
+    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
+
+    tb = tb_lookup(cpu, pc, cs_base, flags, cf_mask);
     if (tb == NULL) {
         mmap_lock();
         tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
diff --git a/accel/tcg/tcg-runtime.c b/accel/tcg/tcg-runtime.c
index d736f4ff55..05e3d52c2f 100644
--- a/accel/tcg/tcg-runtime.c
+++ b/accel/tcg/tcg-runtime.c
@@ -152,7 +152,9 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
     target_ulong cs_base, pc;
     uint32_t flags;
 
-    tb = tb_lookup__cpu_state(cpu, &pc, &cs_base, &flags, curr_cflags());
+    cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
+
+    tb = tb_lookup(cpu, pc, cs_base, flags, curr_cflags());
     if (tb == NULL) {
         return tcg_code_gen_epilogue;
     }
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/5] accel/tcg: move CF_CLUSTER calculation to curr_cflags
  2021-02-24 16:58 [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Alex Bennée
  2021-02-24 16:58 ` [RFC PATCH 1/5] accel/tcg: rename tb_lookup__cpu_state and hoist state extraction Alex Bennée
@ 2021-02-24 16:58 ` Alex Bennée
  2021-02-24 16:58 ` [RFC PATCH 3/5] accel/tcg: drop the use of CF_HASH_MASK and rename params Alex Bennée
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Alex Bennée @ 2021-02-24 16:58 UTC (permalink / raw)
  To: richard.henderson; +Cc: Paolo Bonzini, cota, Alex Bennée, qemu-devel

There is nothing special about this compile flag that doesn't mean we
can't just compute it with curr_cflags() which we should be using when
building a new set.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 include/exec/exec-all.h   | 8 +++++---
 include/exec/tb-lookup.h  | 3 ---
 accel/tcg/cpu-exec.c      | 9 ++++-----
 accel/tcg/tcg-runtime.c   | 2 +-
 accel/tcg/translate-all.c | 6 +++---
 softmmu/physmem.c         | 2 +-
 6 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index b7b3c0ef12..1a69c07add 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -519,10 +519,12 @@ static inline uint32_t tb_cflags(const TranslationBlock *tb)
 }
 
 /* current cflags for hashing/comparison */
-static inline uint32_t curr_cflags(void)
+static inline uint32_t curr_cflags(CPUState *cpu)
 {
-    return (parallel_cpus ? CF_PARALLEL : 0)
-         | (icount_enabled() ? CF_USE_ICOUNT : 0);
+    uint32_t cflags = deposit32(0, CF_CLUSTER_SHIFT, 8, cpu->cluster_index);
+    cflags |= parallel_cpus ? CF_PARALLEL : 0;
+    cflags |= icount_enabled() ? CF_USE_ICOUNT : 0;
+    return cflags;
 }
 
 /* TranslationBlock invalidate API */
diff --git a/include/exec/tb-lookup.h b/include/exec/tb-lookup.h
index 62a509535d..b2247d458b 100644
--- a/include/exec/tb-lookup.h
+++ b/include/exec/tb-lookup.h
@@ -27,9 +27,6 @@ static inline TranslationBlock * tb_lookup(CPUState *cpu,
     hash = tb_jmp_cache_hash_func(pc);
     tb = qatomic_rcu_read(&cpu->tb_jmp_cache[hash]);
 
-    cf_mask &= ~CF_CLUSTER_MASK;
-    cf_mask |= cpu->cluster_index << CF_CLUSTER_SHIFT;
-
     if (likely(tb &&
                tb->pc == pc &&
                tb->cs_base == cs_base &&
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index ef96b312a1..45286dc4b3 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -249,8 +249,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
     TranslationBlock *tb;
     target_ulong cs_base, pc;
     uint32_t flags;
-    uint32_t cflags = 1;
-    uint32_t cf_mask = cflags & CF_HASH_MASK;
+    uint32_t cflags = (curr_cflags(cpu) & ~CF_PARALLEL) | 1;
     int tb_exit;
 
     if (sigsetjmp(cpu->jmp_env, 0) == 0) {
@@ -260,7 +259,7 @@ void cpu_exec_step_atomic(CPUState *cpu)
         cpu->running = true;
 
         cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-        tb = tb_lookup(cpu, pc, cs_base, flags, cf_mask);
+        tb = tb_lookup(cpu, pc, cs_base, flags, cflags);
 
         if (tb == NULL) {
             mmap_lock();
@@ -497,7 +496,7 @@ static inline bool cpu_handle_exception(CPUState *cpu, int *ret)
         if (replay_has_exception()
             && cpu_neg(cpu)->icount_decr.u16.low + cpu->icount_extra == 0) {
             /* Execute just one insn to trigger exception pending in the log */
-            cpu->cflags_next_tb = (curr_cflags() & ~CF_USE_ICOUNT) | 1;
+            cpu->cflags_next_tb = (curr_cflags(cpu) & ~CF_USE_ICOUNT) | 1;
         }
 #endif
         return false;
@@ -794,7 +793,7 @@ int cpu_exec(CPUState *cpu)
                have CF_INVALID set, -1 is a convenient invalid value that
                does not require tcg headers for cpu_common_reset.  */
             if (cflags == -1) {
-                cflags = curr_cflags();
+                cflags = curr_cflags(cpu);
             } else {
                 cpu->cflags_next_tb = -1;
             }
diff --git a/accel/tcg/tcg-runtime.c b/accel/tcg/tcg-runtime.c
index 05e3d52c2f..99403e3eb3 100644
--- a/accel/tcg/tcg-runtime.c
+++ b/accel/tcg/tcg-runtime.c
@@ -154,7 +154,7 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
 
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
 
-    tb = tb_lookup(cpu, pc, cs_base, flags, curr_cflags());
+    tb = tb_lookup(cpu, pc, cs_base, flags, curr_cflags(cpu));
     if (tb == NULL) {
         return tcg_code_gen_epilogue;
     }
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index bbd919a393..f29b47f090 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -2194,7 +2194,7 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages,
     if (current_tb_modified) {
         page_collection_unlock(pages);
         /* Force execution of one insn next time.  */
-        cpu->cflags_next_tb = 1 | curr_cflags();
+        cpu->cflags_next_tb = 1 | curr_cflags(cpu);
         mmap_unlock();
         cpu_loop_exit_noexc(cpu);
     }
@@ -2362,7 +2362,7 @@ static bool tb_invalidate_phys_page(tb_page_addr_t addr, uintptr_t pc)
 #ifdef TARGET_HAS_PRECISE_SMC
     if (current_tb_modified) {
         /* Force execution of one insn next time.  */
-        cpu->cflags_next_tb = 1 | curr_cflags();
+        cpu->cflags_next_tb = 1 | curr_cflags(cpu);
         return true;
     }
 #endif
@@ -2438,7 +2438,7 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
      * operations only (which execute after completion) so we don't
      * double instrument the instruction.
      */
-    cpu->cflags_next_tb = curr_cflags() | CF_MEMI_ONLY | CF_LAST_IO | n;
+    cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_LAST_IO | n;
 
     qemu_log_mask_and_addr(CPU_LOG_EXEC, tb->pc,
                            "cpu_io_recompile: rewound execution of TB to "
diff --git a/softmmu/physmem.c b/softmmu/physmem.c
index 19e0aa9836..7e8b0fab89 100644
--- a/softmmu/physmem.c
+++ b/softmmu/physmem.c
@@ -937,7 +937,7 @@ void cpu_check_watchpoint(CPUState *cpu, vaddr addr, vaddr len,
                     cpu_loop_exit_restore(cpu, ra);
                 } else {
                     /* Force execution of one insn next time.  */
-                    cpu->cflags_next_tb = 1 | curr_cflags();
+                    cpu->cflags_next_tb = 1 | curr_cflags(cpu);
                     mmap_unlock();
                     if (ra) {
                         cpu_restore_state(cpu, ra, true);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 3/5] accel/tcg: drop the use of CF_HASH_MASK and rename params
  2021-02-24 16:58 [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Alex Bennée
  2021-02-24 16:58 ` [RFC PATCH 1/5] accel/tcg: rename tb_lookup__cpu_state and hoist state extraction Alex Bennée
  2021-02-24 16:58 ` [RFC PATCH 2/5] accel/tcg: move CF_CLUSTER calculation to curr_cflags Alex Bennée
@ 2021-02-24 16:58 ` Alex Bennée
  2021-02-24 16:58 ` [RFC PATCH 4/5] include/exec: lightly re-arrange TranslationBlock Alex Bennée
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Alex Bennée @ 2021-02-24 16:58 UTC (permalink / raw)
  To: richard.henderson; +Cc: Paolo Bonzini, cota, Alex Bennée, qemu-devel

We don't really deal in cf_mask most of the time. The one time it's
relevant is when we want to remove an invalidated TB from the QHT
lookup. Everywhere else we should be looking up things without
CF_INVALID set.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 include/exec/exec-all.h   |  4 +---
 include/exec/tb-lookup.h  |  9 ++++++---
 accel/tcg/cpu-exec.c      | 16 ++++++++--------
 accel/tcg/tcg-runtime.c   |  2 +-
 accel/tcg/translate-all.c |  8 +++++---
 5 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 1a69c07add..acf66ab692 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -460,8 +460,6 @@ struct TranslationBlock {
 #define CF_PARALLEL    0x00080000 /* Generate code for a parallel context */
 #define CF_CLUSTER_MASK 0xff000000 /* Top 8 bits are cluster ID */
 #define CF_CLUSTER_SHIFT 24
-/* cflags' mask for hashing/comparison, basically ignore CF_INVALID */
-#define CF_HASH_MASK   (~CF_INVALID)
 
     /* Per-vCPU dynamic tracing state used to generate this TB */
     uint32_t trace_vcpu_dstate;
@@ -538,7 +536,7 @@ void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
 TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
                                    target_ulong cs_base, uint32_t flags,
-                                   uint32_t cf_mask);
+                                   uint32_t cflags);
 void tb_set_jmp_target(TranslationBlock *tb, int n, uintptr_t addr);
 
 /* GETPC is the true target of the return instruction that we'll execute.  */
diff --git a/include/exec/tb-lookup.h b/include/exec/tb-lookup.h
index b2247d458b..7b70412fae 100644
--- a/include/exec/tb-lookup.h
+++ b/include/exec/tb-lookup.h
@@ -19,11 +19,14 @@
 /* Might cause an exception, so have a longjmp destination ready */
 static inline TranslationBlock * tb_lookup(CPUState *cpu,
                                            target_ulong pc, target_ulong cs_base,
-                                           uint32_t flags, uint32_t cf_mask)
+                                           uint32_t flags, uint32_t cflags)
 {
     TranslationBlock *tb;
     uint32_t hash;
 
+    /* we should never be trying to look up an INVALID tb */
+    tcg_debug_assert(!(cflags & CF_INVALID));
+
     hash = tb_jmp_cache_hash_func(pc);
     tb = qatomic_rcu_read(&cpu->tb_jmp_cache[hash]);
 
@@ -32,10 +35,10 @@ static inline TranslationBlock * tb_lookup(CPUState *cpu,
                tb->cs_base == cs_base &&
                tb->flags == flags &&
                tb->trace_vcpu_dstate == *cpu->trace_dstate &&
-               (tb_cflags(tb) & (CF_HASH_MASK | CF_INVALID)) == cf_mask)) {
+               tb_cflags(tb) == cflags)) {
         return tb;
     }
-    tb = tb_htable_lookup(cpu, pc, cs_base, flags, cf_mask);
+    tb = tb_htable_lookup(cpu, pc, cs_base, flags, cflags);
     if (tb == NULL) {
         return NULL;
     }
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 45286dc4b3..931da96c2b 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -307,7 +307,7 @@ struct tb_desc {
     CPUArchState *env;
     tb_page_addr_t phys_page1;
     uint32_t flags;
-    uint32_t cf_mask;
+    uint32_t cflags;
     uint32_t trace_vcpu_dstate;
 };
 
@@ -321,7 +321,7 @@ static bool tb_lookup_cmp(const void *p, const void *d)
         tb->cs_base == desc->cs_base &&
         tb->flags == desc->flags &&
         tb->trace_vcpu_dstate == desc->trace_vcpu_dstate &&
-        (tb_cflags(tb) & (CF_HASH_MASK | CF_INVALID)) == desc->cf_mask) {
+        tb_cflags(tb) == desc->cflags) {
         /* check next page if needed */
         if (tb->page_addr[1] == -1) {
             return true;
@@ -341,7 +341,7 @@ static bool tb_lookup_cmp(const void *p, const void *d)
 
 TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
                                    target_ulong cs_base, uint32_t flags,
-                                   uint32_t cf_mask)
+                                   uint32_t cflags)
 {
     tb_page_addr_t phys_pc;
     struct tb_desc desc;
@@ -350,7 +350,7 @@ TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
     desc.env = (CPUArchState *)cpu->env_ptr;
     desc.cs_base = cs_base;
     desc.flags = flags;
-    desc.cf_mask = cf_mask;
+    desc.cflags = cflags;
     desc.trace_vcpu_dstate = *cpu->trace_dstate;
     desc.pc = pc;
     phys_pc = get_page_addr_code(desc.env, pc);
@@ -358,7 +358,7 @@ TranslationBlock *tb_htable_lookup(CPUState *cpu, target_ulong pc,
         return NULL;
     }
     desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_hash_func(phys_pc, pc, flags, cf_mask, *cpu->trace_dstate);
+    h = tb_hash_func(phys_pc, pc, flags, cflags, *cpu->trace_dstate);
     return qht_lookup_custom(&tb_ctx.htable, &desc, h, tb_lookup_cmp);
 }
 
@@ -418,7 +418,7 @@ static inline void tb_add_jump(TranslationBlock *tb, int n,
 
 static inline TranslationBlock *tb_find(CPUState *cpu,
                                         TranslationBlock *last_tb,
-                                        int tb_exit, uint32_t cf_mask)
+                                        int tb_exit, uint32_t cflags)
 {
     CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb;
@@ -427,10 +427,10 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
 
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
 
-    tb = tb_lookup(cpu, pc, cs_base, flags, cf_mask);
+    tb = tb_lookup(cpu, pc, cs_base, flags, cflags);
     if (tb == NULL) {
         mmap_lock();
-        tb = tb_gen_code(cpu, pc, cs_base, flags, cf_mask);
+        tb = tb_gen_code(cpu, pc, cs_base, flags, cflags);
         mmap_unlock();
         /* We add the TB in the virtual pc hash table for the fast lookup */
         qatomic_set(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)], tb);
diff --git a/accel/tcg/tcg-runtime.c b/accel/tcg/tcg-runtime.c
index 99403e3eb3..49f5de37e8 100644
--- a/accel/tcg/tcg-runtime.c
+++ b/accel/tcg/tcg-runtime.c
@@ -27,10 +27,10 @@
 #include "exec/helper-proto.h"
 #include "exec/cpu_ldst.h"
 #include "exec/exec-all.h"
-#include "exec/tb-lookup.h"
 #include "disas/disas.h"
 #include "exec/log.h"
 #include "tcg/tcg.h"
+#include "exec/tb-lookup.h"
 
 /* 32-bit helpers */
 
diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index f29b47f090..0b0bfd35ab 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -1311,7 +1311,7 @@ static bool tb_cmp(const void *ap, const void *bp)
     return a->pc == b->pc &&
         a->cs_base == b->cs_base &&
         a->flags == b->flags &&
-        (tb_cflags(a) & CF_HASH_MASK) == (tb_cflags(b) & CF_HASH_MASK) &&
+        (tb_cflags(a) & ~CF_INVALID) == (tb_cflags(b) & ~CF_INVALID) &&
         a->trace_vcpu_dstate == b->trace_vcpu_dstate &&
         a->page_addr[0] == b->page_addr[0] &&
         a->page_addr[1] == b->page_addr[1];
@@ -1616,6 +1616,7 @@ static void do_tb_phys_invalidate(TranslationBlock *tb, bool rm_from_page_list)
     PageDesc *p;
     uint32_t h;
     tb_page_addr_t phys_pc;
+    uint32_t orig_cflags = tb_cflags(tb);
 
     assert_memory_lock();
 
@@ -1626,7 +1627,7 @@ static void do_tb_phys_invalidate(TranslationBlock *tb, bool rm_from_page_list)
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb_cflags(tb) & CF_HASH_MASK,
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, orig_cflags,
                      tb->trace_vcpu_dstate);
     if (!qht_remove(&tb_ctx.htable, tb, h)) {
         return;
@@ -1793,6 +1794,7 @@ tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     uint32_t h;
 
     assert_memory_lock();
+    tcg_debug_assert(!(tb->cflags & CF_INVALID));
 
     /*
      * Add the TB to the page list, acquiring first the pages's locks.
@@ -1811,7 +1813,7 @@ tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     }
 
     /* add in the hash table */
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags & CF_HASH_MASK,
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->cflags,
                      tb->trace_vcpu_dstate);
     qht_insert(&tb_ctx.htable, tb, h, &existing_tb);
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH  4/5] include/exec: lightly re-arrange TranslationBlock
  2021-02-24 16:58 [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Alex Bennée
                   ` (2 preceding siblings ...)
  2021-02-24 16:58 ` [RFC PATCH 3/5] accel/tcg: drop the use of CF_HASH_MASK and rename params Alex Bennée
@ 2021-02-24 16:58 ` Alex Bennée
  2021-02-24 16:58 ` [RFC PATCH 5/5] include/exec/tb-lookup: try and reduce branch prediction issues Alex Bennée
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Alex Bennée @ 2021-02-24 16:58 UTC (permalink / raw)
  To: richard.henderson; +Cc: Paolo Bonzini, cota, Alex Bennée, qemu-devel

Lets make sure all the flags we compare when looking up blocks are
together in the same place.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 include/exec/exec-all.h | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index acf66ab692..75f8c3981a 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -448,9 +448,6 @@ struct TranslationBlock {
     target_ulong pc;   /* simulated PC corresponding to this block (EIP + CS base) */
     target_ulong cs_base; /* CS base for this block */
     uint32_t flags; /* flags defining in which context the code was generated */
-    uint16_t size;      /* size of target code for this block (1 <=
-                           size <= TARGET_PAGE_SIZE) */
-    uint16_t icount;
     uint32_t cflags;    /* compile flags */
 #define CF_COUNT_MASK  0x00007fff
 #define CF_LAST_IO     0x00008000 /* Last insn may be an IO access.  */
@@ -464,6 +461,11 @@ struct TranslationBlock {
     /* Per-vCPU dynamic tracing state used to generate this TB */
     uint32_t trace_vcpu_dstate;
 
+    /* Above fields used for comparing */
+    uint16_t size;      /* size of target code for this block (1 <=
+                           size <= TARGET_PAGE_SIZE) */
+    uint16_t icount;
+
     struct tb_tc tc;
 
     /* first and second physical page containing code. The lower bit
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH 5/5] include/exec/tb-lookup: try and reduce branch prediction issues
  2021-02-24 16:58 [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Alex Bennée
                   ` (3 preceding siblings ...)
  2021-02-24 16:58 ` [RFC PATCH 4/5] include/exec: lightly re-arrange TranslationBlock Alex Bennée
@ 2021-02-24 16:58 ` Alex Bennée
  2021-02-25  0:28 ` [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Richard Henderson
  2021-02-25 15:45 ` no-reply
  6 siblings, 0 replies; 9+ messages in thread
From: Alex Bennée @ 2021-02-24 16:58 UTC (permalink / raw)
  To: richard.henderson; +Cc: cota, Alex Bennée, qemu-devel

Now that everything is nicely aligned instead of a compare and jump
just blitz the bits together and test for zero at the end.

[AJB: looking at perf I can't see much change. Basically the hotspot
seems to be the initial load of the TB parameters. If this reflects
the stall of the memory bus loading TB fields I guess this means there
isn't much more to be squeezed out here:

helper_lookup_tb_ptr() /home/alex/lsrc/qemu.git/builds/arm.all/qemu-system-aarch64
Event: cycles:ppp

Percent

             Disassembly of section .text:

             0000000000809c40 <helper_lookup_tb_ptr>:
             helper_lookup_tb_ptr():
             {
                 return ctpop64(arg);
             }

             const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
             {
  4.14         push   %r13
  1.32         push   %r12
             env_cpu():
              *
              * Return the CPUState associated with the environment.
              */
             static inline CPUState *env_cpu(CPUArchState *env)
             {
                 return &env_archcpu(env)->parent_obj;
  0.47         lea    -0x9dc0(%rdi),%r12
             helper_lookup_tb_ptr():
  0.80         push   %rbp
  0.40         mov    %rdi,%rbp
  0.90         push   %rbx
  1.59         sub    $0x28,%rsp
  1.95         mov    %fs:0x28,%rax
  2.28         mov    %rax,0x18(%rsp)
  0.51         xor    %eax,%eax
                 CPUState *cpu = env_cpu(env);
                 TranslationBlock *tb;
                 target_ulong cs_base, pc;
                 uint32_t flags;

                 cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
  0.98         lea    0x4(%rsp),%rcx
  0.66         lea    0x8(%rsp),%rdx
  1.09         lea    0x10(%rsp),%rsi
  2.08       → callq  cpu_get_tb_cpu_state

                 tb = tb_lookup(cpu, pc, cs_base, flags, curr_cflags(cpu));
  1.19         mov    0x10(%rsp),%rsi
  1.03         mov    0x4(%rsp),%r9d
             deposit32():
                                              uint32_t fieldval)
             {
                 uint32_t mask;
                 assert(start >= 0 && length > 0 && length <= 32 - start);
                 mask = (~0U >> (32 - length)) << start;
                 return (value & ~mask) | ((fieldval << start) & mask);
  2.53         mov    -0x1b24(%rbp),%r8d
  0.55         lea    parallel_cpus,%rdx
             tb_jmp_cache_hash_func():
             }

             static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
             {
                 target_ulong tmp;
                 tmp = pc ^ (pc >> (TARGET_PAGE_BITS - TB_JMP_PAGE_BITS));
  0.09         mov    %rsi,%rdi
             deposit32():
  0.62         shl    $0x18,%r8d
             curr_cflags():

             /* current cflags for hashing/comparison */
             static inline uint32_t curr_cflags(CPUState *cpu)
             {
                 uint32_t cflags = deposit32(0, CF_CLUSTER_SHIFT, 8, cpu->cluster_index);
                 cflags |= parallel_cpus ? CF_PARALLEL : 0;
  1.80         mov    %r8d,%eax
  0.92         or     $0x80000,%eax
  1.45         cmpb   $0x0,(%rdx)
  0.84         lea    use_icount,%rdx
  3.13         cmovne %eax,%r8d
                 cflags |= icount_enabled() ? CF_USE_ICOUNT : 0;
  0.62         mov    (%rdx),%ecx
             helper_lookup_tb_ptr():
  0.60         mov    0x8(%rsp),%rdx
  1.95         mov    %r8d,%eax
  0.60         or     $0x20000,%eax
  0.45         test   %ecx,%ecx
  3.70         cmovne %eax,%r8d
             tb_jmp_cache_hash_func():
  0.47         lea    target_page,%rax
  0.55         mov    0x4(%rax),%ecx
  0.55         sub    $0x6,%ecx
  3.17         shr    %cl,%rdi
  0.53         xor    %rsi,%rdi
                 return (((tmp >> (TARGET_PAGE_BITS - TB_JMP_PAGE_BITS)) & TB_JMP_PAGE_MASK)
  1.74         mov    %rdi,%rax
                        | (tmp & TB_JMP_ADDR_MASK));
  0.50         and    $0x3f,%edi
                 return (((tmp >> (TARGET_PAGE_BITS - TB_JMP_PAGE_BITS)) & TB_JMP_PAGE_MASK)
  2.85         shr    %cl,%rax
  0.31         and    $0xfc0,%eax
                        | (tmp & TB_JMP_ADDR_MASK));
  0.36         or     %edi,%eax
             tb_lookup():

                 /* we should never be trying to look up an INVALID tb */
                 tcg_debug_assert(!(cflags & CF_INVALID));

                 hash = tb_jmp_cache_hash_func(pc);
                 tb = qatomic_rcu_read(&cpu->tb_jmp_cache[hash]);
  2.26         lea    0x210(%r12,%rax,8),%r13
  5.86         mov    0x0(%r13),%rbx

                 if (likely(tb)) {
  0.38         test   %rbx,%rbx
  0.00       ↓ je     e7
                     uint64_t bits = tb->pc ^ pc;
  6.77         mov    (%rbx),%rax
                     bits |= tb->cs_base ^ cs_base;
  0.75         mov    0x8(%rbx),%rcx
                     uint64_t bits = tb->pc ^ pc;
  1.67         xor    %rsi,%rax
                     bits |= tb->cs_base ^ cs_base;
  0.83         xor    %rdx,%rcx
  0.95         or     %rax,%rcx
                     bits |= tb->flags ^ flags;
  0.05         mov    %r9d,%eax
  1.63         xor    0x10(%rbx),%eax
  2.26         or     %rax,%rcx
                     bits |= tb->trace_vcpu_dstate ^ *cpu->trace_dstate;
  0.08         mov    0x18(%rbx),%eax
  0.55         xor    -0x1b38(%rbp),%rax
  2.45         or     %rcx,%rax
             tb_cflags():
                 return qatomic_read(&tb->cflags);
  0.13         mov    0x14(%rbx),%ecx
             tb_lookup():
                     bits |= tb_cflags(tb) ^ cflags;
  0.22         xor    %r8d,%ecx
                     if (!bits) {
  3.34         or     %rax,%rcx
  3.69       ↓ je     fe
                         return tb;
                     }
                 }
                 tb = tb_htable_lookup(cpu, pc, cs_base, flags, cflags);
  0.21   e7:   mov    %r9d,%ecx
               mov    %r12,%rdi
  0.20       → callq  tb_htable_lookup
  0.00         mov    %rax,%rbx
                 if (tb == NULL) {
  0.02         test   %rax,%rax
             ↓ je     130
                     return NULL;
                 }
                 qatomic_set(&cpu->tb_jmp_cache[hash], tb);
  0.10         mov    %rax,0x0(%r13)
             helper_lookup_tb_ptr():
                 if (tb == NULL) {
                     return tcg_code_gen_epilogue;
                 }
                 qemu_log_mask_and_addr(CPU_LOG_EXEC, pc,
  0.40   fe:   lea    qemu_loglevel,%rax
  0.16         testb  $0x20,(%rax)
  0.14       ↓ jne    140
                                        "Chain %d: %p ["
                                        TARGET_FMT_lx "/" TARGET_FMT_lx "/%#x] %s\n",
                                        cpu->cpu_index, tb->tc.ptr, cs_base, pc, flags,
                                        lookup_symbol(pc));
                 return tb->tc.ptr;
  1.69  10a:   mov    0x20(%rbx),%rax
             }
  0.18  10e:   mov    0x18(%rsp),%rsi
  0.34         xor    %fs:0x28,%rsi
  0.50       ↓ jne    188
  4.48         add    $0x28,%rsp
  0.27         pop    %rbx
  0.10         pop    %rbp
  0.30         pop    %r12
  4.56         pop    %r13
  0.20       ← retq
               nop
                     return tcg_code_gen_epilogue;
        130:   lea    tcg_code_gen_epilogue,%rax
               mov    (%rax),%rax
             ↑ jmp    10e
               nop
                 qemu_log_mask_and_addr(CPU_LOG_EXEC, pc,
        140:   mov    0x10(%rsp),%rdi
             → callq  qemu_log_in_addr_range
               test   %al,%al
             ↑ je     10a
               mov    0x10(%rsp),%rdi
             → callq  lookup_symbol
               sub    $0x8,%rsp
               mov    0x20(%rbx),%rdx
               mov    -0x1b28(%rbp),%esi
               push   %rax
               mov    0x14(%rsp),%r9d
               lea    __PRETTY_FUNCTION__.30436+0x10,%rdi
               xor    %eax,%eax
               mov    0x20(%rsp),%r8
               mov    0x18(%rsp),%rcx
             → callq  qemu_log
               pop    %rax
               pop    %rdx
             ↑ jmp    10a
             }
        188: → callq  __stack_chk_fail@plt

]

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 include/exec/tb-lookup.h | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/exec/tb-lookup.h b/include/exec/tb-lookup.h
index 7b70412fae..3140abebc2 100644
--- a/include/exec/tb-lookup.h
+++ b/include/exec/tb-lookup.h
@@ -30,13 +30,15 @@ static inline TranslationBlock * tb_lookup(CPUState *cpu,
     hash = tb_jmp_cache_hash_func(pc);
     tb = qatomic_rcu_read(&cpu->tb_jmp_cache[hash]);
 
-    if (likely(tb &&
-               tb->pc == pc &&
-               tb->cs_base == cs_base &&
-               tb->flags == flags &&
-               tb->trace_vcpu_dstate == *cpu->trace_dstate &&
-               tb_cflags(tb) == cflags)) {
-        return tb;
+    if (likely(tb)) {
+        uint64_t bits = tb->pc ^ pc;
+        bits |= tb->cs_base ^ cs_base;
+        bits |= tb->flags ^ flags;
+        bits |= tb->trace_vcpu_dstate ^ *cpu->trace_dstate;
+        bits |= tb_cflags(tb) ^ cflags;
+        if (!bits) {
+            return tb;
+        }
     }
     tb = tb_htable_lookup(cpu, pc, cs_base, flags, cflags);
     if (tb == NULL) {
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/5] Experimenting with tb-lookup tweaks
  2021-02-24 16:58 [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Alex Bennée
                   ` (4 preceding siblings ...)
  2021-02-24 16:58 ` [RFC PATCH 5/5] include/exec/tb-lookup: try and reduce branch prediction issues Alex Bennée
@ 2021-02-25  0:28 ` Richard Henderson
  2021-02-25 10:15   ` Alex Bennée
  2021-02-25 15:45 ` no-reply
  6 siblings, 1 reply; 9+ messages in thread
From: Richard Henderson @ 2021-02-25  0:28 UTC (permalink / raw)
  To: Alex Bennée; +Cc: cota, qemu-devel

On 2/24/21 8:58 AM, Alex Bennée wrote:
> Hi Richard,
> 
> Well I spun up some of the ideas we talked about to see if there was
> anything to be squeezed out of the function. In the end the results
> seem to be a washout with my pigz benchmark:
> 
>  qemu-system-aarch64 -cpu cortex-a57 \
>    -machine type=virt,virtualization=on,gic-version=3 \
>    -serial mon:stdio \
>    -netdev user,id=unet,hostfwd=tcp::2222-:22 \
>    -device virtio-net-pci,netdev=unet,id=virt-net,disable-legacy=on \
>    -device virtio-scsi-pci,id=virt-scsi,disable-legacy=on \
>    -blockdev driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-buster-arm64 \
>    -device scsi-hd,drive=hd,id=virt-scsi-hd \
>    -smp 4 -m 4096 \
>    -kernel ~/lsrc/linux.git/builds/arm64/arch/arm64/boot/Image \
>    -append "root=/dev/sda2 systemd.unit=benchmark-pigz.service" \
>    -display none -snapshot
> 
> | Command | Mean [s]       | Min [s] | Max [s] | Relative |
> |---------+----------------+---------+---------+----------|
> | Before  | 46.597 ± 2.482 |  45.208 |  53.618 |     1.00 |
> | After   | 46.867 ± 2.242 |  45.871 |  53.180 |     1.00 |

Well that's disappointing.

> Maybe the code cleanup itself makes it worthwhile. WDYT?

I think there's little doubt that the first 3 patches are a good code cleanup.

Patch 4 I think is still beneficial, simply so that we can add that "Above
fields" comment.

Patch 5 would only be worthwhile if we could measure any positive difference,
which it seems we cannot.

I have a follow-up patch to remove the parallel_cpus global variable which I
will post in a moment.  While it removes a handful of insns from this
fast-path, I doubt it helps.  But getting rid of a global is probably always
positive, no?

I was glancing through the lookup function for alpha, instead of aarch64 and saw:

 21e:   33 43 18                xor    0x18(%rbx),%eax
 221:   4c 31 e1                xor    %r12,%rcx
 224:   44 31 ea                xor    %r13d,%edx
 227:   09 c2                   or     %eax,%edx
 229:   48 0b 4b 08             or     0x8(%rbx),%rcx

and thought -- hang on, how come we're just ORing nor XORing here?  Of course
it's the cs_base field, which alpha has set to zero.  The compiler has
simplified bits |= 0 ^ tb->cs_base.

Which got me thinking: what if we had a per-cpu

typedef struct {
    target_ulong pc;
    ...
} TranslationBlockID;

static inline bool arch_tbid_cmp(TranslationBlockID x,
                                 TranslationBlockID y)
{
    return x.pc == y.pc && ...;
}

We could potentially reduce this to memcmp(&x, &y).

First, this would allow cs_base to be eliminated where it is not used.  Second,
this would allow cs_base to be renamed for the non-x86 targets for which it is
being abused.  Third, it would allow tb->flags to be either (a) elided or (b)
extended by the target as needed.

This final is directed at ARM, of course, where we've overflowed the uint32_t
that is tb->flags.  We could now extend that to 64-bits.

Obviously, some tweaks to tb_hash_func would be required as well, but that's
manageable.

What do you think about this last?


r~


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/5] Experimenting with tb-lookup tweaks
  2021-02-25  0:28 ` [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Richard Henderson
@ 2021-02-25 10:15   ` Alex Bennée
  0 siblings, 0 replies; 9+ messages in thread
From: Alex Bennée @ 2021-02-25 10:15 UTC (permalink / raw)
  To: Richard Henderson; +Cc: cota, qemu-devel


Richard Henderson <richard.henderson@linaro.org> writes:

> On 2/24/21 8:58 AM, Alex Bennée wrote:
>> Hi Richard,
>> 
>> Well I spun up some of the ideas we talked about to see if there was
>> anything to be squeezed out of the function. In the end the results
>> seem to be a washout with my pigz benchmark:
>> 
>>  qemu-system-aarch64 -cpu cortex-a57 \
>>    -machine type=virt,virtualization=on,gic-version=3 \
>>    -serial mon:stdio \
>>    -netdev user,id=unet,hostfwd=tcp::2222-:22 \
>>    -device virtio-net-pci,netdev=unet,id=virt-net,disable-legacy=on \
>>    -device virtio-scsi-pci,id=virt-scsi,disable-legacy=on \
>>    -blockdev driver=raw,node-name=hd,discard=unmap,file.driver=host_device,file.filename=/dev/zen-disk/debian-buster-arm64 \
>>    -device scsi-hd,drive=hd,id=virt-scsi-hd \
>>    -smp 4 -m 4096 \
>>    -kernel ~/lsrc/linux.git/builds/arm64/arch/arm64/boot/Image \
>>    -append "root=/dev/sda2 systemd.unit=benchmark-pigz.service" \
>>    -display none -snapshot
>> 
>> | Command | Mean [s]       | Min [s] | Max [s] | Relative |
>> |---------+----------------+---------+---------+----------|
>> | Before  | 46.597 ± 2.482 |  45.208 |  53.618 |     1.00 |
>> | After   | 46.867 ± 2.242 |  45.871 |  53.180 |     1.00 |
>
> Well that's disappointing.
>
>> Maybe the code cleanup itself makes it worthwhile. WDYT?
>
> I think there's little doubt that the first 3 patches are a good code cleanup.
>
> Patch 4 I think is still beneficial, simply so that we can add that "Above
> fields" comment.
>
> Patch 5 would only be worthwhile if we could measure any positive difference,
> which it seems we cannot.
>
> I have a follow-up patch to remove the parallel_cpus global variable which I
> will post in a moment.  While it removes a handful of insns from this
> fast-path, I doubt it helps.  But getting rid of a global is probably always
> positive, no?
>
> I was glancing through the lookup function for alpha, instead of aarch64 and saw:
>
>  21e:   33 43 18                xor    0x18(%rbx),%eax
>  221:   4c 31 e1                xor    %r12,%rcx
>  224:   44 31 ea                xor    %r13d,%edx
>  227:   09 c2                   or     %eax,%edx
>  229:   48 0b 4b 08             or     0x8(%rbx),%rcx
>
> and thought -- hang on, how come we're just ORing nor XORing here?  Of course
> it's the cs_base field, which alpha has set to zero.  The compiler has
> simplified bits |= 0 ^ tb->cs_base.
>
> Which got me thinking: what if we had a per-cpu
>
> typedef struct {
>     target_ulong pc;
>     ...
> } TranslationBlockID;
>
> static inline bool arch_tbid_cmp(TranslationBlockID x,
>                                  TranslationBlockID y)
> {
>     return x.pc == y.pc && ...;
> }
>
> We could potentially reduce this to memcmp(&x, &y).
>
> First, this would allow cs_base to be eliminated where it is not used.  Second,
> this would allow cs_base to be renamed for the non-x86 targets for which it is
> being abused.  Third, it would allow tb->flags to be either (a) elided or (b)
> extended by the target as needed.
>
> This final is directed at ARM, of course, where we've overflowed the uint32_t
> that is tb->flags.  We could now extend that to 64-bits.
>
> Obviously, some tweaks to tb_hash_func would be required as well, but that's
> manageable.
>
> What do you think about this last?

Sounds like a good idea for clean-up, especially to get rid of
cs_base/extend tbflags when needed. One concern would be where do we go
when we get to heterogeneous emulation? Will they share the same
translation area like the current cpu->cluster_index stuff or will that
only be for similar but not quite the same architectures? Maybe I'm
thinking too far ahead... 

>
>
> r~


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH  0/5] Experimenting with tb-lookup tweaks
  2021-02-24 16:58 [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Alex Bennée
                   ` (5 preceding siblings ...)
  2021-02-25  0:28 ` [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Richard Henderson
@ 2021-02-25 15:45 ` no-reply
  6 siblings, 0 replies; 9+ messages in thread
From: no-reply @ 2021-02-25 15:45 UTC (permalink / raw)
  To: alex.bennee; +Cc: alex.bennee, cota, richard.henderson, qemu-devel

Patchew URL: https://patchew.org/QEMU/20210224165811.11567-1-alex.bennee@linaro.org/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20210224165811.11567-1-alex.bennee@linaro.org
Subject: [RFC PATCH  0/5] Experimenting with tb-lookup tweaks

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 - [tag update]      patchew/20210218201528.127099-1-eblake@redhat.com -> patchew/20210218201528.127099-1-eblake@redhat.com
 - [tag update]      patchew/20210224055401.492407-1-jasowang@redhat.com -> patchew/20210224055401.492407-1-jasowang@redhat.com
 * [new tag]         patchew/20210224165811.11567-1-alex.bennee@linaro.org -> patchew/20210224165811.11567-1-alex.bennee@linaro.org
 * [new tag]         patchew/20210224165837.21983-1-vgoyal@redhat.com -> patchew/20210224165837.21983-1-vgoyal@redhat.com
 - [tag update]      patchew/20210225032335.64245-1-aik@ozlabs.ru -> patchew/20210225032335.64245-1-aik@ozlabs.ru
 * [new tag]         patchew/20210225054756.35962-1-linuxmaker@163.com -> patchew/20210225054756.35962-1-linuxmaker@163.com
 - [tag update]      patchew/20210225131316.631940-1-pbonzini@redhat.com -> patchew/20210225131316.631940-1-pbonzini@redhat.com
Switched to a new branch 'test'
0be54b4 include/exec/tb-lookup: try and reduce branch prediction issues
c6233de include/exec: lightly re-arrange TranslationBlock
18534bf accel/tcg: drop the use of CF_HASH_MASK and rename params
3a30caf accel/tcg: move CF_CLUSTER calculation to curr_cflags
a135bea accel/tcg: rename tb_lookup__cpu_state and hoist state extraction

=== OUTPUT BEGIN ===
1/5 Checking commit a135bea36366 (accel/tcg: rename tb_lookup__cpu_state and hoist state extraction)
ERROR: "foo * bar" should be "foo *bar"
#84: FILE: include/exec/tb-lookup.h:20:
+static inline TranslationBlock * tb_lookup(CPUState *cpu,

WARNING: line over 80 characters
#85: FILE: include/exec/tb-lookup.h:21:
+                                           target_ulong pc, target_ulong cs_base,

total: 1 errors, 1 warnings, 80 lines checked

Patch 1/5 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

2/5 Checking commit 3a30caf5f47d (accel/tcg: move CF_CLUSTER calculation to curr_cflags)
3/5 Checking commit 18534bff0f1f (accel/tcg: drop the use of CF_HASH_MASK and rename params)
4/5 Checking commit c6233de83263 (include/exec: lightly re-arrange TranslationBlock)
WARNING: Block comments use a leading /* on a separate line
#35: FILE: include/exec/exec-all.h:465:
+    uint16_t size;      /* size of target code for this block (1 <=

WARNING: Block comments use * on subsequent lines
#36: FILE: include/exec/exec-all.h:466:
+    uint16_t size;      /* size of target code for this block (1 <=
+                           size <= TARGET_PAGE_SIZE) */

WARNING: Block comments use a trailing */ on a separate line
#36: FILE: include/exec/exec-all.h:466:
+                           size <= TARGET_PAGE_SIZE) */

total: 0 errors, 3 warnings, 20 lines checked

Patch 4/5 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
5/5 Checking commit 0be54b4ee146 (include/exec/tb-lookup: try and reduce branch prediction issues)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20210224165811.11567-1-alex.bennee@linaro.org/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-02-25 15:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-24 16:58 [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Alex Bennée
2021-02-24 16:58 ` [RFC PATCH 1/5] accel/tcg: rename tb_lookup__cpu_state and hoist state extraction Alex Bennée
2021-02-24 16:58 ` [RFC PATCH 2/5] accel/tcg: move CF_CLUSTER calculation to curr_cflags Alex Bennée
2021-02-24 16:58 ` [RFC PATCH 3/5] accel/tcg: drop the use of CF_HASH_MASK and rename params Alex Bennée
2021-02-24 16:58 ` [RFC PATCH 4/5] include/exec: lightly re-arrange TranslationBlock Alex Bennée
2021-02-24 16:58 ` [RFC PATCH 5/5] include/exec/tb-lookup: try and reduce branch prediction issues Alex Bennée
2021-02-25  0:28 ` [RFC PATCH 0/5] Experimenting with tb-lookup tweaks Richard Henderson
2021-02-25 10:15   ` Alex Bennée
2021-02-25 15:45 ` no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.