All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches
@ 2016-12-22 18:35 Lluís Vilanova
  2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 1/6] exec: [tcg] Refactor flush of per-CPU virtual TB cache Lluís Vilanova
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Lluís Vilanova @ 2016-12-22 18:35 UTC (permalink / raw)
  To: qemu-devel; +Cc: Eric Blake, Eduardo Habkost, Stefan Hajnoczi

Optimizes tracing of events with the 'tcg' and 'vcpu' properties (e.g., memory
accesses), making it feasible to statically enable them by default on all QEMU
builds.

Some quick'n'dirty numbers with 400.perlbench (SPECcpu2006) on the train input
(medium size - suns.pl) and the guest_mem_before event:

* vanilla, statically disabled
real	0m5,827s
user	0m5,800s
sys	0m0,024s

* vanilla, statically enabled (overhead: 2.35x)
real	0m13,696s
user	0m13,684s
sys	0m0,008s

* multi-tb, statically disabled (overhead: 1.09x)
real	0m6,383s
user	0m6,352s
sys	0m0,028s

* multi-tb, statically enabled (overhead: 1.11x)
real	0m6,493s
user	0m6,468s
sys	0m0,020s


Right now, events with the 'tcg' property always generate TCG code to trace that
event at guest code execution time, where the event's dynamic state is checked.

This series adds a performance optimization where TCG code for events with the
'tcg' and 'vcpu' properties is not generated if the event is dynamically
disabled. This optimization raises two issues:

* An event can be dynamically disabled/enabled after the corresponding TCG code
  has been generated (i.e., a new TB with the corresponding code should be
  used).

* Each vCPU can have a different dynamic state for the same event (i.e., tracing
  the memory accesses of only one process pinned to a vCPU).

To handle both issues, this series replicates the shared physical TB cache,
creating a separate physical TB cache for every combination of event states
(those with the 'vcpu' and 'tcg' properties). Then, all vCPUs tracing the same
events will use the same physical TB cache.

Sharing physical TBs makes this very space efficient (only the physical TB
caches, simple arrays of pointers, are replicated). Sharing physical TB caches
maximizes TB reuse across vCPUs whenever possible, and makes dynamic event state
changes more efficient (simply use a different physical TB cache).

The physical TB cache array is indexed with the vCPU's trace event state
bitmask. This is simpler and more efficient than emitting TCG code to check if
an event needs tracing, where we should still move the tracing call code to
either a cold path (making tracing performance worse), or leave it inlined
(making non-tracing performance worse).

This solution is also more efficient than eliding TCG code only when *zero*
vCPUs are tracing an event, since enabling it on a single vCPU will impact the
performance of all other vCPUs that are not tracing that event.

Note on overheads: I suspect the culprit of the 1.15x overhead lies in the
  double dereference

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
---

Changes in v3
=============

* Rebase on 0737f32daf.
* Do not use reserved symbol prefixes ("__") [Stefan Hajnoczi].
* Refactor trace_get_vcpu_event_count() to be inlinable.
* Optimize cpu_tb_cache_set_requested() (hottest path).


Changes in v2
=============

* Fix bitmap copy in cpu_tb_cache_set_apply().
* Split generated code re-alignment into a separate patch [Daniel P. Berrange].


Lluís Vilanova (6):
      exec: [tcg] Refactor flush of per-CPU virtual TB cache
      trace: Make trace_get_vcpu_event_count() inlinable
      exec: [tcg] Use multiple physical TB caches
      exec: [tcg] Switch physical TB cache based on vCPU tracing state
      trace: [tcg] Do not generate TCG code to trace dinamically-disabled events
      trace: [tcg,trivial] Re-align generated code


 cpu-exec.c                               |   11 +++-
 cputlb.c                                 |    2 -
 include/exec/exec-all.h                  |   12 ++++
 include/exec/tb-context.h                |    2 -
 include/qom/cpu.h                        |    3 +
 qom/cpu.c                                |    3 +
 scripts/tracetool/__init__.py            |    1 
 scripts/tracetool/backend/dtrace.py      |    2 -
 scripts/tracetool/backend/ftrace.py      |   20 +++----
 scripts/tracetool/backend/log.py         |   16 +++--
 scripts/tracetool/backend/simple.py      |    2 -
 scripts/tracetool/backend/syslog.py      |    6 +-
 scripts/tracetool/backend/ust.py         |    2 -
 scripts/tracetool/format/h.py            |   24 ++++++--
 scripts/tracetool/format/tcg_h.py        |   19 +++++-
 scripts/tracetool/format/tcg_helper_c.py |    3 +
 trace/control-internal.h                 |    5 ++
 trace/control-target.c                   |    1 
 trace/control.c                          |    9 +--
 trace/control.h                          |    5 +-
 translate-all.c                          |   92 +++++++++++++++++++++++++-----
 translate-all.h                          |   43 ++++++++++++++
 translate-all.inc.h                      |   13 ++++
 23 files changed, 237 insertions(+), 59 deletions(-)
 create mode 100644 translate-all.inc.h


To: qemu-devel@nongnu.org
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Eric Blake <eblake@redhat.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v3 1/6] exec: [tcg] Refactor flush of per-CPU virtual TB cache
  2016-12-22 18:35 [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Lluís Vilanova
@ 2016-12-22 18:35 ` Lluís Vilanova
  2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 2/6] trace: Make trace_get_vcpu_event_count() inlinable Lluís Vilanova
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Lluís Vilanova @ 2016-12-22 18:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Blake, Eduardo Habkost, Stefan Hajnoczi, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

The function is reused in later patches.

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
---
 cputlb.c                |    2 +-
 include/exec/exec-all.h |    6 ++++++
 translate-all.c         |   14 +++++++++-----
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 813279f3bc..9bf9960e1b 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -80,7 +80,7 @@ void tlb_flush(CPUState *cpu, int flush_global)
 
     memset(env->tlb_table, -1, sizeof(env->tlb_table));
     memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
-    memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+    tb_flush_jmp_cache_all(cpu);
 
     env->vtlb_index = 0;
     env->tlb_flush_addr = -1;
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index a8c13cee66..57cd978578 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -256,6 +256,12 @@ struct TranslationBlock {
 };
 
 void tb_free(TranslationBlock *tb);
+/**
+ * tb_flush_jmp_cache_all:
+ *
+ * Flush the virtual translation block cache.
+ */
+void tb_flush_jmp_cache_all(CPUState *env);
 void tb_flush(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
 
diff --git a/translate-all.c b/translate-all.c
index 3dd9214904..29ccb9e546 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -941,11 +941,7 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
     CPU_FOREACH(cpu) {
-        int i;
-
-        for (i = 0; i < TB_JMP_CACHE_SIZE; ++i) {
-            atomic_set(&cpu->tb_jmp_cache[i], NULL);
-        }
+        tb_flush_jmp_cache_all(cpu);
     }
 
     tcg_ctx.tb_ctx.nb_tbs = 0;
@@ -1741,6 +1737,14 @@ void tb_check_watchpoint(CPUState *cpu)
     }
 }
 
+void tb_flush_jmp_cache_all(CPUState *cpu)
+{
+    int i;
+    for (i = 0; i < TB_JMP_CACHE_SIZE; ++i) {
+        atomic_set(&cpu->tb_jmp_cache[i], NULL);
+    }
+}
+
 #ifndef CONFIG_USER_ONLY
 /* in deterministic execution mode, instructions doing device I/Os
    must be at the end of the TB */

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v3 2/6] trace: Make trace_get_vcpu_event_count() inlinable
  2016-12-22 18:35 [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Lluís Vilanova
  2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 1/6] exec: [tcg] Refactor flush of per-CPU virtual TB cache Lluís Vilanova
@ 2016-12-22 18:35 ` Lluís Vilanova
  2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 3/6] exec: [tcg] Use multiple physical TB caches Lluís Vilanova
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Lluís Vilanova @ 2016-12-22 18:35 UTC (permalink / raw)
  To: qemu-devel; +Cc: Eric Blake, Eduardo Habkost, Stefan Hajnoczi

Later patches will make use of it.

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
---
 trace/control-internal.h |    5 +++++
 trace/control.c          |    9 ++-------
 trace/control.h          |    2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/trace/control-internal.h b/trace/control-internal.h
index a9d395a587..beb98a0d2c 100644
--- a/trace/control-internal.h
+++ b/trace/control-internal.h
@@ -16,6 +16,7 @@
 
 
 extern int trace_events_enabled_count;
+extern uint32_t trace_next_vcpu_id;
 
 
 static inline bool trace_event_is_pattern(const char *str)
@@ -82,6 +83,10 @@ static inline bool trace_event_get_vcpu_state_dynamic(CPUState *vcpu,
     return trace_event_get_vcpu_state_dynamic_by_vcpu_id(vcpu, vcpu_id);
 }
 
+static inline uint32_t trace_get_vcpu_event_count(void)
+{
+    return trace_next_vcpu_id;
+}
 
 void trace_event_register_group(TraceEvent **events);
 
diff --git a/trace/control.c b/trace/control.c
index 1a7bee6ddc..52d0e343fa 100644
--- a/trace/control.c
+++ b/trace/control.c
@@ -36,7 +36,7 @@ typedef struct TraceEventGroup {
 static TraceEventGroup *event_groups;
 static size_t nevent_groups;
 static uint32_t next_id;
-static uint32_t next_vcpu_id;
+uint32_t trace_next_vcpu_id;
 
 QemuOptsList qemu_trace_opts = {
     .name = "trace",
@@ -65,7 +65,7 @@ void trace_event_register_group(TraceEvent **events)
     for (i = 0; events[i] != NULL; i++) {
         events[i]->id = next_id++;
         if (events[i]->vcpu_id != TRACE_VCPU_EVENT_NONE) {
-            events[i]->vcpu_id = next_vcpu_id++;
+            events[i]->vcpu_id = trace_next_vcpu_id++;
         }
     }
     event_groups = g_renew(TraceEventGroup, event_groups, nevent_groups + 1);
@@ -299,8 +299,3 @@ char *trace_opt_parse(const char *optarg)
 
     return trace_file;
 }
-
-uint32_t trace_get_vcpu_event_count(void)
-{
-    return next_vcpu_id;
-}
diff --git a/trace/control.h b/trace/control.h
index ccaeac8552..80d326c4d1 100644
--- a/trace/control.h
+++ b/trace/control.h
@@ -237,7 +237,7 @@ char *trace_opt_parse(const char *optarg);
  *
  * Return the number of known vcpu-specific events
  */
-uint32_t trace_get_vcpu_event_count(void);
+static uint32_t trace_get_vcpu_event_count(void);
 
 
 #include "trace/control-internal.h"

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v3 3/6] exec: [tcg] Use multiple physical TB caches
  2016-12-22 18:35 [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Lluís Vilanova
  2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 1/6] exec: [tcg] Refactor flush of per-CPU virtual TB cache Lluís Vilanova
  2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 2/6] trace: Make trace_get_vcpu_event_count() inlinable Lluís Vilanova
@ 2016-12-22 18:35 ` Lluís Vilanova
  2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 4/6] exec: [tcg] Switch physical TB cache based on vCPU tracing state Lluís Vilanova
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Lluís Vilanova @ 2016-12-22 18:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Blake, Eduardo Habkost, Stefan Hajnoczi, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

The physical TB cache is split into 2^E caches, where E is the number of
events with the "vcpu" and without the "disable" properties.

The virtual TB cache on each vCPU uses a (potentially) different
physical TB cache.

This is later exploited to support different tracing event states on a
per-vCPU basis.

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
---
 cpu-exec.c                |    5 +++-
 include/exec/exec-all.h   |    6 +++++
 include/exec/tb-context.h |    2 +-
 include/qom/cpu.h         |    2 ++
 qom/cpu.c                 |    2 ++
 translate-all.c           |   54 ++++++++++++++++++++++++++++++++++++++-------
 translate-all.h           |   17 ++++++++++++++
 translate-all.inc.h       |   13 +++++++++++
 8 files changed, 90 insertions(+), 11 deletions(-)
 create mode 100644 translate-all.inc.h

diff --git a/cpu-exec.c b/cpu-exec.c
index 4188fed3c6..a3d9eee17e 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -33,6 +33,7 @@
 #include "hw/i386/apic.h"
 #endif
 #include "sysemu/replay.h"
+#include "translate-all.h"
 
 /* -icount align implementation. */
 
@@ -298,6 +299,7 @@ static TranslationBlock *tb_htable_lookup(CPUState *cpu,
     tb_page_addr_t phys_pc;
     struct tb_desc desc;
     uint32_t h;
+    struct qht *qht;
 
     desc.env = (CPUArchState *)cpu->env_ptr;
     desc.cs_base = cs_base;
@@ -306,7 +308,8 @@ static TranslationBlock *tb_htable_lookup(CPUState *cpu,
     phys_pc = get_page_addr_code(desc.env, pc);
     desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_hash_func(phys_pc, pc, flags);
-    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
+    qht = tb_caches_get(&tcg_ctx.tb_ctx, cpu->tb_cache_idx);
+    return qht_lookup(qht, tb_cmp, &desc, h);
 }
 
 static inline TranslationBlock *tb_find(CPUState *cpu,
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 57cd978578..feec0f2545 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -200,6 +200,10 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
 #define USE_DIRECT_JUMP
 #endif
 
+/**
+ * TranslationBlock:
+ * @tb_cache_idx: Index of physical TB cache where this TB has been allocated.
+ */
 struct TranslationBlock {
     target_ulong pc;   /* simulated PC corresponding to this block (EIP + CS base) */
     target_ulong cs_base; /* CS base for this block */
@@ -253,6 +257,8 @@ struct TranslationBlock {
      */
     uintptr_t jmp_list_next[2];
     uintptr_t jmp_list_first;
+
+    unsigned long *tb_cache_idx;
 };
 
 void tb_free(TranslationBlock *tb);
diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index c7f17f26e0..f6a2b356e6 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -32,7 +32,7 @@ typedef struct TBContext TBContext;
 struct TBContext {
 
     TranslationBlock *tbs;
-    struct qht htable;
+    struct qht *htables;
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 3f79a8e955..486872b752 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -295,6 +295,7 @@ struct qemu_work_item;
  * @kvm_fd: vCPU file descriptor for KVM.
  * @work_mutex: Lock to prevent multiple access to queued_work_*.
  * @queued_work_first: First asynchronous work pending.
+ * @tb_cache_idx: Index of current TB cache.
  * @trace_dstate: Dynamic tracing state of events for this vCPU (bitmask).
  *
  * State of one CPU core or thread.
@@ -370,6 +371,7 @@ struct CPUState {
      * Dynamically allocated based on bitmap requried to hold up to
      * trace_get_vcpu_event_count() entries.
      */
+    unsigned long *tb_cache_idx;
     unsigned long *trace_dstate;
 
     /* TODO Move common fields from CPUArchState here. */
diff --git a/qom/cpu.c b/qom/cpu.c
index 03d9190f8c..8c702b7818 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -367,6 +367,7 @@ static void cpu_common_initfn(Object *obj)
     QTAILQ_INIT(&cpu->breakpoints);
     QTAILQ_INIT(&cpu->watchpoints);
 
+    cpu->tb_cache_idx = bitmap_new(trace_get_vcpu_event_count());
     cpu->trace_dstate = bitmap_new(trace_get_vcpu_event_count());
 
     cpu_exec_initfn(cpu);
@@ -376,6 +377,7 @@ static void cpu_common_finalize(Object *obj)
 {
     CPUState *cpu = CPU(obj);
     g_free(cpu->trace_dstate);
+    g_free(cpu->tb_cache_idx);
 }
 
 static int64_t cpu_common_get_arch_id(CPUState *cpu)
diff --git a/translate-all.c b/translate-all.c
index 29ccb9e546..1051ec6271 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -53,6 +53,7 @@
 #include "exec/cputlb.h"
 #include "exec/tb-hash.h"
 #include "translate-all.h"
+#include "qemu/error-report.h"
 #include "qemu/bitmap.h"
 #include "qemu/timer.h"
 #include "exec/log.h"
@@ -811,9 +812,19 @@ static inline void code_gen_alloc(size_t tb_size)
 
 static void tb_htable_init(void)
 {
+    int cache;
     unsigned int mode = QHT_MODE_AUTO_RESIZE;
 
-    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
+    if (tb_caches_count() > ULONG_MAX) {
+        /* Ensure bitmaps can be used as indexes */
+        error_report("too many 'vcpu' events to index TB caches");
+    }
+
+    tcg_ctx.tb_ctx.htables = g_malloc(
+        sizeof(tcg_ctx.tb_ctx.htables[0]) * tb_caches_count());
+    for (cache = 0; cache < tb_caches_count(); cache++) {
+        qht_init(&tcg_ctx.tb_ctx.htables[cache], CODE_GEN_HTABLE_SIZE, mode);
+    }
 }
 
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
@@ -856,6 +867,7 @@ static TranslationBlock *tb_alloc(target_ulong pc)
     tb->pc = pc;
     tb->cflags = 0;
     tb->invalid = false;
+    tb->tb_cache_idx = bitmap_new(trace_get_vcpu_event_count());
     return tb;
 }
 
@@ -872,6 +884,8 @@ void tb_free(TranslationBlock *tb)
         tcg_ctx.code_gen_ptr = tb->tc_ptr;
         tcg_ctx.tb_ctx.nb_tbs--;
     }
+
+    g_free(tb->tb_cache_idx);
 }
 
 static inline void invalidate_page_bitmap(PageDesc *p)
@@ -919,6 +933,8 @@ static void page_flush_tb(void)
 /* flush all the translation blocks */
 static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
 {
+    int i;
+
     tb_lock();
 
     /* If it is already been done on request of another CPU,
@@ -945,7 +961,9 @@ static void do_tb_flush(CPUState *cpu, run_on_cpu_data tb_flush_count)
     }
 
     tcg_ctx.tb_ctx.nb_tbs = 0;
-    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
+    for (i = 0; i < tb_caches_count(); i++) {
+        qht_reset_size(&tcg_ctx.tb_ctx.htables[i], CODE_GEN_HTABLE_SIZE);
+    }
     page_flush_tb();
 
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
@@ -987,8 +1005,12 @@ do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
  */
 static void tb_invalidate_check(target_ulong address)
 {
+    int i;
+
     address &= TARGET_PAGE_MASK;
-    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
+    for (i = 0; i < tb_caches_count(); i++) {
+        qht_iter(&tcg_ctx.tb_ctx.htables[i], do_tb_invalidate_check, &address);
+    }
 }
 
 static void
@@ -1008,7 +1030,10 @@ do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 /* verify that all the pages have correct rights for code */
 static void tb_page_check(void)
 {
-    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
+    int i;
+    for (i = 0; i < tb_caches_count(); i++) {
+        qht_iter(&tcg_ctx.tb_ctx.htables[i], do_tb_page_check, NULL);
+    }
 }
 
 #endif
@@ -1098,6 +1123,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     CPUState *cpu;
     PageDesc *p;
     uint32_t h;
+    struct qht *qht;
     tb_page_addr_t phys_pc;
 
     assert_tb_lock();
@@ -1107,7 +1133,8 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
+    qht = tb_caches_get(&tcg_ctx.tb_ctx, tb->tb_cache_idx);
+    qht_remove(qht, tb, h);
 
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
@@ -1239,6 +1266,7 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
     uint32_t h;
+    struct qht *qht;
 
     assert_memory_lock();
 
@@ -1252,7 +1280,8 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
 
     /* add in the hash table */
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
+    qht = tb_caches_get(&tcg_ctx.tb_ctx, tb->tb_cache_idx);
+    qht_insert(qht, tb, h);
 
 #ifdef DEBUG_TB_CHECK
     tb_page_check();
@@ -1294,6 +1323,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb->cs_base = cs_base;
     tb->flags = flags;
     tb->cflags = cflags;
+    bitmap_copy(tb->tb_cache_idx, ENV_GET_CPU(env)->tb_cache_idx,
+                trace_get_vcpu_event_count());
 
 #ifdef CONFIG_PROFILER
     tcg_ctx.tb_count1++; /* includes aborted translations because of
@@ -1798,6 +1829,8 @@ void cpu_io_recompile(CPUState *cpu, uintptr_t retaddr)
     pc = tb->pc;
     cs_base = tb->cs_base;
     flags = tb->flags;
+    /* XXX: It is OK to invalidate only this TB, as this is the one triggering
+     * the memory access */
     tb_phys_invalidate(tb, -1);
     if (tb->cflags & CF_NOCACHE) {
         if (tb->orig_tb) {
@@ -1882,6 +1915,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     int direct_jmp_count, direct_jmp2_count, cross_page;
     TranslationBlock *tb;
     struct qht_stats hst;
+    int cache;
 
     tb_lock();
 
@@ -1935,9 +1969,11 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
                 tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
                         tcg_ctx.tb_ctx.nb_tbs : 0);
 
-    qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
-    print_qht_statistics(f, cpu_fprintf, hst);
-    qht_statistics_destroy(&hst);
+    for (cache = 0; cache < tb_caches_count(); cache++) {
+        qht_statistics_init(&tcg_ctx.tb_ctx.htables[cache], &hst);
+        print_qht_statistics(f, cpu_fprintf, hst);
+        qht_statistics_destroy(&hst);
+    }
 
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %u\n",
diff --git a/translate-all.h b/translate-all.h
index ba8e4d63c4..d39bf325d9 100644
--- a/translate-all.h
+++ b/translate-all.h
@@ -20,7 +20,21 @@
 #define TRANSLATE_ALL_H
 
 #include "exec/exec-all.h"
+#include "qemu/typedefs.h"
 
+/**
+ * tb_caches_count:
+ *
+ * Number of TB caches.
+ */
+static size_t tb_caches_count(void);
+
+/**
+ * tb_caches_get:
+ *
+ * Get the TB cache for the given bitmap index.
+ */
+static struct qht *tb_caches_get(TBContext *tb_ctx, unsigned long *bitmap);
 
 /* translate-all.c */
 void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len);
@@ -33,4 +47,7 @@ void tb_check_watchpoint(CPUState *cpu);
 int page_unprotect(target_ulong address, uintptr_t pc);
 #endif
 
+
+#include "translate-all.inc.h"
+
 #endif /* TRANSLATE_ALL_H */
diff --git a/translate-all.inc.h b/translate-all.inc.h
new file mode 100644
index 0000000000..f52627cfd6
--- /dev/null
+++ b/translate-all.inc.h
@@ -0,0 +1,13 @@
+/* Inline implementations for translate-all.h */
+
+static inline size_t tb_caches_count(void)
+{
+    return 1ULL << trace_get_vcpu_event_count();
+}
+
+static inline struct qht *tb_caches_get(TBContext *tb_ctx,
+                                        unsigned long *bitmap)
+{
+    unsigned long idx = *bitmap;
+    return &tb_ctx->htables[idx];
+}

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v3 4/6] exec: [tcg] Switch physical TB cache based on vCPU tracing state
  2016-12-22 18:35 [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Lluís Vilanova
                   ` (2 preceding siblings ...)
  2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 3/6] exec: [tcg] Use multiple physical TB caches Lluís Vilanova
@ 2016-12-22 18:35 ` Lluís Vilanova
  2016-12-22 18:36 ` [Qemu-devel] [PATCH v3 5/6] trace: [tcg] Do not generate TCG code to trace dinamically-disabled events Lluís Vilanova
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Lluís Vilanova @ 2016-12-22 18:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Blake, Eduardo Habkost, Stefan Hajnoczi, Paolo Bonzini,
	Peter Crosthwaite, Richard Henderson

Uses the per-vCPU event state in CPUState->trace_dstate (a bitmap) as an
index to a physical TB cache that will contain code specific to the set
of dynamically enabled events.

Two vCPUs tracing different events will execute code from different
physical TB caches. Two vCPUs tracing the same events will execute code
from the same physical TB cache.

This is used on the next patch to optimize TCG code related to event
tracing.

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
---
 cpu-exec.c             |    6 ++++++
 include/qom/cpu.h      |    1 +
 qom/cpu.c              |    1 +
 trace/control-target.c |    1 +
 trace/control.h        |    3 +++
 translate-all.c        |   24 ++++++++++++++++++++++++
 translate-all.h        |   26 ++++++++++++++++++++++++++
 7 files changed, 62 insertions(+)

diff --git a/cpu-exec.c b/cpu-exec.c
index a3d9eee17e..3a18b2fa68 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -29,6 +29,7 @@
 #include "qemu/rcu.h"
 #include "exec/tb-hash.h"
 #include "exec/log.h"
+#include "translate-all.h"
 #if defined(TARGET_I386) && !defined(CONFIG_USER_ONLY)
 #include "hw/i386/apic.h"
 #endif
@@ -526,6 +527,11 @@ static inline void cpu_handle_interrupt(CPUState *cpu,
             *last_tb = NULL;
         }
     }
+    if (unlikely(cpu_tb_cache_set_requested(cpu))) {
+        cpu_tb_cache_set_apply(cpu);
+        /* avoid chaning TBs across physical TB caches */
+        *last_tb = NULL;
+    }
     if (unlikely(atomic_read(&cpu->exit_request) || replay_has_interrupt())) {
         atomic_set(&cpu->exit_request, 0);
         cpu->exception_index = EXCP_INTERRUPT;
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 486872b752..910f326cbe 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -371,6 +371,7 @@ struct CPUState {
      * Dynamically allocated based on bitmap requried to hold up to
      * trace_get_vcpu_event_count() entries.
      */
+    bool tb_cache_idx_req;
     unsigned long *tb_cache_idx;
     unsigned long *trace_dstate;
 
diff --git a/qom/cpu.c b/qom/cpu.c
index 8c702b7818..a40ce45242 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -367,6 +367,7 @@ static void cpu_common_initfn(Object *obj)
     QTAILQ_INIT(&cpu->breakpoints);
     QTAILQ_INIT(&cpu->watchpoints);
 
+    cpu->tb_cache_idx_req = false;
     cpu->tb_cache_idx = bitmap_new(trace_get_vcpu_event_count());
     cpu->trace_dstate = bitmap_new(trace_get_vcpu_event_count());
 
diff --git a/trace/control-target.c b/trace/control-target.c
index 7ebf6e0bcb..ecae94dc0a 100644
--- a/trace/control-target.c
+++ b/trace/control-target.c
@@ -76,6 +76,7 @@ void trace_event_set_vcpu_state_dynamic(CPUState *vcpu,
             clear_bit(vcpu_id, vcpu->trace_dstate);
             (*ev->dstate)--;
         }
+        cpu_tb_cache_set_request(vcpu);
     }
 }
 
diff --git a/trace/control.h b/trace/control.h
index 80d326c4d1..cab84a0308 100644
--- a/trace/control.h
+++ b/trace/control.h
@@ -165,6 +165,9 @@ void trace_event_set_state_dynamic(TraceEvent *ev, bool state);
  * Set the dynamic tracing state of an event for the given vCPU.
  *
  * Pre-condition: trace_event_get_vcpu_state_static(ev) == true
+ *
+ * Note: Changes for execution-time events with the 'tcg' property will not be
+ *       propagated until the next TB is executed (iff executing in TCG mode).
  */
 void trace_event_set_vcpu_state_dynamic(CPUState *vcpu,
                                         TraceEvent *ev, bool state);
diff --git a/translate-all.c b/translate-all.c
index 1051ec6271..e9ebd93ece 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1288,6 +1288,30 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
 #endif
 }
 
+void cpu_tb_cache_set_request(CPUState *cpu)
+{
+    /*
+     * Request is taken from cpu->trace_dstate and lazily applied into
+     * cpu->tb_cache_idx at cpu_tb_cache_set_apply().
+     */
+    /* NOTE: Checked by all TBs in gen_tb_start(). */
+    atomic_set(&cpu->tb_cache_idx_req, true);
+    atomic_set(&cpu->tcg_exit_req, 1);
+}
+
+bool cpu_tb_cache_set_requested(CPUState *cpu)
+{
+    return cpu->tb_cache_idx_req;
+}
+
+void cpu_tb_cache_set_apply(CPUState *cpu)
+{
+    cpu->tb_cache_idx_req = false;
+    bitmap_copy(cpu->tb_cache_idx, cpu->trace_dstate,
+                trace_get_vcpu_event_count());
+    tb_flush_jmp_cache_all(cpu);
+}
+
 /* Called with mmap_lock held for user mode emulation.  */
 TranslationBlock *tb_gen_code(CPUState *cpu,
                               target_ulong pc, target_ulong cs_base,
diff --git a/translate-all.h b/translate-all.h
index d39bf325d9..fcc7fb04fc 100644
--- a/translate-all.h
+++ b/translate-all.h
@@ -36,6 +36,32 @@ static size_t tb_caches_count(void);
  */
 static struct qht *tb_caches_get(TBContext *tb_ctx, unsigned long *bitmap);
 
+/**
+ * cpu_tb_cache_set_request:
+ *
+ * Request a physical TB cache switch on this @cpu.
+ */
+void cpu_tb_cache_set_request(CPUState *cpu);
+
+/**
+ * cpu_tb_cache_set_requested:
+ *
+ * Returns: %true if @cpu requested a physical TB cache switch, %false
+ *          otherwise.
+ */
+bool cpu_tb_cache_set_requested(CPUState *cpu);
+
+/**
+ * cput_tb_cache_set_apply:
+ *
+ * Apply a physical TB cache switch.
+ *
+ * Precondition: @cpu is not currently executing any TB.
+ *
+ * Note: Invalidates the jump cache of the given vCPU.
+ */
+void cpu_tb_cache_set_apply(CPUState *cpu);
+
 /* translate-all.c */
 void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len);
 void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v3 5/6] trace: [tcg] Do not generate TCG code to trace dinamically-disabled events
  2016-12-22 18:35 [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Lluís Vilanova
                   ` (3 preceding siblings ...)
  2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 4/6] exec: [tcg] Switch physical TB cache based on vCPU tracing state Lluís Vilanova
@ 2016-12-22 18:36 ` Lluís Vilanova
  2016-12-22 18:36 ` [Qemu-devel] [PATCH v3 6/6] trace: [tcg, trivial] Re-align generated code Lluís Vilanova
  2016-12-23 18:05 ` [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Richard Henderson
  6 siblings, 0 replies; 11+ messages in thread
From: Lluís Vilanova @ 2016-12-22 18:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: Eric Blake, Eduardo Habkost, Stefan Hajnoczi

If an event is dynamically disabled, the TCG code that calls the
execution-time tracer is not generated.

Removes the overheads of execution-time tracers for dynamically disabled
events. As a bonus, also avoids checking the event state when the
execution-time tracer is called from TCG-generated code (since otherwise
TCG would simply not call it).

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
---
 scripts/tracetool/__init__.py            |    1 +
 scripts/tracetool/format/h.py            |   24 ++++++++++++++++++------
 scripts/tracetool/format/tcg_h.py        |   19 ++++++++++++++++---
 scripts/tracetool/format/tcg_helper_c.py |    3 ++-
 4 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/scripts/tracetool/__init__.py b/scripts/tracetool/__init__.py
index 365446fa53..63168ccdf0 100644
--- a/scripts/tracetool/__init__.py
+++ b/scripts/tracetool/__init__.py
@@ -264,6 +264,7 @@ class Event(object):
         return self._FMT.findall(self.fmt)
 
     QEMU_TRACE               = "trace_%(name)s"
+    QEMU_TRACE_NOCHECK       = "_nocheck__" + QEMU_TRACE
     QEMU_TRACE_TCG           = QEMU_TRACE + "_tcg"
     QEMU_DSTATE              = "_TRACE_%(NAME)s_DSTATE"
     QEMU_EVENT               = "_TRACE_%(NAME)s_EVENT"
diff --git a/scripts/tracetool/format/h.py b/scripts/tracetool/format/h.py
index 3682f4e6a8..a78e50ef35 100644
--- a/scripts/tracetool/format/h.py
+++ b/scripts/tracetool/format/h.py
@@ -49,6 +49,19 @@ def generate(events, backend, group):
     backend.generate_begin(events, group)
 
     for e in events:
+        # tracer without checks
+        out('',
+            'static inline void %(api)s(%(args)s)',
+            '{',
+            api=e.api(e.QEMU_TRACE_NOCHECK),
+            args=e.args)
+
+        if "disable" not in e.properties:
+            backend.generate(e, group)
+
+        out('}')
+
+        # tracer wrapper with checks (per-vCPU tracing)
         if "vcpu" in e.properties:
             trace_cpu = next(iter(e.args))[1]
             cond = "trace_event_get_vcpu_state(%(cpu)s,"\
@@ -63,16 +76,15 @@ def generate(events, backend, group):
             'static inline void %(api)s(%(args)s)',
             '{',
             '    if (%(cond)s) {',
+            '        %(api_nocheck)s(%(names)s);',
+            '    }',
+            '}',
             api=e.api(),
+            api_nocheck=e.api(e.QEMU_TRACE_NOCHECK),
             args=e.args,
+            names=", ".join(e.args.names()),
             cond=cond)
 
-        if "disable" not in e.properties:
-            backend.generate(e, group)
-
-        out('    }',
-            '}')
-
     backend.generate_end(events, group)
 
     out('#endif /* TRACE_%s_GENERATED_TRACERS_H */' % group.upper())
diff --git a/scripts/tracetool/format/tcg_h.py b/scripts/tracetool/format/tcg_h.py
index 5f213f6cba..71b5c09432 100644
--- a/scripts/tracetool/format/tcg_h.py
+++ b/scripts/tracetool/format/tcg_h.py
@@ -41,7 +41,7 @@ def generate(events, backend, group):
 
     for e in events:
         # just keep one of them
-        if "tcg-trans" not in e.properties:
+        if "tcg-exec" not in e.properties:
             continue
 
         out('static inline void %(name_tcg)s(%(args)s)',
@@ -53,12 +53,25 @@ def generate(events, backend, group):
             args_trans = e.original.event_trans.args
             args_exec = tracetool.vcpu.transform_args(
                 "tcg_helper_c", e.original.event_exec, "wrapper")
+            if "vcpu" in e.properties:
+                trace_cpu = e.args.names()[0]
+                cond = "trace_event_get_vcpu_state(%(cpu)s,"\
+                       " TRACE_%(id)s)"\
+                       % dict(
+                           cpu=trace_cpu,
+                           id=e.original.event_exec.name.upper())
+            else:
+                cond = "true"
+
             out('    %(name_trans)s(%(argnames_trans)s);',
-                '    gen_helper_%(name_exec)s(%(argnames_exec)s);',
+                '    if (%(cond)s) {',
+                '        gen_helper_%(name_exec)s(%(argnames_exec)s);',
+                '    }',
                 name_trans=e.original.event_trans.api(e.QEMU_TRACE),
                 name_exec=e.original.event_exec.api(e.QEMU_TRACE),
                 argnames_trans=", ".join(args_trans.names()),
-                argnames_exec=", ".join(args_exec.names()))
+                argnames_exec=", ".join(args_exec.names()),
+                cond=cond)
 
         out('}')
 
diff --git a/scripts/tracetool/format/tcg_helper_c.py b/scripts/tracetool/format/tcg_helper_c.py
index cc26e03008..c2a05d756c 100644
--- a/scripts/tracetool/format/tcg_helper_c.py
+++ b/scripts/tracetool/format/tcg_helper_c.py
@@ -66,10 +66,11 @@ def generate(events, backend, group):
 
         out('void %(name_tcg)s(%(args_api)s)',
             '{',
+            # NOTE: the check was already performed at TCG-generation time
             '    %(name)s(%(args_call)s);',
             '}',
             name_tcg="helper_%s_proxy" % e.api(),
-            name=e.api(),
+            name=e.api(e.QEMU_TRACE_NOCHECK),
             args_api=e_args_api,
             args_call=", ".join(e_args_call.casted()),
             )

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH v3 6/6] trace: [tcg, trivial] Re-align generated code
  2016-12-22 18:35 [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Lluís Vilanova
                   ` (4 preceding siblings ...)
  2016-12-22 18:36 ` [Qemu-devel] [PATCH v3 5/6] trace: [tcg] Do not generate TCG code to trace dinamically-disabled events Lluís Vilanova
@ 2016-12-22 18:36 ` Lluís Vilanova
  2016-12-23 18:05 ` [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Richard Henderson
  6 siblings, 0 replies; 11+ messages in thread
From: Lluís Vilanova @ 2016-12-22 18:36 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Blake, Eduardo Habkost, Stefan Hajnoczi, Michael Tokarev,
	Laurent Vivier, open list:Trivial patches

Last patch removed a nesting level in generated code. Re-align all code
generated by backends to be 4-column aligned.

Signed-off-by: Lluís Vilanova <vilanova@ac.upc.edu>
---
 scripts/tracetool/backend/dtrace.py |    2 +-
 scripts/tracetool/backend/ftrace.py |   20 ++++++++++----------
 scripts/tracetool/backend/log.py    |   16 ++++++++--------
 scripts/tracetool/backend/simple.py |    2 +-
 scripts/tracetool/backend/syslog.py |    6 +++---
 scripts/tracetool/backend/ust.py    |    2 +-
 6 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/scripts/tracetool/backend/dtrace.py b/scripts/tracetool/backend/dtrace.py
index 79505c6b1a..b3a8645bf0 100644
--- a/scripts/tracetool/backend/dtrace.py
+++ b/scripts/tracetool/backend/dtrace.py
@@ -41,6 +41,6 @@ def generate_h_begin(events, group):
 
 
 def generate_h(event, group):
-    out('        QEMU_%(uppername)s(%(argnames)s);',
+    out('    QEMU_%(uppername)s(%(argnames)s);',
         uppername=event.name.upper(),
         argnames=", ".join(event.args.names()))
diff --git a/scripts/tracetool/backend/ftrace.py b/scripts/tracetool/backend/ftrace.py
index db9fe7ad57..dd0eda4441 100644
--- a/scripts/tracetool/backend/ftrace.py
+++ b/scripts/tracetool/backend/ftrace.py
@@ -29,17 +29,17 @@ def generate_h(event, group):
     if len(event.args) > 0:
         argnames = ", " + argnames
 
-    out('        {',
-        '            char ftrace_buf[MAX_TRACE_STRLEN];',
-        '            int unused __attribute__ ((unused));',
-        '            int trlen;',
-        '            if (trace_event_get_state(%(event_id)s)) {',
-        '                trlen = snprintf(ftrace_buf, MAX_TRACE_STRLEN,',
-        '                                 "%(name)s " %(fmt)s "\\n" %(argnames)s);',
-        '                trlen = MIN(trlen, MAX_TRACE_STRLEN - 1);',
-        '                unused = write(trace_marker_fd, ftrace_buf, trlen);',
-        '            }',
+    out('    {',
+        '        char ftrace_buf[MAX_TRACE_STRLEN];',
+        '        int unused __attribute__ ((unused));',
+        '        int trlen;',
+        '        if (trace_event_get_state(%(event_id)s)) {',
+        '            trlen = snprintf(ftrace_buf, MAX_TRACE_STRLEN,',
+        '                             "%(name)s " %(fmt)s "\\n" %(argnames)s);',
+        '            trlen = MIN(trlen, MAX_TRACE_STRLEN - 1);',
+        '            unused = write(trace_marker_fd, ftrace_buf, trlen);',
         '        }',
+        '    }',
         name=event.name,
         args=event.args,
         event_id="TRACE_" + event.name.upper(),
diff --git a/scripts/tracetool/backend/log.py b/scripts/tracetool/backend/log.py
index 4f4a4d38b1..8fa8addfe7 100644
--- a/scripts/tracetool/backend/log.py
+++ b/scripts/tracetool/backend/log.py
@@ -35,14 +35,14 @@ def generate_h(event, group):
     else:
         cond = "trace_event_get_state(%s)" % ("TRACE_" + event.name.upper())
 
-    out('        if (%(cond)s) {',
-        '            struct timeval _now;',
-        '            gettimeofday(&_now, NULL);',
-        '            qemu_log_mask(LOG_TRACE, "%%d@%%zd.%%06zd:%(name)s " %(fmt)s "\\n",',
-        '                          getpid(),',
-        '                          (size_t)_now.tv_sec, (size_t)_now.tv_usec',
-        '                          %(argnames)s);',
-        '        }',
+    out('    if (%(cond)s) {',
+        '        struct timeval _now;',
+        '        gettimeofday(&_now, NULL);',
+        '        qemu_log_mask(LOG_TRACE, "%%d@%%zd.%%06zd:%(name)s " %(fmt)s "\\n",',
+        '                      getpid(),',
+        '                      (size_t)_now.tv_sec, (size_t)_now.tv_usec',
+        '                      %(argnames)s);',
+        '    }',
         cond=cond,
         name=event.name,
         fmt=event.fmt.rstrip("\n"),
diff --git a/scripts/tracetool/backend/simple.py b/scripts/tracetool/backend/simple.py
index 85f61028e2..a28460b1e4 100644
--- a/scripts/tracetool/backend/simple.py
+++ b/scripts/tracetool/backend/simple.py
@@ -37,7 +37,7 @@ def generate_h_begin(events, group):
 
 
 def generate_h(event, group):
-    out('        _simple_%(api)s(%(args)s);',
+    out('    _simple_%(api)s(%(args)s);',
         api=event.api(),
         args=", ".join(event.args.names()))
 
diff --git a/scripts/tracetool/backend/syslog.py b/scripts/tracetool/backend/syslog.py
index b8ff2790c4..1ce627f0fc 100644
--- a/scripts/tracetool/backend/syslog.py
+++ b/scripts/tracetool/backend/syslog.py
@@ -35,9 +35,9 @@ def generate_h(event, group):
     else:
         cond = "trace_event_get_state(%s)" % ("TRACE_" + event.name.upper())
 
-    out('        if (%(cond)s) {',
-        '            syslog(LOG_INFO, "%(name)s " %(fmt)s %(argnames)s);',
-        '        }',
+    out('    if (%(cond)s) {',
+        '        syslog(LOG_INFO, "%(name)s " %(fmt)s %(argnames)s);',
+        '    }',
         cond=cond,
         name=event.name,
         fmt=event.fmt.rstrip("\n"),
diff --git a/scripts/tracetool/backend/ust.py b/scripts/tracetool/backend/ust.py
index 4594db6128..2d289b2e3c 100644
--- a/scripts/tracetool/backend/ust.py
+++ b/scripts/tracetool/backend/ust.py
@@ -30,6 +30,6 @@ def generate_h(event, group):
     if len(event.args) > 0:
         argnames = ", " + argnames
 
-    out('        tracepoint(qemu, %(name)s%(tp_args)s);',
+    out('    tracepoint(qemu, %(name)s%(tp_args)s);',
         name=event.name,
         tp_args=argnames)

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches
  2016-12-22 18:35 [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Lluís Vilanova
                   ` (5 preceding siblings ...)
  2016-12-22 18:36 ` [Qemu-devel] [PATCH v3 6/6] trace: [tcg, trivial] Re-align generated code Lluís Vilanova
@ 2016-12-23 18:05 ` Richard Henderson
  2016-12-23 18:51   ` Lluís Vilanova
  6 siblings, 1 reply; 11+ messages in thread
From: Richard Henderson @ 2016-12-23 18:05 UTC (permalink / raw)
  To: Lluís Vilanova, qemu-devel; +Cc: Eduardo Habkost, Stefan Hajnoczi

On 12/22/2016 10:35 AM, Lluís Vilanova wrote:
> To handle both issues, this series replicates the shared physical TB cache,
> creating a separate physical TB cache for every combination of event states
> (those with the 'vcpu' and 'tcg' properties). Then, all vCPUs tracing the same
> events will use the same physical TB cache.

Why do we need to "split the physical TB cache" as opposed to simply including
the trace state into the TB hash function?


r~

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches
  2016-12-23 18:05 ` [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Richard Henderson
@ 2016-12-23 18:51   ` Lluís Vilanova
  2016-12-23 20:09     ` Richard Henderson
  0 siblings, 1 reply; 11+ messages in thread
From: Lluís Vilanova @ 2016-12-23 18:51 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, Eduardo Habkost, Stefan Hajnoczi

Richard Henderson writes:

> On 12/22/2016 10:35 AM, Lluís Vilanova wrote:
>> To handle both issues, this series replicates the shared physical TB cache,
>> creating a separate physical TB cache for every combination of event states
>> (those with the 'vcpu' and 'tcg' properties). Then, all vCPUs tracing the same
>> events will use the same physical TB cache.

> Why do we need to "split the physical TB cache" as opposed to simply including
> the trace state into the TB hash function?

Mmmm, that's an interesting alternative I did not consider. Are you aiming at
minimizing the changes, or do you also think it would be more efficient?

The dynamic tracing state would then be an arbitrarily long bitmap (defined by
the number of events with the 'vcpu' property), so I'm not sure how to fit it
into the hashing function with minimal collisions (the bitmap is now limited to
an unsigned long to use it as an index to the TB cache "matrix").

The other drawback I see is that then it would also take longer to compute the
hashing function, instead of the simpler array indexing. As a benefit, workloads
with a high frequency of TB-flushing operations might be a bit faster (there
would be a single QHT).

If someone can provide me the code for the modified hash lookup function to
account for the trace dstate bitmap contents, I will integrate it and measure if
there is any important change in performance.


Cheers,
  Lluis

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches
  2016-12-23 18:51   ` Lluís Vilanova
@ 2016-12-23 20:09     ` Richard Henderson
  2016-12-26 19:41       ` Lluís Vilanova
  0 siblings, 1 reply; 11+ messages in thread
From: Richard Henderson @ 2016-12-23 20:09 UTC (permalink / raw)
  To: qemu-devel, Eduardo Habkost, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2095 bytes --]

On 12/23/2016 10:51 AM, Lluís Vilanova wrote:
>> On 12/22/2016 10:35 AM, Lluís Vilanova wrote:
>>> To handle both issues, this series replicates the shared physical TB cache,
>>> creating a separate physical TB cache for every combination of event states
>>> (those with the 'vcpu' and 'tcg' properties). Then, all vCPUs tracing the same
>>> events will use the same physical TB cache.
> 
>> Why do we need to "split the physical TB cache" as opposed to simply including
>> the trace state into the TB hash function?
> 
> Mmmm, that's an interesting alternative I did not consider. Are you aiming at
> minimizing the changes, or do you also think it would be more efficient?

I suspect that it will be more efficient.

> The dynamic tracing state would then be an arbitrarily long bitmap (defined by
> the number of events with the 'vcpu' property), so I'm not sure how to fit it
> into the hashing function with minimal collisions (the bitmap is now limited to
> an unsigned long to use it as an index to the TB cache "matrix").

You could consider that index a unique identifier for the tracing state, and
then only compare and hash that integer.

> The other drawback I see is that then it would also take longer to compute the
> hashing function, instead of the simpler array indexing. As a benefit, workloads
> with a high frequency of TB-flushing operations might be a bit faster (there
> would be a single QHT).

I don't see adding one more integer to the hashing function to be significant
at all.  Certainly not the 15% that you describe in your cover letter.

> If someone can provide me the code for the modified hash lookup function to
> account for the trace dstate bitmap contents, I will integrate it and measure if
> there is any important change in performance.

Something like the following should do it.  There are two /* cpu->??? */
markers that would need to be filled in.

If you can reduce the tracing identifier to 8 bits, that would be excellent.
I've been wanting to make some other changes to TB hashing, and that would fit
in well with a second "flags" value.


r~

[-- Attachment #2: z --]
[-- Type: text/plain, Size: 6321 bytes --]

diff --git a/cpu-exec.c b/cpu-exec.c
index 4188fed..3b54317 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -261,6 +261,7 @@ struct tb_desc {
     CPUArchState *env;
     tb_page_addr_t phys_page1;
     uint32_t flags;
+    uint32_t traceid;
 };
 
 static bool tb_cmp(const void *p, const void *d)
@@ -272,6 +273,7 @@ static bool tb_cmp(const void *p, const void *d)
         tb->page_addr[0] == desc->phys_page1 &&
         tb->cs_base == desc->cs_base &&
         tb->flags == desc->flags &&
+        tb->traceid == desc->traceid &&
         !atomic_read(&tb->invalid)) {
         /* check next page if needed */
         if (tb->page_addr[1] == -1) {
@@ -293,7 +295,8 @@ static bool tb_cmp(const void *p, const void *d)
 static TranslationBlock *tb_htable_lookup(CPUState *cpu,
                                           target_ulong pc,
                                           target_ulong cs_base,
-                                          uint32_t flags)
+                                          uint32_t flags,
+                                          uint32_t traceid)
 {
     tb_page_addr_t phys_pc;
     struct tb_desc desc;
@@ -302,10 +305,11 @@ static TranslationBlock *tb_htable_lookup(CPUState *cpu,
     desc.env = (CPUArchState *)cpu->env_ptr;
     desc.cs_base = cs_base;
     desc.flags = flags;
+    desc.traceid = traceid;
     desc.pc = pc;
     phys_pc = get_page_addr_code(desc.env, pc);
     desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_hash_func(phys_pc, pc, flags);
+    h = tb_hash_func(phys_pc, pc, flags, traceid);
     return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
 }
 
@@ -317,6 +321,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
     TranslationBlock *tb;
     target_ulong cs_base, pc;
     uint32_t flags;
+    uint32_t traceid = 0 /* cpu->??? */;
     bool have_tb_lock = false;
 
     /* we record a subset of the CPU state. It will
@@ -325,8 +330,8 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
     tb = atomic_rcu_read(&cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]);
     if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
-                 tb->flags != flags)) {
-        tb = tb_htable_lookup(cpu, pc, cs_base, flags);
+                 tb->flags != flags || tb->traceid != traceid)) {
+        tb = tb_htable_lookup(cpu, pc, cs_base, flags, traceid);
         if (!tb) {
 
             /* mmap_lock is needed by tb_gen_code, and mmap_lock must be
@@ -340,7 +345,7 @@ static inline TranslationBlock *tb_find(CPUState *cpu,
             /* There's a chance that our desired tb has been translated while
              * taking the locks so we check again inside the lock.
              */
-            tb = tb_htable_lookup(cpu, pc, cs_base, flags);
+            tb = tb_htable_lookup(cpu, pc, cs_base, flags, traceid);
             if (!tb) {
                 /* if no translated code available, then translate it now */
                 tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index a8c13ce..40e867d 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -215,6 +215,7 @@ struct TranslationBlock {
 #define CF_IGNORE_ICOUNT 0x40000 /* Do not generate icount code */
 
     uint16_t invalid;
+    uint16_t traceid; /* identifier for tracing state */
 
     void *tc_ptr;    /* pointer to the translated code */
     uint8_t *tc_search;  /* pointer to search data */
diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
index 2c40b5c..7873740 100644
--- a/include/exec/tb-hash-xx.h
+++ b/include/exec/tb-hash-xx.h
@@ -49,7 +49,7 @@
  * contiguous in memory.
  */
 static inline
-uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e)
+uint32_t tb_hash_func6(uint64_t a0, uint64_t b0, uint32_t e, uint32_t f)
 {
     uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
     uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
@@ -83,6 +83,9 @@ uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e)
     h32 += e * PRIME32_3;
     h32  = rol32(h32, 17) * PRIME32_4;
 
+    h32 += f * PRIME32_3;
+    h32  = rol32(h32, 17) * PRIME32_4;
+
     h32 ^= h32 >> 15;
     h32 *= PRIME32_2;
     h32 ^= h32 >> 13;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 2c27490..5664851 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -46,9 +46,10 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
 }
 
 static inline
-uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t flags)
+uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc,
+                      uint32_t flags, uint32_t traceid)
 {
-    return tb_hash_func5(phys_pc, pc, flags);
+    return tb_hash_func6(phys_pc, pc, flags, traceid);
 }
 
 #endif
diff --git a/tests/qht-bench.c b/tests/qht-bench.c
index 2afa09d..11c1cec 100644
--- a/tests/qht-bench.c
+++ b/tests/qht-bench.c
@@ -103,7 +103,7 @@ static bool is_equal(const void *obj, const void *userp)
 
 static inline uint32_t h(unsigned long v)
 {
-    return tb_hash_func5(v, 0, 0);
+    return tb_hash_func6(v, 0, 0, 0);
 }
 
 /*
diff --git a/translate-all.c b/translate-all.c
index 3dd9214..ee86822 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1110,7 +1110,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->traceid);
     qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* remove the TB from the page list */
@@ -1255,7 +1255,7 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     }
 
     /* add in the hash table */
-    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags, tb->traceid);
     qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
 
 #ifdef DEBUG_TB_CHECK
@@ -1298,6 +1298,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb->cs_base = cs_base;
     tb->flags = flags;
     tb->cflags = cflags;
+    tb->traceid = 0 /* cpu->??? */;
 
 #ifdef CONFIG_PROFILER
     tcg_ctx.tb_count1++; /* includes aborted translations because of

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches
  2016-12-23 20:09     ` Richard Henderson
@ 2016-12-26 19:41       ` Lluís Vilanova
  0 siblings, 0 replies; 11+ messages in thread
From: Lluís Vilanova @ 2016-12-26 19:41 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, Eduardo Habkost, Stefan Hajnoczi

Richard Henderson writes:

> On 12/23/2016 10:51 AM, Lluís Vilanova wrote:
>>> On 12/22/2016 10:35 AM, Lluís Vilanova wrote:
>>>> To handle both issues, this series replicates the shared physical TB cache,
>>>> creating a separate physical TB cache for every combination of event states
>>>> (those with the 'vcpu' and 'tcg' properties). Then, all vCPUs tracing the same
>>>> events will use the same physical TB cache.
>> 
>>> Why do we need to "split the physical TB cache" as opposed to simply including
>>> the trace state into the TB hash function?
>> 
>> Mmmm, that's an interesting alternative I did not consider. Are you aiming at
>> minimizing the changes, or do you also think it would be more efficient?

> I suspect that it will be more efficient.

I discovered that I ran my tests with "--enable-debug". Without it, the
performance changes for this series look within the noise range (great news!
guest code events are now for free), and the TB hash function approach should
show the same type of results on the hot path:

* vanilla, statically disabled
real	0m2,259s
user	0m2,252s
sys	0m0,004s

* vanilla, statically enabled (overhead: 2.18x)
real	0m4,921s
user	0m4,912s
sys	0m0,008s

* multi-tb, statically disabled (overhead: 1.00x)
real	0m2,269s
user	0m2,248s
sys	0m0,020s

* multi-tb, statically enabled (overhead: 0.98x)
real	0m2,223s
user	0m2,204s
sys	0m0,016s


The only drawback I see with going for your proposal is the possibility of a
higher TB collision rate when users set vCPUs with different tracing states
(after all, my series has a separate QHT for each combination of tracing
states).


>> The dynamic tracing state would then be an arbitrarily long bitmap (defined by
>> the number of events with the 'vcpu' property), so I'm not sure how to fit it
>> into the hashing function with minimal collisions (the bitmap is now limited to
>> an unsigned long to use it as an index to the TB cache "matrix").

> You could consider that index a unique identifier for the tracing state, and
> then only compare and hash that integer.

Yes, I get that. I meant that the bitmap has a theoretically unbounded size. But
we can limit that to some arbitrary number to keep the hashing function fixed
(in your example, you're using 32 bits, which should be enough).


>> The other drawback I see is that then it would also take longer to compute the
>> hashing function, instead of the simpler array indexing. As a benefit, workloads
>> with a high frequency of TB-flushing operations might be a bit faster (there
>> would be a single QHT).

> I don't see adding one more integer to the hashing function to be significant
> at all.  Certainly not the 15% that you describe in your cover letter.

See results above.


>> If someone can provide me the code for the modified hash lookup function to
>> account for the trace dstate bitmap contents, I will integrate it and measure if
>> there is any important change in performance.

> Something like the following should do it.  There are two /* cpu->??? */
> markers that would need to be filled in.

Thanks for the code! If you think the TB hash function is a less invasive
change, then I'll send a new series with that.


> If you can reduce the tracing identifier to 8 bits, that would be excellent.
> I've been wanting to make some other changes to TB hashing, and that would fit
> in well with a second "flags" value.

With all the vCPU events I have in the queue, it should be on the order of 16
bits. But we cannot shrink it down to 8 bits at compile time now, since we no
longer have a define with the number of vCPU events.


Cheers,
  Lluis

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-12-26 19:41 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-22 18:35 [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Lluís Vilanova
2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 1/6] exec: [tcg] Refactor flush of per-CPU virtual TB cache Lluís Vilanova
2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 2/6] trace: Make trace_get_vcpu_event_count() inlinable Lluís Vilanova
2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 3/6] exec: [tcg] Use multiple physical TB caches Lluís Vilanova
2016-12-22 18:35 ` [Qemu-devel] [PATCH v3 4/6] exec: [tcg] Switch physical TB cache based on vCPU tracing state Lluís Vilanova
2016-12-22 18:36 ` [Qemu-devel] [PATCH v3 5/6] trace: [tcg] Do not generate TCG code to trace dinamically-disabled events Lluís Vilanova
2016-12-22 18:36 ` [Qemu-devel] [PATCH v3 6/6] trace: [tcg, trivial] Re-align generated code Lluís Vilanova
2016-12-23 18:05 ` [Qemu-devel] [PATCH v3 0/6] trace: [tcg] Optimize per-vCPU tracing states with separate TB caches Richard Henderson
2016-12-23 18:51   ` Lluís Vilanova
2016-12-23 20:09     ` Richard Henderson
2016-12-26 19:41       ` Lluís Vilanova

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.