All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Performance optimizations for PPC64
@ 2022-10-21 17:01 Leandro Lupori
  2022-10-21 17:01 ` [PATCH 1/3] accel/tcg: Add a quicker check for breakpoints Leandro Lupori
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Leandro Lupori @ 2022-10-21 17:01 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc
  Cc: richard.henderson, pbonzini, clg, danielhb413, david, groug,
	Leandro Lupori

This patch series contains 2 performance optimizations that
target PPC64, although the first one also benefits other archs.

In the first optimization, the check for empty breakpoints' queue
is moved out of check_for_breakpoints(), into a macro, to avoid
the call overhead.

In the second optimization, the most frequently executed part of
the code that updates the PMCs is translated to inline TCG ops.
Also, new HFLAGS are introduced, to keep the inline code small and fast.

With both optimizations, a reduction of about 15% in Fedora's boot time
was measured on a POWER9 machine.

The PMU tests from kernel selftests were run and all tests that pass on
master still pass with these changes.

Leandro Lupori (3):
  accel/tcg: Add a quicker check for breakpoints
  target/ppc: Add new PMC HFLAGS
  target/ppc: Increment PMC5 with inline insns

 accel/tcg/cpu-exec.c     | 13 +++----
 target/ppc/cpu.h         |  4 ++-
 target/ppc/helper.h      |  1 +
 target/ppc/helper_regs.c |  6 ++++
 target/ppc/power8-pmu.c  | 74 +++++++++++++++++++++-------------------
 target/ppc/power8-pmu.h  |  3 ++
 target/ppc/translate.c   | 32 +++++++++++++++--
 7 files changed, 87 insertions(+), 46 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] accel/tcg: Add a quicker check for breakpoints
  2022-10-21 17:01 [PATCH 0/3] Performance optimizations for PPC64 Leandro Lupori
@ 2022-10-21 17:01 ` Leandro Lupori
  2022-10-22 11:12   ` Richard Henderson
  2022-10-25 12:29   ` Paolo Bonzini
  2022-10-21 17:01 ` [PATCH 2/3] target/ppc: Add new PMC HFLAGS Leandro Lupori
  2022-10-21 17:01 ` [PATCH 3/3] target/ppc: Increment PMC5 with inline insns Leandro Lupori
  2 siblings, 2 replies; 10+ messages in thread
From: Leandro Lupori @ 2022-10-21 17:01 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc
  Cc: richard.henderson, pbonzini, clg, danielhb413, david, groug,
	Leandro Lupori

Profiling QEMU during Fedora 35 for PPC64 boot revealed that a
considerable amount of time was being spent in
check_for_breakpoints() (0.61% of total time on PPC64 and 2.19% on
amd64), even though it was just checking that its queue was empty
and returning, when no breakpoints were set. It turns out this
function is not inlined by the compiler and it's always called by
helper_lookup_tb_ptr(), one of the most called functions.

By moving the check for empty queue to the have_breakpoints()
macro and calling check_for_breakpoints() only when it returns
true, it's possible to avoid the call overhead. An improvement of
about 3% in total time was measured on POWER9.

Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
---
 accel/tcg/cpu-exec.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index f9e5cc9ba0..9eec01ad9a 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -304,16 +304,15 @@ static void log_cpu_exec(target_ulong pc, CPUState *cpu,
     }
 }
 
+#define have_breakpoints(cpu)   (likely(QTAILQ_EMPTY(&(cpu)->breakpoints)) ? \
+                                 false : true)
+
 static bool check_for_breakpoints(CPUState *cpu, target_ulong pc,
                                   uint32_t *cflags)
 {
     CPUBreakpoint *bp;
     bool match_page = false;
 
-    if (likely(QTAILQ_EMPTY(&cpu->breakpoints))) {
-        return false;
-    }
-
     /*
      * Singlestep overrides breakpoints.
      * This requirement is visible in the record-replay tests, where
@@ -392,7 +391,8 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env)
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
 
     cflags = curr_cflags(cpu);
-    if (check_for_breakpoints(cpu, pc, &cflags)) {
+
+    if (have_breakpoints(cpu) && check_for_breakpoints(cpu, pc, &cflags)) {
         cpu_loop_exit(cpu);
     }
 
@@ -990,7 +990,8 @@ int cpu_exec(CPUState *cpu)
                 cpu->cflags_next_tb = -1;
             }
 
-            if (check_for_breakpoints(cpu, pc, &cflags)) {
+            if (have_breakpoints(cpu) &&
+                check_for_breakpoints(cpu, pc, &cflags)) {
                 break;
             }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] target/ppc: Add new PMC HFLAGS
  2022-10-21 17:01 [PATCH 0/3] Performance optimizations for PPC64 Leandro Lupori
  2022-10-21 17:01 ` [PATCH 1/3] accel/tcg: Add a quicker check for breakpoints Leandro Lupori
@ 2022-10-21 17:01 ` Leandro Lupori
  2022-10-25 19:24   ` Daniel Henrique Barboza
  2022-10-21 17:01 ` [PATCH 3/3] target/ppc: Increment PMC5 with inline insns Leandro Lupori
  2 siblings, 1 reply; 10+ messages in thread
From: Leandro Lupori @ 2022-10-21 17:01 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc
  Cc: richard.henderson, pbonzini, clg, danielhb413, david, groug,
	Leandro Lupori

Add 2 new PMC related HFLAGS:
- HFLAGS_PMCJCE - value of MMCR0 PMCjCE bit
- HFLAGS_PMC_OTHER - set if a PMC other than PMC5-6 is enabled

These flags allow further optimization of PMC5 update code, by
allowing frequently tested conditions to be performed at
translation time.

Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
---
 target/ppc/cpu.h         | 4 +++-
 target/ppc/helper_regs.c | 6 ++++++
 target/ppc/translate.c   | 4 ++++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index cca6c4e51c..28b9b8d4e3 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -696,7 +696,9 @@ enum {
     HFLAGS_PR = 14,  /* MSR_PR */
     HFLAGS_PMCC0 = 15,  /* MMCR0 PMCC bit 0 */
     HFLAGS_PMCC1 = 16,  /* MMCR0 PMCC bit 1 */
-    HFLAGS_INSN_CNT = 17, /* PMU instruction count enabled */
+    HFLAGS_PMCJCE = 17, /* MMCR0 PMCjCE bit */
+    HFLAGS_PMC_OTHER = 18, /* PMC other than PMC5-6 is enabled */
+    HFLAGS_INSN_CNT = 19, /* PMU instruction count enabled */
     HFLAGS_VSX = 23, /* MSR_VSX if cpu has VSX */
     HFLAGS_VR = 25,  /* MSR_VR if cpu has VRE */
 
diff --git a/target/ppc/helper_regs.c b/target/ppc/helper_regs.c
index 12235ea2e9..65f5f7b2c0 100644
--- a/target/ppc/helper_regs.c
+++ b/target/ppc/helper_regs.c
@@ -109,6 +109,9 @@ static uint32_t hreg_compute_hflags_value(CPUPPCState *env)
     if (env->spr[SPR_POWER_MMCR0] & MMCR0_PMCC1) {
         hflags |= 1 << HFLAGS_PMCC1;
     }
+    if (env->spr[SPR_POWER_MMCR0] & MMCR0_PMCjCE) {
+        hflags |= 1 << HFLAGS_PMCJCE;
+    }
 
 #ifndef CONFIG_USER_ONLY
     if (!env->has_hv_mode || (msr & (1ull << MSR_HV))) {
@@ -119,6 +122,9 @@ static uint32_t hreg_compute_hflags_value(CPUPPCState *env)
     if (env->pmc_ins_cnt) {
         hflags |= 1 << HFLAGS_INSN_CNT;
     }
+    if (env->pmc_ins_cnt & 0x1e) {
+        hflags |= 1 << HFLAGS_PMC_OTHER;
+    }
 #endif
 
     /*
diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index e810842925..8fda2cf836 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -177,6 +177,8 @@ struct DisasContext {
     bool hr;
     bool mmcr0_pmcc0;
     bool mmcr0_pmcc1;
+    bool mmcr0_pmcjce;
+    bool pmc_other;
     bool pmu_insn_cnt;
     ppc_spr_t *spr_cb; /* Needed to check rights for mfspr/mtspr */
     int singlestep_enabled;
@@ -7574,6 +7576,8 @@ static void ppc_tr_init_disas_context(DisasContextBase *dcbase, CPUState *cs)
     ctx->hr = (hflags >> HFLAGS_HR) & 1;
     ctx->mmcr0_pmcc0 = (hflags >> HFLAGS_PMCC0) & 1;
     ctx->mmcr0_pmcc1 = (hflags >> HFLAGS_PMCC1) & 1;
+    ctx->mmcr0_pmcjce = (hflags >> HFLAGS_PMCJCE) & 1;
+    ctx->pmc_other = (hflags >> HFLAGS_PMC_OTHER) & 1;
     ctx->pmu_insn_cnt = (hflags >> HFLAGS_INSN_CNT) & 1;
 
     ctx->singlestep_enabled = 0;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] target/ppc: Increment PMC5 with inline insns
  2022-10-21 17:01 [PATCH 0/3] Performance optimizations for PPC64 Leandro Lupori
  2022-10-21 17:01 ` [PATCH 1/3] accel/tcg: Add a quicker check for breakpoints Leandro Lupori
  2022-10-21 17:01 ` [PATCH 2/3] target/ppc: Add new PMC HFLAGS Leandro Lupori
@ 2022-10-21 17:01 ` Leandro Lupori
  2022-10-25 19:29   ` Daniel Henrique Barboza
  2 siblings, 1 reply; 10+ messages in thread
From: Leandro Lupori @ 2022-10-21 17:01 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc
  Cc: richard.henderson, pbonzini, clg, danielhb413, david, groug,
	Leandro Lupori

Profiling QEMU during Fedora 35 for PPC64 boot revealed that
6.39% of total time was being spent in helper_insns_inc(), on a
POWER9 machine. To avoid calling this helper every time PMCs had
to be incremented, an inline implementation of PMC5 increment and
check for overflow was developed. This led to a reduction of
about 12% in Fedora's boot time.

Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
---
 target/ppc/helper.h     |  1 +
 target/ppc/power8-pmu.c | 74 +++++++++++++++++++++--------------------
 target/ppc/power8-pmu.h |  3 ++
 target/ppc/translate.c  | 28 ++++++++++++++--
 4 files changed, 67 insertions(+), 39 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 57eee07256..f8cd00c976 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -29,6 +29,7 @@ DEF_HELPER_2(store_mmcr1, void, env, tl)
 DEF_HELPER_3(store_pmc, void, env, i32, i64)
 DEF_HELPER_2(read_pmc, tl, env, i32)
 DEF_HELPER_2(insns_inc, void, env, i32)
+DEF_HELPER_1(handle_pmc5_overflow, void, env)
 #endif
 DEF_HELPER_1(check_tlb_flush_local, void, env)
 DEF_HELPER_1(check_tlb_flush_global, void, env)
diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
index beeab5c494..1381072b9e 100644
--- a/target/ppc/power8-pmu.c
+++ b/target/ppc/power8-pmu.c
@@ -22,8 +22,6 @@
 
 #if defined(TARGET_PPC64) && !defined(CONFIG_USER_ONLY)
 
-#define PMC_COUNTER_NEGATIVE_VAL 0x80000000UL
-
 static bool pmc_has_overflow_enabled(CPUPPCState *env, int sprn)
 {
     if (sprn == SPR_POWER_PMC1) {
@@ -88,49 +86,47 @@ static bool pmu_increment_insns(CPUPPCState *env, uint32_t num_insns)
     bool overflow_triggered = false;
     target_ulong tmp;
 
-    if (unlikely(ins_cnt & 0x1e)) {
-        if (ins_cnt & (1 << 1)) {
-            tmp = env->spr[SPR_POWER_PMC1];
-            tmp += num_insns;
-            if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMC1CE)) {
-                tmp = PMC_COUNTER_NEGATIVE_VAL;
-                overflow_triggered = true;
-            }
-            env->spr[SPR_POWER_PMC1] = tmp;
+    if (ins_cnt & (1 << 1)) {
+        tmp = env->spr[SPR_POWER_PMC1];
+        tmp += num_insns;
+        if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMC1CE)) {
+            tmp = PMC_COUNTER_NEGATIVE_VAL;
+            overflow_triggered = true;
         }
+        env->spr[SPR_POWER_PMC1] = tmp;
+    }
 
-        if (ins_cnt & (1 << 2)) {
-            tmp = env->spr[SPR_POWER_PMC2];
-            tmp += num_insns;
-            if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMCjCE)) {
-                tmp = PMC_COUNTER_NEGATIVE_VAL;
-                overflow_triggered = true;
-            }
-            env->spr[SPR_POWER_PMC2] = tmp;
+    if (ins_cnt & (1 << 2)) {
+        tmp = env->spr[SPR_POWER_PMC2];
+        tmp += num_insns;
+        if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMCjCE)) {
+            tmp = PMC_COUNTER_NEGATIVE_VAL;
+            overflow_triggered = true;
+        }
+        env->spr[SPR_POWER_PMC2] = tmp;
+    }
+
+    if (ins_cnt & (1 << 3)) {
+        tmp = env->spr[SPR_POWER_PMC3];
+        tmp += num_insns;
+        if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMCjCE)) {
+            tmp = PMC_COUNTER_NEGATIVE_VAL;
+            overflow_triggered = true;
         }
+        env->spr[SPR_POWER_PMC3] = tmp;
+    }
 
-        if (ins_cnt & (1 << 3)) {
-            tmp = env->spr[SPR_POWER_PMC3];
+    if (ins_cnt & (1 << 4)) {
+        target_ulong mmcr1 = env->spr[SPR_POWER_MMCR1];
+        int sel = extract64(mmcr1, MMCR1_PMC4EVT_EXTR, MMCR1_EVT_SIZE);
+        if (sel == 0x02 || (env->spr[SPR_CTRL] & CTRL_RUN)) {
+            tmp = env->spr[SPR_POWER_PMC4];
             tmp += num_insns;
             if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMCjCE)) {
                 tmp = PMC_COUNTER_NEGATIVE_VAL;
                 overflow_triggered = true;
             }
-            env->spr[SPR_POWER_PMC3] = tmp;
-        }
-
-        if (ins_cnt & (1 << 4)) {
-            target_ulong mmcr1 = env->spr[SPR_POWER_MMCR1];
-            int sel = extract64(mmcr1, MMCR1_PMC4EVT_EXTR, MMCR1_EVT_SIZE);
-            if (sel == 0x02 || (env->spr[SPR_CTRL] & CTRL_RUN)) {
-                tmp = env->spr[SPR_POWER_PMC4];
-                tmp += num_insns;
-                if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMCjCE)) {
-                    tmp = PMC_COUNTER_NEGATIVE_VAL;
-                    overflow_triggered = true;
-                }
-                env->spr[SPR_POWER_PMC4] = tmp;
-            }
+            env->spr[SPR_POWER_PMC4] = tmp;
         }
     }
 
@@ -310,6 +306,12 @@ static void fire_PMC_interrupt(PowerPCCPU *cpu)
     raise_ebb_perfm_exception(env);
 }
 
+void helper_handle_pmc5_overflow(CPUPPCState *env)
+{
+    env->spr[SPR_POWER_PMC5] = PMC_COUNTER_NEGATIVE_VAL;
+    fire_PMC_interrupt(env_archcpu(env));
+}
+
 /* This helper assumes that the PMC is running. */
 void helper_insns_inc(CPUPPCState *env, uint32_t num_insns)
 {
diff --git a/target/ppc/power8-pmu.h b/target/ppc/power8-pmu.h
index 9692dd765e..c0093e2219 100644
--- a/target/ppc/power8-pmu.h
+++ b/target/ppc/power8-pmu.h
@@ -14,6 +14,9 @@
 #define POWER8_PMU_H
 
 #if defined(TARGET_PPC64) && !defined(CONFIG_USER_ONLY)
+
+#define PMC_COUNTER_NEGATIVE_VAL 0x80000000UL
+
 void cpu_ppc_pmu_init(CPUPPCState *env);
 void pmu_update_summaries(CPUPPCState *env);
 #else
diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index 8fda2cf836..5c74684eee 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -36,6 +36,7 @@
 #include "exec/log.h"
 #include "qemu/atomic128.h"
 #include "spr_common.h"
+#include "power8-pmu.h"
 
 #include "qemu/qemu-print.h"
 #include "qapi/error.h"
@@ -4263,6 +4264,9 @@ static void pmu_count_insns(DisasContext *ctx)
     }
 
  #if !defined(CONFIG_USER_ONLY)
+    TCGLabel *l;
+    TCGv t0;
+
     /*
      * The PMU insns_inc() helper stops the internal PMU timer if a
      * counter overflows happens. In that case, if the guest is
@@ -4271,8 +4275,26 @@ static void pmu_count_insns(DisasContext *ctx)
      */
     gen_icount_io_start(ctx);
 
-    gen_helper_insns_inc(cpu_env, tcg_constant_i32(ctx->base.num_insns));
-#else
+    /* Avoid helper calls when only PMC5-6 are enabled. */
+    if (!ctx->pmc_other) {
+        l = gen_new_label();
+        t0 = tcg_temp_new();
+
+        gen_load_spr(t0, SPR_POWER_PMC5);
+        tcg_gen_addi_tl(t0, t0, ctx->base.num_insns);
+        gen_store_spr(SPR_POWER_PMC5, t0);
+        /* Check for overflow, if it's enabled */
+        if (ctx->mmcr0_pmcjce) {
+            tcg_gen_brcondi_tl(TCG_COND_LT, t0, PMC_COUNTER_NEGATIVE_VAL, l);
+            gen_helper_handle_pmc5_overflow(cpu_env);
+        }
+
+        gen_set_label(l);
+        tcg_temp_free(t0);
+    } else {
+        gen_helper_insns_inc(cpu_env, tcg_constant_i32(ctx->base.num_insns));
+    }
+  #else
     /*
      * User mode can read (but not write) PMC5 and start/stop
      * the PMU via MMCR0_FC. In this case just increment
@@ -4285,7 +4307,7 @@ static void pmu_count_insns(DisasContext *ctx)
     gen_store_spr(SPR_POWER_PMC5, t0);
 
     tcg_temp_free(t0);
-#endif /* #if !defined(CONFIG_USER_ONLY) */
+  #endif /* #if !defined(CONFIG_USER_ONLY) */
 }
 #else
 static void pmu_count_insns(DisasContext *ctx)
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] accel/tcg: Add a quicker check for breakpoints
  2022-10-21 17:01 ` [PATCH 1/3] accel/tcg: Add a quicker check for breakpoints Leandro Lupori
@ 2022-10-22 11:12   ` Richard Henderson
  2022-10-24 15:00     ` Leandro Lupori
  2022-10-25 12:29   ` Paolo Bonzini
  1 sibling, 1 reply; 10+ messages in thread
From: Richard Henderson @ 2022-10-22 11:12 UTC (permalink / raw)
  To: Leandro Lupori, qemu-devel, qemu-ppc
  Cc: pbonzini, clg, danielhb413, david, groug

On 10/22/22 03:01, Leandro Lupori wrote:
> Profiling QEMU during Fedora 35 for PPC64 boot revealed that a
> considerable amount of time was being spent in
> check_for_breakpoints() (0.61% of total time on PPC64 and 2.19% on
> amd64), even though it was just checking that its queue was empty
> and returning, when no breakpoints were set. It turns out this
> function is not inlined by the compiler and it's always called by
> helper_lookup_tb_ptr(), one of the most called functions.
> 
> By moving the check for empty queue to the have_breakpoints()
> macro and calling check_for_breakpoints() only when it returns
> true, it's possible to avoid the call overhead. An improvement of
> about 3% in total time was measured on POWER9.

Wow, 3%?

> 
> Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
> ---
>   accel/tcg/cpu-exec.c | 13 +++++++------
>   1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index f9e5cc9ba0..9eec01ad9a 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -304,16 +304,15 @@ static void log_cpu_exec(target_ulong pc, CPUState *cpu,
>       }
>   }
>   
> +#define have_breakpoints(cpu)   (likely(QTAILQ_EMPTY(&(cpu)->breakpoints)) ? \
> +                                 false : true)

First, always avoid useless ?:.

Second, I think renaming the existing check_for_breakpoints to check_for_breakpoints_slow 
and make this test be an inline function instead.  E.g.

static bool check_for_breakpoints(...)
{
     if (unlikely(!QTAILQ_EMPTY(&cpu->breakpoints))) {
         return check_for_breakpoints_slow(...);
     }
     return false;
}


r~


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] accel/tcg: Add a quicker check for breakpoints
  2022-10-22 11:12   ` Richard Henderson
@ 2022-10-24 15:00     ` Leandro Lupori
  0 siblings, 0 replies; 10+ messages in thread
From: Leandro Lupori @ 2022-10-24 15:00 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-ppc
  Cc: pbonzini, clg, danielhb413, david, groug

On 10/22/22 08:12, Richard Henderson wrote:
> On 10/22/22 03:01, Leandro Lupori wrote:
>> Profiling QEMU during Fedora 35 for PPC64 boot revealed that a
>> considerable amount of time was being spent in
>> check_for_breakpoints() (0.61% of total time on PPC64 and 2.19% on
>> amd64), even though it was just checking that its queue was empty
>> and returning, when no breakpoints were set. It turns out this
>> function is not inlined by the compiler and it's always called by
>> helper_lookup_tb_ptr(), one of the most called functions.
>>
>> By moving the check for empty queue to the have_breakpoints()
>> macro and calling check_for_breakpoints() only when it returns
>> true, it's possible to avoid the call overhead. An improvement of
>> about 3% in total time was measured on POWER9.
> 
> Wow, 3%?
> 
I've compared the averages of 3 runs. To get a more precise number it 
would probably be better to take the average or median of 10 runs with 
each version.

>>
>> Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
>> ---
>>   accel/tcg/cpu-exec.c | 13 +++++++------
>>   1 file changed, 7 insertions(+), 6 deletions(-)
>>
>> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
>> index f9e5cc9ba0..9eec01ad9a 100644
>> --- a/accel/tcg/cpu-exec.c
>> +++ b/accel/tcg/cpu-exec.c
>> @@ -304,16 +304,15 @@ static void log_cpu_exec(target_ulong pc, 
>> CPUState *cpu,
>>       }
>>   }
>>
>> +#define have_breakpoints(cpu)   
>> (likely(QTAILQ_EMPTY(&(cpu)->breakpoints)) ? \
>> +                                 false : true)
> 
> First, always avoid useless ?:.
> 
> Second, I think renaming the existing check_for_breakpoints to 
> check_for_breakpoints_slow
> and make this test be an inline function instead.  E.g.
> 
> static bool check_for_breakpoints(...)
> {
>      if (unlikely(!QTAILQ_EMPTY(&cpu->breakpoints))) {
>          return check_for_breakpoints_slow(...);
>      }
>      return false;
> }
> 
Ok. I had left the ?: to take advantage of likely() in the macro.

Thanks,
Leandro

> 
> r~



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] accel/tcg: Add a quicker check for breakpoints
  2022-10-21 17:01 ` [PATCH 1/3] accel/tcg: Add a quicker check for breakpoints Leandro Lupori
  2022-10-22 11:12   ` Richard Henderson
@ 2022-10-25 12:29   ` Paolo Bonzini
  1 sibling, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2022-10-25 12:29 UTC (permalink / raw)
  To: Leandro Lupori, qemu-devel, qemu-ppc
  Cc: richard.henderson, clg, danielhb413, david, groug

On 10/21/22 19:01, Leandro Lupori wrote:
> Profiling QEMU during Fedora 35 for PPC64 boot revealed that a
> considerable amount of time was being spent in
> check_for_breakpoints() (0.61% of total time on PPC64 and 2.19% on
> amd64), even though it was just checking that its queue was empty
> and returning, when no breakpoints were set. It turns out this
> function is not inlined by the compiler and it's always called by
> helper_lookup_tb_ptr(), one of the most called functions.
> 
> By moving the check for empty queue to the have_breakpoints()
> macro and calling check_for_breakpoints() only when it returns
> true, it's possible to avoid the call overhead. An improvement of
> about 3% in total time was measured on POWER9.
> 
> Signed-off-by: Leandro Lupori<leandro.lupori@eldorado.org.br>
> ---
>   accel/tcg/cpu-exec.c | 13 +++++++------
>   1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
> index f9e5cc9ba0..9eec01ad9a 100644
> --- a/accel/tcg/cpu-exec.c
> +++ b/accel/tcg/cpu-exec.c
> @@ -304,16 +304,15 @@ static void log_cpu_exec(target_ulong pc, CPUState *cpu,
>       }
>   }
>   
> +#define have_breakpoints(cpu)   (likely(QTAILQ_EMPTY(&(cpu)->breakpoints)) ? \
> +                                 false : true)
> +
>   static bool check_for_breakpoints(CPUState *cpu, target_ulong pc,
>                                     uint32_t *cflags)
>   {
>       CPUBreakpoint *bp;
>       bool match_page = false;
>   
> -    if (likely(QTAILQ_EMPTY(&cpu->breakpoints))) {
> -        return false;
> -    }
> -

It's a little more readable to just split out the slow path:

-static inline bool check_for_breakpoints(CPUState *cpu, target_ulong pc,
-                                         uint32_t *cflags)
+static bool check_for_breakpoints_slow(CPUState *cpu, target_ulong pc,
+                                       uint32_t *cflags)
  {
       CPUBreakpoint *bp;
       bool match_page = false;
   
-    if (likely(QTAILQ_EMPTY(&cpu->breakpoints))) {
-        return false;
-    }
      ...
  }
+
+static inline bool check_for_breakpoints(CPUState *cpu, target_ulong pc,
+                                         uint32_t *cflags)
+{
+    return unlikely(!QTAILQ_EMPTY(&cpu->breakpoints))
+        && check_for_breakpoints_slow(cpu, pc, cflags);
+}

Paolo



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/3] target/ppc: Add new PMC HFLAGS
  2022-10-21 17:01 ` [PATCH 2/3] target/ppc: Add new PMC HFLAGS Leandro Lupori
@ 2022-10-25 19:24   ` Daniel Henrique Barboza
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Henrique Barboza @ 2022-10-25 19:24 UTC (permalink / raw)
  To: Leandro Lupori, qemu-devel, qemu-ppc
  Cc: richard.henderson, pbonzini, clg, david, groug



On 10/21/22 14:01, Leandro Lupori wrote:
> Add 2 new PMC related HFLAGS:
> - HFLAGS_PMCJCE - value of MMCR0 PMCjCE bit
> - HFLAGS_PMC_OTHER - set if a PMC other than PMC5-6 is enabled
> 
> These flags allow further optimization of PMC5 update code, by
> allowing frequently tested conditions to be performed at
> translation time.
> 
> Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
> ---

Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>

>   target/ppc/cpu.h         | 4 +++-
>   target/ppc/helper_regs.c | 6 ++++++
>   target/ppc/translate.c   | 4 ++++
>   3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
> index cca6c4e51c..28b9b8d4e3 100644
> --- a/target/ppc/cpu.h
> +++ b/target/ppc/cpu.h
> @@ -696,7 +696,9 @@ enum {
>       HFLAGS_PR = 14,  /* MSR_PR */
>       HFLAGS_PMCC0 = 15,  /* MMCR0 PMCC bit 0 */
>       HFLAGS_PMCC1 = 16,  /* MMCR0 PMCC bit 1 */
> -    HFLAGS_INSN_CNT = 17, /* PMU instruction count enabled */
> +    HFLAGS_PMCJCE = 17, /* MMCR0 PMCjCE bit */
> +    HFLAGS_PMC_OTHER = 18, /* PMC other than PMC5-6 is enabled */
> +    HFLAGS_INSN_CNT = 19, /* PMU instruction count enabled */
>       HFLAGS_VSX = 23, /* MSR_VSX if cpu has VSX */
>       HFLAGS_VR = 25,  /* MSR_VR if cpu has VRE */
>   
> diff --git a/target/ppc/helper_regs.c b/target/ppc/helper_regs.c
> index 12235ea2e9..65f5f7b2c0 100644
> --- a/target/ppc/helper_regs.c
> +++ b/target/ppc/helper_regs.c
> @@ -109,6 +109,9 @@ static uint32_t hreg_compute_hflags_value(CPUPPCState *env)
>       if (env->spr[SPR_POWER_MMCR0] & MMCR0_PMCC1) {
>           hflags |= 1 << HFLAGS_PMCC1;
>       }
> +    if (env->spr[SPR_POWER_MMCR0] & MMCR0_PMCjCE) {
> +        hflags |= 1 << HFLAGS_PMCJCE;
> +    }
>   
>   #ifndef CONFIG_USER_ONLY
>       if (!env->has_hv_mode || (msr & (1ull << MSR_HV))) {
> @@ -119,6 +122,9 @@ static uint32_t hreg_compute_hflags_value(CPUPPCState *env)
>       if (env->pmc_ins_cnt) {
>           hflags |= 1 << HFLAGS_INSN_CNT;
>       }
> +    if (env->pmc_ins_cnt & 0x1e) {
> +        hflags |= 1 << HFLAGS_PMC_OTHER;
> +    }
>   #endif
>   
>       /*
> diff --git a/target/ppc/translate.c b/target/ppc/translate.c
> index e810842925..8fda2cf836 100644
> --- a/target/ppc/translate.c
> +++ b/target/ppc/translate.c
> @@ -177,6 +177,8 @@ struct DisasContext {
>       bool hr;
>       bool mmcr0_pmcc0;
>       bool mmcr0_pmcc1;
> +    bool mmcr0_pmcjce;
> +    bool pmc_other;
>       bool pmu_insn_cnt;
>       ppc_spr_t *spr_cb; /* Needed to check rights for mfspr/mtspr */
>       int singlestep_enabled;
> @@ -7574,6 +7576,8 @@ static void ppc_tr_init_disas_context(DisasContextBase *dcbase, CPUState *cs)
>       ctx->hr = (hflags >> HFLAGS_HR) & 1;
>       ctx->mmcr0_pmcc0 = (hflags >> HFLAGS_PMCC0) & 1;
>       ctx->mmcr0_pmcc1 = (hflags >> HFLAGS_PMCC1) & 1;
> +    ctx->mmcr0_pmcjce = (hflags >> HFLAGS_PMCJCE) & 1;
> +    ctx->pmc_other = (hflags >> HFLAGS_PMC_OTHER) & 1;
>       ctx->pmu_insn_cnt = (hflags >> HFLAGS_INSN_CNT) & 1;
>   
>       ctx->singlestep_enabled = 0;


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] target/ppc: Increment PMC5 with inline insns
  2022-10-21 17:01 ` [PATCH 3/3] target/ppc: Increment PMC5 with inline insns Leandro Lupori
@ 2022-10-25 19:29   ` Daniel Henrique Barboza
  2022-10-25 20:39     ` Leandro Lupori
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel Henrique Barboza @ 2022-10-25 19:29 UTC (permalink / raw)
  To: Leandro Lupori, qemu-devel, qemu-ppc
  Cc: richard.henderson, pbonzini, clg, david, groug



On 10/21/22 14:01, Leandro Lupori wrote:
> Profiling QEMU during Fedora 35 for PPC64 boot revealed that
> 6.39% of total time was being spent in helper_insns_inc(), on a
> POWER9 machine. To avoid calling this helper every time PMCs had
> to be incremented, an inline implementation of PMC5 increment and
> check for overflow was developed. This led to a reduction of
> about 12% in Fedora's boot time.
> 
> Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
> ---

Given that PMC5 is the counter that is most likely to be active, yeah,
isolating the case where PMC5 is incremented standalone makes sense.

Still, 12% performance gain is not too shaby. Not too shaby at all.


Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>


>   target/ppc/helper.h     |  1 +
>   target/ppc/power8-pmu.c | 74 +++++++++++++++++++++--------------------
>   target/ppc/power8-pmu.h |  3 ++
>   target/ppc/translate.c  | 28 ++++++++++++++--
>   4 files changed, 67 insertions(+), 39 deletions(-)
> 
> diff --git a/target/ppc/helper.h b/target/ppc/helper.h
> index 57eee07256..f8cd00c976 100644
> --- a/target/ppc/helper.h
> +++ b/target/ppc/helper.h
> @@ -29,6 +29,7 @@ DEF_HELPER_2(store_mmcr1, void, env, tl)
>   DEF_HELPER_3(store_pmc, void, env, i32, i64)
>   DEF_HELPER_2(read_pmc, tl, env, i32)
>   DEF_HELPER_2(insns_inc, void, env, i32)
> +DEF_HELPER_1(handle_pmc5_overflow, void, env)
>   #endif
>   DEF_HELPER_1(check_tlb_flush_local, void, env)
>   DEF_HELPER_1(check_tlb_flush_global, void, env)
> diff --git a/target/ppc/power8-pmu.c b/target/ppc/power8-pmu.c
> index beeab5c494..1381072b9e 100644
> --- a/target/ppc/power8-pmu.c
> +++ b/target/ppc/power8-pmu.c
> @@ -22,8 +22,6 @@
>   
>   #if defined(TARGET_PPC64) && !defined(CONFIG_USER_ONLY)
>   
> -#define PMC_COUNTER_NEGATIVE_VAL 0x80000000UL
> -
>   static bool pmc_has_overflow_enabled(CPUPPCState *env, int sprn)
>   {
>       if (sprn == SPR_POWER_PMC1) {
> @@ -88,49 +86,47 @@ static bool pmu_increment_insns(CPUPPCState *env, uint32_t num_insns)
>       bool overflow_triggered = false;
>       target_ulong tmp;
>   
> -    if (unlikely(ins_cnt & 0x1e)) {
> -        if (ins_cnt & (1 << 1)) {
> -            tmp = env->spr[SPR_POWER_PMC1];
> -            tmp += num_insns;
> -            if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMC1CE)) {
> -                tmp = PMC_COUNTER_NEGATIVE_VAL;
> -                overflow_triggered = true;
> -            }
> -            env->spr[SPR_POWER_PMC1] = tmp;
> +    if (ins_cnt & (1 << 1)) {
> +        tmp = env->spr[SPR_POWER_PMC1];
> +        tmp += num_insns;
> +        if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMC1CE)) {
> +            tmp = PMC_COUNTER_NEGATIVE_VAL;
> +            overflow_triggered = true;
>           }
> +        env->spr[SPR_POWER_PMC1] = tmp;
> +    }
>   
> -        if (ins_cnt & (1 << 2)) {
> -            tmp = env->spr[SPR_POWER_PMC2];
> -            tmp += num_insns;
> -            if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMCjCE)) {
> -                tmp = PMC_COUNTER_NEGATIVE_VAL;
> -                overflow_triggered = true;
> -            }
> -            env->spr[SPR_POWER_PMC2] = tmp;
> +    if (ins_cnt & (1 << 2)) {
> +        tmp = env->spr[SPR_POWER_PMC2];
> +        tmp += num_insns;
> +        if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMCjCE)) {
> +            tmp = PMC_COUNTER_NEGATIVE_VAL;
> +            overflow_triggered = true;
> +        }
> +        env->spr[SPR_POWER_PMC2] = tmp;
> +    }
> +
> +    if (ins_cnt & (1 << 3)) {
> +        tmp = env->spr[SPR_POWER_PMC3];
> +        tmp += num_insns;
> +        if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMCjCE)) {
> +            tmp = PMC_COUNTER_NEGATIVE_VAL;
> +            overflow_triggered = true;
>           }
> +        env->spr[SPR_POWER_PMC3] = tmp;
> +    }
>   
> -        if (ins_cnt & (1 << 3)) {
> -            tmp = env->spr[SPR_POWER_PMC3];
> +    if (ins_cnt & (1 << 4)) {
> +        target_ulong mmcr1 = env->spr[SPR_POWER_MMCR1];
> +        int sel = extract64(mmcr1, MMCR1_PMC4EVT_EXTR, MMCR1_EVT_SIZE);
> +        if (sel == 0x02 || (env->spr[SPR_CTRL] & CTRL_RUN)) {
> +            tmp = env->spr[SPR_POWER_PMC4];
>               tmp += num_insns;
>               if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMCjCE)) {
>                   tmp = PMC_COUNTER_NEGATIVE_VAL;
>                   overflow_triggered = true;
>               }
> -            env->spr[SPR_POWER_PMC3] = tmp;
> -        }
> -
> -        if (ins_cnt & (1 << 4)) {
> -            target_ulong mmcr1 = env->spr[SPR_POWER_MMCR1];
> -            int sel = extract64(mmcr1, MMCR1_PMC4EVT_EXTR, MMCR1_EVT_SIZE);
> -            if (sel == 0x02 || (env->spr[SPR_CTRL] & CTRL_RUN)) {
> -                tmp = env->spr[SPR_POWER_PMC4];
> -                tmp += num_insns;
> -                if (tmp >= PMC_COUNTER_NEGATIVE_VAL && (mmcr0 & MMCR0_PMCjCE)) {
> -                    tmp = PMC_COUNTER_NEGATIVE_VAL;
> -                    overflow_triggered = true;
> -                }
> -                env->spr[SPR_POWER_PMC4] = tmp;
> -            }
> +            env->spr[SPR_POWER_PMC4] = tmp;
>           }
>       }
>   
> @@ -310,6 +306,12 @@ static void fire_PMC_interrupt(PowerPCCPU *cpu)
>       raise_ebb_perfm_exception(env);
>   }
>   
> +void helper_handle_pmc5_overflow(CPUPPCState *env)
> +{
> +    env->spr[SPR_POWER_PMC5] = PMC_COUNTER_NEGATIVE_VAL;
> +    fire_PMC_interrupt(env_archcpu(env));
> +}
> +
>   /* This helper assumes that the PMC is running. */
>   void helper_insns_inc(CPUPPCState *env, uint32_t num_insns)
>   {
> diff --git a/target/ppc/power8-pmu.h b/target/ppc/power8-pmu.h
> index 9692dd765e..c0093e2219 100644
> --- a/target/ppc/power8-pmu.h
> +++ b/target/ppc/power8-pmu.h
> @@ -14,6 +14,9 @@
>   #define POWER8_PMU_H
>   
>   #if defined(TARGET_PPC64) && !defined(CONFIG_USER_ONLY)
> +
> +#define PMC_COUNTER_NEGATIVE_VAL 0x80000000UL
> +
>   void cpu_ppc_pmu_init(CPUPPCState *env);
>   void pmu_update_summaries(CPUPPCState *env);
>   #else
> diff --git a/target/ppc/translate.c b/target/ppc/translate.c
> index 8fda2cf836..5c74684eee 100644
> --- a/target/ppc/translate.c
> +++ b/target/ppc/translate.c
> @@ -36,6 +36,7 @@
>   #include "exec/log.h"
>   #include "qemu/atomic128.h"
>   #include "spr_common.h"
> +#include "power8-pmu.h"
>   
>   #include "qemu/qemu-print.h"
>   #include "qapi/error.h"
> @@ -4263,6 +4264,9 @@ static void pmu_count_insns(DisasContext *ctx)
>       }
>   
>    #if !defined(CONFIG_USER_ONLY)
> +    TCGLabel *l;
> +    TCGv t0;
> +
>       /*
>        * The PMU insns_inc() helper stops the internal PMU timer if a
>        * counter overflows happens. In that case, if the guest is
> @@ -4271,8 +4275,26 @@ static void pmu_count_insns(DisasContext *ctx)
>        */
>       gen_icount_io_start(ctx);
>   
> -    gen_helper_insns_inc(cpu_env, tcg_constant_i32(ctx->base.num_insns));
> -#else
> +    /* Avoid helper calls when only PMC5-6 are enabled. */
> +    if (!ctx->pmc_other) {
> +        l = gen_new_label();
> +        t0 = tcg_temp_new();
> +
> +        gen_load_spr(t0, SPR_POWER_PMC5);
> +        tcg_gen_addi_tl(t0, t0, ctx->base.num_insns);
> +        gen_store_spr(SPR_POWER_PMC5, t0);
> +        /* Check for overflow, if it's enabled */
> +        if (ctx->mmcr0_pmcjce) {
> +            tcg_gen_brcondi_tl(TCG_COND_LT, t0, PMC_COUNTER_NEGATIVE_VAL, l);
> +            gen_helper_handle_pmc5_overflow(cpu_env);
> +        }
> +
> +        gen_set_label(l);
> +        tcg_temp_free(t0);
> +    } else {
> +        gen_helper_insns_inc(cpu_env, tcg_constant_i32(ctx->base.num_insns));
> +    }
> +  #else
>       /*
>        * User mode can read (but not write) PMC5 and start/stop
>        * the PMU via MMCR0_FC. In this case just increment
> @@ -4285,7 +4307,7 @@ static void pmu_count_insns(DisasContext *ctx)
>       gen_store_spr(SPR_POWER_PMC5, t0);
>   
>       tcg_temp_free(t0);
> -#endif /* #if !defined(CONFIG_USER_ONLY) */
> +  #endif /* #if !defined(CONFIG_USER_ONLY) */
>   }
>   #else
>   static void pmu_count_insns(DisasContext *ctx)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/3] target/ppc: Increment PMC5 with inline insns
  2022-10-25 19:29   ` Daniel Henrique Barboza
@ 2022-10-25 20:39     ` Leandro Lupori
  0 siblings, 0 replies; 10+ messages in thread
From: Leandro Lupori @ 2022-10-25 20:39 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel, qemu-ppc
  Cc: richard.henderson, pbonzini, clg, david, groug

On 10/25/22 16:29, Daniel Henrique Barboza wrote:

> On 10/21/22 14:01, Leandro Lupori wrote:
>> Profiling QEMU during Fedora 35 for PPC64 boot revealed that
>> 6.39% of total time was being spent in helper_insns_inc(), on a
>> POWER9 machine. To avoid calling this helper every time PMCs had
>> to be incremented, an inline implementation of PMC5 increment and
>> check for overflow was developed. This led to a reduction of
>> about 12% in Fedora's boot time.
>>
>> Signed-off-by: Leandro Lupori <leandro.lupori@eldorado.org.br>
>> ---
> 
> Given that PMC5 is the counter that is most likely to be active, yeah,
> isolating the case where PMC5 is incremented standalone makes sense.
> 
> Still, 12% performance gain is not too shaby. Not too shaby at all.
> 

I've tried to move more of helper_insns_inc() to the inline 
implementation, but then performance started to decrease.

Initially I found this strange, but perf revealed a considerable 
increase of time spent in functions such as tcg_gen_code and 
liveness_pass_1.

So as this code has to be generated and optimized for most TBs, it seems 
it makes code generation slower if it's too big.

> 
> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
> 
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-10-25 20:40 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-21 17:01 [PATCH 0/3] Performance optimizations for PPC64 Leandro Lupori
2022-10-21 17:01 ` [PATCH 1/3] accel/tcg: Add a quicker check for breakpoints Leandro Lupori
2022-10-22 11:12   ` Richard Henderson
2022-10-24 15:00     ` Leandro Lupori
2022-10-25 12:29   ` Paolo Bonzini
2022-10-21 17:01 ` [PATCH 2/3] target/ppc: Add new PMC HFLAGS Leandro Lupori
2022-10-25 19:24   ` Daniel Henrique Barboza
2022-10-21 17:01 ` [PATCH 3/3] target/ppc: Increment PMC5 with inline insns Leandro Lupori
2022-10-25 19:29   ` Daniel Henrique Barboza
2022-10-25 20:39     ` Leandro Lupori

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.