linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] New metricgroup output in perf stat default mode
@ 2023-06-07 16:26 kan.liang
  2023-06-07 16:26 ` [PATCH 1/8] perf metric: Fix no group check kan.liang
                   ` (7 more replies)
  0 siblings, 8 replies; 31+ messages in thread
From: kan.liang @ 2023-06-07 16:26 UTC (permalink / raw)
  To: acme, mingo, peterz, irogers, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

In the default mode, the current output of the metricgroup include both
events and metrics, which is not necessary and makes the output hard to
read. Also, different ARCHs (even different generations of the ARCH) may
have a different output format because of the different events in a
metrics.

The patch proposes a new output format which only outputting the value
of each metric and the metricgroup name. It can brings a clean and
consistent output format among ARCHs and generations.

The first two patches are bug fixes for the current code.

The patches 3-6 introduce the new metricgroup output.

The patches 7-8 improve the tests to cover the default mode.

Here are some examples for the new output.

STD output:

On SPR

perf stat -a sleep 1

 Performance counter stats for 'system wide':

        226,054.13 msec cpu-clock                        #  224.588 CPUs utilized
               932      context-switches                 #    4.123 /sec
               224      cpu-migrations                   #    0.991 /sec
                76      page-faults                      #    0.336 /sec
        45,940,682      cycles                           #    0.000 GHz
        36,676,047      instructions                     #    0.80  insn per cycle
         7,044,516      branches                         #   31.163 K/sec
            62,169      branch-misses                    #    0.88% of all branches
                        TopdownL1                 #     68.7 %  tma_backend_bound
                                                  #      3.1 %  tma_bad_speculation
                                                  #     13.0 %  tma_frontend_bound
                                                  #     15.2 %  tma_retiring
                        TopdownL2                 #      2.7 %  tma_branch_mispredicts
                                                  #     19.6 %  tma_core_bound
                                                  #      4.8 %  tma_fetch_bandwidth
                                                  #      8.3 %  tma_fetch_latency
                                                  #      2.9 %  tma_heavy_operations
                                                  #     12.3 %  tma_light_operations
                                                  #      0.4 %  tma_machine_clears
                                                  #     49.1 %  tma_memory_bound

       1.006529767 seconds time elapsed

On Hybrid

perf stat -a sleep 1

 Performance counter stats for 'system wide':

         32,154.81 msec cpu-clock                        #   31.978 CPUs utilized
               165      context-switches                 #    5.131 /sec
                33      cpu-migrations                   #    1.026 /sec
                72      page-faults                      #    2.239 /sec
         5,653,347      cpu_core/cycles/                 #    0.000 GHz
         4,164,114      cpu_atom/cycles/                 #    0.000 GHz
         3,921,839      cpu_core/instructions/           #    0.69  insn per cycle
         2,142,800      cpu_atom/instructions/           #    0.38  insn per cycle
           713,629      cpu_core/branches/               #   22.194 K/sec
           452,838      cpu_atom/branches/               #   14.083 K/sec
            26,810      cpu_core/branch-misses/          #    3.76% of all branches
            26,029      cpu_atom/branch-misses/          #    3.65% of all branches
             TopdownL1 (cpu_core)                 #     32.0 %  tma_backend_bound
                                                  #      8.0 %  tma_bad_speculation
                                                  #     45.5 %  tma_frontend_bound
                                                  #     14.5 %  tma_retiring


JSON output

on SPR

perf stat --json -a sleep 1
{"counter-value" : "225904.823297", "unit" : "msec", "event" : "cpu-clock", "event-runtime" : 225904323425, "pcnt-running" : 100.00, "metric-value" : "224.456872", "metric-unit" : "CPUs utilized"}
{"counter-value" : "986.000000", "unit" : "", "event" : "context-switches", "event-runtime" : 225904108985, "pcnt-running" : 100.00, "metric-value" : "4.364670", "metric-unit" : "/sec"}
{"counter-value" : "224.000000", "unit" : "", "event" : "cpu-migrations", "event-runtime" : 225904016141, "pcnt-running" : 100.00, "metric-value" : "0.991568", "metric-unit" : "/sec"}
{"counter-value" : "76.000000", "unit" : "", "event" : "page-faults", "event-runtime" : 225903913270, "pcnt-running" : 100.00, "metric-value" : "0.336425", "metric-unit" : "/sec"}
{"counter-value" : "48433482.000000", "unit" : "", "event" : "cycles", "event-runtime" : 225903792732, "pcnt-running" : 100.00, "metric-value" : "0.000214", "metric-unit" : "GHz"}
{"counter-value" : "38620409.000000", "unit" : "", "event" : "instructions", "event-runtime" : 225903657830, "pcnt-running" : 100.00, "metric-value" : "0.797391", "metric-unit" : "insn per cycle"}
{"counter-value" : "7369473.000000", "unit" : "", "event" : "branches", "event-runtime" : 225903464328, "pcnt-running" : 100.00, "metric-value" : "32.622026", "metric-unit" : "K/sec"}
{"counter-value" : "54747.000000", "unit" : "", "event" : "branch-misses", "event-runtime" : 225903234523, "pcnt-running" : 100.00, "metric-value" : "0.742889", "metric-unit" : "of all branches"}
{"event-runtime" : 225902840555, "pcnt-running" : 100.00, "metricgroup" : "TopdownL1"}
{"metric-value" : "69.950631", "metric-unit" : "%  tma_backend_bound"}
{"metric-value" : "2.771783", "metric-unit" : "%  tma_bad_speculation"}
{"metric-value" : "12.026074", "metric-unit" : "%  tma_frontend_bound"}
{"metric-value" : "15.251513", "metric-unit" : "%  tma_retiring"}
{"event-runtime" : 225902840555, "pcnt-running" : 100.00, "metricgroup" : "TopdownL2"}
{"metric-value" : "2.351757", "metric-unit" : "%  tma_branch_mispredicts"}
{"metric-value" : "19.729771", "metric-unit" : "%  tma_core_bound"}
{"metric-value" : "4.555207", "metric-unit" : "%  tma_fetch_bandwidth"}
{"metric-value" : "7.470867", "metric-unit" : "%  tma_fetch_latency"}
{"metric-value" : "2.938808", "metric-unit" : "%  tma_heavy_operations"}
{"metric-value" : "12.312705", "metric-unit" : "%  tma_light_operations"}
{"metric-value" : "0.420026", "metric-unit" : "%  tma_machine_clears"}
{"metric-value" : "50.220860", "metric-unit" : "%  tma_memory_bound"}

On hybrid

perf stat --json -a sleep 1
{"counter-value" : "32150.838437", "unit" : "msec", "event" : "cpu-clock", "event-runtime" : 32150846654, "pcnt-running" : 100.00, "metric-value" : "31.981465", "metric-unit" : "CPUs utilized"}
{"counter-value" : "154.000000", "unit" : "", "event" : "context-switches", "event-runtime" : 32150849941, "pcnt-running" : 100.00, "metric-value" : "4.789922", "metric-unit" : "/sec"}
{"counter-value" : "32.000000", "unit" : "", "event" : "cpu-migrations", "event-runtime" : 32150851194, "pcnt-running" : 100.00, "metric-value" : "0.995308", "metric-unit" : "/sec"}
{"counter-value" : "73.000000", "unit" : "", "event" : "page-faults", "event-runtime" : 32150855128, "pcnt-running" : 100.00, "metric-value" : "2.270547", "metric-unit" : "/sec"}
{"counter-value" : "6404864.000000", "unit" : "", "event" : "cpu_core/cycles/", "event-runtime" : 16069765136, "pcnt-running" : 100.00, "metric-value" : "0.000199", "metric-unit" : "GHz"}
{"counter-value" : "3011411.000000", "unit" : "", "event" : "cpu_atom/cycles/", "event-runtime" : 16080917475, "pcnt-running" : 100.00, "metric-value" : "0.000094", "metric-unit" : "GHz"}
{"counter-value" : "4748155.000000", "unit" : "", "event" : "cpu_core/instructions/", "event-runtime" : 16069777198, "pcnt-running" : 100.00, "metric-value" : "0.741336", "metric-unit" : "insn per cycle"}
{"counter-value" : "1129678.000000", "unit" : "", "event" : "cpu_atom/instructions/", "event-runtime" : 16080933337, "pcnt-running" : 100.00, "metric-value" : "0.176378", "metric-unit" : "insn per cycle"}
{"counter-value" : "943319.000000", "unit" : "", "event" : "cpu_core/branches/", "event-runtime" : 16069771422, "pcnt-running" : 100.00, "metric-value" : "29.340417", "metric-unit" : "K/sec"}
{"counter-value" : "194500.000000", "unit" : "", "event" : "cpu_atom/branches/", "event-runtime" : 16080937169, "pcnt-running" : 100.00, "metric-value" : "6.049609", "metric-unit" : "K/sec"}
{"counter-value" : "31974.000000", "unit" : "", "event" : "cpu_core/branch-misses/", "event-runtime" : 16069759637, "pcnt-running" : 100.00, "metric-value" : "3.389521", "metric-unit" : "of all branches"}
{"counter-value" : "18643.000000", "unit" : "", "event" : "cpu_atom/branch-misses/", "event-runtime" : 16080929464, "pcnt-running" : 100.00, "metric-value" : "1.976320", "metric-unit" : "of all branches"}
{"event-runtime" : 16069747669, "pcnt-running" : 100.00, "metricgroup" : "TopdownL1 (cpu_core)"}
{"metric-value" : "30.939553", "metric-unit" : "%  tma_backend_bound"}
{"metric-value" : "8.303274", "metric-unit" : "%  tma_bad_speculation"}
{"metric-value" : "46.181223", "metric-unit" : "%  tma_frontend_bound"}
{"metric-value" : "14.575950", "metric-unit" : "%  tma_retiring"}


CSV output

On SPR

perf stat -x, -a sleep 1
225851.20,msec,cpu-clock,225850700108,100.00,224.431,CPUs utilized
976,,context-switches,225850504803,100.00,4.321,/sec
224,,cpu-migrations,225850410336,100.00,0.992,/sec
76,,page-faults,225850304155,100.00,0.337,/sec
52288305,,cycles,225850188531,100.00,0.000,GHz
37977214,,instructions,225850071251,100.00,0.73,insn per cycle
7299859,,branches,225849890722,100.00,32.322,K/sec
51102,,branch-misses,225849672536,100.00,0.70,of all branches
,225849327050,100.00,,,,TopdownL1
,,,,,70.1,%  tma_backend_bound
,,,,,2.7,%  tma_bad_speculation
,,,,,12.5,%  tma_frontend_bound
,,,,,14.6,%  tma_retiring
,225849327050,100.00,,,,TopdownL2
,,,,,2.3,%  tma_branch_mispredicts
,,,,,19.6,%  tma_core_bound
,,,,,4.6,%  tma_fetch_bandwidth
,,,,,7.9,%  tma_fetch_latency
,,,,,2.9,%  tma_heavy_operations
,,,,,11.7,%  tma_light_operations
,,,,,0.5,%  tma_machine_clears
,,,,,50.5,%  tma_memory_bound

On Hybrid

perf stat -x, -a sleep 1
32148.69,msec,cpu-clock,32148689529,100.00,31.974,CPUs utilized
168,,context-switches,32148707526,100.00,5.226,/sec
33,,cpu-migrations,32148718292,100.00,1.026,/sec
73,,page-faults,32148729436,100.00,2.271,/sec
8632400,,cpu_core/cycles/,16067477534,100.00,0.000,GHz
3359282,,cpu_atom/cycles/,16081105672,100.00,0.000,GHz
9222630,,cpu_core/instructions/,16067506390,100.00,1.07,insn per cycle
1256594,,cpu_atom/instructions/,16081131302,100.00,0.15,insn per cycle
1842167,,cpu_core/branches/,16067509544,100.00,57.301,K/sec
215437,,cpu_atom/branches/,16081139517,100.00,6.701,K/sec
38133,,cpu_core/branch-misses/,16067511463,100.00,2.07,of all branches
20857,,cpu_atom/branch-misses/,16081135654,100.00,1.13,of all branches
,16067501860,100.00,,,,TopdownL1 (cpu_core)
,,,,,30.6,%  tma_backend_bound
,,,,,7.8,%  tma_bad_speculation
,,,,,42.0,%  tma_frontend_bound
,,,,,19.6,%  tma_retiring


Kan Liang (8):
  perf metric: Fix no group check
  perf evsel: Fix the annotation for hardware events on hybrid
  perf metric: JSON flag to default metric group
  perf vendor events arm64: Add default tags into topdown L1 metrics
  perf stat,jevents: Introduce Default tags for the default mode
  perf stat,metrics: New metrics output for the default mode
  pert tests: Support metricgroup perf stat JSON output
  perf test: Add test case for the standard perf stat output

 tools/perf/builtin-stat.c                     |   5 +-
 tools/perf/pmu-events/arch/arm64/sbsa.json    |  12 +-
 .../arch/x86/alderlake/adl-metrics.json       |  20 +-
 .../arch/x86/icelake/icl-metrics.json         |  20 +-
 .../arch/x86/icelakex/icx-metrics.json        |  20 +-
 .../arch/x86/sapphirerapids/spr-metrics.json  |  60 ++--
 .../arch/x86/tigerlake/tgl-metrics.json       |  20 +-
 tools/perf/pmu-events/jevents.py              |   5 +-
 tools/perf/pmu-events/pmu-events.h            |   1 +
 .../tests/shell/lib/perf_json_output_lint.py  |   3 +
 tools/perf/tests/shell/stat+std_output.sh     | 259 ++++++++++++++++++
 tools/perf/util/evsel.h                       |  13 +-
 tools/perf/util/metricgroup.c                 | 111 +++++++-
 tools/perf/util/metricgroup.h                 |   1 +
 tools/perf/util/stat-display.c                |  69 ++++-
 tools/perf/util/stat-shadow.c                 |  39 +--
 16 files changed, 564 insertions(+), 94 deletions(-)
 create mode 100755 tools/perf/tests/shell/stat+std_output.sh

-- 
2.35.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/8] perf metric: Fix no group check
  2023-06-07 16:26 [PATCH 0/8] New metricgroup output in perf stat default mode kan.liang
@ 2023-06-07 16:26 ` kan.liang
  2023-06-13 19:22   ` Ian Rogers
  2023-06-07 16:26 ` [PATCH 2/8] perf evsel: Fix the annotation for hardware events on hybrid kan.liang
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: kan.liang @ 2023-06-07 16:26 UTC (permalink / raw)
  To: acme, mingo, peterz, irogers, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The no group check fails if there is more than one meticgroup in the
metricgroup_no_group.

The first parameter of the match_metric() should be the string, while
the substring should be the second parameter.

Fixes: ccc66c609280 ("perf metric: JSON flag to not group events if gathering a metric group")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/util/metricgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 70ef2e23a710..74f2d8efc02d 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -1175,7 +1175,7 @@ static int metricgroup__add_metric_callback(const struct pmu_metric *pm,
 
 	if (pm->metric_expr && match_pm_metric(pm, data->pmu, data->metric_name)) {
 		bool metric_no_group = data->metric_no_group ||
-			match_metric(data->metric_name, pm->metricgroup_no_group);
+			match_metric(pm->metricgroup_no_group, data->metric_name);
 
 		data->has_match = true;
 		ret = add_metric(data->list, pm, data->modifier, metric_no_group,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/8] perf evsel: Fix the annotation for hardware events on hybrid
  2023-06-07 16:26 [PATCH 0/8] New metricgroup output in perf stat default mode kan.liang
  2023-06-07 16:26 ` [PATCH 1/8] perf metric: Fix no group check kan.liang
@ 2023-06-07 16:26 ` kan.liang
  2023-06-13 19:35   ` Ian Rogers
  2023-06-07 16:26 ` [PATCH 3/8] perf metric: JSON flag to default metric group kan.liang
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: kan.liang @ 2023-06-07 16:26 UTC (permalink / raw)
  To: acme, mingo, peterz, irogers, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The annotation for hardware events is wrong on hybrid. For example,

 # ./perf stat -a sleep 1

 Performance counter stats for 'system wide':

         32,148.85 msec cpu-clock                        #   32.000 CPUs utilized
               374      context-switches                 #   11.633 /sec
                33      cpu-migrations                   #    1.026 /sec
               295      page-faults                      #    9.176 /sec
        18,979,960      cpu_core/cycles/                 #  590.378 K/sec
       261,230,783      cpu_atom/cycles/                 #    8.126 M/sec                       (54.21%)
        17,019,732      cpu_core/instructions/           #  529.404 K/sec
        38,020,470      cpu_atom/instructions/           #    1.183 M/sec                       (63.36%)
         3,296,743      cpu_core/branches/               #  102.546 K/sec
         6,692,338      cpu_atom/branches/               #  208.167 K/sec                       (63.40%)
            96,421      cpu_core/branch-misses/          #    2.999 K/sec
         1,016,336      cpu_atom/branch-misses/          #   31.613 K/sec                       (63.38%)

The hardware events have extended type on hybrid, but the evsel__match()
doesn't take it into account.

Add a mask to filter the extended type on hybrid when checking the config.

With the patch,

 # ./perf stat -a sleep 1

 Performance counter stats for 'system wide':

         32,139.90 msec cpu-clock                        #   32.003 CPUs utilized
               343      context-switches                 #   10.672 /sec
                32      cpu-migrations                   #    0.996 /sec
                73      page-faults                      #    2.271 /sec
        13,712,841      cpu_core/cycles/                 #    0.000 GHz
       258,301,691      cpu_atom/cycles/                 #    0.008 GHz                         (54.20%)
        12,428,163      cpu_core/instructions/           #    0.91  insn per cycle
        37,786,557      cpu_atom/instructions/           #    2.76  insn per cycle              (63.35%)
         2,418,826      cpu_core/branches/               #   75.259 K/sec
         6,965,962      cpu_atom/branches/               #  216.739 K/sec                       (63.38%)
            72,150      cpu_core/branch-misses/          #    2.98% of all branches
         1,032,746      cpu_atom/branch-misses/          #   42.70% of all branches             (63.35%)

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/util/evsel.h       | 12 ++++++-----
 tools/perf/util/stat-shadow.c | 39 +++++++++++++++++++----------------
 2 files changed, 28 insertions(+), 23 deletions(-)

diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index b365b449c6ea..36a32e4ca168 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -350,9 +350,11 @@ u64 format_field__intval(struct tep_format_field *field, struct perf_sample *sam
 
 struct tep_format_field *evsel__field(struct evsel *evsel, const char *name);
 
-#define evsel__match(evsel, t, c)		\
+#define EVSEL_EVENT_MASK			(~0ULL)
+
+#define evsel__match(evsel, t, c, m)			\
 	(evsel->core.attr.type == PERF_TYPE_##t &&	\
-	 evsel->core.attr.config == PERF_COUNT_##c)
+	 (evsel->core.attr.config & m) == PERF_COUNT_##c)
 
 static inline bool evsel__match2(struct evsel *e1, struct evsel *e2)
 {
@@ -438,13 +440,13 @@ bool evsel__is_function_event(struct evsel *evsel);
 
 static inline bool evsel__is_bpf_output(struct evsel *evsel)
 {
-	return evsel__match(evsel, SOFTWARE, SW_BPF_OUTPUT);
+	return evsel__match(evsel, SOFTWARE, SW_BPF_OUTPUT, EVSEL_EVENT_MASK);
 }
 
 static inline bool evsel__is_clock(const struct evsel *evsel)
 {
-	return evsel__match(evsel, SOFTWARE, SW_CPU_CLOCK) ||
-	       evsel__match(evsel, SOFTWARE, SW_TASK_CLOCK);
+	return evsel__match(evsel, SOFTWARE, SW_CPU_CLOCK, EVSEL_EVENT_MASK) ||
+	       evsel__match(evsel, SOFTWARE, SW_TASK_CLOCK, EVSEL_EVENT_MASK);
 }
 
 bool evsel__fallback(struct evsel *evsel, int err, char *msg, size_t msgsize);
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 1566a206ba42..074f38b57e2d 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -6,6 +6,7 @@
 #include "color.h"
 #include "debug.h"
 #include "pmu.h"
+#include "pmus.h"
 #include "rblist.h"
 #include "evlist.h"
 #include "expr.h"
@@ -78,6 +79,8 @@ void perf_stat__reset_shadow_stats(void)
 
 static enum stat_type evsel__stat_type(const struct evsel *evsel)
 {
+	u64 mask = perf_pmus__supports_extended_type() ? PERF_HW_EVENT_MASK : EVSEL_EVENT_MASK;
+
 	/* Fake perf_hw_cache_op_id values for use with evsel__match. */
 	u64 PERF_COUNT_hw_cache_l1d_miss = PERF_COUNT_HW_CACHE_L1D |
 		((PERF_COUNT_HW_CACHE_OP_READ) << 8) |
@@ -97,41 +100,41 @@ static enum stat_type evsel__stat_type(const struct evsel *evsel)
 
 	if (evsel__is_clock(evsel))
 		return STAT_NSECS;
-	else if (evsel__match(evsel, HARDWARE, HW_CPU_CYCLES))
+	else if (evsel__match(evsel, HARDWARE, HW_CPU_CYCLES, mask))
 		return STAT_CYCLES;
-	else if (evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS))
+	else if (evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS, mask))
 		return STAT_INSTRUCTIONS;
-	else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
+	else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_FRONTEND, mask))
 		return STAT_STALLED_CYCLES_FRONT;
-	else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_BACKEND))
+	else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_BACKEND, mask))
 		return STAT_STALLED_CYCLES_BACK;
-	else if (evsel__match(evsel, HARDWARE, HW_BRANCH_INSTRUCTIONS))
+	else if (evsel__match(evsel, HARDWARE, HW_BRANCH_INSTRUCTIONS, mask))
 		return STAT_BRANCHES;
-	else if (evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES))
+	else if (evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES, mask))
 		return STAT_BRANCH_MISS;
-	else if (evsel__match(evsel, HARDWARE, HW_CACHE_REFERENCES))
+	else if (evsel__match(evsel, HARDWARE, HW_CACHE_REFERENCES, mask))
 		return STAT_CACHE_REFS;
-	else if (evsel__match(evsel, HARDWARE, HW_CACHE_MISSES))
+	else if (evsel__match(evsel, HARDWARE, HW_CACHE_MISSES, mask))
 		return STAT_CACHE_MISSES;
-	else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1D))
+	else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1D, mask))
 		return STAT_L1_DCACHE;
-	else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1I))
+	else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1I, mask))
 		return STAT_L1_ICACHE;
-	else if (evsel__match(evsel, HW_CACHE, HW_CACHE_LL))
+	else if (evsel__match(evsel, HW_CACHE, HW_CACHE_LL, mask))
 		return STAT_LL_CACHE;
-	else if (evsel__match(evsel, HW_CACHE, HW_CACHE_DTLB))
+	else if (evsel__match(evsel, HW_CACHE, HW_CACHE_DTLB, mask))
 		return STAT_DTLB_CACHE;
-	else if (evsel__match(evsel, HW_CACHE, HW_CACHE_ITLB))
+	else if (evsel__match(evsel, HW_CACHE, HW_CACHE_ITLB, mask))
 		return STAT_ITLB_CACHE;
-	else if (evsel__match(evsel, HW_CACHE, hw_cache_l1d_miss))
+	else if (evsel__match(evsel, HW_CACHE, hw_cache_l1d_miss, mask))
 		return STAT_L1D_MISS;
-	else if (evsel__match(evsel, HW_CACHE, hw_cache_l1i_miss))
+	else if (evsel__match(evsel, HW_CACHE, hw_cache_l1i_miss, mask))
 		return STAT_L1I_MISS;
-	else if (evsel__match(evsel, HW_CACHE, hw_cache_ll_miss))
+	else if (evsel__match(evsel, HW_CACHE, hw_cache_ll_miss, mask))
 		return STAT_LL_MISS;
-	else if (evsel__match(evsel, HW_CACHE, hw_cache_dtlb_miss))
+	else if (evsel__match(evsel, HW_CACHE, hw_cache_dtlb_miss, mask))
 		return STAT_DTLB_MISS;
-	else if (evsel__match(evsel, HW_CACHE, hw_cache_itlb_miss))
+	else if (evsel__match(evsel, HW_CACHE, hw_cache_itlb_miss, mask))
 		return STAT_ITLB_MISS;
 	return STAT_NONE;
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/8] perf metric: JSON flag to default metric group
  2023-06-07 16:26 [PATCH 0/8] New metricgroup output in perf stat default mode kan.liang
  2023-06-07 16:26 ` [PATCH 1/8] perf metric: Fix no group check kan.liang
  2023-06-07 16:26 ` [PATCH 2/8] perf evsel: Fix the annotation for hardware events on hybrid kan.liang
@ 2023-06-07 16:26 ` kan.liang
  2023-06-13 19:44   ` Ian Rogers
  2023-06-07 16:26 ` [PATCH 4/8] perf vendor events arm64: Add default tags into topdown L1 metrics kan.liang
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: kan.liang @ 2023-06-07 16:26 UTC (permalink / raw)
  To: acme, mingo, peterz, irogers, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

For the default output, the default metric group could vary on different
platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
should be displayed in the default mode. On ICL, only the TopdownL1
should be displayed.

Add a flag so we can tag the default metric group for different
platforms rather than hack the perf code.

The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
since SPR.

Add a new field, DefaultMetricgroupName, in the JSON file to indicate
the real metric group name.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 .../arch/x86/alderlake/adl-metrics.json       | 20 ++++---
 .../arch/x86/icelake/icl-metrics.json         | 20 ++++---
 .../arch/x86/icelakex/icx-metrics.json        | 20 ++++---
 .../arch/x86/sapphirerapids/spr-metrics.json  | 60 +++++++++++--------
 .../arch/x86/tigerlake/tgl-metrics.json       | 20 ++++---
 5 files changed, 84 insertions(+), 56 deletions(-)

diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
index c9f7e3d4ab08..e78c85220e27 100644
--- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
@@ -832,22 +832,24 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_backend_bound",
         "MetricThreshold": "tma_backend_bound > 0.2",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
         "ScaleUnit": "100%",
         "Unit": "cpu_core"
     },
     {
         "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_bad_speculation",
         "MetricThreshold": "tma_bad_speculation > 0.15",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
         "ScaleUnit": "100%",
         "Unit": "cpu_core"
@@ -1112,11 +1114,12 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
-        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_frontend_bound",
         "MetricThreshold": "tma_frontend_bound > 0.15",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
         "ScaleUnit": "100%",
         "Unit": "cpu_core"
@@ -2316,11 +2319,12 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_retiring",
         "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
         "ScaleUnit": "100%",
         "Unit": "cpu_core"
diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
index 20210742171d..cc4edf855064 100644
--- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
@@ -111,21 +111,23 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_backend_bound",
         "MetricThreshold": "tma_backend_bound > 0.2",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
         "ScaleUnit": "100%"
     },
     {
         "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_bad_speculation",
         "MetricThreshold": "tma_bad_speculation > 0.15",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
         "ScaleUnit": "100%"
     },
@@ -372,11 +374,12 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
-        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_frontend_bound",
         "MetricThreshold": "tma_frontend_bound > 0.15",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
         "ScaleUnit": "100%"
     },
@@ -1378,11 +1381,12 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_retiring",
         "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
         "ScaleUnit": "100%"
     },
diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
index ef25cda019be..6f25b5b7aaf6 100644
--- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
@@ -315,21 +315,23 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_backend_bound",
         "MetricThreshold": "tma_backend_bound > 0.2",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
         "ScaleUnit": "100%"
     },
     {
         "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_bad_speculation",
         "MetricThreshold": "tma_bad_speculation > 0.15",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
         "ScaleUnit": "100%"
     },
@@ -576,11 +578,12 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
-        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_frontend_bound",
         "MetricThreshold": "tma_frontend_bound > 0.15",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
         "ScaleUnit": "100%"
     },
@@ -1674,11 +1677,12 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_retiring",
         "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
         "ScaleUnit": "100%"
     },
diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
index 4f3dd85540b6..c732982f70b5 100644
--- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
@@ -340,31 +340,34 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_backend_bound",
         "MetricThreshold": "tma_backend_bound > 0.2",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
         "ScaleUnit": "100%"
     },
     {
         "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_bad_speculation",
         "MetricThreshold": "tma_bad_speculation > 0.15",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
         "ScaleUnit": "100%"
     },
     {
         "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
+        "DefaultMetricgroupName": "TopdownL2",
         "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
-        "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
+        "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
         "MetricName": "tma_branch_mispredicts",
         "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
-        "MetricgroupNoGroup": "TopdownL2",
+        "MetricgroupNoGroup": "TopdownL2;Default",
         "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction.  These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
         "ScaleUnit": "100%"
     },
@@ -407,11 +410,12 @@
     },
     {
         "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
+        "DefaultMetricgroupName": "TopdownL2",
         "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
-        "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
+        "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
         "MetricName": "tma_core_bound",
         "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
-        "MetricgroupNoGroup": "TopdownL2",
+        "MetricgroupNoGroup": "TopdownL2;Default",
         "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck.  Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
         "ScaleUnit": "100%"
     },
@@ -509,21 +513,23 @@
     },
     {
         "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
+        "DefaultMetricgroupName": "TopdownL2",
         "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
-        "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
+        "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
         "MetricName": "tma_fetch_bandwidth",
         "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
-        "MetricgroupNoGroup": "TopdownL2",
+        "MetricgroupNoGroup": "TopdownL2;Default",
         "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues.  For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
         "ScaleUnit": "100%"
     },
     {
         "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
+        "DefaultMetricgroupName": "TopdownL2",
         "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
-        "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
+        "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
         "MetricName": "tma_fetch_latency",
         "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
-        "MetricgroupNoGroup": "TopdownL2",
+        "MetricgroupNoGroup": "TopdownL2;Default",
         "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues.  For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
         "ScaleUnit": "100%"
     },
@@ -611,11 +617,12 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
-        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_frontend_bound",
         "MetricThreshold": "tma_frontend_bound > 0.15",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
         "ScaleUnit": "100%"
     },
@@ -630,11 +637,12 @@
     },
     {
         "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
+        "DefaultMetricgroupName": "TopdownL2",
         "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
-        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
+        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
         "MetricName": "tma_heavy_operations",
         "MetricThreshold": "tma_heavy_operations > 0.1",
-        "MetricgroupNoGroup": "TopdownL2",
+        "MetricgroupNoGroup": "TopdownL2;Default",
         "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
         "ScaleUnit": "100%"
     },
@@ -1486,11 +1494,12 @@
     },
     {
         "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
+        "DefaultMetricgroupName": "TopdownL2",
         "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
-        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
+        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
         "MetricName": "tma_light_operations",
         "MetricThreshold": "tma_light_operations > 0.6",
-        "MetricgroupNoGroup": "TopdownL2",
+        "MetricgroupNoGroup": "TopdownL2;Default",
         "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
         "ScaleUnit": "100%"
     },
@@ -1540,11 +1549,12 @@
     },
     {
         "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
+        "DefaultMetricgroupName": "TopdownL2",
         "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
-        "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
+        "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
         "MetricName": "tma_machine_clears",
         "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
-        "MetricgroupNoGroup": "TopdownL2",
+        "MetricgroupNoGroup": "TopdownL2;Default",
         "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears.  These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
         "ScaleUnit": "100%"
     },
@@ -1576,11 +1586,12 @@
     },
     {
         "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
+        "DefaultMetricgroupName": "TopdownL2",
         "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
-        "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
+        "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
         "MetricName": "tma_memory_bound",
         "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
-        "MetricgroupNoGroup": "TopdownL2",
+        "MetricgroupNoGroup": "TopdownL2;Default",
         "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck.  Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
         "ScaleUnit": "100%"
     },
@@ -1784,11 +1795,12 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_retiring",
         "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
         "ScaleUnit": "100%"
     },
diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
index d0538a754288..83346911aa63 100644
--- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
+++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
@@ -105,21 +105,23 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_backend_bound",
         "MetricThreshold": "tma_backend_bound > 0.2",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
         "ScaleUnit": "100%"
     },
     {
         "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_bad_speculation",
         "MetricThreshold": "tma_bad_speculation > 0.15",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
         "ScaleUnit": "100%"
     },
@@ -366,11 +368,12 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
-        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_frontend_bound",
         "MetricThreshold": "tma_frontend_bound > 0.15",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
         "ScaleUnit": "100%"
     },
@@ -1392,11 +1395,12 @@
     },
     {
         "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
+        "DefaultMetricgroupName": "TopdownL1",
         "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
-        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
+        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
         "MetricName": "tma_retiring",
         "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
-        "MetricgroupNoGroup": "TopdownL1",
+        "MetricgroupNoGroup": "TopdownL1;Default",
         "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
         "ScaleUnit": "100%"
     },
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/8] perf vendor events arm64: Add default tags into topdown L1 metrics
  2023-06-07 16:26 [PATCH 0/8] New metricgroup output in perf stat default mode kan.liang
                   ` (2 preceding siblings ...)
  2023-06-07 16:26 ` [PATCH 3/8] perf metric: JSON flag to default metric group kan.liang
@ 2023-06-07 16:26 ` kan.liang
  2023-06-13 19:45   ` Ian Rogers
  2023-06-14 14:30   ` John Garry
  2023-06-07 16:26 ` [PATCH 5/8] perf stat,jevents: Introduce Default tags for the default mode kan.liang
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 31+ messages in thread
From: kan.liang @ 2023-06-07 16:26 UTC (permalink / raw)
  To: acme, mingo, peterz, irogers, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Kan Liang, Jing Zhang, John Garry

From: Kan Liang <kan.liang@linux.intel.com>

Add the default tags for ARM as well.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Jing Zhang <renyu.zj@linux.alibaba.com>
Cc: John Garry <john.g.garry@oracle.com>
---
 tools/perf/pmu-events/arch/arm64/sbsa.json | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/tools/perf/pmu-events/arch/arm64/sbsa.json b/tools/perf/pmu-events/arch/arm64/sbsa.json
index f678c37ea9c3..f90b338261ac 100644
--- a/tools/perf/pmu-events/arch/arm64/sbsa.json
+++ b/tools/perf/pmu-events/arch/arm64/sbsa.json
@@ -2,28 +2,32 @@
     {
         "MetricExpr": "stall_slot_frontend / (#slots * cpu_cycles)",
         "BriefDescription": "Frontend bound L1 topdown metric",
-        "MetricGroup": "TopdownL1",
+        "DefaultMetricgroupName": "TopdownL1",
+        "MetricGroup": "Default;TopdownL1",
         "MetricName": "frontend_bound",
         "ScaleUnit": "100%"
     },
     {
         "MetricExpr": "(1 - op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
         "BriefDescription": "Bad speculation L1 topdown metric",
-        "MetricGroup": "TopdownL1",
+        "DefaultMetricgroupName": "TopdownL1",
+        "MetricGroup": "Default;TopdownL1",
         "MetricName": "bad_speculation",
         "ScaleUnit": "100%"
     },
     {
         "MetricExpr": "(op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
         "BriefDescription": "Retiring L1 topdown metric",
-        "MetricGroup": "TopdownL1",
+        "DefaultMetricgroupName": "TopdownL1",
+        "MetricGroup": "Default;TopdownL1",
         "MetricName": "retiring",
         "ScaleUnit": "100%"
     },
     {
         "MetricExpr": "stall_slot_backend / (#slots * cpu_cycles)",
         "BriefDescription": "Backend Bound L1 topdown metric",
-        "MetricGroup": "TopdownL1",
+        "DefaultMetricgroupName": "TopdownL1",
+        "MetricGroup": "Default;TopdownL1",
         "MetricName": "backend_bound",
         "ScaleUnit": "100%"
     }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/8] perf stat,jevents: Introduce Default tags for the default mode
  2023-06-07 16:26 [PATCH 0/8] New metricgroup output in perf stat default mode kan.liang
                   ` (3 preceding siblings ...)
  2023-06-07 16:26 ` [PATCH 4/8] perf vendor events arm64: Add default tags into topdown L1 metrics kan.liang
@ 2023-06-07 16:26 ` kan.liang
  2023-06-13 19:59   ` Ian Rogers
  2023-06-07 16:26 ` [PATCH 6/8] perf stat,metrics: New metricgroup output " kan.liang
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 31+ messages in thread
From: kan.liang @ 2023-06-07 16:26 UTC (permalink / raw)
  To: acme, mingo, peterz, irogers, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Introduce a new metricgroup, Default, to tag all the metric groups which
will be collected in the default mode.

Add a new field, DefaultMetricgroupName, in the JSON file to indicate
the real metric group name. It will be printed in the default output
to replace the event names.

There is nothing changed for the output format.

On SPR, both TopdownL1 and TopdownL2 are displayed in the default
output.

On ARM, Intel ICL and later platforms (before SPR), only TopdownL1 is
displayed in the default output.

Suggested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/builtin-stat.c          | 4 ++--
 tools/perf/pmu-events/jevents.py   | 5 +++--
 tools/perf/pmu-events/pmu-events.h | 1 +
 tools/perf/util/metricgroup.c      | 3 +++
 4 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index c87c6897edc9..2269b3e90e9b 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -2154,14 +2154,14 @@ static int add_default_attributes(void)
 		 * Add TopdownL1 metrics if they exist. To minimize
 		 * multiplexing, don't request threshold computation.
 		 */
-		if (metricgroup__has_metric(pmu, "TopdownL1")) {
+		if (metricgroup__has_metric(pmu, "Default")) {
 			struct evlist *metric_evlist = evlist__new();
 			struct evsel *metric_evsel;
 
 			if (!metric_evlist)
 				return -1;
 
-			if (metricgroup__parse_groups(metric_evlist, pmu, "TopdownL1",
+			if (metricgroup__parse_groups(metric_evlist, pmu, "Default",
 							/*metric_no_group=*/false,
 							/*metric_no_merge=*/false,
 							/*metric_no_threshold=*/true,
diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py
index 7ed258be1829..12e80bb7939b 100755
--- a/tools/perf/pmu-events/jevents.py
+++ b/tools/perf/pmu-events/jevents.py
@@ -54,8 +54,8 @@ _json_event_attributes = [
 # Attributes that are in pmu_metric rather than pmu_event.
 _json_metric_attributes = [
     'pmu', 'metric_name', 'metric_group', 'metric_expr', 'metric_threshold',
-    'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group', 'aggr_mode',
-    'event_grouping'
+    'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group',
+    'default_metricgroup_name', 'aggr_mode', 'event_grouping'
 ]
 # Attributes that are bools or enum int values, encoded as '0', '1',...
 _json_enum_attributes = ['aggr_mode', 'deprecated', 'event_grouping', 'perpkg']
@@ -307,6 +307,7 @@ class JsonEvent:
     self.metric_name = jd.get('MetricName')
     self.metric_group = jd.get('MetricGroup')
     self.metricgroup_no_group = jd.get('MetricgroupNoGroup')
+    self.default_metricgroup_name = jd.get('DefaultMetricgroupName')
     self.event_grouping = convert_metric_constraint(jd.get('MetricConstraint'))
     self.metric_expr = None
     if 'MetricExpr' in jd:
diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h
index 8cd23d656a5d..caf59f23cd64 100644
--- a/tools/perf/pmu-events/pmu-events.h
+++ b/tools/perf/pmu-events/pmu-events.h
@@ -61,6 +61,7 @@ struct pmu_metric {
 	const char *desc;
 	const char *long_desc;
 	const char *metricgroup_no_group;
+	const char *default_metricgroup_name;
 	enum aggr_mode_class aggr_mode;
 	enum metric_event_groups event_grouping;
 };
diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index 74f2d8efc02d..efafa02db5e5 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -137,6 +137,8 @@ struct metric {
 	 * output.
 	 */
 	const char *metric_unit;
+	/** Optional default metric group name */
+	const char *default_metricgroup_name;
 	/** Optional null terminated array of referenced metrics. */
 	struct metric_ref *metric_refs;
 	/**
@@ -219,6 +221,7 @@ static struct metric *metric__new(const struct pmu_metric *pm,
 
 	m->pmu = pm->pmu ?: "cpu";
 	m->metric_name = pm->metric_name;
+	m->default_metricgroup_name = pm->default_metricgroup_name;
 	m->modifier = NULL;
 	if (modifier) {
 		m->modifier = strdup(modifier);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/8] perf stat,metrics: New metricgroup output for the default mode
  2023-06-07 16:26 [PATCH 0/8] New metricgroup output in perf stat default mode kan.liang
                   ` (4 preceding siblings ...)
  2023-06-07 16:26 ` [PATCH 5/8] perf stat,jevents: Introduce Default tags for the default mode kan.liang
@ 2023-06-07 16:26 ` kan.liang
  2023-06-13 20:16   ` Ian Rogers
  2023-06-07 16:26 ` [PATCH 7/8] pert tests: Support metricgroup perf stat JSON output kan.liang
  2023-06-07 16:27 ` [PATCH 8/8] perf test: Add test case for the standard perf stat output kan.liang
  7 siblings, 1 reply; 31+ messages in thread
From: kan.liang @ 2023-06-07 16:26 UTC (permalink / raw)
  To: acme, mingo, peterz, irogers, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

In the default mode, the current output of the metricgroup include both
events and metrics, which is not necessary and just makes the output
hard to read. Since different ARCHs (even different generations in the
same ARCH) may use different events. The output also vary on different
platforms.

For a metricgroup, only outputting the value of each metric is good
enough.

Current perf may append different metric groups to the same leader
event, or append the metrics from the same metricgroup to different
events. That could bring confusion when perf only prints the
metricgroup output mode. For example, print the same metricgroup name
several times.
Reorganize metricgroup for the default mode and make sure that
a metricgroup can only be appended to one event.
Sort the metricgroup for the default mode by the name of the
metricgroup.

Add a new field default_metricgroup in evsel to indicate an event of
the default metricgroup. For those events, printout() should print
the metricgroup name rather than events.

Add print_metricgroup_header() to print out the metricgroup name in
different output formats.

On SPR
Before:

 ./perf_old stat sleep 1

 Performance counter stats for 'sleep 1':

              0.54 msec task-clock:u                     #    0.001 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                68      page-faults:u                    #  125.445 K/sec
           540,970      cycles:u                         #    0.998 GHz
           556,325      instructions:u                   #    1.03  insn per cycle
           123,602      branches:u                       #  228.018 M/sec
             6,889      branch-misses:u                  #    5.57% of all branches
         3,245,820      TOPDOWN.SLOTS:u                  #     18.4 %  tma_backend_bound
                                                  #     17.2 %  tma_retiring
                                                  #     23.1 %  tma_bad_speculation
                                                  #     41.4 %  tma_frontend_bound
           564,859      topdown-retiring:u
         1,370,999      topdown-fe-bound:u
           603,271      topdown-be-bound:u
           744,874      topdown-bad-spec:u
            12,661      INT_MISC.UOP_DROPPING:u          #   23.357 M/sec

       1.001798215 seconds time elapsed

       0.000193000 seconds user
       0.001700000 seconds sys

After:

$ ./perf stat sleep 1

 Performance counter stats for 'sleep 1':

              0.51 msec task-clock:u                     #    0.001 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                68      page-faults:u                    #  132.683 K/sec
           545,228      cycles:u                         #    1.064 GHz
           555,509      instructions:u                   #    1.02  insn per cycle
           123,574      branches:u                       #  241.120 M/sec
             6,957      branch-misses:u                  #    5.63% of all branches
                        TopdownL1                 #     17.5 %  tma_backend_bound
                                                  #     22.6 %  tma_bad_speculation
                                                  #     42.7 %  tma_frontend_bound
                                                  #     17.1 %  tma_retiring
                        TopdownL2                 #     21.8 %  tma_branch_mispredicts
                                                  #     11.5 %  tma_core_bound
                                                  #     13.4 %  tma_fetch_bandwidth
                                                  #     29.3 %  tma_fetch_latency
                                                  #      2.7 %  tma_heavy_operations
                                                  #     14.5 %  tma_light_operations
                                                  #      0.8 %  tma_machine_clears
                                                  #      6.1 %  tma_memory_bound

       1.001712086 seconds time elapsed

       0.000151000 seconds user
       0.001618000 seconds sys


Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/builtin-stat.c      |   1 +
 tools/perf/util/evsel.h        |   1 +
 tools/perf/util/metricgroup.c  | 106 ++++++++++++++++++++++++++++++++-
 tools/perf/util/metricgroup.h  |   1 +
 tools/perf/util/stat-display.c |  69 ++++++++++++++++++++-
 5 files changed, 172 insertions(+), 6 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2269b3e90e9b..b274cc264d56 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -2172,6 +2172,7 @@ static int add_default_attributes(void)
 
 			evlist__for_each_entry(metric_evlist, metric_evsel) {
 				metric_evsel->skippable = true;
+				metric_evsel->default_metricgroup = true;
 			}
 			evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
 			evlist__delete(metric_evlist);
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 36a32e4ca168..61b1385108f4 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -130,6 +130,7 @@ struct evsel {
 	bool			reset_group;
 	bool			errored;
 	bool			needs_auxtrace_mmap;
+	bool			default_metricgroup;
 	struct hashmap		*per_pkg_mask;
 	int			err;
 	struct {
diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
index efafa02db5e5..22181ce4f27f 100644
--- a/tools/perf/util/metricgroup.c
+++ b/tools/perf/util/metricgroup.c
@@ -79,6 +79,7 @@ static struct rb_node *metric_event_new(struct rblist *rblist __maybe_unused,
 		return NULL;
 	memcpy(me, entry, sizeof(struct metric_event));
 	me->evsel = ((struct metric_event *)entry)->evsel;
+	me->default_metricgroup_name = NULL;
 	INIT_LIST_HEAD(&me->head);
 	return &me->nd;
 }
@@ -1133,14 +1134,19 @@ static int metricgroup__add_metric_sys_event_iter(const struct pmu_metric *pm,
 /**
  * metric_list_cmp - list_sort comparator that sorts metrics with more events to
  *                   the front. tool events are excluded from the count.
+ *                   For the default metrics, sort them by metricgroup name.
  */
-static int metric_list_cmp(void *priv __maybe_unused, const struct list_head *l,
+static int metric_list_cmp(void *priv, const struct list_head *l,
 			   const struct list_head *r)
 {
 	const struct metric *left = container_of(l, struct metric, nd);
 	const struct metric *right = container_of(r, struct metric, nd);
 	struct expr_id_data *data;
 	int i, left_count, right_count;
+	bool is_default = *(bool *)priv;
+
+	if (is_default && left->default_metricgroup_name && right->default_metricgroup_name)
+		return strcmp(left->default_metricgroup_name, right->default_metricgroup_name);
 
 	left_count = hashmap__size(left->pctx->ids);
 	perf_tool_event__for_each_event(i) {
@@ -1497,6 +1503,91 @@ static int parse_ids(bool metric_no_merge, struct perf_pmu *fake_pmu,
 	return ret;
 }
 
+static struct metric_event *
+metricgroup__lookup_default_metricgroup(struct rblist *metric_events,
+					struct evsel *evsel,
+					struct metric *m)
+{
+	struct metric_event *me;
+	char *name;
+	int err;
+
+	me = metricgroup__lookup(metric_events, evsel, true);
+	if (!me->default_metricgroup_name) {
+		if (m->pmu && strcmp(m->pmu, "cpu"))
+			err = asprintf(&name, "%s (%s)", m->default_metricgroup_name, m->pmu);
+		else
+			err = asprintf(&name, "%s", m->default_metricgroup_name);
+		if (err < 0)
+			return NULL;
+		me->default_metricgroup_name = name;
+	}
+	if (!strncmp(m->default_metricgroup_name,
+		     me->default_metricgroup_name,
+		     strlen(m->default_metricgroup_name)))
+		return me;
+
+	return NULL;
+}
+
+static struct metric_event *
+metricgroup__lookup_create(struct rblist *metric_events,
+			   struct evsel **evsel,
+			   struct list_head *metric_list,
+			   struct metric *m,
+			   bool is_default)
+{
+	struct metric_event *me;
+	struct metric *cur;
+	struct evsel *ev;
+	size_t i;
+
+	if (!is_default)
+		return metricgroup__lookup(metric_events, evsel[0], true);
+
+	/*
+	 * If the metric group has been attached to a previous
+	 * event/metric, use that metric event.
+	 */
+	list_for_each_entry(cur, metric_list, nd) {
+		if (cur == m)
+			break;
+		if (cur->pmu && strcmp(m->pmu, cur->pmu))
+			continue;
+		if (strncmp(m->default_metricgroup_name,
+			    cur->default_metricgroup_name,
+			    strlen(m->default_metricgroup_name)))
+			continue;
+		if (!cur->evlist)
+			continue;
+		evlist__for_each_entry(cur->evlist, ev) {
+			me = metricgroup__lookup(metric_events, ev, false);
+			if (!strncmp(m->default_metricgroup_name,
+				     me->default_metricgroup_name,
+				     strlen(m->default_metricgroup_name)))
+				return me;
+		}
+	}
+
+	/*
+	 * Different metric groups may append to the same leader event.
+	 * For example, TopdownL1 and TopdownL2 are appended to the
+	 * TOPDOWN.SLOTS event.
+	 * Split it and append the new metric group to the next available
+	 * event.
+	 */
+	me = metricgroup__lookup_default_metricgroup(metric_events, evsel[0], m);
+	if (me)
+		return me;
+
+	for (i = 1; i < hashmap__size(m->pctx->ids); i++) {
+		me = metricgroup__lookup_default_metricgroup(metric_events, evsel[i], m);
+		if (me)
+			return me;
+	}
+	return NULL;
+}
+
 static int parse_groups(struct evlist *perf_evlist,
 			const char *pmu, const char *str,
 			bool metric_no_group,
@@ -1512,6 +1603,7 @@ static int parse_groups(struct evlist *perf_evlist,
 	LIST_HEAD(metric_list);
 	struct metric *m;
 	bool tool_events[PERF_TOOL_MAX] = {false};
+	bool is_default = !strcmp(str, "Default");
 	int ret;
 
 	if (metric_events_list->nr_entries == 0)
@@ -1523,7 +1615,7 @@ static int parse_groups(struct evlist *perf_evlist,
 		goto out;
 
 	/* Sort metrics from largest to smallest. */
-	list_sort(NULL, &metric_list, metric_list_cmp);
+	list_sort((void *)&is_default, &metric_list, metric_list_cmp);
 
 	if (!metric_no_merge) {
 		struct expr_parse_ctx *combined = NULL;
@@ -1603,7 +1695,15 @@ static int parse_groups(struct evlist *perf_evlist,
 			goto out;
 		}
 
-		me = metricgroup__lookup(metric_events_list, metric_events[0], true);
+		me = metricgroup__lookup_create(metric_events_list,
+						metric_events,
+						&metric_list, m,
+						is_default);
+		if (!me) {
+			pr_err("Cannot create metric group for default!\n");
+			ret = -EINVAL;
+			goto out;
+		}
 
 		expr = malloc(sizeof(struct metric_expr));
 		if (!expr) {
diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
index bf18274c15df..e3609b853213 100644
--- a/tools/perf/util/metricgroup.h
+++ b/tools/perf/util/metricgroup.h
@@ -22,6 +22,7 @@ struct cgroup;
 struct metric_event {
 	struct rb_node nd;
 	struct evsel *evsel;
+	char *default_metricgroup_name;
 	struct list_head head; /* list of metric_expr */
 };
 
diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
index a2bbdc25d979..efe5fd04c033 100644
--- a/tools/perf/util/stat-display.c
+++ b/tools/perf/util/stat-display.c
@@ -21,10 +21,12 @@
 #include "iostat.h"
 #include "pmu.h"
 #include "pmus.h"
+#include "metricgroup.h"
 
 #define CNTR_NOT_SUPPORTED	"<not supported>"
 #define CNTR_NOT_COUNTED	"<not counted>"
 
+#define MGROUP_LEN   50
 #define METRIC_LEN   38
 #define EVNAME_LEN   32
 #define COUNTS_LEN   18
@@ -707,6 +709,55 @@ static bool evlist__has_hybrid(struct evlist *evlist)
 	return false;
 }
 
+static void print_metricgroup_header_json(struct perf_stat_config *config,
+					  struct outstate *os __maybe_unused,
+					  const char *metricgroup_name)
+{
+	fprintf(config->output, "\"metricgroup\" : \"%s\"}", metricgroup_name);
+	new_line_json(config, (void *)os);
+}
+
+static void print_metricgroup_header_csv(struct perf_stat_config *config,
+					 struct outstate *os,
+					 const char *metricgroup_name)
+{
+	int i;
+
+	for (i = 0; i < os->nfields; i++)
+		fputs(config->csv_sep, os->fh);
+	fprintf(config->output, "%s", metricgroup_name);
+	new_line_csv(config, (void *)os);
+}
+
+static void print_metricgroup_header_std(struct perf_stat_config *config,
+					 struct outstate *os __maybe_unused,
+					 const char *metricgroup_name)
+{
+	int n = fprintf(config->output, " %*s", EVNAME_LEN, metricgroup_name);
+
+	fprintf(config->output, "%*s", MGROUP_LEN - n - 1, "");
+}
+
+static void print_metricgroup_header(struct perf_stat_config *config,
+				     struct outstate *os,
+				     struct evsel *counter,
+				     double noise, u64 run, u64 ena,
+				     const char *metricgroup_name)
+{
+	aggr_printout(config, os->evsel, os->id, os->aggr_nr);
+
+	print_noise(config, counter, noise, /*before_metric=*/true);
+	print_running(config, run, ena, /*before_metric=*/true);
+
+	if (config->json_output) {
+		print_metricgroup_header_json(config, os, metricgroup_name);
+	} else if (config->csv_output) {
+		print_metricgroup_header_csv(config, os, metricgroup_name);
+	} else
+		print_metricgroup_header_std(config, os, metricgroup_name);
+
+}
+
 static void printout(struct perf_stat_config *config, struct outstate *os,
 		     double uval, u64 run, u64 ena, double noise, int aggr_idx)
 {
@@ -751,10 +802,17 @@ static void printout(struct perf_stat_config *config, struct outstate *os,
 	out.force_header = false;
 
 	if (!config->metric_only) {
-		abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
+		if (counter->default_metricgroup) {
+			struct metric_event *me;
 
-		print_noise(config, counter, noise, /*before_metric=*/true);
-		print_running(config, run, ena, /*before_metric=*/true);
+			me = metricgroup__lookup(&config->metric_events, counter, false);
+			print_metricgroup_header(config, os, counter, noise, run, ena,
+						 me->default_metricgroup_name);
+		} else {
+			abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
+			print_noise(config, counter, noise, /*before_metric=*/true);
+			print_running(config, run, ena, /*before_metric=*/true);
+		}
 	}
 
 	if (ok) {
@@ -883,6 +941,11 @@ static void print_counter_aggrdata(struct perf_stat_config *config,
 	if (counter->merged_stat)
 		return;
 
+	/* Only print the metric group for the default mode */
+	if (counter->default_metricgroup &&
+	    !metricgroup__lookup(&config->metric_events, counter, false))
+		return;
+
 	uniquify_counter(config, counter);
 
 	val = aggr->counts.val;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 7/8] pert tests: Support metricgroup perf stat JSON output
  2023-06-07 16:26 [PATCH 0/8] New metricgroup output in perf stat default mode kan.liang
                   ` (5 preceding siblings ...)
  2023-06-07 16:26 ` [PATCH 6/8] perf stat,metrics: New metricgroup output " kan.liang
@ 2023-06-07 16:26 ` kan.liang
  2023-06-13 20:17   ` Ian Rogers
  2023-06-07 16:27 ` [PATCH 8/8] perf test: Add test case for the standard perf stat output kan.liang
  7 siblings, 1 reply; 31+ messages in thread
From: kan.liang @ 2023-06-07 16:26 UTC (permalink / raw)
  To: acme, mingo, peterz, irogers, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

A new field metricgroup has been added in the perf stat JSON output.
Support it in the test case.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/tests/shell/lib/perf_json_output_lint.py | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/perf/tests/shell/lib/perf_json_output_lint.py b/tools/perf/tests/shell/lib/perf_json_output_lint.py
index b81582a89d36..5e9bd68c83fe 100644
--- a/tools/perf/tests/shell/lib/perf_json_output_lint.py
+++ b/tools/perf/tests/shell/lib/perf_json_output_lint.py
@@ -55,6 +55,7 @@ def check_json_output(expected_items):
       'interval': lambda x: isfloat(x),
       'metric-unit': lambda x: True,
       'metric-value': lambda x: isfloat(x),
+      'metricgroup': lambda x: True,
       'node': lambda x: True,
       'pcnt-running': lambda x: isfloat(x),
       'socket': lambda x: True,
@@ -70,6 +71,8 @@ def check_json_output(expected_items):
         # values and possibly other prefixes like interval, core and
         # aggregate-number.
         pass
+      elif count != expected_items and count >= 1 and count <= 5 and 'metricgroup' in item:
+        pass
       elif count != expected_items:
         raise RuntimeError(f'wrong number of fields. counted {count} expected {expected_items}'
                            f' in \'{item}\'')
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 8/8] perf test: Add test case for the standard perf stat output
  2023-06-07 16:26 [PATCH 0/8] New metricgroup output in perf stat default mode kan.liang
                   ` (6 preceding siblings ...)
  2023-06-07 16:26 ` [PATCH 7/8] pert tests: Support metricgroup perf stat JSON output kan.liang
@ 2023-06-07 16:27 ` kan.liang
  2023-06-13 20:21   ` Ian Rogers
  7 siblings, 1 reply; 31+ messages in thread
From: kan.liang @ 2023-06-07 16:27 UTC (permalink / raw)
  To: acme, mingo, peterz, irogers, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Add a new test case to verify the standard perf stat output with
different options.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 tools/perf/tests/shell/stat+std_output.sh | 259 ++++++++++++++++++++++
 1 file changed, 259 insertions(+)
 create mode 100755 tools/perf/tests/shell/stat+std_output.sh

diff --git a/tools/perf/tests/shell/stat+std_output.sh b/tools/perf/tests/shell/stat+std_output.sh
new file mode 100755
index 000000000000..b9db0f245450
--- /dev/null
+++ b/tools/perf/tests/shell/stat+std_output.sh
@@ -0,0 +1,259 @@
+#!/bin/bash
+# perf stat STD output linter
+# SPDX-License-Identifier: GPL-2.0
+# Tests various perf stat STD output commands for
+# default event and metricgroup
+
+set -e
+
+skip_test=0
+
+stat_output=$(mktemp /tmp/__perf_test.stat_output.std.XXXXX)
+
+event_name=(cpu-clock task-clock context-switches cpu-migrations page-faults cycles instructions branches branch-misses stalled-cycles-frontend stalled-cycles-backend)
+event_metric=("CPUs utilized" "CPUs utilized" "/sec" "/sec" "/sec" "GHz" "insn per cycle" "/sec" "of all branches" "frontend cycles idle" "backend cycles idle")
+
+metricgroup_name=(TopdownL1 TopdownL2)
+
+cleanup() {
+  rm -f "${stat_output}"
+
+  trap - EXIT TERM INT
+}
+
+trap_cleanup() {
+  cleanup
+  exit 1
+}
+trap trap_cleanup EXIT TERM INT
+
+function commachecker()
+{
+	local -i cnt=0
+	local prefix=1
+
+	case "$1"
+	in "--interval")	prefix=2
+	;; "--per-thread")	prefix=2
+	;; "--system-wide-no-aggr")	prefix=2
+	;; "--per-core")	prefix=3
+	;; "--per-socket")	prefix=3
+	;; "--per-node")	prefix=3
+	;; "--per-die")		prefix=3
+	;; "--per-cache")	prefix=3
+	esac
+
+	while read line
+	do
+		# Ignore initial "started on" comment.
+		x=${line:0:1}
+		[ "$x" = "#" ] && continue
+		# Ignore initial blank line.
+		[ "$line" = "" ] && continue
+		# Ignore "Performance counter stats"
+		x=${line:0:25}
+		[ "$x" = "Performance counter stats" ] && continue
+		# Ignore "seconds time elapsed" and break
+		[[ "$line" == *"time elapsed"* ]] && break
+
+		main_body=$(echo $line | cut -d' ' -f$prefix-)
+		x=${main_body%#*}
+		# Check default metricgroup
+		y=$(echo $x | tr -d ' ')
+		[ "$y" = "" ] && continue
+		for i in "${!metricgroup_name[@]}"; do
+			[[ "$y" == *"${metricgroup_name[$i]}"* ]] && break
+		done
+		[[ "$y" == *"${metricgroup_name[$i]}"* ]] && continue
+
+		# Check default event
+		for i in "${!event_name[@]}"; do
+			[[ "$x" == *"${event_name[$i]}"* ]] && break
+		done
+
+		[[ ! "$x" == *"${event_name[$i]}"* ]] && {
+			echo "Unknown event name in $line" 1>&2
+			exit 1;
+		}
+
+		# Check event metric if it exists
+		[[ ! "$main_body" == *"#"* ]] && continue
+		[[ ! "$main_body" == *"${event_metric[$i]}"* ]] && {
+			echo "wrong event metric. expected ${event_metric[$i]} in $line" 1>&2
+			exit 1;
+		}
+	done < "${stat_output}"
+	return 0
+}
+
+# Return true if perf_event_paranoid is > $1 and not running as root.
+function ParanoidAndNotRoot()
+{
+	 [ $(id -u) != 0 ] && [ $(cat /proc/sys/kernel/perf_event_paranoid) -gt $1 ]
+}
+
+check_no_args()
+{
+	echo -n "Checking STD output: no args "
+	perf stat -o "${stat_output}" true
+        commachecker --no-args
+	echo "[Success]"
+}
+
+check_system_wide()
+{
+	echo -n "Checking STD output: system wide "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoid and not root"
+		return
+	fi
+	perf stat -a -o "${stat_output}" true
+        commachecker --system-wide
+	echo "[Success]"
+}
+
+check_system_wide_no_aggr()
+{
+	echo -n "Checking STD output: system wide no aggregation "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoid and not root"
+		return
+	fi
+	perf stat -A -a --no-merge -o "${stat_output}" true
+        commachecker --system-wide-no-aggr
+	echo "[Success]"
+}
+
+check_interval()
+{
+	echo -n "Checking STD output: interval "
+	perf stat -I 1000 -o "${stat_output}" true
+        commachecker --interval
+	echo "[Success]"
+}
+
+
+check_per_core()
+{
+	echo -n "Checking STD output: per core "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoid and not root"
+		return
+	fi
+	perf stat --per-core -a -o "${stat_output}" true
+        commachecker --per-core
+	echo "[Success]"
+}
+
+check_per_thread()
+{
+	echo -n "Checking STD output: per thread "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoid and not root"
+		return
+	fi
+	perf stat --per-thread -a -o "${stat_output}" true
+        commachecker --per-thread
+	echo "[Success]"
+}
+
+check_per_cache_instance()
+{
+	echo -n "Checking STD output: per cache instance "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoid and not root"
+		return
+	fi
+	perf stat  --per-cache -a true 2>&1 | commachecker --per-cache
+	echo "[Success]"
+}
+
+check_per_die()
+{
+	echo -n "Checking STD output: per die "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoid and not root"
+		return
+	fi
+	perf stat --per-die -a -o "${stat_output}" true
+        commachecker --per-die
+	echo "[Success]"
+}
+
+check_per_node()
+{
+	echo -n "Checking STD output: per node "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoid and not root"
+		return
+	fi
+	perf stat --per-node -a -o "${stat_output}" true
+        commachecker --per-node
+	echo "[Success]"
+}
+
+check_per_socket()
+{
+	echo -n "Checking STD output: per socket "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoid and not root"
+		return
+	fi
+	perf stat --per-socket -a -o "${stat_output}" true
+        commachecker --per-socket
+	echo "[Success]"
+}
+
+# The perf stat options for per-socket, per-core, per-die
+# and -A ( no_aggr mode ) uses the info fetched from this
+# directory: "/sys/devices/system/cpu/cpu*/topology". For
+# example, socket value is fetched from "physical_package_id"
+# file in topology directory.
+# Reference: cpu__get_topology_int in util/cpumap.c
+# If the platform doesn't expose topology information, values
+# will be set to -1. For example, incase of pSeries platform
+# of powerpc, value for  "physical_package_id" is restricted
+# and set to -1. Check here validates the socket-id read from
+# topology file before proceeding further
+
+FILE_LOC="/sys/devices/system/cpu/cpu*/topology/"
+FILE_NAME="physical_package_id"
+
+check_for_topology()
+{
+	if ! ParanoidAndNotRoot 0
+	then
+		socket_file=`ls $FILE_LOC/$FILE_NAME | head -n 1`
+		[ -z $socket_file ] && return 0
+		socket_id=`cat $socket_file`
+		[ $socket_id == -1 ] && skip_test=1
+		return 0
+	fi
+}
+
+check_for_topology
+check_no_args
+check_system_wide
+check_interval
+check_per_thread
+check_per_node
+if [ $skip_test -ne 1 ]
+then
+	check_system_wide_no_aggr
+	check_per_core
+	check_per_cache_instance
+	check_per_die
+	check_per_socket
+else
+	echo "[Skip] Skipping tests for system_wide_no_aggr, per_core, per_die and per_socket since socket id exposed via topology is invalid"
+fi
+cleanup
+exit 0
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/8] perf metric: Fix no group check
  2023-06-07 16:26 ` [PATCH 1/8] perf metric: Fix no group check kan.liang
@ 2023-06-13 19:22   ` Ian Rogers
  0 siblings, 0 replies; 31+ messages in thread
From: Ian Rogers @ 2023-06-13 19:22 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin

On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> The no group check fails if there is more than one meticgroup in the
> metricgroup_no_group.
>
> The first parameter of the match_metric() should be the string, while
> the substring should be the second parameter.
>
> Fixes: ccc66c609280 ("perf metric: JSON flag to not group events if gathering a metric group")
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

> ---
>  tools/perf/util/metricgroup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
> index 70ef2e23a710..74f2d8efc02d 100644
> --- a/tools/perf/util/metricgroup.c
> +++ b/tools/perf/util/metricgroup.c
> @@ -1175,7 +1175,7 @@ static int metricgroup__add_metric_callback(const struct pmu_metric *pm,
>
>         if (pm->metric_expr && match_pm_metric(pm, data->pmu, data->metric_name)) {
>                 bool metric_no_group = data->metric_no_group ||
> -                       match_metric(data->metric_name, pm->metricgroup_no_group);
> +                       match_metric(pm->metricgroup_no_group, data->metric_name);
>
>                 data->has_match = true;
>                 ret = add_metric(data->list, pm, data->modifier, metric_no_group,
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/8] perf evsel: Fix the annotation for hardware events on hybrid
  2023-06-07 16:26 ` [PATCH 2/8] perf evsel: Fix the annotation for hardware events on hybrid kan.liang
@ 2023-06-13 19:35   ` Ian Rogers
  2023-06-13 20:06     ` Liang, Kan
  0 siblings, 1 reply; 31+ messages in thread
From: Ian Rogers @ 2023-06-13 19:35 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin

On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> The annotation for hardware events is wrong on hybrid. For example,
>
>  # ./perf stat -a sleep 1
>
>  Performance counter stats for 'system wide':
>
>          32,148.85 msec cpu-clock                        #   32.000 CPUs utilized
>                374      context-switches                 #   11.633 /sec
>                 33      cpu-migrations                   #    1.026 /sec
>                295      page-faults                      #    9.176 /sec
>         18,979,960      cpu_core/cycles/                 #  590.378 K/sec
>        261,230,783      cpu_atom/cycles/                 #    8.126 M/sec                       (54.21%)
>         17,019,732      cpu_core/instructions/           #  529.404 K/sec
>         38,020,470      cpu_atom/instructions/           #    1.183 M/sec                       (63.36%)
>          3,296,743      cpu_core/branches/               #  102.546 K/sec
>          6,692,338      cpu_atom/branches/               #  208.167 K/sec                       (63.40%)
>             96,421      cpu_core/branch-misses/          #    2.999 K/sec
>          1,016,336      cpu_atom/branch-misses/          #   31.613 K/sec                       (63.38%)
>
> The hardware events have extended type on hybrid, but the evsel__match()
> doesn't take it into account.
>
> Add a mask to filter the extended type on hybrid when checking the config.
>
> With the patch,
>
>  # ./perf stat -a sleep 1
>
>  Performance counter stats for 'system wide':
>
>          32,139.90 msec cpu-clock                        #   32.003 CPUs utilized
>                343      context-switches                 #   10.672 /sec
>                 32      cpu-migrations                   #    0.996 /sec
>                 73      page-faults                      #    2.271 /sec
>         13,712,841      cpu_core/cycles/                 #    0.000 GHz
>        258,301,691      cpu_atom/cycles/                 #    0.008 GHz                         (54.20%)
>         12,428,163      cpu_core/instructions/           #    0.91  insn per cycle
>         37,786,557      cpu_atom/instructions/           #    2.76  insn per cycle              (63.35%)
>          2,418,826      cpu_core/branches/               #   75.259 K/sec
>          6,965,962      cpu_atom/branches/               #  216.739 K/sec                       (63.38%)
>             72,150      cpu_core/branch-misses/          #    2.98% of all branches
>          1,032,746      cpu_atom/branch-misses/          #   42.70% of all branches             (63.35%)
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/util/evsel.h       | 12 ++++++-----
>  tools/perf/util/stat-shadow.c | 39 +++++++++++++++++++----------------
>  2 files changed, 28 insertions(+), 23 deletions(-)
>
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index b365b449c6ea..36a32e4ca168 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -350,9 +350,11 @@ u64 format_field__intval(struct tep_format_field *field, struct perf_sample *sam
>
>  struct tep_format_field *evsel__field(struct evsel *evsel, const char *name);
>
> -#define evsel__match(evsel, t, c)              \
> +#define EVSEL_EVENT_MASK                       (~0ULL)
> +
> +#define evsel__match(evsel, t, c, m)                   \
>         (evsel->core.attr.type == PERF_TYPE_##t &&      \
> -        evsel->core.attr.config == PERF_COUNT_##c)
> +        (evsel->core.attr.config & m) == PERF_COUNT_##c)

The EVSEL_EVENT_MASK here isn't very intention revealing, perhaps we
can remove it and do something like:

static inline bool __evsel__match(const struct evsel *evsel, u32 type,
u64 config)
{
  if ((type == PERF_TYPE_HARDWARE || type ==PERF_TYPE_HW_CACHE)  &&
perf_pmus__supports_extended_type())
     return (evsel->core.attr.config & PERF_HW_EVENT_MASK) == config;

  return evsel->core.attr.config == config;
}
#define evsel__match(evsel, t, c) __evsel__match(evsel, PERF_TYPE_##t,
PERF_COUNT_##c)

Thanks,
Ian

>
>  static inline bool evsel__match2(struct evsel *e1, struct evsel *e2)
>  {
> @@ -438,13 +440,13 @@ bool evsel__is_function_event(struct evsel *evsel);
>
>  static inline bool evsel__is_bpf_output(struct evsel *evsel)
>  {
> -       return evsel__match(evsel, SOFTWARE, SW_BPF_OUTPUT);
> +       return evsel__match(evsel, SOFTWARE, SW_BPF_OUTPUT, EVSEL_EVENT_MASK);
>  }
>
>  static inline bool evsel__is_clock(const struct evsel *evsel)
>  {
> -       return evsel__match(evsel, SOFTWARE, SW_CPU_CLOCK) ||
> -              evsel__match(evsel, SOFTWARE, SW_TASK_CLOCK);
> +       return evsel__match(evsel, SOFTWARE, SW_CPU_CLOCK, EVSEL_EVENT_MASK) ||
> +              evsel__match(evsel, SOFTWARE, SW_TASK_CLOCK, EVSEL_EVENT_MASK);
>  }
>
>  bool evsel__fallback(struct evsel *evsel, int err, char *msg, size_t msgsize);
> diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
> index 1566a206ba42..074f38b57e2d 100644
> --- a/tools/perf/util/stat-shadow.c
> +++ b/tools/perf/util/stat-shadow.c
> @@ -6,6 +6,7 @@
>  #include "color.h"
>  #include "debug.h"
>  #include "pmu.h"
> +#include "pmus.h"
>  #include "rblist.h"
>  #include "evlist.h"
>  #include "expr.h"
> @@ -78,6 +79,8 @@ void perf_stat__reset_shadow_stats(void)
>
>  static enum stat_type evsel__stat_type(const struct evsel *evsel)
>  {
> +       u64 mask = perf_pmus__supports_extended_type() ? PERF_HW_EVENT_MASK : EVSEL_EVENT_MASK;
> +
>         /* Fake perf_hw_cache_op_id values for use with evsel__match. */
>         u64 PERF_COUNT_hw_cache_l1d_miss = PERF_COUNT_HW_CACHE_L1D |
>                 ((PERF_COUNT_HW_CACHE_OP_READ) << 8) |
> @@ -97,41 +100,41 @@ static enum stat_type evsel__stat_type(const struct evsel *evsel)
>
>         if (evsel__is_clock(evsel))
>                 return STAT_NSECS;
> -       else if (evsel__match(evsel, HARDWARE, HW_CPU_CYCLES))
> +       else if (evsel__match(evsel, HARDWARE, HW_CPU_CYCLES, mask))
>                 return STAT_CYCLES;
> -       else if (evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS))
> +       else if (evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS, mask))
>                 return STAT_INSTRUCTIONS;
> -       else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
> +       else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_FRONTEND, mask))
>                 return STAT_STALLED_CYCLES_FRONT;
> -       else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_BACKEND))
> +       else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_BACKEND, mask))
>                 return STAT_STALLED_CYCLES_BACK;
> -       else if (evsel__match(evsel, HARDWARE, HW_BRANCH_INSTRUCTIONS))
> +       else if (evsel__match(evsel, HARDWARE, HW_BRANCH_INSTRUCTIONS, mask))
>                 return STAT_BRANCHES;
> -       else if (evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES))
> +       else if (evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES, mask))
>                 return STAT_BRANCH_MISS;
> -       else if (evsel__match(evsel, HARDWARE, HW_CACHE_REFERENCES))
> +       else if (evsel__match(evsel, HARDWARE, HW_CACHE_REFERENCES, mask))
>                 return STAT_CACHE_REFS;
> -       else if (evsel__match(evsel, HARDWARE, HW_CACHE_MISSES))
> +       else if (evsel__match(evsel, HARDWARE, HW_CACHE_MISSES, mask))
>                 return STAT_CACHE_MISSES;
> -       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1D))
> +       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1D, mask))
>                 return STAT_L1_DCACHE;
> -       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1I))
> +       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1I, mask))
>                 return STAT_L1_ICACHE;
> -       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_LL))
> +       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_LL, mask))
>                 return STAT_LL_CACHE;
> -       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_DTLB))
> +       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_DTLB, mask))
>                 return STAT_DTLB_CACHE;
> -       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_ITLB))
> +       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_ITLB, mask))
>                 return STAT_ITLB_CACHE;
> -       else if (evsel__match(evsel, HW_CACHE, hw_cache_l1d_miss))
> +       else if (evsel__match(evsel, HW_CACHE, hw_cache_l1d_miss, mask))
>                 return STAT_L1D_MISS;
> -       else if (evsel__match(evsel, HW_CACHE, hw_cache_l1i_miss))
> +       else if (evsel__match(evsel, HW_CACHE, hw_cache_l1i_miss, mask))
>                 return STAT_L1I_MISS;
> -       else if (evsel__match(evsel, HW_CACHE, hw_cache_ll_miss))
> +       else if (evsel__match(evsel, HW_CACHE, hw_cache_ll_miss, mask))
>                 return STAT_LL_MISS;
> -       else if (evsel__match(evsel, HW_CACHE, hw_cache_dtlb_miss))
> +       else if (evsel__match(evsel, HW_CACHE, hw_cache_dtlb_miss, mask))
>                 return STAT_DTLB_MISS;
> -       else if (evsel__match(evsel, HW_CACHE, hw_cache_itlb_miss))
> +       else if (evsel__match(evsel, HW_CACHE, hw_cache_itlb_miss, mask))
>                 return STAT_ITLB_MISS;
>         return STAT_NONE;
>  }
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] perf metric: JSON flag to default metric group
  2023-06-07 16:26 ` [PATCH 3/8] perf metric: JSON flag to default metric group kan.liang
@ 2023-06-13 19:44   ` Ian Rogers
  2023-06-13 20:10     ` Liang, Kan
  0 siblings, 1 reply; 31+ messages in thread
From: Ian Rogers @ 2023-06-13 19:44 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin

On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> For the default output, the default metric group could vary on different
> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
> should be displayed in the default mode. On ICL, only the TopdownL1
> should be displayed.
>
> Add a flag so we can tag the default metric group for different
> platforms rather than hack the perf code.
>
> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
> since SPR.
>
> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
> the real metric group name.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  .../arch/x86/alderlake/adl-metrics.json       | 20 ++++---
>  .../arch/x86/icelake/icl-metrics.json         | 20 ++++---
>  .../arch/x86/icelakex/icx-metrics.json        | 20 ++++---
>  .../arch/x86/sapphirerapids/spr-metrics.json  | 60 +++++++++++--------
>  .../arch/x86/tigerlake/tgl-metrics.json       | 20 ++++---
>  5 files changed, 84 insertions(+), 56 deletions(-)
>
> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> index c9f7e3d4ab08..e78c85220e27 100644
> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> @@ -832,22 +832,24 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_backend_bound",
>          "MetricThreshold": "tma_backend_bound > 0.2",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>          "ScaleUnit": "100%",
>          "Unit": "cpu_core"
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_bad_speculation",
>          "MetricThreshold": "tma_bad_speculation > 0.15",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>          "ScaleUnit": "100%",
>          "Unit": "cpu_core"
> @@ -1112,11 +1114,12 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_frontend_bound",
>          "MetricThreshold": "tma_frontend_bound > 0.15",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>          "ScaleUnit": "100%",
>          "Unit": "cpu_core"
> @@ -2316,11 +2319,12 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_retiring",
>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>          "ScaleUnit": "100%",
>          "Unit": "cpu_core"

For Alderlake the Default metric group is added for all cpu_core
metrics but not cpu_atom. This will lead to only getting metrics for
performance cores while the workload could be running on atoms. This
could lead to a false conclusion that the workload has no issues with
the metrics. I think this behavior is surprising and should be called
out as intentional in the commit message.

Thanks,
Ian

> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> index 20210742171d..cc4edf855064 100644
> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> @@ -111,21 +111,23 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_backend_bound",
>          "MetricThreshold": "tma_backend_bound > 0.2",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>          "ScaleUnit": "100%"
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_bad_speculation",
>          "MetricThreshold": "tma_bad_speculation > 0.15",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>          "ScaleUnit": "100%"
>      },
> @@ -372,11 +374,12 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_frontend_bound",
>          "MetricThreshold": "tma_frontend_bound > 0.15",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>          "ScaleUnit": "100%"
>      },
> @@ -1378,11 +1381,12 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_retiring",
>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>          "ScaleUnit": "100%"
>      },
> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> index ef25cda019be..6f25b5b7aaf6 100644
> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> @@ -315,21 +315,23 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_backend_bound",
>          "MetricThreshold": "tma_backend_bound > 0.2",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>          "ScaleUnit": "100%"
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_bad_speculation",
>          "MetricThreshold": "tma_bad_speculation > 0.15",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>          "ScaleUnit": "100%"
>      },
> @@ -576,11 +578,12 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_frontend_bound",
>          "MetricThreshold": "tma_frontend_bound > 0.15",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>          "ScaleUnit": "100%"
>      },
> @@ -1674,11 +1677,12 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_retiring",
>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>          "ScaleUnit": "100%"
>      },
> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> index 4f3dd85540b6..c732982f70b5 100644
> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> @@ -340,31 +340,34 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_backend_bound",
>          "MetricThreshold": "tma_backend_bound > 0.2",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>          "ScaleUnit": "100%"
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_bad_speculation",
>          "MetricThreshold": "tma_bad_speculation > 0.15",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>          "ScaleUnit": "100%"
>      },
>      {
>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
> +        "DefaultMetricgroupName": "TopdownL2",
>          "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> -        "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> +        "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>          "MetricName": "tma_branch_mispredicts",
>          "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
> -        "MetricgroupNoGroup": "TopdownL2",
> +        "MetricgroupNoGroup": "TopdownL2;Default",
>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction.  These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
>          "ScaleUnit": "100%"
>      },
> @@ -407,11 +410,12 @@
>      },
>      {
>          "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
> +        "DefaultMetricgroupName": "TopdownL2",
>          "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
> -        "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> +        "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>          "MetricName": "tma_core_bound",
>          "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
> -        "MetricgroupNoGroup": "TopdownL2",
> +        "MetricgroupNoGroup": "TopdownL2;Default",
>          "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck.  Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
>          "ScaleUnit": "100%"
>      },
> @@ -509,21 +513,23 @@
>      },
>      {
>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
> +        "DefaultMetricgroupName": "TopdownL2",
>          "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
> -        "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> +        "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>          "MetricName": "tma_fetch_bandwidth",
>          "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
> -        "MetricgroupNoGroup": "TopdownL2",
> +        "MetricgroupNoGroup": "TopdownL2;Default",
>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues.  For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
>          "ScaleUnit": "100%"
>      },
>      {
>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
> +        "DefaultMetricgroupName": "TopdownL2",
>          "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> -        "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> +        "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>          "MetricName": "tma_fetch_latency",
>          "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
> -        "MetricgroupNoGroup": "TopdownL2",
> +        "MetricgroupNoGroup": "TopdownL2;Default",
>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues.  For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
>          "ScaleUnit": "100%"
>      },
> @@ -611,11 +617,12 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_frontend_bound",
>          "MetricThreshold": "tma_frontend_bound > 0.15",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>          "ScaleUnit": "100%"
>      },
> @@ -630,11 +637,12 @@
>      },
>      {
>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
> +        "DefaultMetricgroupName": "TopdownL2",
>          "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>          "MetricName": "tma_heavy_operations",
>          "MetricThreshold": "tma_heavy_operations > 0.1",
> -        "MetricgroupNoGroup": "TopdownL2",
> +        "MetricgroupNoGroup": "TopdownL2;Default",
>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
>          "ScaleUnit": "100%"
>      },
> @@ -1486,11 +1494,12 @@
>      },
>      {
>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
> +        "DefaultMetricgroupName": "TopdownL2",
>          "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>          "MetricName": "tma_light_operations",
>          "MetricThreshold": "tma_light_operations > 0.6",
> -        "MetricgroupNoGroup": "TopdownL2",
> +        "MetricgroupNoGroup": "TopdownL2;Default",
>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
>          "ScaleUnit": "100%"
>      },
> @@ -1540,11 +1549,12 @@
>      },
>      {
>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
> +        "DefaultMetricgroupName": "TopdownL2",
>          "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
> -        "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> +        "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>          "MetricName": "tma_machine_clears",
>          "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
> -        "MetricgroupNoGroup": "TopdownL2",
> +        "MetricgroupNoGroup": "TopdownL2;Default",
>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears.  These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
>          "ScaleUnit": "100%"
>      },
> @@ -1576,11 +1586,12 @@
>      },
>      {
>          "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
> +        "DefaultMetricgroupName": "TopdownL2",
>          "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> -        "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> +        "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>          "MetricName": "tma_memory_bound",
>          "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
> -        "MetricgroupNoGroup": "TopdownL2",
> +        "MetricgroupNoGroup": "TopdownL2;Default",
>          "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck.  Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
>          "ScaleUnit": "100%"
>      },
> @@ -1784,11 +1795,12 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_retiring",
>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>          "ScaleUnit": "100%"
>      },
> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> index d0538a754288..83346911aa63 100644
> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> @@ -105,21 +105,23 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_backend_bound",
>          "MetricThreshold": "tma_backend_bound > 0.2",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>          "ScaleUnit": "100%"
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_bad_speculation",
>          "MetricThreshold": "tma_bad_speculation > 0.15",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>          "ScaleUnit": "100%"
>      },
> @@ -366,11 +368,12 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_frontend_bound",
>          "MetricThreshold": "tma_frontend_bound > 0.15",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>          "ScaleUnit": "100%"
>      },
> @@ -1392,11 +1395,12 @@
>      },
>      {
>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> +        "DefaultMetricgroupName": "TopdownL1",
>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>          "MetricName": "tma_retiring",
>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> -        "MetricgroupNoGroup": "TopdownL1",
> +        "MetricgroupNoGroup": "TopdownL1;Default",
>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>          "ScaleUnit": "100%"
>      },
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] perf vendor events arm64: Add default tags into topdown L1 metrics
  2023-06-07 16:26 ` [PATCH 4/8] perf vendor events arm64: Add default tags into topdown L1 metrics kan.liang
@ 2023-06-13 19:45   ` Ian Rogers
  2023-06-13 20:31     ` Arnaldo Carvalho de Melo
  2023-06-14 14:30   ` John Garry
  1 sibling, 1 reply; 31+ messages in thread
From: Ian Rogers @ 2023-06-13 19:45 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin,
	Jing Zhang, John Garry

On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Add the default tags for ARM as well.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Cc: Jing Zhang <renyu.zj@linux.alibaba.com>
> Cc: John Garry <john.g.garry@oracle.com>

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

> ---
>  tools/perf/pmu-events/arch/arm64/sbsa.json | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/tools/perf/pmu-events/arch/arm64/sbsa.json b/tools/perf/pmu-events/arch/arm64/sbsa.json
> index f678c37ea9c3..f90b338261ac 100644
> --- a/tools/perf/pmu-events/arch/arm64/sbsa.json
> +++ b/tools/perf/pmu-events/arch/arm64/sbsa.json
> @@ -2,28 +2,32 @@
>      {
>          "MetricExpr": "stall_slot_frontend / (#slots * cpu_cycles)",
>          "BriefDescription": "Frontend bound L1 topdown metric",
> -        "MetricGroup": "TopdownL1",
> +        "DefaultMetricgroupName": "TopdownL1",
> +        "MetricGroup": "Default;TopdownL1",
>          "MetricName": "frontend_bound",
>          "ScaleUnit": "100%"
>      },
>      {
>          "MetricExpr": "(1 - op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
>          "BriefDescription": "Bad speculation L1 topdown metric",
> -        "MetricGroup": "TopdownL1",
> +        "DefaultMetricgroupName": "TopdownL1",
> +        "MetricGroup": "Default;TopdownL1",
>          "MetricName": "bad_speculation",
>          "ScaleUnit": "100%"
>      },
>      {
>          "MetricExpr": "(op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
>          "BriefDescription": "Retiring L1 topdown metric",
> -        "MetricGroup": "TopdownL1",
> +        "DefaultMetricgroupName": "TopdownL1",
> +        "MetricGroup": "Default;TopdownL1",
>          "MetricName": "retiring",
>          "ScaleUnit": "100%"
>      },
>      {
>          "MetricExpr": "stall_slot_backend / (#slots * cpu_cycles)",
>          "BriefDescription": "Backend Bound L1 topdown metric",
> -        "MetricGroup": "TopdownL1",
> +        "DefaultMetricgroupName": "TopdownL1",
> +        "MetricGroup": "Default;TopdownL1",
>          "MetricName": "backend_bound",
>          "ScaleUnit": "100%"
>      }
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/8] perf stat,jevents: Introduce Default tags for the default mode
  2023-06-07 16:26 ` [PATCH 5/8] perf stat,jevents: Introduce Default tags for the default mode kan.liang
@ 2023-06-13 19:59   ` Ian Rogers
  2023-06-13 20:11     ` Liang, Kan
  0 siblings, 1 reply; 31+ messages in thread
From: Ian Rogers @ 2023-06-13 19:59 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin

On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Introduce a new metricgroup, Default, to tag all the metric groups which
> will be collected in the default mode.
>
> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
> the real metric group name. It will be printed in the default output
> to replace the event names.
>
> There is nothing changed for the output format.
>
> On SPR, both TopdownL1 and TopdownL2 are displayed in the default
> output.
>
> On ARM, Intel ICL and later platforms (before SPR), only TopdownL1 is
> displayed in the default output.
>
> Suggested-by: Stephane Eranian <eranian@google.com>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/builtin-stat.c          | 4 ++--
>  tools/perf/pmu-events/jevents.py   | 5 +++--
>  tools/perf/pmu-events/pmu-events.h | 1 +
>  tools/perf/util/metricgroup.c      | 3 +++
>  4 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index c87c6897edc9..2269b3e90e9b 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -2154,14 +2154,14 @@ static int add_default_attributes(void)
>                  * Add TopdownL1 metrics if they exist. To minimize
>                  * multiplexing, don't request threshold computation.
>                  */
> -               if (metricgroup__has_metric(pmu, "TopdownL1")) {
> +               if (metricgroup__has_metric(pmu, "Default")) {
>                         struct evlist *metric_evlist = evlist__new();
>                         struct evsel *metric_evsel;
>
>                         if (!metric_evlist)
>                                 return -1;
>
> -                       if (metricgroup__parse_groups(metric_evlist, pmu, "TopdownL1",
> +                       if (metricgroup__parse_groups(metric_evlist, pmu, "Default",
>                                                         /*metric_no_group=*/false,
>                                                         /*metric_no_merge=*/false,
>                                                         /*metric_no_threshold=*/true,
> diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py
> index 7ed258be1829..12e80bb7939b 100755
> --- a/tools/perf/pmu-events/jevents.py
> +++ b/tools/perf/pmu-events/jevents.py
> @@ -54,8 +54,8 @@ _json_event_attributes = [
>  # Attributes that are in pmu_metric rather than pmu_event.
>  _json_metric_attributes = [
>      'pmu', 'metric_name', 'metric_group', 'metric_expr', 'metric_threshold',
> -    'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group', 'aggr_mode',
> -    'event_grouping'
> +    'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group',
> +    'default_metricgroup_name', 'aggr_mode', 'event_grouping'
>  ]
>  # Attributes that are bools or enum int values, encoded as '0', '1',...
>  _json_enum_attributes = ['aggr_mode', 'deprecated', 'event_grouping', 'perpkg']
> @@ -307,6 +307,7 @@ class JsonEvent:
>      self.metric_name = jd.get('MetricName')
>      self.metric_group = jd.get('MetricGroup')
>      self.metricgroup_no_group = jd.get('MetricgroupNoGroup')
> +    self.default_metricgroup_name = jd.get('DefaultMetricgroupName')
>      self.event_grouping = convert_metric_constraint(jd.get('MetricConstraint'))
>      self.metric_expr = None
>      if 'MetricExpr' in jd:
> diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h
> index 8cd23d656a5d..caf59f23cd64 100644
> --- a/tools/perf/pmu-events/pmu-events.h
> +++ b/tools/perf/pmu-events/pmu-events.h
> @@ -61,6 +61,7 @@ struct pmu_metric {
>         const char *desc;
>         const char *long_desc;
>         const char *metricgroup_no_group;
> +       const char *default_metricgroup_name;
>         enum aggr_mode_class aggr_mode;
>         enum metric_event_groups event_grouping;
>  };
> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
> index 74f2d8efc02d..efafa02db5e5 100644
> --- a/tools/perf/util/metricgroup.c
> +++ b/tools/perf/util/metricgroup.c
> @@ -137,6 +137,8 @@ struct metric {
>          * output.
>          */
>         const char *metric_unit;
> +       /** Optional default metric group name */
> +       const char *default_metricgroup_name;

Adding a bit more to the comment would be useful, like:

Optional name of the metric group reported if the Default metric group
is being processed.

>         /** Optional null terminated array of referenced metrics. */
>         struct metric_ref *metric_refs;
>         /**
> @@ -219,6 +221,7 @@ static struct metric *metric__new(const struct pmu_metric *pm,
>
>         m->pmu = pm->pmu ?: "cpu";
>         m->metric_name = pm->metric_name;
> +       m->default_metricgroup_name = pm->default_metricgroup_name;
>         m->modifier = NULL;
>         if (modifier) {
>                 m->modifier = strdup(modifier);
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/8] perf evsel: Fix the annotation for hardware events on hybrid
  2023-06-13 19:35   ` Ian Rogers
@ 2023-06-13 20:06     ` Liang, Kan
  2023-06-13 21:18       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 31+ messages in thread
From: Liang, Kan @ 2023-06-13 20:06 UTC (permalink / raw)
  To: Ian Rogers
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin



On 2023-06-13 3:35 p.m., Ian Rogers wrote:
> On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> The annotation for hardware events is wrong on hybrid. For example,
>>
>>  # ./perf stat -a sleep 1
>>
>>  Performance counter stats for 'system wide':
>>
>>          32,148.85 msec cpu-clock                        #   32.000 CPUs utilized
>>                374      context-switches                 #   11.633 /sec
>>                 33      cpu-migrations                   #    1.026 /sec
>>                295      page-faults                      #    9.176 /sec
>>         18,979,960      cpu_core/cycles/                 #  590.378 K/sec
>>        261,230,783      cpu_atom/cycles/                 #    8.126 M/sec                       (54.21%)
>>         17,019,732      cpu_core/instructions/           #  529.404 K/sec
>>         38,020,470      cpu_atom/instructions/           #    1.183 M/sec                       (63.36%)
>>          3,296,743      cpu_core/branches/               #  102.546 K/sec
>>          6,692,338      cpu_atom/branches/               #  208.167 K/sec                       (63.40%)
>>             96,421      cpu_core/branch-misses/          #    2.999 K/sec
>>          1,016,336      cpu_atom/branch-misses/          #   31.613 K/sec                       (63.38%)
>>
>> The hardware events have extended type on hybrid, but the evsel__match()
>> doesn't take it into account.
>>
>> Add a mask to filter the extended type on hybrid when checking the config.
>>
>> With the patch,
>>
>>  # ./perf stat -a sleep 1
>>
>>  Performance counter stats for 'system wide':
>>
>>          32,139.90 msec cpu-clock                        #   32.003 CPUs utilized
>>                343      context-switches                 #   10.672 /sec
>>                 32      cpu-migrations                   #    0.996 /sec
>>                 73      page-faults                      #    2.271 /sec
>>         13,712,841      cpu_core/cycles/                 #    0.000 GHz
>>        258,301,691      cpu_atom/cycles/                 #    0.008 GHz                         (54.20%)
>>         12,428,163      cpu_core/instructions/           #    0.91  insn per cycle
>>         37,786,557      cpu_atom/instructions/           #    2.76  insn per cycle              (63.35%)
>>          2,418,826      cpu_core/branches/               #   75.259 K/sec
>>          6,965,962      cpu_atom/branches/               #  216.739 K/sec                       (63.38%)
>>             72,150      cpu_core/branch-misses/          #    2.98% of all branches
>>          1,032,746      cpu_atom/branch-misses/          #   42.70% of all branches             (63.35%)
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>  tools/perf/util/evsel.h       | 12 ++++++-----
>>  tools/perf/util/stat-shadow.c | 39 +++++++++++++++++++----------------
>>  2 files changed, 28 insertions(+), 23 deletions(-)
>>
>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
>> index b365b449c6ea..36a32e4ca168 100644
>> --- a/tools/perf/util/evsel.h
>> +++ b/tools/perf/util/evsel.h
>> @@ -350,9 +350,11 @@ u64 format_field__intval(struct tep_format_field *field, struct perf_sample *sam
>>
>>  struct tep_format_field *evsel__field(struct evsel *evsel, const char *name);
>>
>> -#define evsel__match(evsel, t, c)              \
>> +#define EVSEL_EVENT_MASK                       (~0ULL)
>> +
>> +#define evsel__match(evsel, t, c, m)                   \
>>         (evsel->core.attr.type == PERF_TYPE_##t &&      \
>> -        evsel->core.attr.config == PERF_COUNT_##c)
>> +        (evsel->core.attr.config & m) == PERF_COUNT_##c)
> 
> The EVSEL_EVENT_MASK here isn't very intention revealing, perhaps we
> can remove it and do something like:
> 
> static inline bool __evsel__match(const struct evsel *evsel, u32 type,
> u64 config)
> {
>   if ((type == PERF_TYPE_HARDWARE || type ==PERF_TYPE_HW_CACHE)  &&
> perf_pmus__supports_extended_type())
>      return (evsel->core.attr.config & PERF_HW_EVENT_MASK) == config;
> 
>   return evsel->core.attr.config == config;
> }
> #define evsel__match(evsel, t, c) __evsel__match(evsel, PERF_TYPE_##t,
> PERF_COUNT_##c)

Yes, the above code looks better. I will apply it in V2.

Thanks,
Kan
> 
> Thanks,
> Ian
> 
>>
>>  static inline bool evsel__match2(struct evsel *e1, struct evsel *e2)
>>  {
>> @@ -438,13 +440,13 @@ bool evsel__is_function_event(struct evsel *evsel);
>>
>>  static inline bool evsel__is_bpf_output(struct evsel *evsel)
>>  {
>> -       return evsel__match(evsel, SOFTWARE, SW_BPF_OUTPUT);
>> +       return evsel__match(evsel, SOFTWARE, SW_BPF_OUTPUT, EVSEL_EVENT_MASK);
>>  }
>>
>>  static inline bool evsel__is_clock(const struct evsel *evsel)
>>  {
>> -       return evsel__match(evsel, SOFTWARE, SW_CPU_CLOCK) ||
>> -              evsel__match(evsel, SOFTWARE, SW_TASK_CLOCK);
>> +       return evsel__match(evsel, SOFTWARE, SW_CPU_CLOCK, EVSEL_EVENT_MASK) ||
>> +              evsel__match(evsel, SOFTWARE, SW_TASK_CLOCK, EVSEL_EVENT_MASK);
>>  }
>>
>>  bool evsel__fallback(struct evsel *evsel, int err, char *msg, size_t msgsize);
>> diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
>> index 1566a206ba42..074f38b57e2d 100644
>> --- a/tools/perf/util/stat-shadow.c
>> +++ b/tools/perf/util/stat-shadow.c
>> @@ -6,6 +6,7 @@
>>  #include "color.h"
>>  #include "debug.h"
>>  #include "pmu.h"
>> +#include "pmus.h"
>>  #include "rblist.h"
>>  #include "evlist.h"
>>  #include "expr.h"
>> @@ -78,6 +79,8 @@ void perf_stat__reset_shadow_stats(void)
>>
>>  static enum stat_type evsel__stat_type(const struct evsel *evsel)
>>  {
>> +       u64 mask = perf_pmus__supports_extended_type() ? PERF_HW_EVENT_MASK : EVSEL_EVENT_MASK;
>> +
>>         /* Fake perf_hw_cache_op_id values for use with evsel__match. */
>>         u64 PERF_COUNT_hw_cache_l1d_miss = PERF_COUNT_HW_CACHE_L1D |
>>                 ((PERF_COUNT_HW_CACHE_OP_READ) << 8) |
>> @@ -97,41 +100,41 @@ static enum stat_type evsel__stat_type(const struct evsel *evsel)
>>
>>         if (evsel__is_clock(evsel))
>>                 return STAT_NSECS;
>> -       else if (evsel__match(evsel, HARDWARE, HW_CPU_CYCLES))
>> +       else if (evsel__match(evsel, HARDWARE, HW_CPU_CYCLES, mask))
>>                 return STAT_CYCLES;
>> -       else if (evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS))
>> +       else if (evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS, mask))
>>                 return STAT_INSTRUCTIONS;
>> -       else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_FRONTEND))
>> +       else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_FRONTEND, mask))
>>                 return STAT_STALLED_CYCLES_FRONT;
>> -       else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_BACKEND))
>> +       else if (evsel__match(evsel, HARDWARE, HW_STALLED_CYCLES_BACKEND, mask))
>>                 return STAT_STALLED_CYCLES_BACK;
>> -       else if (evsel__match(evsel, HARDWARE, HW_BRANCH_INSTRUCTIONS))
>> +       else if (evsel__match(evsel, HARDWARE, HW_BRANCH_INSTRUCTIONS, mask))
>>                 return STAT_BRANCHES;
>> -       else if (evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES))
>> +       else if (evsel__match(evsel, HARDWARE, HW_BRANCH_MISSES, mask))
>>                 return STAT_BRANCH_MISS;
>> -       else if (evsel__match(evsel, HARDWARE, HW_CACHE_REFERENCES))
>> +       else if (evsel__match(evsel, HARDWARE, HW_CACHE_REFERENCES, mask))
>>                 return STAT_CACHE_REFS;
>> -       else if (evsel__match(evsel, HARDWARE, HW_CACHE_MISSES))
>> +       else if (evsel__match(evsel, HARDWARE, HW_CACHE_MISSES, mask))
>>                 return STAT_CACHE_MISSES;
>> -       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1D))
>> +       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1D, mask))
>>                 return STAT_L1_DCACHE;
>> -       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1I))
>> +       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_L1I, mask))
>>                 return STAT_L1_ICACHE;
>> -       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_LL))
>> +       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_LL, mask))
>>                 return STAT_LL_CACHE;
>> -       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_DTLB))
>> +       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_DTLB, mask))
>>                 return STAT_DTLB_CACHE;
>> -       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_ITLB))
>> +       else if (evsel__match(evsel, HW_CACHE, HW_CACHE_ITLB, mask))
>>                 return STAT_ITLB_CACHE;
>> -       else if (evsel__match(evsel, HW_CACHE, hw_cache_l1d_miss))
>> +       else if (evsel__match(evsel, HW_CACHE, hw_cache_l1d_miss, mask))
>>                 return STAT_L1D_MISS;
>> -       else if (evsel__match(evsel, HW_CACHE, hw_cache_l1i_miss))
>> +       else if (evsel__match(evsel, HW_CACHE, hw_cache_l1i_miss, mask))
>>                 return STAT_L1I_MISS;
>> -       else if (evsel__match(evsel, HW_CACHE, hw_cache_ll_miss))
>> +       else if (evsel__match(evsel, HW_CACHE, hw_cache_ll_miss, mask))
>>                 return STAT_LL_MISS;
>> -       else if (evsel__match(evsel, HW_CACHE, hw_cache_dtlb_miss))
>> +       else if (evsel__match(evsel, HW_CACHE, hw_cache_dtlb_miss, mask))
>>                 return STAT_DTLB_MISS;
>> -       else if (evsel__match(evsel, HW_CACHE, hw_cache_itlb_miss))
>> +       else if (evsel__match(evsel, HW_CACHE, hw_cache_itlb_miss, mask))
>>                 return STAT_ITLB_MISS;
>>         return STAT_NONE;
>>  }
>> --
>> 2.35.1
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] perf metric: JSON flag to default metric group
  2023-06-13 19:44   ` Ian Rogers
@ 2023-06-13 20:10     ` Liang, Kan
  2023-06-13 20:28       ` Ian Rogers
  0 siblings, 1 reply; 31+ messages in thread
From: Liang, Kan @ 2023-06-13 20:10 UTC (permalink / raw)
  To: Ian Rogers
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin



On 2023-06-13 3:44 p.m., Ian Rogers wrote:
> On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> For the default output, the default metric group could vary on different
>> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
>> should be displayed in the default mode. On ICL, only the TopdownL1
>> should be displayed.
>>
>> Add a flag so we can tag the default metric group for different
>> platforms rather than hack the perf code.
>>
>> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
>> since SPR.
>>
>> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
>> the real metric group name.
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>  .../arch/x86/alderlake/adl-metrics.json       | 20 ++++---
>>  .../arch/x86/icelake/icl-metrics.json         | 20 ++++---
>>  .../arch/x86/icelakex/icx-metrics.json        | 20 ++++---
>>  .../arch/x86/sapphirerapids/spr-metrics.json  | 60 +++++++++++--------
>>  .../arch/x86/tigerlake/tgl-metrics.json       | 20 ++++---
>>  5 files changed, 84 insertions(+), 56 deletions(-)
>>
>> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>> index c9f7e3d4ab08..e78c85220e27 100644
>> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>> @@ -832,22 +832,24 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_backend_bound",
>>          "MetricThreshold": "tma_backend_bound > 0.2",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>          "ScaleUnit": "100%",
>>          "Unit": "cpu_core"
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_bad_speculation",
>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>          "ScaleUnit": "100%",
>>          "Unit": "cpu_core"
>> @@ -1112,11 +1114,12 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_frontend_bound",
>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>          "ScaleUnit": "100%",
>>          "Unit": "cpu_core"
>> @@ -2316,11 +2319,12 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_retiring",
>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>          "ScaleUnit": "100%",
>>          "Unit": "cpu_core"
> 
> For Alderlake the Default metric group is added for all cpu_core
> metrics but not cpu_atom. This will lead to only getting metrics for
> performance cores while the workload could be running on atoms. This
> could lead to a false conclusion that the workload has no issues with
> the metrics. I think this behavior is surprising and should be called
> out as intentional in the commit message.
>

The e-core doesn't have enough counters to calculate all the Topdown
events. It will trigger the multiplexing. We try to avoid it in the
default mode.
I will update the commit in V2.

Thanks,
Kan

> Thanks,
> Ian
> 
>> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>> index 20210742171d..cc4edf855064 100644
>> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>> @@ -111,21 +111,23 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_backend_bound",
>>          "MetricThreshold": "tma_backend_bound > 0.2",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>          "ScaleUnit": "100%"
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_bad_speculation",
>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -372,11 +374,12 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_frontend_bound",
>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -1378,11 +1381,12 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_retiring",
>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>          "ScaleUnit": "100%"
>>      },
>> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>> index ef25cda019be..6f25b5b7aaf6 100644
>> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>> @@ -315,21 +315,23 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_backend_bound",
>>          "MetricThreshold": "tma_backend_bound > 0.2",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>          "ScaleUnit": "100%"
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_bad_speculation",
>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -576,11 +578,12 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_frontend_bound",
>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -1674,11 +1677,12 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_retiring",
>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>          "ScaleUnit": "100%"
>>      },
>> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>> index 4f3dd85540b6..c732982f70b5 100644
>> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>> @@ -340,31 +340,34 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_backend_bound",
>>          "MetricThreshold": "tma_backend_bound > 0.2",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>          "ScaleUnit": "100%"
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_bad_speculation",
>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>          "ScaleUnit": "100%"
>>      },
>>      {
>>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
>> +        "DefaultMetricgroupName": "TopdownL2",
>>          "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> -        "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>> +        "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>>          "MetricName": "tma_branch_mispredicts",
>>          "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL2",
>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction.  These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -407,11 +410,12 @@
>>      },
>>      {
>>          "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
>> +        "DefaultMetricgroupName": "TopdownL2",
>>          "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
>> -        "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>> +        "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>          "MetricName": "tma_core_bound",
>>          "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
>> -        "MetricgroupNoGroup": "TopdownL2",
>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>          "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck.  Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -509,21 +513,23 @@
>>      },
>>      {
>>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
>> +        "DefaultMetricgroupName": "TopdownL2",
>>          "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
>> -        "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>> +        "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>>          "MetricName": "tma_fetch_bandwidth",
>>          "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
>> -        "MetricgroupNoGroup": "TopdownL2",
>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues.  For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
>>          "ScaleUnit": "100%"
>>      },
>>      {
>>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
>> +        "DefaultMetricgroupName": "TopdownL2",
>>          "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>> -        "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>> +        "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>>          "MetricName": "tma_fetch_latency",
>>          "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL2",
>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues.  For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -611,11 +617,12 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_frontend_bound",
>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -630,11 +637,12 @@
>>      },
>>      {
>>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
>> +        "DefaultMetricgroupName": "TopdownL2",
>>          "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>          "MetricName": "tma_heavy_operations",
>>          "MetricThreshold": "tma_heavy_operations > 0.1",
>> -        "MetricgroupNoGroup": "TopdownL2",
>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -1486,11 +1494,12 @@
>>      },
>>      {
>>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
>> +        "DefaultMetricgroupName": "TopdownL2",
>>          "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
>> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>          "MetricName": "tma_light_operations",
>>          "MetricThreshold": "tma_light_operations > 0.6",
>> -        "MetricgroupNoGroup": "TopdownL2",
>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -1540,11 +1549,12 @@
>>      },
>>      {
>>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
>> +        "DefaultMetricgroupName": "TopdownL2",
>>          "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
>> -        "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>> +        "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>>          "MetricName": "tma_machine_clears",
>>          "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL2",
>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears.  These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -1576,11 +1586,12 @@
>>      },
>>      {
>>          "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
>> +        "DefaultMetricgroupName": "TopdownL2",
>>          "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> -        "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>> +        "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>          "MetricName": "tma_memory_bound",
>>          "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
>> -        "MetricgroupNoGroup": "TopdownL2",
>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>          "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck.  Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -1784,11 +1795,12 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_retiring",
>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>          "ScaleUnit": "100%"
>>      },
>> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>> index d0538a754288..83346911aa63 100644
>> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>> @@ -105,21 +105,23 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_backend_bound",
>>          "MetricThreshold": "tma_backend_bound > 0.2",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>          "ScaleUnit": "100%"
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_bad_speculation",
>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -366,11 +368,12 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_frontend_bound",
>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>          "ScaleUnit": "100%"
>>      },
>> @@ -1392,11 +1395,12 @@
>>      },
>>      {
>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>> +        "DefaultMetricgroupName": "TopdownL1",
>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>          "MetricName": "tma_retiring",
>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>> -        "MetricgroupNoGroup": "TopdownL1",
>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>          "ScaleUnit": "100%"
>>      },
>> --
>> 2.35.1
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/8] perf stat,jevents: Introduce Default tags for the default mode
  2023-06-13 19:59   ` Ian Rogers
@ 2023-06-13 20:11     ` Liang, Kan
  0 siblings, 0 replies; 31+ messages in thread
From: Liang, Kan @ 2023-06-13 20:11 UTC (permalink / raw)
  To: Ian Rogers
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin



On 2023-06-13 3:59 p.m., Ian Rogers wrote:
> On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Introduce a new metricgroup, Default, to tag all the metric groups which
>> will be collected in the default mode.
>>
>> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
>> the real metric group name. It will be printed in the default output
>> to replace the event names.
>>
>> There is nothing changed for the output format.
>>
>> On SPR, both TopdownL1 and TopdownL2 are displayed in the default
>> output.
>>
>> On ARM, Intel ICL and later platforms (before SPR), only TopdownL1 is
>> displayed in the default output.
>>
>> Suggested-by: Stephane Eranian <eranian@google.com>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>  tools/perf/builtin-stat.c          | 4 ++--
>>  tools/perf/pmu-events/jevents.py   | 5 +++--
>>  tools/perf/pmu-events/pmu-events.h | 1 +
>>  tools/perf/util/metricgroup.c      | 3 +++
>>  4 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index c87c6897edc9..2269b3e90e9b 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -2154,14 +2154,14 @@ static int add_default_attributes(void)
>>                  * Add TopdownL1 metrics if they exist. To minimize
>>                  * multiplexing, don't request threshold computation.
>>                  */
>> -               if (metricgroup__has_metric(pmu, "TopdownL1")) {
>> +               if (metricgroup__has_metric(pmu, "Default")) {
>>                         struct evlist *metric_evlist = evlist__new();
>>                         struct evsel *metric_evsel;
>>
>>                         if (!metric_evlist)
>>                                 return -1;
>>
>> -                       if (metricgroup__parse_groups(metric_evlist, pmu, "TopdownL1",
>> +                       if (metricgroup__parse_groups(metric_evlist, pmu, "Default",
>>                                                         /*metric_no_group=*/false,
>>                                                         /*metric_no_merge=*/false,
>>                                                         /*metric_no_threshold=*/true,
>> diff --git a/tools/perf/pmu-events/jevents.py b/tools/perf/pmu-events/jevents.py
>> index 7ed258be1829..12e80bb7939b 100755
>> --- a/tools/perf/pmu-events/jevents.py
>> +++ b/tools/perf/pmu-events/jevents.py
>> @@ -54,8 +54,8 @@ _json_event_attributes = [
>>  # Attributes that are in pmu_metric rather than pmu_event.
>>  _json_metric_attributes = [
>>      'pmu', 'metric_name', 'metric_group', 'metric_expr', 'metric_threshold',
>> -    'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group', 'aggr_mode',
>> -    'event_grouping'
>> +    'desc', 'long_desc', 'unit', 'compat', 'metricgroup_no_group',
>> +    'default_metricgroup_name', 'aggr_mode', 'event_grouping'
>>  ]
>>  # Attributes that are bools or enum int values, encoded as '0', '1',...
>>  _json_enum_attributes = ['aggr_mode', 'deprecated', 'event_grouping', 'perpkg']
>> @@ -307,6 +307,7 @@ class JsonEvent:
>>      self.metric_name = jd.get('MetricName')
>>      self.metric_group = jd.get('MetricGroup')
>>      self.metricgroup_no_group = jd.get('MetricgroupNoGroup')
>> +    self.default_metricgroup_name = jd.get('DefaultMetricgroupName')
>>      self.event_grouping = convert_metric_constraint(jd.get('MetricConstraint'))
>>      self.metric_expr = None
>>      if 'MetricExpr' in jd:
>> diff --git a/tools/perf/pmu-events/pmu-events.h b/tools/perf/pmu-events/pmu-events.h
>> index 8cd23d656a5d..caf59f23cd64 100644
>> --- a/tools/perf/pmu-events/pmu-events.h
>> +++ b/tools/perf/pmu-events/pmu-events.h
>> @@ -61,6 +61,7 @@ struct pmu_metric {
>>         const char *desc;
>>         const char *long_desc;
>>         const char *metricgroup_no_group;
>> +       const char *default_metricgroup_name;
>>         enum aggr_mode_class aggr_mode;
>>         enum metric_event_groups event_grouping;
>>  };
>> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
>> index 74f2d8efc02d..efafa02db5e5 100644
>> --- a/tools/perf/util/metricgroup.c
>> +++ b/tools/perf/util/metricgroup.c
>> @@ -137,6 +137,8 @@ struct metric {
>>          * output.
>>          */
>>         const char *metric_unit;
>> +       /** Optional default metric group name */
>> +       const char *default_metricgroup_name;
> 
> Adding a bit more to the comment would be useful, like:
> 
> Optional name of the metric group reported if the Default metric group
> is being processed.

Sure.

Thanks,
Kan
> 
>>         /** Optional null terminated array of referenced metrics. */
>>         struct metric_ref *metric_refs;
>>         /**
>> @@ -219,6 +221,7 @@ static struct metric *metric__new(const struct pmu_metric *pm,
>>
>>         m->pmu = pm->pmu ?: "cpu";
>>         m->metric_name = pm->metric_name;
>> +       m->default_metricgroup_name = pm->default_metricgroup_name;
>>         m->modifier = NULL;
>>         if (modifier) {
>>                 m->modifier = strdup(modifier);
>> --
>> 2.35.1
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] perf stat,metrics: New metricgroup output for the default mode
  2023-06-07 16:26 ` [PATCH 6/8] perf stat,metrics: New metricgroup output " kan.liang
@ 2023-06-13 20:16   ` Ian Rogers
  2023-06-13 20:50     ` Liang, Kan
  0 siblings, 1 reply; 31+ messages in thread
From: Ian Rogers @ 2023-06-13 20:16 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin

On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> In the default mode, the current output of the metricgroup include both
> events and metrics, which is not necessary and just makes the output
> hard to read. Since different ARCHs (even different generations in the
> same ARCH) may use different events. The output also vary on different
> platforms.
>
> For a metricgroup, only outputting the value of each metric is good
> enough.
>
> Current perf may append different metric groups to the same leader
> event, or append the metrics from the same metricgroup to different
> events. That could bring confusion when perf only prints the
> metricgroup output mode. For example, print the same metricgroup name
> several times.
> Reorganize metricgroup for the default mode and make sure that
> a metricgroup can only be appended to one event.
> Sort the metricgroup for the default mode by the name of the
> metricgroup.
>
> Add a new field default_metricgroup in evsel to indicate an event of
> the default metricgroup. For those events, printout() should print
> the metricgroup name rather than events.
>
> Add print_metricgroup_header() to print out the metricgroup name in
> different output formats.
>
> On SPR
> Before:
>
>  ./perf_old stat sleep 1
>
>  Performance counter stats for 'sleep 1':
>
>               0.54 msec task-clock:u                     #    0.001 CPUs utilized
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>                 68      page-faults:u                    #  125.445 K/sec
>            540,970      cycles:u                         #    0.998 GHz
>            556,325      instructions:u                   #    1.03  insn per cycle
>            123,602      branches:u                       #  228.018 M/sec
>              6,889      branch-misses:u                  #    5.57% of all branches
>          3,245,820      TOPDOWN.SLOTS:u                  #     18.4 %  tma_backend_bound
>                                                   #     17.2 %  tma_retiring
>                                                   #     23.1 %  tma_bad_speculation
>                                                   #     41.4 %  tma_frontend_bound
>            564,859      topdown-retiring:u
>          1,370,999      topdown-fe-bound:u
>            603,271      topdown-be-bound:u
>            744,874      topdown-bad-spec:u
>             12,661      INT_MISC.UOP_DROPPING:u          #   23.357 M/sec
>
>        1.001798215 seconds time elapsed
>
>        0.000193000 seconds user
>        0.001700000 seconds sys
>
> After:
>
> $ ./perf stat sleep 1
>
>  Performance counter stats for 'sleep 1':
>
>               0.51 msec task-clock:u                     #    0.001 CPUs utilized
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>                 68      page-faults:u                    #  132.683 K/sec
>            545,228      cycles:u                         #    1.064 GHz
>            555,509      instructions:u                   #    1.02  insn per cycle
>            123,574      branches:u                       #  241.120 M/sec
>              6,957      branch-misses:u                  #    5.63% of all branches
>                         TopdownL1                 #     17.5 %  tma_backend_bound
>                                                   #     22.6 %  tma_bad_speculation
>                                                   #     42.7 %  tma_frontend_bound
>                                                   #     17.1 %  tma_retiring
>                         TopdownL2                 #     21.8 %  tma_branch_mispredicts
>                                                   #     11.5 %  tma_core_bound
>                                                   #     13.4 %  tma_fetch_bandwidth
>                                                   #     29.3 %  tma_fetch_latency
>                                                   #      2.7 %  tma_heavy_operations
>                                                   #     14.5 %  tma_light_operations
>                                                   #      0.8 %  tma_machine_clears
>                                                   #      6.1 %  tma_memory_bound
>
>        1.001712086 seconds time elapsed
>
>        0.000151000 seconds user
>        0.001618000 seconds sys
>
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/builtin-stat.c      |   1 +
>  tools/perf/util/evsel.h        |   1 +
>  tools/perf/util/metricgroup.c  | 106 ++++++++++++++++++++++++++++++++-
>  tools/perf/util/metricgroup.h  |   1 +
>  tools/perf/util/stat-display.c |  69 ++++++++++++++++++++-
>  5 files changed, 172 insertions(+), 6 deletions(-)
>
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 2269b3e90e9b..b274cc264d56 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -2172,6 +2172,7 @@ static int add_default_attributes(void)
>
>                         evlist__for_each_entry(metric_evlist, metric_evsel) {
>                                 metric_evsel->skippable = true;
> +                               metric_evsel->default_metricgroup = true;
>                         }
>                         evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
>                         evlist__delete(metric_evlist);
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index 36a32e4ca168..61b1385108f4 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -130,6 +130,7 @@ struct evsel {
>         bool                    reset_group;
>         bool                    errored;
>         bool                    needs_auxtrace_mmap;
> +       bool                    default_metricgroup;

A comment would be useful here, something like:

If running perf stat, is this evsel a member of a Default metric group metric.

>         struct hashmap          *per_pkg_mask;
>         int                     err;
>         struct {
> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
> index efafa02db5e5..22181ce4f27f 100644
> --- a/tools/perf/util/metricgroup.c
> +++ b/tools/perf/util/metricgroup.c
> @@ -79,6 +79,7 @@ static struct rb_node *metric_event_new(struct rblist *rblist __maybe_unused,
>                 return NULL;
>         memcpy(me, entry, sizeof(struct metric_event));
>         me->evsel = ((struct metric_event *)entry)->evsel;
> +       me->default_metricgroup_name = NULL;
>         INIT_LIST_HEAD(&me->head);
>         return &me->nd;
>  }
> @@ -1133,14 +1134,19 @@ static int metricgroup__add_metric_sys_event_iter(const struct pmu_metric *pm,
>  /**
>   * metric_list_cmp - list_sort comparator that sorts metrics with more events to
>   *                   the front. tool events are excluded from the count.
> + *                   For the default metrics, sort them by metricgroup name.
>   */
> -static int metric_list_cmp(void *priv __maybe_unused, const struct list_head *l,
> +static int metric_list_cmp(void *priv, const struct list_head *l,
>                            const struct list_head *r)
>  {
>         const struct metric *left = container_of(l, struct metric, nd);
>         const struct metric *right = container_of(r, struct metric, nd);
>         struct expr_id_data *data;
>         int i, left_count, right_count;
> +       bool is_default = *(bool *)priv;
> +
> +       if (is_default && left->default_metricgroup_name && right->default_metricgroup_name)
> +               return strcmp(left->default_metricgroup_name, right->default_metricgroup_name);

This breaks the comment above. The events are now sorted prioritizing
default metric group names. This potentially will have an effect of
reducing sharing of events between groups, it will also break
assumptions within that code that there are always the same number of
fewer events in a metric as you process the list. To remedy this I
think you need to re-sort the metrics after the event sharing has had
a chance to share events between groups.


>
>         left_count = hashmap__size(left->pctx->ids);
>         perf_tool_event__for_each_event(i) {
> @@ -1497,6 +1503,91 @@ static int parse_ids(bool metric_no_merge, struct perf_pmu *fake_pmu,
>         return ret;
>  }
>
> +static struct metric_event *
> +metricgroup__lookup_default_metricgroup(struct rblist *metric_events,
> +                                       struct evsel *evsel,
> +                                       struct metric *m)
> +{
> +       struct metric_event *me;
> +       char *name;
> +       int err;
> +
> +       me = metricgroup__lookup(metric_events, evsel, true);
> +       if (!me->default_metricgroup_name) {
> +               if (m->pmu && strcmp(m->pmu, "cpu"))
> +                       err = asprintf(&name, "%s (%s)", m->default_metricgroup_name, m->pmu);
> +               else
> +                       err = asprintf(&name, "%s", m->default_metricgroup_name);
> +               if (err < 0)
> +                       return NULL;
> +               me->default_metricgroup_name = name;
> +       }
> +       if (!strncmp(m->default_metricgroup_name,
> +                    me->default_metricgroup_name,
> +                    strlen(m->default_metricgroup_name)))
> +               return me;
> +
> +       return NULL;
> +}

A function comment would be useful as the name is confusing, why
lookup? Doesn't it create the value? Leak sanitizer isn't happy here:

```
==1545918==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 10 byte(s) in 1 object(s) allocated from:
    #0 0x7f2755a7077b in __interceptor_strdup
../../../../src/libsanitizer/asan/asan_interceptors.cpp:439
    #1 0x564986a8df31 in asprintf util/util.c:566
    #2 0x5649869b5901 in metricgroup__lookup_default_metricgroup
util/metricgroup.c:1520
    #3 0x5649869b5e57 in metricgroup__lookup_create util/metricgroup.c:1579
    #4 0x5649869b6ddc in parse_groups util/metricgroup.c:1698
    #5 0x5649869b7714 in metricgroup__parse_groups util/metricgroup.c:1771
    #6 0x5649867da9d5 in add_default_attributes tools/perf/builtin-stat.c:2164
    #7 0x5649867ddbfb in cmd_stat tools/perf/builtin-stat.c:2707
    #8 0x5649868fa5a2 in run_builtin tools/perf/perf.c:323
    #9 0x5649868fab13 in handle_internal_command tools/perf/perf.c:377
    #10 0x5649868faedb in run_argv tools/perf/perf.c:421
    #11 0x5649868fb443 in main tools/perf/perf.c:537
    #12 0x7f2754846189 in __libc_start_call_main
../sysdeps/nptl/libc_start_call_main.h:58
```

> +static struct metric_event *
> +metricgroup__lookup_create(struct rblist *metric_events,
> +                          struct evsel **evsel,
> +                          struct list_head *metric_list,
> +                          struct metric *m,
> +                          bool is_default)
> +{
> +       struct metric_event *me;
> +       struct metric *cur;
> +       struct evsel *ev;
> +       size_t i;
> +
> +       if (!is_default)
> +               return metricgroup__lookup(metric_events, evsel[0], true);
> +
> +       /*
> +        * If the metric group has been attached to a previous
> +        * event/metric, use that metric event.
> +        */
> +       list_for_each_entry(cur, metric_list, nd) {
> +               if (cur == m)
> +                       break;
> +               if (cur->pmu && strcmp(m->pmu, cur->pmu))
> +                       continue;
> +               if (strncmp(m->default_metricgroup_name,
> +                           cur->default_metricgroup_name,
> +                           strlen(m->default_metricgroup_name)))
> +                       continue;
> +               if (!cur->evlist)
> +                       continue;
> +               evlist__for_each_entry(cur->evlist, ev) {
> +                       me = metricgroup__lookup(metric_events, ev, false);
> +                       if (!strncmp(m->default_metricgroup_name,
> +                                    me->default_metricgroup_name,
> +                                    strlen(m->default_metricgroup_name)))
> +                               return me;
> +               }
> +       }
> +
> +       /*
> +        * Different metric groups may append to the same leader event.
> +        * For example, TopdownL1 and TopdownL2 are appended to the
> +        * TOPDOWN.SLOTS event.
> +        * Split it and append the new metric group to the next available
> +        * event.
> +        */
> +       me = metricgroup__lookup_default_metricgroup(metric_events, evsel[0], m);
> +       if (me)
> +               return me;
> +
> +       for (i = 1; i < hashmap__size(m->pctx->ids); i++) {
> +               me = metricgroup__lookup_default_metricgroup(metric_events, evsel[i], m);
> +               if (me)
> +                       return me;
> +       }
> +       return NULL;
> +}
> +

I have a hard time understanding this function, does it just go away
if you do the two sorts that I proposed above? Should this be
metric_event__lookup_create? A function comment saying what the code
is trying to achieve would be useful.

This appears to be trying to correct output issues by changing how
metrics are associated with events, shouldn't output issues be
resolved by fixing the output code? If not, why don't we apply this
logic to TopdownL1, why just Default?

>  static int parse_groups(struct evlist *perf_evlist,
>                         const char *pmu, const char *str,
>                         bool metric_no_group,
> @@ -1512,6 +1603,7 @@ static int parse_groups(struct evlist *perf_evlist,
>         LIST_HEAD(metric_list);
>         struct metric *m;
>         bool tool_events[PERF_TOOL_MAX] = {false};
> +       bool is_default = !strcmp(str, "Default");
>         int ret;
>
>         if (metric_events_list->nr_entries == 0)
> @@ -1523,7 +1615,7 @@ static int parse_groups(struct evlist *perf_evlist,
>                 goto out;
>
>         /* Sort metrics from largest to smallest. */
> -       list_sort(NULL, &metric_list, metric_list_cmp);
> +       list_sort((void *)&is_default, &metric_list, metric_list_cmp);
>
>         if (!metric_no_merge) {
>                 struct expr_parse_ctx *combined = NULL;
> @@ -1603,7 +1695,15 @@ static int parse_groups(struct evlist *perf_evlist,
>                         goto out;
>                 }
>
> -               me = metricgroup__lookup(metric_events_list, metric_events[0], true);
> +               me = metricgroup__lookup_create(metric_events_list,
> +                                               metric_events,
> +                                               &metric_list, m,
> +                                               is_default);
> +               if (!me) {
> +                       pr_err("Cannot create metric group for default!\n");
> +                       ret = -EINVAL;
> +                       goto out;
> +               }
>
>                 expr = malloc(sizeof(struct metric_expr));
>                 if (!expr) {
> diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
> index bf18274c15df..e3609b853213 100644
> --- a/tools/perf/util/metricgroup.h
> +++ b/tools/perf/util/metricgroup.h
> @@ -22,6 +22,7 @@ struct cgroup;
>  struct metric_event {
>         struct rb_node nd;
>         struct evsel *evsel;
> +       char *default_metricgroup_name;
>         struct list_head head; /* list of metric_expr */
>  };
>
> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> index a2bbdc25d979..efe5fd04c033 100644
> --- a/tools/perf/util/stat-display.c
> +++ b/tools/perf/util/stat-display.c
> @@ -21,10 +21,12 @@
>  #include "iostat.h"
>  #include "pmu.h"
>  #include "pmus.h"
> +#include "metricgroup.h"

This is bringing metric code from stat-shadow, which is kind of the
whole reason there is a separation and stat-shadow exists. Should the
logic exist in stat-shadow instead?

>
>  #define CNTR_NOT_SUPPORTED     "<not supported>"
>  #define CNTR_NOT_COUNTED       "<not counted>"
>
> +#define MGROUP_LEN   50
>  #define METRIC_LEN   38
>  #define EVNAME_LEN   32
>  #define COUNTS_LEN   18
> @@ -707,6 +709,55 @@ static bool evlist__has_hybrid(struct evlist *evlist)
>         return false;
>  }
>
> +static void print_metricgroup_header_json(struct perf_stat_config *config,
> +                                         struct outstate *os __maybe_unused,
> +                                         const char *metricgroup_name)
> +{
> +       fprintf(config->output, "\"metricgroup\" : \"%s\"}", metricgroup_name);
> +       new_line_json(config, (void *)os);
> +}
> +

Should the output part of this patch be separate from the
evsel/evlist/meitric modifications?

Thanks,
Ian

> +static void print_metricgroup_header_csv(struct perf_stat_config *config,
> +                                        struct outstate *os,
> +                                        const char *metricgroup_name)
> +{
> +       int i;
> +
> +       for (i = 0; i < os->nfields; i++)
> +               fputs(config->csv_sep, os->fh);
> +       fprintf(config->output, "%s", metricgroup_name);
> +       new_line_csv(config, (void *)os);
> +}
> +
> +static void print_metricgroup_header_std(struct perf_stat_config *config,
> +                                        struct outstate *os __maybe_unused,
> +                                        const char *metricgroup_name)
> +{
> +       int n = fprintf(config->output, " %*s", EVNAME_LEN, metricgroup_name);
> +
> +       fprintf(config->output, "%*s", MGROUP_LEN - n - 1, "");
> +}
> +
> +static void print_metricgroup_header(struct perf_stat_config *config,
> +                                    struct outstate *os,
> +                                    struct evsel *counter,
> +                                    double noise, u64 run, u64 ena,
> +                                    const char *metricgroup_name)
> +{
> +       aggr_printout(config, os->evsel, os->id, os->aggr_nr);
> +
> +       print_noise(config, counter, noise, /*before_metric=*/true);
> +       print_running(config, run, ena, /*before_metric=*/true);
> +
> +       if (config->json_output) {
> +               print_metricgroup_header_json(config, os, metricgroup_name);
> +       } else if (config->csv_output) {
> +               print_metricgroup_header_csv(config, os, metricgroup_name);
> +       } else
> +               print_metricgroup_header_std(config, os, metricgroup_name);
> +
> +}
> +
>  static void printout(struct perf_stat_config *config, struct outstate *os,
>                      double uval, u64 run, u64 ena, double noise, int aggr_idx)
>  {
> @@ -751,10 +802,17 @@ static void printout(struct perf_stat_config *config, struct outstate *os,
>         out.force_header = false;
>
>         if (!config->metric_only) {
> -               abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
> +               if (counter->default_metricgroup) {
> +                       struct metric_event *me;
>
> -               print_noise(config, counter, noise, /*before_metric=*/true);
> -               print_running(config, run, ena, /*before_metric=*/true);
> +                       me = metricgroup__lookup(&config->metric_events, counter, false);
> +                       print_metricgroup_header(config, os, counter, noise, run, ena,
> +                                                me->default_metricgroup_name);
> +               } else {
> +                       abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
> +                       print_noise(config, counter, noise, /*before_metric=*/true);
> +                       print_running(config, run, ena, /*before_metric=*/true);
> +               }
>         }
>
>         if (ok) {
> @@ -883,6 +941,11 @@ static void print_counter_aggrdata(struct perf_stat_config *config,
>         if (counter->merged_stat)
>                 return;
>
> +       /* Only print the metric group for the default mode */
> +       if (counter->default_metricgroup &&
> +           !metricgroup__lookup(&config->metric_events, counter, false))
> +               return;
> +
>         uniquify_counter(config, counter);
>
>         val = aggr->counts.val;
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] pert tests: Support metricgroup perf stat JSON output
  2023-06-07 16:26 ` [PATCH 7/8] pert tests: Support metricgroup perf stat JSON output kan.liang
@ 2023-06-13 20:17   ` Ian Rogers
  2023-06-13 20:30     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 31+ messages in thread
From: Ian Rogers @ 2023-06-13 20:17 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin

On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> A new field metricgroup has been added in the perf stat JSON output.
> Support it in the test case.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>

Acked-by: Ian Rogers <irogers@google.com>

Thanks,
Ian

> ---
>  tools/perf/tests/shell/lib/perf_json_output_lint.py | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/tools/perf/tests/shell/lib/perf_json_output_lint.py b/tools/perf/tests/shell/lib/perf_json_output_lint.py
> index b81582a89d36..5e9bd68c83fe 100644
> --- a/tools/perf/tests/shell/lib/perf_json_output_lint.py
> +++ b/tools/perf/tests/shell/lib/perf_json_output_lint.py
> @@ -55,6 +55,7 @@ def check_json_output(expected_items):
>        'interval': lambda x: isfloat(x),
>        'metric-unit': lambda x: True,
>        'metric-value': lambda x: isfloat(x),
> +      'metricgroup': lambda x: True,
>        'node': lambda x: True,
>        'pcnt-running': lambda x: isfloat(x),
>        'socket': lambda x: True,
> @@ -70,6 +71,8 @@ def check_json_output(expected_items):
>          # values and possibly other prefixes like interval, core and
>          # aggregate-number.
>          pass
> +      elif count != expected_items and count >= 1 and count <= 5 and 'metricgroup' in item:
> +        pass
>        elif count != expected_items:
>          raise RuntimeError(f'wrong number of fields. counted {count} expected {expected_items}'
>                             f' in \'{item}\'')
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 8/8] perf test: Add test case for the standard perf stat output
  2023-06-07 16:27 ` [PATCH 8/8] perf test: Add test case for the standard perf stat output kan.liang
@ 2023-06-13 20:21   ` Ian Rogers
  0 siblings, 0 replies; 31+ messages in thread
From: Ian Rogers @ 2023-06-13 20:21 UTC (permalink / raw)
  To: kan.liang
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin

On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Add a new test case to verify the standard perf stat output with
> different options.
>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  tools/perf/tests/shell/stat+std_output.sh | 259 ++++++++++++++++++++++
>  1 file changed, 259 insertions(+)
>  create mode 100755 tools/perf/tests/shell/stat+std_output.sh
>
> diff --git a/tools/perf/tests/shell/stat+std_output.sh b/tools/perf/tests/shell/stat+std_output.sh
> new file mode 100755
> index 000000000000..b9db0f245450
> --- /dev/null
> +++ b/tools/perf/tests/shell/stat+std_output.sh
> @@ -0,0 +1,259 @@
> +#!/bin/bash
> +# perf stat STD output linter
> +# SPDX-License-Identifier: GPL-2.0
> +# Tests various perf stat STD output commands for
> +# default event and metricgroup
> +
> +set -e
> +
> +skip_test=0
> +
> +stat_output=$(mktemp /tmp/__perf_test.stat_output.std.XXXXX)
> +
> +event_name=(cpu-clock task-clock context-switches cpu-migrations page-faults cycles instructions branches branch-misses stalled-cycles-frontend stalled-cycles-backend)
> +event_metric=("CPUs utilized" "CPUs utilized" "/sec" "/sec" "/sec" "GHz" "insn per cycle" "/sec" "of all branches" "frontend cycles idle" "backend cycles idle")
> +
> +metricgroup_name=(TopdownL1 TopdownL2)
> +
> +cleanup() {
> +  rm -f "${stat_output}"
> +
> +  trap - EXIT TERM INT
> +}
> +
> +trap_cleanup() {
> +  cleanup
> +  exit 1
> +}
> +trap trap_cleanup EXIT TERM INT
> +
> +function commachecker()
> +{
> +       local -i cnt=0
> +       local prefix=1
> +
> +       case "$1"
> +       in "--interval")        prefix=2
> +       ;; "--per-thread")      prefix=2
> +       ;; "--system-wide-no-aggr")     prefix=2
> +       ;; "--per-core")        prefix=3
> +       ;; "--per-socket")      prefix=3
> +       ;; "--per-node")        prefix=3
> +       ;; "--per-die")         prefix=3
> +       ;; "--per-cache")       prefix=3
> +       esac
> +
> +       while read line
> +       do
> +               # Ignore initial "started on" comment.
> +               x=${line:0:1}
> +               [ "$x" = "#" ] && continue
> +               # Ignore initial blank line.
> +               [ "$line" = "" ] && continue
> +               # Ignore "Performance counter stats"
> +               x=${line:0:25}
> +               [ "$x" = "Performance counter stats" ] && continue
> +               # Ignore "seconds time elapsed" and break
> +               [[ "$line" == *"time elapsed"* ]] && break
> +
> +               main_body=$(echo $line | cut -d' ' -f$prefix-)
> +               x=${main_body%#*}
> +               # Check default metricgroup
> +               y=$(echo $x | tr -d ' ')
> +               [ "$y" = "" ] && continue
> +               for i in "${!metricgroup_name[@]}"; do
> +                       [[ "$y" == *"${metricgroup_name[$i]}"* ]] && break
> +               done
> +               [[ "$y" == *"${metricgroup_name[$i]}"* ]] && continue
> +
> +               # Check default event
> +               for i in "${!event_name[@]}"; do
> +                       [[ "$x" == *"${event_name[$i]}"* ]] && break
> +               done
> +
> +               [[ ! "$x" == *"${event_name[$i]}"* ]] && {
> +                       echo "Unknown event name in $line" 1>&2
> +                       exit 1;
> +               }
> +
> +               # Check event metric if it exists
> +               [[ ! "$main_body" == *"#"* ]] && continue
> +               [[ ! "$main_body" == *"${event_metric[$i]}"* ]] && {
> +                       echo "wrong event metric. expected ${event_metric[$i]} in $line" 1>&2
> +                       exit 1;
> +               }
> +       done < "${stat_output}"
> +       return 0
> +}
> +
> +# Return true if perf_event_paranoid is > $1 and not running as root.
> +function ParanoidAndNotRoot()
> +{
> +        [ $(id -u) != 0 ] && [ $(cat /proc/sys/kernel/perf_event_paranoid) -gt $1 ]
> +}
> +
> +check_no_args()
> +{
> +       echo -n "Checking STD output: no args "
> +       perf stat -o "${stat_output}" true
> +        commachecker --no-args
> +       echo "[Success]"
> +}
> +
> +check_system_wide()
> +{
> +       echo -n "Checking STD output: system wide "
> +       if ParanoidAndNotRoot 0
> +       then
> +               echo "[Skip] paranoid and not root"
> +               return
> +       fi
> +       perf stat -a -o "${stat_output}" true
> +        commachecker --system-wide
> +       echo "[Success]"
> +}
> +
> +check_system_wide_no_aggr()
> +{
> +       echo -n "Checking STD output: system wide no aggregation "
> +       if ParanoidAndNotRoot 0
> +       then
> +               echo "[Skip] paranoid and not root"
> +               return
> +       fi
> +       perf stat -A -a --no-merge -o "${stat_output}" true
> +        commachecker --system-wide-no-aggr
> +       echo "[Success]"
> +}
> +
> +check_interval()
> +{
> +       echo -n "Checking STD output: interval "
> +       perf stat -I 1000 -o "${stat_output}" true
> +        commachecker --interval
> +       echo "[Success]"
> +}
> +
> +
> +check_per_core()
> +{
> +       echo -n "Checking STD output: per core "
> +       if ParanoidAndNotRoot 0
> +       then
> +               echo "[Skip] paranoid and not root"
> +               return
> +       fi
> +       perf stat --per-core -a -o "${stat_output}" true
> +        commachecker --per-core
> +       echo "[Success]"
> +}
> +
> +check_per_thread()
> +{
> +       echo -n "Checking STD output: per thread "
> +       if ParanoidAndNotRoot 0
> +       then
> +               echo "[Skip] paranoid and not root"
> +               return
> +       fi
> +       perf stat --per-thread -a -o "${stat_output}" true
> +        commachecker --per-thread
> +       echo "[Success]"
> +}
> +
> +check_per_cache_instance()
> +{
> +       echo -n "Checking STD output: per cache instance "
> +       if ParanoidAndNotRoot 0
> +       then
> +               echo "[Skip] paranoid and not root"
> +               return
> +       fi
> +       perf stat  --per-cache -a true 2>&1 | commachecker --per-cache
> +       echo "[Success]"
> +}
> +
> +check_per_die()
> +{
> +       echo -n "Checking STD output: per die "
> +       if ParanoidAndNotRoot 0
> +       then
> +               echo "[Skip] paranoid and not root"
> +               return
> +       fi
> +       perf stat --per-die -a -o "${stat_output}" true
> +        commachecker --per-die
> +       echo "[Success]"
> +}
> +
> +check_per_node()
> +{
> +       echo -n "Checking STD output: per node "
> +       if ParanoidAndNotRoot 0
> +       then
> +               echo "[Skip] paranoid and not root"
> +               return
> +       fi
> +       perf stat --per-node -a -o "${stat_output}" true
> +        commachecker --per-node
> +       echo "[Success]"
> +}
> +
> +check_per_socket()
> +{
> +       echo -n "Checking STD output: per socket "
> +       if ParanoidAndNotRoot 0
> +       then
> +               echo "[Skip] paranoid and not root"
> +               return
> +       fi
> +       perf stat --per-socket -a -o "${stat_output}" true
> +        commachecker --per-socket
> +       echo "[Success]"
> +}
> +
> +# The perf stat options for per-socket, per-core, per-die
> +# and -A ( no_aggr mode ) uses the info fetched from this
> +# directory: "/sys/devices/system/cpu/cpu*/topology". For
> +# example, socket value is fetched from "physical_package_id"
> +# file in topology directory.
> +# Reference: cpu__get_topology_int in util/cpumap.c
> +# If the platform doesn't expose topology information, values
> +# will be set to -1. For example, incase of pSeries platform
> +# of powerpc, value for  "physical_package_id" is restricted
> +# and set to -1. Check here validates the socket-id read from
> +# topology file before proceeding further
> +
> +FILE_LOC="/sys/devices/system/cpu/cpu*/topology/"
> +FILE_NAME="physical_package_id"
> +
> +check_for_topology()
> +{
> +       if ! ParanoidAndNotRoot 0
> +       then
> +               socket_file=`ls $FILE_LOC/$FILE_NAME | head -n 1`
> +               [ -z $socket_file ] && return 0
> +               socket_id=`cat $socket_file`
> +               [ $socket_id == -1 ] && skip_test=1
> +               return 0
> +       fi
> +}

Tests, great! This logic is taken from
tools/perf/tests/shell/stat+csv_output.sh, could we share the
implementation between that and here by moving the code into something
in the lib directory?

Thanks,
Ian

> +
> +check_for_topology
> +check_no_args
> +check_system_wide
> +check_interval
> +check_per_thread
> +check_per_node
> +if [ $skip_test -ne 1 ]
> +then
> +       check_system_wide_no_aggr
> +       check_per_core
> +       check_per_cache_instance
> +       check_per_die
> +       check_per_socket
> +else
> +       echo "[Skip] Skipping tests for system_wide_no_aggr, per_core, per_die and per_socket since socket id exposed via topology is invalid"
> +fi
> +cleanup
> +exit 0
> --
> 2.35.1
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] perf metric: JSON flag to default metric group
  2023-06-13 20:10     ` Liang, Kan
@ 2023-06-13 20:28       ` Ian Rogers
  2023-06-13 20:59         ` Liang, Kan
  0 siblings, 1 reply; 31+ messages in thread
From: Ian Rogers @ 2023-06-13 20:28 UTC (permalink / raw)
  To: Liang, Kan, ahmad.yasin
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian

On Tue, Jun 13, 2023 at 1:10 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-06-13 3:44 p.m., Ian Rogers wrote:
> > On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
> >>
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> For the default output, the default metric group could vary on different
> >> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
> >> should be displayed in the default mode. On ICL, only the TopdownL1
> >> should be displayed.
> >>
> >> Add a flag so we can tag the default metric group for different
> >> platforms rather than hack the perf code.
> >>
> >> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
> >> since SPR.
> >>
> >> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
> >> the real metric group name.
> >>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> ---
> >>  .../arch/x86/alderlake/adl-metrics.json       | 20 ++++---
> >>  .../arch/x86/icelake/icl-metrics.json         | 20 ++++---
> >>  .../arch/x86/icelakex/icx-metrics.json        | 20 ++++---
> >>  .../arch/x86/sapphirerapids/spr-metrics.json  | 60 +++++++++++--------
> >>  .../arch/x86/tigerlake/tgl-metrics.json       | 20 ++++---
> >>  5 files changed, 84 insertions(+), 56 deletions(-)
> >>
> >> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >> index c9f7e3d4ab08..e78c85220e27 100644
> >> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >> @@ -832,22 +832,24 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_backend_bound",
> >>          "MetricThreshold": "tma_backend_bound > 0.2",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>          "ScaleUnit": "100%",
> >>          "Unit": "cpu_core"
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_bad_speculation",
> >>          "MetricThreshold": "tma_bad_speculation > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>          "ScaleUnit": "100%",
> >>          "Unit": "cpu_core"
> >> @@ -1112,11 +1114,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
> >> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_frontend_bound",
> >>          "MetricThreshold": "tma_frontend_bound > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>          "ScaleUnit": "100%",
> >>          "Unit": "cpu_core"
> >> @@ -2316,11 +2319,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_retiring",
> >>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>          "ScaleUnit": "100%",
> >>          "Unit": "cpu_core"
> >
> > For Alderlake the Default metric group is added for all cpu_core
> > metrics but not cpu_atom. This will lead to only getting metrics for
> > performance cores while the workload could be running on atoms. This
> > could lead to a false conclusion that the workload has no issues with
> > the metrics. I think this behavior is surprising and should be called
> > out as intentional in the commit message.
> >
>
> The e-core doesn't have enough counters to calculate all the Topdown
> events. It will trigger the multiplexing. We try to avoid it in the
> default mode.
> I will update the commit in V2.

Is multiplexing a worse crime than only giving output for half the
cores? Both can be misleading. Perhaps the safest thing is to not use
Default on hybrid platforms.

Thanks,
Ian

> Thanks,
> Kan
>
> > Thanks,
> > Ian
> >
> >> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >> index 20210742171d..cc4edf855064 100644
> >> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >> @@ -111,21 +111,23 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_backend_bound",
> >>          "MetricThreshold": "tma_backend_bound > 0.2",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>          "ScaleUnit": "100%"
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_bad_speculation",
> >>          "MetricThreshold": "tma_bad_speculation > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -372,11 +374,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_frontend_bound",
> >>          "MetricThreshold": "tma_frontend_bound > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -1378,11 +1381,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_retiring",
> >>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>          "ScaleUnit": "100%"
> >>      },
> >> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >> index ef25cda019be..6f25b5b7aaf6 100644
> >> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >> @@ -315,21 +315,23 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_backend_bound",
> >>          "MetricThreshold": "tma_backend_bound > 0.2",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>          "ScaleUnit": "100%"
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_bad_speculation",
> >>          "MetricThreshold": "tma_bad_speculation > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -576,11 +578,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_frontend_bound",
> >>          "MetricThreshold": "tma_frontend_bound > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -1674,11 +1677,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_retiring",
> >>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>          "ScaleUnit": "100%"
> >>      },
> >> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >> index 4f3dd85540b6..c732982f70b5 100644
> >> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >> @@ -340,31 +340,34 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_backend_bound",
> >>          "MetricThreshold": "tma_backend_bound > 0.2",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>          "ScaleUnit": "100%"
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_bad_speculation",
> >>          "MetricThreshold": "tma_bad_speculation > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>          "ScaleUnit": "100%"
> >>      },
> >>      {
> >>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
> >> +        "DefaultMetricgroupName": "TopdownL2",
> >>          "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> -        "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> >> +        "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> >>          "MetricName": "tma_branch_mispredicts",
> >>          "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL2",
> >> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction.  These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -407,11 +410,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
> >> +        "DefaultMetricgroupName": "TopdownL2",
> >>          "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
> >> -        "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >> +        "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >>          "MetricName": "tma_core_bound",
> >>          "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
> >> -        "MetricgroupNoGroup": "TopdownL2",
> >> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>          "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck.  Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -509,21 +513,23 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
> >> +        "DefaultMetricgroupName": "TopdownL2",
> >>          "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
> >> -        "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> >> +        "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> >>          "MetricName": "tma_fetch_bandwidth",
> >>          "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
> >> -        "MetricgroupNoGroup": "TopdownL2",
> >> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues.  For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
> >>          "ScaleUnit": "100%"
> >>      },
> >>      {
> >>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
> >> +        "DefaultMetricgroupName": "TopdownL2",
> >>          "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >> -        "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> >> +        "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> >>          "MetricName": "tma_fetch_latency",
> >>          "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL2",
> >> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues.  For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -611,11 +617,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_frontend_bound",
> >>          "MetricThreshold": "tma_frontend_bound > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -630,11 +637,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
> >> +        "DefaultMetricgroupName": "TopdownL2",
> >>          "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >>          "MetricName": "tma_heavy_operations",
> >>          "MetricThreshold": "tma_heavy_operations > 0.1",
> >> -        "MetricgroupNoGroup": "TopdownL2",
> >> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -1486,11 +1494,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
> >> +        "DefaultMetricgroupName": "TopdownL2",
> >>          "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
> >> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >>          "MetricName": "tma_light_operations",
> >>          "MetricThreshold": "tma_light_operations > 0.6",
> >> -        "MetricgroupNoGroup": "TopdownL2",
> >> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -1540,11 +1549,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
> >> +        "DefaultMetricgroupName": "TopdownL2",
> >>          "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
> >> -        "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> >> +        "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> >>          "MetricName": "tma_machine_clears",
> >>          "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL2",
> >> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears.  These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -1576,11 +1586,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
> >> +        "DefaultMetricgroupName": "TopdownL2",
> >>          "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> -        "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >> +        "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >>          "MetricName": "tma_memory_bound",
> >>          "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
> >> -        "MetricgroupNoGroup": "TopdownL2",
> >> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>          "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck.  Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -1784,11 +1795,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_retiring",
> >>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>          "ScaleUnit": "100%"
> >>      },
> >> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >> index d0538a754288..83346911aa63 100644
> >> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >> @@ -105,21 +105,23 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_backend_bound",
> >>          "MetricThreshold": "tma_backend_bound > 0.2",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>          "ScaleUnit": "100%"
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_bad_speculation",
> >>          "MetricThreshold": "tma_bad_speculation > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -366,11 +368,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_frontend_bound",
> >>          "MetricThreshold": "tma_frontend_bound > 0.15",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>          "ScaleUnit": "100%"
> >>      },
> >> @@ -1392,11 +1395,12 @@
> >>      },
> >>      {
> >>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >> +        "DefaultMetricgroupName": "TopdownL1",
> >>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>          "MetricName": "tma_retiring",
> >>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >> -        "MetricgroupNoGroup": "TopdownL1",
> >> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>          "ScaleUnit": "100%"
> >>      },
> >> --
> >> 2.35.1
> >>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 7/8] pert tests: Support metricgroup perf stat JSON output
  2023-06-13 20:17   ` Ian Rogers
@ 2023-06-13 20:30     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 31+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-06-13 20:30 UTC (permalink / raw)
  To: Ian Rogers
  Cc: kan.liang, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin

Em Tue, Jun 13, 2023 at 01:17:41PM -0700, Ian Rogers escreveu:
> On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
> >
> > From: Kan Liang <kan.liang@linux.intel.com>
> >
> > A new field metricgroup has been added in the perf stat JSON output.
> > Support it in the test case.
> >
> > Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> 
> Acked-by: Ian Rogers <irogers@google.com>

Thanks, applied.

- Arnaldo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] perf vendor events arm64: Add default tags into topdown L1 metrics
  2023-06-13 19:45   ` Ian Rogers
@ 2023-06-13 20:31     ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 31+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-06-13 20:31 UTC (permalink / raw)
  To: Ian Rogers
  Cc: kan.liang, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin,
	Jing Zhang, John Garry

Em Tue, Jun 13, 2023 at 12:45:10PM -0700, Ian Rogers escreveu:
> On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
> >
> > From: Kan Liang <kan.liang@linux.intel.com>
> >
> > Add the default tags for ARM as well.
> >
> > Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> > Cc: Jing Zhang <renyu.zj@linux.alibaba.com>
> > Cc: John Garry <john.g.garry@oracle.com>
> 
> Acked-by: Ian Rogers <irogers@google.com>

Thanks, applied.

- Arnaldo

 
> Thanks,
> Ian
> 
> > ---
> >  tools/perf/pmu-events/arch/arm64/sbsa.json | 12 ++++++++----
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> >
> > diff --git a/tools/perf/pmu-events/arch/arm64/sbsa.json b/tools/perf/pmu-events/arch/arm64/sbsa.json
> > index f678c37ea9c3..f90b338261ac 100644
> > --- a/tools/perf/pmu-events/arch/arm64/sbsa.json
> > +++ b/tools/perf/pmu-events/arch/arm64/sbsa.json
> > @@ -2,28 +2,32 @@
> >      {
> >          "MetricExpr": "stall_slot_frontend / (#slots * cpu_cycles)",
> >          "BriefDescription": "Frontend bound L1 topdown metric",
> > -        "MetricGroup": "TopdownL1",
> > +        "DefaultMetricgroupName": "TopdownL1",
> > +        "MetricGroup": "Default;TopdownL1",
> >          "MetricName": "frontend_bound",
> >          "ScaleUnit": "100%"
> >      },
> >      {
> >          "MetricExpr": "(1 - op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
> >          "BriefDescription": "Bad speculation L1 topdown metric",
> > -        "MetricGroup": "TopdownL1",
> > +        "DefaultMetricgroupName": "TopdownL1",
> > +        "MetricGroup": "Default;TopdownL1",
> >          "MetricName": "bad_speculation",
> >          "ScaleUnit": "100%"
> >      },
> >      {
> >          "MetricExpr": "(op_retired / op_spec) * (1 - stall_slot / (#slots * cpu_cycles))",
> >          "BriefDescription": "Retiring L1 topdown metric",
> > -        "MetricGroup": "TopdownL1",
> > +        "DefaultMetricgroupName": "TopdownL1",
> > +        "MetricGroup": "Default;TopdownL1",
> >          "MetricName": "retiring",
> >          "ScaleUnit": "100%"
> >      },
> >      {
> >          "MetricExpr": "stall_slot_backend / (#slots * cpu_cycles)",
> >          "BriefDescription": "Backend Bound L1 topdown metric",
> > -        "MetricGroup": "TopdownL1",
> > +        "DefaultMetricgroupName": "TopdownL1",
> > +        "MetricGroup": "Default;TopdownL1",
> >          "MetricName": "backend_bound",
> >          "ScaleUnit": "100%"
> >      }
> > --
> > 2.35.1
> >

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/8] perf stat,metrics: New metricgroup output for the default mode
  2023-06-13 20:16   ` Ian Rogers
@ 2023-06-13 20:50     ` Liang, Kan
  0 siblings, 0 replies; 31+ messages in thread
From: Liang, Kan @ 2023-06-13 20:50 UTC (permalink / raw)
  To: Ian Rogers
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin



On 2023-06-13 4:16 p.m., Ian Rogers wrote:
> On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>>
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> In the default mode, the current output of the metricgroup include both
>> events and metrics, which is not necessary and just makes the output
>> hard to read. Since different ARCHs (even different generations in the
>> same ARCH) may use different events. The output also vary on different
>> platforms.
>>
>> For a metricgroup, only outputting the value of each metric is good
>> enough.
>>
>> Current perf may append different metric groups to the same leader
>> event, or append the metrics from the same metricgroup to different
>> events. That could bring confusion when perf only prints the
>> metricgroup output mode. For example, print the same metricgroup name
>> several times.
>> Reorganize metricgroup for the default mode and make sure that
>> a metricgroup can only be appended to one event.
>> Sort the metricgroup for the default mode by the name of the
>> metricgroup.
>>
>> Add a new field default_metricgroup in evsel to indicate an event of
>> the default metricgroup. For those events, printout() should print
>> the metricgroup name rather than events.
>>
>> Add print_metricgroup_header() to print out the metricgroup name in
>> different output formats.
>>
>> On SPR
>> Before:
>>
>>  ./perf_old stat sleep 1
>>
>>  Performance counter stats for 'sleep 1':
>>
>>               0.54 msec task-clock:u                     #    0.001 CPUs utilized
>>                  0      context-switches:u               #    0.000 /sec
>>                  0      cpu-migrations:u                 #    0.000 /sec
>>                 68      page-faults:u                    #  125.445 K/sec
>>            540,970      cycles:u                         #    0.998 GHz
>>            556,325      instructions:u                   #    1.03  insn per cycle
>>            123,602      branches:u                       #  228.018 M/sec
>>              6,889      branch-misses:u                  #    5.57% of all branches
>>          3,245,820      TOPDOWN.SLOTS:u                  #     18.4 %  tma_backend_bound
>>                                                   #     17.2 %  tma_retiring
>>                                                   #     23.1 %  tma_bad_speculation
>>                                                   #     41.4 %  tma_frontend_bound
>>            564,859      topdown-retiring:u
>>          1,370,999      topdown-fe-bound:u
>>            603,271      topdown-be-bound:u
>>            744,874      topdown-bad-spec:u
>>             12,661      INT_MISC.UOP_DROPPING:u          #   23.357 M/sec
>>
>>        1.001798215 seconds time elapsed
>>
>>        0.000193000 seconds user
>>        0.001700000 seconds sys
>>
>> After:
>>
>> $ ./perf stat sleep 1
>>
>>  Performance counter stats for 'sleep 1':
>>
>>               0.51 msec task-clock:u                     #    0.001 CPUs utilized
>>                  0      context-switches:u               #    0.000 /sec
>>                  0      cpu-migrations:u                 #    0.000 /sec
>>                 68      page-faults:u                    #  132.683 K/sec
>>            545,228      cycles:u                         #    1.064 GHz
>>            555,509      instructions:u                   #    1.02  insn per cycle
>>            123,574      branches:u                       #  241.120 M/sec
>>              6,957      branch-misses:u                  #    5.63% of all branches
>>                         TopdownL1                 #     17.5 %  tma_backend_bound
>>                                                   #     22.6 %  tma_bad_speculation
>>                                                   #     42.7 %  tma_frontend_bound
>>                                                   #     17.1 %  tma_retiring
>>                         TopdownL2                 #     21.8 %  tma_branch_mispredicts
>>                                                   #     11.5 %  tma_core_bound
>>                                                   #     13.4 %  tma_fetch_bandwidth
>>                                                   #     29.3 %  tma_fetch_latency
>>                                                   #      2.7 %  tma_heavy_operations
>>                                                   #     14.5 %  tma_light_operations
>>                                                   #      0.8 %  tma_machine_clears
>>                                                   #      6.1 %  tma_memory_bound
>>
>>        1.001712086 seconds time elapsed
>>
>>        0.000151000 seconds user
>>        0.001618000 seconds sys
>>
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>  tools/perf/builtin-stat.c      |   1 +
>>  tools/perf/util/evsel.h        |   1 +
>>  tools/perf/util/metricgroup.c  | 106 ++++++++++++++++++++++++++++++++-
>>  tools/perf/util/metricgroup.h  |   1 +
>>  tools/perf/util/stat-display.c |  69 ++++++++++++++++++++-
>>  5 files changed, 172 insertions(+), 6 deletions(-)
>>
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 2269b3e90e9b..b274cc264d56 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -2172,6 +2172,7 @@ static int add_default_attributes(void)
>>
>>                         evlist__for_each_entry(metric_evlist, metric_evsel) {
>>                                 metric_evsel->skippable = true;
>> +                               metric_evsel->default_metricgroup = true;
>>                         }
>>                         evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
>>                         evlist__delete(metric_evlist);
>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
>> index 36a32e4ca168..61b1385108f4 100644
>> --- a/tools/perf/util/evsel.h
>> +++ b/tools/perf/util/evsel.h
>> @@ -130,6 +130,7 @@ struct evsel {
>>         bool                    reset_group;
>>         bool                    errored;
>>         bool                    needs_auxtrace_mmap;
>> +       bool                    default_metricgroup;
> 
> A comment would be useful here, something like:
> 
> If running perf stat, is this evsel a member of a Default metric group metric.

Yes, it's the member of the 'default' metric group.
I will add a comment.

> 
>>         struct hashmap          *per_pkg_mask;
>>         int                     err;
>>         struct {
>> diff --git a/tools/perf/util/metricgroup.c b/tools/perf/util/metricgroup.c
>> index efafa02db5e5..22181ce4f27f 100644
>> --- a/tools/perf/util/metricgroup.c
>> +++ b/tools/perf/util/metricgroup.c
>> @@ -79,6 +79,7 @@ static struct rb_node *metric_event_new(struct rblist *rblist __maybe_unused,
>>                 return NULL;
>>         memcpy(me, entry, sizeof(struct metric_event));
>>         me->evsel = ((struct metric_event *)entry)->evsel;
>> +       me->default_metricgroup_name = NULL;
>>         INIT_LIST_HEAD(&me->head);
>>         return &me->nd;
>>  }
>> @@ -1133,14 +1134,19 @@ static int metricgroup__add_metric_sys_event_iter(const struct pmu_metric *pm,
>>  /**
>>   * metric_list_cmp - list_sort comparator that sorts metrics with more events to
>>   *                   the front. tool events are excluded from the count.
>> + *                   For the default metrics, sort them by metricgroup name.
>>   */
>> -static int metric_list_cmp(void *priv __maybe_unused, const struct list_head *l,
>> +static int metric_list_cmp(void *priv, const struct list_head *l,
>>                            const struct list_head *r)
>>  {
>>         const struct metric *left = container_of(l, struct metric, nd);
>>         const struct metric *right = container_of(r, struct metric, nd);
>>         struct expr_id_data *data;
>>         int i, left_count, right_count;
>> +       bool is_default = *(bool *)priv;
>> +
>> +       if (is_default && left->default_metricgroup_name && right->default_metricgroup_name)
>> +               return strcmp(left->default_metricgroup_name, right->default_metricgroup_name);
> 
> This breaks the comment above. The events are now sorted prioritizing
> default metric group names. This potentially will have an effect of
> reducing sharing of events between groups, it will also break
> assumptions within that code that there are always the same number of
> fewer events in a metric as you process the list. To remedy this I
> think you need to re-sort the metrics after the event sharing has had
> a chance to share events between groups.
> 
> 
>>
>>         left_count = hashmap__size(left->pctx->ids);
>>         perf_tool_event__for_each_event(i) {
>> @@ -1497,6 +1503,91 @@ static int parse_ids(bool metric_no_merge, struct perf_pmu *fake_pmu,
>>         return ret;
>>  }
>>
>> +static struct metric_event *
>> +metricgroup__lookup_default_metricgroup(struct rblist *metric_events,
>> +                                       struct evsel *evsel,
>> +                                       struct metric *m)
>> +{
>> +       struct metric_event *me;
>> +       char *name;
>> +       int err;
>> +
>> +       me = metricgroup__lookup(metric_events, evsel, true);
>> +       if (!me->default_metricgroup_name) {
>> +               if (m->pmu && strcmp(m->pmu, "cpu"))
>> +                       err = asprintf(&name, "%s (%s)", m->default_metricgroup_name, m->pmu);
>> +               else
>> +                       err = asprintf(&name, "%s", m->default_metricgroup_name);
>> +               if (err < 0)
>> +                       return NULL;
>> +               me->default_metricgroup_name = name;
>> +       }
>> +       if (!strncmp(m->default_metricgroup_name,
>> +                    me->default_metricgroup_name,
>> +                    strlen(m->default_metricgroup_name)))
>> +               return me;
>> +
>> +       return NULL;
>> +}
> 
> A function comment would be useful as the name is confusing, why
> lookup? Doesn't it create the value? Leak sanitizer isn't happy here:
> 
> ```
> ==1545918==ERROR: LeakSanitizer: detected memory leaks
> 
> Direct leak of 10 byte(s) in 1 object(s) allocated from:
>     #0 0x7f2755a7077b in __interceptor_strdup
> ../../../../src/libsanitizer/asan/asan_interceptors.cpp:439
>     #1 0x564986a8df31 in asprintf util/util.c:566
>     #2 0x5649869b5901 in metricgroup__lookup_default_metricgroup
> util/metricgroup.c:1520
>     #3 0x5649869b5e57 in metricgroup__lookup_create util/metricgroup.c:1579
>     #4 0x5649869b6ddc in parse_groups util/metricgroup.c:1698
>     #5 0x5649869b7714 in metricgroup__parse_groups util/metricgroup.c:1771
>     #6 0x5649867da9d5 in add_default_attributes tools/perf/builtin-stat.c:2164
>     #7 0x5649867ddbfb in cmd_stat tools/perf/builtin-stat.c:2707
>     #8 0x5649868fa5a2 in run_builtin tools/perf/perf.c:323
>     #9 0x5649868fab13 in handle_internal_command tools/perf/perf.c:377
>     #10 0x5649868faedb in run_argv tools/perf/perf.c:421
>     #11 0x5649868fb443 in main tools/perf/perf.c:537
>     #12 0x7f2754846189 in __libc_start_call_main
> ../sysdeps/nptl/libc_start_call_main.h:58
> ```
> 
>> +static struct metric_event *
>> +metricgroup__lookup_create(struct rblist *metric_events,
>> +                          struct evsel **evsel,
>> +                          struct list_head *metric_list,
>> +                          struct metric *m,
>> +                          bool is_default)
>> +{
>> +       struct metric_event *me;
>> +       struct metric *cur;
>> +       struct evsel *ev;
>> +       size_t i;
>> +
>> +       if (!is_default)
>> +               return metricgroup__lookup(metric_events, evsel[0], true);
>> +
>> +       /*
>> +        * If the metric group has been attached to a previous
>> +        * event/metric, use that metric event.
>> +        */
>> +       list_for_each_entry(cur, metric_list, nd) {
>> +               if (cur == m)
>> +                       break;
>> +               if (cur->pmu && strcmp(m->pmu, cur->pmu))
>> +                       continue;
>> +               if (strncmp(m->default_metricgroup_name,
>> +                           cur->default_metricgroup_name,
>> +                           strlen(m->default_metricgroup_name)))
>> +                       continue;
>> +               if (!cur->evlist)
>> +                       continue;
>> +               evlist__for_each_entry(cur->evlist, ev) {
>> +                       me = metricgroup__lookup(metric_events, ev, false);
>> +                       if (!strncmp(m->default_metricgroup_name,
>> +                                    me->default_metricgroup_name,
>> +                                    strlen(m->default_metricgroup_name)))
>> +                               return me;
>> +               }
>> +       }
>> +
>> +       /*
>> +        * Different metric groups may append to the same leader event.
>> +        * For example, TopdownL1 and TopdownL2 are appended to the
>> +        * TOPDOWN.SLOTS event.
>> +        * Split it and append the new metric group to the next available
>> +        * event.
>> +        */
>> +       me = metricgroup__lookup_default_metricgroup(metric_events, evsel[0], m);
>> +       if (me)
>> +               return me;
>> +
>> +       for (i = 1; i < hashmap__size(m->pctx->ids); i++) {
>> +               me = metricgroup__lookup_default_metricgroup(metric_events, evsel[i], m);
>> +               if (me)
>> +                       return me;
>> +       }
>> +       return NULL;
>> +}
>> +
> 
> I have a hard time understanding this function, does it just go away
> if you do the two sorts that I proposed above? Should this be
> metric_event__lookup_create? A function comment saying what the code
> is trying to achieve would be useful.
> 
> This appears to be trying to correct output issues by changing how
> metrics are associated with events, shouldn't output issues be
> resolved by fixing the output code? If not, why don't we apply this
> logic to TopdownL1, why just Default?

Yes, the above codes try to re-organize the metric and append the metric
from the same metricgroup into the same event. So it can be easily print
out later.

With the second sort, I think it should not be problem to address it in
the output code. Let me do some experiments.

> 
>>  static int parse_groups(struct evlist *perf_evlist,
>>                         const char *pmu, const char *str,
>>                         bool metric_no_group,
>> @@ -1512,6 +1603,7 @@ static int parse_groups(struct evlist *perf_evlist,
>>         LIST_HEAD(metric_list);
>>         struct metric *m;
>>         bool tool_events[PERF_TOOL_MAX] = {false};
>> +       bool is_default = !strcmp(str, "Default");
>>         int ret;
>>
>>         if (metric_events_list->nr_entries == 0)
>> @@ -1523,7 +1615,7 @@ static int parse_groups(struct evlist *perf_evlist,
>>                 goto out;
>>
>>         /* Sort metrics from largest to smallest. */
>> -       list_sort(NULL, &metric_list, metric_list_cmp);
>> +       list_sort((void *)&is_default, &metric_list, metric_list_cmp);
>>
>>         if (!metric_no_merge) {
>>                 struct expr_parse_ctx *combined = NULL;
>> @@ -1603,7 +1695,15 @@ static int parse_groups(struct evlist *perf_evlist,
>>                         goto out;
>>                 }
>>
>> -               me = metricgroup__lookup(metric_events_list, metric_events[0], true);
>> +               me = metricgroup__lookup_create(metric_events_list,
>> +                                               metric_events,
>> +                                               &metric_list, m,
>> +                                               is_default);
>> +               if (!me) {
>> +                       pr_err("Cannot create metric group for default!\n");
>> +                       ret = -EINVAL;
>> +                       goto out;
>> +               }
>>
>>                 expr = malloc(sizeof(struct metric_expr));
>>                 if (!expr) {
>> diff --git a/tools/perf/util/metricgroup.h b/tools/perf/util/metricgroup.h
>> index bf18274c15df..e3609b853213 100644
>> --- a/tools/perf/util/metricgroup.h
>> +++ b/tools/perf/util/metricgroup.h
>> @@ -22,6 +22,7 @@ struct cgroup;
>>  struct metric_event {
>>         struct rb_node nd;
>>         struct evsel *evsel;
>> +       char *default_metricgroup_name;
>>         struct list_head head; /* list of metric_expr */
>>  };
>>
>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
>> index a2bbdc25d979..efe5fd04c033 100644
>> --- a/tools/perf/util/stat-display.c
>> +++ b/tools/perf/util/stat-display.c
>> @@ -21,10 +21,12 @@
>>  #include "iostat.h"
>>  #include "pmu.h"
>>  #include "pmus.h"
>> +#include "metricgroup.h"
> 
> This is bringing metric code from stat-shadow, which is kind of the
> whole reason there is a separation and stat-shadow exists. Should the
> logic exist in stat-shadow instead?
> 
>>
>>  #define CNTR_NOT_SUPPORTED     "<not supported>"
>>  #define CNTR_NOT_COUNTED       "<not counted>"
>>
>> +#define MGROUP_LEN   50
>>  #define METRIC_LEN   38
>>  #define EVNAME_LEN   32
>>  #define COUNTS_LEN   18
>> @@ -707,6 +709,55 @@ static bool evlist__has_hybrid(struct evlist *evlist)
>>         return false;
>>  }
>>
>> +static void print_metricgroup_header_json(struct perf_stat_config *config,
>> +                                         struct outstate *os __maybe_unused,
>> +                                         const char *metricgroup_name)
>> +{
>> +       fprintf(config->output, "\"metricgroup\" : \"%s\"}", metricgroup_name);
>> +       new_line_json(config, (void *)os);
>> +}
>> +
> 
> Should the output part of this patch be separate from the
> evsel/evlist/meitric modifications?
> 

Sure, I will split the patch.

Thanks,
Kan

> Thanks,
> Ian
> 
>> +static void print_metricgroup_header_csv(struct perf_stat_config *config,
>> +                                        struct outstate *os,
>> +                                        const char *metricgroup_name)
>> +{
>> +       int i;
>> +
>> +       for (i = 0; i < os->nfields; i++)
>> +               fputs(config->csv_sep, os->fh);
>> +       fprintf(config->output, "%s", metricgroup_name);
>> +       new_line_csv(config, (void *)os);
>> +}
>> +
>> +static void print_metricgroup_header_std(struct perf_stat_config *config,
>> +                                        struct outstate *os __maybe_unused,
>> +                                        const char *metricgroup_name)
>> +{
>> +       int n = fprintf(config->output, " %*s", EVNAME_LEN, metricgroup_name);
>> +
>> +       fprintf(config->output, "%*s", MGROUP_LEN - n - 1, "");
>> +}
>> +
>> +static void print_metricgroup_header(struct perf_stat_config *config,
>> +                                    struct outstate *os,
>> +                                    struct evsel *counter,
>> +                                    double noise, u64 run, u64 ena,
>> +                                    const char *metricgroup_name)
>> +{
>> +       aggr_printout(config, os->evsel, os->id, os->aggr_nr);
>> +
>> +       print_noise(config, counter, noise, /*before_metric=*/true);
>> +       print_running(config, run, ena, /*before_metric=*/true);
>> +
>> +       if (config->json_output) {
>> +               print_metricgroup_header_json(config, os, metricgroup_name);
>> +       } else if (config->csv_output) {
>> +               print_metricgroup_header_csv(config, os, metricgroup_name);
>> +       } else
>> +               print_metricgroup_header_std(config, os, metricgroup_name);
>> +
>> +}
>> +
>>  static void printout(struct perf_stat_config *config, struct outstate *os,
>>                      double uval, u64 run, u64 ena, double noise, int aggr_idx)
>>  {
>> @@ -751,10 +802,17 @@ static void printout(struct perf_stat_config *config, struct outstate *os,
>>         out.force_header = false;
>>
>>         if (!config->metric_only) {
>> -               abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
>> +               if (counter->default_metricgroup) {
>> +                       struct metric_event *me;
>>
>> -               print_noise(config, counter, noise, /*before_metric=*/true);
>> -               print_running(config, run, ena, /*before_metric=*/true);
>> +                       me = metricgroup__lookup(&config->metric_events, counter, false);
>> +                       print_metricgroup_header(config, os, counter, noise, run, ena,
>> +                                                me->default_metricgroup_name);
>> +               } else {
>> +                       abs_printout(config, os->id, os->aggr_nr, counter, uval, ok);
>> +                       print_noise(config, counter, noise, /*before_metric=*/true);
>> +                       print_running(config, run, ena, /*before_metric=*/true);
>> +               }
>>         }
>>
>>         if (ok) {
>> @@ -883,6 +941,11 @@ static void print_counter_aggrdata(struct perf_stat_config *config,
>>         if (counter->merged_stat)
>>                 return;
>>
>> +       /* Only print the metric group for the default mode */
>> +       if (counter->default_metricgroup &&
>> +           !metricgroup__lookup(&config->metric_events, counter, false))
>> +               return;
>> +
>>         uniquify_counter(config, counter);
>>
>>         val = aggr->counts.val;
>> --
>> 2.35.1
>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] perf metric: JSON flag to default metric group
  2023-06-13 20:28       ` Ian Rogers
@ 2023-06-13 20:59         ` Liang, Kan
  2023-06-13 21:28           ` Ian Rogers
  0 siblings, 1 reply; 31+ messages in thread
From: Liang, Kan @ 2023-06-13 20:59 UTC (permalink / raw)
  To: Ian Rogers, ahmad.yasin
  Cc: acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian



On 2023-06-13 4:28 p.m., Ian Rogers wrote:
> On Tue, Jun 13, 2023 at 1:10 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2023-06-13 3:44 p.m., Ian Rogers wrote:
>>> On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>>>>
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> For the default output, the default metric group could vary on different
>>>> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
>>>> should be displayed in the default mode. On ICL, only the TopdownL1
>>>> should be displayed.
>>>>
>>>> Add a flag so we can tag the default metric group for different
>>>> platforms rather than hack the perf code.
>>>>
>>>> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
>>>> since SPR.
>>>>
>>>> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
>>>> the real metric group name.
>>>>
>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>> ---
>>>>  .../arch/x86/alderlake/adl-metrics.json       | 20 ++++---
>>>>  .../arch/x86/icelake/icl-metrics.json         | 20 ++++---
>>>>  .../arch/x86/icelakex/icx-metrics.json        | 20 ++++---
>>>>  .../arch/x86/sapphirerapids/spr-metrics.json  | 60 +++++++++++--------
>>>>  .../arch/x86/tigerlake/tgl-metrics.json       | 20 ++++---
>>>>  5 files changed, 84 insertions(+), 56 deletions(-)
>>>>
>>>> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>> index c9f7e3d4ab08..e78c85220e27 100644
>>>> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>> @@ -832,22 +832,24 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_backend_bound",
>>>>          "MetricThreshold": "tma_backend_bound > 0.2",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>          "ScaleUnit": "100%",
>>>>          "Unit": "cpu_core"
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_bad_speculation",
>>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>          "ScaleUnit": "100%",
>>>>          "Unit": "cpu_core"
>>>> @@ -1112,11 +1114,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
>>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_frontend_bound",
>>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>          "ScaleUnit": "100%",
>>>>          "Unit": "cpu_core"
>>>> @@ -2316,11 +2319,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_retiring",
>>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>          "ScaleUnit": "100%",
>>>>          "Unit": "cpu_core"
>>>
>>> For Alderlake the Default metric group is added for all cpu_core
>>> metrics but not cpu_atom. This will lead to only getting metrics for
>>> performance cores while the workload could be running on atoms. This
>>> could lead to a false conclusion that the workload has no issues with
>>> the metrics. I think this behavior is surprising and should be called
>>> out as intentional in the commit message.
>>>
>>
>> The e-core doesn't have enough counters to calculate all the Topdown
>> events. It will trigger the multiplexing. We try to avoid it in the
>> default mode.
>> I will update the commit in V2.
> 
> Is multiplexing a worse crime than only giving output for half the
> cores? Both can be misleading. Perhaps the safest thing is to not use
> Default on hybrid platforms.
>

I think if we cannot give the accurate number, we shouldn't show it. I
don't think it's a problem just showing the Topdown on p-core. If the
user doesn't find their interested data in the default mode, they can
always use the --topdown for a specific core.

Thanks,
Kan

> Thanks,
> Ian
> 
>> Thanks,
>> Kan
>>
>>> Thanks,
>>> Ian
>>>
>>>> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>> index 20210742171d..cc4edf855064 100644
>>>> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>> @@ -111,21 +111,23 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_backend_bound",
>>>>          "MetricThreshold": "tma_backend_bound > 0.2",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_bad_speculation",
>>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -372,11 +374,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_frontend_bound",
>>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -1378,11 +1381,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_retiring",
>>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>> index ef25cda019be..6f25b5b7aaf6 100644
>>>> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>> @@ -315,21 +315,23 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_backend_bound",
>>>>          "MetricThreshold": "tma_backend_bound > 0.2",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_bad_speculation",
>>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -576,11 +578,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_frontend_bound",
>>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -1674,11 +1677,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_retiring",
>>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>> index 4f3dd85540b6..c732982f70b5 100644
>>>> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>> @@ -340,31 +340,34 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_backend_bound",
>>>>          "MetricThreshold": "tma_backend_bound > 0.2",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_bad_speculation",
>>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>          "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> -        "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>>>> +        "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>>>>          "MetricName": "tma_branch_mispredicts",
>>>>          "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction.  These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -407,11 +410,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>          "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
>>>> -        "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>> +        "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>>          "MetricName": "tma_core_bound",
>>>>          "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>          "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck.  Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -509,21 +513,23 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>          "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
>>>> -        "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>>>> +        "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>>>>          "MetricName": "tma_fetch_bandwidth",
>>>>          "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues.  For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>          "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>> -        "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>>>> +        "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>>>>          "MetricName": "tma_fetch_latency",
>>>>          "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues.  For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -611,11 +617,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_frontend_bound",
>>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -630,11 +637,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>          "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>>          "MetricName": "tma_heavy_operations",
>>>>          "MetricThreshold": "tma_heavy_operations > 0.1",
>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -1486,11 +1494,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>          "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
>>>> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>>          "MetricName": "tma_light_operations",
>>>>          "MetricThreshold": "tma_light_operations > 0.6",
>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -1540,11 +1549,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>          "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
>>>> -        "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>>>> +        "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>>>>          "MetricName": "tma_machine_clears",
>>>>          "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears.  These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -1576,11 +1586,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>          "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> -        "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>> +        "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>>          "MetricName": "tma_memory_bound",
>>>>          "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>          "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck.  Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -1784,11 +1795,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_retiring",
>>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>> index d0538a754288..83346911aa63 100644
>>>> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>> @@ -105,21 +105,23 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_backend_bound",
>>>>          "MetricThreshold": "tma_backend_bound > 0.2",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_bad_speculation",
>>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -366,11 +368,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_frontend_bound",
>>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> @@ -1392,11 +1395,12 @@
>>>>      },
>>>>      {
>>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>          "MetricName": "tma_retiring",
>>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>          "ScaleUnit": "100%"
>>>>      },
>>>> --
>>>> 2.35.1
>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/8] perf evsel: Fix the annotation for hardware events on hybrid
  2023-06-13 20:06     ` Liang, Kan
@ 2023-06-13 21:18       ` Arnaldo Carvalho de Melo
  2023-06-13 23:57         ` Liang, Kan
  0 siblings, 1 reply; 31+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-06-13 21:18 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Ian Rogers, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin

Em Tue, Jun 13, 2023 at 04:06:59PM -0400, Liang, Kan escreveu:
> 
> 
> On 2023-06-13 3:35 p.m., Ian Rogers wrote:
> > On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
> >>
> >> From: Kan Liang <kan.liang@linux.intel.com>
> >>
> >> The annotation for hardware events is wrong on hybrid. For example,
> >>
> >>  # ./perf stat -a sleep 1
> >>
> >>  Performance counter stats for 'system wide':
> >>
> >>          32,148.85 msec cpu-clock                        #   32.000 CPUs utilized
> >>                374      context-switches                 #   11.633 /sec
> >>                 33      cpu-migrations                   #    1.026 /sec
> >>                295      page-faults                      #    9.176 /sec
> >>         18,979,960      cpu_core/cycles/                 #  590.378 K/sec
> >>        261,230,783      cpu_atom/cycles/                 #    8.126 M/sec                       (54.21%)
> >>         17,019,732      cpu_core/instructions/           #  529.404 K/sec
> >>         38,020,470      cpu_atom/instructions/           #    1.183 M/sec                       (63.36%)
> >>          3,296,743      cpu_core/branches/               #  102.546 K/sec
> >>          6,692,338      cpu_atom/branches/               #  208.167 K/sec                       (63.40%)
> >>             96,421      cpu_core/branch-misses/          #    2.999 K/sec
> >>          1,016,336      cpu_atom/branch-misses/          #   31.613 K/sec                       (63.38%)
> >>
> >> The hardware events have extended type on hybrid, but the evsel__match()
> >> doesn't take it into account.
> >>
> >> Add a mask to filter the extended type on hybrid when checking the config.
> >>
> >> With the patch,
> >>
> >>  # ./perf stat -a sleep 1
> >>
> >>  Performance counter stats for 'system wide':
> >>
> >>          32,139.90 msec cpu-clock                        #   32.003 CPUs utilized
> >>                343      context-switches                 #   10.672 /sec
> >>                 32      cpu-migrations                   #    0.996 /sec
> >>                 73      page-faults                      #    2.271 /sec
> >>         13,712,841      cpu_core/cycles/                 #    0.000 GHz
> >>        258,301,691      cpu_atom/cycles/                 #    0.008 GHz                         (54.20%)
> >>         12,428,163      cpu_core/instructions/           #    0.91  insn per cycle
> >>         37,786,557      cpu_atom/instructions/           #    2.76  insn per cycle              (63.35%)
> >>          2,418,826      cpu_core/branches/               #   75.259 K/sec
> >>          6,965,962      cpu_atom/branches/               #  216.739 K/sec                       (63.38%)
> >>             72,150      cpu_core/branch-misses/          #    2.98% of all branches
> >>          1,032,746      cpu_atom/branch-misses/          #   42.70% of all branches             (63.35%)
> >>
> >> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >> ---
> >>  tools/perf/util/evsel.h       | 12 ++++++-----
> >>  tools/perf/util/stat-shadow.c | 39 +++++++++++++++++++----------------
> >>  2 files changed, 28 insertions(+), 23 deletions(-)
> >>
> >> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> >> index b365b449c6ea..36a32e4ca168 100644
> >> --- a/tools/perf/util/evsel.h
> >> +++ b/tools/perf/util/evsel.h
> >> @@ -350,9 +350,11 @@ u64 format_field__intval(struct tep_format_field *field, struct perf_sample *sam
> >>
> >>  struct tep_format_field *evsel__field(struct evsel *evsel, const char *name);
> >>
> >> -#define evsel__match(evsel, t, c)              \
> >> +#define EVSEL_EVENT_MASK                       (~0ULL)
> >> +
> >> +#define evsel__match(evsel, t, c, m)                   \
> >>         (evsel->core.attr.type == PERF_TYPE_##t &&      \
> >> -        evsel->core.attr.config == PERF_COUNT_##c)
> >> +        (evsel->core.attr.config & m) == PERF_COUNT_##c)
> > 
> > The EVSEL_EVENT_MASK here isn't very intention revealing, perhaps we
> > can remove it and do something like:
> > 
> > static inline bool __evsel__match(const struct evsel *evsel, u32 type,
> > u64 config)
> > {
> >   if ((type == PERF_TYPE_HARDWARE || type ==PERF_TYPE_HW_CACHE)  &&
> > perf_pmus__supports_extended_type())
> >      return (evsel->core.attr.config & PERF_HW_EVENT_MASK) == config;
> > 
> >   return evsel->core.attr.config == config;
> > }
> > #define evsel__match(evsel, t, c) __evsel__match(evsel, PERF_TYPE_##t,
> > PERF_COUNT_##c)
> 
> Yes, the above code looks better. I will apply it in V2.

Please base v2 on tmp.perf-tools-next, tests are running and that branch
will become perf-tools-next.

Some patches from your series were cherry-picked there.

- Arnaldo

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] perf metric: JSON flag to default metric group
  2023-06-13 20:59         ` Liang, Kan
@ 2023-06-13 21:28           ` Ian Rogers
  2023-06-14  0:02             ` Liang, Kan
  0 siblings, 1 reply; 31+ messages in thread
From: Ian Rogers @ 2023-06-13 21:28 UTC (permalink / raw)
  To: Liang, Kan
  Cc: ahmad.yasin, acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian

On Tue, Jun 13, 2023 at 2:00 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-06-13 4:28 p.m., Ian Rogers wrote:
> > On Tue, Jun 13, 2023 at 1:10 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>
> >>
> >>
> >> On 2023-06-13 3:44 p.m., Ian Rogers wrote:
> >>> On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
> >>>>
> >>>> From: Kan Liang <kan.liang@linux.intel.com>
> >>>>
> >>>> For the default output, the default metric group could vary on different
> >>>> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
> >>>> should be displayed in the default mode. On ICL, only the TopdownL1
> >>>> should be displayed.
> >>>>
> >>>> Add a flag so we can tag the default metric group for different
> >>>> platforms rather than hack the perf code.
> >>>>
> >>>> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
> >>>> since SPR.
> >>>>
> >>>> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
> >>>> the real metric group name.
> >>>>
> >>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> >>>> ---
> >>>>  .../arch/x86/alderlake/adl-metrics.json       | 20 ++++---
> >>>>  .../arch/x86/icelake/icl-metrics.json         | 20 ++++---
> >>>>  .../arch/x86/icelakex/icx-metrics.json        | 20 ++++---
> >>>>  .../arch/x86/sapphirerapids/spr-metrics.json  | 60 +++++++++++--------
> >>>>  .../arch/x86/tigerlake/tgl-metrics.json       | 20 ++++---
> >>>>  5 files changed, 84 insertions(+), 56 deletions(-)
> >>>>
> >>>> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >>>> index c9f7e3d4ab08..e78c85220e27 100644
> >>>> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >>>> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
> >>>> @@ -832,22 +832,24 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_backend_bound",
> >>>>          "MetricThreshold": "tma_backend_bound > 0.2",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>>>          "ScaleUnit": "100%",
> >>>>          "Unit": "cpu_core"
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_bad_speculation",
> >>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>>>          "ScaleUnit": "100%",
> >>>>          "Unit": "cpu_core"
> >>>> @@ -1112,11 +1114,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
> >>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_frontend_bound",
> >>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>>>          "ScaleUnit": "100%",
> >>>>          "Unit": "cpu_core"
> >>>> @@ -2316,11 +2319,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_retiring",
> >>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>>>          "ScaleUnit": "100%",
> >>>>          "Unit": "cpu_core"
> >>>
> >>> For Alderlake the Default metric group is added for all cpu_core
> >>> metrics but not cpu_atom. This will lead to only getting metrics for
> >>> performance cores while the workload could be running on atoms. This
> >>> could lead to a false conclusion that the workload has no issues with
> >>> the metrics. I think this behavior is surprising and should be called
> >>> out as intentional in the commit message.
> >>>
> >>
> >> The e-core doesn't have enough counters to calculate all the Topdown
> >> events. It will trigger the multiplexing. We try to avoid it in the
> >> default mode.
> >> I will update the commit in V2.
> >
> > Is multiplexing a worse crime than only giving output for half the
> > cores? Both can be misleading. Perhaps the safest thing is to not use
> > Default on hybrid platforms.
> >
>
> I think if we cannot give the accurate number, we shouldn't show it. I
> don't think it's a problem just showing the Topdown on p-core. If the
> user doesn't find their interested data in the default mode, they can
> always use the --topdown for a specific core.

So --topdown is just dressing to using "-M TopdownL ..." and using -M
is how to drill down by group. I'm not sure how useful the command
line flag is, especially for levels >2.

Playing devil's advocate somewhat on the hybrid metric, let's say I
configure a managed runtime like a JVM so that all garbage collector
threads run on atom cores the main workload runs on the p-cores. This
is at least done in research papers. Let's say the garbage collector
is backend memory bound. The result from the default metrics won't
show this just (from the cover letter):

```
 Performance counter stats for 'system wide':

         32,154.81 msec cpu-clock                        #   31.978
CPUs utilized
               165      context-switches                 #    5.131 /sec
                33      cpu-migrations                   #    1.026 /sec
                72      page-faults                      #    2.239 /sec
         5,653,347      cpu_core/cycles/                 #    0.000 GHz
         4,164,114      cpu_atom/cycles/                 #    0.000 GHz
         3,921,839      cpu_core/instructions/           #    0.69
insn per cycle
         2,142,800      cpu_atom/instructions/           #    0.38
insn per cycle
           713,629      cpu_core/branches/               #   22.194 K/sec
           452,838      cpu_atom/branches/               #   14.083 K/sec
            26,810      cpu_core/branch-misses/          #    3.76% of
all branches
            26,029      cpu_atom/branch-misses/          #    3.65% of
all branches
             TopdownL1 (cpu_core)                 #     32.0 %
tma_backend_bound
                                                  #      8.0 %
tma_bad_speculation
                                                  #     45.5 %
tma_frontend_bound
                                                  #     14.5 %  tma_retiring
```

As the garbage collector needs to run to free memory it can lead to
priority inversion where the garbage collector being slow is meaning
there isn't enough heap on the p-cores. Here the user has to interpret
the "(cpu_core)" to know that only half the metrics are shown and they
should run with "-M TopdownL1" to get cpu_core and cpu_atom. From this
they can see they have a memory bound issue on the atom cores. This
seems less safe than reporting nothing then the user specifying "-M
TopdownL1" to get the metrics on both cores.

For the multiplexing problem, is it solved by removing IPC from this output?

Thanks,
Ian

> Thanks,
> Kan
>
> > Thanks,
> > Ian
> >
> >> Thanks,
> >> Kan
> >>
> >>> Thanks,
> >>> Ian
> >>>
> >>>> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >>>> index 20210742171d..cc4edf855064 100644
> >>>> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >>>> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
> >>>> @@ -111,21 +111,23 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_backend_bound",
> >>>>          "MetricThreshold": "tma_backend_bound > 0.2",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_bad_speculation",
> >>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -372,11 +374,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_frontend_bound",
> >>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -1378,11 +1381,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_retiring",
> >>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >>>> index ef25cda019be..6f25b5b7aaf6 100644
> >>>> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >>>> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
> >>>> @@ -315,21 +315,23 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_backend_bound",
> >>>>          "MetricThreshold": "tma_backend_bound > 0.2",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_bad_speculation",
> >>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -576,11 +578,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_frontend_bound",
> >>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -1674,11 +1677,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_retiring",
> >>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >>>> index 4f3dd85540b6..c732982f70b5 100644
> >>>> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >>>> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
> >>>> @@ -340,31 +340,34 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_backend_bound",
> >>>>          "MetricThreshold": "tma_backend_bound > 0.2",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_bad_speculation",
> >>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
> >>>> +        "DefaultMetricgroupName": "TopdownL2",
> >>>>          "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> -        "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> >>>> +        "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
> >>>>          "MetricName": "tma_branch_mispredicts",
> >>>>          "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL2",
> >>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>>>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction.  These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -407,11 +410,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
> >>>> +        "DefaultMetricgroupName": "TopdownL2",
> >>>>          "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
> >>>> -        "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >>>> +        "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >>>>          "MetricName": "tma_core_bound",
> >>>>          "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
> >>>> -        "MetricgroupNoGroup": "TopdownL2",
> >>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>>>          "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck.  Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -509,21 +513,23 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
> >>>> +        "DefaultMetricgroupName": "TopdownL2",
> >>>>          "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
> >>>> -        "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> >>>> +        "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
> >>>>          "MetricName": "tma_fetch_bandwidth",
> >>>>          "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
> >>>> -        "MetricgroupNoGroup": "TopdownL2",
> >>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>>>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues.  For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
> >>>> +        "DefaultMetricgroupName": "TopdownL2",
> >>>>          "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >>>> -        "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> >>>> +        "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
> >>>>          "MetricName": "tma_fetch_latency",
> >>>>          "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL2",
> >>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>>>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues.  For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -611,11 +617,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_frontend_bound",
> >>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -630,11 +637,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
> >>>> +        "DefaultMetricgroupName": "TopdownL2",
> >>>>          "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >>>> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >>>>          "MetricName": "tma_heavy_operations",
> >>>>          "MetricThreshold": "tma_heavy_operations > 0.1",
> >>>> -        "MetricgroupNoGroup": "TopdownL2",
> >>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>>>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -1486,11 +1494,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
> >>>> +        "DefaultMetricgroupName": "TopdownL2",
> >>>>          "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
> >>>> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >>>> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
> >>>>          "MetricName": "tma_light_operations",
> >>>>          "MetricThreshold": "tma_light_operations > 0.6",
> >>>> -        "MetricgroupNoGroup": "TopdownL2",
> >>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>>>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -1540,11 +1549,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
> >>>> +        "DefaultMetricgroupName": "TopdownL2",
> >>>>          "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
> >>>> -        "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> >>>> +        "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
> >>>>          "MetricName": "tma_machine_clears",
> >>>>          "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL2",
> >>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>>>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears.  These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -1576,11 +1586,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
> >>>> +        "DefaultMetricgroupName": "TopdownL2",
> >>>>          "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> -        "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >>>> +        "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
> >>>>          "MetricName": "tma_memory_bound",
> >>>>          "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
> >>>> -        "MetricgroupNoGroup": "TopdownL2",
> >>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
> >>>>          "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck.  Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -1784,11 +1795,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_retiring",
> >>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >>>> index d0538a754288..83346911aa63 100644
> >>>> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >>>> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
> >>>> @@ -105,21 +105,23 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_backend_bound",
> >>>>          "MetricThreshold": "tma_backend_bound > 0.2",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_bad_speculation",
> >>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -366,11 +368,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
> >>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_frontend_bound",
> >>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> @@ -1392,11 +1395,12 @@
> >>>>      },
> >>>>      {
> >>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
> >>>> +        "DefaultMetricgroupName": "TopdownL1",
> >>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
> >>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
> >>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
> >>>>          "MetricName": "tma_retiring",
> >>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
> >>>> -        "MetricgroupNoGroup": "TopdownL1",
> >>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
> >>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
> >>>>          "ScaleUnit": "100%"
> >>>>      },
> >>>> --
> >>>> 2.35.1
> >>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/8] perf evsel: Fix the annotation for hardware events on hybrid
  2023-06-13 21:18       ` Arnaldo Carvalho de Melo
@ 2023-06-13 23:57         ` Liang, Kan
  0 siblings, 0 replies; 31+ messages in thread
From: Liang, Kan @ 2023-06-13 23:57 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Ian Rogers, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian, ahmad.yasin



On 2023-06-13 5:18 p.m., Arnaldo Carvalho de Melo wrote:
> Em Tue, Jun 13, 2023 at 04:06:59PM -0400, Liang, Kan escreveu:
>>
>>
>> On 2023-06-13 3:35 p.m., Ian Rogers wrote:
>>> On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>>>>
>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>
>>>> The annotation for hardware events is wrong on hybrid. For example,
>>>>
>>>>  # ./perf stat -a sleep 1
>>>>
>>>>  Performance counter stats for 'system wide':
>>>>
>>>>          32,148.85 msec cpu-clock                        #   32.000 CPUs utilized
>>>>                374      context-switches                 #   11.633 /sec
>>>>                 33      cpu-migrations                   #    1.026 /sec
>>>>                295      page-faults                      #    9.176 /sec
>>>>         18,979,960      cpu_core/cycles/                 #  590.378 K/sec
>>>>        261,230,783      cpu_atom/cycles/                 #    8.126 M/sec                       (54.21%)
>>>>         17,019,732      cpu_core/instructions/           #  529.404 K/sec
>>>>         38,020,470      cpu_atom/instructions/           #    1.183 M/sec                       (63.36%)
>>>>          3,296,743      cpu_core/branches/               #  102.546 K/sec
>>>>          6,692,338      cpu_atom/branches/               #  208.167 K/sec                       (63.40%)
>>>>             96,421      cpu_core/branch-misses/          #    2.999 K/sec
>>>>          1,016,336      cpu_atom/branch-misses/          #   31.613 K/sec                       (63.38%)
>>>>
>>>> The hardware events have extended type on hybrid, but the evsel__match()
>>>> doesn't take it into account.
>>>>
>>>> Add a mask to filter the extended type on hybrid when checking the config.
>>>>
>>>> With the patch,
>>>>
>>>>  # ./perf stat -a sleep 1
>>>>
>>>>  Performance counter stats for 'system wide':
>>>>
>>>>          32,139.90 msec cpu-clock                        #   32.003 CPUs utilized
>>>>                343      context-switches                 #   10.672 /sec
>>>>                 32      cpu-migrations                   #    0.996 /sec
>>>>                 73      page-faults                      #    2.271 /sec
>>>>         13,712,841      cpu_core/cycles/                 #    0.000 GHz
>>>>        258,301,691      cpu_atom/cycles/                 #    0.008 GHz                         (54.20%)
>>>>         12,428,163      cpu_core/instructions/           #    0.91  insn per cycle
>>>>         37,786,557      cpu_atom/instructions/           #    2.76  insn per cycle              (63.35%)
>>>>          2,418,826      cpu_core/branches/               #   75.259 K/sec
>>>>          6,965,962      cpu_atom/branches/               #  216.739 K/sec                       (63.38%)
>>>>             72,150      cpu_core/branch-misses/          #    2.98% of all branches
>>>>          1,032,746      cpu_atom/branch-misses/          #   42.70% of all branches             (63.35%)
>>>>
>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>> ---
>>>>  tools/perf/util/evsel.h       | 12 ++++++-----
>>>>  tools/perf/util/stat-shadow.c | 39 +++++++++++++++++++----------------
>>>>  2 files changed, 28 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
>>>> index b365b449c6ea..36a32e4ca168 100644
>>>> --- a/tools/perf/util/evsel.h
>>>> +++ b/tools/perf/util/evsel.h
>>>> @@ -350,9 +350,11 @@ u64 format_field__intval(struct tep_format_field *field, struct perf_sample *sam
>>>>
>>>>  struct tep_format_field *evsel__field(struct evsel *evsel, const char *name);
>>>>
>>>> -#define evsel__match(evsel, t, c)              \
>>>> +#define EVSEL_EVENT_MASK                       (~0ULL)
>>>> +
>>>> +#define evsel__match(evsel, t, c, m)                   \
>>>>         (evsel->core.attr.type == PERF_TYPE_##t &&      \
>>>> -        evsel->core.attr.config == PERF_COUNT_##c)
>>>> +        (evsel->core.attr.config & m) == PERF_COUNT_##c)
>>>
>>> The EVSEL_EVENT_MASK here isn't very intention revealing, perhaps we
>>> can remove it and do something like:
>>>
>>> static inline bool __evsel__match(const struct evsel *evsel, u32 type,
>>> u64 config)
>>> {
>>>   if ((type == PERF_TYPE_HARDWARE || type ==PERF_TYPE_HW_CACHE)  &&
>>> perf_pmus__supports_extended_type())
>>>      return (evsel->core.attr.config & PERF_HW_EVENT_MASK) == config;
>>>
>>>   return evsel->core.attr.config == config;
>>> }
>>> #define evsel__match(evsel, t, c) __evsel__match(evsel, PERF_TYPE_##t,
>>> PERF_COUNT_##c)
>>
>> Yes, the above code looks better. I will apply it in V2.
> 
> Please base v2 on tmp.perf-tools-next, tests are running and that branch
> will become perf-tools-next.
> 

Sure.

> Some patches from your series were cherry-picked there.

Thanks.

Kan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/8] perf metric: JSON flag to default metric group
  2023-06-13 21:28           ` Ian Rogers
@ 2023-06-14  0:02             ` Liang, Kan
  0 siblings, 0 replies; 31+ messages in thread
From: Liang, Kan @ 2023-06-14  0:02 UTC (permalink / raw)
  To: Ian Rogers
  Cc: ahmad.yasin, acme, mingo, peterz, namhyung, jolsa, adrian.hunter,
	linux-perf-users, linux-kernel, ak, eranian



On 2023-06-13 5:28 p.m., Ian Rogers wrote:
> On Tue, Jun 13, 2023 at 2:00 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2023-06-13 4:28 p.m., Ian Rogers wrote:
>>> On Tue, Jun 13, 2023 at 1:10 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2023-06-13 3:44 p.m., Ian Rogers wrote:
>>>>> On Wed, Jun 7, 2023 at 9:27 AM <kan.liang@linux.intel.com> wrote:
>>>>>>
>>>>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>>>>
>>>>>> For the default output, the default metric group could vary on different
>>>>>> platforms. For example, on SPR, the TopdownL1 and TopdownL2 metrics
>>>>>> should be displayed in the default mode. On ICL, only the TopdownL1
>>>>>> should be displayed.
>>>>>>
>>>>>> Add a flag so we can tag the default metric group for different
>>>>>> platforms rather than hack the perf code.
>>>>>>
>>>>>> The flag is added to Intel TopdownL1 since ICL and TopdownL2 metrics
>>>>>> since SPR.
>>>>>>
>>>>>> Add a new field, DefaultMetricgroupName, in the JSON file to indicate
>>>>>> the real metric group name.
>>>>>>
>>>>>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>>>>>> ---
>>>>>>  .../arch/x86/alderlake/adl-metrics.json       | 20 ++++---
>>>>>>  .../arch/x86/icelake/icl-metrics.json         | 20 ++++---
>>>>>>  .../arch/x86/icelakex/icx-metrics.json        | 20 ++++---
>>>>>>  .../arch/x86/sapphirerapids/spr-metrics.json  | 60 +++++++++++--------
>>>>>>  .../arch/x86/tigerlake/tgl-metrics.json       | 20 ++++---
>>>>>>  5 files changed, 84 insertions(+), 56 deletions(-)
>>>>>>
>>>>>> diff --git a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>>>> index c9f7e3d4ab08..e78c85220e27 100644
>>>>>> --- a/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>>>> +++ b/tools/perf/pmu-events/arch/x86/alderlake/adl-metrics.json
>>>>>> @@ -832,22 +832,24 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "cpu_core@topdown\\-be\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_backend_bound",
>>>>>>          "MetricThreshold": "tma_backend_bound > 0.2",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>>>          "ScaleUnit": "100%",
>>>>>>          "Unit": "cpu_core"
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_bad_speculation",
>>>>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>>>          "ScaleUnit": "100%",
>>>>>>          "Unit": "cpu_core"
>>>>>> @@ -1112,11 +1114,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "cpu_core@topdown\\-fe\\-bound@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) - cpu_core@INT_MISC.UOP_DROPPING@ / tma_info_thread_slots",
>>>>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_frontend_bound",
>>>>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>>>          "ScaleUnit": "100%",
>>>>>>          "Unit": "cpu_core"
>>>>>> @@ -2316,11 +2319,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "cpu_core@topdown\\-retiring@ / (cpu_core@topdown\\-fe\\-bound@ + cpu_core@topdown\\-bad\\-spec@ + cpu_core@topdown\\-retiring@ + cpu_core@topdown\\-be\\-bound@) + 0 * tma_info_thread_slots",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_retiring",
>>>>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>>>          "ScaleUnit": "100%",
>>>>>>          "Unit": "cpu_core"
>>>>>
>>>>> For Alderlake the Default metric group is added for all cpu_core
>>>>> metrics but not cpu_atom. This will lead to only getting metrics for
>>>>> performance cores while the workload could be running on atoms. This
>>>>> could lead to a false conclusion that the workload has no issues with
>>>>> the metrics. I think this behavior is surprising and should be called
>>>>> out as intentional in the commit message.
>>>>>
>>>>
>>>> The e-core doesn't have enough counters to calculate all the Topdown
>>>> events. It will trigger the multiplexing. We try to avoid it in the
>>>> default mode.
>>>> I will update the commit in V2.
>>>
>>> Is multiplexing a worse crime than only giving output for half the
>>> cores? Both can be misleading. Perhaps the safest thing is to not use
>>> Default on hybrid platforms.
>>>
>>
>> I think if we cannot give the accurate number, we shouldn't show it. I
>> don't think it's a problem just showing the Topdown on p-core. If the
>> user doesn't find their interested data in the default mode, they can
>> always use the --topdown for a specific core.
> 
> So --topdown is just dressing to using "-M TopdownL ..." and using -M
> is how to drill down by group. I'm not sure how useful the command
> line flag is, especially for levels >2.
> 
> Playing devil's advocate somewhat on the hybrid metric, let's say I
> configure a managed runtime like a JVM so that all garbage collector
> threads run on atom cores the main workload runs on the p-cores. This
> is at least done in research papers. Let's say the garbage collector
> is backend memory bound. The result from the default metrics won't
> show this just (from the cover letter):
> 
> ```
>  Performance counter stats for 'system wide':
> 
>          32,154.81 msec cpu-clock                        #   31.978
> CPUs utilized
>                165      context-switches                 #    5.131 /sec
>                 33      cpu-migrations                   #    1.026 /sec
>                 72      page-faults                      #    2.239 /sec
>          5,653,347      cpu_core/cycles/                 #    0.000 GHz
>          4,164,114      cpu_atom/cycles/                 #    0.000 GHz
>          3,921,839      cpu_core/instructions/           #    0.69
> insn per cycle
>          2,142,800      cpu_atom/instructions/           #    0.38
> insn per cycle
>            713,629      cpu_core/branches/               #   22.194 K/sec
>            452,838      cpu_atom/branches/               #   14.083 K/sec
>             26,810      cpu_core/branch-misses/          #    3.76% of
> all branches
>             26,029      cpu_atom/branch-misses/          #    3.65% of
> all branches
>              TopdownL1 (cpu_core)                 #     32.0 %
> tma_backend_bound
>                                                   #      8.0 %
> tma_bad_speculation
>                                                   #     45.5 %
> tma_frontend_bound
>                                                   #     14.5 %  tma_retiring
> ```
> 
> As the garbage collector needs to run to free memory it can lead to
> priority inversion where the garbage collector being slow is meaning
> there isn't enough heap on the p-cores. Here the user has to interpret
> the "(cpu_core)" to know that only half the metrics are shown and they
> should run with "-M TopdownL1" to get cpu_core and cpu_atom. From this
> they can see they have a memory bound issue on the atom cores. This
> seems less safe than reporting nothing then the user specifying "-M
> TopdownL1" to get the metrics on both cores.

OK. I will think about it. But no matter which way we choose, I think we
have to update the script anyway.

> 
> For the multiplexing problem, is it solved by removing IPC from this output?

No, IPC should only uses the fixed counters. The branch events share the
GP counters with the Topdown events.

Thanks,
Kan

> 
> Thanks,
> Ian
> 
>> Thanks,
>> Kan
>>
>>> Thanks,
>>> Ian
>>>
>>>> Thanks,
>>>> Kan
>>>>
>>>>> Thanks,
>>>>> Ian
>>>>>
>>>>>> diff --git a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>>>> index 20210742171d..cc4edf855064 100644
>>>>>> --- a/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>>>> +++ b/tools/perf/pmu-events/arch/x86/icelake/icl-metrics.json
>>>>>> @@ -111,21 +111,23 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_backend_bound",
>>>>>>          "MetricThreshold": "tma_backend_bound > 0.2",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_bad_speculation",
>>>>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -372,11 +374,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_frontend_bound",
>>>>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -1378,11 +1381,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_retiring",
>>>>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> diff --git a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>>>> index ef25cda019be..6f25b5b7aaf6 100644
>>>>>> --- a/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>>>> +++ b/tools/perf/pmu-events/arch/x86/icelakex/icx-metrics.json
>>>>>> @@ -315,21 +315,23 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_backend_bound",
>>>>>>          "MetricThreshold": "tma_backend_bound > 0.2",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_bad_speculation",
>>>>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -576,11 +578,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_frontend_bound",
>>>>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -1674,11 +1677,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_retiring",
>>>>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> diff --git a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>>>> index 4f3dd85540b6..c732982f70b5 100644
>>>>>> --- a/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>>>> +++ b/tools/perf/pmu-events/arch/x86/sapphirerapids/spr-metrics.json
>>>>>> @@ -340,31 +340,34 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_backend_bound",
>>>>>>          "MetricThreshold": "tma_backend_bound > 0.2",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_bad_speculation",
>>>>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction",
>>>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>>>          "MetricExpr": "topdown\\-br\\-mispredict / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> -        "MetricGroup": "BadSpec;BrMispredicts;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>>>>>> +        "MetricGroup": "BadSpec;BrMispredicts;Default;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueBM",
>>>>>>          "MetricName": "tma_branch_mispredicts",
>>>>>>          "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_speculation > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Branch Misprediction.  These slots are either wasted by uops fetched from an incorrectly speculated program path; or stalls when the out-of-order part of the machine needs to recover its state from a speculative path. Sample with: TOPDOWN.BR_MISPREDICT_SLOTS. Related metrics: tma_info_bad_spec_branch_misprediction_cost, tma_info_bottleneck_mispredictions, tma_mispredicts_resteers",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -407,11 +410,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck",
>>>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>>>          "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)",
>>>>>> -        "MetricGroup": "Backend;Compute;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>>>> +        "MetricGroup": "Backend;Compute;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>>>>          "MetricName": "tma_core_bound",
>>>>>>          "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2",
>>>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>>          "PublicDescription": "This metric represents fraction of slots where Core non-memory issues were of a bottleneck.  Shortage in hardware compute resources; or dependencies in software's instructions are both categorized under Core Bound. Hence it may indicate the machine ran out of an out-of-order resource; certain execution units are overloaded or dependencies in program's data- or instruction-flow are limiting the performance (e.g. FP-chained long-latency arithmetic operations).",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -509,21 +513,23 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues",
>>>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>>>          "MetricExpr": "max(0, tma_frontend_bound - tma_fetch_latency)",
>>>>>> -        "MetricGroup": "FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>>>>>> +        "MetricGroup": "Default;FetchBW;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group;tma_issueFB",
>>>>>>          "MetricName": "tma_fetch_bandwidth",
>>>>>>          "MetricThreshold": "tma_fetch_bandwidth > 0.1 & tma_frontend_bound > 0.15 & tma_info_thread_ipc / 6 > 0.35",
>>>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend bandwidth issues.  For example; inefficiencies at the instruction decoders; or restrictions for caching in the DSB (decoded uops cache) are categorized under Fetch Bandwidth. In such cases; the Frontend typically delivers suboptimal amount of uops to the Backend. Sample with: FRONTEND_RETIRED.LATENCY_GE_2_BUBBLES_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_1_PS;FRONTEND_RETIRED.LATENCY_GE_2_PS. Related metrics: tma_dsb_switches, tma_info_botlnk_l2_dsb_misses, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues",
>>>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>>>          "MetricExpr": "topdown\\-fetch\\-lat / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>>>> -        "MetricGroup": "Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>>>>>> +        "MetricGroup": "Default;Frontend;TmaL2;TopdownL2;tma_L2_group;tma_frontend_bound_group",
>>>>>>          "MetricName": "tma_fetch_latency",
>>>>>>          "MetricThreshold": "tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>>          "PublicDescription": "This metric represents fraction of slots the CPU was stalled due to Frontend latency issues.  For example; instruction-cache misses; iTLB misses or fetch stalls after a branch misprediction are categorized under Frontend Latency. In such cases; the Frontend eventually delivers no uops for some period. Sample with: FRONTEND_RETIRED.LATENCY_GE_16_PS;FRONTEND_RETIRED.LATENCY_GE_8_PS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -611,11 +617,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_frontend_bound",
>>>>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -630,11 +637,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences",
>>>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>>>          "MetricExpr": "topdown\\-heavy\\-ops / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>>>> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>>>>          "MetricName": "tma_heavy_operations",
>>>>>>          "MetricThreshold": "tma_heavy_operations > 0.1",
>>>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring heavy-weight operations -- instructions that require two or more uops or micro-coded sequences. This highly-correlates with the uop length of these instructions/sequences. Sample with: UOPS_RETIRED.HEAVY",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -1486,11 +1494,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation)",
>>>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>>>          "MetricExpr": "max(0, tma_retiring - tma_heavy_operations)",
>>>>>> -        "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>>>> +        "MetricGroup": "Default;Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_group",
>>>>>>          "MetricName": "tma_light_operations",
>>>>>>          "MetricThreshold": "tma_light_operations > 0.6",
>>>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>>          "PublicDescription": "This metric represents fraction of slots where the CPU was retiring light-weight operations -- instructions that require no more than one uop (micro-operation). This correlates with total number of instructions used by the program. A uops-per-instruction (see UopPI metric) ratio of 1 or less should be expected for decently optimized software running on Intel Core/Xeon products. While this often indicates efficient X86 instructions were executed; high value does not necessarily mean better performance cannot be achieved. Sample with: INST_RETIRED.PREC_DIST",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -1540,11 +1549,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears",
>>>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>>>          "MetricExpr": "max(0, tma_bad_speculation - tma_branch_mispredicts)",
>>>>>> -        "MetricGroup": "BadSpec;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>>>>>> +        "MetricGroup": "BadSpec;Default;MachineClears;TmaL2;TopdownL2;tma_L2_group;tma_bad_speculation_group;tma_issueMC;tma_issueSyncxn",
>>>>>>          "MetricName": "tma_machine_clears",
>>>>>>          "MetricThreshold": "tma_machine_clears > 0.1 & tma_bad_speculation > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>>          "PublicDescription": "This metric represents fraction of slots the CPU has wasted due to Machine Clears.  These slots are either wasted by uops fetched prior to the clear; or stalls the out-of-order portion of the machine needs to recover its state after the clear. For example; this can happen due to memory ordering Nukes (e.g. Memory Disambiguation) or Self-Modifying-Code (SMC) nukes. Sample with: MACHINE_CLEARS.COUNT. Related metrics: tma_clears_resteers, tma_contested_accesses, tma_data_sharing, tma_false_sharing, tma_l1_bound, tma_microcode_sequencer, tma_ms_switches, tma_remote_cache",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -1576,11 +1586,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck",
>>>>>> +        "DefaultMetricgroupName": "TopdownL2",
>>>>>>          "MetricExpr": "topdown\\-mem\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> -        "MetricGroup": "Backend;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>>>> +        "MetricGroup": "Backend;Default;TmaL2;TopdownL2;tma_L2_group;tma_backend_bound_group",
>>>>>>          "MetricName": "tma_memory_bound",
>>>>>>          "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0.2",
>>>>>> -        "MetricgroupNoGroup": "TopdownL2",
>>>>>> +        "MetricgroupNoGroup": "TopdownL2;Default",
>>>>>>          "PublicDescription": "This metric represents fraction of slots the Memory subsystem within the Backend was a bottleneck.  Memory Bound estimates fraction of slots where pipeline is likely stalled due to demand load or store instructions. This accounts mainly for (1) non-completed in-flight memory demand loads which coincides with execution units starvation; in addition to (2) cases where stores could impose backpressure on the pipeline when many of them get buffered at the same time (less common out of the two).",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -1784,11 +1795,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_retiring",
>>>>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> diff --git a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>>>> index d0538a754288..83346911aa63 100644
>>>>>> --- a/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>>>> +++ b/tools/perf/pmu-events/arch/x86/tigerlake/tgl-metrics.json
>>>>>> @@ -105,21 +105,23 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-be\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 5 * cpu@INT_MISC.RECOVERY_CYCLES\\,cmask\\=1\\,edge@ / tma_info_thread_slots",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_backend_bound",
>>>>>>          "MetricThreshold": "tma_backend_bound > 0.2",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots where no uops are being delivered due to a lack of required resources for accepting new uops in the Backend. Backend is the portion of the processor core where the out-of-order scheduler dispatches ready uops into their respective execution units; and once completed these uops get retired according to program order. For example; stalls due to data-cache misses or stalls due to the divider unit being overloaded are both categorized under Backend Bound. Backend Bound is further divided into two main categories: Memory Bound and Core Bound. Sample with: TOPDOWN.BACKEND_BOUND_SLOTS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots wasted due to incorrect speculations",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "max(1 - (tma_frontend_bound + tma_backend_bound + tma_retiring), 0)",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_bad_speculation",
>>>>>>          "MetricThreshold": "tma_bad_speculation > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots wasted due to incorrect speculations. This include slots used to issue uops that do not eventually get retired and slots for which the issue-pipeline was blocked due to recovery from earlier incorrect speculation. For example; wasted work due to miss-predicted branches are categorized under Bad Speculation category. Incorrect data speculation followed by Memory Ordering Nukes is another example.",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -366,11 +368,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-fe\\-bound / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) - INT_MISC.UOP_DROPPING / tma_info_thread_slots",
>>>>>> -        "MetricGroup": "PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;PGO;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_frontend_bound",
>>>>>>          "MetricThreshold": "tma_frontend_bound > 0.15",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots where the processor's Frontend undersupplies its Backend. Frontend denotes the first part of the processor core responsible to fetch operations that are executed later on by the Backend part. Within the Frontend; a branch predictor predicts the next address to fetch; cache-lines are fetched from the memory subsystem; parsed into instructions; and lastly decoded into micro-operations (uops). Ideally the Frontend can issue Pipeline_Width uops every cycle to the Backend. Frontend Bound denotes unutilized issue-slots when there is no Backend stall; i.e. bubbles where Frontend delivered no uops while Backend could have accepted them. For example; stalls due to instruction-cache misses would be categorized under Frontend Bound. Sample with: FRONTEND_RETIRED.LATENCY_GE_4_PS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> @@ -1392,11 +1395,12 @@
>>>>>>      },
>>>>>>      {
>>>>>>          "BriefDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired",
>>>>>> +        "DefaultMetricgroupName": "TopdownL1",
>>>>>>          "MetricExpr": "topdown\\-retiring / (topdown\\-fe\\-bound + topdown\\-bad\\-spec + topdown\\-retiring + topdown\\-be\\-bound) + 0 * tma_info_thread_slots",
>>>>>> -        "MetricGroup": "TmaL1;TopdownL1;tma_L1_group",
>>>>>> +        "MetricGroup": "Default;TmaL1;TopdownL1;tma_L1_group",
>>>>>>          "MetricName": "tma_retiring",
>>>>>>          "MetricThreshold": "tma_retiring > 0.7 | tma_heavy_operations > 0.1",
>>>>>> -        "MetricgroupNoGroup": "TopdownL1",
>>>>>> +        "MetricgroupNoGroup": "TopdownL1;Default",
>>>>>>          "PublicDescription": "This category represents fraction of slots utilized by useful work i.e. issued uops that eventually get retired. Ideally; all pipeline slots would be attributed to the Retiring category.  Retiring of 100% would indicate the maximum Pipeline_Width throughput was achieved.  Maximizing Retiring typically increases the Instructions-per-cycle (see IPC metric). Note that a high Retiring value does not necessary mean there is no room for more performance.  For example; Heavy-operations or Microcode Assists are categorized under Retiring. They often indicate suboptimal performance and can often be optimized or avoided. Sample with: UOPS_RETIRED.SLOTS",
>>>>>>          "ScaleUnit": "100%"
>>>>>>      },
>>>>>> --
>>>>>> 2.35.1
>>>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] perf vendor events arm64: Add default tags into topdown L1 metrics
  2023-06-07 16:26 ` [PATCH 4/8] perf vendor events arm64: Add default tags into topdown L1 metrics kan.liang
  2023-06-13 19:45   ` Ian Rogers
@ 2023-06-14 14:30   ` John Garry
  2023-06-16  3:17     ` Liang, Kan
  1 sibling, 1 reply; 31+ messages in thread
From: John Garry @ 2023-06-14 14:30 UTC (permalink / raw)
  To: kan.liang, acme, mingo, peterz, irogers, namhyung, jolsa,
	adrian.hunter, linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Jing Zhang

On 07/06/2023 17:26, kan.liang@linux.intel.com wrote:
> From: Kan Liang<kan.liang@linux.intel.com>
> 
> Add the default tags for ARM as well.
> 
> Signed-off-by: Kan Liang<kan.liang@linux.intel.com>
> Cc: Jing Zhang<renyu.zj@linux.alibaba.com>
> Cc: John Garry<john.g.garry@oracle.com>

Reviewed-by: John Garry <john.g.garry@oracle.com>

But does pmu-events/arch/arm64/hisilicon/hip08/metrics.json need to be 
fixed up as well?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/8] perf vendor events arm64: Add default tags into topdown L1 metrics
  2023-06-14 14:30   ` John Garry
@ 2023-06-16  3:17     ` Liang, Kan
  0 siblings, 0 replies; 31+ messages in thread
From: Liang, Kan @ 2023-06-16  3:17 UTC (permalink / raw)
  To: John Garry, acme, mingo, peterz, irogers, namhyung, jolsa,
	adrian.hunter, linux-perf-users, linux-kernel
  Cc: ak, eranian, ahmad.yasin, Jing Zhang



On 2023-06-14 10:30 a.m., John Garry wrote:
> On 07/06/2023 17:26, kan.liang@linux.intel.com wrote:
>> From: Kan Liang<kan.liang@linux.intel.com>
>>
>> Add the default tags for ARM as well.
>>
>> Signed-off-by: Kan Liang<kan.liang@linux.intel.com>
>> Cc: Jing Zhang<renyu.zj@linux.alibaba.com>
>> Cc: John Garry<john.g.garry@oracle.com>
> 
> Reviewed-by: John Garry <john.g.garry@oracle.com>
> 
> But does pmu-events/arch/arm64/hisilicon/hip08/metrics.json need to be
> fixed up as well?

The patch has been added in V4. Please take a look.

https://lore.kernel.org/lkml/20230616031420.3751973-6-kan.liang@linux.intel.com/

Thanks,
Kan

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-06-16  3:18 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-07 16:26 [PATCH 0/8] New metricgroup output in perf stat default mode kan.liang
2023-06-07 16:26 ` [PATCH 1/8] perf metric: Fix no group check kan.liang
2023-06-13 19:22   ` Ian Rogers
2023-06-07 16:26 ` [PATCH 2/8] perf evsel: Fix the annotation for hardware events on hybrid kan.liang
2023-06-13 19:35   ` Ian Rogers
2023-06-13 20:06     ` Liang, Kan
2023-06-13 21:18       ` Arnaldo Carvalho de Melo
2023-06-13 23:57         ` Liang, Kan
2023-06-07 16:26 ` [PATCH 3/8] perf metric: JSON flag to default metric group kan.liang
2023-06-13 19:44   ` Ian Rogers
2023-06-13 20:10     ` Liang, Kan
2023-06-13 20:28       ` Ian Rogers
2023-06-13 20:59         ` Liang, Kan
2023-06-13 21:28           ` Ian Rogers
2023-06-14  0:02             ` Liang, Kan
2023-06-07 16:26 ` [PATCH 4/8] perf vendor events arm64: Add default tags into topdown L1 metrics kan.liang
2023-06-13 19:45   ` Ian Rogers
2023-06-13 20:31     ` Arnaldo Carvalho de Melo
2023-06-14 14:30   ` John Garry
2023-06-16  3:17     ` Liang, Kan
2023-06-07 16:26 ` [PATCH 5/8] perf stat,jevents: Introduce Default tags for the default mode kan.liang
2023-06-13 19:59   ` Ian Rogers
2023-06-13 20:11     ` Liang, Kan
2023-06-07 16:26 ` [PATCH 6/8] perf stat,metrics: New metricgroup output " kan.liang
2023-06-13 20:16   ` Ian Rogers
2023-06-13 20:50     ` Liang, Kan
2023-06-07 16:26 ` [PATCH 7/8] pert tests: Support metricgroup perf stat JSON output kan.liang
2023-06-13 20:17   ` Ian Rogers
2023-06-13 20:30     ` Arnaldo Carvalho de Melo
2023-06-07 16:27 ` [PATCH 8/8] perf test: Add test case for the standard perf stat output kan.liang
2023-06-13 20:21   ` Ian Rogers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).