All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] perf stat: Introduce skippable evsels
@ 2023-04-14  5:19 Ian Rogers
  2023-04-14 18:02 ` Liang, Kan
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-14  5:19 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Kan Liang, Florian Fischer,
	linux-perf-users, linux-kernel

Perf stat with no arguments will use default events and metrics. These
events may fail to open even with kernel and hypervisor disabled. When
these fail then the permissions error appears even though they were
implicitly selected. This is particularly a problem with the automatic
selection of the TopdownL1 metric group on certain architectures like
Skylake:

```
$ perf stat true
Error:
Access to performance monitoring and observability operations is limited.
Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
access to performance monitoring and observability operations for processes
without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
More information can be found at 'Perf events and tool security' document:
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
perf_event_paranoid setting is 2:
  -1: Allow use of (almost) all events by all users
      Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow raw and ftrace function tracepoint access
>= 1: Disallow CPU event access
>= 2: Disallow kernel profiling
To make the adjusted perf_event_paranoid setting permanent preserve it
in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
```

This patch adds skippable evsels that when they fail to open won't
fail and won't appear in output. The TopdownL1 events, from the metric
group, are marked as skippable. This turns the failure above to:

```
$ perf stat true

 Performance counter stats for 'true':

              1.26 msec task-clock:u                     #    0.328 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                49      page-faults:u                    #   38.930 K/sec
           176,449      cycles:u                         #    0.140 GHz                         (48.99%)
           122,905      instructions:u                   #    0.70  insn per cycle
            28,264      branches:u                       #   22.456 M/sec
             2,405      branch-misses:u                  #    8.51% of all branches

       0.003834565 seconds time elapsed

       0.000000000 seconds user
       0.004130000 seconds sys
```

When the events can have kernel/hypervisor disabled, like on
Tigerlake, then it continues to succeed as:

```
$ perf stat true

 Performance counter stats for 'true':

              0.57 msec task-clock:u                     #    0.385 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                47      page-faults:u                    #   82.329 K/sec
           287,017      cycles:u                         #    0.503 GHz
           133,318      instructions:u                   #    0.46  insn per cycle
            31,396      branches:u                       #   54.996 M/sec
             2,442      branch-misses:u                  #    7.78% of all branches
           998,790      TOPDOWN.SLOTS:u                  #     14.5 %  tma_retiring
                                                  #     27.6 %  tma_backend_bound
                                                  #     40.9 %  tma_frontend_bound
                                                  #     17.0 %  tma_bad_speculation
           144,922      topdown-retiring:u
           411,266      topdown-fe-bound:u
           258,510      topdown-be-bound:u
           184,090      topdown-bad-spec:u
             2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
             3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #    6.015 M/sec

       0.001480954 seconds time elapsed

       0.000000000 seconds user
       0.001686000 seconds sys
```

And this likewise works if paranoia allows or running as root.

v2. Don't display the skipped events as <not counted> or <not supported>.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/builtin-stat.c      | 39 ++++++++++++++++++++++++++--------
 tools/perf/util/evsel.c        | 15 +++++++++++--
 tools/perf/util/evsel.h        |  1 +
 tools/perf/util/stat-display.c |  4 ++++
 4 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index d3cbee7460fc..7a641a67486d 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
 			evsel_list->core.threads->err_thread = -1;
 			return COUNTER_RETRY;
 		}
+	} else if (counter->skippable) {
+		if (verbose > 0)
+			ui__warning("skipping event %s that kernel failed to open .\n",
+				    evsel__name(counter));
+		counter->supported = false;
+		counter->errored = true;
+		return COUNTER_SKIP;
 	}
 
 	evsel__open_strerror(counter, &target, errno, msg, sizeof(msg));
@@ -1885,15 +1892,29 @@ static int add_default_attributes(void)
 		 * Add TopdownL1 metrics if they exist. To minimize
 		 * multiplexing, don't request threshold computation.
 		 */
-		if (metricgroup__has_metric("TopdownL1") &&
-		    metricgroup__parse_groups(evsel_list, "TopdownL1",
-					    /*metric_no_group=*/false,
-					    /*metric_no_merge=*/false,
-					    /*metric_no_threshold=*/true,
-					    stat_config.user_requested_cpu_list,
-					    stat_config.system_wide,
-					    &stat_config.metric_events) < 0)
-			return -1;
+		if (metricgroup__has_metric("TopdownL1")) {
+			struct evlist *metric_evlist = evlist__new();
+			struct evsel *metric_evsel;
+
+			if (!metric_evlist)
+				return -1;
+
+			if (metricgroup__parse_groups(metric_evlist, "TopdownL1",
+							/*metric_no_group=*/false,
+							/*metric_no_merge=*/false,
+							/*metric_no_threshold=*/true,
+							stat_config.user_requested_cpu_list,
+							stat_config.system_wide,
+							&stat_config.metric_events) < 0)
+				return -1;
+
+			evlist__for_each_entry(metric_evlist, metric_evsel) {
+				metric_evsel->skippable = true;
+			}
+			evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
+			evlist__delete(metric_evlist);
+		}
+
 		/* Platform specific attrs */
 		if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
 			return -1;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index a85a987128aa..83a65f771666 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel,
 	evsel->per_pkg_mask  = NULL;
 	evsel->collect_stat  = false;
 	evsel->pmu_name      = NULL;
+	evsel->skippable     = false;
 }
 
 struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
@@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread)
 		return -1;
 
 	fd = FD(leader, cpu_map_idx, thread);
-	BUG_ON(fd == -1);
+	BUG_ON(fd == -1 && !leader->skippable);
 
-	return fd;
+	/*
+	 * When the leader has been skipped, return -2 to distinguish from no
+	 * group leader case.
+	 */
+	return fd == -1 ? -2 : fd;
 }
 
 static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx)
@@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
 
 			group_fd = get_group_fd(evsel, idx, thread);
 
+			if (group_fd == -2) {
+				pr_debug("broken group leader for %s\n", evsel->name);
+				err = -EINVAL;
+				goto out_close;
+			}
+
 			test_attr__ready();
 
 			/* Debug message used by test scripts */
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index 68072ec655ce..98afe3351176 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -95,6 +95,7 @@ struct evsel {
 		bool			weak_group;
 		bool			bpf_counter;
 		bool			use_config_name;
+		bool			skippable;
 		int			bpf_fd;
 		struct bpf_object	*bpf_obj;
 		struct list_head	config_terms;
diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
index e6035ecbeee8..6b46bbb3d322 100644
--- a/tools/perf/util/stat-display.c
+++ b/tools/perf/util/stat-display.c
@@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config,
 	struct perf_cpu cpu;
 	int idx;
 
+	/* Skip counters that were speculatively/default enabled rather than requested. */
+	if (counter->skippable)
+		return true;
+
 	/*
 	 * Skip value 0 when enabling --per-thread globally,
 	 * otherwise it will have too many 0 output.
-- 
2.40.0.634.g4ca3ef3211-goog


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-14  5:19 [PATCH v2] perf stat: Introduce skippable evsels Ian Rogers
@ 2023-04-14 18:02 ` Liang, Kan
  2023-04-14 23:03   ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-14 18:02 UTC (permalink / raw)
  To: Ian Rogers, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Adrian Hunter, Florian Fischer,
	linux-perf-users, linux-kernel



On 2023-04-14 1:19 a.m., Ian Rogers wrote:
> Perf stat with no arguments will use default events and metrics. These
> events may fail to open even with kernel and hypervisor disabled. When
> these fail then the permissions error appears even though they were
> implicitly selected. This is particularly a problem with the automatic
> selection of the TopdownL1 metric group on certain architectures like
> Skylake:
> 
> ```
> $ perf stat true
> Error:
> Access to performance monitoring and observability operations is limited.
> Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
> access to performance monitoring and observability operations for processes
> without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
> More information can be found at 'Perf events and tool security' document:
> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> perf_event_paranoid setting is 2:
>   -1: Allow use of (almost) all events by all users
>       Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>> = 0: Disallow raw and ftrace function tracepoint access
>> = 1: Disallow CPU event access
>> = 2: Disallow kernel profiling
> To make the adjusted perf_event_paranoid setting permanent preserve it
> in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
> ```
> 
> This patch adds skippable evsels that when they fail to open won't
> fail and won't appear in output. The TopdownL1 events, from the metric
> group, are marked as skippable. This turns the failure above to:
> 
> ```
> $ perf stat true
> 
>  Performance counter stats for 'true':
> 
>               1.26 msec task-clock:u                     #    0.328 CPUs utilized
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>                 49      page-faults:u                    #   38.930 K/sec
>            176,449      cycles:u                         #    0.140 GHz                         (48.99%)

Multiplexing?

Thanks,
Kan

>            122,905      instructions:u                   #    0.70  insn per cycle
>             28,264      branches:u                       #   22.456 M/sec
>              2,405      branch-misses:u                  #    8.51% of all branches
> 
>        0.003834565 seconds time elapsed
> 
>        0.000000000 seconds user
>        0.004130000 seconds sys
> ```
> 
> When the events can have kernel/hypervisor disabled, like on
> Tigerlake, then it continues to succeed as:
> 
> ```
> $ perf stat true
> 
>  Performance counter stats for 'true':
> 
>               0.57 msec task-clock:u                     #    0.385 CPUs utilized
>                  0      context-switches:u               #    0.000 /sec
>                  0      cpu-migrations:u                 #    0.000 /sec
>                 47      page-faults:u                    #   82.329 K/sec
>            287,017      cycles:u                         #    0.503 GHz
>            133,318      instructions:u                   #    0.46  insn per cycle
>             31,396      branches:u                       #   54.996 M/sec
>              2,442      branch-misses:u                  #    7.78% of all branches
>            998,790      TOPDOWN.SLOTS:u                  #     14.5 %  tma_retiring
>                                                   #     27.6 %  tma_backend_bound
>                                                   #     40.9 %  tma_frontend_bound
>                                                   #     17.0 %  tma_bad_speculation
>            144,922      topdown-retiring:u
>            411,266      topdown-fe-bound:u
>            258,510      topdown-be-bound:u
>            184,090      topdown-bad-spec:u
>              2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
>              3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #    6.015 M/sec
> 
>        0.001480954 seconds time elapsed
> 
>        0.000000000 seconds user
>        0.001686000 seconds sys
> ```
> 
> And this likewise works if paranoia allows or running as root.
> 
> v2. Don't display the skipped events as <not counted> or <not supported>.
> 
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/builtin-stat.c      | 39 ++++++++++++++++++++++++++--------
>  tools/perf/util/evsel.c        | 15 +++++++++++--
>  tools/perf/util/evsel.h        |  1 +
>  tools/perf/util/stat-display.c |  4 ++++
>  4 files changed, 48 insertions(+), 11 deletions(-)
> 
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index d3cbee7460fc..7a641a67486d 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
>  			evsel_list->core.threads->err_thread = -1;
>  			return COUNTER_RETRY;
>  		}
> +	} else if (counter->skippable) {
> +		if (verbose > 0)
> +			ui__warning("skipping event %s that kernel failed to open .\n",
> +				    evsel__name(counter));
> +		counter->supported = false;
> +		counter->errored = true;
> +		return COUNTER_SKIP;
>  	}
>  
>  	evsel__open_strerror(counter, &target, errno, msg, sizeof(msg));
> @@ -1885,15 +1892,29 @@ static int add_default_attributes(void)
>  		 * Add TopdownL1 metrics if they exist. To minimize
>  		 * multiplexing, don't request threshold computation.
>  		 */
> -		if (metricgroup__has_metric("TopdownL1") &&
> -		    metricgroup__parse_groups(evsel_list, "TopdownL1",
> -					    /*metric_no_group=*/false,
> -					    /*metric_no_merge=*/false,
> -					    /*metric_no_threshold=*/true,
> -					    stat_config.user_requested_cpu_list,
> -					    stat_config.system_wide,
> -					    &stat_config.metric_events) < 0)
> -			return -1;
> +		if (metricgroup__has_metric("TopdownL1")) {
> +			struct evlist *metric_evlist = evlist__new();
> +			struct evsel *metric_evsel;
> +
> +			if (!metric_evlist)
> +				return -1;
> +
> +			if (metricgroup__parse_groups(metric_evlist, "TopdownL1",
> +							/*metric_no_group=*/false,
> +							/*metric_no_merge=*/false,
> +							/*metric_no_threshold=*/true,
> +							stat_config.user_requested_cpu_list,
> +							stat_config.system_wide,
> +							&stat_config.metric_events) < 0)
> +				return -1;
> +
> +			evlist__for_each_entry(metric_evlist, metric_evsel) {
> +				metric_evsel->skippable = true;
> +			}
> +			evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
> +			evlist__delete(metric_evlist);
> +		}
> +
>  		/* Platform specific attrs */
>  		if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
>  			return -1;
> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> index a85a987128aa..83a65f771666 100644
> --- a/tools/perf/util/evsel.c
> +++ b/tools/perf/util/evsel.c
> @@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel,
>  	evsel->per_pkg_mask  = NULL;
>  	evsel->collect_stat  = false;
>  	evsel->pmu_name      = NULL;
> +	evsel->skippable     = false;
>  }
>  
>  struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
> @@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread)
>  		return -1;
>  
>  	fd = FD(leader, cpu_map_idx, thread);
> -	BUG_ON(fd == -1);
> +	BUG_ON(fd == -1 && !leader->skippable);
>  
> -	return fd;
> +	/*
> +	 * When the leader has been skipped, return -2 to distinguish from no
> +	 * group leader case.
> +	 */
> +	return fd == -1 ? -2 : fd;
>  }
>  
>  static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx)
> @@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>  
>  			group_fd = get_group_fd(evsel, idx, thread);
>  
> +			if (group_fd == -2) {
> +				pr_debug("broken group leader for %s\n", evsel->name);
> +				err = -EINVAL;
> +				goto out_close;
> +			}
> +
>  			test_attr__ready();
>  
>  			/* Debug message used by test scripts */
> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> index 68072ec655ce..98afe3351176 100644
> --- a/tools/perf/util/evsel.h
> +++ b/tools/perf/util/evsel.h
> @@ -95,6 +95,7 @@ struct evsel {
>  		bool			weak_group;
>  		bool			bpf_counter;
>  		bool			use_config_name;
> +		bool			skippable;
>  		int			bpf_fd;
>  		struct bpf_object	*bpf_obj;
>  		struct list_head	config_terms;
> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> index e6035ecbeee8..6b46bbb3d322 100644
> --- a/tools/perf/util/stat-display.c
> +++ b/tools/perf/util/stat-display.c
> @@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config,
>  	struct perf_cpu cpu;
>  	int idx;
>  
> +	/* Skip counters that were speculatively/default enabled rather than requested. */
> +	if (counter->skippable)
> +		return true;
> +
>  	/*
>  	 * Skip value 0 when enabling --per-thread globally,
>  	 * otherwise it will have too many 0 output.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-14 18:02 ` Liang, Kan
@ 2023-04-14 23:03   ` Ian Rogers
  2023-04-17 13:58     ` Liang, Kan
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-14 23:03 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Fri, Apr 14, 2023 at 11:07 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-14 1:19 a.m., Ian Rogers wrote:
> > Perf stat with no arguments will use default events and metrics. These
> > events may fail to open even with kernel and hypervisor disabled. When
> > these fail then the permissions error appears even though they were
> > implicitly selected. This is particularly a problem with the automatic
> > selection of the TopdownL1 metric group on certain architectures like
> > Skylake:
> >
> > ```
> > $ perf stat true
> > Error:
> > Access to performance monitoring and observability operations is limited.
> > Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
> > access to performance monitoring and observability operations for processes
> > without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
> > More information can be found at 'Perf events and tool security' document:
> > https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> > perf_event_paranoid setting is 2:
> >   -1: Allow use of (almost) all events by all users
> >       Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
> >> = 0: Disallow raw and ftrace function tracepoint access
> >> = 1: Disallow CPU event access
> >> = 2: Disallow kernel profiling
> > To make the adjusted perf_event_paranoid setting permanent preserve it
> > in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
> > ```
> >
> > This patch adds skippable evsels that when they fail to open won't
> > fail and won't appear in output. The TopdownL1 events, from the metric
> > group, are marked as skippable. This turns the failure above to:
> >
> > ```
> > $ perf stat true
> >
> >  Performance counter stats for 'true':
> >
> >               1.26 msec task-clock:u                     #    0.328 CPUs utilized
> >                  0      context-switches:u               #    0.000 /sec
> >                  0      cpu-migrations:u                 #    0.000 /sec
> >                 49      page-faults:u                    #   38.930 K/sec
> >            176,449      cycles:u                         #    0.140 GHz                         (48.99%)
>
> Multiplexing?
>
> Thanks,
> Kan

I may have been running a test in the background otherwise I can't
explain it. Repeating the test yields no multiplexing:

```
$ perf stat true

Performance counter stats for 'true':

             0.78 msec task-clock:u                     #    0.383
CPUs utilized
                0      context-switches:u               #    0.000
/sec
                0      cpu-migrations:u                 #    0.000
/sec
               47      page-faults:u                    #   60.174
K/sec
          233,420      cycles:u                         #    0.299 GHz
          133,318      instructions:u                   #    0.57
insn per cycle
           31,396      branches:u                       #   40.196
M/sec
            2,334      branch-misses:u                  #    7.43% of
all branches
        1,167,100      TOPDOWN.SLOTS:u                  #     12.2 %
tma_retiring
                                                 #     28.9 %
tma_backend_bound
                                                 #     41.0 %
tma_frontend_bound
                                                 #     18.0 %
tma_bad_speculation
          141,882      topdown-retiring:u
          480,570      topdown-fe-bound:u
          320,380      topdown-be-bound:u
          224,266      topdown-bad-spec:u
            2,173      INT_MISC.UOP_DROPPING:u          #    2.782
M/sec
            3,323      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #
 4.254 M/sec


      0.002036744 seconds time elapsed

      0.002252000 seconds user
      0.000000000 seconds sys
```

Thanks,
Ian

> >            122,905      instructions:u                   #    0.70  insn per cycle
> >             28,264      branches:u                       #   22.456 M/sec
> >              2,405      branch-misses:u                  #    8.51% of all branches
> >
> >        0.003834565 seconds time elapsed
> >
> >        0.000000000 seconds user
> >        0.004130000 seconds sys
> > ```
> >
> > When the events can have kernel/hypervisor disabled, like on
> > Tigerlake, then it continues to succeed as:
> >
> > ```
> > $ perf stat true
> >
> >  Performance counter stats for 'true':
> >
> >               0.57 msec task-clock:u                     #    0.385 CPUs utilized
> >                  0      context-switches:u               #    0.000 /sec
> >                  0      cpu-migrations:u                 #    0.000 /sec
> >                 47      page-faults:u                    #   82.329 K/sec
> >            287,017      cycles:u                         #    0.503 GHz
> >            133,318      instructions:u                   #    0.46  insn per cycle
> >             31,396      branches:u                       #   54.996 M/sec
> >              2,442      branch-misses:u                  #    7.78% of all branches
> >            998,790      TOPDOWN.SLOTS:u                  #     14.5 %  tma_retiring
> >                                                   #     27.6 %  tma_backend_bound
> >                                                   #     40.9 %  tma_frontend_bound
> >                                                   #     17.0 %  tma_bad_speculation
> >            144,922      topdown-retiring:u
> >            411,266      topdown-fe-bound:u
> >            258,510      topdown-be-bound:u
> >            184,090      topdown-bad-spec:u
> >              2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
> >              3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #    6.015 M/sec
> >
> >        0.001480954 seconds time elapsed
> >
> >        0.000000000 seconds user
> >        0.001686000 seconds sys
> > ```
> >
> > And this likewise works if paranoia allows or running as root.
> >
> > v2. Don't display the skipped events as <not counted> or <not supported>.
> >
> > Signed-off-by: Ian Rogers <irogers@google.com>
> > ---
> >  tools/perf/builtin-stat.c      | 39 ++++++++++++++++++++++++++--------
> >  tools/perf/util/evsel.c        | 15 +++++++++++--
> >  tools/perf/util/evsel.h        |  1 +
> >  tools/perf/util/stat-display.c |  4 ++++
> >  4 files changed, 48 insertions(+), 11 deletions(-)
> >
> > diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> > index d3cbee7460fc..7a641a67486d 100644
> > --- a/tools/perf/builtin-stat.c
> > +++ b/tools/perf/builtin-stat.c
> > @@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
> >                       evsel_list->core.threads->err_thread = -1;
> >                       return COUNTER_RETRY;
> >               }
> > +     } else if (counter->skippable) {
> > +             if (verbose > 0)
> > +                     ui__warning("skipping event %s that kernel failed to open .\n",
> > +                                 evsel__name(counter));
> > +             counter->supported = false;
> > +             counter->errored = true;
> > +             return COUNTER_SKIP;
> >       }
> >
> >       evsel__open_strerror(counter, &target, errno, msg, sizeof(msg));
> > @@ -1885,15 +1892,29 @@ static int add_default_attributes(void)
> >                * Add TopdownL1 metrics if they exist. To minimize
> >                * multiplexing, don't request threshold computation.
> >                */
> > -             if (metricgroup__has_metric("TopdownL1") &&
> > -                 metricgroup__parse_groups(evsel_list, "TopdownL1",
> > -                                         /*metric_no_group=*/false,
> > -                                         /*metric_no_merge=*/false,
> > -                                         /*metric_no_threshold=*/true,
> > -                                         stat_config.user_requested_cpu_list,
> > -                                         stat_config.system_wide,
> > -                                         &stat_config.metric_events) < 0)
> > -                     return -1;
> > +             if (metricgroup__has_metric("TopdownL1")) {
> > +                     struct evlist *metric_evlist = evlist__new();
> > +                     struct evsel *metric_evsel;
> > +
> > +                     if (!metric_evlist)
> > +                             return -1;
> > +
> > +                     if (metricgroup__parse_groups(metric_evlist, "TopdownL1",
> > +                                                     /*metric_no_group=*/false,
> > +                                                     /*metric_no_merge=*/false,
> > +                                                     /*metric_no_threshold=*/true,
> > +                                                     stat_config.user_requested_cpu_list,
> > +                                                     stat_config.system_wide,
> > +                                                     &stat_config.metric_events) < 0)
> > +                             return -1;
> > +
> > +                     evlist__for_each_entry(metric_evlist, metric_evsel) {
> > +                             metric_evsel->skippable = true;
> > +                     }
> > +                     evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
> > +                     evlist__delete(metric_evlist);
> > +             }
> > +
> >               /* Platform specific attrs */
> >               if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
> >                       return -1;
> > diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> > index a85a987128aa..83a65f771666 100644
> > --- a/tools/perf/util/evsel.c
> > +++ b/tools/perf/util/evsel.c
> > @@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel,
> >       evsel->per_pkg_mask  = NULL;
> >       evsel->collect_stat  = false;
> >       evsel->pmu_name      = NULL;
> > +     evsel->skippable     = false;
> >  }
> >
> >  struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
> > @@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread)
> >               return -1;
> >
> >       fd = FD(leader, cpu_map_idx, thread);
> > -     BUG_ON(fd == -1);
> > +     BUG_ON(fd == -1 && !leader->skippable);
> >
> > -     return fd;
> > +     /*
> > +      * When the leader has been skipped, return -2 to distinguish from no
> > +      * group leader case.
> > +      */
> > +     return fd == -1 ? -2 : fd;
> >  }
> >
> >  static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx)
> > @@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> >
> >                       group_fd = get_group_fd(evsel, idx, thread);
> >
> > +                     if (group_fd == -2) {
> > +                             pr_debug("broken group leader for %s\n", evsel->name);
> > +                             err = -EINVAL;
> > +                             goto out_close;
> > +                     }
> > +
> >                       test_attr__ready();
> >
> >                       /* Debug message used by test scripts */
> > diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> > index 68072ec655ce..98afe3351176 100644
> > --- a/tools/perf/util/evsel.h
> > +++ b/tools/perf/util/evsel.h
> > @@ -95,6 +95,7 @@ struct evsel {
> >               bool                    weak_group;
> >               bool                    bpf_counter;
> >               bool                    use_config_name;
> > +             bool                    skippable;
> >               int                     bpf_fd;
> >               struct bpf_object       *bpf_obj;
> >               struct list_head        config_terms;
> > diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> > index e6035ecbeee8..6b46bbb3d322 100644
> > --- a/tools/perf/util/stat-display.c
> > +++ b/tools/perf/util/stat-display.c
> > @@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config,
> >       struct perf_cpu cpu;
> >       int idx;
> >
> > +     /* Skip counters that were speculatively/default enabled rather than requested. */
> > +     if (counter->skippable)
> > +             return true;
> > +
> >       /*
> >        * Skip value 0 when enabling --per-thread globally,
> >        * otherwise it will have too many 0 output.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-14 23:03   ` Ian Rogers
@ 2023-04-17 13:58     ` Liang, Kan
  2023-04-17 15:59       ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-17 13:58 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel



On 2023-04-14 7:03 p.m., Ian Rogers wrote:
> On Fri, Apr 14, 2023 at 11:07 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2023-04-14 1:19 a.m., Ian Rogers wrote:
>>> Perf stat with no arguments will use default events and metrics. These
>>> events may fail to open even with kernel and hypervisor disabled. When
>>> these fail then the permissions error appears even though they were
>>> implicitly selected. This is particularly a problem with the automatic
>>> selection of the TopdownL1 metric group on certain architectures like
>>> Skylake:
>>>
>>> ```
>>> $ perf stat true
>>> Error:
>>> Access to performance monitoring and observability operations is limited.
>>> Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
>>> access to performance monitoring and observability operations for processes
>>> without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
>>> More information can be found at 'Perf events and tool security' document:
>>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
>>> perf_event_paranoid setting is 2:
>>>   -1: Allow use of (almost) all events by all users
>>>       Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>>>> = 0: Disallow raw and ftrace function tracepoint access
>>>> = 1: Disallow CPU event access
>>>> = 2: Disallow kernel profiling
>>> To make the adjusted perf_event_paranoid setting permanent preserve it
>>> in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
>>> ```
>>>
>>> This patch adds skippable evsels that when they fail to open won't
>>> fail and won't appear in output. The TopdownL1 events, from the metric
>>> group, are marked as skippable. This turns the failure above to:
>>>
>>> ```
>>> $ perf stat true
>>>
>>>  Performance counter stats for 'true':
>>>
>>>               1.26 msec task-clock:u                     #    0.328 CPUs utilized
>>>                  0      context-switches:u               #    0.000 /sec
>>>                  0      cpu-migrations:u                 #    0.000 /sec
>>>                 49      page-faults:u                    #   38.930 K/sec
>>>            176,449      cycles:u                         #    0.140 GHz                         (48.99%)
>>
>> Multiplexing?
>>
>> Thanks,
>> Kan
> 
> I may have been running a test in the background otherwise I can't
> explain it. Repeating the test yields no multiplexing:


The above multiplexing should be on a Skylake (since there is no
topdownL1 printed), but the test which you repeat seems on a Tigerlake
(has topdownL1). Could you please double check on a Skylake?


Thanks,
Kan
> 
> ```
> $ perf stat true
> 
> Performance counter stats for 'true':
> 
>              0.78 msec task-clock:u                     #    0.383
> CPUs utilized
>                 0      context-switches:u               #    0.000
> /sec
>                 0      cpu-migrations:u                 #    0.000
> /sec
>                47      page-faults:u                    #   60.174
> K/sec
>           233,420      cycles:u                         #    0.299 GHz
>           133,318      instructions:u                   #    0.57
> insn per cycle
>            31,396      branches:u                       #   40.196
> M/sec
>             2,334      branch-misses:u                  #    7.43% of
> all branches
>         1,167,100      TOPDOWN.SLOTS:u                  #     12.2 %
> tma_retiring
>                                                  #     28.9 %
> tma_backend_bound
>                                                  #     41.0 %
> tma_frontend_bound
>                                                  #     18.0 %
> tma_bad_speculation
>           141,882      topdown-retiring:u
>           480,570      topdown-fe-bound:u
>           320,380      topdown-be-bound:u
>           224,266      topdown-bad-spec:u
>             2,173      INT_MISC.UOP_DROPPING:u          #    2.782
> M/sec
>             3,323      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #
>  4.254 M/sec
> 
> 
>       0.002036744 seconds time elapsed
> 
>       0.002252000 seconds user
>       0.000000000 seconds sys
> ```
> 
> Thanks,
> Ian
> 
>>>            122,905      instructions:u                   #    0.70  insn per cycle
>>>             28,264      branches:u                       #   22.456 M/sec
>>>              2,405      branch-misses:u                  #    8.51% of all branches
>>>
>>>        0.003834565 seconds time elapsed
>>>
>>>        0.000000000 seconds user
>>>        0.004130000 seconds sys
>>> ```
>>>
>>> When the events can have kernel/hypervisor disabled, like on
>>> Tigerlake, then it continues to succeed as:
>>>
>>> ```
>>> $ perf stat true
>>>
>>>  Performance counter stats for 'true':
>>>
>>>               0.57 msec task-clock:u                     #    0.385 CPUs utilized
>>>                  0      context-switches:u               #    0.000 /sec
>>>                  0      cpu-migrations:u                 #    0.000 /sec
>>>                 47      page-faults:u                    #   82.329 K/sec
>>>            287,017      cycles:u                         #    0.503 GHz
>>>            133,318      instructions:u                   #    0.46  insn per cycle
>>>             31,396      branches:u                       #   54.996 M/sec
>>>              2,442      branch-misses:u                  #    7.78% of all branches
>>>            998,790      TOPDOWN.SLOTS:u                  #     14.5 %  tma_retiring
>>>                                                   #     27.6 %  tma_backend_bound
>>>                                                   #     40.9 %  tma_frontend_bound
>>>                                                   #     17.0 %  tma_bad_speculation
>>>            144,922      topdown-retiring:u
>>>            411,266      topdown-fe-bound:u
>>>            258,510      topdown-be-bound:u
>>>            184,090      topdown-bad-spec:u
>>>              2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
>>>              3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #    6.015 M/sec
>>>
>>>        0.001480954 seconds time elapsed
>>>
>>>        0.000000000 seconds user
>>>        0.001686000 seconds sys
>>> ```
>>>
>>> And this likewise works if paranoia allows or running as root.
>>>
>>> v2. Don't display the skipped events as <not counted> or <not supported>.
>>>
>>> Signed-off-by: Ian Rogers <irogers@google.com>
>>> ---
>>>  tools/perf/builtin-stat.c      | 39 ++++++++++++++++++++++++++--------
>>>  tools/perf/util/evsel.c        | 15 +++++++++++--
>>>  tools/perf/util/evsel.h        |  1 +
>>>  tools/perf/util/stat-display.c |  4 ++++
>>>  4 files changed, 48 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>>> index d3cbee7460fc..7a641a67486d 100644
>>> --- a/tools/perf/builtin-stat.c
>>> +++ b/tools/perf/builtin-stat.c
>>> @@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
>>>                       evsel_list->core.threads->err_thread = -1;
>>>                       return COUNTER_RETRY;
>>>               }
>>> +     } else if (counter->skippable) {
>>> +             if (verbose > 0)
>>> +                     ui__warning("skipping event %s that kernel failed to open .\n",
>>> +                                 evsel__name(counter));
>>> +             counter->supported = false;
>>> +             counter->errored = true;
>>> +             return COUNTER_SKIP;
>>>       }
>>>
>>>       evsel__open_strerror(counter, &target, errno, msg, sizeof(msg));
>>> @@ -1885,15 +1892,29 @@ static int add_default_attributes(void)
>>>                * Add TopdownL1 metrics if they exist. To minimize
>>>                * multiplexing, don't request threshold computation.
>>>                */
>>> -             if (metricgroup__has_metric("TopdownL1") &&
>>> -                 metricgroup__parse_groups(evsel_list, "TopdownL1",
>>> -                                         /*metric_no_group=*/false,
>>> -                                         /*metric_no_merge=*/false,
>>> -                                         /*metric_no_threshold=*/true,
>>> -                                         stat_config.user_requested_cpu_list,
>>> -                                         stat_config.system_wide,
>>> -                                         &stat_config.metric_events) < 0)
>>> -                     return -1;
>>> +             if (metricgroup__has_metric("TopdownL1")) {
>>> +                     struct evlist *metric_evlist = evlist__new();
>>> +                     struct evsel *metric_evsel;
>>> +
>>> +                     if (!metric_evlist)
>>> +                             return -1;
>>> +
>>> +                     if (metricgroup__parse_groups(metric_evlist, "TopdownL1",
>>> +                                                     /*metric_no_group=*/false,
>>> +                                                     /*metric_no_merge=*/false,
>>> +                                                     /*metric_no_threshold=*/true,
>>> +                                                     stat_config.user_requested_cpu_list,
>>> +                                                     stat_config.system_wide,
>>> +                                                     &stat_config.metric_events) < 0)
>>> +                             return -1;
>>> +
>>> +                     evlist__for_each_entry(metric_evlist, metric_evsel) {
>>> +                             metric_evsel->skippable = true;
>>> +                     }
>>> +                     evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
>>> +                     evlist__delete(metric_evlist);
>>> +             }
>>> +
>>>               /* Platform specific attrs */
>>>               if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
>>>                       return -1;
>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>> index a85a987128aa..83a65f771666 100644
>>> --- a/tools/perf/util/evsel.c
>>> +++ b/tools/perf/util/evsel.c
>>> @@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel,
>>>       evsel->per_pkg_mask  = NULL;
>>>       evsel->collect_stat  = false;
>>>       evsel->pmu_name      = NULL;
>>> +     evsel->skippable     = false;
>>>  }
>>>
>>>  struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
>>> @@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread)
>>>               return -1;
>>>
>>>       fd = FD(leader, cpu_map_idx, thread);
>>> -     BUG_ON(fd == -1);
>>> +     BUG_ON(fd == -1 && !leader->skippable);
>>>
>>> -     return fd;
>>> +     /*
>>> +      * When the leader has been skipped, return -2 to distinguish from no
>>> +      * group leader case.
>>> +      */
>>> +     return fd == -1 ? -2 : fd;
>>>  }
>>>
>>>  static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx)
>>> @@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>>>
>>>                       group_fd = get_group_fd(evsel, idx, thread);
>>>
>>> +                     if (group_fd == -2) {
>>> +                             pr_debug("broken group leader for %s\n", evsel->name);
>>> +                             err = -EINVAL;
>>> +                             goto out_close;
>>> +                     }
>>> +
>>>                       test_attr__ready();
>>>
>>>                       /* Debug message used by test scripts */
>>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
>>> index 68072ec655ce..98afe3351176 100644
>>> --- a/tools/perf/util/evsel.h
>>> +++ b/tools/perf/util/evsel.h
>>> @@ -95,6 +95,7 @@ struct evsel {
>>>               bool                    weak_group;
>>>               bool                    bpf_counter;
>>>               bool                    use_config_name;
>>> +             bool                    skippable;
>>>               int                     bpf_fd;
>>>               struct bpf_object       *bpf_obj;
>>>               struct list_head        config_terms;
>>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
>>> index e6035ecbeee8..6b46bbb3d322 100644
>>> --- a/tools/perf/util/stat-display.c
>>> +++ b/tools/perf/util/stat-display.c
>>> @@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config,
>>>       struct perf_cpu cpu;
>>>       int idx;
>>>
>>> +     /* Skip counters that were speculatively/default enabled rather than requested. */
>>> +     if (counter->skippable)
>>> +             return true;
>>> +
>>>       /*
>>>        * Skip value 0 when enabling --per-thread globally,
>>>        * otherwise it will have too many 0 output.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-17 13:58     ` Liang, Kan
@ 2023-04-17 15:59       ` Ian Rogers
  2023-04-17 17:31         ` Liang, Kan
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-17 15:59 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Mon, Apr 17, 2023 at 6:58 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-14 7:03 p.m., Ian Rogers wrote:
> > On Fri, Apr 14, 2023 at 11:07 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>
> >>
> >>
> >> On 2023-04-14 1:19 a.m., Ian Rogers wrote:
> >>> Perf stat with no arguments will use default events and metrics. These
> >>> events may fail to open even with kernel and hypervisor disabled. When
> >>> these fail then the permissions error appears even though they were
> >>> implicitly selected. This is particularly a problem with the automatic
> >>> selection of the TopdownL1 metric group on certain architectures like
> >>> Skylake:
> >>>
> >>> ```
> >>> $ perf stat true
> >>> Error:
> >>> Access to performance monitoring and observability operations is limited.
> >>> Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
> >>> access to performance monitoring and observability operations for processes
> >>> without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
> >>> More information can be found at 'Perf events and tool security' document:
> >>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> >>> perf_event_paranoid setting is 2:
> >>>   -1: Allow use of (almost) all events by all users
> >>>       Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
> >>>> = 0: Disallow raw and ftrace function tracepoint access
> >>>> = 1: Disallow CPU event access
> >>>> = 2: Disallow kernel profiling
> >>> To make the adjusted perf_event_paranoid setting permanent preserve it
> >>> in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
> >>> ```
> >>>
> >>> This patch adds skippable evsels that when they fail to open won't
> >>> fail and won't appear in output. The TopdownL1 events, from the metric
> >>> group, are marked as skippable. This turns the failure above to:
> >>>
> >>> ```
> >>> $ perf stat true
> >>>
> >>>  Performance counter stats for 'true':
> >>>
> >>>               1.26 msec task-clock:u                     #    0.328 CPUs utilized
> >>>                  0      context-switches:u               #    0.000 /sec
> >>>                  0      cpu-migrations:u                 #    0.000 /sec
> >>>                 49      page-faults:u                    #   38.930 K/sec
> >>>            176,449      cycles:u                         #    0.140 GHz                         (48.99%)
> >>
> >> Multiplexing?
> >>
> >> Thanks,
> >> Kan
> >
> > I may have been running a test in the background otherwise I can't
> > explain it. Repeating the test yields no multiplexing:
>
>
> The above multiplexing should be on a Skylake (since there is no
> topdownL1 printed), but the test which you repeat seems on a Tigerlake
> (has topdownL1). Could you please double check on a Skylake?

In the best circumstances (ie no EBS_Mode, no other events, nmi
watchdog disabled) Skylake has multiplexing for TopdownL1:

```
$ sudo perf stat --metric-no-group -M TopdownL1 --metric-no-group -a sleep 1

Performance counter stats for 'system wide':

      500,145,019      INST_RETIRED.ANY                 #     14.2 %
tma_retiring             (71.07%)
    2,402,640,337      CPU_CLK_UNHALTED.THREAD_ANY      #     41.1 %
tma_frontend_bound
                                                 #     36.2 %
tma_backend_bound
                                                 #      8.4 %
tma_bad_speculation      (85.63%)
    1,976,769,761      IDQ_UOPS_NOT_DELIVERED.CORE
                        (85.81%)
      114,069,133      INT_MISC.RECOVERY_CYCLES_ANY
                        (85.83%)
      684,605,487      UOPS_RETIRED.RETIRE_SLOTS
                        (85.83%)
       49,695,823      UOPS_RETIRED.MACRO_FUSED
                        (85.83%)
      860,603,122      UOPS_ISSUED.ANY
                        (56.70%)

      1.014011174 seconds time elapsed
```

but this isn't a regression:
https://lore.kernel.org/lkml/20200520072814.128267-1-irogers@google.com/

I think this is off-topic for this change.

Thanks,
Ian

> Thanks,
> Kan
> >
> > ```
> > $ perf stat true
> >
> > Performance counter stats for 'true':
> >
> >              0.78 msec task-clock:u                     #    0.383
> > CPUs utilized
> >                 0      context-switches:u               #    0.000
> > /sec
> >                 0      cpu-migrations:u                 #    0.000
> > /sec
> >                47      page-faults:u                    #   60.174
> > K/sec
> >           233,420      cycles:u                         #    0.299 GHz
> >           133,318      instructions:u                   #    0.57
> > insn per cycle
> >            31,396      branches:u                       #   40.196
> > M/sec
> >             2,334      branch-misses:u                  #    7.43% of
> > all branches
> >         1,167,100      TOPDOWN.SLOTS:u                  #     12.2 %
> > tma_retiring
> >                                                  #     28.9 %
> > tma_backend_bound
> >                                                  #     41.0 %
> > tma_frontend_bound
> >                                                  #     18.0 %
> > tma_bad_speculation
> >           141,882      topdown-retiring:u
> >           480,570      topdown-fe-bound:u
> >           320,380      topdown-be-bound:u
> >           224,266      topdown-bad-spec:u
> >             2,173      INT_MISC.UOP_DROPPING:u          #    2.782
> > M/sec
> >             3,323      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #
> >  4.254 M/sec
> >
> >
> >       0.002036744 seconds time elapsed
> >
> >       0.002252000 seconds user
> >       0.000000000 seconds sys
> > ```
> >
> > Thanks,
> > Ian
> >
> >>>            122,905      instructions:u                   #    0.70  insn per cycle
> >>>             28,264      branches:u                       #   22.456 M/sec
> >>>              2,405      branch-misses:u                  #    8.51% of all branches
> >>>
> >>>        0.003834565 seconds time elapsed
> >>>
> >>>        0.000000000 seconds user
> >>>        0.004130000 seconds sys
> >>> ```
> >>>
> >>> When the events can have kernel/hypervisor disabled, like on
> >>> Tigerlake, then it continues to succeed as:
> >>>
> >>> ```
> >>> $ perf stat true
> >>>
> >>>  Performance counter stats for 'true':
> >>>
> >>>               0.57 msec task-clock:u                     #    0.385 CPUs utilized
> >>>                  0      context-switches:u               #    0.000 /sec
> >>>                  0      cpu-migrations:u                 #    0.000 /sec
> >>>                 47      page-faults:u                    #   82.329 K/sec
> >>>            287,017      cycles:u                         #    0.503 GHz
> >>>            133,318      instructions:u                   #    0.46  insn per cycle
> >>>             31,396      branches:u                       #   54.996 M/sec
> >>>              2,442      branch-misses:u                  #    7.78% of all branches
> >>>            998,790      TOPDOWN.SLOTS:u                  #     14.5 %  tma_retiring
> >>>                                                   #     27.6 %  tma_backend_bound
> >>>                                                   #     40.9 %  tma_frontend_bound
> >>>                                                   #     17.0 %  tma_bad_speculation
> >>>            144,922      topdown-retiring:u
> >>>            411,266      topdown-fe-bound:u
> >>>            258,510      topdown-be-bound:u
> >>>            184,090      topdown-bad-spec:u
> >>>              2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
> >>>              3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #    6.015 M/sec
> >>>
> >>>        0.001480954 seconds time elapsed
> >>>
> >>>        0.000000000 seconds user
> >>>        0.001686000 seconds sys
> >>> ```
> >>>
> >>> And this likewise works if paranoia allows or running as root.
> >>>
> >>> v2. Don't display the skipped events as <not counted> or <not supported>.
> >>>
> >>> Signed-off-by: Ian Rogers <irogers@google.com>
> >>> ---
> >>>  tools/perf/builtin-stat.c      | 39 ++++++++++++++++++++++++++--------
> >>>  tools/perf/util/evsel.c        | 15 +++++++++++--
> >>>  tools/perf/util/evsel.h        |  1 +
> >>>  tools/perf/util/stat-display.c |  4 ++++
> >>>  4 files changed, 48 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> >>> index d3cbee7460fc..7a641a67486d 100644
> >>> --- a/tools/perf/builtin-stat.c
> >>> +++ b/tools/perf/builtin-stat.c
> >>> @@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
> >>>                       evsel_list->core.threads->err_thread = -1;
> >>>                       return COUNTER_RETRY;
> >>>               }
> >>> +     } else if (counter->skippable) {
> >>> +             if (verbose > 0)
> >>> +                     ui__warning("skipping event %s that kernel failed to open .\n",
> >>> +                                 evsel__name(counter));
> >>> +             counter->supported = false;
> >>> +             counter->errored = true;
> >>> +             return COUNTER_SKIP;
> >>>       }
> >>>
> >>>       evsel__open_strerror(counter, &target, errno, msg, sizeof(msg));
> >>> @@ -1885,15 +1892,29 @@ static int add_default_attributes(void)
> >>>                * Add TopdownL1 metrics if they exist. To minimize
> >>>                * multiplexing, don't request threshold computation.
> >>>                */
> >>> -             if (metricgroup__has_metric("TopdownL1") &&
> >>> -                 metricgroup__parse_groups(evsel_list, "TopdownL1",
> >>> -                                         /*metric_no_group=*/false,
> >>> -                                         /*metric_no_merge=*/false,
> >>> -                                         /*metric_no_threshold=*/true,
> >>> -                                         stat_config.user_requested_cpu_list,
> >>> -                                         stat_config.system_wide,
> >>> -                                         &stat_config.metric_events) < 0)
> >>> -                     return -1;
> >>> +             if (metricgroup__has_metric("TopdownL1")) {
> >>> +                     struct evlist *metric_evlist = evlist__new();
> >>> +                     struct evsel *metric_evsel;
> >>> +
> >>> +                     if (!metric_evlist)
> >>> +                             return -1;
> >>> +
> >>> +                     if (metricgroup__parse_groups(metric_evlist, "TopdownL1",
> >>> +                                                     /*metric_no_group=*/false,
> >>> +                                                     /*metric_no_merge=*/false,
> >>> +                                                     /*metric_no_threshold=*/true,
> >>> +                                                     stat_config.user_requested_cpu_list,
> >>> +                                                     stat_config.system_wide,
> >>> +                                                     &stat_config.metric_events) < 0)
> >>> +                             return -1;
> >>> +
> >>> +                     evlist__for_each_entry(metric_evlist, metric_evsel) {
> >>> +                             metric_evsel->skippable = true;
> >>> +                     }
> >>> +                     evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
> >>> +                     evlist__delete(metric_evlist);
> >>> +             }
> >>> +
> >>>               /* Platform specific attrs */
> >>>               if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
> >>>                       return -1;
> >>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>> index a85a987128aa..83a65f771666 100644
> >>> --- a/tools/perf/util/evsel.c
> >>> +++ b/tools/perf/util/evsel.c
> >>> @@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel,
> >>>       evsel->per_pkg_mask  = NULL;
> >>>       evsel->collect_stat  = false;
> >>>       evsel->pmu_name      = NULL;
> >>> +     evsel->skippable     = false;
> >>>  }
> >>>
> >>>  struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
> >>> @@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread)
> >>>               return -1;
> >>>
> >>>       fd = FD(leader, cpu_map_idx, thread);
> >>> -     BUG_ON(fd == -1);
> >>> +     BUG_ON(fd == -1 && !leader->skippable);
> >>>
> >>> -     return fd;
> >>> +     /*
> >>> +      * When the leader has been skipped, return -2 to distinguish from no
> >>> +      * group leader case.
> >>> +      */
> >>> +     return fd == -1 ? -2 : fd;
> >>>  }
> >>>
> >>>  static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx)
> >>> @@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> >>>
> >>>                       group_fd = get_group_fd(evsel, idx, thread);
> >>>
> >>> +                     if (group_fd == -2) {
> >>> +                             pr_debug("broken group leader for %s\n", evsel->name);
> >>> +                             err = -EINVAL;
> >>> +                             goto out_close;
> >>> +                     }
> >>> +
> >>>                       test_attr__ready();
> >>>
> >>>                       /* Debug message used by test scripts */
> >>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> >>> index 68072ec655ce..98afe3351176 100644
> >>> --- a/tools/perf/util/evsel.h
> >>> +++ b/tools/perf/util/evsel.h
> >>> @@ -95,6 +95,7 @@ struct evsel {
> >>>               bool                    weak_group;
> >>>               bool                    bpf_counter;
> >>>               bool                    use_config_name;
> >>> +             bool                    skippable;
> >>>               int                     bpf_fd;
> >>>               struct bpf_object       *bpf_obj;
> >>>               struct list_head        config_terms;
> >>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> >>> index e6035ecbeee8..6b46bbb3d322 100644
> >>> --- a/tools/perf/util/stat-display.c
> >>> +++ b/tools/perf/util/stat-display.c
> >>> @@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config,
> >>>       struct perf_cpu cpu;
> >>>       int idx;
> >>>
> >>> +     /* Skip counters that were speculatively/default enabled rather than requested. */
> >>> +     if (counter->skippable)
> >>> +             return true;
> >>> +
> >>>       /*
> >>>        * Skip value 0 when enabling --per-thread globally,
> >>>        * otherwise it will have too many 0 output.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-17 15:59       ` Ian Rogers
@ 2023-04-17 17:31         ` Liang, Kan
  2023-04-17 18:13           ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-17 17:31 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel



On 2023-04-17 11:59 a.m., Ian Rogers wrote:
> On Mon, Apr 17, 2023 at 6:58 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2023-04-14 7:03 p.m., Ian Rogers wrote:
>>> On Fri, Apr 14, 2023 at 11:07 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2023-04-14 1:19 a.m., Ian Rogers wrote:
>>>>> Perf stat with no arguments will use default events and metrics. These
>>>>> events may fail to open even with kernel and hypervisor disabled. When
>>>>> these fail then the permissions error appears even though they were
>>>>> implicitly selected. This is particularly a problem with the automatic
>>>>> selection of the TopdownL1 metric group on certain architectures like
>>>>> Skylake:
>>>>>
>>>>> ```
>>>>> $ perf stat true
>>>>> Error:
>>>>> Access to performance monitoring and observability operations is limited.
>>>>> Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
>>>>> access to performance monitoring and observability operations for processes
>>>>> without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
>>>>> More information can be found at 'Perf events and tool security' document:
>>>>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
>>>>> perf_event_paranoid setting is 2:
>>>>>   -1: Allow use of (almost) all events by all users
>>>>>       Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>>>>>> = 0: Disallow raw and ftrace function tracepoint access
>>>>>> = 1: Disallow CPU event access
>>>>>> = 2: Disallow kernel profiling
>>>>> To make the adjusted perf_event_paranoid setting permanent preserve it
>>>>> in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
>>>>> ```
>>>>>
>>>>> This patch adds skippable evsels that when they fail to open won't
>>>>> fail and won't appear in output. The TopdownL1 events, from the metric
>>>>> group, are marked as skippable. This turns the failure above to:
>>>>>
>>>>> ```
>>>>> $ perf stat true
>>>>>
>>>>>  Performance counter stats for 'true':
>>>>>
>>>>>               1.26 msec task-clock:u                     #    0.328 CPUs utilized
>>>>>                  0      context-switches:u               #    0.000 /sec
>>>>>                  0      cpu-migrations:u                 #    0.000 /sec
>>>>>                 49      page-faults:u                    #   38.930 K/sec
>>>>>            176,449      cycles:u                         #    0.140 GHz                         (48.99%)
>>>>
>>>> Multiplexing?
>>>>
>>>> Thanks,
>>>> Kan
>>>
>>> I may have been running a test in the background otherwise I can't
>>> explain it. Repeating the test yields no multiplexing:
>>
>>
>> The above multiplexing should be on a Skylake (since there is no
>> topdownL1 printed), but the test which you repeat seems on a Tigerlake
>> (has topdownL1). Could you please double check on a Skylake?
> 
> In the best circumstances (ie no EBS_Mode, no other events, nmi
> watchdog disabled) Skylake has multiplexing for TopdownL1:
>

No I mean the perf stat default on Skylake.

With this patch, on a Skylake machine, there should be no TopdownL1
event and no multiplexing.
From your test result in the v2 description, we can see that there is no
TopdownL1, which is good and expected. However, there is a (48.99%) with
cycles:u event, which implies multiplexing. Could you please check
what's the problem here?
Also, if it's because of the backgroud, all the events should be
multiplexing. But it looks like only cycle:u has multiplexing. Other
events, instructions:u, branches:u and branch-misses:u work without
multiplexing. That's very strange.


> ```
> $ sudo perf stat --metric-no-group -M TopdownL1 --metric-no-group -a sleep 1
> 
> Performance counter stats for 'system wide':
> 
>       500,145,019      INST_RETIRED.ANY                 #     14.2 %
> tma_retiring             (71.07%)
>     2,402,640,337      CPU_CLK_UNHALTED.THREAD_ANY      #     41.1 %
> tma_frontend_bound
>                                                  #     36.2 %
> tma_backend_bound
>                                                  #      8.4 %
> tma_bad_speculation      (85.63%)
>     1,976,769,761      IDQ_UOPS_NOT_DELIVERED.CORE
>                         (85.81%)
>       114,069,133      INT_MISC.RECOVERY_CYCLES_ANY
>                         (85.83%)
>       684,605,487      UOPS_RETIRED.RETIRE_SLOTS
>                         (85.83%)
>        49,695,823      UOPS_RETIRED.MACRO_FUSED
>                         (85.83%)
>       860,603,122      UOPS_ISSUED.ANY
>                         (56.70%)
> 
>       1.014011174 seconds time elapsed
> ```
> 
> but this isn't a regression:
> https://lore.kernel.org/lkml/20200520072814.128267-1-irogers@google.com/
> 
> I think this is off-topic for this change.
> 
> Thanks,
> Ian
> 
>> Thanks,
>> Kan
>>>
>>> ```
>>> $ perf stat true
>>>
>>> Performance counter stats for 'true':
>>>
>>>              0.78 msec task-clock:u                     #    0.383
>>> CPUs utilized
>>>                 0      context-switches:u               #    0.000
>>> /sec
>>>                 0      cpu-migrations:u                 #    0.000
>>> /sec
>>>                47      page-faults:u                    #   60.174
>>> K/sec
>>>           233,420      cycles:u                         #    0.299 GHz
>>>           133,318      instructions:u                   #    0.57
>>> insn per cycle
>>>            31,396      branches:u                       #   40.196
>>> M/sec
>>>             2,334      branch-misses:u                  #    7.43% of
>>> all branches
>>>         1,167,100      TOPDOWN.SLOTS:u                  #     12.2 %
>>> tma_retiring
>>>                                                  #     28.9 %
>>> tma_backend_bound
>>>                                                  #     41.0 %
>>> tma_frontend_bound
>>>                                                  #     18.0 %
>>> tma_bad_speculation
>>>           141,882      topdown-retiring:u
>>>           480,570      topdown-fe-bound:u
>>>           320,380      topdown-be-bound:u
>>>           224,266      topdown-bad-spec:u
>>>             2,173      INT_MISC.UOP_DROPPING:u          #    2.782
>>> M/sec
>>>             3,323      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #
>>>  4.254 M/sec
>>>
>>>
>>>       0.002036744 seconds time elapsed
>>>
>>>       0.002252000 seconds user
>>>       0.000000000 seconds sys
>>> ```
>>>

This is apparently on a TigerLake, which has TopdownL1 with the perf
stat default. I don't have a problem with this.

Thanks,
Kan

>>> Thanks,
>>> Ian
>>>
>>>>>            122,905      instructions:u                   #    0.70  insn per cycle
>>>>>             28,264      branches:u                       #   22.456 M/sec
>>>>>              2,405      branch-misses:u                  #    8.51% of all branches
>>>>>
>>>>>        0.003834565 seconds time elapsed
>>>>>
>>>>>        0.000000000 seconds user
>>>>>        0.004130000 seconds sys
>>>>> ```
>>>>>
>>>>> When the events can have kernel/hypervisor disabled, like on
>>>>> Tigerlake, then it continues to succeed as:
>>>>>
>>>>> ```
>>>>> $ perf stat true
>>>>>
>>>>>  Performance counter stats for 'true':
>>>>>
>>>>>               0.57 msec task-clock:u                     #    0.385 CPUs utilized
>>>>>                  0      context-switches:u               #    0.000 /sec
>>>>>                  0      cpu-migrations:u                 #    0.000 /sec
>>>>>                 47      page-faults:u                    #   82.329 K/sec
>>>>>            287,017      cycles:u                         #    0.503 GHz
>>>>>            133,318      instructions:u                   #    0.46  insn per cycle
>>>>>             31,396      branches:u                       #   54.996 M/sec
>>>>>              2,442      branch-misses:u                  #    7.78% of all branches
>>>>>            998,790      TOPDOWN.SLOTS:u                  #     14.5 %  tma_retiring
>>>>>                                                   #     27.6 %  tma_backend_bound
>>>>>                                                   #     40.9 %  tma_frontend_bound
>>>>>                                                   #     17.0 %  tma_bad_speculation
>>>>>            144,922      topdown-retiring:u
>>>>>            411,266      topdown-fe-bound:u
>>>>>            258,510      topdown-be-bound:u
>>>>>            184,090      topdown-bad-spec:u
>>>>>              2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
>>>>>              3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #    6.015 M/sec
>>>>>
>>>>>        0.001480954 seconds time elapsed
>>>>>
>>>>>        0.000000000 seconds user
>>>>>        0.001686000 seconds sys
>>>>> ```
>>>>>
>>>>> And this likewise works if paranoia allows or running as root.
>>>>>
>>>>> v2. Don't display the skipped events as <not counted> or <not supported>.
>>>>>
>>>>> Signed-off-by: Ian Rogers <irogers@google.com>
>>>>> ---
>>>>>  tools/perf/builtin-stat.c      | 39 ++++++++++++++++++++++++++--------
>>>>>  tools/perf/util/evsel.c        | 15 +++++++++++--
>>>>>  tools/perf/util/evsel.h        |  1 +
>>>>>  tools/perf/util/stat-display.c |  4 ++++
>>>>>  4 files changed, 48 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>>>>> index d3cbee7460fc..7a641a67486d 100644
>>>>> --- a/tools/perf/builtin-stat.c
>>>>> +++ b/tools/perf/builtin-stat.c
>>>>> @@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
>>>>>                       evsel_list->core.threads->err_thread = -1;
>>>>>                       return COUNTER_RETRY;
>>>>>               }
>>>>> +     } else if (counter->skippable) {
>>>>> +             if (verbose > 0)
>>>>> +                     ui__warning("skipping event %s that kernel failed to open .\n",
>>>>> +                                 evsel__name(counter));
>>>>> +             counter->supported = false;
>>>>> +             counter->errored = true;
>>>>> +             return COUNTER_SKIP;
>>>>>       }
>>>>>
>>>>>       evsel__open_strerror(counter, &target, errno, msg, sizeof(msg));
>>>>> @@ -1885,15 +1892,29 @@ static int add_default_attributes(void)
>>>>>                * Add TopdownL1 metrics if they exist. To minimize
>>>>>                * multiplexing, don't request threshold computation.
>>>>>                */
>>>>> -             if (metricgroup__has_metric("TopdownL1") &&
>>>>> -                 metricgroup__parse_groups(evsel_list, "TopdownL1",
>>>>> -                                         /*metric_no_group=*/false,
>>>>> -                                         /*metric_no_merge=*/false,
>>>>> -                                         /*metric_no_threshold=*/true,
>>>>> -                                         stat_config.user_requested_cpu_list,
>>>>> -                                         stat_config.system_wide,
>>>>> -                                         &stat_config.metric_events) < 0)
>>>>> -                     return -1;
>>>>> +             if (metricgroup__has_metric("TopdownL1")) {
>>>>> +                     struct evlist *metric_evlist = evlist__new();
>>>>> +                     struct evsel *metric_evsel;
>>>>> +
>>>>> +                     if (!metric_evlist)
>>>>> +                             return -1;
>>>>> +
>>>>> +                     if (metricgroup__parse_groups(metric_evlist, "TopdownL1",
>>>>> +                                                     /*metric_no_group=*/false,
>>>>> +                                                     /*metric_no_merge=*/false,
>>>>> +                                                     /*metric_no_threshold=*/true,
>>>>> +                                                     stat_config.user_requested_cpu_list,
>>>>> +                                                     stat_config.system_wide,
>>>>> +                                                     &stat_config.metric_events) < 0)
>>>>> +                             return -1;
>>>>> +
>>>>> +                     evlist__for_each_entry(metric_evlist, metric_evsel) {
>>>>> +                             metric_evsel->skippable = true;
>>>>> +                     }
>>>>> +                     evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
>>>>> +                     evlist__delete(metric_evlist);
>>>>> +             }
>>>>> +
>>>>>               /* Platform specific attrs */
>>>>>               if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
>>>>>                       return -1;
>>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
>>>>> index a85a987128aa..83a65f771666 100644
>>>>> --- a/tools/perf/util/evsel.c
>>>>> +++ b/tools/perf/util/evsel.c
>>>>> @@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel,
>>>>>       evsel->per_pkg_mask  = NULL;
>>>>>       evsel->collect_stat  = false;
>>>>>       evsel->pmu_name      = NULL;
>>>>> +     evsel->skippable     = false;
>>>>>  }
>>>>>
>>>>>  struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
>>>>> @@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread)
>>>>>               return -1;
>>>>>
>>>>>       fd = FD(leader, cpu_map_idx, thread);
>>>>> -     BUG_ON(fd == -1);
>>>>> +     BUG_ON(fd == -1 && !leader->skippable);
>>>>>
>>>>> -     return fd;
>>>>> +     /*
>>>>> +      * When the leader has been skipped, return -2 to distinguish from no
>>>>> +      * group leader case.
>>>>> +      */
>>>>> +     return fd == -1 ? -2 : fd;
>>>>>  }
>>>>>
>>>>>  static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx)
>>>>> @@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
>>>>>
>>>>>                       group_fd = get_group_fd(evsel, idx, thread);
>>>>>
>>>>> +                     if (group_fd == -2) {
>>>>> +                             pr_debug("broken group leader for %s\n", evsel->name);
>>>>> +                             err = -EINVAL;
>>>>> +                             goto out_close;
>>>>> +                     }
>>>>> +
>>>>>                       test_attr__ready();
>>>>>
>>>>>                       /* Debug message used by test scripts */
>>>>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
>>>>> index 68072ec655ce..98afe3351176 100644
>>>>> --- a/tools/perf/util/evsel.h
>>>>> +++ b/tools/perf/util/evsel.h
>>>>> @@ -95,6 +95,7 @@ struct evsel {
>>>>>               bool                    weak_group;
>>>>>               bool                    bpf_counter;
>>>>>               bool                    use_config_name;
>>>>> +             bool                    skippable;
>>>>>               int                     bpf_fd;
>>>>>               struct bpf_object       *bpf_obj;
>>>>>               struct list_head        config_terms;
>>>>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
>>>>> index e6035ecbeee8..6b46bbb3d322 100644
>>>>> --- a/tools/perf/util/stat-display.c
>>>>> +++ b/tools/perf/util/stat-display.c
>>>>> @@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config,
>>>>>       struct perf_cpu cpu;
>>>>>       int idx;
>>>>>
>>>>> +     /* Skip counters that were speculatively/default enabled rather than requested. */
>>>>> +     if (counter->skippable)
>>>>> +             return true;
>>>>> +
>>>>>       /*
>>>>>        * Skip value 0 when enabling --per-thread globally,
>>>>>        * otherwise it will have too many 0 output.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-17 17:31         ` Liang, Kan
@ 2023-04-17 18:13           ` Ian Rogers
  2023-04-18 13:03             ` Liang, Kan
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-17 18:13 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Mon, Apr 17, 2023 at 10:31 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-17 11:59 a.m., Ian Rogers wrote:
> > On Mon, Apr 17, 2023 at 6:58 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>
> >>
> >>
> >> On 2023-04-14 7:03 p.m., Ian Rogers wrote:
> >>> On Fri, Apr 14, 2023 at 11:07 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2023-04-14 1:19 a.m., Ian Rogers wrote:
> >>>>> Perf stat with no arguments will use default events and metrics. These
> >>>>> events may fail to open even with kernel and hypervisor disabled. When
> >>>>> these fail then the permissions error appears even though they were
> >>>>> implicitly selected. This is particularly a problem with the automatic
> >>>>> selection of the TopdownL1 metric group on certain architectures like
> >>>>> Skylake:
> >>>>>
> >>>>> ```
> >>>>> $ perf stat true
> >>>>> Error:
> >>>>> Access to performance monitoring and observability operations is limited.
> >>>>> Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
> >>>>> access to performance monitoring and observability operations for processes
> >>>>> without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
> >>>>> More information can be found at 'Perf events and tool security' document:
> >>>>> https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
> >>>>> perf_event_paranoid setting is 2:
> >>>>>   -1: Allow use of (almost) all events by all users
> >>>>>       Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
> >>>>>> = 0: Disallow raw and ftrace function tracepoint access
> >>>>>> = 1: Disallow CPU event access
> >>>>>> = 2: Disallow kernel profiling
> >>>>> To make the adjusted perf_event_paranoid setting permanent preserve it
> >>>>> in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
> >>>>> ```
> >>>>>
> >>>>> This patch adds skippable evsels that when they fail to open won't
> >>>>> fail and won't appear in output. The TopdownL1 events, from the metric
> >>>>> group, are marked as skippable. This turns the failure above to:
> >>>>>
> >>>>> ```
> >>>>> $ perf stat true
> >>>>>
> >>>>>  Performance counter stats for 'true':
> >>>>>
> >>>>>               1.26 msec task-clock:u                     #    0.328 CPUs utilized
> >>>>>                  0      context-switches:u               #    0.000 /sec
> >>>>>                  0      cpu-migrations:u                 #    0.000 /sec
> >>>>>                 49      page-faults:u                    #   38.930 K/sec
> >>>>>            176,449      cycles:u                         #    0.140 GHz                         (48.99%)
> >>>>
> >>>> Multiplexing?
> >>>>
> >>>> Thanks,
> >>>> Kan
> >>>
> >>> I may have been running a test in the background otherwise I can't
> >>> explain it. Repeating the test yields no multiplexing:
> >>
> >>
> >> The above multiplexing should be on a Skylake (since there is no
> >> topdownL1 printed), but the test which you repeat seems on a Tigerlake
> >> (has topdownL1). Could you please double check on a Skylake?
> >
> > In the best circumstances (ie no EBS_Mode, no other events, nmi
> > watchdog disabled) Skylake has multiplexing for TopdownL1:
> >
>
> No I mean the perf stat default on Skylake.
>
> With this patch, on a Skylake machine, there should be no TopdownL1
> event and no multiplexing.

There are no top down events in:
```
$ perf stat true

 Performance counter stats for 'true':

              1.26 msec task-clock:u                     #    0.328
CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                49      page-faults:u                    #   38.930 K/sec
           176,449      cycles:u                         #    0.140
GHz                         (48.99%)
           122,905      instructions:u                   #    0.70
insn per cycle
            28,264      branches:u                       #   22.456 M/sec
             2,405      branch-misses:u                  #    8.51% of
all branches

       0.003834565 seconds time elapsed

       0.000000000 seconds user
       0.004130000 seconds sys
```
The json TopdownL1 is enabled if present unconditionally for perf stat
default. Enabling it on Skylake has multiplexing as TopdownL1 on
Skylake has multiplexing unrelated to this change - at least on the
machine I was testing on. We can remove the metric group TopdownL1 on
Skylake so that we don't enable it by default, there is still the
group TmaL1. To me, disabling TopdownL1 seems less desirable than
running with multiplexing - previously to get into topdown analysis
there has to be knowledge that "perf stat -M TopdownL1" is the way to
do this.

This doesn't relate to this change which is about making it so that
failing to set up TopdownL1 doesn't cause an early exit. The reason I
showed TigerLake output was that on TigerLake the skip/fallback
approach works while Skylake just needs the events disabled/skipped
unless it has sufficient permissions. Note the :u on the events in:

```
$ perf stat true

 Performance counter stats for 'true':

              0.57 msec task-clock:u                     #    0.385
CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                47      page-faults:u                    #   82.329 K/sec
           287,017      cycles:u                         #    0.503 GHz
           133,318      instructions:u                   #    0.46
insn per cycle
            31,396      branches:u                       #   54.996 M/sec
             2,442      branch-misses:u                  #    7.78% of
all branches
           998,790      TOPDOWN.SLOTS:u                  #     14.5 %
tma_retiring
                                                  #     27.6 %
tma_backend_bound
                                                  #     40.9 %
tma_frontend_bound
                                                  #     17.0 %
tma_bad_speculation
           144,922      topdown-retiring:u
           411,266      topdown-fe-bound:u
           258,510      topdown-be-bound:u
           184,090      topdown-bad-spec:u
             2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
             3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #
  6.015 M/sec

       0.001480954 seconds time elapsed

       0.000000000 seconds user
       0.001686000 seconds sys
```

> From your test result in the v2 description, we can see that there is no
> TopdownL1, which is good and expected. However, there is a (48.99%) with
> cycles:u event, which implies multiplexing. Could you please check
> what's the problem here?
> Also, if it's because of the backgroud, all the events should be
> multiplexing. But it looks like only cycle:u has multiplexing. Other
> events, instructions:u, branches:u and branch-misses:u work without
> multiplexing. That's very strange.

I wasn't able to reproduce it and suspect it was a transient thing. I
think there are multiplexing things to look into, 2 events on a fixed
counter on Icelake+ will trigger multiplexing on all counters, and
Skylake's 3 fixed and 4 generic should fit TopdownL1.

Thanks,
Ian

> > ```
> > $ sudo perf stat --metric-no-group -M TopdownL1 --metric-no-group -a sleep 1
> >
> > Performance counter stats for 'system wide':
> >
> >       500,145,019      INST_RETIRED.ANY                 #     14.2 %
> > tma_retiring             (71.07%)
> >     2,402,640,337      CPU_CLK_UNHALTED.THREAD_ANY      #     41.1 %
> > tma_frontend_bound
> >                                                  #     36.2 %
> > tma_backend_bound
> >                                                  #      8.4 %
> > tma_bad_speculation      (85.63%)
> >     1,976,769,761      IDQ_UOPS_NOT_DELIVERED.CORE
> >                         (85.81%)
> >       114,069,133      INT_MISC.RECOVERY_CYCLES_ANY
> >                         (85.83%)
> >       684,605,487      UOPS_RETIRED.RETIRE_SLOTS
> >                         (85.83%)
> >        49,695,823      UOPS_RETIRED.MACRO_FUSED
> >                         (85.83%)
> >       860,603,122      UOPS_ISSUED.ANY
> >                         (56.70%)
> >
> >       1.014011174 seconds time elapsed
> > ```
> >
> > but this isn't a regression:
> > https://lore.kernel.org/lkml/20200520072814.128267-1-irogers@google.com/
> >
> > I think this is off-topic for this change.
> >
> > Thanks,
> > Ian
> >
> >> Thanks,
> >> Kan
> >>>
> >>> ```
> >>> $ perf stat true
> >>>
> >>> Performance counter stats for 'true':
> >>>
> >>>              0.78 msec task-clock:u                     #    0.383
> >>> CPUs utilized
> >>>                 0      context-switches:u               #    0.000
> >>> /sec
> >>>                 0      cpu-migrations:u                 #    0.000
> >>> /sec
> >>>                47      page-faults:u                    #   60.174
> >>> K/sec
> >>>           233,420      cycles:u                         #    0.299 GHz
> >>>           133,318      instructions:u                   #    0.57
> >>> insn per cycle
> >>>            31,396      branches:u                       #   40.196
> >>> M/sec
> >>>             2,334      branch-misses:u                  #    7.43% of
> >>> all branches
> >>>         1,167,100      TOPDOWN.SLOTS:u                  #     12.2 %
> >>> tma_retiring
> >>>                                                  #     28.9 %
> >>> tma_backend_bound
> >>>                                                  #     41.0 %
> >>> tma_frontend_bound
> >>>                                                  #     18.0 %
> >>> tma_bad_speculation
> >>>           141,882      topdown-retiring:u
> >>>           480,570      topdown-fe-bound:u
> >>>           320,380      topdown-be-bound:u
> >>>           224,266      topdown-bad-spec:u
> >>>             2,173      INT_MISC.UOP_DROPPING:u          #    2.782
> >>> M/sec
> >>>             3,323      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #
> >>>  4.254 M/sec
> >>>
> >>>
> >>>       0.002036744 seconds time elapsed
> >>>
> >>>       0.002252000 seconds user
> >>>       0.000000000 seconds sys
> >>> ```
> >>>
>
> This is apparently on a TigerLake, which has TopdownL1 with the perf
> stat default. I don't have a problem with this.
>
> Thanks,
> Kan
>
> >>> Thanks,
> >>> Ian
> >>>
> >>>>>            122,905      instructions:u                   #    0.70  insn per cycle
> >>>>>             28,264      branches:u                       #   22.456 M/sec
> >>>>>              2,405      branch-misses:u                  #    8.51% of all branches
> >>>>>
> >>>>>        0.003834565 seconds time elapsed
> >>>>>
> >>>>>        0.000000000 seconds user
> >>>>>        0.004130000 seconds sys
> >>>>> ```
> >>>>>
> >>>>> When the events can have kernel/hypervisor disabled, like on
> >>>>> Tigerlake, then it continues to succeed as:
> >>>>>
> >>>>> ```
> >>>>> $ perf stat true
> >>>>>
> >>>>>  Performance counter stats for 'true':
> >>>>>
> >>>>>               0.57 msec task-clock:u                     #    0.385 CPUs utilized
> >>>>>                  0      context-switches:u               #    0.000 /sec
> >>>>>                  0      cpu-migrations:u                 #    0.000 /sec
> >>>>>                 47      page-faults:u                    #   82.329 K/sec
> >>>>>            287,017      cycles:u                         #    0.503 GHz
> >>>>>            133,318      instructions:u                   #    0.46  insn per cycle
> >>>>>             31,396      branches:u                       #   54.996 M/sec
> >>>>>              2,442      branch-misses:u                  #    7.78% of all branches
> >>>>>            998,790      TOPDOWN.SLOTS:u                  #     14.5 %  tma_retiring
> >>>>>                                                   #     27.6 %  tma_backend_bound
> >>>>>                                                   #     40.9 %  tma_frontend_bound
> >>>>>                                                   #     17.0 %  tma_bad_speculation
> >>>>>            144,922      topdown-retiring:u
> >>>>>            411,266      topdown-fe-bound:u
> >>>>>            258,510      topdown-be-bound:u
> >>>>>            184,090      topdown-bad-spec:u
> >>>>>              2,585      INT_MISC.UOP_DROPPING:u          #    4.528 M/sec
> >>>>>              3,434      cpu/INT_MISC.RECOVERY_CYCLES,cmask=1,edge/u #    6.015 M/sec
> >>>>>
> >>>>>        0.001480954 seconds time elapsed
> >>>>>
> >>>>>        0.000000000 seconds user
> >>>>>        0.001686000 seconds sys
> >>>>> ```
> >>>>>
> >>>>> And this likewise works if paranoia allows or running as root.
> >>>>>
> >>>>> v2. Don't display the skipped events as <not counted> or <not supported>.
> >>>>>
> >>>>> Signed-off-by: Ian Rogers <irogers@google.com>
> >>>>> ---
> >>>>>  tools/perf/builtin-stat.c      | 39 ++++++++++++++++++++++++++--------
> >>>>>  tools/perf/util/evsel.c        | 15 +++++++++++--
> >>>>>  tools/perf/util/evsel.h        |  1 +
> >>>>>  tools/perf/util/stat-display.c |  4 ++++
> >>>>>  4 files changed, 48 insertions(+), 11 deletions(-)
> >>>>>
> >>>>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> >>>>> index d3cbee7460fc..7a641a67486d 100644
> >>>>> --- a/tools/perf/builtin-stat.c
> >>>>> +++ b/tools/perf/builtin-stat.c
> >>>>> @@ -667,6 +667,13 @@ static enum counter_recovery stat_handle_error(struct evsel *counter)
> >>>>>                       evsel_list->core.threads->err_thread = -1;
> >>>>>                       return COUNTER_RETRY;
> >>>>>               }
> >>>>> +     } else if (counter->skippable) {
> >>>>> +             if (verbose > 0)
> >>>>> +                     ui__warning("skipping event %s that kernel failed to open .\n",
> >>>>> +                                 evsel__name(counter));
> >>>>> +             counter->supported = false;
> >>>>> +             counter->errored = true;
> >>>>> +             return COUNTER_SKIP;
> >>>>>       }
> >>>>>
> >>>>>       evsel__open_strerror(counter, &target, errno, msg, sizeof(msg));
> >>>>> @@ -1885,15 +1892,29 @@ static int add_default_attributes(void)
> >>>>>                * Add TopdownL1 metrics if they exist. To minimize
> >>>>>                * multiplexing, don't request threshold computation.
> >>>>>                */
> >>>>> -             if (metricgroup__has_metric("TopdownL1") &&
> >>>>> -                 metricgroup__parse_groups(evsel_list, "TopdownL1",
> >>>>> -                                         /*metric_no_group=*/false,
> >>>>> -                                         /*metric_no_merge=*/false,
> >>>>> -                                         /*metric_no_threshold=*/true,
> >>>>> -                                         stat_config.user_requested_cpu_list,
> >>>>> -                                         stat_config.system_wide,
> >>>>> -                                         &stat_config.metric_events) < 0)
> >>>>> -                     return -1;
> >>>>> +             if (metricgroup__has_metric("TopdownL1")) {
> >>>>> +                     struct evlist *metric_evlist = evlist__new();
> >>>>> +                     struct evsel *metric_evsel;
> >>>>> +
> >>>>> +                     if (!metric_evlist)
> >>>>> +                             return -1;
> >>>>> +
> >>>>> +                     if (metricgroup__parse_groups(metric_evlist, "TopdownL1",
> >>>>> +                                                     /*metric_no_group=*/false,
> >>>>> +                                                     /*metric_no_merge=*/false,
> >>>>> +                                                     /*metric_no_threshold=*/true,
> >>>>> +                                                     stat_config.user_requested_cpu_list,
> >>>>> +                                                     stat_config.system_wide,
> >>>>> +                                                     &stat_config.metric_events) < 0)
> >>>>> +                             return -1;
> >>>>> +
> >>>>> +                     evlist__for_each_entry(metric_evlist, metric_evsel) {
> >>>>> +                             metric_evsel->skippable = true;
> >>>>> +                     }
> >>>>> +                     evlist__splice_list_tail(evsel_list, &metric_evlist->core.entries);
> >>>>> +                     evlist__delete(metric_evlist);
> >>>>> +             }
> >>>>> +
> >>>>>               /* Platform specific attrs */
> >>>>>               if (evlist__add_default_attrs(evsel_list, default_null_attrs) < 0)
> >>>>>                       return -1;
> >>>>> diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
> >>>>> index a85a987128aa..83a65f771666 100644
> >>>>> --- a/tools/perf/util/evsel.c
> >>>>> +++ b/tools/perf/util/evsel.c
> >>>>> @@ -290,6 +290,7 @@ void evsel__init(struct evsel *evsel,
> >>>>>       evsel->per_pkg_mask  = NULL;
> >>>>>       evsel->collect_stat  = false;
> >>>>>       evsel->pmu_name      = NULL;
> >>>>> +     evsel->skippable     = false;
> >>>>>  }
> >>>>>
> >>>>>  struct evsel *evsel__new_idx(struct perf_event_attr *attr, int idx)
> >>>>> @@ -1720,9 +1721,13 @@ static int get_group_fd(struct evsel *evsel, int cpu_map_idx, int thread)
> >>>>>               return -1;
> >>>>>
> >>>>>       fd = FD(leader, cpu_map_idx, thread);
> >>>>> -     BUG_ON(fd == -1);
> >>>>> +     BUG_ON(fd == -1 && !leader->skippable);
> >>>>>
> >>>>> -     return fd;
> >>>>> +     /*
> >>>>> +      * When the leader has been skipped, return -2 to distinguish from no
> >>>>> +      * group leader case.
> >>>>> +      */
> >>>>> +     return fd == -1 ? -2 : fd;
> >>>>>  }
> >>>>>
> >>>>>  static void evsel__remove_fd(struct evsel *pos, int nr_cpus, int nr_threads, int thread_idx)
> >>>>> @@ -2104,6 +2109,12 @@ static int evsel__open_cpu(struct evsel *evsel, struct perf_cpu_map *cpus,
> >>>>>
> >>>>>                       group_fd = get_group_fd(evsel, idx, thread);
> >>>>>
> >>>>> +                     if (group_fd == -2) {
> >>>>> +                             pr_debug("broken group leader for %s\n", evsel->name);
> >>>>> +                             err = -EINVAL;
> >>>>> +                             goto out_close;
> >>>>> +                     }
> >>>>> +
> >>>>>                       test_attr__ready();
> >>>>>
> >>>>>                       /* Debug message used by test scripts */
> >>>>> diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
> >>>>> index 68072ec655ce..98afe3351176 100644
> >>>>> --- a/tools/perf/util/evsel.h
> >>>>> +++ b/tools/perf/util/evsel.h
> >>>>> @@ -95,6 +95,7 @@ struct evsel {
> >>>>>               bool                    weak_group;
> >>>>>               bool                    bpf_counter;
> >>>>>               bool                    use_config_name;
> >>>>> +             bool                    skippable;
> >>>>>               int                     bpf_fd;
> >>>>>               struct bpf_object       *bpf_obj;
> >>>>>               struct list_head        config_terms;
> >>>>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> >>>>> index e6035ecbeee8..6b46bbb3d322 100644
> >>>>> --- a/tools/perf/util/stat-display.c
> >>>>> +++ b/tools/perf/util/stat-display.c
> >>>>> @@ -810,6 +810,10 @@ static bool should_skip_zero_counter(struct perf_stat_config *config,
> >>>>>       struct perf_cpu cpu;
> >>>>>       int idx;
> >>>>>
> >>>>> +     /* Skip counters that were speculatively/default enabled rather than requested. */
> >>>>> +     if (counter->skippable)
> >>>>> +             return true;
> >>>>> +
> >>>>>       /*
> >>>>>        * Skip value 0 when enabling --per-thread globally,
> >>>>>        * otherwise it will have too many 0 output.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-17 18:13           ` Ian Rogers
@ 2023-04-18 13:03             ` Liang, Kan
  2023-04-18 15:43               ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-18 13:03 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel



On 2023-04-17 2:13 p.m., Ian Rogers wrote:
> The json TopdownL1 is enabled if present unconditionally for perf stat
> default. Enabling it on Skylake has multiplexing as TopdownL1 on
> Skylake has multiplexing unrelated to this change - at least on the
> machine I was testing on. We can remove the metric group TopdownL1 on
> Skylake so that we don't enable it by default, there is still the
> group TmaL1. To me, disabling TopdownL1 seems less desirable than
> running with multiplexing - previously to get into topdown analysis
> there has to be knowledge that "perf stat -M TopdownL1" is the way to
> do this.

To be honest, I don't think it's a good idea to remove the TopdownL1. We
cannot remove it just because the new way cannot handle it. The perf
stat default works well until 6.3-rc7. It's a regression issue of the
current perf-tools-next.

But I'm OK to add some flags with the metrics to assist the perf tool to
specially handle the case if you prefer to modify the event list.

> 
> This doesn't relate to this change which is about making it so that
> failing to set up TopdownL1 doesn't cause an early exit. The reason I
> showed TigerLake output was that on TigerLake the skip/fallback
> approach works while Skylake just needs the events disabled/skipped
> unless it has sufficient permissions. Note the :u on the events in:

The perf_event_open() should be good to detect the insufficient
permission, but it doesn't work to detect an existing of an event.
That's because the kernel only checks the features not specific events.
It's not a reliable way to rely on the output of the perf_event_open() here.


>> From your test result in the v2 description, we can see that there is no
>> TopdownL1, which is good and expected. However, there is a (48.99%) with
>> cycles:u event, which implies multiplexing. Could you please check
>> what's the problem here?
>> Also, if it's because of the backgroud, all the events should be
>> multiplexing. But it looks like only cycle:u has multiplexing. Other
>> events, instructions:u, branches:u and branch-misses:u work without
>> multiplexing. That's very strange.
> I wasn't able to reproduce it and suspect it was a transient thing. I
> think there are multiplexing things to look into, 2 events on a fixed
> counter on Icelake+ will trigger multiplexing on all counters, and
> Skylake's 3 fixed and 4 generic should fit TopdownL1.

Just found a cascade lake. With this patch + the current
perf-tools-next, partial of the TopdownL1 and multiplexing can still be
observed.

$ sudo ./perf stat true

 Performance counter stats for 'true':

              2.91 msec task-clock                       #    0.316 CPUs
utilized
                 0      context-switches                 #    0.000 /sec
                 0      cpu-migrations                   #    0.000 /sec
                45      page-faults                      #   15.474 K/sec
         2,819,972      cycles                           #    0.970 GHz
                       (60.14%)
         5,391,406      instructions                     #    1.91  insn
per cycle
         1,068,575      branches                         #  367.442 M/sec
             8,455      branch-misses                    #    0.79% of
all branches
            70,283      CPU_CLK_UNHALTED.REF_XCLK        #   24.168
M/sec
            48,806      INT_MISC.RECOVERY_CYCLES_ANY     #   16.783
M/sec                       (39.86%)

       0.009204517 seconds time elapsed

       0.000000000 seconds user
       0.009614000 seconds sys


Thanks,
Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-18 13:03             ` Liang, Kan
@ 2023-04-18 15:43               ` Ian Rogers
  2023-04-18 18:19                 ` Liang, Kan
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-18 15:43 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
> > The json TopdownL1 is enabled if present unconditionally for perf stat
> > default. Enabling it on Skylake has multiplexing as TopdownL1 on
> > Skylake has multiplexing unrelated to this change - at least on the
> > machine I was testing on. We can remove the metric group TopdownL1 on
> > Skylake so that we don't enable it by default, there is still the
> > group TmaL1. To me, disabling TopdownL1 seems less desirable than
> > running with multiplexing - previously to get into topdown analysis
> > there has to be knowledge that "perf stat -M TopdownL1" is the way to
> > do this.
>
> To be honest, I don't think it's a good idea to remove the TopdownL1. We
> cannot remove it just because the new way cannot handle it. The perf
> stat default works well until 6.3-rc7. It's a regression issue of the
> current perf-tools-next.

I'm not so clear it is a regression to consistently add TopdownL1 for
all architectures supporting it. The assertion is that because
TopdownL1 has multiplexing and multiplexing is global then we've
lowered the accuracy of other metrics, but the only other hardware
metrics in the default output pre-Icelake are IPC and branch miss
rate. Having TopdownL1 is a way of drilling into performance issues,
while IPC and branch miss rate are putting your finger in the air to
see which way the wind is blowing. Perhaps we should drop these if
TopdownL1 is present.

> But I'm OK to add some flags with the metrics to assist the perf tool to
> specially handle the case if you prefer to modify the event list.

We've already removed thresholds from the default output, we could
remove groups.

> >
> > This doesn't relate to this change which is about making it so that
> > failing to set up TopdownL1 doesn't cause an early exit. The reason I
> > showed TigerLake output was that on TigerLake the skip/fallback
> > approach works while Skylake just needs the events disabled/skipped
> > unless it has sufficient permissions. Note the :u on the events in:
>
> The perf_event_open() should be good to detect the insufficient
> permission, but it doesn't work to detect an existing of an event.
> That's because the kernel only checks the features not specific events.
> It's not a reliable way to rely on the output of the perf_event_open() here.

I'm unclear why not as not having perf_event_open fail seems like a
kernel bug. I can see there is little motivation to fix this on older
architectures like Skylake. We do attempt to workaround it with the
errata flags on the metrics introduced here:
https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py#L1296

> >> From your test result in the v2 description, we can see that there is no
> >> TopdownL1, which is good and expected. However, there is a (48.99%) with
> >> cycles:u event, which implies multiplexing. Could you please check
> >> what's the problem here?
> >> Also, if it's because of the backgroud, all the events should be
> >> multiplexing. But it looks like only cycle:u has multiplexing. Other
> >> events, instructions:u, branches:u and branch-misses:u work without
> >> multiplexing. That's very strange.
> > I wasn't able to reproduce it and suspect it was a transient thing. I
> > think there are multiplexing things to look into, 2 events on a fixed
> > counter on Icelake+ will trigger multiplexing on all counters, and
> > Skylake's 3 fixed and 4 generic should fit TopdownL1.
>
> Just found a cascade lake. With this patch + the current
> perf-tools-next, partial of the TopdownL1 and multiplexing can still be
> observed.
>
> $ sudo ./perf stat true
>
>  Performance counter stats for 'true':
>
>               2.91 msec task-clock                       #    0.316 CPUs
> utilized
>                  0      context-switches                 #    0.000 /sec
>                  0      cpu-migrations                   #    0.000 /sec
>                 45      page-faults                      #   15.474 K/sec
>          2,819,972      cycles                           #    0.970 GHz
>                        (60.14%)
>          5,391,406      instructions                     #    1.91  insn
> per cycle
>          1,068,575      branches                         #  367.442 M/sec
>              8,455      branch-misses                    #    0.79% of
> all branches
>             70,283      CPU_CLK_UNHALTED.REF_XCLK        #   24.168
> M/sec
>             48,806      INT_MISC.RECOVERY_CYCLES_ANY     #   16.783
> M/sec                       (39.86%)
>
>        0.009204517 seconds time elapsed
>
>        0.000000000 seconds user
>        0.009614000 seconds sys

The issue here is that 'true' ran very quickly and so we skipped the
output of the events with 0 counts, no metrics were computed due to
the zero counts. Cascade lake has the same TopdownL1 multiplexing
issues as Skylake.

Thanks,
Ian

>
> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-18 15:43               ` Ian Rogers
@ 2023-04-18 18:19                 ` Liang, Kan
  2023-04-18 20:08                   ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-18 18:19 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel



On 2023-04-18 11:43 a.m., Ian Rogers wrote:
> On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
>>> The json TopdownL1 is enabled if present unconditionally for perf stat
>>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
>>> Skylake has multiplexing unrelated to this change - at least on the
>>> machine I was testing on. We can remove the metric group TopdownL1 on
>>> Skylake so that we don't enable it by default, there is still the
>>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
>>> running with multiplexing - previously to get into topdown analysis
>>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
>>> do this.
>>
>> To be honest, I don't think it's a good idea to remove the TopdownL1. We
>> cannot remove it just because the new way cannot handle it. The perf
>> stat default works well until 6.3-rc7. It's a regression issue of the
>> current perf-tools-next.
> 
> I'm not so clear it is a regression to consistently add TopdownL1 for
> all architectures supporting it. 


Breaking the perf stat default is a regression.

The reason we once added the TopdownL1 for ICL and later platform is
that it doesn't break the original design (a *clean* output).

> The assertion is that because
> TopdownL1 has multiplexing and multiplexing is global then we've
> lowered the accuracy of other metrics, but the only other hardware
> metrics in the default output pre-Icelake are IPC and branch miss
> rate. Having TopdownL1 is a way of drilling into performance issues,
> while IPC and branch miss rate are putting your finger in the air to
> see which way the wind is blowing. Perhaps we should drop these if
> TopdownL1 is present.
> 
>> But I'm OK to add some flags with the metrics to assist the perf tool to
>> specially handle the case if you prefer to modify the event list.
> 
> We've already removed thresholds from the default output, we could
> remove groups.
> 
>>>
>>> This doesn't relate to this change which is about making it so that
>>> failing to set up TopdownL1 doesn't cause an early exit. The reason I
>>> showed TigerLake output was that on TigerLake the skip/fallback
>>> approach works while Skylake just needs the events disabled/skipped
>>> unless it has sufficient permissions. Note the :u on the events in:
>>
>> The perf_event_open() should be good to detect the insufficient
>> permission, but it doesn't work to detect an existing of an event.
>> That's because the kernel only checks the features not specific events.
>> It's not a reliable way to rely on the output of the perf_event_open() here.
> 
> I'm unclear why not as not having perf_event_open fail seems like a
> kernel bug. I can see there is little motivation to fix this on older
> architectures like Skylake.

Updating kernel may not be an option here. Because
- There is no issue with the existing perf tool until 6.3-rc7. It
doesn't sound like a defect of the kernel.
- The kernel for the SKL has been there for many years. It's impossible
to change all the kernels to support a new requirement from the perf tool.

> We do attempt to workaround it with the
> errata flags on the metrics introduced here:
> https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py#L1296
>
If so, I would still suggest to check the slots and topdown events in
sysfs and decide whether to append the TopdownL1 to perf stat default.

So we probably need a ARCH specific is_event_available(), before
appending the events.


>>>> From your test result in the v2 description, we can see that there is no
>>>> TopdownL1, which is good and expected. However, there is a (48.99%) with
>>>> cycles:u event, which implies multiplexing. Could you please check
>>>> what's the problem here?
>>>> Also, if it's because of the backgroud, all the events should be
>>>> multiplexing. But it looks like only cycle:u has multiplexing. Other
>>>> events, instructions:u, branches:u and branch-misses:u work without
>>>> multiplexing. That's very strange.
>>> I wasn't able to reproduce it and suspect it was a transient thing. I
>>> think there are multiplexing things to look into, 2 events on a fixed
>>> counter on Icelake+ will trigger multiplexing on all counters, and
>>> Skylake's 3 fixed and 4 generic should fit TopdownL1.
>>
>> Just found a cascade lake. With this patch + the current
>> perf-tools-next, partial of the TopdownL1 and multiplexing can still be
>> observed.
>>
>> $ sudo ./perf stat true
>>
>>  Performance counter stats for 'true':
>>
>>               2.91 msec task-clock                       #    0.316 CPUs
>> utilized
>>                  0      context-switches                 #    0.000 /sec
>>                  0      cpu-migrations                   #    0.000 /sec
>>                 45      page-faults                      #   15.474 K/sec
>>          2,819,972      cycles                           #    0.970 GHz
>>                        (60.14%)
>>          5,391,406      instructions                     #    1.91  insn
>> per cycle
>>          1,068,575      branches                         #  367.442 M/sec
>>              8,455      branch-misses                    #    0.79% of
>> all branches
>>             70,283      CPU_CLK_UNHALTED.REF_XCLK        #   24.168
>> M/sec
>>             48,806      INT_MISC.RECOVERY_CYCLES_ANY     #   16.783
>> M/sec                       (39.86%)
>>
>>        0.009204517 seconds time elapsed
>>
>>        0.000000000 seconds user
>>        0.009614000 seconds sys
> 
> The issue here is that 'true' ran very quickly and so we skipped the
> output of the events with 0 counts, no metrics were computed due to
> the zero counts. Cascade lake has the same TopdownL1 multiplexing
> issues as Skylake.
> 

I tried other benchmark. I can still observe the multiplexing. But my
remote machine just crashed. I'm asking the tech support. So I cannot do
more tests.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-18 18:19                 ` Liang, Kan
@ 2023-04-18 20:08                   ` Ian Rogers
  2023-04-18 21:51                     ` Liang, Kan
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-18 20:08 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Tue, Apr 18, 2023 at 11:19 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-18 11:43 a.m., Ian Rogers wrote:
> > On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>
> >>
> >>
> >> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
> >>> The json TopdownL1 is enabled if present unconditionally for perf stat
> >>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
> >>> Skylake has multiplexing unrelated to this change - at least on the
> >>> machine I was testing on. We can remove the metric group TopdownL1 on
> >>> Skylake so that we don't enable it by default, there is still the
> >>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
> >>> running with multiplexing - previously to get into topdown analysis
> >>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
> >>> do this.
> >>
> >> To be honest, I don't think it's a good idea to remove the TopdownL1. We
> >> cannot remove it just because the new way cannot handle it. The perf
> >> stat default works well until 6.3-rc7. It's a regression issue of the
> >> current perf-tools-next.
> >
> > I'm not so clear it is a regression to consistently add TopdownL1 for
> > all architectures supporting it.
>
>
> Breaking the perf stat default is a regression.

Breaking is overstating the use of multiplexing. The impact is less
accuracy in the IPC and branch misses default metrics, if multiplexing
is necessary on your Intel architecture. I believe TopdownL1 is more
useful than either of these metrics and so having TopdownL1 be a
default is an improvement. We can add a patch, as I keep repeating
this is off-topic for this patch, to make it so that TopdownL1 isn't
enabled on Intel CPUs pre-Icelake, but I would discourage this.

> The reason we once added the TopdownL1 for ICL and later platform is
> that it doesn't break the original design (a *clean* output).

Right, and in 6.3-rc7 the aggregation of counts was broken because of
duplicated counts and hard coded metrics (I did a last minute partial
fix). In perf-tools-next aggregation was fixed and we switched to the
json metrics, that are accurate and up-to-date with the latest TMA
metrics, so that we wouldn't need to maintain a duplicate code path.
What keys enabling TopdownL1 in 6.3 is the presence of topdown events
whilst in perf-tools-next it is the presence of TopdownL1 metric
group, as this is a more consistent approach and had first been
proposed by ARM.

> > The assertion is that because
> > TopdownL1 has multiplexing and multiplexing is global then we've
> > lowered the accuracy of other metrics, but the only other hardware
> > metrics in the default output pre-Icelake are IPC and branch miss
> > rate. Having TopdownL1 is a way of drilling into performance issues,
> > while IPC and branch miss rate are putting your finger in the air to
> > see which way the wind is blowing. Perhaps we should drop these if
> > TopdownL1 is present.
> >
> >> But I'm OK to add some flags with the metrics to assist the perf tool to
> >> specially handle the case if you prefer to modify the event list.
> >
> > We've already removed thresholds from the default output, we could
> > remove groups.
> >
> >>>
> >>> This doesn't relate to this change which is about making it so that
> >>> failing to set up TopdownL1 doesn't cause an early exit. The reason I
> >>> showed TigerLake output was that on TigerLake the skip/fallback
> >>> approach works while Skylake just needs the events disabled/skipped
> >>> unless it has sufficient permissions. Note the :u on the events in:
> >>
> >> The perf_event_open() should be good to detect the insufficient
> >> permission, but it doesn't work to detect an existing of an event.
> >> That's because the kernel only checks the features not specific events.
> >> It's not a reliable way to rely on the output of the perf_event_open() here.
> >
> > I'm unclear why not as not having perf_event_open fail seems like a
> > kernel bug. I can see there is little motivation to fix this on older
> > architectures like Skylake.
>
> Updating kernel may not be an option here. Because
> - There is no issue with the existing perf tool until 6.3-rc7. It
> doesn't sound like a defect of the kernel.
> - The kernel for the SKL has been there for many years. It's impossible
> to change all the kernels to support a new requirement from the perf tool.

This isn't a new behavior in the perf tool, the expectation that
perf_event_open fails is one that is present in the implementation of
weak event groups.

> > We do attempt to workaround it with the
> > errata flags on the metrics introduced here:
> > https://github.com/intel/perfmon/blob/main/scripts/create_perf_json.py#L1296
> >
> If so, I would still suggest to check the slots and topdown events in
> sysfs and decide whether to append the TopdownL1 to perf stat default.
>
> So we probably need a ARCH specific is_event_available(), before
> appending the events.

This can be done, but is a feature request. I think we should enable
TopdownL1 by default as the way to avoid multiplexing, if you care or
are affected, is to choose the specific event or metric of interest
and profile with just that. I suspect for branch miss-rate and IPC,
the numbers wanting to do that will be small. Educating that TopdownL1
exists is a harder problem and shouldn't just be something present for
Intel Icelake+ users or those who've read through the wiki.

Thanks,
Ian

> >>>> From your test result in the v2 description, we can see that there is no
> >>>> TopdownL1, which is good and expected. However, there is a (48.99%) with
> >>>> cycles:u event, which implies multiplexing. Could you please check
> >>>> what's the problem here?
> >>>> Also, if it's because of the backgroud, all the events should be
> >>>> multiplexing. But it looks like only cycle:u has multiplexing. Other
> >>>> events, instructions:u, branches:u and branch-misses:u work without
> >>>> multiplexing. That's very strange.
> >>> I wasn't able to reproduce it and suspect it was a transient thing. I
> >>> think there are multiplexing things to look into, 2 events on a fixed
> >>> counter on Icelake+ will trigger multiplexing on all counters, and
> >>> Skylake's 3 fixed and 4 generic should fit TopdownL1.
> >>
> >> Just found a cascade lake. With this patch + the current
> >> perf-tools-next, partial of the TopdownL1 and multiplexing can still be
> >> observed.
> >>
> >> $ sudo ./perf stat true
> >>
> >>  Performance counter stats for 'true':
> >>
> >>               2.91 msec task-clock                       #    0.316 CPUs
> >> utilized
> >>                  0      context-switches                 #    0.000 /sec
> >>                  0      cpu-migrations                   #    0.000 /sec
> >>                 45      page-faults                      #   15.474 K/sec
> >>          2,819,972      cycles                           #    0.970 GHz
> >>                        (60.14%)
> >>          5,391,406      instructions                     #    1.91  insn
> >> per cycle
> >>          1,068,575      branches                         #  367.442 M/sec
> >>              8,455      branch-misses                    #    0.79% of
> >> all branches
> >>             70,283      CPU_CLK_UNHALTED.REF_XCLK        #   24.168
> >> M/sec
> >>             48,806      INT_MISC.RECOVERY_CYCLES_ANY     #   16.783
> >> M/sec                       (39.86%)
> >>
> >>        0.009204517 seconds time elapsed
> >>
> >>        0.000000000 seconds user
> >>        0.009614000 seconds sys
> >
> > The issue here is that 'true' ran very quickly and so we skipped the
> > output of the events with 0 counts, no metrics were computed due to
> > the zero counts. Cascade lake has the same TopdownL1 multiplexing
> > issues as Skylake.
> >
>
> I tried other benchmark. I can still observe the multiplexing. But my
> remote machine just crashed. I'm asking the tech support. So I cannot do
> more tests.
>
> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-18 20:08                   ` Ian Rogers
@ 2023-04-18 21:51                     ` Liang, Kan
  2023-04-19  0:12                       ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-18 21:51 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel



On 2023-04-18 4:08 p.m., Ian Rogers wrote:
> On Tue, Apr 18, 2023 at 11:19 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2023-04-18 11:43 a.m., Ian Rogers wrote:
>>> On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
>>>>> The json TopdownL1 is enabled if present unconditionally for perf stat
>>>>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
>>>>> Skylake has multiplexing unrelated to this change - at least on the
>>>>> machine I was testing on. We can remove the metric group TopdownL1 on
>>>>> Skylake so that we don't enable it by default, there is still the
>>>>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
>>>>> running with multiplexing - previously to get into topdown analysis
>>>>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
>>>>> do this.
>>>>
>>>> To be honest, I don't think it's a good idea to remove the TopdownL1. We
>>>> cannot remove it just because the new way cannot handle it. The perf
>>>> stat default works well until 6.3-rc7. It's a regression issue of the
>>>> current perf-tools-next.
>>>
>>> I'm not so clear it is a regression to consistently add TopdownL1 for
>>> all architectures supporting it.
>>
>>
>> Breaking the perf stat default is a regression.
> 
> Breaking is overstating the use of multiplexing. The impact is less
> accuracy in the IPC and branch misses default metrics,

Inaccuracy is a breakage for the default.

> if multiplexing
> is necessary on your Intel architecture. I believe TopdownL1 is more
> useful than either of these metrics and so having TopdownL1 be a
> default is an improvement. We can add a patch, as I keep repeating
> this is off-topic for this patch, to make it so that TopdownL1 isn't
> enabled on Intel CPUs pre-Icelake, but I would discourage this.


We need the TopdownL1. We just don't need TopdownL1 in the perf stat
default when perf metrics feature is not available.


> 
>> The reason we once added the TopdownL1 for ICL and later platform is
>> that it doesn't break the original design (a *clean* output).
> 
> Right, and in 6.3-rc7 the aggregation of counts was broken because of
> duplicated counts and hard coded metrics (I did a last minute partial
> fix). In perf-tools-next aggregation was fixed and we switched to the
> json metrics, that are accurate and up-to-date with the latest TMA
> metrics, so that we wouldn't need to maintain a duplicate code path.
> What keys enabling TopdownL1 in 6.3 is the presence of topdown events
> whilst in perf-tools-next it is the presence of TopdownL1 metric
> group, as this is a more consistent approach and had first been
> proposed by ARM.

A consistent approach is good only when it can benefits all parties
rather than sacrifices any of them.

Apparently, the approach in the perf-tools-next brings all kinds of
issues, multiplexing/inaccuracy in the perf stat default on Intel
platforms. Why cannot we fix it properly before applying the approach?

I think Andi also mentioned the similar request when ARM introduced the
TopdownL1 metrics.
https://lore.kernel.org/linux-perf-users/12e0deef-08db-445f-4958-bcd5c3e10367@linux.intel.com/

Thanks,
Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-18 21:51                     ` Liang, Kan
@ 2023-04-19  0:12                       ` Ian Rogers
  2023-04-19  1:00                         ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-19  0:12 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Tue, Apr 18, 2023 at 2:51 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-18 4:08 p.m., Ian Rogers wrote:
> > On Tue, Apr 18, 2023 at 11:19 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>
> >>
> >>
> >> On 2023-04-18 11:43 a.m., Ian Rogers wrote:
> >>> On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
> >>>>> The json TopdownL1 is enabled if present unconditionally for perf stat
> >>>>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
> >>>>> Skylake has multiplexing unrelated to this change - at least on the
> >>>>> machine I was testing on. We can remove the metric group TopdownL1 on
> >>>>> Skylake so that we don't enable it by default, there is still the
> >>>>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
> >>>>> running with multiplexing - previously to get into topdown analysis
> >>>>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
> >>>>> do this.
> >>>>
> >>>> To be honest, I don't think it's a good idea to remove the TopdownL1. We
> >>>> cannot remove it just because the new way cannot handle it. The perf
> >>>> stat default works well until 6.3-rc7. It's a regression issue of the
> >>>> current perf-tools-next.
> >>>
> >>> I'm not so clear it is a regression to consistently add TopdownL1 for
> >>> all architectures supporting it.
> >>
> >>
> >> Breaking the perf stat default is a regression.
> >
> > Breaking is overstating the use of multiplexing. The impact is less
> > accuracy in the IPC and branch misses default metrics,
>
> Inaccuracy is a breakage for the default.

Can you present a case where this matters? The events are already not
grouped and so inaccurate for metrics.

> > if multiplexing
> > is necessary on your Intel architecture. I believe TopdownL1 is more
> > useful than either of these metrics and so having TopdownL1 be a
> > default is an improvement. We can add a patch, as I keep repeating
> > this is off-topic for this patch, to make it so that TopdownL1 isn't
> > enabled on Intel CPUs pre-Icelake, but I would discourage this.
>
>
> We need the TopdownL1. We just don't need TopdownL1 in the perf stat
> default when perf metrics feature is not available.

Perf metrics is an Intel only Icelake+ feature. I suggest the simplest
way to achieve this would be to remove the TopdownL1 metric group from
all Intel metrics before Icelake. This will mean on these
architectures the group TmaL1 will need using instead.

Thanks,
Ian

> >
> >> The reason we once added the TopdownL1 for ICL and later platform is
> >> that it doesn't break the original design (a *clean* output).
> >
> > Right, and in 6.3-rc7 the aggregation of counts was broken because of
> > duplicated counts and hard coded metrics (I did a last minute partial
> > fix). In perf-tools-next aggregation was fixed and we switched to the
> > json metrics, that are accurate and up-to-date with the latest TMA
> > metrics, so that we wouldn't need to maintain a duplicate code path.
> > What keys enabling TopdownL1 in 6.3 is the presence of topdown events
> > whilst in perf-tools-next it is the presence of TopdownL1 metric
> > group, as this is a more consistent approach and had first been
> > proposed by ARM.
>
> A consistent approach is good only when it can benefits all parties
> rather than sacrifices any of them.
>
> Apparently, the approach in the perf-tools-next brings all kinds of
> issues, multiplexing/inaccuracy in the perf stat default on Intel
> platforms. Why cannot we fix it properly before applying the approach?
>
> I think Andi also mentioned the similar request when ARM introduced the
> TopdownL1 metrics.
> https://lore.kernel.org/linux-perf-users/12e0deef-08db-445f-4958-bcd5c3e10367@linux.intel.com/
>
> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-19  0:12                       ` Ian Rogers
@ 2023-04-19  1:00                         ` Ian Rogers
  2023-04-19 12:31                           ` Liang, Kan
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-19  1:00 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Tue, Apr 18, 2023 at 5:12 PM Ian Rogers <irogers@google.com> wrote:
>
> On Tue, Apr 18, 2023 at 2:51 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >
> >
> >
> > On 2023-04-18 4:08 p.m., Ian Rogers wrote:
> > > On Tue, Apr 18, 2023 at 11:19 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> > >>
> > >>
> > >>
> > >> On 2023-04-18 11:43 a.m., Ian Rogers wrote:
> > >>> On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
> > >>>>> The json TopdownL1 is enabled if present unconditionally for perf stat
> > >>>>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
> > >>>>> Skylake has multiplexing unrelated to this change - at least on the
> > >>>>> machine I was testing on. We can remove the metric group TopdownL1 on
> > >>>>> Skylake so that we don't enable it by default, there is still the
> > >>>>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
> > >>>>> running with multiplexing - previously to get into topdown analysis
> > >>>>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
> > >>>>> do this.
> > >>>>
> > >>>> To be honest, I don't think it's a good idea to remove the TopdownL1. We
> > >>>> cannot remove it just because the new way cannot handle it. The perf
> > >>>> stat default works well until 6.3-rc7. It's a regression issue of the
> > >>>> current perf-tools-next.
> > >>>
> > >>> I'm not so clear it is a regression to consistently add TopdownL1 for
> > >>> all architectures supporting it.
> > >>
> > >>
> > >> Breaking the perf stat default is a regression.
> > >
> > > Breaking is overstating the use of multiplexing. The impact is less
> > > accuracy in the IPC and branch misses default metrics,
> >
> > Inaccuracy is a breakage for the default.
>
> Can you present a case where this matters? The events are already not
> grouped and so inaccurate for metrics.

Removing CPUs without perf metrics from the TopdownL1 metric group is
implemented here:
https://lore.kernel.org/lkml/20230419005423.343862-6-irogers@google.com/
Note, this applies to pre-Icelake and atom CPUs as these also lack
perf metric (aka topdown) events.

With that change I don't have a case that requires skippable evsels,
and so we can take that series with patch 6 over the v1 of that series
with this change.

Thanks,
Ian

> > > if multiplexing
> > > is necessary on your Intel architecture. I believe TopdownL1 is more
> > > useful than either of these metrics and so having TopdownL1 be a
> > > default is an improvement. We can add a patch, as I keep repeating
> > > this is off-topic for this patch, to make it so that TopdownL1 isn't
> > > enabled on Intel CPUs pre-Icelake, but I would discourage this.
> >
> >
> > We need the TopdownL1. We just don't need TopdownL1 in the perf stat
> > default when perf metrics feature is not available.
>
> Perf metrics is an Intel only Icelake+ feature. I suggest the simplest
> way to achieve this would be to remove the TopdownL1 metric group from
> all Intel metrics before Icelake. This will mean on these
> architectures the group TmaL1 will need using instead.
>
> Thanks,
> Ian
>
> > >
> > >> The reason we once added the TopdownL1 for ICL and later platform is
> > >> that it doesn't break the original design (a *clean* output).
> > >
> > > Right, and in 6.3-rc7 the aggregation of counts was broken because of
> > > duplicated counts and hard coded metrics (I did a last minute partial
> > > fix). In perf-tools-next aggregation was fixed and we switched to the
> > > json metrics, that are accurate and up-to-date with the latest TMA
> > > metrics, so that we wouldn't need to maintain a duplicate code path.
> > > What keys enabling TopdownL1 in 6.3 is the presence of topdown events
> > > whilst in perf-tools-next it is the presence of TopdownL1 metric
> > > group, as this is a more consistent approach and had first been
> > > proposed by ARM.
> >
> > A consistent approach is good only when it can benefits all parties
> > rather than sacrifices any of them.
> >
> > Apparently, the approach in the perf-tools-next brings all kinds of
> > issues, multiplexing/inaccuracy in the perf stat default on Intel
> > platforms. Why cannot we fix it properly before applying the approach?
> >
> > I think Andi also mentioned the similar request when ARM introduced the
> > TopdownL1 metrics.
> > https://lore.kernel.org/linux-perf-users/12e0deef-08db-445f-4958-bcd5c3e10367@linux.intel.com/
> >
> > Thanks,
> > Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-19  1:00                         ` Ian Rogers
@ 2023-04-19 12:31                           ` Liang, Kan
  2023-04-19 13:19                             ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-19 12:31 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel



On 2023-04-18 9:00 p.m., Ian Rogers wrote:
> On Tue, Apr 18, 2023 at 5:12 PM Ian Rogers <irogers@google.com> wrote:
>>
>> On Tue, Apr 18, 2023 at 2:51 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>
>>>
>>>
>>> On 2023-04-18 4:08 p.m., Ian Rogers wrote:
>>>> On Tue, Apr 18, 2023 at 11:19 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2023-04-18 11:43 a.m., Ian Rogers wrote:
>>>>>> On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
>>>>>>>> The json TopdownL1 is enabled if present unconditionally for perf stat
>>>>>>>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
>>>>>>>> Skylake has multiplexing unrelated to this change - at least on the
>>>>>>>> machine I was testing on. We can remove the metric group TopdownL1 on
>>>>>>>> Skylake so that we don't enable it by default, there is still the
>>>>>>>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
>>>>>>>> running with multiplexing - previously to get into topdown analysis
>>>>>>>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
>>>>>>>> do this.
>>>>>>>
>>>>>>> To be honest, I don't think it's a good idea to remove the TopdownL1. We
>>>>>>> cannot remove it just because the new way cannot handle it. The perf
>>>>>>> stat default works well until 6.3-rc7. It's a regression issue of the
>>>>>>> current perf-tools-next.
>>>>>>
>>>>>> I'm not so clear it is a regression to consistently add TopdownL1 for
>>>>>> all architectures supporting it.
>>>>>
>>>>>
>>>>> Breaking the perf stat default is a regression.
>>>>
>>>> Breaking is overstating the use of multiplexing. The impact is less
>>>> accuracy in the IPC and branch misses default metrics,
>>>
>>> Inaccuracy is a breakage for the default.
>>
>> Can you present a case where this matters? The events are already not
>> grouped and so inaccurate for metrics.
> 
> Removing CPUs without perf metrics from the TopdownL1 metric group is
> implemented here:
> https://lore.kernel.org/lkml/20230419005423.343862-6-irogers@google.com/
> Note, this applies to pre-Icelake and atom CPUs as these also lack
> perf metric (aka topdown) events.
>

That may give the end user the impression that the pre-Icelake doesn't
support the Topdown Level1 events, which is not true.

I think perf should either keep it for all Intel platforms which
supports tma_L1_group, or remove the TopdownL1 name entirely for Intel
platform (let the end user use the tma_L1_group and the name exposed by
the kernel as before.).


> With that change I don't have a case that requires skippable evsels,
> and so we can take that series with patch 6 over the v1 of that series
> with this change.
> 

I'm afraid this is not the only problem the commit 94b1a603fca7 ("perf
stat: Add TopdownL1 metric as a default if present") in the
perf-tools-next branch introduced.

The topdown L2 in the perf stat default on SPR and big core of the ADL
is still missed. I don't see a possible fix for this on the current
perf-tools-next branch.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-19 12:31                           ` Liang, Kan
@ 2023-04-19 13:19                             ` Ian Rogers
  2023-04-19 14:16                               ` Liang, Kan
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-19 13:19 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Wed, Apr 19, 2023 at 5:31 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-18 9:00 p.m., Ian Rogers wrote:
> > On Tue, Apr 18, 2023 at 5:12 PM Ian Rogers <irogers@google.com> wrote:
> >>
> >> On Tue, Apr 18, 2023 at 2:51 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2023-04-18 4:08 p.m., Ian Rogers wrote:
> >>>> On Tue, Apr 18, 2023 at 11:19 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2023-04-18 11:43 a.m., Ian Rogers wrote:
> >>>>>> On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
> >>>>>>>> The json TopdownL1 is enabled if present unconditionally for perf stat
> >>>>>>>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
> >>>>>>>> Skylake has multiplexing unrelated to this change - at least on the
> >>>>>>>> machine I was testing on. We can remove the metric group TopdownL1 on
> >>>>>>>> Skylake so that we don't enable it by default, there is still the
> >>>>>>>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
> >>>>>>>> running with multiplexing - previously to get into topdown analysis
> >>>>>>>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
> >>>>>>>> do this.
> >>>>>>>
> >>>>>>> To be honest, I don't think it's a good idea to remove the TopdownL1. We
> >>>>>>> cannot remove it just because the new way cannot handle it. The perf
> >>>>>>> stat default works well until 6.3-rc7. It's a regression issue of the
> >>>>>>> current perf-tools-next.
> >>>>>>
> >>>>>> I'm not so clear it is a regression to consistently add TopdownL1 for
> >>>>>> all architectures supporting it.
> >>>>>
> >>>>>
> >>>>> Breaking the perf stat default is a regression.
> >>>>
> >>>> Breaking is overstating the use of multiplexing. The impact is less
> >>>> accuracy in the IPC and branch misses default metrics,
> >>>
> >>> Inaccuracy is a breakage for the default.
> >>
> >> Can you present a case where this matters? The events are already not
> >> grouped and so inaccurate for metrics.
> >
> > Removing CPUs without perf metrics from the TopdownL1 metric group is
> > implemented here:
> > https://lore.kernel.org/lkml/20230419005423.343862-6-irogers@google.com/
> > Note, this applies to pre-Icelake and atom CPUs as these also lack
> > perf metric (aka topdown) events.
> >
>
> That may give the end user the impression that the pre-Icelake doesn't
> support the Topdown Level1 events, which is not true.
>
> I think perf should either keep it for all Intel platforms which
> supports tma_L1_group, or remove the TopdownL1 name entirely for Intel
> platform (let the end user use the tma_L1_group and the name exposed by
> the kernel as before.).

How does this work on hybrid systems? We will enable TopdownL1 because
of the presence of perf metric (aka topdown) events but this will also
enable TopdownL1 on the atom core.

>
> > With that change I don't have a case that requires skippable evsels,
> > and so we can take that series with patch 6 over the v1 of that series
> > with this change.
> >
>
> I'm afraid this is not the only problem the commit 94b1a603fca7 ("perf
> stat: Add TopdownL1 metric as a default if present") in the
> perf-tools-next branch introduced.
>
> The topdown L2 in the perf stat default on SPR and big core of the ADL
> is still missed. I don't see a possible fix for this on the current
> perf-tools-next branch.

I thought in its current state the json metrics for TopdownL2 on SPR
have multiplexing. Given L1 is used to drill down to L2, it seems odd
to start on L2, but given L1 is used to compute the thresholds for L2,
this should be to have both L1 and L2 on these platforms. However,
that doesn't work as you don't want multiplexing.

This all seems backward to avoid potential multiplexing on branch miss
rate and IPC, just always having TopdownL1 seems cleanest with the
skippable evsels working around the permissions issue - as put forward
in this patch. Possibly adding L2 metrics on ADL/SPR, but only once
the multiplexing issue is resolved.

Thanks,
Ian

> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-19 13:19                             ` Ian Rogers
@ 2023-04-19 14:16                               ` Liang, Kan
  2023-04-19 16:51                                 ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-19 14:16 UTC (permalink / raw)
  To: Ian Rogers, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Adrian Hunter, Florian Fischer,
	linux-perf-users, linux-kernel



On 2023-04-19 9:19 a.m., Ian Rogers wrote:
> On Wed, Apr 19, 2023 at 5:31 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>>
>> On 2023-04-18 9:00 p.m., Ian Rogers wrote:
>>> On Tue, Apr 18, 2023 at 5:12 PM Ian Rogers <irogers@google.com> wrote:
>>>>
>>>> On Tue, Apr 18, 2023 at 2:51 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2023-04-18 4:08 p.m., Ian Rogers wrote:
>>>>>> On Tue, Apr 18, 2023 at 11:19 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2023-04-18 11:43 a.m., Ian Rogers wrote:
>>>>>>>> On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
>>>>>>>>>> The json TopdownL1 is enabled if present unconditionally for perf stat
>>>>>>>>>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
>>>>>>>>>> Skylake has multiplexing unrelated to this change - at least on the
>>>>>>>>>> machine I was testing on. We can remove the metric group TopdownL1 on
>>>>>>>>>> Skylake so that we don't enable it by default, there is still the
>>>>>>>>>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
>>>>>>>>>> running with multiplexing - previously to get into topdown analysis
>>>>>>>>>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
>>>>>>>>>> do this.
>>>>>>>>>
>>>>>>>>> To be honest, I don't think it's a good idea to remove the TopdownL1. We
>>>>>>>>> cannot remove it just because the new way cannot handle it. The perf
>>>>>>>>> stat default works well until 6.3-rc7. It's a regression issue of the
>>>>>>>>> current perf-tools-next.
>>>>>>>>
>>>>>>>> I'm not so clear it is a regression to consistently add TopdownL1 for
>>>>>>>> all architectures supporting it.
>>>>>>>
>>>>>>>
>>>>>>> Breaking the perf stat default is a regression.
>>>>>>
>>>>>> Breaking is overstating the use of multiplexing. The impact is less
>>>>>> accuracy in the IPC and branch misses default metrics,
>>>>>
>>>>> Inaccuracy is a breakage for the default.
>>>>
>>>> Can you present a case where this matters? The events are already not
>>>> grouped and so inaccurate for metrics.
>>>
>>> Removing CPUs without perf metrics from the TopdownL1 metric group is
>>> implemented here:
>>> https://lore.kernel.org/lkml/20230419005423.343862-6-irogers@google.com/
>>> Note, this applies to pre-Icelake and atom CPUs as these also lack
>>> perf metric (aka topdown) events.
>>>
>>
>> That may give the end user the impression that the pre-Icelake doesn't
>> support the Topdown Level1 events, which is not true.
>>
>> I think perf should either keep it for all Intel platforms which
>> supports tma_L1_group, or remove the TopdownL1 name entirely for Intel
>> platform (let the end user use the tma_L1_group and the name exposed by
>> the kernel as before.).
> 
> How does this work on hybrid systems? We will enable TopdownL1 because
> of the presence of perf metric (aka topdown) events but this will also
> enable TopdownL1 on the atom core.


This is the output from a hybrid system with current 6.3-rc7.

As you can see that the Topdown L1 and L2 are displayed for the big
core. No Topdown events are displayed for the atom core.

(BTW: The 99.15% is not multiplexing. I think it's because the perf stat
may starts from the big core and it takes a little bit time to run
something on the small core.)


$perf stat ./hybrid_triad_loop.sh

 Performance counter stats for './hybrid_triad_loop.sh':

            211.80 msec task-clock                       #    0.996 CPUs
utilized
                 5      context-switches                 #   23.608 /sec
                 3      cpu-migrations                   #   14.165 /sec
               652      page-faults                      #    3.078 K/sec
       411,470,713      cpu_core/cycles/                 #    1.943 G/sec
       607,566,483      cpu_atom/cycles/                 #    2.869
G/sec                       (99.15%)
     1,613,379,362      cpu_core/instructions/           #    7.618 G/sec
     1,616,816,312      cpu_atom/instructions/           #    7.634
G/sec                       (99.15%)
       202,876,952      cpu_core/branches/               #  957.884 M/sec
       202,367,829      cpu_atom/branches/               #  955.480
M/sec                       (99.15%)
            56,740      cpu_core/branch-misses/          #  267.898 K/sec
            19,033      cpu_atom/branch-misses/          #   89.864
K/sec                       (99.15%)
     2,468,765,562      cpu_core/slots/                  #   11.656 G/sec
     1,411,184,398      cpu_core/topdown-retiring/       #     57.4%
Retiring
         4,671,159      cpu_core/topdown-bad-spec/       #      0.2% Bad
Speculation
        92,222,378      cpu_core/topdown-fe-bound/       #      3.7%
Frontend Bound
       952,516,107      cpu_core/topdown-be-bound/       #     38.7%
Backend Bound
         2,696,347      cpu_core/topdown-heavy-ops/      #      0.1%
Heavy Operations          #     57.2% Light Operations
         4,460,659      cpu_core/topdown-br-mispredict/  #      0.2%
Branch Mispredict         #      0.0% Machine Clears
        19,538,486      cpu_core/topdown-fetch-lat/      #      0.8%
Fetch Latency             #      3.0% Fetch Bandwidth
        24,170,592      cpu_core/topdown-mem-bound/      #      1.0%
Memory Bound              #     37.7% Core Bound

       0.212598999 seconds time elapsed

       0.212525000 seconds user
       0.000000000 seconds sys


> 
>>
>>> With that change I don't have a case that requires skippable evsels,
>>> and so we can take that series with patch 6 over the v1 of that series
>>> with this change.
>>>
>>
>> I'm afraid this is not the only problem the commit 94b1a603fca7 ("perf
>> stat: Add TopdownL1 metric as a default if present") in the
>> perf-tools-next branch introduced.
>>
>> The topdown L2 in the perf stat default on SPR and big core of the ADL
>> is still missed. I don't see a possible fix for this on the current
>> perf-tools-next branch.
> 
> I thought in its current state the json metrics for TopdownL2 on SPR
> have multiplexing. Given L1 is used to drill down to L2, it seems odd
> to start on L2, but given L1 is used to compute the thresholds for L2,
> this should be to have both L1 and L2 on these platforms. However,
> that doesn't work as you don't want multiplexing.
> 
> This all seems backward to avoid potential multiplexing on branch miss
> rate and IPC, just always having TopdownL1 seems cleanest with the
> skippable evsels working around the permissions issue - as put forward
> in this patch. Possibly adding L2 metrics on ADL/SPR, but only once
> the multiplexing issue is resolved.
> 

No, not just that issue. Based to what I tested these days, perf stat
default has issues/regressions on most of the Intel platforms with the
current perf-tools-next and perf/core branch of acme's repo.

For the pre-ICL platforms:
- The permission issue. (This patch tried to address.)
- Unclean perf stat default. (This patch failed to address.)
  Unnecessary multiplexing for cycles.
  Display partial of the TopdownL1

https://lore.kernel.org/lkml/d1fe801a-22d0-1f9b-b127-227b21635bd5@linux.intel.com/

For SPR platforms
- Topdown L2 metrics is missed, while it works with the current 6.3-rc7.

For ADL/RPL platforms
- Segmentation fault which I just found this morning.
# ./perf stat true
Segmentation fault (core dumped)


After the test on a hybrid machine, I incline to revert the commit
94b1a603fca7 ("perf stat: Add TopdownL1 metric as a default if present")
and related patches for now.

To clarify, I do not object a generic solution for the Topdown on
different ARCHs. But the current generic solution aka TopdownL1 has all
kinds of problems on most of Intel platforms. We should fix them first
before applying to the mainline.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-19 14:16                               ` Liang, Kan
@ 2023-04-19 16:51                                 ` Ian Rogers
  2023-04-19 18:57                                   ` Liang, Kan
  2023-04-20 11:33                                   ` Arnaldo Carvalho de Melo
  0 siblings, 2 replies; 29+ messages in thread
From: Ian Rogers @ 2023-04-19 16:51 UTC (permalink / raw)
  To: Liang, Kan, Stephane Eranian, Andi Kleen, Yasin, Ahmad, Taylor,
	Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin, Edward
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Wed, Apr 19, 2023 at 7:16 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-19 9:19 a.m., Ian Rogers wrote:
> > On Wed, Apr 19, 2023 at 5:31 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>
> >>
> >>
> >> On 2023-04-18 9:00 p.m., Ian Rogers wrote:
> >>> On Tue, Apr 18, 2023 at 5:12 PM Ian Rogers <irogers@google.com> wrote:
> >>>>
> >>>> On Tue, Apr 18, 2023 at 2:51 PM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2023-04-18 4:08 p.m., Ian Rogers wrote:
> >>>>>> On Tue, Apr 18, 2023 at 11:19 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2023-04-18 11:43 a.m., Ian Rogers wrote:
> >>>>>>>> On Tue, Apr 18, 2023 at 6:03 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 2023-04-17 2:13 p.m., Ian Rogers wrote:
> >>>>>>>>>> The json TopdownL1 is enabled if present unconditionally for perf stat
> >>>>>>>>>> default. Enabling it on Skylake has multiplexing as TopdownL1 on
> >>>>>>>>>> Skylake has multiplexing unrelated to this change - at least on the
> >>>>>>>>>> machine I was testing on. We can remove the metric group TopdownL1 on
> >>>>>>>>>> Skylake so that we don't enable it by default, there is still the
> >>>>>>>>>> group TmaL1. To me, disabling TopdownL1 seems less desirable than
> >>>>>>>>>> running with multiplexing - previously to get into topdown analysis
> >>>>>>>>>> there has to be knowledge that "perf stat -M TopdownL1" is the way to
> >>>>>>>>>> do this.
> >>>>>>>>>
> >>>>>>>>> To be honest, I don't think it's a good idea to remove the TopdownL1. We
> >>>>>>>>> cannot remove it just because the new way cannot handle it. The perf
> >>>>>>>>> stat default works well until 6.3-rc7. It's a regression issue of the
> >>>>>>>>> current perf-tools-next.
> >>>>>>>>
> >>>>>>>> I'm not so clear it is a regression to consistently add TopdownL1 for
> >>>>>>>> all architectures supporting it.
> >>>>>>>
> >>>>>>>
> >>>>>>> Breaking the perf stat default is a regression.
> >>>>>>
> >>>>>> Breaking is overstating the use of multiplexing. The impact is less
> >>>>>> accuracy in the IPC and branch misses default metrics,
> >>>>>
> >>>>> Inaccuracy is a breakage for the default.
> >>>>
> >>>> Can you present a case where this matters? The events are already not
> >>>> grouped and so inaccurate for metrics.
> >>>
> >>> Removing CPUs without perf metrics from the TopdownL1 metric group is
> >>> implemented here:
> >>> https://lore.kernel.org/lkml/20230419005423.343862-6-irogers@google.com/
> >>> Note, this applies to pre-Icelake and atom CPUs as these also lack
> >>> perf metric (aka topdown) events.
> >>>
> >>
> >> That may give the end user the impression that the pre-Icelake doesn't
> >> support the Topdown Level1 events, which is not true.
> >>
> >> I think perf should either keep it for all Intel platforms which
> >> supports tma_L1_group, or remove the TopdownL1 name entirely for Intel
> >> platform (let the end user use the tma_L1_group and the name exposed by
> >> the kernel as before.).
> >
> > How does this work on hybrid systems? We will enable TopdownL1 because
> > of the presence of perf metric (aka topdown) events but this will also
> > enable TopdownL1 on the atom core.
>
>
> This is the output from a hybrid system with current 6.3-rc7.
>
> As you can see that the Topdown L1 and L2 are displayed for the big
> core. No Topdown events are displayed for the atom core.
>
> (BTW: The 99.15% is not multiplexing. I think it's because the perf stat
> may starts from the big core and it takes a little bit time to run
> something on the small core.)
>
>
> $perf stat ./hybrid_triad_loop.sh
>
>  Performance counter stats for './hybrid_triad_loop.sh':
>
>             211.80 msec task-clock                       #    0.996 CPUs
> utilized
>                  5      context-switches                 #   23.608 /sec
>                  3      cpu-migrations                   #   14.165 /sec
>                652      page-faults                      #    3.078 K/sec
>        411,470,713      cpu_core/cycles/                 #    1.943 G/sec
>        607,566,483      cpu_atom/cycles/                 #    2.869
> G/sec                       (99.15%)
>      1,613,379,362      cpu_core/instructions/           #    7.618 G/sec
>      1,616,816,312      cpu_atom/instructions/           #    7.634
> G/sec                       (99.15%)
>        202,876,952      cpu_core/branches/               #  957.884 M/sec
>        202,367,829      cpu_atom/branches/               #  955.480
> M/sec                       (99.15%)
>             56,740      cpu_core/branch-misses/          #  267.898 K/sec
>             19,033      cpu_atom/branch-misses/          #   89.864
> K/sec                       (99.15%)
>      2,468,765,562      cpu_core/slots/                  #   11.656 G/sec
>      1,411,184,398      cpu_core/topdown-retiring/       #     57.4%
> Retiring
>          4,671,159      cpu_core/topdown-bad-spec/       #      0.2% Bad
> Speculation
>         92,222,378      cpu_core/topdown-fe-bound/       #      3.7%
> Frontend Bound
>        952,516,107      cpu_core/topdown-be-bound/       #     38.7%
> Backend Bound
>          2,696,347      cpu_core/topdown-heavy-ops/      #      0.1%
> Heavy Operations          #     57.2% Light Operations
>          4,460,659      cpu_core/topdown-br-mispredict/  #      0.2%
> Branch Mispredict         #      0.0% Machine Clears
>         19,538,486      cpu_core/topdown-fetch-lat/      #      0.8%
> Fetch Latency             #      3.0% Fetch Bandwidth
>         24,170,592      cpu_core/topdown-mem-bound/      #      1.0%
> Memory Bound              #     37.7% Core Bound
>
>        0.212598999 seconds time elapsed
>
>        0.212525000 seconds user
>        0.000000000 seconds sys
>
>
> >
> >>
> >>> With that change I don't have a case that requires skippable evsels,
> >>> and so we can take that series with patch 6 over the v1 of that series
> >>> with this change.
> >>>
> >>
> >> I'm afraid this is not the only problem the commit 94b1a603fca7 ("perf
> >> stat: Add TopdownL1 metric as a default if present") in the
> >> perf-tools-next branch introduced.
> >>
> >> The topdown L2 in the perf stat default on SPR and big core of the ADL
> >> is still missed. I don't see a possible fix for this on the current
> >> perf-tools-next branch.
> >
> > I thought in its current state the json metrics for TopdownL2 on SPR
> > have multiplexing. Given L1 is used to drill down to L2, it seems odd
> > to start on L2, but given L1 is used to compute the thresholds for L2,
> > this should be to have both L1 and L2 on these platforms. However,
> > that doesn't work as you don't want multiplexing.
> >
> > This all seems backward to avoid potential multiplexing on branch miss
> > rate and IPC, just always having TopdownL1 seems cleanest with the
> > skippable evsels working around the permissions issue - as put forward
> > in this patch. Possibly adding L2 metrics on ADL/SPR, but only once
> > the multiplexing issue is resolved.
> >
>
> No, not just that issue. Based to what I tested these days, perf stat
> default has issues/regressions on most of the Intel platforms with the
> current perf-tools-next and perf/core branch of acme's repo.
>
> For the pre-ICL platforms:
> - The permission issue. (This patch tried to address.)
> - Unclean perf stat default. (This patch failed to address.)
>   Unnecessary multiplexing for cycles.
>   Display partial of the TopdownL1
>
> https://lore.kernel.org/lkml/d1fe801a-22d0-1f9b-b127-227b21635bd5@linux.intel.com/
>
> For SPR platforms
> - Topdown L2 metrics is missed, while it works with the current 6.3-rc7.
>
> For ADL/RPL platforms
> - Segmentation fault which I just found this morning.
> # ./perf stat true
> Segmentation fault (core dumped)

This may also stem from the reference count checking work that Arnaldo
is currently merging. It is hard to test hybrid because it uses
non-generic code paths.

> After the test on a hybrid machine, I incline to revert the commit
> 94b1a603fca7 ("perf stat: Add TopdownL1 metric as a default if present")
> and related patches for now.
>
> To clarify, I do not object a generic solution for the Topdown on
> different ARCHs. But the current generic solution aka TopdownL1 has all
> kinds of problems on most of Intel platforms. We should fix them first
> before applying to the mainline.

No, 6.3 has many issues as do the default events/metrics:
 - in 6.3 aggregation is wrong for metrics, specifically double
counting happens as the summary gets applied to the saved value under
certain conditions like repeat runs. This is solved in perf-tools-next
by removing the duplicated aggregation and using a single set of
counters. The "shadow" code is both simpler and more correct than in
6.3, reverting it would be a massive regression - tbh, if it wasn't
for the size of the patches I think we should cherry-pick what is in
perf-tools-next back to 6.3 due to the size of the errors.
 - the default events for IPC and branch miss rate lack groups and so
are inaccurate when multiplexing happens - but multiplexing inaccuracy
has historically not been an issue for the default events/metrics.
 - the previous default topdown metrics were inaccurate compared to
the TMA metrics spreadsheet, specifically they lacked events to
correct for errors and the thresholds differed. This is why TopdownL2
is a challenge because the json metrics do things properly causing the
event groups not to be trivially shared.
 - the topdown event solution is not generic, it is not only Intel
specific but Intel model specific. Solving the problem via the json is
the most consistent way to do this and is what I suggest we move to
in:
https://lore.kernel.org/lkml/20230419005423.343862-6-irogers@google.com/
but you can keep pushing that perf should do something special for
Intel and for every Intel model. It also pushes a need into the metric
code to do hybrid specific things so we can ignore atom TopdownL1. In
this vein, hybrid is a giant mess that the code base needs cleaning up
from; take the recent patch to just disable generic event parsing
tests for MTL:
https://lore.kernel.org/lkml/20230411094330.653965-1-tinghao.zhang@intel.com/
We need to stop doing this, and have generic code, not least because
of complaints like:
https://www.kernel.org/doc/html/latest/gpu/i915.html#issues-hit-with-first-prototype-based-on-core-perf
and because of the crash you are reporting on ADL/RPL.

What has motivated you here is a concern over multiplexing on branch
miss rate and IPC on pre-Icelake architectures that I find hard to
understand and is trivially remedied in time honored fashions, ie
profile the metrics you care about separately. A far more egregious
multiplexing problem is that on Icelake+ architectures we multiplex
fixed and generic counters unnecessarily when the same fixed counter
is needed or there are insufficient generic counters. This is
egregious as there is no user workaround and the kernel fix, to
iterate all hardware events the same way we do for all software
events, is both obvious and trivial. We've lived with this
multiplexing problem since Icelake and I think we can live with a
possible multiplexing problem, on legacy architectures, with what is
in perf-tools-next. The permissions problem tackled here means we
should merge this.

Thanks,
Ian

> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-19 16:51                                 ` Ian Rogers
@ 2023-04-19 18:57                                   ` Liang, Kan
  2023-04-20  0:23                                     ` Ian Rogers
  2023-04-20 11:33                                   ` Arnaldo Carvalho de Melo
  1 sibling, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-19 18:57 UTC (permalink / raw)
  To: Ian Rogers, Stephane Eranian, Andi Kleen, Yasin, Ahmad, Taylor,
	Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin, Edward,
	Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Adrian Hunter, Florian Fischer,
	linux-perf-users, linux-kernel


On 2023-04-19 12:51 p.m., Ian Rogers wrote:
>>>>> With that change I don't have a case that requires skippable evsels,
>>>>> and so we can take that series with patch 6 over the v1 of that series
>>>>> with this change.
>>>>>
>>>>
>>>> I'm afraid this is not the only problem the commit 94b1a603fca7 ("perf
>>>> stat: Add TopdownL1 metric as a default if present") in the
>>>> perf-tools-next branch introduced.
>>>>
>>>> The topdown L2 in the perf stat default on SPR and big core of the ADL
>>>> is still missed. I don't see a possible fix for this on the current
>>>> perf-tools-next branch.
>>>
>>> I thought in its current state the json metrics for TopdownL2 on SPR
>>> have multiplexing. Given L1 is used to drill down to L2, it seems odd
>>> to start on L2, but given L1 is used to compute the thresholds for L2,
>>> this should be to have both L1 and L2 on these platforms. However,
>>> that doesn't work as you don't want multiplexing.
>>>
>>> This all seems backward to avoid potential multiplexing on branch miss
>>> rate and IPC, just always having TopdownL1 seems cleanest with the
>>> skippable evsels working around the permissions issue - as put forward
>>> in this patch. Possibly adding L2 metrics on ADL/SPR, but only once
>>> the multiplexing issue is resolved.
>>>
>>
>> No, not just that issue. Based to what I tested these days, perf stat
>> default has issues/regressions on most of the Intel platforms with the
>> current perf-tools-next and perf/core branch of acme's repo.
>>
>> For the pre-ICL platforms:
>> - The permission issue. (This patch tried to address.)
>> - Unclean perf stat default. (This patch failed to address.)
>>   Unnecessary multiplexing for cycles.
>>   Display partial of the TopdownL1
>>
>> https://lore.kernel.org/lkml/d1fe801a-22d0-1f9b-b127-227b21635bd5@linux.intel.com/
>>
>> For SPR platforms
>> - Topdown L2 metrics is missed, while it works with the current 6.3-rc7.
>>
>> For ADL/RPL platforms
>> - Segmentation fault which I just found this morning.
>> # ./perf stat true
>> Segmentation fault (core dumped)
> 
> This may also stem from the reference count checking work that Arnaldo
> is currently merging. It is hard to test hybrid because it uses
> non-generic code paths.

There are two places which causes the Segmentation fault.
One place is the TopdownL1.

After I disable the TopdownL1 and add !counter->name as below, there are
no errors for the ./perf stat true.

(The below is just for test purpose.)
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 7a641a67486d..8e12ed1141e0 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1892,7 +1892,7 @@ static int add_default_attributes(void)
                 * Add TopdownL1 metrics if they exist. To minimize
                 * multiplexing, don't request threshold computation.
                 */
-               if (metricgroup__has_metric("TopdownL1")) {
+               if (0 && metricgroup__has_metric("TopdownL1")) {
                        struct evlist *metric_evlist = evlist__new();
                        struct evsel *metric_evsel;

diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
index 6b46bbb3d322..072fa56744b4 100644
--- a/tools/perf/util/stat-display.c
+++ b/tools/perf/util/stat-display.c
@@ -747,7 +747,7 @@ static void uniquify_event_name(struct evsel *counter)
        int ret = 0;

        if (counter->uniquified_name || counter->use_config_name ||
-           !counter->pmu_name || !strncmp(counter->name, counter->pmu_name,
+           !counter->pmu_name || !counter->name ||
!strncmp(counter->name, counter->pmu_name,
                                           strlen(counter->pmu_name)))
                return;

> 
>> After the test on a hybrid machine, I incline to revert the commit
>> 94b1a603fca7 ("perf stat: Add TopdownL1 metric as a default if present")
>> and related patches for now.
>>
>> To clarify, I do not object a generic solution for the Topdown on
>> different ARCHs. But the current generic solution aka TopdownL1 has all
>> kinds of problems on most of Intel platforms. We should fix them first
>> before applying to the mainline.
> 
> No, 6.3 has many issues as do the default events/metrics:

To be honest, I don't think they are real critical issues. For the first
one, I think there was already a temporary fix. For the others, they are
there for years.

However, the solution you proposed in the huge patch set
(https://lore.kernel.org/lkml/20230219092848.639226-37-irogers@google.com/)
brings many critical issues on different Intel platforms, crashes,
Missing features, etc.
Also, I was just told that many of our existing tools which on top of
the perf tool will also be broken, because all the annotations of the
kernel top-down metrics event disappeared.

So we really should revert the patches. I don't think patches 39 to 51
are well-tested and reviewed.

Thanks,
Kan

>  - in 6.3 aggregation is wrong for metrics, specifically double
> counting happens as the summary gets applied to the saved value under
> certain conditions like repeat runs. This is solved in perf-tools-next
> by removing the duplicated aggregation and using a single set of
> counters. The "shadow" code is both simpler and more correct than in
> 6.3, reverting it would be a massive regression - tbh, if it wasn't
> for the size of the patches I think we should cherry-pick what is in
> perf-tools-next back to 6.3 due to the size of the errors.
>  - the default events for IPC and branch miss rate lack groups and so
> are inaccurate when multiplexing happens - but multiplexing inaccuracy
> has historically not been an issue for the default events/metrics.
>  - the previous default topdown metrics were inaccurate compared to
> the TMA metrics spreadsheet, specifically they lacked events to
> correct for errors and the thresholds differed. This is why TopdownL2
> is a challenge because the json metrics do things properly causing the
> event groups not to be trivially shared.
>  - the topdown event solution is not generic, it is not only Intel
> specific but Intel model specific. Solving the problem via the json is
> the most consistent way to do this and is what I suggest we move to
> in:
> https://lore.kernel.org/lkml/20230419005423.343862-6-irogers@google.com/
> but you can keep pushing that perf should do something special for
> Intel and for every Intel model. It also pushes a need into the metric
> code to do hybrid specific things so we can ignore atom TopdownL1. In
> this vein, hybrid is a giant mess that the code base needs cleaning up
> from; take the recent patch to just disable generic event parsing
> tests for MTL:
> https://lore.kernel.org/lkml/20230411094330.653965-1-tinghao.zhang@intel.com/
> We need to stop doing this, and have generic code, not least because
> of complaints like:
> https://www.kernel.org/doc/html/latest/gpu/i915.html#issues-hit-with-first-prototype-based-on-core-perf
> and because of the crash you are reporting on ADL/RPL.
> 
> What has motivated you here is a concern over multiplexing on branch
> miss rate and IPC on pre-Icelake architectures that I find hard to
> understand and is trivially remedied in time honored fashions, ie
> profile the metrics you care about separately. A far more egregious
> multiplexing problem is that on Icelake+ architectures we multiplex
> fixed and generic counters unnecessarily when the same fixed counter
> is needed or there are insufficient generic counters. This is
> egregious as there is no user workaround and the kernel fix, to
> iterate all hardware events the same way we do for all software
> events, is both obvious and trivial. We've lived with this
> multiplexing problem since Icelake and I think we can live with a
> possible multiplexing problem, on legacy architectures, with what is
> in perf-tools-next. The permissions problem tackled here means we
> should merge this.
> 
> Thanks,
> Ian
> 
>> Thanks,
>> Kan

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-19 18:57                                   ` Liang, Kan
@ 2023-04-20  0:23                                     ` Ian Rogers
  2023-04-20 13:02                                       ` Liang, Kan
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-20  0:23 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Stephane Eranian, Andi Kleen, Yasin, Ahmad, Taylor, Perry, Alt,
	Samantha, Biggers, Caleb, Wang, Weilin, Edward,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Wed, Apr 19, 2023 at 11:57 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
> On 2023-04-19 12:51 p.m., Ian Rogers wrote:
> >>>>> With that change I don't have a case that requires skippable evsels,
> >>>>> and so we can take that series with patch 6 over the v1 of that series
> >>>>> with this change.
> >>>>>
> >>>>
> >>>> I'm afraid this is not the only problem the commit 94b1a603fca7 ("perf
> >>>> stat: Add TopdownL1 metric as a default if present") in the
> >>>> perf-tools-next branch introduced.
> >>>>
> >>>> The topdown L2 in the perf stat default on SPR and big core of the ADL
> >>>> is still missed. I don't see a possible fix for this on the current
> >>>> perf-tools-next branch.
> >>>
> >>> I thought in its current state the json metrics for TopdownL2 on SPR
> >>> have multiplexing. Given L1 is used to drill down to L2, it seems odd
> >>> to start on L2, but given L1 is used to compute the thresholds for L2,
> >>> this should be to have both L1 and L2 on these platforms. However,
> >>> that doesn't work as you don't want multiplexing.
> >>>
> >>> This all seems backward to avoid potential multiplexing on branch miss
> >>> rate and IPC, just always having TopdownL1 seems cleanest with the
> >>> skippable evsels working around the permissions issue - as put forward
> >>> in this patch. Possibly adding L2 metrics on ADL/SPR, but only once
> >>> the multiplexing issue is resolved.
> >>>
> >>
> >> No, not just that issue. Based to what I tested these days, perf stat
> >> default has issues/regressions on most of the Intel platforms with the
> >> current perf-tools-next and perf/core branch of acme's repo.
> >>
> >> For the pre-ICL platforms:
> >> - The permission issue. (This patch tried to address.)
> >> - Unclean perf stat default. (This patch failed to address.)
> >>   Unnecessary multiplexing for cycles.
> >>   Display partial of the TopdownL1
> >>
> >> https://lore.kernel.org/lkml/d1fe801a-22d0-1f9b-b127-227b21635bd5@linux.intel.com/
> >>
> >> For SPR platforms
> >> - Topdown L2 metrics is missed, while it works with the current 6.3-rc7.
> >>
> >> For ADL/RPL platforms
> >> - Segmentation fault which I just found this morning.
> >> # ./perf stat true
> >> Segmentation fault (core dumped)
> >
> > This may also stem from the reference count checking work that Arnaldo
> > is currently merging. It is hard to test hybrid because it uses
> > non-generic code paths.
>
> There are two places which causes the Segmentation fault.
> One place is the TopdownL1.
>
> After I disable the TopdownL1 and add !counter->name as below, there are
> no errors for the ./perf stat true.
>
> (The below is just for test purpose.)
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 7a641a67486d..8e12ed1141e0 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -1892,7 +1892,7 @@ static int add_default_attributes(void)
>                  * Add TopdownL1 metrics if they exist. To minimize
>                  * multiplexing, don't request threshold computation.
>                  */
> -               if (metricgroup__has_metric("TopdownL1")) {
> +               if (0 && metricgroup__has_metric("TopdownL1")) {

So hybrid has something different that causes this. Can you provide
the information to solve?

>                         struct evlist *metric_evlist = evlist__new();
>                         struct evsel *metric_evsel;
>
> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> index 6b46bbb3d322..072fa56744b4 100644
> --- a/tools/perf/util/stat-display.c
> +++ b/tools/perf/util/stat-display.c
> @@ -747,7 +747,7 @@ static void uniquify_event_name(struct evsel *counter)
>         int ret = 0;
>
>         if (counter->uniquified_name || counter->use_config_name ||
> -           !counter->pmu_name || !strncmp(counter->name, counter->pmu_name,
> +           !counter->pmu_name || !counter->name ||
> !strncmp(counter->name, counter->pmu_name,
>                                            strlen(counter->pmu_name)))
>                 return;

Is this a pre-existing hybrid bug? It is a real shame hybrid shows so
few common code paths. In general evsel__name should be preferred over
directly accessing name.

> >
> >> After the test on a hybrid machine, I incline to revert the commit
> >> 94b1a603fca7 ("perf stat: Add TopdownL1 metric as a default if present")
> >> and related patches for now.
> >>
> >> To clarify, I do not object a generic solution for the Topdown on
> >> different ARCHs. But the current generic solution aka TopdownL1 has all
> >> kinds of problems on most of Intel platforms. We should fix them first
> >> before applying to the mainline.
> >
> > No, 6.3 has many issues as do the default events/metrics:
>
> To be honest, I don't think they are real critical issues. For the first
> one, I think there was already a temporary fix. For the others, they are
> there for years.

This isn't true. The aggregation bug was raised to me by Intel and
stems from aggregation refactoring in per-thread mode done by
Namhyung.

> However, the solution you proposed in the huge patch set
> (https://lore.kernel.org/lkml/20230219092848.639226-37-irogers@google.com/)
> brings many critical issues on different Intel platforms, crashes,
> Missing features, etc.
> Also, I was just told that many of our existing tools which on top of
> the perf tool will also be broken, because all the annotations of the
> kernel top-down metrics event disappeared.
>
> So we really should revert the patches. I don't think patches 39 to 51
> are well-tested and reviewed.

The only issue I'm aware of is that hard coded use of the inaccurate
hard coded metrics now needs to switch to json metrics. This seems a
worthwhile update anyway, and not one that would justify breaking perf
stat metrics.

Thanks,
Ian

> Thanks,
> Kan
>
> >  - in 6.3 aggregation is wrong for metrics, specifically double
> > counting happens as the summary gets applied to the saved value under
> > certain conditions like repeat runs. This is solved in perf-tools-next
> > by removing the duplicated aggregation and using a single set of
> > counters. The "shadow" code is both simpler and more correct than in
> > 6.3, reverting it would be a massive regression - tbh, if it wasn't
> > for the size of the patches I think we should cherry-pick what is in
> > perf-tools-next back to 6.3 due to the size of the errors.
> >  - the default events for IPC and branch miss rate lack groups and so
> > are inaccurate when multiplexing happens - but multiplexing inaccuracy
> > has historically not been an issue for the default events/metrics.
> >  - the previous default topdown metrics were inaccurate compared to
> > the TMA metrics spreadsheet, specifically they lacked events to
> > correct for errors and the thresholds differed. This is why TopdownL2
> > is a challenge because the json metrics do things properly causing the
> > event groups not to be trivially shared.
> >  - the topdown event solution is not generic, it is not only Intel
> > specific but Intel model specific. Solving the problem via the json is
> > the most consistent way to do this and is what I suggest we move to
> > in:
> > https://lore.kernel.org/lkml/20230419005423.343862-6-irogers@google.com/
> > but you can keep pushing that perf should do something special for
> > Intel and for every Intel model. It also pushes a need into the metric
> > code to do hybrid specific things so we can ignore atom TopdownL1. In
> > this vein, hybrid is a giant mess that the code base needs cleaning up
> > from; take the recent patch to just disable generic event parsing
> > tests for MTL:
> > https://lore.kernel.org/lkml/20230411094330.653965-1-tinghao.zhang@intel.com/
> > We need to stop doing this, and have generic code, not least because
> > of complaints like:
> > https://www.kernel.org/doc/html/latest/gpu/i915.html#issues-hit-with-first-prototype-based-on-core-perf
> > and because of the crash you are reporting on ADL/RPL.
> >
> > What has motivated you here is a concern over multiplexing on branch
> > miss rate and IPC on pre-Icelake architectures that I find hard to
> > understand and is trivially remedied in time honored fashions, ie
> > profile the metrics you care about separately. A far more egregious
> > multiplexing problem is that on Icelake+ architectures we multiplex
> > fixed and generic counters unnecessarily when the same fixed counter
> > is needed or there are insufficient generic counters. This is
> > egregious as there is no user workaround and the kernel fix, to
> > iterate all hardware events the same way we do for all software
> > events, is both obvious and trivial. We've lived with this
> > multiplexing problem since Icelake and I think we can live with a
> > possible multiplexing problem, on legacy architectures, with what is
> > in perf-tools-next. The permissions problem tackled here means we
> > should merge this.
> >
> > Thanks,
> > Ian
> >
> >> Thanks,
> >> Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-19 16:51                                 ` Ian Rogers
  2023-04-19 18:57                                   ` Liang, Kan
@ 2023-04-20 11:33                                   ` Arnaldo Carvalho de Melo
  2023-04-20 12:22                                     ` Liang, Kan
  1 sibling, 1 reply; 29+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-04-20 11:33 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Liang, Kan, Stephane Eranian, Andi Kleen, Yasin, Ahmad, Taylor,
	Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin, Edward,
	Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Adrian Hunter, Florian Fischer,
	linux-perf-users, linux-kernel

Em Wed, Apr 19, 2023 at 09:51:20AM -0700, Ian Rogers escreveu:
> On Wed, Apr 19, 2023 at 7:16 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> > For ADL/RPL platforms
> > - Segmentation fault which I just found this morning.
> > # ./perf stat true
> > Segmentation fault (core dumped)
 
> This may also stem from the reference count checking work that Arnaldo
> is currently merging. It is hard to test hybrid because it uses
> non-generic code paths.

Hey, could you please try this under gdb and provide a backtrace? It may
indeed be related to this refcount checking work, there was a bug fixed
by the ARM guys for cs-etm and I combed thru and fixed some other use
before check for NULL cases, maybe one more slipped up.

Here I couldn't reproduce, but I don't have a Intel hybrid system, will
check with an ARM, but unsure if it will exercise the same code paths...

- Arnaldo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-20 11:33                                   ` Arnaldo Carvalho de Melo
@ 2023-04-20 12:22                                     ` Liang, Kan
  0 siblings, 0 replies; 29+ messages in thread
From: Liang, Kan @ 2023-04-20 12:22 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ian Rogers
  Cc: Stephane Eranian, Andi Kleen, Yasin, Ahmad, Taylor, Perry, Alt,
	Samantha, Biggers, Caleb, Wang, Weilin, Edward, Peter Zijlstra,
	Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Adrian Hunter, Florian Fischer, linux-perf-users,
	linux-kernel



On 2023-04-20 7:33 a.m., Arnaldo Carvalho de Melo wrote:
> Em Wed, Apr 19, 2023 at 09:51:20AM -0700, Ian Rogers escreveu:
>> On Wed, Apr 19, 2023 at 7:16 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>> For ADL/RPL platforms
>>> - Segmentation fault which I just found this morning.
>>> # ./perf stat true
>>> Segmentation fault (core dumped)
>  
>> This may also stem from the reference count checking work that Arnaldo
>> is currently merging. It is hard to test hybrid because it uses
>> non-generic code paths.
> 
> Hey, could you please try this under gdb and provide a backtrace? 

Here is the backtrace.

#0  get_group_fd (thread=0, cpu_map_idx=<optimized out>,
evsel=0x555556015af0) at util/evsel.c:1722
#1  evsel__open_cpu (evsel=<optimized out>, cpus=<optimized out>,
threads=<optimized out>,
    start_cpu_map_idx=<optimized out>, end_cpu_map_idx=<optimized out>)
at util/evsel.c:2105
#2  0x000055555561dd9e in __run_perf_stat (run_idx=<optimized out>,
argv=0x7fffffffe1d0, argc=1)
    at builtin-stat.c:734
#3  run_perf_stat (run_idx=<optimized out>, argv=0x7fffffffe1d0, argc=1)
at builtin-stat.c:949
#4  cmd_stat (argc=1, argv=0x7fffffffe1d0) at builtin-stat.c:2537
#5  0x00005555556b56a0 in run_builtin (p=p@entry=0x555555f84450
<commands+336>, argc=argc@entry=2,
    argv=argv@entry=0x7fffffffe1d0) at perf.c:323
#6  0x00005555555fe2d9 in handle_internal_command (argv=0x7fffffffe1d0,
argc=2) at perf.c:377
#7  run_argv (argv=<synthetic pointer>, argcp=<synthetic pointer>) at
perf.c:421
#8  main (argc=2, argv=0x7fffffffe1d0) at perf.c:537

Thanks,
Kan

> It may
> indeed be related to this refcount checking work, there was a bug fixed
> by the ARM guys for cs-etm and I combed thru and fixed some other use
> before check for NULL cases, maybe one more slipped up.
> 
> Here I couldn't reproduce, but I don't have a Intel hybrid system, will
> check with an ARM, but unsure if it will exercise the same code paths...
> 
> - Arnaldo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-20  0:23                                     ` Ian Rogers
@ 2023-04-20 13:02                                       ` Liang, Kan
  2023-04-21  0:19                                         ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-20 13:02 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Stephane Eranian, Andi Kleen, Yasin, Ahmad, Taylor, Perry, Alt,
	Samantha, Biggers, Caleb, Wang, Weilin, Edward,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel



On 2023-04-19 8:23 p.m., Ian Rogers wrote:
> On Wed, Apr 19, 2023 at 11:57 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>>
>>
>> On 2023-04-19 12:51 p.m., Ian Rogers wrote:
>>>>>>> With that change I don't have a case that requires skippable evsels,
>>>>>>> and so we can take that series with patch 6 over the v1 of that series
>>>>>>> with this change.
>>>>>>>
>>>>>>
>>>>>> I'm afraid this is not the only problem the commit 94b1a603fca7 ("perf
>>>>>> stat: Add TopdownL1 metric as a default if present") in the
>>>>>> perf-tools-next branch introduced.
>>>>>>
>>>>>> The topdown L2 in the perf stat default on SPR and big core of the ADL
>>>>>> is still missed. I don't see a possible fix for this on the current
>>>>>> perf-tools-next branch.
>>>>>
>>>>> I thought in its current state the json metrics for TopdownL2 on SPR
>>>>> have multiplexing. Given L1 is used to drill down to L2, it seems odd
>>>>> to start on L2, but given L1 is used to compute the thresholds for L2,
>>>>> this should be to have both L1 and L2 on these platforms. However,
>>>>> that doesn't work as you don't want multiplexing.
>>>>>
>>>>> This all seems backward to avoid potential multiplexing on branch miss
>>>>> rate and IPC, just always having TopdownL1 seems cleanest with the
>>>>> skippable evsels working around the permissions issue - as put forward
>>>>> in this patch. Possibly adding L2 metrics on ADL/SPR, but only once
>>>>> the multiplexing issue is resolved.
>>>>>
>>>>
>>>> No, not just that issue. Based to what I tested these days, perf stat
>>>> default has issues/regressions on most of the Intel platforms with the
>>>> current perf-tools-next and perf/core branch of acme's repo.
>>>>
>>>> For the pre-ICL platforms:
>>>> - The permission issue. (This patch tried to address.)
>>>> - Unclean perf stat default. (This patch failed to address.)
>>>>   Unnecessary multiplexing for cycles.
>>>>   Display partial of the TopdownL1
>>>>
>>>> https://lore.kernel.org/lkml/d1fe801a-22d0-1f9b-b127-227b21635bd5@linux.intel.com/
>>>>
>>>> For SPR platforms
>>>> - Topdown L2 metrics is missed, while it works with the current 6.3-rc7.
>>>>
>>>> For ADL/RPL platforms
>>>> - Segmentation fault which I just found this morning.
>>>> # ./perf stat true
>>>> Segmentation fault (core dumped)
>>>
>>> This may also stem from the reference count checking work that Arnaldo
>>> is currently merging. It is hard to test hybrid because it uses
>>> non-generic code paths.
>>
>> There are two places which causes the Segmentation fault.
>> One place is the TopdownL1.
>>
>> After I disable the TopdownL1 and add !counter->name as below, there are
>> no errors for the ./perf stat true.
>>
>> (The below is just for test purpose.)
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 7a641a67486d..8e12ed1141e0 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -1892,7 +1892,7 @@ static int add_default_attributes(void)
>>                  * Add TopdownL1 metrics if they exist. To minimize
>>                  * multiplexing, don't request threshold computation.
>>                  */
>> -               if (metricgroup__has_metric("TopdownL1")) {
>> +               if (0 && metricgroup__has_metric("TopdownL1")) {
> 
> So hybrid has something different that causes this. Can you provide
> the information to solve?

Here is the backtrace.

#0  get_group_fd (thread=0, cpu_map_idx=<optimized out>,
evsel=0x555556015af0) at util/evsel.c:1722
#1  evsel__open_cpu (evsel=<optimized out>, cpus=<optimized out>,
threads=<optimized out>,
    start_cpu_map_idx=<optimized out>, end_cpu_map_idx=<optimized out>)
at util/evsel.c:2105
#2  0x000055555561dd9e in __run_perf_stat (run_idx=<optimized out>,
argv=0x7fffffffe1d0, argc=1)
    at builtin-stat.c:734
#3  run_perf_stat (run_idx=<optimized out>, argv=0x7fffffffe1d0, argc=1)
at builtin-stat.c:949
#4  cmd_stat (argc=1, argv=0x7fffffffe1d0) at builtin-stat.c:2537
#5  0x00005555556b56a0 in run_builtin (p=p@entry=0x555555f84450
<commands+336>, argc=argc@entry=2,
    argv=argv@entry=0x7fffffffe1d0) at perf.c:323
#6  0x00005555555fe2d9 in handle_internal_command (argv=0x7fffffffe1d0,
argc=2) at perf.c:377
#7  run_argv (argv=<synthetic pointer>, argcp=<synthetic pointer>) at
perf.c:421
#8  main (argc=2, argv=0x7fffffffe1d0) at perf.c:537

> 
>>                         struct evlist *metric_evlist = evlist__new();
>>                         struct evsel *metric_evsel;
>>
>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
>> index 6b46bbb3d322..072fa56744b4 100644
>> --- a/tools/perf/util/stat-display.c
>> +++ b/tools/perf/util/stat-display.c
>> @@ -747,7 +747,7 @@ static void uniquify_event_name(struct evsel *counter)
>>         int ret = 0;
>>
>>         if (counter->uniquified_name || counter->use_config_name ||
>> -           !counter->pmu_name || !strncmp(counter->name, counter->pmu_name,
>> +           !counter->pmu_name || !counter->name ||
>> !strncmp(counter->name, counter->pmu_name,
>>                                            strlen(counter->pmu_name)))
>>                 return;
> 
> Is this a pre-existing hybrid bug? It is a real shame hybrid shows so
> few common code paths. In general evsel__name should be preferred over
> directly accessing name.


I don't think so.

I haven't dig into the bug yet. But from the source code I can tell that
the check is the same as the current 6.3-rc7.

For the current 6.3-rc7, perf stat true works.
The perf stat -M TopdownL1 --metric-no-group can work as well.

But with the current perf-tools-next branch, perf stat true gives a
Segmentation fault.

The TopdownL1 doesn't work either.

# ./perf stat -M TopdownL1 --metric-no-group
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument)
for event (topdown-retiring).
/bin/dmesg | grep -i perf may provide additional information.



> 
>>>
>>>> After the test on a hybrid machine, I incline to revert the commit
>>>> 94b1a603fca7 ("perf stat: Add TopdownL1 metric as a default if present")
>>>> and related patches for now.
>>>>
>>>> To clarify, I do not object a generic solution for the Topdown on
>>>> different ARCHs. But the current generic solution aka TopdownL1 has all
>>>> kinds of problems on most of Intel platforms. We should fix them first
>>>> before applying to the mainline.
>>>
>>> No, 6.3 has many issues as do the default events/metrics:
>>
>> To be honest, I don't think they are real critical issues. For the first
>> one, I think there was already a temporary fix. For the others, they are
>> there for years.
> 
> This isn't true. The aggregation bug was raised to me by Intel and
> stems from aggregation refactoring in per-thread mode done by
> Namhyung.
> 
>> However, the solution you proposed in the huge patch set
>> (https://lore.kernel.org/lkml/20230219092848.639226-37-irogers@google.com/)
>> brings many critical issues on different Intel platforms, crashes,
>> Missing features, etc.
>> Also, I was just told that many of our existing tools which on top of
>> the perf tool will also be broken, because all the annotations of the
>> kernel top-down metrics event disappeared.
>>
>> So we really should revert the patches. I don't think patches 39 to 51
>> are well-tested and reviewed.
> 
> The only issue I'm aware of is that hard coded use of the inaccurate
> hard coded metrics now needs to switch to json metrics. This seems a
> worthwhile update anyway, and not one that would justify breaking perf
> stat metrics.
> 

I have pointed out many issues (crash, missing features, user-visible
changes which breaking existing tools) many times in the previous reply,
I will not repeat the details again here.

The json metrics in the current perf-tools-next branch which you want to
switch to also has many bugs based on my recent test.
- The TopdownL2 triggers unnecessary multiplexing on SPR.
- Doesn't work on hybrid platforms while the current 6.3-rc7 can.

The difference between the hard coded kernel metrics and the json
metrics is that the json metrics just collect two more events, e.g., on
Tigerlake. The core metrics is exactly the same. It's good enough for a
initial judgement with core metrics.

I don't think the switching is a good choice and necessary based on all
of the above issues.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-20 13:02                                       ` Liang, Kan
@ 2023-04-21  0:19                                         ` Ian Rogers
  2023-04-21 13:32                                           ` Liang, Kan
  0 siblings, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-21  0:19 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Stephane Eranian, Andi Kleen, Yasin, Ahmad, Taylor, Perry, Alt,
	Samantha, Biggers, Caleb, Wang, Weilin, Edward,
	Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Adrian Hunter, Florian Fischer, linux-perf-users, linux-kernel

On Thu, Apr 20, 2023 at 6:04 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-19 8:23 p.m., Ian Rogers wrote:
> > On Wed, Apr 19, 2023 at 11:57 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
> >>
> >>
> >> On 2023-04-19 12:51 p.m., Ian Rogers wrote:
> >>>>>>> With that change I don't have a case that requires skippable evsels,
> >>>>>>> and so we can take that series with patch 6 over the v1 of that series
> >>>>>>> with this change.
> >>>>>>>
> >>>>>>
> >>>>>> I'm afraid this is not the only problem the commit 94b1a603fca7 ("perf
> >>>>>> stat: Add TopdownL1 metric as a default if present") in the
> >>>>>> perf-tools-next branch introduced.
> >>>>>>
> >>>>>> The topdown L2 in the perf stat default on SPR and big core of the ADL
> >>>>>> is still missed. I don't see a possible fix for this on the current
> >>>>>> perf-tools-next branch.
> >>>>>
> >>>>> I thought in its current state the json metrics for TopdownL2 on SPR
> >>>>> have multiplexing. Given L1 is used to drill down to L2, it seems odd
> >>>>> to start on L2, but given L1 is used to compute the thresholds for L2,
> >>>>> this should be to have both L1 and L2 on these platforms. However,
> >>>>> that doesn't work as you don't want multiplexing.
> >>>>>
> >>>>> This all seems backward to avoid potential multiplexing on branch miss
> >>>>> rate and IPC, just always having TopdownL1 seems cleanest with the
> >>>>> skippable evsels working around the permissions issue - as put forward
> >>>>> in this patch. Possibly adding L2 metrics on ADL/SPR, but only once
> >>>>> the multiplexing issue is resolved.
> >>>>>
> >>>>
> >>>> No, not just that issue. Based to what I tested these days, perf stat
> >>>> default has issues/regressions on most of the Intel platforms with the
> >>>> current perf-tools-next and perf/core branch of acme's repo.
> >>>>
> >>>> For the pre-ICL platforms:
> >>>> - The permission issue. (This patch tried to address.)
> >>>> - Unclean perf stat default. (This patch failed to address.)
> >>>>   Unnecessary multiplexing for cycles.
> >>>>   Display partial of the TopdownL1
> >>>>
> >>>> https://lore.kernel.org/lkml/d1fe801a-22d0-1f9b-b127-227b21635bd5@linux.intel.com/
> >>>>
> >>>> For SPR platforms
> >>>> - Topdown L2 metrics is missed, while it works with the current 6.3-rc7.
> >>>>
> >>>> For ADL/RPL platforms
> >>>> - Segmentation fault which I just found this morning.
> >>>> # ./perf stat true
> >>>> Segmentation fault (core dumped)
> >>>
> >>> This may also stem from the reference count checking work that Arnaldo
> >>> is currently merging. It is hard to test hybrid because it uses
> >>> non-generic code paths.
> >>
> >> There are two places which causes the Segmentation fault.
> >> One place is the TopdownL1.
> >>
> >> After I disable the TopdownL1 and add !counter->name as below, there are
> >> no errors for the ./perf stat true.
> >>
> >> (The below is just for test purpose.)
> >> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> >> index 7a641a67486d..8e12ed1141e0 100644
> >> --- a/tools/perf/builtin-stat.c
> >> +++ b/tools/perf/builtin-stat.c
> >> @@ -1892,7 +1892,7 @@ static int add_default_attributes(void)
> >>                  * Add TopdownL1 metrics if they exist. To minimize
> >>                  * multiplexing, don't request threshold computation.
> >>                  */
> >> -               if (metricgroup__has_metric("TopdownL1")) {
> >> +               if (0 && metricgroup__has_metric("TopdownL1")) {
> >
> > So hybrid has something different that causes this. Can you provide
> > the information to solve?
>
> Here is the backtrace.
>
> #0  get_group_fd (thread=0, cpu_map_idx=<optimized out>,
> evsel=0x555556015af0) at util/evsel.c:1722
> #1  evsel__open_cpu (evsel=<optimized out>, cpus=<optimized out>,
> threads=<optimized out>,
>     start_cpu_map_idx=<optimized out>, end_cpu_map_idx=<optimized out>)
> at util/evsel.c:2105
> #2  0x000055555561dd9e in __run_perf_stat (run_idx=<optimized out>,
> argv=0x7fffffffe1d0, argc=1)
>     at builtin-stat.c:734
> #3  run_perf_stat (run_idx=<optimized out>, argv=0x7fffffffe1d0, argc=1)
> at builtin-stat.c:949
> #4  cmd_stat (argc=1, argv=0x7fffffffe1d0) at builtin-stat.c:2537
> #5  0x00005555556b56a0 in run_builtin (p=p@entry=0x555555f84450
> <commands+336>, argc=argc@entry=2,
>     argv=argv@entry=0x7fffffffe1d0) at perf.c:323
> #6  0x00005555555fe2d9 in handle_internal_command (argv=0x7fffffffe1d0,
> argc=2) at perf.c:377
> #7  run_argv (argv=<synthetic pointer>, argcp=<synthetic pointer>) at
> perf.c:421
> #8  main (argc=2, argv=0x7fffffffe1d0) at perf.c:537
>
> >
> >>                         struct evlist *metric_evlist = evlist__new();
> >>                         struct evsel *metric_evsel;
> >>
> >> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> >> index 6b46bbb3d322..072fa56744b4 100644
> >> --- a/tools/perf/util/stat-display.c
> >> +++ b/tools/perf/util/stat-display.c
> >> @@ -747,7 +747,7 @@ static void uniquify_event_name(struct evsel *counter)
> >>         int ret = 0;
> >>
> >>         if (counter->uniquified_name || counter->use_config_name ||
> >> -           !counter->pmu_name || !strncmp(counter->name, counter->pmu_name,
> >> +           !counter->pmu_name || !counter->name ||
> >> !strncmp(counter->name, counter->pmu_name,
> >>                                            strlen(counter->pmu_name)))
> >>                 return;
> >
> > Is this a pre-existing hybrid bug? It is a real shame hybrid shows so
> > few common code paths. In general evsel__name should be preferred over
> > directly accessing name.
>
>
> I don't think so.
>
> I haven't dig into the bug yet. But from the source code I can tell that
> the check is the same as the current 6.3-rc7.
>
> For the current 6.3-rc7, perf stat true works.
> The perf stat -M TopdownL1 --metric-no-group can work as well.
>
> But with the current perf-tools-next branch, perf stat true gives a
> Segmentation fault.
>
> The TopdownL1 doesn't work either.
>
> # ./perf stat -M TopdownL1 --metric-no-group
> Error:
> The sys_perf_event_open() syscall returned with 22 (Invalid argument)
> for event (topdown-retiring).
> /bin/dmesg | grep -i perf may provide additional information.

I see hybrid failing basic sanity tests both for 6.3 and in
perf-tools-next. For metrics I see:

```
$ git status
...
Your branch is up to date with 'linus/master'
...
$ git describe
v6.3-rc7-139-gb7bc77e2f2c7
$ sudo perf stat -M TopdownL1 -a sleep 1
WARNING: events in group from different hybrid PMUs!
WARNING: grouped events cpus do not match, disabling group:
 anon group { topdown-retiring, topdown-retiring,
INT_MISC.UOP_DROPPING, topdown-fe-bound, topdown-fe-bound,
CPU_CLK_UNHALTED.CORE, topdown-be-bound, topdown-be-bound,
topdown-bad-spec, topdown-bad-spec }
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument)
for event (topdown-retiring).
/bin/dmesg | grep -i perf may provide additional information.
```

It seems perf on hybrid is quite broken in 6.3, but I doubt we can fix
6.3 given the late stage of the release cycle. As perf-tools-next
enables TopdownL1 metrics when no events or metric are specified and
when the metric group is present, on hybrid this will cause the
pre-existing bug to appear for the no events/metrics case. I suspect
this is the cause of the crashes you see, but I'm seeing assertion
failures and similar as I'm using a debug build.

I'm looking into fixing perf-tools-next and not 6.3. Maybe there will
be something we can cherry-pick back to fix up 6.3. It hasn't been
easy to find hardware to test on, and if the machine I'm remotely
using falls over then I have no means to test, so fingers crossed.

Thanks,
Ian

> >
> >>>
> >>>> After the test on a hybrid machine, I incline to revert the commit
> >>>> 94b1a603fca7 ("perf stat: Add TopdownL1 metric as a default if present")
> >>>> and related patches for now.
> >>>>
> >>>> To clarify, I do not object a generic solution for the Topdown on
> >>>> different ARCHs. But the current generic solution aka TopdownL1 has all
> >>>> kinds of problems on most of Intel platforms. We should fix them first
> >>>> before applying to the mainline.
> >>>
> >>> No, 6.3 has many issues as do the default events/metrics:
> >>
> >> To be honest, I don't think they are real critical issues. For the first
> >> one, I think there was already a temporary fix. For the others, they are
> >> there for years.
> >
> > This isn't true. The aggregation bug was raised to me by Intel and
> > stems from aggregation refactoring in per-thread mode done by
> > Namhyung.
> >
> >> However, the solution you proposed in the huge patch set
> >> (https://lore.kernel.org/lkml/20230219092848.639226-37-irogers@google.com/)
> >> brings many critical issues on different Intel platforms, crashes,
> >> Missing features, etc.
> >> Also, I was just told that many of our existing tools which on top of
> >> the perf tool will also be broken, because all the annotations of the
> >> kernel top-down metrics event disappeared.
> >>
> >> So we really should revert the patches. I don't think patches 39 to 51
> >> are well-tested and reviewed.
> >
> > The only issue I'm aware of is that hard coded use of the inaccurate
> > hard coded metrics now needs to switch to json metrics. This seems a
> > worthwhile update anyway, and not one that would justify breaking perf
> > stat metrics.
> >
>
> I have pointed out many issues (crash, missing features, user-visible
> changes which breaking existing tools) many times in the previous reply,
> I will not repeat the details again here.
>
> The json metrics in the current perf-tools-next branch which you want to
> switch to also has many bugs based on my recent test.
> - The TopdownL2 triggers unnecessary multiplexing on SPR.
> - Doesn't work on hybrid platforms while the current 6.3-rc7 can.
>
> The difference between the hard coded kernel metrics and the json
> metrics is that the json metrics just collect two more events, e.g., on
> Tigerlake. The core metrics is exactly the same. It's good enough for a
> initial judgement with core metrics.
>
> I don't think the switching is a good choice and necessary based on all
> of the above issues.
>
> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-21  0:19                                         ` Ian Rogers
@ 2023-04-21 13:32                                           ` Liang, Kan
  2023-04-21 15:49                                             ` Ian Rogers
  2023-04-21 15:58                                             ` Ian Rogers
  0 siblings, 2 replies; 29+ messages in thread
From: Liang, Kan @ 2023-04-21 13:32 UTC (permalink / raw)
  To: Ian Rogers, Arnaldo Carvalho de Melo
  Cc: Stephane Eranian, Andi Kleen, Yasin, Ahmad, Taylor, Perry, Alt,
	Samantha, Biggers, Caleb, Wang, Weilin, Edward, Peter Zijlstra,
	Ingo Molnar, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Adrian Hunter, Florian Fischer, linux-perf-users,
	linux-kernel



On 2023-04-20 8:19 p.m., Ian Rogers wrote:
>>>>                         struct evlist *metric_evlist = evlist__new();
>>>>                         struct evsel *metric_evsel;
>>>>
>>>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
>>>> index 6b46bbb3d322..072fa56744b4 100644
>>>> --- a/tools/perf/util/stat-display.c
>>>> +++ b/tools/perf/util/stat-display.c
>>>> @@ -747,7 +747,7 @@ static void uniquify_event_name(struct evsel *counter)
>>>>         int ret = 0;
>>>>
>>>>         if (counter->uniquified_name || counter->use_config_name ||
>>>> -           !counter->pmu_name || !strncmp(counter->name, counter->pmu_name,
>>>> +           !counter->pmu_name || !counter->name ||
>>>> !strncmp(counter->name, counter->pmu_name,
>>>>                                            strlen(counter->pmu_name)))
>>>>                 return;
>>>
>>> Is this a pre-existing hybrid bug? It is a real shame hybrid shows so
>>> few common code paths. In general evsel__name should be preferred over
>>> directly accessing name.
>>
>>
>> I don't think so.
>>
>> I haven't dig into the bug yet. But from the source code I can tell that
>> the check is the same as the current 6.3-rc7.
>>
>> For the current 6.3-rc7, perf stat true works.
>> The perf stat -M TopdownL1 --metric-no-group can work as well.
>>
>> But with the current perf-tools-next branch, perf stat true gives a
>> Segmentation fault.
>>
>> The TopdownL1 doesn't work either.
>>
>> # ./perf stat -M TopdownL1 --metric-no-group
>> Error:
>> The sys_perf_event_open() syscall returned with 22 (Invalid argument)
>> for event (topdown-retiring).
>> /bin/dmesg | grep -i perf may provide additional information.
> 
> I see hybrid failing basic sanity tests both for 6.3 and in
> perf-tools-next. For metrics I see:
> 
> ```
> $ git status
> ...
> Your branch is up to date with 'linus/master'
> ...
> $ git describe
> v6.3-rc7-139-gb7bc77e2f2c7
> $ sudo perf stat -M TopdownL1 -a sleep 1

Try the --metric-no-group.


> WARNING: events in group from different hybrid PMUs!
> WARNING: grouped events cpus do not match, disabling group:
>  anon group { topdown-retiring, topdown-retiring,
> INT_MISC.UOP_DROPPING, topdown-fe-bound, topdown-fe-bound,
> CPU_CLK_UNHALTED.CORE, topdown-be-bound, topdown-be-bound,
> topdown-bad-spec, topdown-bad-spec }
> Error:
> The sys_perf_event_open() syscall returned with 22 (Invalid argument)
> for event (topdown-retiring).
> /bin/dmesg | grep -i perf may provide additional information.
> ```
> 
> It seems perf on hybrid is quite broken in 6.3, but I doubt we can fix
> 6.3 given the late stage of the release cycle. As perf-tools-next
> enables TopdownL1 metrics when no events or metric are specified and
> when the metric group is present, on hybrid this will cause the
> pre-existing bug to appear for the no events/metrics case. I suspect
> this is the cause of the crashes you see, but I'm seeing assertion
> failures and similar as I'm using a debug build.
> 
> I'm looking into fixing perf-tools-next and not 6.3. Maybe there will
> be something we can cherry-pick back to fix up 6.3. It hasn't been
> easy to find hardware to test on, and if the machine I'm remotely
> using falls over then I have no means to test, so fingers crossed.
> 

OK. So the json metric thing is buggy on both 6.3 and perf-tools-next. I
think it is even worse in the perf-tools-next.
Besides the bugs, the json metric also changes the output layout of perf
stat default. The tools, which are on top of perf, are all impacted.

The question is why we are in such a rush to move the default of perf
stat from reliable kernel metric to json metric?

Can we take a step back? Create a perf/json_metric or whatever branch,
fix all the issues thoroughly, and then merge to the mainline.

I think the default of perf stat is frequently used by not only the
newbees but also the veterans. That could have a big user-visible impact.

The 6.4 merge window is approaching. Can we at least revert the patches
for 6.4?

Arnaldo, what do you think?

Thanks,
Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-21 13:32                                           ` Liang, Kan
@ 2023-04-21 15:49                                             ` Ian Rogers
  2023-04-21 17:10                                               ` Liang, Kan
  2023-04-21 15:58                                             ` Ian Rogers
  1 sibling, 1 reply; 29+ messages in thread
From: Ian Rogers @ 2023-04-21 15:49 UTC (permalink / raw)
  To: Liang, Kan, Yasin, Ahmad
  Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Andi Kleen, Taylor,
	Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin, Edward,
	Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Adrian Hunter, Florian Fischer,
	linux-perf-users, linux-kernel

On Fri, Apr 21, 2023 at 6:32 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-20 8:19 p.m., Ian Rogers wrote:
> >>>>                         struct evlist *metric_evlist = evlist__new();
> >>>>                         struct evsel *metric_evsel;
> >>>>
> >>>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> >>>> index 6b46bbb3d322..072fa56744b4 100644
> >>>> --- a/tools/perf/util/stat-display.c
> >>>> +++ b/tools/perf/util/stat-display.c
> >>>> @@ -747,7 +747,7 @@ static void uniquify_event_name(struct evsel *counter)
> >>>>         int ret = 0;
> >>>>
> >>>>         if (counter->uniquified_name || counter->use_config_name ||
> >>>> -           !counter->pmu_name || !strncmp(counter->name, counter->pmu_name,
> >>>> +           !counter->pmu_name || !counter->name ||
> >>>> !strncmp(counter->name, counter->pmu_name,
> >>>>                                            strlen(counter->pmu_name)))
> >>>>                 return;
> >>>
> >>> Is this a pre-existing hybrid bug? It is a real shame hybrid shows so
> >>> few common code paths. In general evsel__name should be preferred over
> >>> directly accessing name.
> >>
> >>
> >> I don't think so.
> >>
> >> I haven't dig into the bug yet. But from the source code I can tell that
> >> the check is the same as the current 6.3-rc7.
> >>
> >> For the current 6.3-rc7, perf stat true works.
> >> The perf stat -M TopdownL1 --metric-no-group can work as well.
> >>
> >> But with the current perf-tools-next branch, perf stat true gives a
> >> Segmentation fault.
> >>
> >> The TopdownL1 doesn't work either.
> >>
> >> # ./perf stat -M TopdownL1 --metric-no-group
> >> Error:
> >> The sys_perf_event_open() syscall returned with 22 (Invalid argument)
> >> for event (topdown-retiring).
> >> /bin/dmesg | grep -i perf may provide additional information.
> >
> > I see hybrid failing basic sanity tests both for 6.3 and in
> > perf-tools-next. For metrics I see:
> >
> > ```
> > $ git status
> > ...
> > Your branch is up to date with 'linus/master'
> > ...
> > $ git describe
> > v6.3-rc7-139-gb7bc77e2f2c7
> > $ sudo perf stat -M TopdownL1 -a sleep 1
>
> Try the --metric-no-group.
>
>
> > WARNING: events in group from different hybrid PMUs!
> > WARNING: grouped events cpus do not match, disabling group:
> >  anon group { topdown-retiring, topdown-retiring,
> > INT_MISC.UOP_DROPPING, topdown-fe-bound, topdown-fe-bound,
> > CPU_CLK_UNHALTED.CORE, topdown-be-bound, topdown-be-bound,
> > topdown-bad-spec, topdown-bad-spec }
> > Error:
> > The sys_perf_event_open() syscall returned with 22 (Invalid argument)
> > for event (topdown-retiring).
> > /bin/dmesg | grep -i perf may provide additional information.
> > ```
> >
> > It seems perf on hybrid is quite broken in 6.3, but I doubt we can fix
> > 6.3 given the late stage of the release cycle. As perf-tools-next
> > enables TopdownL1 metrics when no events or metric are specified and
> > when the metric group is present, on hybrid this will cause the
> > pre-existing bug to appear for the no events/metrics case. I suspect
> > this is the cause of the crashes you see, but I'm seeing assertion
> > failures and similar as I'm using a debug build.
> >
> > I'm looking into fixing perf-tools-next and not 6.3. Maybe there will
> > be something we can cherry-pick back to fix up 6.3. It hasn't been
> > easy to find hardware to test on, and if the machine I'm remotely
> > using falls over then I have no means to test, so fingers crossed.
> >
>
> OK. So the json metric thing is buggy on both 6.3 and perf-tools-next. I
> think it is even worse in the perf-tools-next.
> Besides the bugs, the json metric also changes the output layout of perf
> stat default. The tools, which are on top of perf, are all impacted.
>
> The question is why we are in such a rush to move the default of perf
> stat from reliable kernel metric to json metric?

So there is a long answer involving description of the multi-year
effort between Intel and Google to transition to json metrics that are
fully tested, etc. but the short answer here is that:
1) the code has always called for json to be the approach:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/builtin-stat.c#n1841
".. Once platform specific metrics support has been added to the json
files, all architectures will use this approach..."
2) aggregation in the saved values used by hard coded metrics, and
json metrics, is broken in 6.3 The bug reporting this came from Intel
around the time 6.2 was wrapping up (ie not a new issue), and Intel
requested a fix. This has been in use for metric testing by Intel. The
fix removed saved values, used by the hard coded metrics, and even if
the hard coded metrics were ported to perf-tools-next which version
should be ported? The version that disagrees with the TMA metric
spreadsheet that is in 6.3, or a version matching the json metric? If
you're just going to match the json metric then why not just use the
json metric? It is less code and cleaner all round. Fixing the hard
coded metric also alters the output formatting and so is unlikely to
resolve what has been an issue here.
3) hard coded metrics have a non-zero cost, in particular in output
formatting. Take for example:
https://lore.kernel.org/all/20221205042852.83382-1-atrajeev@linux.vnet.ibm.com/
"tools/perf: Fix printing field separator in CSV metrics output"
But the perf tool also has had patches in this area from RedHat.
Having the hard coded metrics is error prone, while the json metrics
are something that is far more uniform and tested. We have in the
region of 30 hard coded metrics in Linux 6.3, much fewer in
perf-tools-next, whereas we have more than 4,100 json metrics.

> Can we take a step back? Create a perf/json_metric or whatever branch,
> fix all the issues thoroughly, and then merge to the mainline.
>
> I think the default of perf stat is frequently used by not only the
> newbees but also the veterans. That could have a big user-visible impact.
>
> The 6.4 merge window is approaching. Can we at least revert the patches
> for 6.4?
>
> Arnaldo, what do you think?

Revert is a complete no go, we'd be reintroducing the bugs that are
fixed and who will then fix those? How will they be fixed in any way
other than how they've already been fixed? I've yet to see a
description of a bug that isn't either:
1) an issue because the output formatting changed - the fix here is to
use CSV or json output, using the hard coded metrics was a bug here
anyway, not least as the metrics are wrong,
2) a pre-existing issue, such as hybrid is broken,
3) a non-issue, such as multiplexing on Skylake.
That's not to say everything couldn't run better, which was the issue
this patch was looking to address.

My team have called for this to be discussed with Intel as we're firm
on what we see as being the correct approach. I think the meeting can
conclude what will happen wrt reverts, but I haven't heard a sensible
argument why reverts are necessary. I think the energy from this
conversation would be much better directed into fixing the issues we
know we have, which is why I'm currently looking at fixing some
aspects of what is broken in hybrid.

Thanks,
Ian

> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-21 13:32                                           ` Liang, Kan
  2023-04-21 15:49                                             ` Ian Rogers
@ 2023-04-21 15:58                                             ` Ian Rogers
  1 sibling, 0 replies; 29+ messages in thread
From: Ian Rogers @ 2023-04-21 15:58 UTC (permalink / raw)
  To: Liang, Kan, Yasin, Ahmad
  Cc: Arnaldo Carvalho de Melo, Stephane Eranian, Andi Kleen, Taylor,
	Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin, Edward,
	Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Adrian Hunter, Florian Fischer,
	linux-perf-users, linux-kernel

On Fri, Apr 21, 2023 at 6:32 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-20 8:19 p.m., Ian Rogers wrote:
> >>>>                         struct evlist *metric_evlist = evlist__new();
> >>>>                         struct evsel *metric_evsel;
> >>>>
> >>>> diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
> >>>> index 6b46bbb3d322..072fa56744b4 100644
> >>>> --- a/tools/perf/util/stat-display.c
> >>>> +++ b/tools/perf/util/stat-display.c
> >>>> @@ -747,7 +747,7 @@ static void uniquify_event_name(struct evsel *counter)
> >>>>         int ret = 0;
> >>>>
> >>>>         if (counter->uniquified_name || counter->use_config_name ||
> >>>> -           !counter->pmu_name || !strncmp(counter->name, counter->pmu_name,
> >>>> +           !counter->pmu_name || !counter->name ||
> >>>> !strncmp(counter->name, counter->pmu_name,
> >>>>                                            strlen(counter->pmu_name)))
> >>>>                 return;
> >>>
> >>> Is this a pre-existing hybrid bug? It is a real shame hybrid shows so
> >>> few common code paths. In general evsel__name should be preferred over
> >>> directly accessing name.
> >>
> >>
> >> I don't think so.
> >>
> >> I haven't dig into the bug yet. But from the source code I can tell that
> >> the check is the same as the current 6.3-rc7.
> >>
> >> For the current 6.3-rc7, perf stat true works.
> >> The perf stat -M TopdownL1 --metric-no-group can work as well.
> >>
> >> But with the current perf-tools-next branch, perf stat true gives a
> >> Segmentation fault.
> >>
> >> The TopdownL1 doesn't work either.
> >>
> >> # ./perf stat -M TopdownL1 --metric-no-group
> >> Error:
> >> The sys_perf_event_open() syscall returned with 22 (Invalid argument)
> >> for event (topdown-retiring).
> >> /bin/dmesg | grep -i perf may provide additional information.
> >
> > I see hybrid failing basic sanity tests both for 6.3 and in
> > perf-tools-next. For metrics I see:
> >
> > ```
> > $ git status
> > ...
> > Your branch is up to date with 'linus/master'
> > ...
> > $ git describe
> > v6.3-rc7-139-gb7bc77e2f2c7
> > $ sudo perf stat -M TopdownL1 -a sleep 1
>
> Try the --metric-no-group.

You can't not group topdown events, needed for the performance core,
on Linux 6.3. The logic to make that not necessary was added to
perf-tools-next and you are currently looking to revert the changes in
perf-tools-next.
https://lore.kernel.org/all/20230312021543.3060328-1-irogers@google.com/

Thanks,
Ian

>
>
> > WARNING: events in group from different hybrid PMUs!
> > WARNING: grouped events cpus do not match, disabling group:
> >  anon group { topdown-retiring, topdown-retiring,
> > INT_MISC.UOP_DROPPING, topdown-fe-bound, topdown-fe-bound,
> > CPU_CLK_UNHALTED.CORE, topdown-be-bound, topdown-be-bound,
> > topdown-bad-spec, topdown-bad-spec }
> > Error:
> > The sys_perf_event_open() syscall returned with 22 (Invalid argument)
> > for event (topdown-retiring).
> > /bin/dmesg | grep -i perf may provide additional information.
> > ```
> >
> > It seems perf on hybrid is quite broken in 6.3, but I doubt we can fix
> > 6.3 given the late stage of the release cycle. As perf-tools-next
> > enables TopdownL1 metrics when no events or metric are specified and
> > when the metric group is present, on hybrid this will cause the
> > pre-existing bug to appear for the no events/metrics case. I suspect
> > this is the cause of the crashes you see, but I'm seeing assertion
> > failures and similar as I'm using a debug build.
> >
> > I'm looking into fixing perf-tools-next and not 6.3. Maybe there will
> > be something we can cherry-pick back to fix up 6.3. It hasn't been
> > easy to find hardware to test on, and if the machine I'm remotely
> > using falls over then I have no means to test, so fingers crossed.
> >
>
> OK. So the json metric thing is buggy on both 6.3 and perf-tools-next. I
> think it is even worse in the perf-tools-next.
> Besides the bugs, the json metric also changes the output layout of perf
> stat default. The tools, which are on top of perf, are all impacted.
>
> The question is why we are in such a rush to move the default of perf
> stat from reliable kernel metric to json metric?
>
> Can we take a step back? Create a perf/json_metric or whatever branch,
> fix all the issues thoroughly, and then merge to the mainline.
>
> I think the default of perf stat is frequently used by not only the
> newbees but also the veterans. That could have a big user-visible impact.
>
> The 6.4 merge window is approaching. Can we at least revert the patches
> for 6.4?
>
> Arnaldo, what do you think?
>
> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-21 15:49                                             ` Ian Rogers
@ 2023-04-21 17:10                                               ` Liang, Kan
  2023-04-21 17:30                                                 ` Ian Rogers
  0 siblings, 1 reply; 29+ messages in thread
From: Liang, Kan @ 2023-04-21 17:10 UTC (permalink / raw)
  To: Ian Rogers, Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar
  Cc: Stephane Eranian, Andi Kleen, Taylor, Perry, Alt, Samantha,
	Biggers, Caleb, Wang, Weilin, Edward, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Adrian Hunter,
	Florian Fischer, linux-perf-users, linux-kernel, Yasin, Ahmad



On 2023-04-21 11:49 a.m., Ian Rogers wrote:
>> Can we take a step back? Create a perf/json_metric or whatever branch,
>> fix all the issues thoroughly, and then merge to the mainline.
>>
>> I think the default of perf stat is frequently used by not only the
>> newbees but also the veterans. That could have a big user-visible impact.
>>
>> The 6.4 merge window is approaching. Can we at least revert the patches
>> for 6.4?
>>
>> Arnaldo, what do you think?
> Revert is a complete no go, we'd be reintroducing the bugs that are
> fixed and who will then fix those? How will they be fixed in any way
> other than how they've already been fixed? I've yet to see a
> description of a bug that isn't either:
> 1) an issue because the output formatting changed - the fix here is to
> use CSV or json output, using the hard coded metrics was a bug here
> anyway, not least as the metrics are wrong,
> 2) a pre-existing issue, such as hybrid is broken,
> 3) a non-issue, such as multiplexing on Skylake.
> That's not to say everything couldn't run better, which was the issue
> this patch was looking to address.

Sigh, we really need a clear design and definition regarding the default
of perf stat, especially what is included, what's the expected layout,
the forbidden. So everyone from the developers to the users is on the
same page. Could you please update the document for the default of perf
stat?


One more thing I want to point out is that the json metric may keep
changing. While the kernel metics (only includes the core part of the
json metric) is quite stable. For the default of perf stat, the kernel
metics should be a better choice.


Clarification: I'm not looking for reverting all the patches of the json
metrics in the perf-tool-next branch. The request is to revert the
patches which impact the default of perf stat. More specifically, the
patch 39 and 41.
https://lore.kernel.org/lkml/20230219092848.639226-40-irogers@google.com/


This is really a long discussion. I think we all express the arguments
clearly. Let's make the final decision and stop here.

Arnaldo? Peter? Ingo?

Could you please share your opinions?

Thanks,
Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] perf stat: Introduce skippable evsels
  2023-04-21 17:10                                               ` Liang, Kan
@ 2023-04-21 17:30                                                 ` Ian Rogers
  0 siblings, 0 replies; 29+ messages in thread
From: Ian Rogers @ 2023-04-21 17:30 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Stephane Eranian, Andi Kleen, Taylor, Perry, Alt, Samantha,
	Biggers, Caleb, Wang, Weilin, Edward, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Adrian Hunter,
	Florian Fischer, linux-perf-users, linux-kernel, Yasin, Ahmad

On Fri, Apr 21, 2023 at 10:11 AM Liang, Kan <kan.liang@linux.intel.com> wrote:
>
>
>
> On 2023-04-21 11:49 a.m., Ian Rogers wrote:
> >> Can we take a step back? Create a perf/json_metric or whatever branch,
> >> fix all the issues thoroughly, and then merge to the mainline.
> >>
> >> I think the default of perf stat is frequently used by not only the
> >> newbees but also the veterans. That could have a big user-visible impact.
> >>
> >> The 6.4 merge window is approaching. Can we at least revert the patches
> >> for 6.4?
> >>
> >> Arnaldo, what do you think?
> > Revert is a complete no go, we'd be reintroducing the bugs that are
> > fixed and who will then fix those? How will they be fixed in any way
> > other than how they've already been fixed? I've yet to see a
> > description of a bug that isn't either:
> > 1) an issue because the output formatting changed - the fix here is to
> > use CSV or json output, using the hard coded metrics was a bug here
> > anyway, not least as the metrics are wrong,
> > 2) a pre-existing issue, such as hybrid is broken,
> > 3) a non-issue, such as multiplexing on Skylake.
> > That's not to say everything couldn't run better, which was the issue
> > this patch was looking to address.
>
> Sigh, we really need a clear design and definition regarding the default
> of perf stat, especially what is included, what's the expected layout,
> the forbidden. So everyone from the developers to the users is on the
> same page. Could you please update the document for the default of perf
> stat?
>
>
> One more thing I want to point out is that the json metric may keep
> changing. While the kernel metics (only includes the core part of the
> json metric) is quite stable. For the default of perf stat, the kernel
> metics should be a better choice.
>
>
> Clarification: I'm not looking for reverting all the patches of the json
> metrics in the perf-tool-next branch. The request is to revert the
> patches which impact the default of perf stat. More specifically, the
> patch 39 and 41.
> https://lore.kernel.org/lkml/20230219092848.639226-40-irogers@google.com/

This is silly. You can't just revert those patches. The removal of the
hard coded logic was so that it didn't need to get ported when the
later patches removed saved values. You are asking for a far bigger
revert.

Thanks,
Ian

> This is really a long discussion. I think we all express the arguments
> clearly. Let's make the final decision and stop here.
>
> Arnaldo? Peter? Ingo?
>
> Could you please share your opinions?
>
> Thanks,
> Kan

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-04-21 17:30 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-14  5:19 [PATCH v2] perf stat: Introduce skippable evsels Ian Rogers
2023-04-14 18:02 ` Liang, Kan
2023-04-14 23:03   ` Ian Rogers
2023-04-17 13:58     ` Liang, Kan
2023-04-17 15:59       ` Ian Rogers
2023-04-17 17:31         ` Liang, Kan
2023-04-17 18:13           ` Ian Rogers
2023-04-18 13:03             ` Liang, Kan
2023-04-18 15:43               ` Ian Rogers
2023-04-18 18:19                 ` Liang, Kan
2023-04-18 20:08                   ` Ian Rogers
2023-04-18 21:51                     ` Liang, Kan
2023-04-19  0:12                       ` Ian Rogers
2023-04-19  1:00                         ` Ian Rogers
2023-04-19 12:31                           ` Liang, Kan
2023-04-19 13:19                             ` Ian Rogers
2023-04-19 14:16                               ` Liang, Kan
2023-04-19 16:51                                 ` Ian Rogers
2023-04-19 18:57                                   ` Liang, Kan
2023-04-20  0:23                                     ` Ian Rogers
2023-04-20 13:02                                       ` Liang, Kan
2023-04-21  0:19                                         ` Ian Rogers
2023-04-21 13:32                                           ` Liang, Kan
2023-04-21 15:49                                             ` Ian Rogers
2023-04-21 17:10                                               ` Liang, Kan
2023-04-21 17:30                                                 ` Ian Rogers
2023-04-21 15:58                                             ` Ian Rogers
2023-04-20 11:33                                   ` Arnaldo Carvalho de Melo
2023-04-20 12:22                                     ` Liang, Kan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.