Re: [RFC PATCH 00/12] Topdown parser

From: Andi Kleen <ak@linux.intel.com>
To: Ian Rogers <irogers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Jin Yao <yao.jin@linux.intel.com>,
	John Garry <john.garry@huawei.com>, Paul Clarke <pc@us.ibm.com>,
	kajoljain <kjain@linux.ibm.com>,
	Stephane Eranian <eranian@google.com>,
	Sandeep Dasgupta <sdasgup@google.com>,
	linux-perf-users <linux-perf-users@vger.kernel.org>
Subject: Re: [RFC PATCH 00/12] Topdown parser
Date: Wed, 11 Nov 2020 22:35:42 -0800	[thread overview]
Message-ID: <20201112063542.GD894261@tassilo.jf.intel.com> (raw)
In-Reply-To: <CAP-5=fUDOLzfpuJNjk_D6KrAGMNXKXOFKfVi9O7qXRDdP_4Rpg@mail.gmail.com>

On Wed, Nov 11, 2020 at 08:09:49PM -0800, Ian Rogers wrote:
>      >    to the optimization manual the group Topdown_Group_TopDownL1
>      provides the
>      >   
>      metrics Topdown_Metric_Frontend_Bound, Topdown_Metric_Backend_Bound, Topdown_Metric_Bad_Speculation
>      >    and Topdown_Metric_Retiring. The hope is the events here will all
>      be
>      >    scheduled without multiplexing.
> 
>      That's not necessarily true. Some of the newer expressions are quite
>      complex (e.g .due to workarounds or because the events are complex, like
>      the FLOPS events) There's also some problems with the
>      scheduling of the fixed metrics on Icelake+, that need special handling.
> 
>    For FLOPS I see:
>    ( #FP_Arith_Scalar + #FP_Arith_Vector ) / ( 2 * CORE_CLKS ) 
>    is the concern about multiplexing? Could we create metrics/groups aware of
>    the limitations?

If you expand it you'll end up with a lot of events, so it has to be split
into groups. But you still need to understand the rules, otherwise the tool
ends up with non schedulable groups.

For example here's a group schedule generated by toplev for level 3 Icelake.
On pre Icelake with only 4 counters it's more complicated.

Microcode_Sequencer[3] Heavy_Operations[1] Memory_Bound[1]
Branch_Mispredicts[1] Fetch_Bandwidth[1] Other[4] Heavy_Operations[3]
Frontend_Bound[1] FP_Arith[1] Backend_Bound[1] Light_Operations[3]
Microcode_Sequencer[1] FP_Arith[4] Fetch_Bandwidth[2] Light_Operations[1]
Retiring[1] Bad_Speculation[1] Machine_Clears[1] Other[1] Core_Bound[1]
Ports_Utilization[1]:
  perf_metrics.frontend_bound[f] perf_metrics.bad_speculation[f]
  topdown.slots[f] perf_metrics.backend_bound[f] perf_metrics.retiring[f] [0
  counters]
Machine_Clears[2] Branch_Mispredicts[2] ITLB_Misses[3] Branch_Mispredicts[1]
Fetch_Bandwidth[1] Machine_Clears[1] Frontend_Bound[1] Backend_Bound[1]
ICache_Misses[3] Fetch_Latency[2] Fetch_Bandwidth[2] Bad_Speculation[1]
Memory_Bound[1] DSB_Switches[3]:
  inst_retired.any[f] machine_clears.count cpu_clk_unhalted.thread[f]
  int_misc.recovery_cycles:c1:e1 br_misp_retired.all_branches
  idq_uops_not_delivered.cycles_0_uops_deliv.core topdown.slots[f]
  dsb2mite_switches.penalty_cycles icache_64b.iftag_stall
  int_misc.uop_dropping icache_16b.ifdata_stall [8 counters]
Core_Bound[1] Core_Bound[2] Heavy_Operations[3] Memory_Bound[2]
Light_Operations[3]:
  cycle_activity.stalls_mem_any idq.ms_uops exe_activity.1_ports_util
  exe_activity.exe_bound_0_ports exe_activity.2_ports_util
  exe_activity.bound_on_stores int_misc.recovery_cycles:c1:e1 uops_issued.any
  [8 counters]
Microcode_Sequencer[3] Store_Bound[3] Branch_Resteers[3] MS_Switches[3]
Divider[3] LCP[3]:
  int_misc.clear_resteer_cycles arith.divider_active
  cpu_clk_unhalted.thread[f] idq.ms_switches ild_stall.lcp baclears.any
  exe_activity.bound_on_stores idq.ms_uops uops_issued.any [8 counters]
L1_Bound[3] L3_Bound[3]:
  cycle_activity.stalls_l1d_miss cycle_activity.stalls_mem_any
  cycle_activity.stalls_l2_miss cpu_clk_unhalted.thread[f]
  cycle_activity.stalls_l3_miss [4 counters]
DSB[3] MITE[3]:
  idq.mite_cycles_ok idq.mite_cycles_any cpu_clk_unhalted.distributed
  idq.dsb_cycles_any idq.dsb_cycles_ok [5 counters]
LSD[3] L2_Bound[3]:
  cycle_activity.stalls_l1d_miss cpu_clk_unhalted.thread[f]
  cpu_clk_unhalted.distributed lsd.cycles_ok lsd.cycles_active
  cycle_activity.stalls_l2_miss [5 counters]
Other[4] FP_Arith[4] L2_Bound[3] DRAM_Bound[3]:
  mem_load_retired.fb_hit mem_load_retired.l2_hit
  fp_arith_inst_retired.512b_packed_single mem_load_retired.l1_miss
  fp_arith_inst_retired.512b_packed_double l1d_pend_miss.fb_full_periods [6
  counters]
Ports_Utilization[3] DRAM_Bound[3]:
  cycle_activity.stalls_l1d_miss arith.divider_active mem_load_retired.l2_hit
  exe_activity.1_ports_util cpu_clk_unhalted.thread[f]
  cycle_activity.stalls_l3_miss exe_activity.2_ports_util
  cycle_activity.stalls_l2_miss exe_activity.exe_bound_0_ports [8 counters]
Other[4] FP_Arith[4]:
  fp_arith_inst_retired.128b_packed_single uops_executed.thread
  uops_executed.x87 fp_arith_inst_retired.scalar_double
  fp_arith_inst_retired.256b_packed_single
  fp_arith_inst_retired.scalar_single
  fp_arith_inst_retired.128b_packed_double
  fp_arith_inst_retired.256b_packed_double [8 counters]

>    Ok, so we can read the threshold from the spreadsheet and create an extra
>    metric for if the metric above the threshold?

Yes it can be all derived from the spreadsheet.

You need a much more complicated evaluation algorithm though, single
pass is not enough.

> 
>      Also in other cases it's probably better to not drilldown, but collect
>      everything upfront, e.g. when someone else is doing the collection
>      for you. In this case the thresholding has to be figured out from
>      existing data.
> 
>    This sounds like perf record support for metrics, which I think is a good
>    idea but not currently a use-case I'm trying to solve.

I meant just a single collection with perf stat, but with all events for all 
metrics/nodes collected. But the tool automatically figures out the bottleneck
and only shows the relevant information instead of all.

That's a good use case when it's not feasible to iterate (e.g. you ask
someone else to measure)

>      The other biggie which is currently not in metrics is per core mode,
>      which is needed for many metrics on CPUs older than Icelake. This really
>      has to be supported in some way, otherwise the results on pre Icelake
>      SMT
>      are not good (Icelake fixes this problem)
> 
>    I think this can be added in the style of event modifiers. We may want to
>    distinguish metrics from user or kernel code. We may want one cpu, etc.

There is a per core qualifier now, but you would also need to teach the perf stat output
to show the shared nodes. That's likely significant surgery.

>    Sure, but I'm not understanding how a user without vtune or toplev is
>    expected follow top down? 

It's fairly difficult. Some experts can do it, but it's not a common skill.

Giving the metrics automates a lot of the
>    process as well as alleviating differences across different models.
>    Previously a lot of metrics had just broken expressions. Were C code used

These problems should be all fixed now. We had some testing challenges,
but they have been addressed

>    instead of strings in json, these would have resulted in compilation
>    errors. Even with the metrics being broken in perf I didn't see users
>    complaining. I suspect metrics are outside of most people's use, but that
>    doesn't mean we shouldn't make them convenient to use if they want them.
>    Yes there are issues, but that's also true for most events.
>    I think ideally we have the TMA_Metrics.csv (and related tools)
>    contributed to Linux, this would be processed at build time to build all
>    events and metrics (what this patch series does). Making the metrics work,
>    complexity, multiplexing, etc. are all issues, but ones worth fixing.

My feeling before was that some selected nodes and the non TMA metrics make 
sense and that's the current metrics in fact. Perhaps it wasn't perfect,
but at least it was something, and they work already quite well
with the current perf infrastructure.

It probably also makes sense to add a few more nodes (let's say level 2
or maybe level 3).

But if you add much more you'll need all this extra machinery I described.
Without it adding all the lower level nodes blindly is not a good idea
because it will not be very usable, or worse even misleading.

Even for level 2/3 we should probably have some simple thresholding at least,
but maybe could get away with not having it yet.

While it's of course possible to add this too it would be significant surgery.
I actually considered doing it all in perf at some point, but it was quite
complicated.

I personally found it simpler to handle those higher level algorithms in a
separate tool.

-Andi