[RFC PATCH 0/4] perf stat: Add option to aggregate data based on the cache topology

* [RFC PATCH 0/4] perf stat: Add option to aggregate data based on the cache topology
@ 2023-03-31  4:51 K Prateek Nayak
  2023-03-31  4:51 ` [RFC PATCH 1/4] perf: Read cache instance ID when building " K Prateek Nayak
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: K Prateek Nayak @ 2023-03-31  4:51 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo,
	mark.rutland, alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy, eranian

Motivation behind this feature is to aggregate the data at the LLC level
for chiplet based processors which currently do not expose the chiplet
details in sysfs cpu topology information.

For the completeness of the feature, the series adds ability to
aggregate data at any cache level. Following is the example of the
output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
chiplet per socket.

  $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

   Performance counter stats for 'system wide':

  S0-D0-L3-ID0             16              4,463      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID1             16              2,962      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID2             16              2,592      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID3             16              2,508      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID4             16              1,841      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID5             16              1,764      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID6             16              1,205      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID7             16              5,806      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID8             16              1,461      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID9             16                648      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID10            16              1,443      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID11            16              1,333      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID12            16              1,167      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID13            16                640      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID14            16                601      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID15            16              3,423      ls_dmnd_fills_from_sys.ext_cache_remote

         5.017954593 seconds time elapsed

The series also adds support for perf stat record and perf stat report
to aggregate data at various cache levels. Following is an example of
recording with aggregation at L2 level and reporting the same data with
aggregation at L3 level.

  $ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

   Performance counter stats for 'system wide':

  S0-D0-L2-ID0              2              3,212      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID1              2                240      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID2              2                 10      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID3              2                 13      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID4              2                 13      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID5              2                319      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID6              2                348      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID7              2                648      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID8              2                284      ls_dmnd_fills_from_sys.ext_cache_remote
  ...
  S1-D1-L2-ID127            2                113      ls_dmnd_fills_from_sys.ext_cache_remote

         5.017958787 seconds time elapsed

  $ sudo perf stat report --per-cache=L3

   Performance counter stats for '/home/amd/dev/linux/tools/perf/perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5':

  S0-D0-L3-ID0             16              4,803      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID1             16              3,421      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID2             16              1,149      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID3             16              1,220      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID4             16              1,502      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID5             16              6,751      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID6             16              1,600      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID7             16              1,985      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID8             16              1,566      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID9             16              1,010      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID10            16              1,337      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID11            16              2,298      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID12            16                314      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID13            16                350      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID14            16                664      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID15            16              3,834      ls_dmnd_fills_from_sys.ext_cache_remote

         5.017958787 seconds time elapsed

The sum of the aggregate at L2 from S0-D0-L2-ID0 to S0-D0-L2-ID7 is
equal to the value for S0-D0-L3-ID0 in perf stat report with aggregation
at L3 level since L3-ID0 contains L2-ID0 to L2-ID7 on the machine.

This series makes breaking change when saving the cache details of env
for recording and reporting purpose. If there is a better way to do
this, please do let me know.

Following points were not considered when designing this RFC:

- Handling multiple cache types at same level, for example L1i and L1d
  both of which are at level 1. The current implementation will retrieve
  the instance ID from the last entry in cache_level_data[] with the
  matching level. This works as long as L1i and L1d cover same set of
  CPUs but will not work for an exotic cache topology.

- If the processor features an exotic cache topology with different
  type of caches at same level covering different set of CPUs, the
  record and report might not give consistent result as the qsort()
  function used to sort cache_level_data[] when saving the env data is
  unstable and might not preserve the order for the different caches at
  same level.

I'm seeking some clarification from the community for the above problems
and potential solutions for processors where all CPUs might not share
the same topology structure.

This series cleanly applies on top perf-tool branch from Arnaldo's tree
(https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools)
at:

commit e8d018dd0257 ("Linux 6.3-rc3")

--
K Prateek Nayak (4):
  perf: Read cache instance ID when building cache topology
  perf: Save cache instance ID when saving cache topology data
  perf: Extract building cache level for a CPU into separate function
  perf: Add option for --per-cache aggregation

 tools/lib/perf/include/perf/cpumap.h          |   5 +
 tools/lib/perf/include/perf/event.h           |   3 +-
 tools/perf/Documentation/perf-stat.txt        |  16 ++
 tools/perf/builtin-stat.c                     | 149 +++++++++++++++++-
 .../tests/shell/lib/perf_json_output_lint.py  |   4 +-
 tools/perf/tests/shell/stat+csv_output.sh     |  14 ++
 tools/perf/tests/shell/stat+json_output.sh    |  13 ++
 tools/perf/util/cpumap.c                      |  97 ++++++++++++
 tools/perf/util/cpumap.h                      |  17 ++
 tools/perf/util/env.h                         |   1 +
 tools/perf/util/event.c                       |   7 +-
 tools/perf/util/header.c                      |  77 ++++++---
 tools/perf/util/header.h                      |   4 +
 tools/perf/util/stat-display.c                |  16 ++
 tools/perf/util/stat-shadow.c                 |   1 +
 tools/perf/util/stat.h                        |   2 +
 tools/perf/util/synthetic-events.c            |   1 +
 17 files changed, 395 insertions(+), 32 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 7+ messages in thread