linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] perf stat: Add option to aggregate data based on the cache topology
@ 2023-03-31  4:51 K Prateek Nayak
  2023-03-31  4:51 ` [RFC PATCH 1/4] perf: Read cache instance ID when building " K Prateek Nayak
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: K Prateek Nayak @ 2023-03-31  4:51 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo,
	mark.rutland, alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy, eranian

Motivation behind this feature is to aggregate the data at the LLC level
for chiplet based processors which currently do not expose the chiplet
details in sysfs cpu topology information.

For the completeness of the feature, the series adds ability to
aggregate data at any cache level. Following is the example of the
output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
chiplet per socket.

  $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

   Performance counter stats for 'system wide':

  S0-D0-L3-ID0             16              4,463      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID1             16              2,962      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID2             16              2,592      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID3             16              2,508      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID4             16              1,841      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID5             16              1,764      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID6             16              1,205      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID7             16              5,806      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID8             16              1,461      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID9             16                648      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID10            16              1,443      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID11            16              1,333      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID12            16              1,167      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID13            16                640      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID14            16                601      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID15            16              3,423      ls_dmnd_fills_from_sys.ext_cache_remote

         5.017954593 seconds time elapsed

The series also adds support for perf stat record and perf stat report
to aggregate data at various cache levels. Following is an example of
recording with aggregation at L2 level and reporting the same data with
aggregation at L3 level.

  $ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

   Performance counter stats for 'system wide':

  S0-D0-L2-ID0              2              3,212      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID1              2                240      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID2              2                 10      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID3              2                 13      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID4              2                 13      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID5              2                319      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID6              2                348      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID7              2                648      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID8              2                284      ls_dmnd_fills_from_sys.ext_cache_remote
  ...
  S1-D1-L2-ID127            2                113      ls_dmnd_fills_from_sys.ext_cache_remote

         5.017958787 seconds time elapsed

  $ sudo perf stat report --per-cache=L3

   Performance counter stats for '/home/amd/dev/linux/tools/perf/perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5':

  S0-D0-L3-ID0             16              4,803      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID1             16              3,421      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID2             16              1,149      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID3             16              1,220      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID4             16              1,502      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID5             16              6,751      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID6             16              1,600      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID7             16              1,985      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID8             16              1,566      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID9             16              1,010      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID10            16              1,337      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID11            16              2,298      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID12            16                314      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID13            16                350      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID14            16                664      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID15            16              3,834      ls_dmnd_fills_from_sys.ext_cache_remote

         5.017958787 seconds time elapsed

The sum of the aggregate at L2 from S0-D0-L2-ID0 to S0-D0-L2-ID7 is
equal to the value for S0-D0-L3-ID0 in perf stat report with aggregation
at L3 level since L3-ID0 contains L2-ID0 to L2-ID7 on the machine.

This series makes breaking change when saving the cache details of env
for recording and reporting purpose. If there is a better way to do
this, please do let me know.

Following points were not considered when designing this RFC:

- Handling multiple cache types at same level, for example L1i and L1d
  both of which are at level 1. The current implementation will retrieve
  the instance ID from the last entry in cache_level_data[] with the
  matching level. This works as long as L1i and L1d cover same set of
  CPUs but will not work for an exotic cache topology.

- If the processor features an exotic cache topology with different
  type of caches at same level covering different set of CPUs, the
  record and report might not give consistent result as the qsort()
  function used to sort cache_level_data[] when saving the env data is
  unstable and might not preserve the order for the different caches at
  same level.

I'm seeking some clarification from the community for the above problems
and potential solutions for processors where all CPUs might not share
the same topology structure.

This series cleanly applies on top perf-tool branch from Arnaldo's tree
(https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools)
at:

commit e8d018dd0257 ("Linux 6.3-rc3")

--
K Prateek Nayak (4):
  perf: Read cache instance ID when building cache topology
  perf: Save cache instance ID when saving cache topology data
  perf: Extract building cache level for a CPU into separate function
  perf: Add option for --per-cache aggregation

 tools/lib/perf/include/perf/cpumap.h          |   5 +
 tools/lib/perf/include/perf/event.h           |   3 +-
 tools/perf/Documentation/perf-stat.txt        |  16 ++
 tools/perf/builtin-stat.c                     | 149 +++++++++++++++++-
 .../tests/shell/lib/perf_json_output_lint.py  |   4 +-
 tools/perf/tests/shell/stat+csv_output.sh     |  14 ++
 tools/perf/tests/shell/stat+json_output.sh    |  13 ++
 tools/perf/util/cpumap.c                      |  97 ++++++++++++
 tools/perf/util/cpumap.h                      |  17 ++
 tools/perf/util/env.h                         |   1 +
 tools/perf/util/event.c                       |   7 +-
 tools/perf/util/header.c                      |  77 ++++++---
 tools/perf/util/header.h                      |   4 +
 tools/perf/util/stat-display.c                |  16 ++
 tools/perf/util/stat-shadow.c                 |   1 +
 tools/perf/util/stat.h                        |   2 +
 tools/perf/util/synthetic-events.c            |   1 +
 17 files changed, 395 insertions(+), 32 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/4] perf: Read cache instance ID when building cache topology
  2023-03-31  4:51 [RFC PATCH 0/4] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
@ 2023-03-31  4:51 ` K Prateek Nayak
  2023-03-31 11:54   ` Arnaldo Carvalho de Melo
  2023-03-31  4:51 ` [RFC PATCH 2/4] perf: Save cache instance ID when saving cache topology data K Prateek Nayak
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: K Prateek Nayak @ 2023-03-31  4:51 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo,
	mark.rutland, alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy, eranian

CPU cache level data currently stores cache level, type, line size,
associativity, sets, total cache size, and the CPUs sharing the cache.
Also read and store the cache instance ID from
"/sys/devices/system/cpu/cpuX/cache/indexY/id" in the cache level data.
Use instance ID as well when comparing cache levels.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 tools/perf/util/env.h    | 1 +
 tools/perf/util/header.c | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h
index 4566c51f2fd9..d761bfae76af 100644
--- a/tools/perf/util/env.h
+++ b/tools/perf/util/env.h
@@ -17,6 +17,7 @@ struct cpu_topology_map {
 
 struct cpu_cache_level {
 	u32	level;
+	u32	id;
 	u32	line_size;
 	u32	sets;
 	u32	ways;
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 404d816ca124..5c3f5d260612 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1131,6 +1131,9 @@ static bool cpu_cache_level__cmp(struct cpu_cache_level *a, struct cpu_cache_lev
 	if (a->level != b->level)
 		return false;
 
+	if (a->id != b->id)
+		return false;
+
 	if (a->line_size != b->line_size)
 		return false;
 
@@ -1168,6 +1171,10 @@ static int cpu_cache_level__read(struct cpu_cache_level *cache, u32 cpu, u16 lev
 	if (sysfs__read_int(file, (int *) &cache->level))
 		return -1;
 
+	scnprintf(file, PATH_MAX, "%s/id", path);
+	if (sysfs__read_int(file, (int *) &cache->id))
+		return -1;
+
 	scnprintf(file, PATH_MAX, "%s/coherency_line_size", path);
 	if (sysfs__read_int(file, (int *) &cache->line_size))
 		return -1;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/4] perf: Save cache instance ID when saving cache topology data
  2023-03-31  4:51 [RFC PATCH 0/4] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
  2023-03-31  4:51 ` [RFC PATCH 1/4] perf: Read cache instance ID when building " K Prateek Nayak
@ 2023-03-31  4:51 ` K Prateek Nayak
  2023-03-31  4:51 ` [RFC PATCH 3/4] perf: Extract building cache level for a CPU into separate function K Prateek Nayak
  2023-03-31  4:51 ` [RFC PATCH 4/4] perf: Add option for --per-cache aggregation K Prateek Nayak
  3 siblings, 0 replies; 7+ messages in thread
From: K Prateek Nayak @ 2023-03-31  4:51 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo,
	mark.rutland, alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy, eranian

Bump up the version and save the cache instance ID when saving cache
topology information. When reading the topology information,
conditionally parse the instance ID for newer versions only.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 tools/perf/util/header.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 5c3f5d260612..50d66092c82b 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1262,7 +1262,7 @@ static int write_cache(struct feat_fd *ff,
 {
 	u32 max_caches = cpu__max_cpu().cpu * MAX_CACHE_LVL;
 	struct cpu_cache_level caches[max_caches];
-	u32 cnt = 0, i, version = 1;
+	u32 cnt = 0, i, version = 2;
 	int ret;
 
 	ret = build_caches(caches, &cnt);
@@ -1288,6 +1288,7 @@ static int write_cache(struct feat_fd *ff,
 				goto out;
 
 		_W(level)
+		_W(id)
 		_W(line_size)
 		_W(sets)
 		_W(ways)
@@ -2879,7 +2880,7 @@ static int process_cache(struct feat_fd *ff, void *data __maybe_unused)
 	if (do_read_u32(ff, &version))
 		return -1;
 
-	if (version != 1)
+	if (version != 1 && version != 2)
 		return -1;
 
 	if (do_read_u32(ff, &cnt))
@@ -2897,6 +2898,8 @@ static int process_cache(struct feat_fd *ff, void *data __maybe_unused)
 				goto out_free_caches;			\
 
 		_R(level)
+		if (version >= 2)
+			_R(id)
 		_R(line_size)
 		_R(sets)
 		_R(ways)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 3/4] perf: Extract building cache level for a CPU into separate function
  2023-03-31  4:51 [RFC PATCH 0/4] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
  2023-03-31  4:51 ` [RFC PATCH 1/4] perf: Read cache instance ID when building " K Prateek Nayak
  2023-03-31  4:51 ` [RFC PATCH 2/4] perf: Save cache instance ID when saving cache topology data K Prateek Nayak
@ 2023-03-31  4:51 ` K Prateek Nayak
  2023-03-31  4:51 ` [RFC PATCH 4/4] perf: Add option for --per-cache aggregation K Prateek Nayak
  3 siblings, 0 replies; 7+ messages in thread
From: K Prateek Nayak @ 2023-03-31  4:51 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo,
	mark.rutland, alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy, eranian

build_caches() builds the complete cache topology of the system by
iterating over all CPU, building and comparing cache levels of each CPU,
keeping only the unique ones at the end.

Extract the function that build the cache levels for a single CPU into
a separate function. Expose this function to be used elsewhere in perf
too.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 tools/perf/util/header.c | 62 +++++++++++++++++++++++++---------------
 tools/perf/util/header.h |  4 +++
 2 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 50d66092c82b..770b0f624d7c 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -1220,38 +1220,54 @@ static void cpu_cache_level__fprintf(FILE *out, struct cpu_cache_level *c)
 	fprintf(out, "L%d %-15s %8s [%s]\n", c->level, c->type, c->size, c->map);
 }
 
-#define MAX_CACHE_LVL 4
-
-static int build_caches(struct cpu_cache_level caches[], u32 *cntp)
+/*
+ * Build caches levels for a particular CPU from the data in
+ * /sys/devices/system/cpu/cpu<cpu>/cache/
+ * The cache level data is stored in caches[] from index at
+ * *cntp.
+ */
+int build_caches_for_cpu(u32 cpu, struct cpu_cache_level caches[], u32 *cntp)
 {
-	u32 i, cnt = 0;
-	u32 nr, cpu;
 	u16 level;
 
-	nr = cpu__max_cpu().cpu;
+	for (level = 0; level < MAX_CACHE_LVL; level++) {
+		struct cpu_cache_level c;
+		int err;
+		u32 i;
 
-	for (cpu = 0; cpu < nr; cpu++) {
-		for (level = 0; level < MAX_CACHE_LVL; level++) {
-			struct cpu_cache_level c;
-			int err;
+		err = cpu_cache_level__read(&c, cpu, level);
+		if (err < 0)
+			return err;
 
-			err = cpu_cache_level__read(&c, cpu, level);
-			if (err < 0)
-				return err;
+		if (err == 1)
+			break;
 
-			if (err == 1)
+		for (i = 0; i < *cntp; i++) {
+			if (cpu_cache_level__cmp(&c, &caches[i]))
 				break;
+		}
 
-			for (i = 0; i < cnt; i++) {
-				if (cpu_cache_level__cmp(&c, &caches[i]))
-					break;
-			}
+		if (i == *cntp) {
+			caches[*cntp] = c;
+			*cntp = *cntp + 1;
+		} else
+			cpu_cache_level__free(&c);
+	}
 
-			if (i == cnt)
-				caches[cnt++] = c;
-			else
-				cpu_cache_level__free(&c);
-		}
+	return 0;
+}
+
+static int build_caches(struct cpu_cache_level caches[], u32 *cntp)
+{
+	u32 nr, cpu, cnt = 0;
+
+	nr = cpu__max_cpu().cpu;
+
+	for (cpu = 0; cpu < nr; cpu++) {
+		int ret = build_caches_for_cpu(cpu, caches, &cnt);
+
+		if (ret)
+			return ret;
 	}
 	*cntp = cnt;
 	return 0;
diff --git a/tools/perf/util/header.h b/tools/perf/util/header.h
index e3861ae62172..94cf2ffb6e60 100644
--- a/tools/perf/util/header.h
+++ b/tools/perf/util/header.h
@@ -177,7 +177,11 @@ int do_write(struct feat_fd *fd, const void *buf, size_t size);
 int write_padded(struct feat_fd *fd, const void *bf,
 		 size_t count, size_t count_aligned);
 
+#define MAX_CACHE_LVL 4
+
 int is_cpu_online(unsigned int cpu);
+int build_caches_for_cpu(u32 cpu, struct cpu_cache_level caches[], u32 *cntp);
+
 /*
  * arch specific callback
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 4/4] perf: Add option for --per-cache aggregation
  2023-03-31  4:51 [RFC PATCH 0/4] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
                   ` (2 preceding siblings ...)
  2023-03-31  4:51 ` [RFC PATCH 3/4] perf: Extract building cache level for a CPU into separate function K Prateek Nayak
@ 2023-03-31  4:51 ` K Prateek Nayak
  3 siblings, 0 replies; 7+ messages in thread
From: K Prateek Nayak @ 2023-03-31  4:51 UTC (permalink / raw)
  To: linux-perf-users, linux-kernel, acme, peterz, mingo,
	mark.rutland, alexander.shishkin, jolsa, namhyung
  Cc: ravi.bangoria, sandipan.das, ananth.narayan, gautham.shenoy, eranian

Processors based on chiplet architecture, such as AMD EPYC and Hygon do
not expose the chiplet details in the sysfs CPU topology information.
However, this information can be derived from the per CPU cache level
information from the sysfs.

perf stat has already supported aggregation based on topology
information using core ID, socket ID, etc. It'll be useful to aggregate
based on the cache topology to detect problems like imbalance and
cache-to-cache sharing at various cache levels.

This patch adds support for "--per-cache" option for aggregation at a
particular cache level. Also update the docs and related test. The
output will be like:

  $ sudo ./perf stat --per-cache=L3 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

   Performance counter stats for 'system wide':

  S0-D0-L3-ID0             16              7,022      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID1             16                994      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID2             16                297      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID3             16              2,852      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID4             16              7,764      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID5             16              1,779      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID6             16              2,747      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID7             16              6,665      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID8             16                799      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID9             16                846      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID10            16              3,048      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID11            16              1,015      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID12            16                432      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID13            16                837      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID14            16                348      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID15            16              1,175      ls_dmnd_fills_from_sys.ext_cache_remote

Also support perf stat record and perf stat report with the ability to
specify a different cache level to aggregate data at when running perf
stat report.

Suggested-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 tools/lib/perf/include/perf/cpumap.h          |   5 +
 tools/lib/perf/include/perf/event.h           |   3 +-
 tools/perf/Documentation/perf-stat.txt        |  16 ++
 tools/perf/builtin-stat.c                     | 153 +++++++++++++++++-
 .../tests/shell/lib/perf_json_output_lint.py  |   4 +-
 tools/perf/tests/shell/stat+csv_output.sh     |  14 ++
 tools/perf/tests/shell/stat+json_output.sh    |  13 ++
 tools/perf/util/cpumap.c                      |  97 +++++++++++
 tools/perf/util/cpumap.h                      |  17 ++
 tools/perf/util/event.c                       |   7 +-
 tools/perf/util/stat-display.c                |  17 ++
 tools/perf/util/stat-shadow.c                 |   1 +
 tools/perf/util/stat.h                        |   2 +
 tools/perf/util/synthetic-events.c            |   1 +
 14 files changed, 343 insertions(+), 7 deletions(-)

diff --git a/tools/lib/perf/include/perf/cpumap.h b/tools/lib/perf/include/perf/cpumap.h
index 3f43f770cdac..8724dde79342 100644
--- a/tools/lib/perf/include/perf/cpumap.h
+++ b/tools/lib/perf/include/perf/cpumap.h
@@ -11,6 +11,11 @@ struct perf_cpu {
 	int cpu;
 };
 
+struct perf_cache {
+	int cache_lvl;
+	int cache;
+};
+
 struct perf_cpu_map;
 
 LIBPERF_API struct perf_cpu_map *perf_cpu_map__dummy_new(void);
diff --git a/tools/lib/perf/include/perf/event.h b/tools/lib/perf/include/perf/event.h
index ad47d7b31046..f3ceb2f96593 100644
--- a/tools/lib/perf/include/perf/event.h
+++ b/tools/lib/perf/include/perf/event.h
@@ -378,7 +378,8 @@ enum {
 	PERF_STAT_CONFIG_TERM__AGGR_MODE	= 0,
 	PERF_STAT_CONFIG_TERM__INTERVAL		= 1,
 	PERF_STAT_CONFIG_TERM__SCALE		= 2,
-	PERF_STAT_CONFIG_TERM__MAX		= 3,
+	PERF_STAT_CONFIG_TERM__AGGR_LEVEL	= 3,
+	PERF_STAT_CONFIG_TERM__MAX		= 4,
 };
 
 struct perf_record_stat_config_entry {
diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 18abdc1dce05..ad7894f5c02b 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -308,6 +308,14 @@ use --per-die in addition to -a. (system-wide).  The output includes the
 die number and the number of online processors on that die. This is
 useful to gauge the amount of aggregation.
 
+--per-cache::
+Aggregate counts per cache instance for system-wide mode measurements.  By
+default, the aggregation happens for the cache level at the highest index
+in the system. To specify a particular level, mention the cache level
+alongside the option in the format [Ll][1-9][0-9]*. For example:
+"--per-cache=l3", "--per-cache=L3", and "--per-cache=3" will aggregate
+the information at the boundary of the level 3 cache in the system.
+
 --per-core::
 Aggregate counts per physical processor for system-wide mode measurements.  This
 is a useful mode to detect imbalance between physical cores.  To enable this mode,
@@ -379,6 +387,14 @@ Aggregate counts per processor socket for system-wide mode measurements.
 --per-die::
 Aggregate counts per processor die for system-wide mode measurements.
 
+--per-cache::
+Aggregate counts per cache instance for system-wide mode measurements.  By
+default, the aggregation happens for the cache level at the highest index
+in the system. To specify a particular level, mention the cache level
+alongside the option in the format [Ll][1-9][0-9]*. For example:
+"--per-cache=l3", "--per-cache=L3", and "--per-cache=3" will aggregate
+the information at the boundary of the level 3 cache in the system.
+
 --per-core::
 Aggregate counts per physical processor for system-wide mode measurements.
 
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index fa7c40956d0f..884f92319f9f 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -203,6 +203,7 @@ struct perf_stat {
 	struct perf_cpu_map	*cpus;
 	struct perf_thread_map *threads;
 	enum aggr_mode		 aggr_mode;
+	u32			 aggr_level;
 };
 
 static struct perf_stat		perf_stat;
@@ -210,8 +211,9 @@ static struct perf_stat		perf_stat;
 
 static volatile sig_atomic_t done = 0;
 
-static struct perf_stat_config stat_config = {
+struct perf_stat_config stat_config = {
 	.aggr_mode		= AGGR_GLOBAL,
+	.aggr_level		= MAX_CACHE_LVL + 1,
 	.scale			= true,
 	.unit_width		= 4, /* strlen("unit") */
 	.run_count		= 1,
@@ -1160,6 +1162,52 @@ static int parse_hybrid_type(const struct option *opt,
 	return 0;
 }
 
+static int parse_cache_level(const struct option *opt, const char *str,
+		 int unset __maybe_unused)
+{
+	int level;
+	u32 *aggr_mode = (u32 *)opt->value;
+	u32 *aggr_level = (u32 *)opt->data;
+
+	if (str == NULL) {
+		level = MAX_CACHE_LVL + 1;
+		goto out;
+	}
+
+	if (!strlen(str) || (str[0] != 'l' && str[0] != 'L')) {
+		pr_err("Cache level must be of form L[1-%d], or l[1-%d]\n",
+		       MAX_CACHE_LVL,
+		       MAX_CACHE_LVL);
+		return -EINVAL;
+	}
+
+	level = atoi(str);
+	if (level > 0)
+		goto out;
+
+	/*
+	 * Assume first character of string is 'L' or 'l'
+	 * if the conversion fails.
+	 */
+	level = atoi(&str[1]);
+	if (level < 1) {
+		pr_err("Cache level must be of form L[1-%d], or l[1-%d]\n",
+		       MAX_CACHE_LVL,
+		       MAX_CACHE_LVL);
+		return -EINVAL;
+	}
+
+	if (level > MAX_CACHE_LVL) {
+		pr_err("perf only supports max cache level of %d.\n"
+		       "Consider increasing MAX_CACHE_LVL\n", MAX_CACHE_LVL);
+		return -EINVAL;
+	}
+out:
+	*aggr_mode = AGGR_CACHE;
+	*aggr_level = level;
+	return 0;
+}
+
 static struct option stat_options[] = {
 	OPT_BOOLEAN('T', "transaction", &transaction_run,
 		    "hardware transaction statistics"),
@@ -1237,6 +1285,9 @@ static struct option stat_options[] = {
 		     "aggregate counts per processor socket", AGGR_SOCKET),
 	OPT_SET_UINT(0, "per-die", &stat_config.aggr_mode,
 		     "aggregate counts per processor die", AGGR_DIE),
+	OPT_CALLBACK_OPTARG(0, "per-cache", &stat_config.aggr_mode, &stat_config.aggr_level,
+			    "cache level", "aggregate count at this cache level (Default: LLC)",
+			    parse_cache_level),
 	OPT_SET_UINT(0, "per-core", &stat_config.aggr_mode,
 		     "aggregate counts per physical processor core", AGGR_CORE),
 	OPT_SET_UINT(0, "per-thread", &stat_config.aggr_mode,
@@ -1298,6 +1349,7 @@ static struct option stat_options[] = {
 
 static const char *const aggr_mode__string[] = {
 	[AGGR_CORE] = "core",
+	[AGGR_CACHE] = "cache",
 	[AGGR_DIE] = "die",
 	[AGGR_GLOBAL] = "global",
 	[AGGR_NODE] = "node",
@@ -1319,6 +1371,12 @@ static struct aggr_cpu_id perf_stat__get_die(struct perf_stat_config *config __m
 	return aggr_cpu_id__die(cpu, /*data=*/NULL);
 }
 
+static struct aggr_cpu_id perf_stat__get_cache_id(struct perf_stat_config *config __maybe_unused,
+						 struct perf_cpu cpu)
+{
+	return aggr_cpu_id__cache(cpu, /*data=*/NULL);
+}
+
 static struct aggr_cpu_id perf_stat__get_core(struct perf_stat_config *config __maybe_unused,
 					      struct perf_cpu cpu)
 {
@@ -1371,6 +1429,12 @@ static struct aggr_cpu_id perf_stat__get_die_cached(struct perf_stat_config *con
 	return perf_stat__get_aggr(config, perf_stat__get_die, cpu);
 }
 
+static struct aggr_cpu_id perf_stat__get_cache_id_cached(struct perf_stat_config *config,
+							struct perf_cpu cpu)
+{
+	return perf_stat__get_aggr(config, perf_stat__get_cache_id, cpu);
+}
+
 static struct aggr_cpu_id perf_stat__get_core_cached(struct perf_stat_config *config,
 						     struct perf_cpu cpu)
 {
@@ -1402,6 +1466,8 @@ static aggr_cpu_id_get_t aggr_mode__get_aggr(enum aggr_mode aggr_mode)
 		return aggr_cpu_id__socket;
 	case AGGR_DIE:
 		return aggr_cpu_id__die;
+	case AGGR_CACHE:
+		return aggr_cpu_id__cache;
 	case AGGR_CORE:
 		return aggr_cpu_id__core;
 	case AGGR_NODE:
@@ -1425,6 +1491,8 @@ static aggr_get_id_t aggr_mode__get_id(enum aggr_mode aggr_mode)
 		return perf_stat__get_socket_cached;
 	case AGGR_DIE:
 		return perf_stat__get_die_cached;
+	case AGGR_CACHE:
+		return perf_stat__get_cache_id_cached;
 	case AGGR_CORE:
 		return perf_stat__get_core_cached;
 	case AGGR_NODE:
@@ -1537,6 +1605,72 @@ static struct aggr_cpu_id perf_env__get_die_aggr_by_cpu(struct perf_cpu cpu, voi
 	return id;
 }
 
+static int perf_env__get_cache_id_for_cpu(struct perf_cpu cpu, struct perf_env *env,
+					  u32 cache_level, struct perf_cache *cache)
+{
+	int i;
+	int caches_cnt = env->caches_cnt;
+	struct cpu_cache_level *caches = env->caches;
+
+	if (!caches_cnt)
+		return -1;
+
+	for (i = caches_cnt - 1; i > -1; --i) {
+		struct perf_cpu_map *cpu_map;
+		int map_contains_cpu;
+
+		/*
+		 * If user has not specified a level, find the fist level with
+		 * the cpu in the map. Since building the map is expensive, do
+		 * this only if levels match.
+		 */
+		if (cache_level <= MAX_CACHE_LVL && caches[i].level != cache_level)
+			continue;
+
+		cpu_map = perf_cpu_map__new(caches[i].map);
+		map_contains_cpu = perf_cpu_map__idx(cpu_map, cpu);
+		perf_cpu_map__put(cpu_map);
+
+		if (map_contains_cpu != -1) {
+			cache->cache = caches[i].id;
+			cache->cache_lvl = caches[i].level;
+			return 0;
+		}
+	}
+
+	return -1;
+}
+
+static struct aggr_cpu_id perf_env__get_cache_aggr_by_cpu(struct perf_cpu cpu,
+							  void *data)
+{
+	struct perf_env *env = data;
+	struct aggr_cpu_id id = aggr_cpu_id__empty();
+
+	if (cpu.cpu != -1) {
+		int ret;
+		u32 cache_level;
+		struct perf_cache cache = {
+			.cache		= -1,
+			.cache_lvl	= -1,
+		};
+
+		cache_level = (perf_stat.aggr_level) ?: stat_config.aggr_level;
+
+		id.socket = env->cpu[cpu.cpu].socket_id;
+		id.die = env->cpu[cpu.cpu].die_id;
+
+		ret = perf_env__get_cache_id_for_cpu(cpu, env, cache_level, &cache);
+		if (ret)
+			return id;
+
+		id.cache_lvl = cache.cache_lvl;
+		id.cache = cache.cache;
+	}
+
+	return id;
+}
+
 static struct aggr_cpu_id perf_env__get_core_aggr_by_cpu(struct perf_cpu cpu, void *data)
 {
 	struct perf_env *env = data;
@@ -1605,6 +1739,12 @@ static struct aggr_cpu_id perf_stat__get_die_file(struct perf_stat_config *confi
 	return perf_env__get_die_aggr_by_cpu(cpu, &perf_stat.session->header.env);
 }
 
+static struct aggr_cpu_id perf_stat__get_cache_file(struct perf_stat_config *config __maybe_unused,
+						      struct perf_cpu cpu)
+{
+	return perf_env__get_cache_aggr_by_cpu(cpu, &perf_stat.session->header.env);
+}
+
 static struct aggr_cpu_id perf_stat__get_core_file(struct perf_stat_config *config __maybe_unused,
 						   struct perf_cpu cpu)
 {
@@ -1636,6 +1776,8 @@ static aggr_cpu_id_get_t aggr_mode__get_aggr_file(enum aggr_mode aggr_mode)
 		return perf_env__get_socket_aggr_by_cpu;
 	case AGGR_DIE:
 		return perf_env__get_die_aggr_by_cpu;
+	case AGGR_CACHE:
+		return perf_env__get_cache_aggr_by_cpu;
 	case AGGR_CORE:
 		return perf_env__get_core_aggr_by_cpu;
 	case AGGR_NODE:
@@ -1659,6 +1801,8 @@ static aggr_get_id_t aggr_mode__get_id_file(enum aggr_mode aggr_mode)
 		return perf_stat__get_socket_file;
 	case AGGR_DIE:
 		return perf_stat__get_die_file;
+	case AGGR_CACHE:
+		return perf_stat__get_cache_file;
 	case AGGR_CORE:
 		return perf_stat__get_core_file;
 	case AGGR_NODE:
@@ -2207,7 +2351,8 @@ static struct perf_stat perf_stat = {
 		.stat		= perf_event__process_stat_event,
 		.stat_round	= process_stat_round_event,
 	},
-	.aggr_mode = AGGR_UNSET,
+	.aggr_mode	= AGGR_UNSET,
+	.aggr_level	= 0,
 };
 
 static int __cmd_report(int argc, const char **argv)
@@ -2219,6 +2364,10 @@ static int __cmd_report(int argc, const char **argv)
 		     "aggregate counts per processor socket", AGGR_SOCKET),
 	OPT_SET_UINT(0, "per-die", &perf_stat.aggr_mode,
 		     "aggregate counts per processor die", AGGR_DIE),
+	OPT_CALLBACK_OPTARG(0, "per-cache", &perf_stat.aggr_mode, &perf_stat.aggr_level,
+			    "cache level",
+			    "aggregate count at this cache level (Default: LLC)",
+			    parse_cache_level),
 	OPT_SET_UINT(0, "per-core", &perf_stat.aggr_mode,
 		     "aggregate counts per physical processor core", AGGR_CORE),
 	OPT_SET_UINT(0, "per-node", &perf_stat.aggr_mode,
diff --git a/tools/perf/tests/shell/lib/perf_json_output_lint.py b/tools/perf/tests/shell/lib/perf_json_output_lint.py
index 97598d14e532..62489766b93c 100644
--- a/tools/perf/tests/shell/lib/perf_json_output_lint.py
+++ b/tools/perf/tests/shell/lib/perf_json_output_lint.py
@@ -14,6 +14,7 @@ ap.add_argument('--system-wide', action='store_true')
 ap.add_argument('--event', action='store_true')
 ap.add_argument('--per-core', action='store_true')
 ap.add_argument('--per-thread', action='store_true')
+ap.add_argument('--per-cache', action='store_true')
 ap.add_argument('--per-die', action='store_true')
 ap.add_argument('--per-node', action='store_true')
 ap.add_argument('--per-socket', action='store_true')
@@ -46,6 +47,7 @@ def check_json_output(expected_items):
       'counter-value': lambda x: is_counter_value(x),
       'cgroup': lambda x: True,
       'cpu': lambda x: isint(x),
+      'cache': lambda x: True,
       'die': lambda x: True,
       'event': lambda x: True,
       'event-runtime': lambda x: isfloat(x),
@@ -82,7 +84,7 @@ try:
     expected_items = 7
   elif args.interval or args.per_thread or args.system_wide_no_aggr:
     expected_items = 8
-  elif args.per_core or args.per_socket or args.per_node or args.per_die:
+  elif args.per_core or args.per_socket or args.per_node or args.per_die or args.per_cache_instance:
     expected_items = 9
   else:
     # If no option is specified, don't check the number of items.
diff --git a/tools/perf/tests/shell/stat+csv_output.sh b/tools/perf/tests/shell/stat+csv_output.sh
index 324fc9e6edd7..6cdf2fd386d5 100755
--- a/tools/perf/tests/shell/stat+csv_output.sh
+++ b/tools/perf/tests/shell/stat+csv_output.sh
@@ -26,6 +26,7 @@ function commachecker()
 	;; "--per-socket")	exp=8
 	;; "--per-node")	exp=8
 	;; "--per-die")		exp=8
+	;; "--per-cache")	exp=8
 	esac
 
 	while read line
@@ -123,6 +124,18 @@ check_per_thread()
 	echo "[Success]"
 }
 
+check_per_cache_instance()
+{
+	echo -n "Checking CSV output: per cache instance "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoid and not root"
+		return
+	fi
+	perf stat -x$csv_sep --per-cache -a true 2>&1 | commachecker --per-cache
+	echo "[Success]"
+}
+
 check_per_die()
 {
 	echo -n "Checking CSV output: per die "
@@ -197,6 +210,7 @@ if [ $skip_test -ne 1 ]
 then
 	check_system_wide_no_aggr
 	check_per_core
+	check_per_cache_instance
 	check_per_die
 	check_per_socket
 else
diff --git a/tools/perf/tests/shell/stat+json_output.sh b/tools/perf/tests/shell/stat+json_output.sh
index 2c4212c641ed..d79a6e0d4042 100755
--- a/tools/perf/tests/shell/stat+json_output.sh
+++ b/tools/perf/tests/shell/stat+json_output.sh
@@ -100,6 +100,18 @@ check_per_thread()
 	echo "[Success]"
 }
 
+check_per_cache_instance()
+{
+	echo -n "Checking json output: per cache_instance "
+	if ParanoidAndNotRoot 0
+	then
+		echo "[Skip] paranoia and not root"
+		return
+	fi
+	perf stat -j --per-cache -a true 2>&1 | $PYTHON $pythonchecker --per-cache
+	echo "[Success]"
+}
+
 check_per_die()
 {
 	echo -n "Checking json output: per die "
@@ -174,6 +186,7 @@ if [ $skip_test -ne 1 ]
 then
 	check_system_wide_no_aggr
 	check_per_core
+	check_per_cache_instance
 	check_per_die
 	check_per_socket
 else
diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index 5e564974fba4..5d62f21c6adc 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -3,6 +3,8 @@
 #include "cpumap.h"
 #include "debug.h"
 #include "event.h"
+#include "header.h"
+#include "stat.h"
 #include <assert.h>
 #include <dirent.h>
 #include <stdio.h>
@@ -227,6 +229,10 @@ static int aggr_cpu_id__cmp(const void *a_pointer, const void *b_pointer)
 		return a->socket - b->socket;
 	else if (a->die != b->die)
 		return a->die - b->die;
+	else if (a->cache_lvl != b->cache_lvl)
+		return a->cache_lvl - b->cache_lvl;
+	else if (a->cache != b->cache)
+		return a->cache - b->cache;
 	else if (a->core != b->core)
 		return a->core - b->core;
 	else
@@ -310,6 +316,91 @@ struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data)
 	return id;
 }
 
+extern struct perf_stat_config stat_config;
+
+int cpu__get_cache_details(struct perf_cpu cpu, struct perf_cache *cache)
+{
+	int ret = 0;
+	struct cpu_cache_level caches[MAX_CACHE_LVL];
+	u32 cache_level = stat_config.aggr_level;
+	u32 i = 0, caches_cnt = 0;
+
+	cache->cache_lvl = -1;
+	cache->cache = -1;
+
+	ret = build_caches_for_cpu(cpu.cpu, caches, &caches_cnt);
+	if (ret) {
+		/*
+		 * If caches_cnt is not 0, cpu_cache_level data
+		 * was allocated when building the topology.
+		 * Free the allocated data before returning.
+		 */
+		if (caches_cnt)
+			goto free_caches;
+
+		return ret;
+	}
+
+	if (!caches_cnt)
+		return -1;
+
+	/*
+	 * Save the data for the highest level if no
+	 * level was specified by the user.
+	 */
+	if (cache_level > MAX_CACHE_LVL) {
+		int max_level_index = 0;
+
+		for (i = 1; i < caches_cnt; ++i) {
+			if (caches[i].level > caches[max_level_index].level)
+				max_level_index = i;
+		}
+
+		cache->cache_lvl = caches[max_level_index].level;
+		cache->cache = caches[max_level_index].id;
+
+		i = 0; // Reset i to 0 to free entire caches[]
+		goto free_caches;
+	}
+
+	for (i = 0; i < caches_cnt; ++i) {
+		if (caches[i].level == cache_level) {
+			cache->cache_lvl = cache_level;
+			cache->cache = caches[i].id;
+		}
+
+		cpu_cache_level__free(&caches[i]);
+	}
+
+free_caches:
+	/*
+	 * Free all the allocated cpu_cache_level data.
+	 */
+	while (i < caches_cnt)
+		cpu_cache_level__free(&caches[i++]);
+
+	return ret;
+}
+
+struct aggr_cpu_id aggr_cpu_id__cache(struct perf_cpu cpu, void *data)
+{
+	int ret;
+	struct aggr_cpu_id id;
+	struct perf_cache cache;
+
+	id = aggr_cpu_id__die(cpu, data);
+	if (aggr_cpu_id__is_empty(&id))
+		return id;
+
+	ret = cpu__get_cache_details(cpu, &cache);
+	if (ret)
+		return id;
+
+	id.cache_lvl = cache.cache_lvl;
+	id.cache = cache.cache;
+	return id;
+}
+
 int cpu__get_core_id(struct perf_cpu cpu)
 {
 	int value, ret = cpu__get_topology_int(cpu.cpu, "core_id", &value);
@@ -684,6 +775,8 @@ bool aggr_cpu_id__equal(const struct aggr_cpu_id *a, const struct aggr_cpu_id *b
 		a->node == b->node &&
 		a->socket == b->socket &&
 		a->die == b->die &&
+		a->cache_lvl == b->cache_lvl &&
+		a->cache == b->cache &&
 		a->core == b->core &&
 		a->cpu.cpu == b->cpu.cpu;
 }
@@ -694,6 +787,8 @@ bool aggr_cpu_id__is_empty(const struct aggr_cpu_id *a)
 		a->node == -1 &&
 		a->socket == -1 &&
 		a->die == -1 &&
+		a->cache_lvl == -1 &&
+		a->cache == -1 &&
 		a->core == -1 &&
 		a->cpu.cpu == -1;
 }
@@ -705,6 +800,8 @@ struct aggr_cpu_id aggr_cpu_id__empty(void)
 		.node = -1,
 		.socket = -1,
 		.die = -1,
+		.cache_lvl = -1,
+		.cache = -1,
 		.core = -1,
 		.cpu = (struct perf_cpu){ .cpu = -1 },
 	};
diff --git a/tools/perf/util/cpumap.h b/tools/perf/util/cpumap.h
index c2f5824a3a22..d319c260ea09 100644
--- a/tools/perf/util/cpumap.h
+++ b/tools/perf/util/cpumap.h
@@ -20,6 +20,10 @@ struct aggr_cpu_id {
 	int socket;
 	/** The die id as read from /sys/devices/system/cpu/cpuX/topology/die_id. */
 	int die;
+	/** The cache level as read from /sys/devices/system/cpu/cpuX/cache/indexY/level */
+	int cache_lvl;
+	/** The cache instance ID as read from /sys/devices/system/cpu/cpuX/cache/indexY/id */
+	int cache;
 	/** The core id as read from /sys/devices/system/cpu/cpuX/topology/core_id. */
 	int core;
 	/** CPU aggregation, note there is one CPU for each SMT thread. */
@@ -76,6 +80,12 @@ int cpu__get_socket_id(struct perf_cpu cpu);
  * /sys/devices/system/cpu/cpuX/topology/die_id for the given CPU.
  */
 int cpu__get_die_id(struct perf_cpu cpu);
+/**
+ * cpu__get_cache_id - Returns 0 if successful in populating the
+ * cache level and cache id as read from
+ * /sys/devices/system/cpu/cpuX/cache/indexY/{id, level} for the given CPU.
+ */
+int cpu__get_cache_details(struct perf_cpu cpu, struct perf_cache *cache);
 /**
  * cpu__get_core_id - Returns the core id as read from
  * /sys/devices/system/cpu/cpuX/topology/core_id for the given CPU.
@@ -116,6 +126,13 @@ struct aggr_cpu_id aggr_cpu_id__socket(struct perf_cpu cpu, void *data);
  * aggr_cpu_id_get_t.
  */
 struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data);
+/**
+ * aggr_cpu_id__cache - Create an aggr_cpu_id with cache instache ID, cache
+ * level, die and socket populated with the cache instache ID, cache level,
+ * die and socket for cpu. The function signature is compatible with
+ * aggr_cpu_id_get_t.
+ */
+struct aggr_cpu_id aggr_cpu_id__cache(struct perf_cpu cpu, void *data);
 /**
  * aggr_cpu_id__core - Create an aggr_cpu_id with the core, die and socket
  * populated with the core, die and socket for cpu. The function signature is
diff --git a/tools/perf/util/event.c b/tools/perf/util/event.c
index 1fa14598b916..faf0df3c5b95 100644
--- a/tools/perf/util/event.c
+++ b/tools/perf/util/event.c
@@ -135,9 +135,10 @@ void perf_event__read_stat_config(struct perf_stat_config *config,
 			config->__val = event->data[i].val;	\
 			break;
 
-		CASE(AGGR_MODE, aggr_mode)
-		CASE(SCALE,     scale)
-		CASE(INTERVAL,  interval)
+		CASE(AGGR_MODE,  aggr_mode)
+		CASE(SCALE,      scale)
+		CASE(INTERVAL,   interval)
+		CASE(AGGR_LEVEL, aggr_level)
 #undef CASE
 		default:
 			pr_warning("unknown stat config term %" PRI_lu64 "\n",
diff --git a/tools/perf/util/stat-display.c b/tools/perf/util/stat-display.c
index 1b5cb20efd23..82ec668bc3ba 100644
--- a/tools/perf/util/stat-display.c
+++ b/tools/perf/util/stat-display.c
@@ -36,6 +36,7 @@
 
 static int aggr_header_lens[] = {
 	[AGGR_CORE] 	= 18,
+	[AGGR_CACHE]	= 22,
 	[AGGR_DIE] 	= 12,
 	[AGGR_SOCKET] 	= 6,
 	[AGGR_NODE] 	= 6,
@@ -46,6 +47,7 @@ static int aggr_header_lens[] = {
 
 static const char *aggr_header_csv[] = {
 	[AGGR_CORE] 	= 	"core,cpus,",
+	[AGGR_CACHE]	= 	"cache,cpus,",
 	[AGGR_DIE] 	= 	"die,cpus,",
 	[AGGR_SOCKET] 	= 	"socket,cpus,",
 	[AGGR_NONE] 	= 	"cpu,",
@@ -56,6 +58,7 @@ static const char *aggr_header_csv[] = {
 
 static const char *aggr_header_std[] = {
 	[AGGR_CORE] 	= 	"core",
+	[AGGR_CACHE] 	= 	"cache",
 	[AGGR_DIE] 	= 	"die",
 	[AGGR_SOCKET] 	= 	"socket",
 	[AGGR_NONE] 	= 	"cpu",
@@ -193,6 +196,10 @@ static void print_aggr_id_std(struct perf_stat_config *config,
 	case AGGR_CORE:
 		snprintf(buf, sizeof(buf), "S%d-D%d-C%d", id.socket, id.die, id.core);
 		break;
+	case AGGR_CACHE:
+		snprintf(buf, sizeof(buf), "S%d-D%d-L%d-ID%d",
+			 id.socket, id.die, id.cache_lvl, id.cache);
+		break;
 	case AGGR_DIE:
 		snprintf(buf, sizeof(buf), "S%d-D%d", id.socket, id.die);
 		break;
@@ -239,6 +246,10 @@ static void print_aggr_id_csv(struct perf_stat_config *config,
 		fprintf(output, "S%d-D%d-C%d%s%d%s",
 			id.socket, id.die, id.core, sep, nr, sep);
 		break;
+	case AGGR_CACHE:
+		fprintf(config->output, "S%d-D%d-L%d-ID%d%s%d%s",
+			id.socket, id.die, id.cache_lvl, id.cache, sep, nr, sep);
+		break;
 	case AGGR_DIE:
 		fprintf(output, "S%d-D%d%s%d%s",
 			id.socket, id.die, sep, nr, sep);
@@ -284,6 +295,10 @@ static void print_aggr_id_json(struct perf_stat_config *config,
 		fprintf(output, "\"core\" : \"S%d-D%d-C%d\", \"aggregate-number\" : %d, ",
 			id.socket, id.die, id.core, nr);
 		break;
+	case AGGR_CACHE:
+		fprintf(output, "\"cache\" : \"S%d-D%d-L%d-ID%d\", \"aggregate-number\" : %d, ",
+			id.socket, id.die, id.cache_lvl, id.cache, nr);
+		break;
 	case AGGR_DIE:
 		fprintf(output, "\"die\" : \"S%d-D%d\", \"aggregate-number\" : %d, ",
 			id.socket, id.die, nr);
@@ -1126,6 +1141,7 @@ static void print_header_interval_std(struct perf_stat_config *config,
 	case AGGR_NODE:
 	case AGGR_SOCKET:
 	case AGGR_DIE:
+	case AGGR_CACHE:
 	case AGGR_CORE:
 		fprintf(output, "#%*s %-*s cpus",
 			INTERVAL_LEN - 1, "time",
@@ -1422,6 +1438,7 @@ void evlist__print_counters(struct evlist *evlist, struct perf_stat_config *conf
 
 	switch (config->aggr_mode) {
 	case AGGR_CORE:
+	case AGGR_CACHE:
 	case AGGR_DIE:
 	case AGGR_SOCKET:
 	case AGGR_NODE:
diff --git a/tools/perf/util/stat-shadow.c b/tools/perf/util/stat-shadow.c
index 806b32156459..f080905a3ece 100644
--- a/tools/perf/util/stat-shadow.c
+++ b/tools/perf/util/stat-shadow.c
@@ -20,6 +20,7 @@
  * AGGR_GLOBAL: Use CPU 0
  * AGGR_SOCKET: Use first CPU of socket
  * AGGR_DIE: Use first CPU of die
+ * AGGR_CACHE: Use first CPU of cache level instance
  * AGGR_CORE: Use first CPU of core
  * AGGR_NONE: Use matching CPU
  * AGGR_THREAD: Not supported?
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index bf1794ebc916..848b3b3f5819 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -74,6 +74,7 @@ enum aggr_mode {
 	AGGR_GLOBAL,
 	AGGR_SOCKET,
 	AGGR_DIE,
+	AGGR_CACHE,
 	AGGR_CORE,
 	AGGR_THREAD,
 	AGGR_UNSET,
@@ -139,6 +140,7 @@ typedef struct aggr_cpu_id (*aggr_get_id_t)(struct perf_stat_config *config, str
 
 struct perf_stat_config {
 	enum aggr_mode		 aggr_mode;
+	u32			 aggr_level;
 	bool			 scale;
 	bool			 no_inherit;
 	bool			 identifier;
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index 9ab9308ee80c..2fe648be1e7d 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -1373,6 +1373,7 @@ int perf_event__synthesize_stat_config(struct perf_tool *tool,
 	ADD(AGGR_MODE,	config->aggr_mode)
 	ADD(INTERVAL,	config->interval)
 	ADD(SCALE,	config->scale)
+	ADD(AGGR_LEVEL,	config->aggr_level)
 
 	WARN_ONCE(i != PERF_STAT_CONFIG_TERM__MAX,
 		  "stat config terms unbalanced\n");
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 1/4] perf: Read cache instance ID when building cache topology
  2023-03-31  4:51 ` [RFC PATCH 1/4] perf: Read cache instance ID when building " K Prateek Nayak
@ 2023-03-31 11:54   ` Arnaldo Carvalho de Melo
  2023-04-03  2:43     ` K Prateek Nayak
  0 siblings, 1 reply; 7+ messages in thread
From: Arnaldo Carvalho de Melo @ 2023-03-31 11:54 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: linux-perf-users, linux-kernel, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung, ravi.bangoria, sandipan.das,
	ananth.narayan, gautham.shenoy, eranian

Em Fri, Mar 31, 2023 at 10:21:14AM +0530, K Prateek Nayak escreveu:
> CPU cache level data currently stores cache level, type, line size,
> associativity, sets, total cache size, and the CPUs sharing the cache.
> Also read and store the cache instance ID from
> "/sys/devices/system/cpu/cpuX/cache/indexY/id" in the cache level data.
> Use instance ID as well when comparing cache levels.

And if a new perf tool is used in an older kernel without this new 'id'
file?

Please check if the file exists, if it doesn't don't fail, just
initialize with a zero, this way the latest perf will be usable in older
kernels.

- Arnaldo
 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  tools/perf/util/env.h    | 1 +
>  tools/perf/util/header.c | 7 +++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/tools/perf/util/env.h b/tools/perf/util/env.h
> index 4566c51f2fd9..d761bfae76af 100644
> --- a/tools/perf/util/env.h
> +++ b/tools/perf/util/env.h
> @@ -17,6 +17,7 @@ struct cpu_topology_map {
>  
>  struct cpu_cache_level {
>  	u32	level;
> +	u32	id;
>  	u32	line_size;
>  	u32	sets;
>  	u32	ways;
> diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
> index 404d816ca124..5c3f5d260612 100644
> --- a/tools/perf/util/header.c
> +++ b/tools/perf/util/header.c
> @@ -1131,6 +1131,9 @@ static bool cpu_cache_level__cmp(struct cpu_cache_level *a, struct cpu_cache_lev
>  	if (a->level != b->level)
>  		return false;
>  
> +	if (a->id != b->id)
> +		return false;
> +
>  	if (a->line_size != b->line_size)
>  		return false;
>  
> @@ -1168,6 +1171,10 @@ static int cpu_cache_level__read(struct cpu_cache_level *cache, u32 cpu, u16 lev
>  	if (sysfs__read_int(file, (int *) &cache->level))
>  		return -1;
>  
> +	scnprintf(file, PATH_MAX, "%s/id", path);
> +	if (sysfs__read_int(file, (int *) &cache->id))
> +		return -1;
> +
>  	scnprintf(file, PATH_MAX, "%s/coherency_line_size", path);
>  	if (sysfs__read_int(file, (int *) &cache->line_size))
>  		return -1;
> -- 
> 2.34.1
> 

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 1/4] perf: Read cache instance ID when building cache topology
  2023-03-31 11:54   ` Arnaldo Carvalho de Melo
@ 2023-04-03  2:43     ` K Prateek Nayak
  0 siblings, 0 replies; 7+ messages in thread
From: K Prateek Nayak @ 2023-04-03  2:43 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: linux-perf-users, linux-kernel, peterz, mingo, mark.rutland,
	alexander.shishkin, jolsa, namhyung, ravi.bangoria, sandipan.das,
	ananth.narayan, gautham.shenoy, eranian

Hello Arnaldo,

Thank you for taking a look at the series.

On 3/31/2023 5:24 PM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Mar 31, 2023 at 10:21:14AM +0530, K Prateek Nayak escreveu:
>> CPU cache level data currently stores cache level, type, line size,
>> associativity, sets, total cache size, and the CPUs sharing the cache.
>> Also read and store the cache instance ID from
>> "/sys/devices/system/cpu/cpuX/cache/indexY/id" in the cache level data.
>> Use instance ID as well when comparing cache levels.
> 
> And if a new perf tool is used in an older kernel without this new 'id'
> file?
> 
> Please check if the file exists, if it doesn't don't fail, just
> initialize with a zero, this way the latest perf will be usable in older
> kernels.

That makes sense. I'll handle this case as you suggested in the next
version.

> 
> - Arnaldo
>  
>> [..snip..]
> 

 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-04-03  2:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-31  4:51 [RFC PATCH 0/4] perf stat: Add option to aggregate data based on the cache topology K Prateek Nayak
2023-03-31  4:51 ` [RFC PATCH 1/4] perf: Read cache instance ID when building " K Prateek Nayak
2023-03-31 11:54   ` Arnaldo Carvalho de Melo
2023-04-03  2:43     ` K Prateek Nayak
2023-03-31  4:51 ` [RFC PATCH 2/4] perf: Save cache instance ID when saving cache topology data K Prateek Nayak
2023-03-31  4:51 ` [RFC PATCH 3/4] perf: Extract building cache level for a CPU into separate function K Prateek Nayak
2023-03-31  4:51 ` [RFC PATCH 4/4] perf: Add option for --per-cache aggregation K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).