bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/5] Make evlist CPUs more accurate
@ 2022-04-08  3:56 Ian Rogers
  2022-04-08  3:56 ` [PATCH v3 1/5] perf cpumap: Don't decrement refcnt on args to merge Ian Rogers
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Ian Rogers @ 2022-04-08  3:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Mathieu Poirier, Suzuki K Poulose, Mike Leach, Leo Yan,
	John Garry, Will Deacon, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Kajol Jain, James Clark, German Gomez,
	Adrian Hunter, Riccardo Mancini, Andi Kleen, Alexey Bayduraev,
	Alexander Antonov, linux-perf-users, linux-kernel, coresight,
	linux-arm-kernel, netdev, bpf
  Cc: Stephane Eranian, Ian Rogers

evlist has all_cpus, computed to be the merge of all evsel CPU maps,
and cpus. cpus may contain more CPUs than all_cpus, as by default cpus
holds all online CPUs whilst all_cpus holds the merge/union from
evsels. For an uncore event there may just be 1 CPU per socket, which
will be a far smaller CPU map than all online CPUs.

The v1 patches changed cpus to be called user_requested_cpus, to
reflect their potential user specified nature. The user_requested_cpus
are set to be the current value intersected with all_cpus, so that
user_requested_cpus is always a subset of all_cpus. This fixes
printing code for metrics so that unnecessary blank lines aren't
printed.

To make the intersect function perform well, a perf_cpu_map__is_subset
function is added. While adding this function, the v2 patches also
used it in perf_cpu_map__merge to avoid creating a new CPU map for
some currently missed patterns. The reference counts for these
functions is simplified as discussed here:
https://lore.kernel.org/lkml/YkdOpJDnknrOPq2t@kernel.org/ but this
means users of perf_cpu_map__merge must now do a put on the 1st
argument.

v2. Reorders the "Avoid segv" patch and makes other adjustments
    suggested by Arnaldo Carvalho de Melo <acme@kernel.org>.
v3. Modify reference count behaviour for merge and intersect. Add
    intersect tests and tidy thee cpu map tests suite.

Ian Rogers (5):
  perf cpumap: Don't decrement refcnt on args to merge
  perf tests: Additional cpumap merge tests
  perf cpumap: Add intersect function.
  perf evlist: Respect all_cpus when setting user_requested_cpus
  perf test: Combine cpu map tests into 1 suite

 tools/lib/perf/cpumap.c              | 46 ++++++++++++++---
 tools/lib/perf/evlist.c              |  6 ++-
 tools/lib/perf/include/perf/cpumap.h |  2 +
 tools/perf/tests/builtin-test.c      |  4 +-
 tools/perf/tests/cpumap.c            | 74 +++++++++++++++++++++++++---
 tools/perf/tests/tests.h             |  4 +-
 tools/perf/util/evlist.c             |  7 +++
 7 files changed, 120 insertions(+), 23 deletions(-)

-- 
2.35.1.1178.g4f1659d476-goog


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/5] perf cpumap: Don't decrement refcnt on args to merge
  2022-04-08  3:56 [PATCH v3 0/5] Make evlist CPUs more accurate Ian Rogers
@ 2022-04-08  3:56 ` Ian Rogers
  2022-04-08  3:56 ` [PATCH v3 2/5] perf tests: Additional cpumap merge tests Ian Rogers
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Ian Rogers @ 2022-04-08  3:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Mathieu Poirier, Suzuki K Poulose, Mike Leach, Leo Yan,
	John Garry, Will Deacon, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Kajol Jain, James Clark, German Gomez,
	Adrian Hunter, Riccardo Mancini, Andi Kleen, Alexey Bayduraev,
	Alexander Antonov, linux-perf-users, linux-kernel, coresight,
	linux-arm-kernel, netdev, bpf
  Cc: Stephane Eranian, Ian Rogers

Having one argument to the cpumap merge decremented but not the other
leads to an inconsistent API. Don't decrement either argument.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/lib/perf/cpumap.c   | 11 +++--------
 tools/lib/perf/evlist.c   |  6 +++++-
 tools/perf/tests/cpumap.c |  1 +
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/tools/lib/perf/cpumap.c b/tools/lib/perf/cpumap.c
index 384d5e076ee4..95c56e17241b 100644
--- a/tools/lib/perf/cpumap.c
+++ b/tools/lib/perf/cpumap.c
@@ -342,9 +342,7 @@ bool perf_cpu_map__is_subset(const struct perf_cpu_map *a, const struct perf_cpu
 /*
  * Merge two cpumaps
  *
- * orig either gets freed and replaced with a new map, or reused
- * with no reference count change (similar to "realloc")
- * other has its reference count increased.
+ * May reuse either orig or other bumping reference count accordingly.
  */
 
 struct perf_cpu_map *perf_cpu_map__merge(struct perf_cpu_map *orig,
@@ -356,11 +354,9 @@ struct perf_cpu_map *perf_cpu_map__merge(struct perf_cpu_map *orig,
 	struct perf_cpu_map *merged;
 
 	if (perf_cpu_map__is_subset(orig, other))
-		return orig;
-	if (perf_cpu_map__is_subset(other, orig)) {
-		perf_cpu_map__put(orig);
+		return perf_cpu_map__get(orig);
+	if (perf_cpu_map__is_subset(other, orig))
 		return perf_cpu_map__get(other);
-	}
 
 	tmp_len = orig->nr + other->nr;
 	tmp_cpus = malloc(tmp_len * sizeof(struct perf_cpu));
@@ -387,6 +383,5 @@ struct perf_cpu_map *perf_cpu_map__merge(struct perf_cpu_map *orig,
 
 	merged = cpu_map__trim_new(k, tmp_cpus);
 	free(tmp_cpus);
-	perf_cpu_map__put(orig);
 	return merged;
 }
diff --git a/tools/lib/perf/evlist.c b/tools/lib/perf/evlist.c
index 1b15ba13c477..b783249a038b 100644
--- a/tools/lib/perf/evlist.c
+++ b/tools/lib/perf/evlist.c
@@ -35,6 +35,8 @@ void perf_evlist__init(struct perf_evlist *evlist)
 static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,
 					  struct perf_evsel *evsel)
 {
+	struct perf_cpu_map *tmp;
+
 	/*
 	 * We already have cpus for evsel (via PMU sysfs) so
 	 * keep it, if there's no target cpu list defined.
@@ -52,7 +54,9 @@ static void __perf_evlist__propagate_maps(struct perf_evlist *evlist,
 
 	perf_thread_map__put(evsel->threads);
 	evsel->threads = perf_thread_map__get(evlist->threads);
-	evlist->all_cpus = perf_cpu_map__merge(evlist->all_cpus, evsel->cpus);
+	tmp = perf_cpu_map__merge(evlist->all_cpus, evsel->cpus);
+	perf_cpu_map__put(evlist->all_cpus);
+	evlist->all_cpus = tmp;
 }
 
 static void perf_evlist__propagate_maps(struct perf_evlist *evlist)
diff --git a/tools/perf/tests/cpumap.c b/tools/perf/tests/cpumap.c
index f94929ebb54b..cf205ed6b158 100644
--- a/tools/perf/tests/cpumap.c
+++ b/tools/perf/tests/cpumap.c
@@ -133,6 +133,7 @@ static int test__cpu_map_merge(struct test_suite *test __maybe_unused, int subte
 	TEST_ASSERT_VAL("failed to merge map: bad nr", perf_cpu_map__nr(c) == 5);
 	cpu_map__snprint(c, buf, sizeof(buf));
 	TEST_ASSERT_VAL("failed to merge map: bad result", !strcmp(buf, "1-2,4-5,7"));
+	perf_cpu_map__put(a);
 	perf_cpu_map__put(b);
 	perf_cpu_map__put(c);
 	return 0;
-- 
2.35.1.1178.g4f1659d476-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/5] perf tests: Additional cpumap merge tests
  2022-04-08  3:56 [PATCH v3 0/5] Make evlist CPUs more accurate Ian Rogers
  2022-04-08  3:56 ` [PATCH v3 1/5] perf cpumap: Don't decrement refcnt on args to merge Ian Rogers
@ 2022-04-08  3:56 ` Ian Rogers
  2022-04-08  3:56 ` [PATCH v3 3/5] perf cpumap: Add intersect function Ian Rogers
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Ian Rogers @ 2022-04-08  3:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Mathieu Poirier, Suzuki K Poulose, Mike Leach, Leo Yan,
	John Garry, Will Deacon, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Kajol Jain, James Clark, German Gomez,
	Adrian Hunter, Riccardo Mancini, Andi Kleen, Alexey Bayduraev,
	Alexander Antonov, linux-perf-users, linux-kernel, coresight,
	linux-arm-kernel, netdev, bpf
  Cc: Stephane Eranian, Ian Rogers

Cover cases where one cpu map is a subset of the other.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/tests/cpumap.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/tools/perf/tests/cpumap.c b/tools/perf/tests/cpumap.c
index cf205ed6b158..3b9fc549d30b 100644
--- a/tools/perf/tests/cpumap.c
+++ b/tools/perf/tests/cpumap.c
@@ -123,22 +123,36 @@ static int test__cpu_map_print(struct test_suite *test __maybe_unused, int subte
 	return 0;
 }
 
-static int test__cpu_map_merge(struct test_suite *test __maybe_unused, int subtest __maybe_unused)
+static int __test__cpu_map_merge(const char *lhs, const char *rhs, int nr, const char *expected)
 {
-	struct perf_cpu_map *a = perf_cpu_map__new("4,2,1");
-	struct perf_cpu_map *b = perf_cpu_map__new("4,5,7");
+	struct perf_cpu_map *a = perf_cpu_map__new(lhs);
+	struct perf_cpu_map *b = perf_cpu_map__new(rhs);
 	struct perf_cpu_map *c = perf_cpu_map__merge(a, b);
 	char buf[100];
 
-	TEST_ASSERT_VAL("failed to merge map: bad nr", perf_cpu_map__nr(c) == 5);
+	TEST_ASSERT_EQUAL("failed to merge map: bad nr", perf_cpu_map__nr(c), nr);
 	cpu_map__snprint(c, buf, sizeof(buf));
-	TEST_ASSERT_VAL("failed to merge map: bad result", !strcmp(buf, "1-2,4-5,7"));
+	TEST_ASSERT_VAL("failed to merge map: bad result", !strcmp(buf, expected));
 	perf_cpu_map__put(a);
 	perf_cpu_map__put(b);
 	perf_cpu_map__put(c);
 	return 0;
 }
 
+static int test__cpu_map_merge(struct test_suite *test __maybe_unused, int subtest __maybe_unused)
+{
+	int ret;
+
+	ret = __test__cpu_map_merge("4,2,1", "4,5,7", 5, "1-2,4-5,7");
+	if (ret) return ret;
+	ret = __test__cpu_map_merge("4,2,1", "1", 3, "1-2,4");
+	if (ret) return ret;
+	ret = __test__cpu_map_merge("1", "4,2,1", 3, "1-2,4");
+	if (ret) return ret;
+	ret = __test__cpu_map_merge("1", "1", 1, "1");
+	return ret;
+}
+
 DEFINE_SUITE("Synthesize cpu map", cpu_map_synthesize);
 DEFINE_SUITE("Print cpu map", cpu_map_print);
 DEFINE_SUITE("Merge cpu map", cpu_map_merge);
-- 
2.35.1.1178.g4f1659d476-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/5] perf cpumap: Add intersect function.
  2022-04-08  3:56 [PATCH v3 0/5] Make evlist CPUs more accurate Ian Rogers
  2022-04-08  3:56 ` [PATCH v3 1/5] perf cpumap: Don't decrement refcnt on args to merge Ian Rogers
  2022-04-08  3:56 ` [PATCH v3 2/5] perf tests: Additional cpumap merge tests Ian Rogers
@ 2022-04-08  3:56 ` Ian Rogers
  2022-04-08  3:56 ` [PATCH v3 4/5] perf evlist: Respect all_cpus when setting user_requested_cpus Ian Rogers
  2022-04-08  3:56 ` [PATCH v3 5/5] perf test: Combine cpu map tests into 1 suite Ian Rogers
  4 siblings, 0 replies; 9+ messages in thread
From: Ian Rogers @ 2022-04-08  3:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Mathieu Poirier, Suzuki K Poulose, Mike Leach, Leo Yan,
	John Garry, Will Deacon, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Kajol Jain, James Clark, German Gomez,
	Adrian Hunter, Riccardo Mancini, Andi Kleen, Alexey Bayduraev,
	Alexander Antonov, linux-perf-users, linux-kernel, coresight,
	linux-arm-kernel, netdev, bpf
  Cc: Stephane Eranian, Ian Rogers

The merge function gives the union of two cpu maps. Add an intersect
function which will be used in the next change.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/lib/perf/cpumap.c              | 35 ++++++++++++++++++++++++++++
 tools/lib/perf/include/perf/cpumap.h |  2 ++
 tools/perf/tests/builtin-test.c      |  1 +
 tools/perf/tests/cpumap.c            | 35 ++++++++++++++++++++++++++++
 tools/perf/tests/tests.h             |  1 +
 5 files changed, 74 insertions(+)

diff --git a/tools/lib/perf/cpumap.c b/tools/lib/perf/cpumap.c
index 95c56e17241b..66371135e742 100644
--- a/tools/lib/perf/cpumap.c
+++ b/tools/lib/perf/cpumap.c
@@ -385,3 +385,38 @@ struct perf_cpu_map *perf_cpu_map__merge(struct perf_cpu_map *orig,
 	free(tmp_cpus);
 	return merged;
 }
+
+struct perf_cpu_map *perf_cpu_map__intersect(struct perf_cpu_map *orig,
+					     struct perf_cpu_map *other)
+{
+	struct perf_cpu *tmp_cpus;
+	int tmp_len;
+	int i, j, k;
+	struct perf_cpu_map *merged = NULL;
+
+	if (perf_cpu_map__is_subset(other, orig))
+		return perf_cpu_map__get(orig);
+	if (perf_cpu_map__is_subset(orig, other))
+		return perf_cpu_map__get(other);
+
+	tmp_len = max(orig->nr, other->nr);
+	tmp_cpus = malloc(tmp_len * sizeof(struct perf_cpu));
+	if (!tmp_cpus)
+		return NULL;
+
+	i = j = k = 0;
+	while (i < orig->nr && j < other->nr) {
+		if (orig->map[i].cpu < other->map[j].cpu)
+			i++;
+		else if (orig->map[i].cpu > other->map[j].cpu)
+			j++;
+		else {
+			j++;
+			tmp_cpus[k++] = orig->map[i++];
+		}
+	}
+	if (k)
+		merged = cpu_map__trim_new(k, tmp_cpus);
+	free(tmp_cpus);
+	return merged;
+}
diff --git a/tools/lib/perf/include/perf/cpumap.h b/tools/lib/perf/include/perf/cpumap.h
index 4a2edbdb5e2b..a2a7216c0b78 100644
--- a/tools/lib/perf/include/perf/cpumap.h
+++ b/tools/lib/perf/include/perf/cpumap.h
@@ -19,6 +19,8 @@ LIBPERF_API struct perf_cpu_map *perf_cpu_map__read(FILE *file);
 LIBPERF_API struct perf_cpu_map *perf_cpu_map__get(struct perf_cpu_map *map);
 LIBPERF_API struct perf_cpu_map *perf_cpu_map__merge(struct perf_cpu_map *orig,
 						     struct perf_cpu_map *other);
+LIBPERF_API struct perf_cpu_map *perf_cpu_map__intersect(struct perf_cpu_map *orig,
+							 struct perf_cpu_map *other);
 LIBPERF_API void perf_cpu_map__put(struct perf_cpu_map *map);
 LIBPERF_API struct perf_cpu perf_cpu_map__cpu(const struct perf_cpu_map *cpus, int idx);
 LIBPERF_API int perf_cpu_map__nr(const struct perf_cpu_map *cpus);
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index fac3717d9ba1..dffa41e7ee20 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -88,6 +88,7 @@ static struct test_suite *generic_tests[] = {
 	&suite__backward_ring_buffer,
 	&suite__cpu_map_print,
 	&suite__cpu_map_merge,
+	&suite__cpu_map_intersect,
 	&suite__sdt_event,
 	&suite__is_printable_array,
 	&suite__bitmap_print,
diff --git a/tools/perf/tests/cpumap.c b/tools/perf/tests/cpumap.c
index 3b9fc549d30b..112331829414 100644
--- a/tools/perf/tests/cpumap.c
+++ b/tools/perf/tests/cpumap.c
@@ -153,6 +153,41 @@ static int test__cpu_map_merge(struct test_suite *test __maybe_unused, int subte
 	return ret;
 }
 
+static int __test__cpu_map_intersect(const char *lhs, const char *rhs, int nr, const char *expected)
+{
+	struct perf_cpu_map *a = perf_cpu_map__new(lhs);
+	struct perf_cpu_map *b = perf_cpu_map__new(rhs);
+	struct perf_cpu_map *c = perf_cpu_map__intersect(a, b);
+	char buf[100];
+
+	TEST_ASSERT_EQUAL("failed to intersect map: bad nr", perf_cpu_map__nr(c), nr);
+	cpu_map__snprint(c, buf, sizeof(buf));
+	TEST_ASSERT_VAL("failed to intersect map: bad result", !strcmp(buf, expected));
+	perf_cpu_map__put(a);
+	perf_cpu_map__put(b);
+	perf_cpu_map__put(c);
+	return 0;
+}
+
+static int test__cpu_map_intersect(struct test_suite *test __maybe_unused, int subtest __maybe_unused)
+{
+	int ret;
+
+	ret = __test__cpu_map_intersect("4,2,1", "4,5,7", 1, "4");
+	if (ret) return ret;
+	ret = __test__cpu_map_intersect("1-8", "6-9", 3, "6-8");
+	if (ret) return ret;
+	ret = __test__cpu_map_intersect("1-8,12-20", "6-9,15", 4, "6-8,15");
+	if (ret) return ret;
+	ret = __test__cpu_map_intersect("4,2,1", "1", 1, "1");
+	if (ret) return ret;
+	ret = __test__cpu_map_intersect("1", "4,2,1", 1, "1");
+	if (ret) return ret;
+	ret = __test__cpu_map_intersect("1", "1", 1, "1");
+	return ret;
+}
+
 DEFINE_SUITE("Synthesize cpu map", cpu_map_synthesize);
 DEFINE_SUITE("Print cpu map", cpu_map_print);
 DEFINE_SUITE("Merge cpu map", cpu_map_merge);
+DEFINE_SUITE("Intersect cpu map", cpu_map_intersect);
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index 5bbb8f6a48fc..f2823c4859b8 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -127,6 +127,7 @@ DECLARE_SUITE(event_times);
 DECLARE_SUITE(backward_ring_buffer);
 DECLARE_SUITE(cpu_map_print);
 DECLARE_SUITE(cpu_map_merge);
+DECLARE_SUITE(cpu_map_intersect);
 DECLARE_SUITE(sdt_event);
 DECLARE_SUITE(is_printable_array);
 DECLARE_SUITE(bitmap_print);
-- 
2.35.1.1178.g4f1659d476-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 4/5] perf evlist: Respect all_cpus when setting user_requested_cpus
  2022-04-08  3:56 [PATCH v3 0/5] Make evlist CPUs more accurate Ian Rogers
                   ` (2 preceding siblings ...)
  2022-04-08  3:56 ` [PATCH v3 3/5] perf cpumap: Add intersect function Ian Rogers
@ 2022-04-08  3:56 ` Ian Rogers
  2022-04-28 20:15   ` Adrian Hunter
  2022-04-08  3:56 ` [PATCH v3 5/5] perf test: Combine cpu map tests into 1 suite Ian Rogers
  4 siblings, 1 reply; 9+ messages in thread
From: Ian Rogers @ 2022-04-08  3:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Mathieu Poirier, Suzuki K Poulose, Mike Leach, Leo Yan,
	John Garry, Will Deacon, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Kajol Jain, James Clark, German Gomez,
	Adrian Hunter, Riccardo Mancini, Andi Kleen, Alexey Bayduraev,
	Alexander Antonov, linux-perf-users, linux-kernel, coresight,
	linux-arm-kernel, netdev, bpf
  Cc: Stephane Eranian, Ian Rogers

If all_cpus is calculated it represents the merge/union of all
evsel cpu maps. By default user_requested_cpus is computed to be
the online CPUs. For uncore events, it is often the case currently
that all_cpus is a subset of user_requested_cpus. Metrics printed
without aggregation and with metric-only, in print_no_aggr_metric,
iterate over user_requested_cpus assuming every CPU has a metric to
print. For each CPU the prefix is printed, but then if the
evsel's cpus doesn't contain anything you get an empty line like
the following on a 2 socket 36 core SkylakeX:

```
$ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000
     1.000453137 CPU0                       0.00
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137 CPU18                      0.00
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     1.000453137
     2.003717143 CPU0                       0.00
...
```

While it is possible to be lazier in printing the prefix and
trailing newline, having user_requested_cpus not be a subset of
all_cpus is preferential so that wasted work isn't done elsewhere
user_requested_cpus is used. The change modifies user_requested_cpus
to be the intersection of user specified CPUs, or default all online
CPUs, with the CPUs computed through the merge of all evsel cpu maps.

New behavior:
```
$ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000
     1.001086325 CPU0                       0.00
     1.001086325 CPU18                      0.00
     2.003671291 CPU0                       0.00
     2.003671291 CPU18                      0.00
...
```

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/util/evlist.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 52ea004ba01e..196d57b905a0 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1036,6 +1036,13 @@ int evlist__create_maps(struct evlist *evlist, struct target *target)
 	if (!cpus)
 		goto out_delete_threads;
 
+	if (evlist->core.all_cpus) {
+		struct perf_cpu_map *tmp;
+
+		tmp = perf_cpu_map__intersect(cpus, evlist->core.all_cpus);
+		perf_cpu_map__put(cpus);
+		cpus = tmp;
+	}
 	evlist->core.has_user_cpus = !!target->cpu_list && !target->hybrid;
 
 	perf_evlist__set_maps(&evlist->core, cpus, threads);
-- 
2.35.1.1178.g4f1659d476-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 5/5] perf test: Combine cpu map tests into 1 suite
  2022-04-08  3:56 [PATCH v3 0/5] Make evlist CPUs more accurate Ian Rogers
                   ` (3 preceding siblings ...)
  2022-04-08  3:56 ` [PATCH v3 4/5] perf evlist: Respect all_cpus when setting user_requested_cpus Ian Rogers
@ 2022-04-08  3:56 ` Ian Rogers
  4 siblings, 0 replies; 9+ messages in thread
From: Ian Rogers @ 2022-04-08  3:56 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Namhyung Kim,
	Mathieu Poirier, Suzuki K Poulose, Mike Leach, Leo Yan,
	John Garry, Will Deacon, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Kajol Jain, James Clark, German Gomez,
	Adrian Hunter, Riccardo Mancini, Andi Kleen, Alexey Bayduraev,
	Alexander Antonov, linux-perf-users, linux-kernel, coresight,
	linux-arm-kernel, netdev, bpf
  Cc: Stephane Eranian, Ian Rogers

Combine related CPU map tests into 1 suite reducing global state.

Signed-off-by: Ian Rogers <irogers@google.com>
---
 tools/perf/tests/builtin-test.c |  5 +----
 tools/perf/tests/cpumap.c       | 16 ++++++++++++----
 tools/perf/tests/tests.h        |  5 +----
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index dffa41e7ee20..1941ae52e8b6 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -79,16 +79,13 @@ static struct test_suite *generic_tests[] = {
 	&suite__bpf,
 	&suite__thread_map_synthesize,
 	&suite__thread_map_remove,
-	&suite__cpu_map_synthesize,
+	&suite__cpu_map,
 	&suite__synthesize_stat_config,
 	&suite__synthesize_stat,
 	&suite__synthesize_stat_round,
 	&suite__event_update,
 	&suite__event_times,
 	&suite__backward_ring_buffer,
-	&suite__cpu_map_print,
-	&suite__cpu_map_merge,
-	&suite__cpu_map_intersect,
 	&suite__sdt_event,
 	&suite__is_printable_array,
 	&suite__bitmap_print,
diff --git a/tools/perf/tests/cpumap.c b/tools/perf/tests/cpumap.c
index 112331829414..fc124757a082 100644
--- a/tools/perf/tests/cpumap.c
+++ b/tools/perf/tests/cpumap.c
@@ -187,7 +187,15 @@ static int test__cpu_map_intersect(struct test_suite *test __maybe_unused, int s
 	return ret;
 }
 
-DEFINE_SUITE("Synthesize cpu map", cpu_map_synthesize);
-DEFINE_SUITE("Print cpu map", cpu_map_print);
-DEFINE_SUITE("Merge cpu map", cpu_map_merge);
-DEFINE_SUITE("Intersect cpu map", cpu_map_intersect);
+static struct test_case cpu_map_tests[] = {
+	TEST_CASE("Synthesize cpu map", cpu_map_synthesize),
+	TEST_CASE("Print cpu map", cpu_map_print),
+	TEST_CASE("Merge cpu map", cpu_map_merge),
+	TEST_CASE("Intersect cpu map", cpu_map_intersect),
+	{ .name = NULL, }
+};
+
+struct test_suite suite__cpu_map = {
+	.desc = "CPU map",
+	.test_cases = cpu_map_tests,
+};
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index f2823c4859b8..895803fdedc4 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -118,16 +118,13 @@ DECLARE_SUITE(bpf);
 DECLARE_SUITE(session_topology);
 DECLARE_SUITE(thread_map_synthesize);
 DECLARE_SUITE(thread_map_remove);
-DECLARE_SUITE(cpu_map_synthesize);
+DECLARE_SUITE(cpu_map);
 DECLARE_SUITE(synthesize_stat_config);
 DECLARE_SUITE(synthesize_stat);
 DECLARE_SUITE(synthesize_stat_round);
 DECLARE_SUITE(event_update);
 DECLARE_SUITE(event_times);
 DECLARE_SUITE(backward_ring_buffer);
-DECLARE_SUITE(cpu_map_print);
-DECLARE_SUITE(cpu_map_merge);
-DECLARE_SUITE(cpu_map_intersect);
 DECLARE_SUITE(sdt_event);
 DECLARE_SUITE(is_printable_array);
 DECLARE_SUITE(bitmap_print);
-- 
2.35.1.1178.g4f1659d476-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 4/5] perf evlist: Respect all_cpus when setting user_requested_cpus
  2022-04-08  3:56 ` [PATCH v3 4/5] perf evlist: Respect all_cpus when setting user_requested_cpus Ian Rogers
@ 2022-04-28 20:15   ` Adrian Hunter
       [not found]     ` <CAP-5=fVNuQDW+yge897RjaWfE3cfQTD4ufFws6PS2k99Qe05Uw@mail.gmail.com>
  0 siblings, 1 reply; 9+ messages in thread
From: Adrian Hunter @ 2022-04-28 20:15 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Stephane Eranian, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Mathieu Poirier, Suzuki K Poulose,
	Mike Leach, Leo Yan, John Garry, Will Deacon, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Kajol Jain, James Clark,
	German Gomez, Riccardo Mancini, Andi Kleen, Alexey Bayduraev,
	Alexander Antonov, linux-perf-users, linux-kernel, coresight,
	linux-arm-kernel, netdev, bpf

On 8/04/22 06:56, Ian Rogers wrote:
> If all_cpus is calculated it represents the merge/union of all
> evsel cpu maps. By default user_requested_cpus is computed to be
> the online CPUs. For uncore events, it is often the case currently
> that all_cpus is a subset of user_requested_cpus. Metrics printed
> without aggregation and with metric-only, in print_no_aggr_metric,
> iterate over user_requested_cpus assuming every CPU has a metric to
> print. For each CPU the prefix is printed, but then if the
> evsel's cpus doesn't contain anything you get an empty line like
> the following on a 2 socket 36 core SkylakeX:
> 
> ```
> $ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000
>      1.000453137 CPU0                       0.00
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137 CPU18                      0.00
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      1.000453137
>      2.003717143 CPU0                       0.00
> ...
> ```
> 
> While it is possible to be lazier in printing the prefix and
> trailing newline, having user_requested_cpus not be a subset of
> all_cpus is preferential so that wasted work isn't done elsewhere
> user_requested_cpus is used. The change modifies user_requested_cpus
> to be the intersection of user specified CPUs, or default all online
> CPUs, with the CPUs computed through the merge of all evsel cpu maps.
> 
> New behavior:
> ```
> $ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000
>      1.001086325 CPU0                       0.00
>      1.001086325 CPU18                      0.00
>      2.003671291 CPU0                       0.00
>      2.003671291 CPU18                      0.00
> ...
> ```
> 
> Signed-off-by: Ian Rogers <irogers@google.com>
> ---
>  tools/perf/util/evlist.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index 52ea004ba01e..196d57b905a0 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -1036,6 +1036,13 @@ int evlist__create_maps(struct evlist *evlist, struct target *target)
>  	if (!cpus)
>  		goto out_delete_threads;
>  
> +	if (evlist->core.all_cpus) {
> +		struct perf_cpu_map *tmp;
> +
> +		tmp = perf_cpu_map__intersect(cpus, evlist->core.all_cpus);

Isn't an uncore PMU represented as being on CPU0 actually
collecting data that can be due to any CPU.

Or for an uncore PMU represented as being on CPU0-CPU4 on a
4 core 8 hyperthread processor, actually 1 PMU per core.

So I am not sure intersection makes sense.

Also it is not obvious what happens with hybrid CPUs or
per thread recording.

> +		perf_cpu_map__put(cpus);
> +		cpus = tmp;
> +	}
>  	evlist->core.has_user_cpus = !!target->cpu_list && !target->hybrid;
>  
>  	perf_evlist__set_maps(&evlist->core, cpus, threads);


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 4/5] perf evlist: Respect all_cpus when setting user_requested_cpus
       [not found]     ` <CAP-5=fVNuQDW+yge897RjaWfE3cfQTD4ufFws6PS2k99Qe05Uw@mail.gmail.com>
@ 2022-04-29 11:34       ` Adrian Hunter
  2022-04-30  1:06         ` Ian Rogers
  0 siblings, 1 reply; 9+ messages in thread
From: Adrian Hunter @ 2022-04-29 11:34 UTC (permalink / raw)
  To: Ian Rogers
  Cc: Stephane Eranian, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Mathieu Poirier, Suzuki K Poulose,
	Mike Leach, Leo Yan, John Garry, Will Deacon, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Kajol Jain, James Clark,
	German Gomez, Riccardo Mancini, Andi Kleen, Alexey Bayduraev,
	Alexander Antonov, linux-perf-users, linux-kernel, coresight,
	linux-arm-kernel, netdev, bpf

On 28/04/22 23:49, Ian Rogers wrote:
> On Thu, Apr 28, 2022 at 1:16 PM Adrian Hunter <adrian.hunter@intel.com <mailto:adrian.hunter@intel.com>> wrote:
> 
>     On 8/04/22 06:56, Ian Rogers wrote:
>     > If all_cpus is calculated it represents the merge/union of all
>     > evsel cpu maps. By default user_requested_cpus is computed to be
>     > the online CPUs. For uncore events, it is often the case currently
>     > that all_cpus is a subset of user_requested_cpus. Metrics printed
>     > without aggregation and with metric-only, in print_no_aggr_metric,
>     > iterate over user_requested_cpus assuming every CPU has a metric to
>     > print. For each CPU the prefix is printed, but then if the
>     > evsel's cpus doesn't contain anything you get an empty line like
>     > the following on a 2 socket 36 core SkylakeX:
>     >
>     > ```
>     > $ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000
>     >      1.000453137 CPU0                       0.00
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137 CPU18                      0.00
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      1.000453137
>     >      2.003717143 CPU0                       0.00
>     > ...
>     > ```
>     >
>     > While it is possible to be lazier in printing the prefix and
>     > trailing newline, having user_requested_cpus not be a subset of
>     > all_cpus is preferential so that wasted work isn't done elsewhere
>     > user_requested_cpus is used. The change modifies user_requested_cpus
>     > to be the intersection of user specified CPUs, or default all online
>     > CPUs, with the CPUs computed through the merge of all evsel cpu maps.
>     >
>     > New behavior:
>     > ```
>     > $ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000
>     >      1.001086325 CPU0                       0.00
>     >      1.001086325 CPU18                      0.00
>     >      2.003671291 CPU0                       0.00
>     >      2.003671291 CPU18                      0.00
>     > ...
>     > ```
>     >
>     > Signed-off-by: Ian Rogers <irogers@google.com <mailto:irogers@google.com>>
>     > ---
>     >  tools/perf/util/evlist.c | 7 +++++++
>     >  1 file changed, 7 insertions(+)
>     >
>     > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
>     > index 52ea004ba01e..196d57b905a0 100644
>     > --- a/tools/perf/util/evlist.c
>     > +++ b/tools/perf/util/evlist.c
>     > @@ -1036,6 +1036,13 @@ int evlist__create_maps(struct evlist *evlist, struct target *target)
>     >       if (!cpus)
>     >               goto out_delete_threads;
>     > 
>     > +     if (evlist->core.all_cpus) {
>     > +             struct perf_cpu_map *tmp;
>     > +
>     > +             tmp = perf_cpu_map__intersect(cpus, evlist->core.all_cpus);
> 
>     Isn't an uncore PMU represented as being on CPU0 actually
>     collecting data that can be due to any CPU.
> 
> 
> This is correct but the counter is only opened on CPU0 as the all_cpus cpu_map will only contain CPU0. Trying to dump the counter for say CPU1 will fail as there is no counter there. This is why the metric-only output isn't displaying anything above.

That's not what happens for me:

$ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000 -- sleep 1
#           time CPU              DRAM_BW_Use 
     1.001114691 CPU0                       0.00 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.001114691 
     1.002265387 CPU0                       0.00 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 
     1.002265387 

perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000 -C 1 -- sleep 1
#           time CPU              DRAM_BW_Use 
     1.001100827 CPU1                       0.00 
     1.002128527 CPU1                       0.00 


>  
> 
>     Or for an uncore PMU represented as being on CPU0-CPU4 on a
>     4 core 8 hyperthread processor, actually 1 PMU per core.
> 
> 
> In this case I believe the CPU map will be CPU0, CPU2, CPU4, CPU6. To get the core counter for hyperthreads on CPU0 and CPU1 you read on CPU0, there is no counter on CPU1 and trying to read it will fail as the counters are indexed by a cpu map index into the all_cpus . Not long ago I cleaned up the cpu_map code as there was quite a bit of confusion over cpus and indexes which were both of type int.
>  
> 
>     So I am not sure intersection makes sense.
> 
>     Also it is not obvious what happens with hybrid CPUs or
>     per thread recording.
> 
> 
> The majority of code is using all_cpus, and so is unchanged by this change.

I am not sure what you mean.  Every tool uses this code.  It affects everything when using PMUs with their own cpus.

 Code that is affected, when it say needs to use counters, needs to check that the user CPU was valid in all_cpus, and use the all_cpus index. The metric-only output could be fixed in the same way, ie don't display lines when the user_requested_cpu isn't in all_cpus. I prefered to solve the problem this way as it is inefficient  to be processing cpus where there can be no corresponding counters, etc. We may be setting something like affinity unnecessarily - although that doesn't currently happen as that code iterates over all_cpus. I also think it is confusing from its name when the variable all_cpus is for a cpu_map that contains fewer cpus than user_requested_cpus - albeit that was worse when user_requested_cpus was called just cpus.
> 
> It could be hybrid or intel-pt have different assumptions on these cpu_maps. I don't have access to a hybrid test system. For intel-pt it'd be great if there were a perf test. Given that most code is using all_cpus and was cleaned up as part of the cpu_map work, I believe the change to be correct.

Mainly what happens if you try to intersect all_cpus with dummy cpus?

> 
> Thanks,
> Ian
> 
> 
>     > +             perf_cpu_map__put(cpus);
>     > +             cpus = tmp;
>     > +     }
>     >       evlist->core.has_user_cpus = !!target->cpu_list && !target->hybrid;
>     > 
>     >       perf_evlist__set_maps(&evlist->core, cpus, threads);
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 4/5] perf evlist: Respect all_cpus when setting user_requested_cpus
  2022-04-29 11:34       ` Adrian Hunter
@ 2022-04-30  1:06         ` Ian Rogers
  0 siblings, 0 replies; 9+ messages in thread
From: Ian Rogers @ 2022-04-30  1:06 UTC (permalink / raw)
  To: Adrian Hunter
  Cc: Stephane Eranian, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Mathieu Poirier, Suzuki K Poulose,
	Mike Leach, Leo Yan, John Garry, Will Deacon, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Kajol Jain, James Clark,
	German Gomez, Riccardo Mancini, Andi Kleen, Alexey Bayduraev,
	Alexander Antonov, linux-perf-users, linux-kernel, coresight,
	linux-arm-kernel, netdev, bpf

On Fri, Apr 29, 2022 at 4:34 AM Adrian Hunter <adrian.hunter@intel.com> wrote:
>
> On 28/04/22 23:49, Ian Rogers wrote:
> > On Thu, Apr 28, 2022 at 1:16 PM Adrian Hunter <adrian.hunter@intel.com <mailto:adrian.hunter@intel.com>> wrote:
> >
> >     On 8/04/22 06:56, Ian Rogers wrote:
> >     > If all_cpus is calculated it represents the merge/union of all
> >     > evsel cpu maps. By default user_requested_cpus is computed to be
> >     > the online CPUs. For uncore events, it is often the case currently
> >     > that all_cpus is a subset of user_requested_cpus. Metrics printed
> >     > without aggregation and with metric-only, in print_no_aggr_metric,
> >     > iterate over user_requested_cpus assuming every CPU has a metric to
> >     > print. For each CPU the prefix is printed, but then if the
> >     > evsel's cpus doesn't contain anything you get an empty line like
> >     > the following on a 2 socket 36 core SkylakeX:
> >     >
> >     > ```
> >     > $ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000
> >     >      1.000453137 CPU0                       0.00
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137 CPU18                      0.00
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      1.000453137
> >     >      2.003717143 CPU0                       0.00
> >     > ...
> >     > ```
> >     >
> >     > While it is possible to be lazier in printing the prefix and
> >     > trailing newline, having user_requested_cpus not be a subset of
> >     > all_cpus is preferential so that wasted work isn't done elsewhere
> >     > user_requested_cpus is used. The change modifies user_requested_cpus
> >     > to be the intersection of user specified CPUs, or default all online
> >     > CPUs, with the CPUs computed through the merge of all evsel cpu maps.
> >     >
> >     > New behavior:
> >     > ```
> >     > $ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000
> >     >      1.001086325 CPU0                       0.00
> >     >      1.001086325 CPU18                      0.00
> >     >      2.003671291 CPU0                       0.00
> >     >      2.003671291 CPU18                      0.00
> >     > ...
> >     > ```
> >     >
> >     > Signed-off-by: Ian Rogers <irogers@google.com <mailto:irogers@google.com>>
> >     > ---
> >     >  tools/perf/util/evlist.c | 7 +++++++
> >     >  1 file changed, 7 insertions(+)
> >     >
> >     > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> >     > index 52ea004ba01e..196d57b905a0 100644
> >     > --- a/tools/perf/util/evlist.c
> >     > +++ b/tools/perf/util/evlist.c
> >     > @@ -1036,6 +1036,13 @@ int evlist__create_maps(struct evlist *evlist, struct target *target)
> >     >       if (!cpus)
> >     >               goto out_delete_threads;
> >     >
> >     > +     if (evlist->core.all_cpus) {
> >     > +             struct perf_cpu_map *tmp;
> >     > +
> >     > +             tmp = perf_cpu_map__intersect(cpus, evlist->core.all_cpus);
> >
> >     Isn't an uncore PMU represented as being on CPU0 actually
> >     collecting data that can be due to any CPU.
> >
> >
> > This is correct but the counter is only opened on CPU0 as the all_cpus cpu_map will only contain CPU0. Trying to dump the counter for say CPU1 will fail as there is no counter there. This is why the metric-only output isn't displaying anything above.
>
> That's not what happens for me:
>
> $ perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000 -- sleep 1
> #           time CPU              DRAM_BW_Use
>      1.001114691 CPU0                       0.00
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.001114691
>      1.002265387 CPU0                       0.00
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387
>      1.002265387

Thanks! To be clear, getting rid of the unnecessary spew above is what
this patch set is about.

> perf stat -A -M DRAM_BW_Use -a --metric-only -I 1000 -C 1 -- sleep 1
> #           time CPU              DRAM_BW_Use
>      1.001100827 CPU1                       0.00
>      1.002128527 CPU1                       0.00

So this case I'd not been thinking about. What does it mean? We have a
metric that is computed using uncore_imc that has a cpumask of 0 on
your machine. You are requesting the data for CPU1. This feels like
user error. After this change the behavior is:

$ perf stat -A -M DRAM_BW_Use -a --metric-only -C 1 -- sleep 1
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument)
for event (uncore_imc/cas_count_write/).
/bin/dmesg | grep -i perf may provide additional information.

That kind of goes along with this. What has actually happened is the
intersect has resulted in an empty cpu_map and so we try to program
the events for CPU -1 and that fails.

The existing behavior is to open the event for CPU1 and more than that
it succeeds in the perf_event_open for CPU 1. I find that weird. With
the CPU being 1 we have the user_requested_cpus being {1} and all_cpus
being {0} (I'm using the curlies just to say it is really a set). That
means, for example, the affinity iterator evlist__for_each_cpu [1]
will not iterate over any of the evsels as there is no CPU where the
evsel's cpu_map agrees with the all_cpu cpu_map.

Now, I can imagine it being said the existing behavior is value add.
We can move events away from the crowded CPU0 to some arbitrary CPU
and I'm sympathetic to that. I think the bug this is highlighting is
that in this case all_cpus should be {1} and not {0} as the code
currently computes (hence the empty intersect and death). This would
also fix the evlist__for_each_cpu loop.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/util/evlist.h?h=perf/core#n353

> >
> >
> >     Or for an uncore PMU represented as being on CPU0-CPU4 on a
> >     4 core 8 hyperthread processor, actually 1 PMU per core.
> >
> >
> > In this case I believe the CPU map will be CPU0, CPU2, CPU4, CPU6. To get the core counter for hyperthreads on CPU0 and CPU1 you read on CPU0, there is no counter on CPU1 and trying to read it will fail as the counters are indexed by a cpu map index into the all_cpus . Not long ago I cleaned up the cpu_map code as there was quite a bit of confusion over cpus and indexes which were both of type int.
> >
> >
> >     So I am not sure intersection makes sense.
> >
> >     Also it is not obvious what happens with hybrid CPUs or
> >     per thread recording.
> >
> >
> > The majority of code is using all_cpus, and so is unchanged by this change.
>
> I am not sure what you mean.  Every tool uses this code.  It affects everything when using PMUs with their own cpus.

My point was that because we use iterators on all_cpus then the common
case is the all_cpus case, this code only affects user_requested_cpus.
Looking at raw references there are more to user_requested_cpus than
all_cpus, but I believe the main iterators in the stat code are
largely based on all_cpus. In any case this doesn't matter given what
I found below.

> >  Code that is affected, when it say needs to use counters, needs to check that the user CPU was valid in all_cpus, and use the all_cpus index. The metric-only output could be fixed in the same way, ie don't display lines when the user_requested_cpu isn't in all_cpus. I prefered to solve the problem this way as it is inefficient  to be processing cpus where there can be no corresponding counters, etc. We may be setting something like affinity unnecessarily - although that doesn't currently happen as that code iterates over all_cpus. I also think it is confusing from its name when the variable all_cpus is for a cpu_map that contains fewer cpus than user_requested_cpus - albeit that was worse when user_requested_cpus was called just cpus.
> >
> > It could be hybrid or intel-pt have different assumptions on these cpu_maps. I don't have access to a hybrid test system. For intel-pt it'd be great if there were a perf test. Given that most code is using all_cpus and was cleaned up as part of the cpu_map work, I believe the change to be correct.
>
> Mainly what happens if you try to intersect all_cpus with dummy cpus?

The intersect of two dummy cpu_maps is a dummy cpu_map, as with merge.
When all_cpus is computed during add:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/lib/perf/evlist.c?h=perf/core#n72
dummy maps are treated as empty and evsel's cpumap is replaced with
the user_requested_cpus which is empty:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/lib/perf/evlist.c?h=perf/core#n47
It will then merge this empty cpu_map into all_cpus:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/lib/perf/evlist.c?h=perf/core#n55
Rather than rely on the copy of an empty user_requested_cpus, I think
the merge and intersect functions should explicitly handle dummy as an
empty cpu_map.

This code then sets user_requested_cpus and it will intersect it with
all_cpus. This is then propagated to the empty cpu_maps in the evsels.
This shows another bug in this change:

Current behavior:
```
$ perf stat -A -M DRAM_BW_Use -e faults -a sleep 1

Performance counter stats for 'system wide':

CPU0                   280.09 MiB  uncore_imc/cas_count_read/ #
0.00 DRAM_BW_Use
CPU18                  184.07 MiB  uncore_imc/cas_count_read/ #
0.00 DRAM_BW_Use
CPU0                   167.29 MiB  uncore_imc/cas_count_write/
CPU18                  149.95 MiB  uncore_imc/cas_count_write/
CPU0            1,002,927,474 ns   duration_time
CPU0                       42      faults
CPU1                        2      faults
CPU2                       19      faults
CPU3                        0      faults
CPU4                        2      faults
CPU5                       88      faults
CPU6                        0      faults
CPU7                        6      faults
CPU8                        2      faults
CPU9                        2      faults
CPU10                       2      faults
CPU11                       0      faults
CPU12                       0      faults
CPU13                      12      faults
CPU14                     158      faults
CPU15                      31      faults
CPU16                       0      faults
CPU17                      18      faults
CPU18                     108      faults
CPU19                      27      faults
CPU20                      23      faults
CPU21                       9      faults
CPU22                      10      faults
CPU23                       0      faults
CPU24                       0      faults
CPU25                       1      faults
CPU26                      38      faults
CPU27                       1      faults
CPU28                      16      faults
CPU29                       0      faults
CPU30                       3      faults
CPU31                       5      faults
CPU32                       2      faults
CPU33                       1      faults
CPU34                       3      faults
CPU35                      41      faults

      1.002927474 seconds time elapsed
```

New behavior:
```
$ /tmp/perf/perf stat -A -M DRAM_BW_Use -e faults -a sleep 1

Performance counter stats for 'system wide':

CPU0                   389.17 MiB  uncore_imc/cas_count_write/ #
0.00 DRAM_BW_Use
CPU18                  158.69 MiB  uncore_imc/cas_count_write/ #
0.00 DRAM_BW_Use
CPU0                   465.99 MiB  uncore_imc/cas_count_read/
CPU18                  242.37 MiB  uncore_imc/cas_count_read/
CPU0            1,003,287,405 ns   duration_time
CPU0                      176      faults
CPU18                     110      faults

      1.003287405 seconds time elapsed
```

So I think I've come around to the idea that user_requested_cpus needs
to be the original non-intersected value for the setting of
empty/dummy evsel's cpu_map. We could add a valid_user_requested_cpus
variable, which would be the intersect of the user_requested_cpus with
all_cpus, but there seems little point.

So I still need to follow this up to fix the bug and the bugs
discovered in this thread. The main follow-up actions are:
1) modify merge to explicitly handle dummy maps - and intersect if we
still need intersect after these changes.
2) document user_requested_cpus, detail why it can have more cpus than
"all_cpus" and the behavior for dummy cpu maps
3) rename all_cpus, to perhaps merged_evsel_cpus - I want to get away
from the implication all_cpus is a super set of cpu maps like
user_requested_cpus.
4) fixup evsel cpu maps when the event will be opened on a cpu that
isn't in the cpu_map.
5) update the stat-display print_no_aggr_metric of user_requested_cpus
so that CPUs not in all_cpus/merged_evsel_cpus don't get new lines
printed.

Thanks,
Ian

> >
> > Thanks,
> > Ian
> >
> >
> >     > +             perf_cpu_map__put(cpus);
> >     > +             cpus = tmp;
> >     > +     }
> >     >       evlist->core.has_user_cpus = !!target->cpu_list && !target->hybrid;
> >     >
> >     >       perf_evlist__set_maps(&evlist->core, cpus, threads);
> >
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-04-30  1:06 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-08  3:56 [PATCH v3 0/5] Make evlist CPUs more accurate Ian Rogers
2022-04-08  3:56 ` [PATCH v3 1/5] perf cpumap: Don't decrement refcnt on args to merge Ian Rogers
2022-04-08  3:56 ` [PATCH v3 2/5] perf tests: Additional cpumap merge tests Ian Rogers
2022-04-08  3:56 ` [PATCH v3 3/5] perf cpumap: Add intersect function Ian Rogers
2022-04-08  3:56 ` [PATCH v3 4/5] perf evlist: Respect all_cpus when setting user_requested_cpus Ian Rogers
2022-04-28 20:15   ` Adrian Hunter
     [not found]     ` <CAP-5=fVNuQDW+yge897RjaWfE3cfQTD4ufFws6PS2k99Qe05Uw@mail.gmail.com>
2022-04-29 11:34       ` Adrian Hunter
2022-04-30  1:06         ` Ian Rogers
2022-04-08  3:56 ` [PATCH v3 5/5] perf test: Combine cpu map tests into 1 suite Ian Rogers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).