linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] perf lock contention: Improve call stack handling (v1)
@ 2022-09-08  6:37 Namhyung Kim
  2022-09-08  6:37 ` [PATCH 1/4] perf lock contention: Factor out get_symbol_name_offset() Namhyung Kim
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Namhyung Kim @ 2022-09-08  6:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Ian Rogers, linux-perf-users,
	Song Liu, bpf

Hello,

I found that call stack from the lock tracepoint (using bpf_get_stackid)
can be different on each configuration.  For example it's very different
when I run it on a VM than on a real machine.

The perf lock contention relies on the stack trace to get the lock
caller names, this kind of difference can be annoying.  Ideally we could
skip stack trace entries for internal BPF or lock functions and get the
correct caller, but it's not the case as of today.  Currently it's hard
coded to control the behavior of stack traces for the lock contention
tracepoints.

To handle those differences, add two new options to control the number of
stack entries and how many it skips.  The default value worked well on
my VM setup, but I had to use --stack-skip=5 on real machines.

You can get it from 'perf/lock-stack-v1' branch in

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Thanks,
Namhyung


Namhyung Kim (4):
  perf lock contention: Factor out get_symbol_name_offset()
  perf lock contention: Show full callstack with -v option
  perf lock contention: Allow to change stack depth and skip
  perf lock contention: Skip stack trace from BPF

 tools/perf/Documentation/perf-lock.txt        |  6 ++
 tools/perf/builtin-lock.c                     | 89 ++++++++++++++-----
 tools/perf/util/bpf_lock_contention.c         | 21 +++--
 .../perf/util/bpf_skel/lock_contention.bpf.c  |  3 +-
 tools/perf/util/lock-contention.h             |  3 +
 5 files changed, 96 insertions(+), 26 deletions(-)


base-commit: 6c3bd8d3e01d9014312caa52e4ef1c29d5249648
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] perf lock contention: Factor out get_symbol_name_offset()
  2022-09-08  6:37 [PATCH 0/4] perf lock contention: Improve call stack handling (v1) Namhyung Kim
@ 2022-09-08  6:37 ` Namhyung Kim
  2022-09-08  6:37 ` [PATCH 2/4] perf lock contention: Show full callstack with -v option Namhyung Kim
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2022-09-08  6:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Ian Rogers, linux-perf-users,
	Song Liu, bpf

It's to convert addr to symbol+offset.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-lock.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 70197c0593b1..2a5672f8d22e 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -902,6 +902,23 @@ bool is_lock_function(struct machine *machine, u64 addr)
 	return false;
 }
 
+static int get_symbol_name_offset(struct map *map, struct symbol *sym, u64 ip,
+				  char *buf, int size)
+{
+	u64 offset;
+
+	if (map == NULL || sym == NULL) {
+		buf[0] = '\0';
+		return 0;
+	}
+
+	offset = map->map_ip(map, ip) - sym->start;
+
+	if (offset)
+		return scnprintf(buf, size, "%s+%#lx", sym->name, offset);
+	else
+		return strlcpy(buf, sym->name, size);
+}
 static int lock_contention_caller(struct evsel *evsel, struct perf_sample *sample,
 				  char *buf, int size)
 {
@@ -944,15 +961,8 @@ static int lock_contention_caller(struct evsel *evsel, struct perf_sample *sampl
 
 		sym = node->ms.sym;
 		if (sym && !is_lock_function(machine, node->ip)) {
-			struct map *map = node->ms.map;
-			u64 offset;
-
-			offset = map->map_ip(map, node->ip) - sym->start;
-
-			if (offset)
-				scnprintf(buf, size, "%s+%#lx", sym->name, offset);
-			else
-				strlcpy(buf, sym->name, size);
+			get_symbol_name_offset(node->ms.map, sym, node->ip,
+					       buf, size);
 			return 0;
 		}
 
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] perf lock contention: Show full callstack with -v option
  2022-09-08  6:37 [PATCH 0/4] perf lock contention: Improve call stack handling (v1) Namhyung Kim
  2022-09-08  6:37 ` [PATCH 1/4] perf lock contention: Factor out get_symbol_name_offset() Namhyung Kim
@ 2022-09-08  6:37 ` Namhyung Kim
  2022-09-08  6:37 ` [PATCH 3/4] perf lock contention: Allow to change stack depth and skip Namhyung Kim
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2022-09-08  6:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Ian Rogers, linux-perf-users,
	Song Liu, bpf

Currently it shows a caller function for each entry, but users need to see
the full call stacks sometimes.  Use -v/--verbose option to do that.

  # perf lock con -a -b -v sleep 3
  Looking at the vmlinux_path (8 entries long)
  symsrc__init: cannot get elf header.
  Using /proc/kcore for kernel data
  Using /proc/kallsyms for symbols
   contended   total wait     max wait     avg wait         type   caller

           1     10.74 us     10.74 us     10.74 us     spinlock   __bpf_trace_contention_begin+0xb
                          0xffffffffc03b5c47  bpf_prog_bf07ae9e2cbd02c5_contention_begin+0x117
                          0xffffffffc03b5c47  bpf_prog_bf07ae9e2cbd02c5_contention_begin+0x117
                          0xffffffffbb8b8e75  bpf_trace_run2+0x35
                          0xffffffffbb7eab9b  __bpf_trace_contention_begin+0xb
                          0xffffffffbb7ebe75  queued_spin_lock_slowpath+0x1f5
                          0xffffffffbc1c26ff  _raw_spin_lock+0x1f
                          0xffffffffbb841015  tick_do_update_jiffies64+0x25
                          0xffffffffbb8409ee  tick_irq_enter+0x9e
           1      7.70 us      7.70 us      7.70 us     spinlock   __bpf_trace_contention_begin+0xb
                          0xffffffffc03b5c47  bpf_prog_bf07ae9e2cbd02c5_contention_begin+0x117
                          0xffffffffc03b5c47  bpf_prog_bf07ae9e2cbd02c5_contention_begin+0x117
                          0xffffffffbb8b8e75  bpf_trace_run2+0x35
                          0xffffffffbb7eab9b  __bpf_trace_contention_begin+0xb
                          0xffffffffbb7ebe75  queued_spin_lock_slowpath+0x1f5
                          0xffffffffbc1c26ff  _raw_spin_lock+0x1f
                          0xffffffffbb7bc27e  raw_spin_rq_lock_nested+0xe
                          0xffffffffbb7cef9c  load_balance+0x66c

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/builtin-lock.c             | 43 ++++++++++++++++++++++-----
 tools/perf/util/bpf_lock_contention.c |  9 ++++++
 tools/perf/util/lock-contention.h     |  1 +
 3 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index 2a5672f8d22e..e66fbb38d8df 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -972,13 +972,14 @@ static int lock_contention_caller(struct evsel *evsel, struct perf_sample *sampl
 	return -1;
 }
 
-static u64 callchain_id(struct evsel *evsel, struct perf_sample *sample)
+static u64 callchain_id(struct evsel *evsel, struct perf_sample *sample, u64 *callchains)
 {
 	struct callchain_cursor *cursor = &callchain_cursor;
 	struct machine *machine = &session->machines.host;
 	struct thread *thread;
 	u64 hash = 0;
 	int skip = 0;
+	int i = 0;
 	int ret;
 
 	thread = machine__findnew_thread(machine, -1, sample->pid);
@@ -1002,6 +1003,9 @@ static u64 callchain_id(struct evsel *evsel, struct perf_sample *sample)
 		if (node == NULL)
 			break;
 
+		if (callchains)
+			callchains[i++] = node->ip;
+
 		/* skip first few entries - for lock functions */
 		if (++skip <= CONTENTION_STACK_SKIP)
 			goto next;
@@ -1025,6 +1029,7 @@ static int report_lock_contention_begin_event(struct evsel *evsel,
 	struct lock_seq_stat *seq;
 	u64 addr = evsel__intval(evsel, sample, "lock_addr");
 	u64 key;
+	u64 callchains[CONTENTION_STACK_DEPTH];
 
 	switch (aggr_mode) {
 	case LOCK_AGGR_ADDR:
@@ -1034,7 +1039,9 @@ static int report_lock_contention_begin_event(struct evsel *evsel,
 		key = sample->tid;
 		break;
 	case LOCK_AGGR_CALLER:
-		key = callchain_id(evsel, sample);
+		if (verbose)
+			memset(callchains, 0, sizeof(callchains));
+		key = callchain_id(evsel, sample, verbose ? callchains : NULL);
 		break;
 	default:
 		pr_err("Invalid aggregation mode: %d\n", aggr_mode);
@@ -1053,6 +1060,12 @@ static int report_lock_contention_begin_event(struct evsel *evsel,
 		ls = lock_stat_findnew(key, caller, flags);
 		if (!ls)
 			return -ENOMEM;
+
+		if (aggr_mode == LOCK_AGGR_CALLER && verbose) {
+			ls->callstack = memdup(callchains, sizeof(callchains));
+			if (ls->callstack == NULL)
+				return -ENOMEM;
+		}
 	}
 
 	ts = thread_stat_findnew(sample->tid);
@@ -1117,7 +1130,7 @@ static int report_lock_contention_end_event(struct evsel *evsel,
 		key = sample->tid;
 		break;
 	case LOCK_AGGR_CALLER:
-		key = callchain_id(evsel, sample);
+		key = callchain_id(evsel, sample, NULL);
 		break;
 	default:
 		pr_err("Invalid aggregation mode: %d\n", aggr_mode);
@@ -1466,7 +1479,7 @@ static void sort_contention_result(void)
 	sort_result();
 }
 
-static void print_contention_result(void)
+static void print_contention_result(struct lock_contention *con)
 {
 	struct lock_stat *st;
 	struct lock_key *key;
@@ -1505,6 +1518,22 @@ static void print_contention_result(void)
 		}
 
 		pr_info("  %10s   %s\n", get_type_str(st), st->name);
+		if (verbose) {
+			struct map *kmap;
+			struct symbol *sym;
+			char buf[128];
+			u64 ip;
+
+			for (int i = 0; i < CONTENTION_STACK_DEPTH; i++) {
+				if (!st->callstack || !st->callstack[i])
+					break;
+
+				ip = st->callstack[i];
+				sym = machine__find_kernel_symbol(con->machine, ip, &kmap);
+				get_symbol_name_offset(kmap, sym, ip, buf, sizeof(buf));
+				pr_info("\t\t\t%#lx  %s\n", (unsigned long)ip, buf);
+			}
+		}
 	}
 
 	print_bad_events(bad, total);
@@ -1620,6 +1649,8 @@ static int __cmd_contention(int argc, const char **argv)
 		return PTR_ERR(session);
 	}
 
+	con.machine = &session->machines.host;
+
 	/* for lock function check */
 	symbol_conf.sort_by_name = true;
 	symbol__init(&session->header.env);
@@ -1638,8 +1669,6 @@ static int __cmd_contention(int argc, const char **argv)
 		signal(SIGCHLD, sighandler);
 		signal(SIGTERM, sighandler);
 
-		con.machine = &session->machines.host;
-
 		con.evlist = evlist__new();
 		if (con.evlist == NULL) {
 			err = -ENOMEM;
@@ -1711,7 +1740,7 @@ static int __cmd_contention(int argc, const char **argv)
 	setup_pager();
 
 	sort_contention_result();
-	print_contention_result();
+	print_contention_result(&con);
 
 out_delete:
 	evlist__delete(con.evlist);
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index c591a66733ef..6545bee65347 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -8,6 +8,7 @@
 #include "util/thread_map.h"
 #include "util/lock-contention.h"
 #include <linux/zalloc.h>
+#include <linux/string.h>
 #include <bpf/bpf.h>
 
 #include "bpf_skel/lock_contention.skel.h"
@@ -171,6 +172,14 @@ int lock_contention_read(struct lock_contention *con)
 			return -1;
 		}
 
+		if (verbose) {
+			st->callstack = memdup(stack_trace, sizeof(stack_trace));
+			if (st->callstack == NULL) {
+				free(st);
+				return -1;
+			}
+		}
+
 		hlist_add_head(&st->hash_entry, con->result);
 		prev_key = key;
 	}
diff --git a/tools/perf/util/lock-contention.h b/tools/perf/util/lock-contention.h
index 2146efc33396..bdb6e2a61e5b 100644
--- a/tools/perf/util/lock-contention.h
+++ b/tools/perf/util/lock-contention.h
@@ -11,6 +11,7 @@ struct lock_stat {
 
 	u64			addr;		/* address of lockdep_map, used as ID */
 	char			*name;		/* for strcpy(), we cannot use const */
+	u64			*callstack;
 
 	unsigned int		nr_acquire;
 	unsigned int		nr_acquired;
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] perf lock contention: Allow to change stack depth and skip
  2022-09-08  6:37 [PATCH 0/4] perf lock contention: Improve call stack handling (v1) Namhyung Kim
  2022-09-08  6:37 ` [PATCH 1/4] perf lock contention: Factor out get_symbol_name_offset() Namhyung Kim
  2022-09-08  6:37 ` [PATCH 2/4] perf lock contention: Show full callstack with -v option Namhyung Kim
@ 2022-09-08  6:37 ` Namhyung Kim
  2022-09-08  6:37 ` [PATCH 4/4] perf lock contention: Skip stack trace from BPF Namhyung Kim
  2022-09-08 18:43 ` [PATCH 0/4] perf lock contention: Improve call stack handling (v1) Arnaldo Carvalho de Melo
  4 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2022-09-08  6:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Ian Rogers, linux-perf-users,
	Song Liu, bpf

It needs stack traces to find callers of locks.  To minimize the
performance overhead it only collects up to 8 entries for each stack
trace.  And it skips first 3 entries as they came from BPF, tracepoint
and lock functions which are not interested for most users.

But it turned out that those numbers are different in some
configuration.  Using fixed number can result in non meaningful caller
names.  Let's make them adjustable with --stack-depth and --skip-stack
options.

On my setup, the default output is like below:

  # perf lock con -ab -F contended,wait_total sleep 3
   contended   total wait         type   caller

          28      4.55 ms     rwlock:W   __bpf_trace_contention_begin+0xb
          33      1.67 ms     rwlock:W   __bpf_trace_contention_begin+0xb
          12    580.28 us     spinlock   __bpf_trace_contention_begin+0xb
          60    240.54 us      rwsem:R   __bpf_trace_contention_begin+0xb
          27     64.45 us     spinlock   __bpf_trace_contention_begin+0xb

If I change the stack skip to 5, the result will be like:

  # perf lock con -ab -F contended,wait_total --stack-skip 5 sleep 3
   contended   total wait         type   caller

          32    715.45 us     spinlock   folio_lruvec_lock_irqsave+0x61
          26    550.22 us     spinlock   folio_lruvec_lock_irqsave+0x61
          15    486.93 us      rwsem:R   mmap_read_lock+0x13
          12    139.66 us      rwsem:W   vm_mmap_pgoff+0x93
           1      7.04 us     spinlock   tick_do_update_jiffies64+0x25

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/Documentation/perf-lock.txt |  6 ++++++
 tools/perf/builtin-lock.c              | 20 +++++++++++++++-----
 tools/perf/util/bpf_lock_contention.c  |  7 ++++---
 tools/perf/util/lock-contention.h      |  2 ++
 4 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/tools/perf/Documentation/perf-lock.txt b/tools/perf/Documentation/perf-lock.txt
index 193c5d8b8db9..5f2dc634258e 100644
--- a/tools/perf/Documentation/perf-lock.txt
+++ b/tools/perf/Documentation/perf-lock.txt
@@ -148,6 +148,12 @@ CONTENTION OPTIONS
 --map-nr-entries::
 	Maximum number of BPF map entries (default: 10240).
 
+--max-stack::
+	Maximum stack depth when collecting lock contention (default: 8).
+
+--stack-skip
+	Number of stack depth to skip when finding a lock caller (default: 3).
+
 
 SEE ALSO
 --------
diff --git a/tools/perf/builtin-lock.c b/tools/perf/builtin-lock.c
index e66fbb38d8df..13900ac1d186 100644
--- a/tools/perf/builtin-lock.c
+++ b/tools/perf/builtin-lock.c
@@ -56,6 +56,8 @@ static bool combine_locks;
 static bool show_thread_stats;
 static bool use_bpf;
 static unsigned long bpf_map_entries = 10240;
+static int max_stack_depth = CONTENTION_STACK_DEPTH;
+static int stack_skip = CONTENTION_STACK_SKIP;
 
 static enum {
 	LOCK_AGGR_ADDR,
@@ -939,7 +941,7 @@ static int lock_contention_caller(struct evsel *evsel, struct perf_sample *sampl
 
 	/* use caller function name from the callchain */
 	ret = thread__resolve_callchain(thread, cursor, evsel, sample,
-					NULL, NULL, CONTENTION_STACK_DEPTH);
+					NULL, NULL, max_stack_depth);
 	if (ret != 0) {
 		thread__put(thread);
 		return -1;
@@ -956,7 +958,7 @@ static int lock_contention_caller(struct evsel *evsel, struct perf_sample *sampl
 			break;
 
 		/* skip first few entries - for lock functions */
-		if (++skip <= CONTENTION_STACK_SKIP)
+		if (++skip <= stack_skip)
 			goto next;
 
 		sym = node->ms.sym;
@@ -988,7 +990,7 @@ static u64 callchain_id(struct evsel *evsel, struct perf_sample *sample, u64 *ca
 
 	/* use caller function name from the callchain */
 	ret = thread__resolve_callchain(thread, cursor, evsel, sample,
-					NULL, NULL, CONTENTION_STACK_DEPTH);
+					NULL, NULL, max_stack_depth);
 	thread__put(thread);
 
 	if (ret != 0)
@@ -1007,7 +1009,7 @@ static u64 callchain_id(struct evsel *evsel, struct perf_sample *sample, u64 *ca
 			callchains[i++] = node->ip;
 
 		/* skip first few entries - for lock functions */
-		if (++skip <= CONTENTION_STACK_SKIP)
+		if (++skip <= stack_skip)
 			goto next;
 
 		if (node->ms.sym && is_lock_function(machine, node->ip))
@@ -1524,7 +1526,7 @@ static void print_contention_result(struct lock_contention *con)
 			char buf[128];
 			u64 ip;
 
-			for (int i = 0; i < CONTENTION_STACK_DEPTH; i++) {
+			for (int i = 0; i < max_stack_depth; i++) {
 				if (!st->callstack || !st->callstack[i])
 					break;
 
@@ -1641,6 +1643,8 @@ static int __cmd_contention(int argc, const char **argv)
 		.target = &target,
 		.result = &lockhash_table[0],
 		.map_nr_entries = bpf_map_entries,
+		.max_stack = max_stack_depth,
+		.stack_skip = stack_skip,
 	};
 
 	session = perf_session__new(use_bpf ? NULL : &data, &eops);
@@ -1904,6 +1908,12 @@ int cmd_lock(int argc, const char **argv)
 		   "Trace on existing thread id (exclusive to --pid)"),
 	OPT_CALLBACK(0, "map-nr-entries", &bpf_map_entries, "num",
 		     "Max number of BPF map entries", parse_map_entry),
+	OPT_INTEGER(0, "max-stack", &max_stack_depth,
+		    "Set the maximum stack depth when collecting lock contention, "
+		    "Default: " __stringify(CONTENTION_STACK_DEPTH)),
+	OPT_INTEGER(0, "stack-skip", &stack_skip,
+		    "Set the number of stack depth to skip when finding a lock caller, "
+		    "Default: " __stringify(CONTENTION_STACK_SKIP)),
 	OPT_PARENT(lock_options)
 	};
 
diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index 6545bee65347..ef5323c78ffc 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -41,6 +41,7 @@ int lock_contention_prepare(struct lock_contention *con)
 		return -1;
 	}
 
+	bpf_map__set_value_size(skel->maps.stacks, con->max_stack * sizeof(u64));
 	bpf_map__set_max_entries(skel->maps.stacks, con->map_nr_entries);
 	bpf_map__set_max_entries(skel->maps.lock_stat, con->map_nr_entries);
 
@@ -115,7 +116,7 @@ int lock_contention_read(struct lock_contention *con)
 	struct lock_contention_data data;
 	struct lock_stat *st;
 	struct machine *machine = con->machine;
-	u64 stack_trace[CONTENTION_STACK_DEPTH];
+	u64 stack_trace[con->max_stack];
 
 	fd = bpf_map__fd(skel->maps.lock_stat);
 	stack = bpf_map__fd(skel->maps.stacks);
@@ -146,9 +147,9 @@ int lock_contention_read(struct lock_contention *con)
 		bpf_map_lookup_elem(stack, &key, stack_trace);
 
 		/* skip BPF + lock internal functions */
-		idx = CONTENTION_STACK_SKIP;
+		idx = con->stack_skip;
 		while (is_lock_function(machine, stack_trace[idx]) &&
-		       idx < CONTENTION_STACK_DEPTH - 1)
+		       idx < con->max_stack - 1)
 			idx++;
 
 		st->addr = stack_trace[idx];
diff --git a/tools/perf/util/lock-contention.h b/tools/perf/util/lock-contention.h
index bdb6e2a61e5b..67db311fc9df 100644
--- a/tools/perf/util/lock-contention.h
+++ b/tools/perf/util/lock-contention.h
@@ -115,6 +115,8 @@ struct lock_contention {
 	struct hlist_head *result;
 	unsigned long map_nr_entries;
 	unsigned long lost;
+	int max_stack;
+	int stack_skip;
 };
 
 #ifdef HAVE_BPF_SKEL
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] perf lock contention: Skip stack trace from BPF
  2022-09-08  6:37 [PATCH 0/4] perf lock contention: Improve call stack handling (v1) Namhyung Kim
                   ` (2 preceding siblings ...)
  2022-09-08  6:37 ` [PATCH 3/4] perf lock contention: Allow to change stack depth and skip Namhyung Kim
@ 2022-09-08  6:37 ` Namhyung Kim
  2022-09-08 18:43 ` [PATCH 0/4] perf lock contention: Improve call stack handling (v1) Arnaldo Carvalho de Melo
  4 siblings, 0 replies; 10+ messages in thread
From: Namhyung Kim @ 2022-09-08  6:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Jiri Olsa
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Ian Rogers, linux-perf-users,
	Song Liu, bpf

Currently it collects stack traces to max size then skip entries.
Because we don't have control how to skip perf callchains.  But BPF can
do it with bpf_get_stackid() with a flag.

Say we have max-stack=4 and stack-skip=2, we get these stack traces.

Before:                    After:

      ---> +---+ <--             ---> +---+ <--
     /     |   |    \           /     |   |    \
     |     +---+  usable        |     +---+    |
    max    |   |    /          max    |   |    |
   stack   +---+ <--          stack   +---+  usable
     |     | X |                |     |   |    |
     |     +---+   skip         |     +---+    |
     \     | X |                \     |   |    /
      ---> +---+                 ---> +---+ <--    <=== collection
                                      | X |
                                      +---+   skip
                                      | X |
                                      +---+

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
---
 tools/perf/util/bpf_lock_contention.c          | 7 ++++---
 tools/perf/util/bpf_skel/lock_contention.bpf.c | 3 ++-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/perf/util/bpf_lock_contention.c b/tools/perf/util/bpf_lock_contention.c
index ef5323c78ffc..efe5b9968e77 100644
--- a/tools/perf/util/bpf_lock_contention.c
+++ b/tools/perf/util/bpf_lock_contention.c
@@ -93,6 +93,8 @@ int lock_contention_prepare(struct lock_contention *con)
 		bpf_map_update_elem(fd, &pid, &val, BPF_ANY);
 	}
 
+	skel->bss->stack_skip = con->stack_skip;
+
 	lock_contention_bpf__attach(skel);
 	return 0;
 }
@@ -127,7 +129,7 @@ int lock_contention_read(struct lock_contention *con)
 	while (!bpf_map_get_next_key(fd, &prev_key, &key)) {
 		struct map *kmap;
 		struct symbol *sym;
-		int idx;
+		int idx = 0;
 
 		bpf_map_lookup_elem(fd, &key, &data);
 		st = zalloc(sizeof(*st));
@@ -146,8 +148,7 @@ int lock_contention_read(struct lock_contention *con)
 
 		bpf_map_lookup_elem(stack, &key, stack_trace);
 
-		/* skip BPF + lock internal functions */
-		idx = con->stack_skip;
+		/* skip lock internal functions */
 		while (is_lock_function(machine, stack_trace[idx]) &&
 		       idx < con->max_stack - 1)
 			idx++;
diff --git a/tools/perf/util/bpf_skel/lock_contention.bpf.c b/tools/perf/util/bpf_skel/lock_contention.bpf.c
index 9e8b94eb6320..e107d71f0f1a 100644
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -72,6 +72,7 @@ struct {
 int enabled;
 int has_cpu;
 int has_task;
+int stack_skip;
 
 /* error stat */
 unsigned long lost;
@@ -117,7 +118,7 @@ int contention_begin(u64 *ctx)
 	pelem->timestamp = bpf_ktime_get_ns();
 	pelem->lock = (__u64)ctx[0];
 	pelem->flags = (__u32)ctx[1];
-	pelem->stack_id = bpf_get_stackid(ctx, &stacks, BPF_F_FAST_STACK_CMP);
+	pelem->stack_id = bpf_get_stackid(ctx, &stacks, BPF_F_FAST_STACK_CMP | stack_skip);
 
 	if (pelem->stack_id < 0)
 		lost++;
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] perf lock contention: Improve call stack handling (v1)
  2022-09-08  6:37 [PATCH 0/4] perf lock contention: Improve call stack handling (v1) Namhyung Kim
                   ` (3 preceding siblings ...)
  2022-09-08  6:37 ` [PATCH 4/4] perf lock contention: Skip stack trace from BPF Namhyung Kim
@ 2022-09-08 18:43 ` Arnaldo Carvalho de Melo
  2022-09-08 23:44   ` Namhyung Kim
  4 siblings, 1 reply; 10+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-09-08 18:43 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML, Ian Rogers,
	linux-perf-users, Song Liu, bpf

Em Wed, Sep 07, 2022 at 11:37:50PM -0700, Namhyung Kim escreveu:
> Hello,
> 
> I found that call stack from the lock tracepoint (using bpf_get_stackid)
> can be different on each configuration.  For example it's very different
> when I run it on a VM than on a real machine.
> 
> The perf lock contention relies on the stack trace to get the lock
> caller names, this kind of difference can be annoying.  Ideally we could
> skip stack trace entries for internal BPF or lock functions and get the
> correct caller, but it's not the case as of today.  Currently it's hard
> coded to control the behavior of stack traces for the lock contention
> tracepoints.
> 
> To handle those differences, add two new options to control the number of
> stack entries and how many it skips.  The default value worked well on
> my VM setup, but I had to use --stack-skip=5 on real machines.
> 
> You can get it from 'perf/lock-stack-v1' branch in
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

This clashed with a patch you Acked earlier, so lets see if someone has
extra review comments and a v2 become needed for other reason, when you
can refresh it, ok?

- Arnaldo
 
> Thanks,
> Namhyung
> 
> 
> Namhyung Kim (4):
>   perf lock contention: Factor out get_symbol_name_offset()
>   perf lock contention: Show full callstack with -v option
>   perf lock contention: Allow to change stack depth and skip
>   perf lock contention: Skip stack trace from BPF
> 
>  tools/perf/Documentation/perf-lock.txt        |  6 ++
>  tools/perf/builtin-lock.c                     | 89 ++++++++++++++-----
>  tools/perf/util/bpf_lock_contention.c         | 21 +++--
>  .../perf/util/bpf_skel/lock_contention.bpf.c  |  3 +-
>  tools/perf/util/lock-contention.h             |  3 +
>  5 files changed, 96 insertions(+), 26 deletions(-)
> 
> 
> base-commit: 6c3bd8d3e01d9014312caa52e4ef1c29d5249648
> -- 
> 2.37.2.789.g6183377224-goog

-- 

- Arnaldo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] perf lock contention: Improve call stack handling (v1)
  2022-09-08 18:43 ` [PATCH 0/4] perf lock contention: Improve call stack handling (v1) Arnaldo Carvalho de Melo
@ 2022-09-08 23:44   ` Namhyung Kim
  2022-09-20 20:22     ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 10+ messages in thread
From: Namhyung Kim @ 2022-09-08 23:44 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML, Ian Rogers,
	linux-perf-users, Song Liu, bpf

Hi Arnaldo,

On Thu, Sep 8, 2022 at 11:43 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> Em Wed, Sep 07, 2022 at 11:37:50PM -0700, Namhyung Kim escreveu:
> > Hello,
> >
> > I found that call stack from the lock tracepoint (using bpf_get_stackid)
> > can be different on each configuration.  For example it's very different
> > when I run it on a VM than on a real machine.
> >
> > The perf lock contention relies on the stack trace to get the lock
> > caller names, this kind of difference can be annoying.  Ideally we could
> > skip stack trace entries for internal BPF or lock functions and get the
> > correct caller, but it's not the case as of today.  Currently it's hard
> > coded to control the behavior of stack traces for the lock contention
> > tracepoints.
> >
> > To handle those differences, add two new options to control the number of
> > stack entries and how many it skips.  The default value worked well on
> > my VM setup, but I had to use --stack-skip=5 on real machines.
> >
> > You can get it from 'perf/lock-stack-v1' branch in
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
>
> This clashed with a patch you Acked earlier, so lets see if someone has
> extra review comments and a v2 become needed for other reason, when you
> can refresh it, ok?

Sounds good!

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] perf lock contention: Improve call stack handling (v1)
  2022-09-08 23:44   ` Namhyung Kim
@ 2022-09-20 20:22     ` Arnaldo Carvalho de Melo
  2022-09-20 21:04       ` Namhyung Kim
  0 siblings, 1 reply; 10+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-09-20 20:22 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML, Ian Rogers,
	linux-perf-users, Song Liu, bpf

Em Thu, Sep 08, 2022 at 04:44:15PM -0700, Namhyung Kim escreveu:
> Hi Arnaldo,
> 
> On Thu, Sep 8, 2022 at 11:43 AM Arnaldo Carvalho de Melo
> <acme@kernel.org> wrote:
> >
> > Em Wed, Sep 07, 2022 at 11:37:50PM -0700, Namhyung Kim escreveu:
> > > Hello,
> > >
> > > I found that call stack from the lock tracepoint (using bpf_get_stackid)
> > > can be different on each configuration.  For example it's very different
> > > when I run it on a VM than on a real machine.
> > >
> > > The perf lock contention relies on the stack trace to get the lock
> > > caller names, this kind of difference can be annoying.  Ideally we could
> > > skip stack trace entries for internal BPF or lock functions and get the
> > > correct caller, but it's not the case as of today.  Currently it's hard
> > > coded to control the behavior of stack traces for the lock contention
> > > tracepoints.
> > >
> > > To handle those differences, add two new options to control the number of
> > > stack entries and how many it skips.  The default value worked well on
> > > my VM setup, but I had to use --stack-skip=5 on real machines.
> > >
> > > You can get it from 'perf/lock-stack-v1' branch in
> > >
> > >   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> >
> > This clashed with a patch you Acked earlier, so lets see if someone has
> > extra review comments and a v2 become needed for other reason, when you
> > can refresh it, ok?
> 
> Sounds good!

Have you resubmitted this? /me goes on the backlog...

- Arnaldo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] perf lock contention: Improve call stack handling (v1)
  2022-09-20 20:22     ` Arnaldo Carvalho de Melo
@ 2022-09-20 21:04       ` Namhyung Kim
  2022-09-21 14:09         ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 10+ messages in thread
From: Namhyung Kim @ 2022-09-20 21:04 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo
  Cc: Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML, Ian Rogers,
	linux-perf-users, Song Liu, bpf

On Tue, Sep 20, 2022 at 1:22 PM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> Em Thu, Sep 08, 2022 at 04:44:15PM -0700, Namhyung Kim escreveu:
> > Hi Arnaldo,
> >
> > On Thu, Sep 8, 2022 at 11:43 AM Arnaldo Carvalho de Melo
> > <acme@kernel.org> wrote:
> > >
> > > Em Wed, Sep 07, 2022 at 11:37:50PM -0700, Namhyung Kim escreveu:
> > > > Hello,
> > > >
> > > > I found that call stack from the lock tracepoint (using bpf_get_stackid)
> > > > can be different on each configuration.  For example it's very different
> > > > when I run it on a VM than on a real machine.
> > > >
> > > > The perf lock contention relies on the stack trace to get the lock
> > > > caller names, this kind of difference can be annoying.  Ideally we could
> > > > skip stack trace entries for internal BPF or lock functions and get the
> > > > correct caller, but it's not the case as of today.  Currently it's hard
> > > > coded to control the behavior of stack traces for the lock contention
> > > > tracepoints.
> > > >
> > > > To handle those differences, add two new options to control the number of
> > > > stack entries and how many it skips.  The default value worked well on
> > > > my VM setup, but I had to use --stack-skip=5 on real machines.
> > > >
> > > > You can get it from 'perf/lock-stack-v1' branch in
> > > >
> > > >   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> > >
> > > This clashed with a patch you Acked earlier, so lets see if someone has
> > > extra review comments and a v2 become needed for other reason, when you
> > > can refresh it, ok?
> >
> > Sounds good!
>
> Have you resubmitted this? /me goes on the backlog...

Yep :)

https://lore.kernel.org/r/20220912055314.744552-1-namhyung@kernel.org

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/4] perf lock contention: Improve call stack handling (v1)
  2022-09-20 21:04       ` Namhyung Kim
@ 2022-09-21 14:09         ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 10+ messages in thread
From: Arnaldo Carvalho de Melo @ 2022-09-21 14:09 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Jiri Olsa, Ingo Molnar, Peter Zijlstra, LKML, Ian Rogers,
	linux-perf-users, Song Liu, bpf

Em Tue, Sep 20, 2022 at 02:04:47PM -0700, Namhyung Kim escreveu:
> On Tue, Sep 20, 2022 at 1:22 PM Arnaldo Carvalho de Melo
> <acme@kernel.org> wrote:
> >
> > Em Thu, Sep 08, 2022 at 04:44:15PM -0700, Namhyung Kim escreveu:
> > > Hi Arnaldo,
> > >
> > > On Thu, Sep 8, 2022 at 11:43 AM Arnaldo Carvalho de Melo
> > > <acme@kernel.org> wrote:
> > > >
> > > > Em Wed, Sep 07, 2022 at 11:37:50PM -0700, Namhyung Kim escreveu:
> > > > > Hello,
> > > > >
> > > > > I found that call stack from the lock tracepoint (using bpf_get_stackid)
> > > > > can be different on each configuration.  For example it's very different
> > > > > when I run it on a VM than on a real machine.
> > > > >
> > > > > The perf lock contention relies on the stack trace to get the lock
> > > > > caller names, this kind of difference can be annoying.  Ideally we could
> > > > > skip stack trace entries for internal BPF or lock functions and get the
> > > > > correct caller, but it's not the case as of today.  Currently it's hard
> > > > > coded to control the behavior of stack traces for the lock contention
> > > > > tracepoints.
> > > > >
> > > > > To handle those differences, add two new options to control the number of
> > > > > stack entries and how many it skips.  The default value worked well on
> > > > > my VM setup, but I had to use --stack-skip=5 on real machines.
> > > > >
> > > > > You can get it from 'perf/lock-stack-v1' branch in
> > > > >
> > > > >   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> > > >
> > > > This clashed with a patch you Acked earlier, so lets see if someone has
> > > > extra review comments and a v2 become needed for other reason, when you
> > > > can refresh it, ok?
> > >
> > > Sounds good!
> >
> > Have you resubmitted this? /me goes on the backlog...
> 
> Yep :)
> 
> https://lore.kernel.org/r/20220912055314.744552-1-namhyung@kernel.org

It applies now, testing :-)

- Arnaldo

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-09-21 14:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-08  6:37 [PATCH 0/4] perf lock contention: Improve call stack handling (v1) Namhyung Kim
2022-09-08  6:37 ` [PATCH 1/4] perf lock contention: Factor out get_symbol_name_offset() Namhyung Kim
2022-09-08  6:37 ` [PATCH 2/4] perf lock contention: Show full callstack with -v option Namhyung Kim
2022-09-08  6:37 ` [PATCH 3/4] perf lock contention: Allow to change stack depth and skip Namhyung Kim
2022-09-08  6:37 ` [PATCH 4/4] perf lock contention: Skip stack trace from BPF Namhyung Kim
2022-09-08 18:43 ` [PATCH 0/4] perf lock contention: Improve call stack handling (v1) Arnaldo Carvalho de Melo
2022-09-08 23:44   ` Namhyung Kim
2022-09-20 20:22     ` Arnaldo Carvalho de Melo
2022-09-20 21:04       ` Namhyung Kim
2022-09-21 14:09         ` Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).