linux-toolchains.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/10] perf: user space sframe unwinding
@ 2023-11-09  0:41 Josh Poimboeuf
  2023-11-09  0:41 ` [PATCH RFC 01/10] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
                   ` (10 more replies)
  0 siblings, 11 replies; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF (or .eh_frame) instead isn't feasible
because of complexity and slowness.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame ("Simple Frame") format starting with binutils
2.40.

These patches add support for unwinding user space from the kernel using
SFrame with perf.  It should be easy to add user unwinding support for
other components like ftrace.

I tested it on Gentoo by recompiling everything with -Wa,-gsframe and
using a custom glibc patch (which I'll send in a reply to this email).

The unwinding itself seems to work well, though I still have a major
problem: how to tell perf tool to stitch together the separate
kernel+user callchains into a single event?

Right now I have a hack which somehow causes perf tool to overwrite the
kernel callchain with the user one.  I'm perf-clueless, any ideas or
patches for a clean way to implement that would be most helpful.


Otherwise there were two main challenges:

1) Finding .sframe sections in shared/dlopened libraries

   The kernel has no visibility to the contents of shared libraries.
   This was solved by adding a PR_ADD_SFRAME option to prctl() which
   allows the runtime linker to manually provide the in-memory address
   of an .sframe section to the kernel.

2) Dealing with page faults

   Keeping all binaries' sframe data pinned would likely waste a lot of
   memory.  Instead, read it from user space on demand.  That can't be
   done from perf NMI context due to page faults, so defer the unwind to
   the next user exit.  Since the NMI handler doesn't do exit work,
   self-IPI and then schedule task work to be run on exit from the IPI.


Special thanks to Indu for the original concept, and to Steven and Peter
for helping a lot with the design.  And to Steven for letting me do it ;-)


TODO:
- Stitch kernel+user events together in perf tool (help needed)
- Add arm64 support
- Add VDSO .sframe support
- Allow specifying FP vs sframe from perf tool?  Right now it's
  auto-detected, maybe that's enough
- Port ftrace and others to use sframe
- Support sframe v2
- Determine the impact of missing DRAP support (aligned stacks which
  SFrame doesn't currently support)
- Add debugging hooks



Josh Poimboeuf (10):
  perf: Remove get_perf_callchain() 'init_nr' argument
  perf: Remove get_perf_callchain() 'crosstask' argument
  perf: Simplify get_perf_callchain() user logic
  perf: Introduce deferred user callchains
  perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED
  unwind: Introduce generic user space unwinding interfaces
  unwind/x86: Add HAVE_USER_UNWIND
  perf/x86: Use user_unwind interface
  unwind: Introduce SFrame user space unwinding
  unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME

 arch/Kconfig                       |   9 +
 arch/x86/Kconfig                   |   3 +
 arch/x86/events/core.c             |  65 ++---
 arch/x86/include/asm/mmu.h         |   2 +-
 arch/x86/include/asm/user_unwind.h |  11 +
 fs/binfmt_elf.c                    |  46 +++-
 include/linux/mm_types.h           |   3 +
 include/linux/perf_event.h         |  24 +-
 include/linux/sframe.h             |  46 ++++
 include/linux/user_unwind.h        |  33 +++
 include/uapi/linux/elf.h           |   1 +
 include/uapi/linux/perf_event.h    |   1 +
 include/uapi/linux/prctl.h         |   3 +
 kernel/Makefile                    |   1 +
 kernel/bpf/stackmap.c              |   6 +-
 kernel/events/callchain.c          |  39 ++-
 kernel/events/core.c               |  96 ++++++-
 kernel/fork.c                      |  10 +
 kernel/sys.c                       |  11 +
 kernel/unwind/Makefile             |   2 +
 kernel/unwind/sframe.c             | 414 +++++++++++++++++++++++++++++
 kernel/unwind/sframe.h             | 217 +++++++++++++++
 kernel/unwind/user.c               |  86 ++++++
 mm/init-mm.c                       |   2 +
 24 files changed, 1060 insertions(+), 71 deletions(-)
 create mode 100644 arch/x86/include/asm/user_unwind.h
 create mode 100644 include/linux/sframe.h
 create mode 100644 include/linux/user_unwind.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h
 create mode 100644 kernel/unwind/user.c

-- 
2.41.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH RFC 01/10] perf: Remove get_perf_callchain() 'init_nr' argument
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
@ 2023-11-09  0:41 ` Josh Poimboeuf
  2023-11-11  6:09   ` Namhyung Kim
  2023-11-09  0:41 ` [PATCH RFC 02/10] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

The 'init_nr' argument has double duty: it's used to initialize both the
number of contexts and the number of stack entries.  That's confusing
and the callers always pass zero anyway.  Hard code the zero.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h |  2 +-
 kernel/bpf/stackmap.c      |  4 ++--
 kernel/events/callchain.c  | 12 ++++++------
 kernel/events/core.c       |  2 +-
 4 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index afb028c54f33..f4b05954076c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1533,7 +1533,7 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
 extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index d6b277482085..b0b0fbff7c18 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -294,7 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+	trace = get_perf_callchain(regs, kernel, user, max_depth,
 				   false, false);
 
 	if (unlikely(!trace))
@@ -420,7 +420,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 	else if (kernel && task)
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
-		trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
+		trace = get_perf_callchain(regs, kernel, user, max_depth,
 					   false, false);
 	if (unlikely(!trace))
 		goto err_fault;
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1273be84392c..1e135195250c 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -177,7 +177,7 @@ put_callchain_entry(int rctx)
 }
 
 struct perf_callchain_entry *
-get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
+get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
@@ -188,11 +188,11 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
 	if (!entry)
 		return NULL;
 
-	ctx.entry     = entry;
-	ctx.max_stack = max_stack;
-	ctx.nr	      = entry->nr = init_nr;
-	ctx.contexts       = 0;
-	ctx.contexts_maxed = false;
+	ctx.entry		= entry;
+	ctx.max_stack		= max_stack;
+	ctx.nr			= entry->nr = 0;
+	ctx.contexts		= 0;
+	ctx.contexts_maxed	= false;
 
 	if (kernel && !user_mode(regs)) {
 		if (add_mark)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 683dc086ef10..b0d62df7df4e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7600,7 +7600,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, 0, kernel, user,
+	callchain = get_perf_callchain(regs, kernel, user,
 				       max_stack, crosstask, true);
 	return callchain ?: &__empty_callchain;
 }
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC 02/10] perf: Remove get_perf_callchain() 'crosstask' argument
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
  2023-11-09  0:41 ` [PATCH RFC 01/10] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
@ 2023-11-09  0:41 ` Josh Poimboeuf
  2023-11-11  6:11   ` Namhyung Kim
  2023-11-09  0:41 ` [PATCH RFC 03/10] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

get_perf_callchain() doesn't support cross-task unwinding, so it doesn't
make much sense to have 'crosstask' as an argument.  Instead, have
perf_callchain() adjust 'user' accordingly.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 include/linux/perf_event.h | 2 +-
 kernel/bpf/stackmap.c      | 5 ++---
 kernel/events/callchain.c  | 6 +-----
 kernel/events/core.c       | 8 ++++----
 4 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f4b05954076c..2d8fa253b9df 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1534,7 +1534,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark);
+		   u32 max_stack, bool add_mark);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index b0b0fbff7c18..e4827ca5378d 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -294,8 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth,
-				   false, false);
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
 
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
@@ -421,7 +420,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   false, false);
+					   false);
 	if (unlikely(!trace))
 		goto err_fault;
 
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 1e135195250c..aa5f9d11c28d 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -178,7 +178,7 @@ put_callchain_entry(int rctx)
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool crosstask, bool add_mark)
+		   u32 max_stack, bool add_mark)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -209,9 +209,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 
 		if (regs) {
-			if (crosstask)
-				goto exit_put;
-
 			if (add_mark)
 				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
@@ -219,7 +216,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		}
 	}
 
-exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index b0d62df7df4e..5e41a3b70bcd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7592,16 +7592,16 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
 {
 	bool kernel = !event->attr.exclude_callchain_kernel;
 	bool user   = !event->attr.exclude_callchain_user;
-	/* Disallow cross-task user callchains. */
-	bool crosstask = event->ctx->task && event->ctx->task != current;
 	const u32 max_stack = event->attr.sample_max_stack;
 	struct perf_callchain_entry *callchain;
 
+	/* Disallow cross-task user callchains. */
+	user &= !event->ctx->task || event->ctx->task == current;
+
 	if (!kernel && !user)
 		return &__empty_callchain;
 
-	callchain = get_perf_callchain(regs, kernel, user,
-				       max_stack, crosstask, true);
+	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
 	return callchain ?: &__empty_callchain;
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC 03/10] perf: Simplify get_perf_callchain() user logic
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
  2023-11-09  0:41 ` [PATCH RFC 01/10] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
  2023-11-09  0:41 ` [PATCH RFC 02/10] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
@ 2023-11-09  0:41 ` Josh Poimboeuf
  2023-11-11  6:11   ` Namhyung Kim
  2023-11-09  0:41 ` [PATCH RFC 04/10] perf: Introduce deferred user callchains Josh Poimboeuf
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

Simplify the get_perf_callchain() user logic a bit.  task_pt_regs()
should never be NULL.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 kernel/events/callchain.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index aa5f9d11c28d..2bee8b6fda0e 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -202,20 +202,18 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 
 	if (user) {
 		if (!user_mode(regs)) {
-			if  (current->mm)
-				regs = task_pt_regs(current);
-			else
-				regs = NULL;
+			if (!current->mm)
+				goto exit_put;
+			regs = task_pt_regs(current);
 		}
 
-		if (regs) {
-			if (add_mark)
-				perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+		if (add_mark)
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
-			perf_callchain_user(&ctx, regs);
-		}
+		perf_callchain_user(&ctx, regs);
 	}
 
+exit_put:
 	put_callchain_entry(rctx);
 
 	return entry;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
                   ` (2 preceding siblings ...)
  2023-11-09  0:41 ` [PATCH RFC 03/10] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
@ 2023-11-09  0:41 ` Josh Poimboeuf
  2023-11-11  6:57   ` Namhyung Kim
  2023-11-09  0:41 ` [PATCH RFC 05/10] perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED Josh Poimboeuf
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

Instead of attempting to unwind user space from the NMI handler, defer
it to run in task context by sending a self-IPI and then scheduling the
unwind to run in the IRQ's exit task work before returning to user space.

This allows the user stack page to be paged in if needed, avoids
duplicate unwinds for kernel-bound workloads, and prepares for SFrame
unwinding (so .sframe sections can be paged in on demand).

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                    |  3 ++
 include/linux/perf_event.h      | 22 ++++++--
 include/uapi/linux/perf_event.h |  1 +
 kernel/bpf/stackmap.c           |  5 +-
 kernel/events/callchain.c       |  7 ++-
 kernel/events/core.c            | 90 ++++++++++++++++++++++++++++++---
 6 files changed, 115 insertions(+), 13 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index f4b210ab0612..690c82212224 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -425,6 +425,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 	  It uses the same command line parameters, and sysctl interface,
 	  as the generic hardlockup detectors.
 
+config HAVE_PERF_CALLCHAIN_DEFERRED
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2d8fa253b9df..2f232111dff2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -786,6 +786,7 @@ struct perf_event {
 	struct irq_work			pending_irq;
 	struct callback_head		pending_task;
 	unsigned int			pending_work;
+	unsigned int			pending_unwind;
 
 	atomic_t			event_limit;
 
@@ -1113,7 +1114,10 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
 extern u64 perf_event_read_value(struct perf_event *event,
 				 u64 *enabled, u64 *running);
 
-extern struct perf_callchain_entry *perf_callchain(struct perf_event *event, struct pt_regs *regs);
+extern void perf_callchain(struct perf_sample_data *data,
+			   struct perf_event *event, struct pt_regs *regs);
+extern void perf_callchain_deferred(struct perf_sample_data *data,
+				    struct perf_event *event, struct pt_regs *regs);
 
 static inline bool branch_sample_no_flags(const struct perf_event *event)
 {
@@ -1189,6 +1193,7 @@ struct perf_sample_data {
 	u64				data_page_size;
 	u64				code_page_size;
 	u64				aux_size;
+	bool				deferred;
 } ____cacheline_aligned;
 
 /* default value for data source */
@@ -1206,6 +1211,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
 	data->sample_flags = PERF_SAMPLE_PERIOD;
 	data->period = period;
 	data->dyn_size = 0;
+	data->deferred = false;
 
 	if (addr) {
 		data->addr = addr;
@@ -1219,7 +1225,11 @@ static inline void perf_sample_save_callchain(struct perf_sample_data *data,
 {
 	int size = 1;
 
-	data->callchain = perf_callchain(event, regs);
+	if (data->deferred)
+		perf_callchain_deferred(data, event, regs);
+	else
+		perf_callchain(data, event, regs);
+
 	size += data->callchain->nr;
 
 	data->dyn_size += size * sizeof(u64);
@@ -1534,12 +1544,18 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
 extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
 extern struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark);
+		   u32 max_stack, bool add_mark, bool defer_user);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
 extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
 extern void put_callchain_entry(int rctx);
 
+#ifdef CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED
+extern void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
+#else
+static inline void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) {}
+#endif
+
 extern int sysctl_perf_event_max_stack;
 extern int sysctl_perf_event_max_contexts_per_stack;
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 39c6a250dd1b..9a1127af4cda 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -1237,6 +1237,7 @@ enum perf_callchain_context {
 	PERF_CONTEXT_HV			= (__u64)-32,
 	PERF_CONTEXT_KERNEL		= (__u64)-128,
 	PERF_CONTEXT_USER		= (__u64)-512,
+	PERF_CONTEXT_USER_DEFERRED	= (__u64)-640,
 
 	PERF_CONTEXT_GUEST		= (__u64)-2048,
 	PERF_CONTEXT_GUEST_KERNEL	= (__u64)-2176,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index e4827ca5378d..fcdd26715b12 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -294,8 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
 	if (max_depth > sysctl_perf_event_max_stack)
 		max_depth = sysctl_perf_event_max_stack;
 
-	trace = get_perf_callchain(regs, kernel, user, max_depth, false);
-
+	trace = get_perf_callchain(regs, kernel, user, max_depth, false, false);
 	if (unlikely(!trace))
 		/* couldn't fetch the stack trace */
 		return -EFAULT;
@@ -420,7 +419,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
 		trace = get_callchain_entry_for_task(task, max_depth);
 	else
 		trace = get_perf_callchain(regs, kernel, user, max_depth,
-					   false);
+					   false, false);
 	if (unlikely(!trace))
 		goto err_fault;
 
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index 2bee8b6fda0e..16571c8d6771 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -178,7 +178,7 @@ put_callchain_entry(int rctx)
 
 struct perf_callchain_entry *
 get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
-		   u32 max_stack, bool add_mark)
+		   u32 max_stack, bool add_mark, bool defer_user)
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
@@ -207,6 +207,11 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 			regs = task_pt_regs(current);
 		}
 
+		if (defer_user) {
+			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
+			goto exit_put;
+		}
+
 		if (add_mark)
 			perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5e41a3b70bcd..290e06b0071c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6751,6 +6751,12 @@ static void perf_pending_irq(struct irq_work *entry)
 	struct perf_event *event = container_of(entry, struct perf_event, pending_irq);
 	int rctx;
 
+	if (!is_software_event(event)) {
+		if (event->pending_unwind)
+			task_work_add(current, &event->pending_task, TWA_RESUME);
+		return;
+	}
+
 	/*
 	 * If we 'fail' here, that's OK, it means recursion is already disabled
 	 * and we won't recurse 'further'.
@@ -6772,11 +6778,57 @@ static void perf_pending_irq(struct irq_work *entry)
 		perf_swevent_put_recursion_context(rctx);
 }
 
+static void perf_pending_task_unwind(struct perf_event *event)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	struct perf_output_handle handle;
+	struct perf_event_header header;
+	struct perf_sample_data data;
+	struct perf_callchain_entry *callchain;
+
+	callchain = kmalloc(sizeof(struct perf_callchain_entry) +
+			    (sizeof(__u64) * event->attr.sample_max_stack) +
+			    (sizeof(__u64) * 1) /* one context */,
+			    GFP_KERNEL);
+	if (!callchain)
+		return;
+
+	callchain->nr = 0;
+	data.callchain = callchain;
+
+	perf_sample_data_init(&data, 0, event->hw.last_period);
+
+	data.deferred = true;
+
+	perf_prepare_sample(&data, event, regs);
+
+	perf_prepare_header(&header, &data, event, regs);
+
+	if (perf_output_begin(&handle, &data, event, header.size))
+		goto done;
+
+	perf_output_sample(&handle, &header, &data, event);
+
+	perf_output_end(&handle);
+
+done:
+	kfree(callchain);
+}
+
+
 static void perf_pending_task(struct callback_head *head)
 {
 	struct perf_event *event = container_of(head, struct perf_event, pending_task);
 	int rctx;
 
+	if (!is_software_event(event)) {
+		if (event->pending_unwind) {
+			perf_pending_task_unwind(event);
+			event->pending_unwind = 0;
+		}
+		return;
+	}
+
 	/*
 	 * If we 'fail' here, that's OK, it means recursion is already disabled
 	 * and we won't recurse 'further'.
@@ -7587,22 +7639,48 @@ static u64 perf_get_page_size(unsigned long addr)
 
 static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
 
-struct perf_callchain_entry *
-perf_callchain(struct perf_event *event, struct pt_regs *regs)
+void perf_callchain(struct perf_sample_data *data, struct perf_event *event,
+		    struct pt_regs *regs)
 {
 	bool kernel = !event->attr.exclude_callchain_kernel;
 	bool user   = !event->attr.exclude_callchain_user;
 	const u32 max_stack = event->attr.sample_max_stack;
-	struct perf_callchain_entry *callchain;
+	bool defer_user = IS_ENABLED(CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED);
 
 	/* Disallow cross-task user callchains. */
 	user &= !event->ctx->task || event->ctx->task == current;
 
 	if (!kernel && !user)
-		return &__empty_callchain;
+		goto empty;
 
-	callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
-	return callchain ?: &__empty_callchain;
+	data->callchain = get_perf_callchain(regs, kernel, user, max_stack, true, defer_user);
+	if (!data->callchain)
+		goto empty;
+
+	if (user && defer_user && !event->pending_unwind) {
+		event->pending_unwind = 1;
+		irq_work_queue(&event->pending_irq);
+	}
+
+	return;
+
+empty:
+	data->callchain = &__empty_callchain;
+}
+
+void perf_callchain_deferred(struct perf_sample_data *data,
+			     struct perf_event *event, struct pt_regs *regs)
+{
+	struct perf_callchain_entry_ctx ctx;
+
+	ctx.entry		= data->callchain;
+	ctx.max_stack		= event->attr.sample_max_stack;
+	ctx.nr			= 0;
+	ctx.contexts		= 0;
+	ctx.contexts_maxed	= false;
+
+	perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
+	perf_callchain_user_deferred(&ctx, regs);
 }
 
 static __always_inline u64 __cond_set(u64 flags, u64 s, u64 d)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC 05/10] perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
                   ` (3 preceding siblings ...)
  2023-11-09  0:41 ` [PATCH RFC 04/10] perf: Introduce deferred user callchains Josh Poimboeuf
@ 2023-11-09  0:41 ` Josh Poimboeuf
  2023-11-09  0:41 ` [PATCH RFC 06/10] unwind: Introduce generic user space unwinding interfaces Josh Poimboeuf
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

Enable deferred user space unwinding on x86.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig       |  1 +
 arch/x86/events/core.c | 47 ++++++++++++++++++++++++++++--------------
 2 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3762f41bb092..cacf11ac4b10 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -256,6 +256,7 @@ config X86
 	select HAVE_PERF_EVENTS_NMI
 	select HAVE_HARDLOCKUP_DETECTOR_PERF	if PERF_EVENTS && HAVE_PERF_EVENTS_NMI
 	select HAVE_PCI
+	select HAVE_PERF_CALLCHAIN_DEFERRED
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select MMU_GATHER_RCU_TABLE_FREE	if PARAVIRT
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 40ad1425ffa2..ae264437f794 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2816,8 +2816,8 @@ static unsigned long get_segment_base(unsigned int segment)
 
 #include <linux/compat.h>
 
-static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *entry)
+static inline int __perf_callchain_user32(struct pt_regs *regs,
+					  struct perf_callchain_entry_ctx *entry)
 {
 	/* 32-bit process in 64-bit kernel. */
 	unsigned long ss_base, cs_base;
@@ -2831,7 +2831,6 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 	ss_base = get_segment_base(regs->ss);
 
 	fp = compat_ptr(ss_base + regs->bp);
-	pagefault_disable();
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
@@ -2844,19 +2843,18 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 		perf_callchain_store(entry, cs_base + frame.return_address);
 		fp = compat_ptr(ss_base + frame.next_frame);
 	}
-	pagefault_enable();
 	return 1;
 }
-#else
-static inline int
-perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *entry)
+#else /* !CONFIG_IA32_EMULATION */
+static inline int __perf_callchain_user32(struct pt_regs *regs,
+					  struct perf_callchain_entry_ctx *entry)
 {
-    return 0;
+	return 0;
 }
-#endif
+#endif /* CONFIG_IA32_EMULATION */
 
-void
-perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
+void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
+			   struct pt_regs *regs, bool atomic)
 {
 	struct stack_frame frame;
 	const struct stack_frame __user *fp;
@@ -2876,13 +2874,15 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 
 	perf_callchain_store(entry, regs->ip);
 
-	if (!nmi_uaccess_okay())
+	if (atomic && !nmi_uaccess_okay())
 		return;
 
-	if (perf_callchain_user32(regs, entry))
-		return;
+	if (atomic)
+		pagefault_disable();
+
+	if (__perf_callchain_user32(regs, entry))
+		goto done;
 
-	pagefault_disable();
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
@@ -2895,7 +2895,22 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 		perf_callchain_store(entry, frame.return_address);
 		fp = (void __user *)frame.next_frame;
 	}
-	pagefault_enable();
+done:
+	if (atomic)
+		pagefault_enable();
+}
+
+
+void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
+			 struct pt_regs *regs)
+{
+	return __perf_callchain_user(entry, regs, true);
+}
+
+void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry,
+				  struct pt_regs *regs)
+{
+	return __perf_callchain_user(entry, regs, false);
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC 06/10] unwind: Introduce generic user space unwinding interfaces
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
                   ` (4 preceding siblings ...)
  2023-11-09  0:41 ` [PATCH RFC 05/10] perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED Josh Poimboeuf
@ 2023-11-09  0:41 ` Josh Poimboeuf
  2023-11-09  0:41 ` [PATCH RFC 07/10] unwind/x86: Add HAVE_USER_UNWIND Josh Poimboeuf
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

Introduce generic user space unwinder interfaces which will provide a
unified way for architectures to unwind different user space stack frame
types.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                |  3 ++
 include/linux/user_unwind.h | 32 +++++++++++++++
 kernel/Makefile             |  1 +
 kernel/unwind/Makefile      |  1 +
 kernel/unwind/user.c        | 77 +++++++++++++++++++++++++++++++++++++
 5 files changed, 114 insertions(+)
 create mode 100644 include/linux/user_unwind.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/user.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 690c82212224..c4a08485835e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -428,6 +428,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
 config HAVE_PERF_CALLCHAIN_DEFERRED
 	bool
 
+config HAVE_USER_UNWIND
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/include/linux/user_unwind.h b/include/linux/user_unwind.h
new file mode 100644
index 000000000000..2812b88c95fd
--- /dev/null
+++ b/include/linux/user_unwind.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_USER_UNWIND_H
+#define _LINUX_USER_UNWIND_H
+
+#include <linux/types.h>
+#include <linux/sframe.h>
+
+enum user_unwind_type {
+	USER_UNWIND_TYPE_AUTO,
+	USER_UNWIND_TYPE_FP,
+};
+
+struct user_unwind_frame {
+	s32 cfa_off;
+	s32 ra_off;
+	s32 fp_off;
+	bool use_fp;
+};
+
+struct user_unwind_state {
+	unsigned long ip, sp, fp;
+	enum user_unwind_type type;
+	bool done;
+};
+
+extern int user_unwind_start(struct user_unwind_state *state, enum user_unwind_type);
+extern int user_unwind_next(struct user_unwind_state *state);
+
+#define for_each_user_frame(state, type) \
+	for (user_unwind_start(&state, type); !state.done; user_unwind_next(&state))
+
+#endif /* _LINUX_USER_UNWIND_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 3947122d618b..bddf58b3b496 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,6 +50,7 @@ obj-y += rcu/
 obj-y += livepatch/
 obj-y += dma/
 obj-y += entry/
+obj-y += unwind/
 obj-$(CONFIG_MODULES) += module/
 
 obj-$(CONFIG_KCMP) += kcmp.o
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
new file mode 100644
index 000000000000..eb466d6a3295
--- /dev/null
+++ b/kernel/unwind/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_HAVE_USER_UNWIND) += user.o
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
new file mode 100644
index 000000000000..8f9432306482
--- /dev/null
+++ b/kernel/unwind/user.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+#include <linux/user_unwind.h>
+#include <linux/sframe.h>
+#include <linux/uaccess.h>
+#include <asm/user_unwind.h>
+
+static struct user_unwind_frame fp_frame = {
+	ARCH_INIT_USER_FP_FRAME
+};
+
+int user_unwind_next(struct user_unwind_state *state)
+{
+	struct user_unwind_frame _frame;
+	struct user_unwind_frame *frame = &_frame;
+	unsigned long cfa, fp, ra;
+	int ret = -EINVAL;
+
+	if (state->done)
+		return -EINVAL;
+
+	switch (state->type) {
+	case USER_UNWIND_TYPE_FP:
+		frame = &fp_frame;
+		break;
+	default:
+		BUG();
+	}
+
+	cfa = (frame->use_fp ? state->fp : state->sp) + frame->cfa_off;
+
+	if (frame->ra_off && get_user(ra, (unsigned long *)(cfa + frame->ra_off)))
+		goto the_end;
+
+	if (frame->fp_off && get_user(fp, (unsigned long *)(cfa + frame->fp_off)))
+		goto the_end;
+
+	state->sp = cfa;
+	state->ip = ra;
+	if (frame->fp_off)
+		state->fp = fp;
+
+	return 0;
+
+the_end:
+	state->done = true;
+	return ret;
+}
+
+int user_unwind_start(struct user_unwind_state *state,
+		      enum user_unwind_type type)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+
+	might_sleep();
+
+	memset(state, 0, sizeof(*state));
+
+	if (!current->mm) {
+		state->done = true;
+		return -EINVAL;
+	}
+
+	if (type == USER_UNWIND_TYPE_AUTO)
+		state->type = USER_UNWIND_TYPE_FP;
+	else
+		state->type = type;
+
+	state->sp = user_stack_pointer(regs);
+	state->ip = instruction_pointer(regs);
+	state->fp = frame_pointer(regs);
+
+	return user_unwind_next(state);
+}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC 07/10] unwind/x86: Add HAVE_USER_UNWIND
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
                   ` (5 preceding siblings ...)
  2023-11-09  0:41 ` [PATCH RFC 06/10] unwind: Introduce generic user space unwinding interfaces Josh Poimboeuf
@ 2023-11-09  0:41 ` Josh Poimboeuf
  2023-11-09  0:41 ` [PATCH RFC 08/10] perf/x86: Use user_unwind interface Josh Poimboeuf
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

Use ARCH_INIT_USER_FP_FRAME to describe how frame pointers are unwound
on x86, and enable HAVE_USER_UNWIND accordinlgy so the user unwind
interfaces can be used.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig                   |  1 +
 arch/x86/include/asm/user_unwind.h | 11 +++++++++++
 2 files changed, 12 insertions(+)
 create mode 100644 arch/x86/include/asm/user_unwind.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cacf11ac4b10..95939cd54dfe 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -278,6 +278,7 @@ config X86
 	select HAVE_UACCESS_VALIDATION		if HAVE_OBJTOOL
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_USER_RETURN_NOTIFIER
+	select HAVE_USER_UNWIND
 	select HAVE_GENERIC_VDSO
 	select HOTPLUG_PARALLEL			if SMP && X86_64
 	select HOTPLUG_SMT			if SMP
diff --git a/arch/x86/include/asm/user_unwind.h b/arch/x86/include/asm/user_unwind.h
new file mode 100644
index 000000000000..caa6266abbb4
--- /dev/null
+++ b/arch/x86/include/asm/user_unwind.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_USER_UNWIND_H
+#define _ASM_X86_USER_UNWIND_H
+
+#define ARCH_INIT_USER_FP_FRAME						\
+	.ra_off		= sizeof(long) * -1,				\
+	.cfa_off	= sizeof(long) * 2,				\
+	.fp_off		= sizeof(long) * -2,				\
+	.use_fp		= true,
+
+#endif /* _ASM_X86_USER_UNWIND_H */
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC 08/10] perf/x86: Use user_unwind interface
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
                   ` (6 preceding siblings ...)
  2023-11-09  0:41 ` [PATCH RFC 07/10] unwind/x86: Add HAVE_USER_UNWIND Josh Poimboeuf
@ 2023-11-09  0:41 ` Josh Poimboeuf
  2023-11-09  0:41 ` [PATCH RFC 09/10] unwind: Introduce SFrame user space unwinding Josh Poimboeuf
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

Simplify __perf_callchain_user() and prepare for sframe user space
unwinding by switching to the generic user unwind interface.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/events/core.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index ae264437f794..5c41a11f058f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -29,6 +29,7 @@
 #include <linux/device.h>
 #include <linux/nospec.h>
 #include <linux/static_call.h>
+#include <linux/user_unwind.h>
 
 #include <asm/apic.h>
 #include <asm/stacktrace.h>
@@ -2856,8 +2857,7 @@ static inline int __perf_callchain_user32(struct pt_regs *regs,
 void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 			   struct pt_regs *regs, bool atomic)
 {
-	struct stack_frame frame;
-	const struct stack_frame __user *fp;
+	struct user_unwind_state state;
 
 	if (perf_guest_state()) {
 		/* TODO: We don't support guest os callchain now */
@@ -2870,8 +2870,6 @@ void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 	if (regs->flags & (X86_VM_MASK | PERF_EFLAGS_VM))
 		return;
 
-	fp = (void __user *)regs->bp;
-
 	perf_callchain_store(entry, regs->ip);
 
 	if (atomic && !nmi_uaccess_okay())
@@ -2883,17 +2881,9 @@ void __perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 	if (__perf_callchain_user32(regs, entry))
 		goto done;
 
-	while (entry->nr < entry->max_stack) {
-		if (!valid_user_frame(fp, sizeof(frame)))
-			break;
-
-		if (__get_user(frame.next_frame, &fp->next_frame))
-			break;
-		if (__get_user(frame.return_address, &fp->return_address))
-			break;
-
-		perf_callchain_store(entry, frame.return_address);
-		fp = (void __user *)frame.next_frame;
+	for_each_user_frame(state, USER_UNWIND_TYPE_AUTO) {
+		if (perf_callchain_store(entry, state.ip))
+			goto done;
 	}
 done:
 	if (atomic)
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC 09/10] unwind: Introduce SFrame user space unwinding
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
                   ` (7 preceding siblings ...)
  2023-11-09  0:41 ` [PATCH RFC 08/10] perf/x86: Use user_unwind interface Josh Poimboeuf
@ 2023-11-09  0:41 ` Josh Poimboeuf
  2023-11-09 19:31   ` Indu Bhagat
  2023-11-09  0:41 ` [PATCH RFC 10/10] unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME Josh Poimboeuf
  2023-11-09  0:45 ` [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
  10 siblings, 1 reply; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF instead isn't feasible due to complexity and
slowness.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame format starting with binutils 2.40.  SFrame is a
simpler version of .eh_frame which gets placed in the .sframe section.

Add support for unwinding user space using SFrame.

More information about SFrame can be found here:

  - https://lwn.net/Articles/932209/
  - https://lwn.net/Articles/940686/
  - https://sourceware.org/binutils/docs/sframe-spec.html

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/Kconfig                |   3 +
 arch/x86/include/asm/mmu.h  |   2 +-
 fs/binfmt_elf.c             |  46 +++-
 include/linux/mm_types.h    |   3 +
 include/linux/sframe.h      |  46 ++++
 include/linux/user_unwind.h |   1 +
 include/uapi/linux/elf.h    |   1 +
 include/uapi/linux/prctl.h  |   3 +
 kernel/fork.c               |  10 +
 kernel/sys.c                |  11 +
 kernel/unwind/Makefile      |   1 +
 kernel/unwind/sframe.c      | 414 ++++++++++++++++++++++++++++++++++++
 kernel/unwind/sframe.h      | 217 +++++++++++++++++++
 kernel/unwind/user.c        |  15 +-
 mm/init-mm.c                |   2 +
 15 files changed, 768 insertions(+), 7 deletions(-)
 create mode 100644 include/linux/sframe.h
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h

diff --git a/arch/Kconfig b/arch/Kconfig
index c4a08485835e..b133b03102c7 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -431,6 +431,9 @@ config HAVE_PERF_CALLCHAIN_DEFERRED
 config HAVE_USER_UNWIND
 	bool
 
+config HAVE_USER_UNWIND_SFRAME
+	bool
+
 config HAVE_PERF_REGS
 	bool
 	help
diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 0da5c227f490..9cf9cae8345f 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -73,7 +73,7 @@ typedef struct {
 	.context = {							\
 		.ctx_id = 1,						\
 		.lock = __MUTEX_INITIALIZER(mm.context.lock),		\
-	}
+	},
 
 void leave_mm(int cpu);
 #define leave_mm leave_mm
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 5397b552fbeb..bca207844a70 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -47,6 +47,7 @@
 #include <linux/dax.h>
 #include <linux/uaccess.h>
 #include <linux/rseq.h>
+#include <linux/sframe.h>
 #include <asm/param.h>
 #include <asm/page.h>
 
@@ -633,11 +634,13 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 		unsigned long no_base, struct elf_phdr *interp_elf_phdata,
 		struct arch_elf_state *arch_state)
 {
-	struct elf_phdr *eppnt;
+	struct elf_phdr *eppnt, *sframe_phdr = NULL;
 	unsigned long load_addr = 0;
 	int load_addr_set = 0;
 	unsigned long error = ~0UL;
 	unsigned long total_size;
+	unsigned long start_code = ~0UL;
+	unsigned long end_code = 0;
 	int i;
 
 	/* First of all, some simple consistency checks */
@@ -659,7 +662,8 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 
 	eppnt = interp_elf_phdata;
 	for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
-		if (eppnt->p_type == PT_LOAD) {
+		switch (eppnt->p_type) {
+		case PT_LOAD: {
 			int elf_type = MAP_PRIVATE;
 			int elf_prot = make_prot(eppnt->p_flags, arch_state,
 						 true, true);
@@ -698,7 +702,29 @@ static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
 				error = -ENOMEM;
 				goto out;
 			}
+
+			if ((eppnt->p_flags & PF_X) && k < start_code)
+				start_code = k;
+
+			k = load_addr + eppnt->p_vaddr + eppnt->p_filesz;
+			if ((eppnt->p_flags & PF_X) && k > end_code)
+				end_code = k;
+			break;
 		}
+		case PT_GNU_SFRAME:
+			sframe_phdr = eppnt;
+			break;
+		}
+	}
+
+	if (sframe_phdr) {
+		struct sframe_file sfile = {
+			.sframe_addr	= load_addr + sframe_phdr->p_vaddr,
+			.text_start	= start_code,
+			.text_end	= end_code,
+		};
+
+		__sframe_add_section(&sfile);
 	}
 
 	error = load_addr;
@@ -823,7 +849,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	int first_pt_load = 1;
 	unsigned long error;
 	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
-	struct elf_phdr *elf_property_phdata = NULL;
+	struct elf_phdr *elf_property_phdata = NULL, *sframe_phdr = NULL;
 	unsigned long elf_brk;
 	int retval, i;
 	unsigned long elf_entry;
@@ -931,6 +957,10 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				executable_stack = EXSTACK_DISABLE_X;
 			break;
 
+		case PT_GNU_SFRAME:
+			sframe_phdr = elf_ppnt;
+			break;
+
 		case PT_LOPROC ... PT_HIPROC:
 			retval = arch_elf_pt_proc(elf_ex, elf_ppnt,
 						  bprm->file, false,
@@ -1279,6 +1309,16 @@ static int load_elf_binary(struct linux_binprm *bprm)
 				MAP_FIXED | MAP_PRIVATE, 0);
 	}
 
+	if (sframe_phdr) {
+		struct sframe_file sfile = {
+			.sframe_addr	= load_bias + sframe_phdr->p_vaddr,
+			.text_start	= start_code,
+			.text_end	= end_code,
+		};
+
+		__sframe_add_section(&sfile);
+	}
+
 	regs = current_pt_regs();
 #ifdef ELF_PLAT_INIT
 	/*
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 957ce38768b2..7c361a9ccf75 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -974,6 +974,9 @@ struct mm_struct {
 #endif
 		} lru_gen;
 #endif /* CONFIG_LRU_GEN */
+#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
+		struct maple_tree sframe_mt;
+#endif
 	} __randomize_layout;
 
 	/*
diff --git a/include/linux/sframe.h b/include/linux/sframe.h
new file mode 100644
index 000000000000..72a2e8625026
--- /dev/null
+++ b/include/linux/sframe.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SFRAME_H
+#define _LINUX_SFRAME_H
+
+#include <linux/mm_types.h>
+
+struct sframe_file {
+	unsigned long sframe_addr, text_start, text_end;
+};
+
+struct user_unwind_frame;
+
+#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
+
+#define INIT_MM_SFRAME .sframe_mt = MTREE_INIT(sframe_mt, 0),
+
+extern void sframe_free_mm(struct mm_struct *mm);
+
+extern int __sframe_add_section(struct sframe_file *file);
+extern int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end);
+extern int sframe_remove_section(unsigned long sframe_addr);
+extern int sframe_find(unsigned long ip, struct user_unwind_frame *frame);
+
+static inline bool sframe_enabled_current(void)
+{
+	struct mm_struct *mm = current->mm;
+
+	return mm && !mtree_empty(&mm->sframe_mt);
+}
+
+#else /* !CONFIG_HAVE_USER_UNWIND_SFRAME */
+
+#define INIT_MM_SFRAME
+
+static inline void sframe_free_mm(struct mm_struct *mm) {}
+
+static inline int __sframe_add_section(struct sframe_file *file) { return -EINVAL; }
+static inline int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end) { return -EINVAL; }
+static inline int sframe_remove_section(unsigned long sframe_addr) { return -EINVAL; }
+static inline int sframe_find(unsigned long ip, struct user_unwind_frame *frame) { return -EINVAL; }
+
+static inline bool sframe_enabled_current(void) { return false; }
+
+#endif /* CONFIG_HAVE_USER_UNWIND_SFRAME */
+
+#endif /* _LINUX_SFRAME_H */
diff --git a/include/linux/user_unwind.h b/include/linux/user_unwind.h
index 2812b88c95fd..9a5e6e557530 100644
--- a/include/linux/user_unwind.h
+++ b/include/linux/user_unwind.h
@@ -8,6 +8,7 @@
 enum user_unwind_type {
 	USER_UNWIND_TYPE_AUTO,
 	USER_UNWIND_TYPE_FP,
+	USER_UNWIND_TYPE_SFRAME,
 };
 
 struct user_unwind_frame {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 9417309b7230..e3a08ee03fe4 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -39,6 +39,7 @@ typedef __s64	Elf64_Sxword;
 #define PT_GNU_STACK	(PT_LOOS + 0x474e551)
 #define PT_GNU_RELRO	(PT_LOOS + 0x474e552)
 #define PT_GNU_PROPERTY	(PT_LOOS + 0x474e553)
+#define PT_GNU_SFRAME	(PT_LOOS + 0x474e554)
 
 
 /* ARM MTE memory tag segment type */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 370ed14b1ae0..336277ea9782 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -306,4 +306,7 @@ struct prctl_mm_map {
 # define PR_RISCV_V_VSTATE_CTRL_NEXT_MASK	0xc
 # define PR_RISCV_V_VSTATE_CTRL_MASK		0x1f
 
+#define PR_ADD_SFRAME			71
+#define PR_REMOVE_SFRAME		72
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 10917c3e1f03..0ec13004d86c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -99,6 +99,7 @@
 #include <linux/stackprotector.h>
 #include <linux/user_events.h>
 #include <linux/iommu.h>
+#include <linux/sframe.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -924,6 +925,7 @@ void __mmdrop(struct mm_struct *mm)
 	mm_pasid_drop(mm);
 	mm_destroy_cid(mm);
 	percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS);
+	sframe_free_mm(mm);
 
 	free_mm(mm);
 }
@@ -1254,6 +1256,13 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
+static void mm_init_sframe(struct mm_struct *mm)
+{
+#ifdef CONFIG_HAVE_USER_UNWIND_SFRAME
+	mt_init(&mm->sframe_mt);
+#endif
+}
+
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
@@ -1285,6 +1294,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->pmd_huge_pte = NULL;
 #endif
 	mm_init_uprobes_state(mm);
+	mm_init_sframe(mm);
 	hugetlb_count_init(mm);
 
 	if (current->mm) {
diff --git a/kernel/sys.c b/kernel/sys.c
index 420d9cb9cc8e..4f2d6f91814d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -64,6 +64,7 @@
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
+#include <linux/sframe.h>
 
 #include <linux/nospec.h>
 
@@ -2739,6 +2740,16 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_RISCV_V_GET_CONTROL:
 		error = RISCV_V_GET_CONTROL();
 		break;
+	case PR_ADD_SFRAME:
+		if (arg5)
+			return -EINVAL;
+		error = sframe_add_section(arg2, arg3, arg4);
+		break;
+	case PR_REMOVE_SFRAME:
+		if (arg3 || arg4 || arg5)
+			return -EINVAL;
+		error = sframe_remove_section(arg2);
+		break;
 	default:
 		error = -EINVAL;
 		break;
diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile
index eb466d6a3295..6f202c5840cf 100644
--- a/kernel/unwind/Makefile
+++ b/kernel/unwind/Makefile
@@ -1 +1,2 @@
 obj-$(CONFIG_HAVE_USER_UNWIND) += user.o
+obj-$(CONFIG_HAVE_USER_UNWIND_SFRAME) += sframe.o
diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
new file mode 100644
index 000000000000..b167c19497e5
--- /dev/null
+++ b/kernel/unwind/sframe.c
@@ -0,0 +1,414 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/srcu.h>
+#include <linux/uaccess.h>
+#include <linux/mm.h>
+#include <linux/sframe.h>
+#include <linux/user_unwind.h>
+
+#include "sframe.h"
+
+#define SFRAME_FILENAME_LEN 32
+
+struct sframe_section {
+	struct rcu_head rcu;
+
+	unsigned long sframe_addr;
+	unsigned long text_addr;
+
+	unsigned long fdes_addr;
+	unsigned long fres_addr;
+	unsigned int  fdes_num;
+	signed char ra_off, fp_off;
+};
+
+DEFINE_STATIC_SRCU(sframe_srcu);
+
+#define __SFRAME_GET_USER(out, user_ptr, type)				\
+({									\
+	type __tmp;							\
+	if (get_user(__tmp, (type *)user_ptr))				\
+		return -EFAULT;						\
+	user_ptr += sizeof(__tmp);					\
+	out = __tmp;							\
+})
+
+#define SFRAME_GET_USER_SIGNED(out, user_ptr, size)			\
+({									\
+	switch (size) {							\
+	case 1:								\
+		__SFRAME_GET_USER(out, user_ptr, s8);			\
+		break;							\
+	case 2:								\
+		__SFRAME_GET_USER(out, user_ptr, s16);			\
+		break;							\
+	case 4:								\
+		__SFRAME_GET_USER(out, user_ptr, s32);			\
+		break;							\
+	default:							\
+		return -EINVAL;						\
+	}								\
+})
+
+#define SFRAME_GET_USER_UNSIGNED(out, user_ptr, size)			\
+({									\
+	switch (size) {							\
+	case 1:								\
+		__SFRAME_GET_USER(out, user_ptr, u8);			\
+		break;							\
+	case 2:								\
+		__SFRAME_GET_USER(out, user_ptr, u16);			\
+		break;							\
+	case 4:								\
+		__SFRAME_GET_USER(out, user_ptr, u32);			\
+		break;							\
+	default:							\
+		return -EINVAL;						\
+	}								\
+})
+
+static unsigned char fre_type_to_size(unsigned char fre_type)
+{
+	if (fre_type > 2)
+		return 0;
+	return 1 << fre_type;
+}
+
+static unsigned char offset_size_enum_to_size(unsigned char off_size)
+{
+	if (off_size > 2)
+		return 0;
+	return 1 << off_size;
+}
+
+static int find_fde(struct sframe_section *sec, unsigned long ip,
+		    struct sframe_fde *fde)
+{
+	s32 func_off, ip_off;
+	struct sframe_fde __user *first, *last, *mid, *found;
+
+	ip_off = ip - sec->sframe_addr;
+
+	first = (void *)sec->fdes_addr;
+	last = first + sec->fdes_num;
+	while (first <= last) {
+		mid = first + ((last - first) / 2);
+		if (get_user(func_off, (s32 *)mid))
+			return -EFAULT;
+		if (ip_off >= func_off) {
+			found = mid;
+			first = mid + 1;
+		} else
+			last = mid - 1;
+	}
+
+	if (!found)
+		return -EINVAL;
+
+	if (copy_from_user(fde, found, sizeof(*fde)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
+		    unsigned long ip, struct user_unwind_frame *frame)
+{
+	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
+	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
+	s32 fre_ip_off, cfa_off, ra_off, fp_off, ip_off;
+	unsigned char offset_count, offset_size;
+	unsigned char addr_size;
+	void __user *f, *last_f;
+	u8 fre_info;
+	int i;
+
+	addr_size = fre_type_to_size(fre_type);
+	if (!addr_size)
+		return -EINVAL;
+
+	ip_off = ip - sec->sframe_addr - fde->start_addr;
+
+	f = (void *)sec->fres_addr + fde->fres_off;
+
+	for (i = 0; i < fde->fres_num; i++) {
+
+		SFRAME_GET_USER_UNSIGNED(fre_ip_off, f, addr_size);
+
+		if (fde_type == SFRAME_FDE_TYPE_PCINC) {
+			if (fre_ip_off > ip_off)
+				break;
+		} else {
+			/* SFRAME_FDE_TYPE_PCMASK */
+#if 0 /* sframe v2 */
+			if (ip_off % fde->rep_size < fre_ip_off)
+				break;
+#endif
+		}
+
+		SFRAME_GET_USER_UNSIGNED(fre_info, f, 1);
+
+		offset_count = SFRAME_FRE_OFFSET_COUNT(fre_info);
+		offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(fre_info));
+
+		if (!offset_count || !offset_size)
+			return -EINVAL;
+
+		last_f = f;
+		f += offset_count * offset_size;
+	}
+
+	if (!last_f)
+		return -EINVAL;
+
+	f = last_f;
+
+	SFRAME_GET_USER_UNSIGNED(cfa_off, f, offset_size);
+	offset_count--;
+
+	ra_off = sec->ra_off;
+	if (!ra_off) {
+		if (!offset_count--)
+			return -EINVAL;
+		SFRAME_GET_USER_SIGNED(ra_off, f, offset_size);
+	}
+
+	fp_off = sec->fp_off;
+	if (!fp_off && offset_count) {
+		offset_count--;
+		SFRAME_GET_USER_SIGNED(fp_off, f, offset_size);
+	}
+
+	if (offset_count)
+		return -EINVAL;
+
+	frame->cfa_off = cfa_off;
+	frame->ra_off = ra_off;
+	frame->fp_off = fp_off;
+	frame->use_fp = SFRAME_FRE_CFA_BASE_REG_ID(fre_info) == SFRAME_BASE_REG_FP;
+
+	return 0;
+}
+
+int sframe_find(unsigned long ip, struct user_unwind_frame *frame)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	struct sframe_fde fde;
+	int srcu_idx;
+	int ret = -EINVAL;
+
+	srcu_idx = srcu_read_lock(&sframe_srcu);
+
+	sec = mtree_load(&mm->sframe_mt, ip);
+	if (!sec) {
+		srcu_read_unlock(&sframe_srcu, srcu_idx);
+		return -EINVAL;
+	}
+
+
+	ret = find_fde(sec, ip, &fde);
+	if (ret)
+		goto err_unlock;
+
+	ret = find_fre(sec, &fde, ip, frame);
+	if (ret)
+		goto err_unlock;
+
+	srcu_read_unlock(&sframe_srcu, srcu_idx);
+	return 0;
+
+err_unlock:
+	srcu_read_unlock(&sframe_srcu, srcu_idx);
+	return ret;
+}
+
+static int get_sframe_file(unsigned long sframe_addr, struct sframe_file *file)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *sframe_vma, *text_vma, *vma;
+	VMA_ITERATOR(vmi, mm, 0);
+
+	mmap_read_lock(mm);
+
+	sframe_vma = vma_lookup(mm, sframe_addr);
+	if (!sframe_vma || !sframe_vma->vm_file)
+		goto err_unlock;
+
+	text_vma = NULL;
+
+	for_each_vma(vmi, vma) {
+		if (vma->vm_file != sframe_vma->vm_file)
+			continue;
+		if (vma->vm_flags & VM_EXEC) {
+			if (text_vma) {
+				/*
+				 * Multiple EXEC segments in a single file
+				 * aren't currently supported, is that a thing?
+				 */
+				WARN_ON_ONCE(1);
+				goto err_unlock;
+			}
+			text_vma = vma;
+		}
+	}
+
+	file->sframe_addr	= sframe_addr;
+	file->text_start	= text_vma->vm_start;
+	file->text_end		= text_vma->vm_end;
+
+	mmap_read_unlock(mm);
+	return 0;
+
+err_unlock:
+	mmap_read_unlock(mm);
+	return -EINVAL;
+}
+
+static int validate_sframe_addrs(struct sframe_file *file)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *text_vma;
+
+	mmap_read_lock(mm);
+
+	if (!vma_lookup(mm, file->sframe_addr))
+		goto err_unlock;
+
+	text_vma = vma_lookup(mm, file->text_start);
+	if (!(text_vma->vm_flags & VM_EXEC))
+		goto err_unlock;
+
+	if (vma_lookup(mm, file->text_end-1) != text_vma)
+		goto err_unlock;
+
+	mmap_read_unlock(mm);
+	return 0;
+
+err_unlock:
+	mmap_read_unlock(mm);
+	return -EINVAL;
+}
+
+int __sframe_add_section(struct sframe_file *file)
+{
+	struct maple_tree *sframe_mt = &current->mm->sframe_mt;
+	struct sframe_section *sec;
+	struct sframe_header shdr;
+	unsigned long header_end;
+	int ret;
+
+	if (copy_from_user(&shdr, (void *)file->sframe_addr, sizeof(shdr)))
+		return -EFAULT;
+
+	if (shdr.preamble.magic != SFRAME_MAGIC ||
+	    shdr.preamble.version != SFRAME_VERSION_1 ||
+	    (!shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
+	    shdr.auxhdr_len || !shdr.num_fdes || !shdr.num_fres ||
+	    shdr.fdes_off > shdr.fres_off) {
+		return -EINVAL;
+	}
+
+	header_end = file->sframe_addr + SFRAME_HDR_SIZE(shdr);
+
+	sec = kmalloc(sizeof(*sec), GFP_KERNEL);
+	if (!sec)
+		return -ENOMEM;
+
+	sec->sframe_addr	= file->sframe_addr;
+	sec->text_addr		= file->text_start;
+	sec->fdes_addr		= header_end + shdr.fdes_off;
+	sec->fres_addr		= header_end + shdr.fres_off;
+	sec->fdes_num		= shdr.num_fdes;
+	sec->ra_off		= shdr.cfa_fixed_ra_offset;
+	sec->fp_off		= shdr.cfa_fixed_fp_offset;
+
+	ret = mtree_insert_range(sframe_mt, file->text_start, file->text_end,
+				 sec, GFP_KERNEL);
+	if (ret) {
+		kfree(sec);
+		return ret;
+	}
+
+	return 0;
+}
+
+int sframe_add_section(unsigned long sframe_addr, unsigned long text_start, unsigned long text_end)
+{
+	struct sframe_file file;
+	int ret;
+
+	if (!text_start || !text_end) {
+		ret = get_sframe_file(sframe_addr, &file);
+		if (ret)
+			return ret;
+	} else {
+		/*
+		 * This is mainly for generated code, for which the text isn't
+		 * file-backed so the user has to give the text bounds.
+		 */
+		file.sframe_addr	= sframe_addr;
+		file.text_start		= text_start;
+		file.text_end		= text_end;
+		ret = validate_sframe_addrs(&file);
+		if (ret)
+			return ret;
+	}
+
+	return __sframe_add_section(&file);
+}
+
+static void sframe_free_rcu(struct rcu_head *rcu)
+{
+	struct sframe_section *sec = container_of(rcu, struct sframe_section, rcu);
+
+	kfree(sec);
+}
+
+static int __sframe_remove_section(struct mm_struct *mm,
+				   struct sframe_section *sec)
+{
+	struct sframe_section *s;
+
+	s = mtree_erase(&mm->sframe_mt, sec->text_addr);
+	if (!s || WARN_ON_ONCE(s != sec))
+		return -EINVAL;
+
+	call_srcu(&sframe_srcu, &sec->rcu, sframe_free_rcu);
+
+	return 0;
+}
+
+int sframe_remove_section(unsigned long sframe_addr)
+{
+	struct mm_struct *mm = current->mm;
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	sec = mtree_load(&mm->sframe_mt, sframe_addr);
+	if (!sec)
+		return -EINVAL;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX) {
+		if (sec->sframe_addr == sframe_addr)
+			return __sframe_remove_section(mm, sec);
+	}
+
+	return -EINVAL;
+}
+
+void sframe_free_mm(struct mm_struct *mm)
+{
+	struct sframe_section *sec;
+	unsigned long index = 0;
+
+	if (!mm)
+		return;
+
+	mt_for_each(&mm->sframe_mt, sec, index, ULONG_MAX)
+		kfree(sec);
+
+	mtree_destroy(&mm->sframe_mt);
+}
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
new file mode 100644
index 000000000000..1f91b696daf5
--- /dev/null
+++ b/kernel/unwind/sframe.h
@@ -0,0 +1,217 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _SFRAME_H
+#define _SFRAME_H
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ *
+ * This file contains definitions for the SFrame stack tracing format, which is
+ * documented at https://sourceware.org/binutils/docs
+ */
+
+#include <linux/types.h>
+
+#define SFRAME_VERSION_1	1
+#define SFRAME_VERSION_2	2
+#define SFRAME_MAGIC		0xdee2
+
+/* Function Descriptor Entries are sorted on PC. */
+#define SFRAME_F_FDE_SORTED	0x1
+/* Frame-pointer based stack tracing. Defined, but not set. */
+#define SFRAME_F_FRAME_POINTER	0x2
+
+#define SFRAME_CFA_FIXED_FP_INVALID 0
+#define SFRAME_CFA_FIXED_RA_INVALID 0
+
+/* Supported ABIs/Arch. */
+#define SFRAME_ABI_AARCH64_ENDIAN_BIG	    1 /* AARCH64 big endian. */
+#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE    2 /* AARCH64 little endian. */
+#define SFRAME_ABI_AMD64_ENDIAN_LITTLE	    3 /* AMD64 little endian. */
+
+/* SFrame FRE types. */
+#define SFRAME_FRE_TYPE_ADDR1	0
+#define SFRAME_FRE_TYPE_ADDR2	1
+#define SFRAME_FRE_TYPE_ADDR4	2
+
+/*
+ * SFrame Function Descriptor Entry types.
+ *
+ * The SFrame format has two possible representations for functions. The
+ * choice of which type to use is made according to the instruction patterns
+ * in the relevant program stub.
+ */
+
+/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
+#define SFRAME_FDE_TYPE_PCINC	0
+/*
+ * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
+ * to look up a matching FRE. Typical usecases are pltN entries, trampolines
+ * etc.
+ */
+#define SFRAME_FDE_TYPE_PCMASK	1
+
+/**
+ * struct sframe_preamble - SFrame Preamble.
+ * @magic: Magic number (SFRAME_MAGIC).
+ * @version: Format version number (SFRAME_VERSION).
+ * @flags: Various flags.
+ */
+struct sframe_preamble {
+	u16 magic;
+	u8  version;
+	u8  flags;
+} __packed;
+
+/**
+ * struct sframe_header - SFrame Header.
+ * @preamble: SFrame preamble.
+ * @abi_arch: Identify the arch (including endianness) and ABI.
+ * @cfa_fixed_fp_offset: Offset for the Frame Pointer (FP) from CFA may be
+ *	  fixed  for some ABIs ((e.g, in AMD64 when -fno-omit-frame-pointer is
+ *	  used). When fixed, this field specifies the fixed stack frame offset
+ *	  and the individual FREs do not need to track it. When not fixed, it
+ *	  is set to SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may
+ *	  provide the applicable stack frame offset, if any.
+ * @cfa_fixed_ra_offset: Offset for the Return Address from CFA is fixed for
+ *	  some ABIs. When fixed, this field specifies the fixed stack frame
+ *	  offset and the individual FREs do not need to track it. When not
+ *	  fixed, it is set to SFRAME_CFA_FIXED_FP_INVALID.
+ * @auxhdr_len: Number of bytes making up the auxiliary header, if any.
+ *	  Some ABI/arch, in the future, may use this space for extending the
+ *	  information in SFrame header. Auxiliary header is contained in bytes
+ *	  sequentially following the sframe_header.
+ * @num_fdes: Number of SFrame FDEs in this SFrame section.
+ * @num_fres: Number of SFrame Frame Row Entries.
+ * @fre_len:  Number of bytes in the SFrame Frame Row Entry section.
+ * @fdes_off: Offset of SFrame Function Descriptor Entry section.
+ * @fres_off: Offset of SFrame Frame Row Entry section.
+ */
+struct sframe_header {
+	struct sframe_preamble preamble;
+	u8  abi_arch;
+	s8  cfa_fixed_fp_offset;
+	s8  cfa_fixed_ra_offset;
+	u8  auxhdr_len;
+	u32 num_fdes;
+	u32 num_fres;
+	u32 fre_len;
+	u32 fdes_off;
+	u32 fres_off;
+} __packed;
+
+#define SFRAME_HDR_SIZE(sframe_hdr)	\
+	((sizeof(struct sframe_header) + (sframe_hdr).auxhdr_len))
+
+/* Two possible keys for executable (instruction) pointers signing. */
+#define SFRAME_AARCH64_PAUTH_KEY_A    0 /* Key A. */
+#define SFRAME_AARCH64_PAUTH_KEY_B    1 /* Key B. */
+
+/**
+ * struct sframe_fde - SFrame Function Descriptor Entry.
+ * @start_addr: Function start address. Encoded as a signed offset,
+ *	  relative to the current FDE.
+ * @size: Size of the function in bytes.
+ * @fres_off: Offset of the first SFrame Frame Row Entry of the function,
+ *	  relative to the beginning of the SFrame Frame Row Entry sub-section.
+ * @fres_num: Number of frame row entries for the function.
+ * @info: Additional information for deciphering the stack trace
+ *	  information for the function. Contains information about SFrame FRE
+ *	  type, SFrame FDE type, PAC authorization A/B key, etc.
+ * @rep_size: Block size for SFRAME_FDE_TYPE_PCMASK
+ * @padding: Unused
+ */
+struct sframe_fde {
+	s32 start_addr;
+	u32 size;
+	u32 fres_off;
+	u32 fres_num;
+	u8  info;
+#if 0 /* TODO sframe v2 */
+	u8  rep_size;
+	u16 padding;
+#endif
+} __packed;
+
+/*
+ * 'func_info' in SFrame FDE contains additional information for deciphering
+ * the stack trace information for the function. In V1, the information is
+ * organized as follows:
+ *   - 4-bits: Identify the FRE type used for the function.
+ *   - 1-bit: Identify the FDE type of the function - mask or inc.
+ *   - 1-bit: PAC authorization A/B key (aarch64).
+ *   - 2-bits: Unused.
+ * ---------------------------------------------------------------------
+ * |  Unused  |  PAC auth A/B key (aarch64) |  FDE type |   FRE type   |
+ * |          |        Unused (amd64)       |           |              |
+ * ---------------------------------------------------------------------
+ * 8          6                             5           4              0
+ */
+
+/* Note: Set PAC auth key to SFRAME_AARCH64_PAUTH_KEY_A by default.  */
+#define SFRAME_FUNC_INFO(fde_type, fre_enc_type) \
+	(((SFRAME_AARCH64_PAUTH_KEY_A & 0x1) << 5) | \
+	 (((fde_type) & 0x1) << 4) | ((fre_enc_type) & 0xf))
+
+#define SFRAME_FUNC_FRE_TYPE(data)	  ((data) & 0xf)
+#define SFRAME_FUNC_FDE_TYPE(data)	  (((data) >> 4) & 0x1)
+#define SFRAME_FUNC_PAUTH_KEY(data)	  (((data) >> 5) & 0x1)
+
+/*
+ * Size of stack frame offsets in an SFrame Frame Row Entry. A single
+ * SFrame FRE has all offsets of the same size. Offset size may vary
+ * across frame row entries.
+ */
+#define SFRAME_FRE_OFFSET_1B	  0
+#define SFRAME_FRE_OFFSET_2B	  1
+#define SFRAME_FRE_OFFSET_4B	  2
+
+/* An SFrame Frame Row Entry can be SP or FP based.  */
+#define SFRAME_BASE_REG_FP	0
+#define SFRAME_BASE_REG_SP	1
+
+/*
+ * The index at which a specific offset is presented in the variable length
+ * bytes of an FRE.
+ */
+#define SFRAME_FRE_CFA_OFFSET_IDX   0
+/*
+ * The RA stack offset, if present, will always be at index 1 in the variable
+ * length bytes of the FRE.
+ */
+#define SFRAME_FRE_RA_OFFSET_IDX    1
+/*
+ * The FP stack offset may appear at offset 1 or 2, depending on the ABI as RA
+ * may or may not be tracked.
+ */
+#define SFRAME_FRE_FP_OFFSET_IDX    2
+
+/*
+ * 'fre_info' in SFrame FRE contains information about:
+ *   - 1 bit: base reg for CFA
+ *   - 4 bits: Number of offsets (N). A value of up to 3 is allowed to track
+ *   all three of CFA, FP and RA (fixed implicit order).
+ *   - 2 bits: information about size of the offsets (S) in bytes.
+ *     Valid values are SFRAME_FRE_OFFSET_1B, SFRAME_FRE_OFFSET_2B,
+ *     SFRAME_FRE_OFFSET_4B
+ *   - 1 bit: Mangled RA state bit (aarch64 only).
+ * ---------------------------------------------------------------
+ * | Mangled-RA (aarch64) |  Size of   |   Number of  | base_reg |
+ * |  Unused (amd64)      |  offsets   |    offsets   |          |
+ * ---------------------------------------------------------------
+ * 8                      7             5             1          0
+ */
+
+/* Note: Set mangled_ra_p to zero by default. */
+#define SFRAME_FRE_INFO(base_reg_id, offset_num, offset_size) \
+	(((0 & 0x1) << 7) | (((offset_size) & 0x3) << 5) | \
+	 (((offset_num) & 0xf) << 1) | ((base_reg_id) & 0x1))
+
+/* Set the mangled_ra_p bit as indicated. */
+#define SFRAME_FRE_INFO_UPDATE_MANGLED_RA_P(mangled_ra_p, fre_info) \
+	((((mangled_ra_p) & 0x1) << 7) | ((fre_info) & 0x7f))
+
+#define SFRAME_FRE_CFA_BASE_REG_ID(data)	  ((data) & 0x1)
+#define SFRAME_FRE_OFFSET_COUNT(data)		  (((data) >> 1) & 0xf)
+#define SFRAME_FRE_OFFSET_SIZE(data)		  (((data) >> 5) & 0x3)
+#define SFRAME_FRE_MANGLED_RA_P(data)		  (((data) >> 7) & 0x1)
+
+#endif /* _SFRAME_H */
diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 8f9432306482..4194180df154 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -26,6 +26,11 @@ int user_unwind_next(struct user_unwind_state *state)
 	case USER_UNWIND_TYPE_FP:
 		frame = &fp_frame;
 		break;
+	case USER_UNWIND_TYPE_SFRAME:
+		ret = sframe_find(state->ip, frame);
+		if (ret)
+			goto the_end;
+		break;
 	default:
 		BUG();
 	}
@@ -64,10 +69,14 @@ int user_unwind_start(struct user_unwind_state *state,
 		return -EINVAL;
 	}
 
-	if (type == USER_UNWIND_TYPE_AUTO)
-		state->type = USER_UNWIND_TYPE_FP;
-	else
+	if (type == USER_UNWIND_TYPE_AUTO) {
+		state->type = sframe_enabled_current() ? USER_UNWIND_TYPE_SFRAME
+						       : USER_UNWIND_TYPE_FP;
+	} else {
+		if (type == USER_UNWIND_TYPE_SFRAME && !sframe_enabled_current())
+			return -EINVAL;
 		state->type = type;
+	}
 
 	state->sp = user_stack_pointer(regs);
 	state->ip = instruction_pointer(regs);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index cfd367822cdd..288885a39e12 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -11,6 +11,7 @@
 #include <linux/atomic.h>
 #include <linux/user_namespace.h>
 #include <linux/iommu.h>
+#include <linux/sframe.h>
 #include <asm/mmu.h>
 
 #ifndef INIT_MM_CONTEXT
@@ -48,6 +49,7 @@ struct mm_struct init_mm = {
 	.pasid		= IOMMU_PASID_INVALID,
 #endif
 	INIT_MM_CONTEXT(init_mm)
+	INIT_MM_SFRAME
 };
 
 void setup_initial_init_mm(void *start_code, void *end_code,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH RFC 10/10] unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
                   ` (8 preceding siblings ...)
  2023-11-09  0:41 ` [PATCH RFC 09/10] unwind: Introduce SFrame user space unwinding Josh Poimboeuf
@ 2023-11-09  0:41 ` Josh Poimboeuf
  2023-11-09  0:45 ` [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
  10 siblings, 0 replies; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:41 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

Binutils 2.40 supports generating sframe for x86_64.  It works well in
testing so enable it.

NOTE: An out-of-tree glibc patch is still needed to enable setting
PR_ADD_SFRAME for shared libraries and dlopens.

Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
---
 arch/x86/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 95939cd54dfe..770d0528e4c9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -279,6 +279,7 @@ config X86
 	select HAVE_UNSTABLE_SCHED_CLOCK
 	select HAVE_USER_RETURN_NOTIFIER
 	select HAVE_USER_UNWIND
+	select HAVE_USER_UNWIND_SFRAME		if X86_64
 	select HAVE_GENERIC_VDSO
 	select HOTPLUG_PARALLEL			if SMP && X86_64
 	select HOTPLUG_SMT			if SMP
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 00/10] perf: user space sframe unwinding
  2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
                   ` (9 preceding siblings ...)
  2023-11-09  0:41 ` [PATCH RFC 10/10] unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME Josh Poimboeuf
@ 2023-11-09  0:45 ` Josh Poimboeuf
  10 siblings, 0 replies; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09  0:45 UTC (permalink / raw)
  To: Peter Zijlstra, Steven Rostedt, Ingo Molnar, Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Indu Bhagat, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Ian Rogers, Adrian Hunter,
	linux-perf-users, Mark Brown, linux-toolchains

On Wed, Nov 08, 2023 at 04:41:05PM -0800, Josh Poimboeuf wrote:
> Some distros have started compiling frame pointers into all their
> packages to enable the kernel to do system-wide profiling of user space.
> Unfortunately that creates a runtime performance penalty across the
> entire system.  Using DWARF (or .eh_frame) instead isn't feasible
> because of complexity and slowness.
> 
> For in-kernel unwinding we solved this problem with the creation of the
> ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
> has created the SFrame ("Simple Frame") format starting with binutils
> 2.40.
> 
> These patches add support for unwinding user space from the kernel using
> SFrame with perf.  It should be easy to add user unwinding support for
> other components like ftrace.
> 
> I tested it on Gentoo by recompiling everything with -Wa,-gsframe and
> using a custom glibc patch (which I'll send in a reply to this email).

Here's my glibc patch:

diff --git a/elf/dl-load.c b/elf/dl-load.c
index 2923b1141d..333d7c39fd 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -29,6 +29,7 @@
 #include <bits/wordsize.h>
 #include <sys/mman.h>
 #include <sys/param.h>
+#include <sys/prctl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <gnu/lib-names.h>
@@ -88,6 +89,10 @@ struct filebuf
 
 #define STRING(x) __STRING (x)
 
+#ifndef PT_GNU_SFRAME
+#define PT_GNU_SFRAME 0x6474e554
+#endif
+
 
 int __stack_prot attribute_hidden attribute_relro
 #if _STACK_GROWS_DOWN && defined PROT_GROWSDOWN
@@ -1213,6 +1218,10 @@ _dl_map_object_from_fd (const char *name, const char *origname, int fd,
 	  l->l_relro_addr = ph->p_vaddr;
 	  l->l_relro_size = ph->p_memsz;
 	  break;
+
+	case PT_GNU_SFRAME:
+	  l->l_sframe_addr = ph->p_vaddr;
+	  break;
 	}
 
     if (__glibc_unlikely (nloadcmds == 0))
@@ -1263,6 +1272,8 @@ _dl_map_object_from_fd (const char *name, const char *origname, int fd,
 	l->l_map_start = l->l_map_end = 0;
 	goto lose;
       }
+
+
   }
 
   if (l->l_ld != 0)
@@ -1376,6 +1387,13 @@ cannot enable executable stack as shared object requires");
 	break;
       }
 
+#define PR_ADD_SFRAME 71
+  if (l->l_sframe_addr != 0)
+  {
+    l->l_sframe_addr += l->l_addr;
+    __prctl(PR_ADD_SFRAME, l->l_sframe_addr, NULL, NULL, NULL);
+  }
+
   /* We are done mapping in the file.  We no longer need the descriptor.  */
   if (__glibc_unlikely (__close_nocancel (fd) != 0))
     {
diff --git a/include/link.h b/include/link.h
index c6af095d87..36ac75680f 100644
--- a/include/link.h
+++ b/include/link.h
@@ -348,6 +348,8 @@ struct link_map
     ElfW(Addr) l_relro_addr;
     size_t l_relro_size;
 
+    ElfW(Addr) l_sframe_addr;
+
     unsigned long long int l_serial;
   };
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 09/10] unwind: Introduce SFrame user space unwinding
  2023-11-09  0:41 ` [PATCH RFC 09/10] unwind: Introduce SFrame user space unwinding Josh Poimboeuf
@ 2023-11-09 19:31   ` Indu Bhagat
  2023-11-09 19:37     ` Josh Poimboeuf
  0 siblings, 1 reply; 30+ messages in thread
From: Indu Bhagat @ 2023-11-09 19:31 UTC (permalink / raw)
  To: Josh Poimboeuf, Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo
  Cc: linux-kernel, x86, Mark Rutland, Alexander Shishkin, Jiri Olsa,
	Namhyung Kim, Ian Rogers, Adrian Hunter, linux-perf-users,
	Mark Brown, linux-toolchains

On 11/8/23 16:41, Josh Poimboeuf wrote:
> Some distros have started compiling frame pointers into all their
> packages to enable the kernel to do system-wide profiling of user space.
> Unfortunately that creates a runtime performance penalty across the
> entire system.  Using DWARF instead isn't feasible due to complexity and
> slowness.
> 
> For in-kernel unwinding we solved this problem with the creation of the
> ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
> has created the SFrame format starting with binutils 2.40.  SFrame is a
> simpler version of .eh_frame which gets placed in the .sframe section.
> 
> Add support for unwinding user space using SFrame.
> 
> More information about SFrame can be found here:
> 
>    - https://lwn.net/Articles/932209/
>    - https://lwn.net/Articles/940686/
>    - https://sourceware.org/binutils/docs/sframe-spec.html
> 
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>   arch/Kconfig                |   3 +
>   arch/x86/include/asm/mmu.h  |   2 +-
>   fs/binfmt_elf.c             |  46 +++-
>   include/linux/mm_types.h    |   3 +
>   include/linux/sframe.h      |  46 ++++
>   include/linux/user_unwind.h |   1 +
>   include/uapi/linux/elf.h    |   1 +
>   include/uapi/linux/prctl.h  |   3 +
>   kernel/fork.c               |  10 +
>   kernel/sys.c                |  11 +
>   kernel/unwind/Makefile      |   1 +
>   kernel/unwind/sframe.c      | 414 ++++++++++++++++++++++++++++++++++++
>   kernel/unwind/sframe.h      | 217 +++++++++++++++++++
>   kernel/unwind/user.c        |  15 +-
>   mm/init-mm.c                |   2 +
>   15 files changed, 768 insertions(+), 7 deletions(-)
>   create mode 100644 include/linux/sframe.h
>   create mode 100644 kernel/unwind/sframe.c
>   create mode 100644 kernel/unwind/sframe.h
> 

Thanks for working on this.

> diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
> new file mode 100644
> index 000000000000..b167c19497e5
> --- /dev/null
> +++ b/kernel/unwind/sframe.c
> @@ -0,0 +1,414 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/srcu.h>
> +#include <linux/uaccess.h>
> +#include <linux/mm.h>
> +#include <linux/sframe.h>
> +#include <linux/user_unwind.h>
> +
> +#include "sframe.h"
> +
> +#define SFRAME_FILENAME_LEN 32
> +
> +struct sframe_section {
> +	struct rcu_head rcu;
> +
> +	unsigned long sframe_addr;
> +	unsigned long text_addr;
> +
> +	unsigned long fdes_addr;
> +	unsigned long fres_addr;
> +	unsigned int  fdes_num;
> +	signed char ra_off, fp_off;
> +};
> +
> +DEFINE_STATIC_SRCU(sframe_srcu);
> +
> +#define __SFRAME_GET_USER(out, user_ptr, type)				\
> +({									\
> +	type __tmp;							\
> +	if (get_user(__tmp, (type *)user_ptr))				\
> +		return -EFAULT;						\
> +	user_ptr += sizeof(__tmp);					\
> +	out = __tmp;							\
> +})
> +
> +#define SFRAME_GET_USER_SIGNED(out, user_ptr, size)			\
> +({									\
> +	switch (size) {							\
> +	case 1:								\
> +		__SFRAME_GET_USER(out, user_ptr, s8);			\
> +		break;							\
> +	case 2:								\
> +		__SFRAME_GET_USER(out, user_ptr, s16);			\
> +		break;							\
> +	case 4:								\
> +		__SFRAME_GET_USER(out, user_ptr, s32);			\
> +		break;							\
> +	default:							\
> +		return -EINVAL;						\
> +	}								\
> +})
> +
> +#define SFRAME_GET_USER_UNSIGNED(out, user_ptr, size)			\
> +({									\
> +	switch (size) {							\
> +	case 1:								\
> +		__SFRAME_GET_USER(out, user_ptr, u8);			\
> +		break;							\
> +	case 2:								\
> +		__SFRAME_GET_USER(out, user_ptr, u16);			\
> +		break;							\
> +	case 4:								\
> +		__SFRAME_GET_USER(out, user_ptr, u32);			\
> +		break;							\
> +	default:							\
> +		return -EINVAL;						\
> +	}								\
> +})
> +
> +static unsigned char fre_type_to_size(unsigned char fre_type)
> +{
> +	if (fre_type > 2)
> +		return 0;
> +	return 1 << fre_type;
> +}
> +
> +static unsigned char offset_size_enum_to_size(unsigned char off_size)
> +{
> +	if (off_size > 2)
> +		return 0;
> +	return 1 << off_size;
> +}
> +
> +static int find_fde(struct sframe_section *sec, unsigned long ip,
> +		    struct sframe_fde *fde)
> +{
> +	s32 func_off, ip_off;
> +	struct sframe_fde __user *first, *last, *mid, *found;
> +
> +	ip_off = ip - sec->sframe_addr;
> +
> +	first = (void *)sec->fdes_addr;
> +	last = first + sec->fdes_num;
> +	while (first <= last) {
> +		mid = first + ((last - first) / 2);
> +		if (get_user(func_off, (s32 *)mid))
> +			return -EFAULT;
> +		if (ip_off >= func_off) {
> +			found = mid;
> +			first = mid + 1;
> +		} else
> +			last = mid - 1;
> +	}
> +
> +	if (!found)
> +		return -EINVAL;
> +
> +	if (copy_from_user(fde, found, sizeof(*fde)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
> +		    unsigned long ip, struct user_unwind_frame *frame)
> +{
> +	unsigned char fde_type = SFRAME_FUNC_FDE_TYPE(fde->info);
> +	unsigned char fre_type = SFRAME_FUNC_FRE_TYPE(fde->info);
> +	s32 fre_ip_off, cfa_off, ra_off, fp_off, ip_off;
> +	unsigned char offset_count, offset_size;
> +	unsigned char addr_size;
> +	void __user *f, *last_f;
> +	u8 fre_info;
> +	int i;
> +
> +	addr_size = fre_type_to_size(fre_type);
> +	if (!addr_size)
> +		return -EINVAL;
> +
> +	ip_off = ip - sec->sframe_addr - fde->start_addr;
> +
> +	f = (void *)sec->fres_addr + fde->fres_off;
> +
> +	for (i = 0; i < fde->fres_num; i++) {
> +
> +		SFRAME_GET_USER_UNSIGNED(fre_ip_off, f, addr_size);
> +
> +		if (fde_type == SFRAME_FDE_TYPE_PCINC) {
> +			if (fre_ip_off > ip_off)
> +				break;
> +		} else {
> +			/* SFRAME_FDE_TYPE_PCMASK */
> +#if 0 /* sframe v2 */
> +			if (ip_off % fde->rep_size < fre_ip_off)
> +				break;
> +#endif
> +		}
> +
> +		SFRAME_GET_USER_UNSIGNED(fre_info, f, 1);
> +
> +		offset_count = SFRAME_FRE_OFFSET_COUNT(fre_info);
> +		offset_size  = offset_size_enum_to_size(SFRAME_FRE_OFFSET_SIZE(fre_info));
> +
> +		if (!offset_count || !offset_size)
> +			return -EINVAL;
> +
> +		last_f = f;
> +		f += offset_count * offset_size;
> +	}
> +
> +	if (!last_f)
> +		return -EINVAL;
> +
> +	f = last_f;
> +
> +	SFRAME_GET_USER_UNSIGNED(cfa_off, f, offset_size);
> +	offset_count--;
> +
> +	ra_off = sec->ra_off;
> +	if (!ra_off) {
> +		if (!offset_count--)
> +			return -EINVAL;
> +		SFRAME_GET_USER_SIGNED(ra_off, f, offset_size);
> +	}
> +
> +	fp_off = sec->fp_off;
> +	if (!fp_off && offset_count) {
> +		offset_count--;
> +		SFRAME_GET_USER_SIGNED(fp_off, f, offset_size);
> +	}
> +
> +	if (offset_count)
> +		return -EINVAL;
> +
> +	frame->cfa_off = cfa_off;
> +	frame->ra_off = ra_off;
> +	frame->fp_off = fp_off;
> +	frame->use_fp = SFRAME_FRE_CFA_BASE_REG_ID(fre_info) == SFRAME_BASE_REG_FP;
> +
> +	return 0;
> +}
> +
> +int sframe_find(unsigned long ip, struct user_unwind_frame *frame)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct sframe_section *sec;
> +	struct sframe_fde fde;
> +	int srcu_idx;
> +	int ret = -EINVAL;
> +
> +	srcu_idx = srcu_read_lock(&sframe_srcu);
> +
> +	sec = mtree_load(&mm->sframe_mt, ip);
> +	if (!sec) {
> +		srcu_read_unlock(&sframe_srcu, srcu_idx);
> +		return -EINVAL;
> +	}
> +
> +
> +	ret = find_fde(sec, ip, &fde);
> +	if (ret)
> +		goto err_unlock;
> +
> +	ret = find_fre(sec, &fde, ip, frame);
> +	if (ret)
> +		goto err_unlock;
> +
> +	srcu_read_unlock(&sframe_srcu, srcu_idx);
> +	return 0;
> +
> +err_unlock:
> +	srcu_read_unlock(&sframe_srcu, srcu_idx);
> +	return ret;
> +}
> +
> +static int get_sframe_file(unsigned long sframe_addr, struct sframe_file *file)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *sframe_vma, *text_vma, *vma;
> +	VMA_ITERATOR(vmi, mm, 0);
> +
> +	mmap_read_lock(mm);
> +
> +	sframe_vma = vma_lookup(mm, sframe_addr);
> +	if (!sframe_vma || !sframe_vma->vm_file)
> +		goto err_unlock;
> +
> +	text_vma = NULL;
> +
> +	for_each_vma(vmi, vma) {
> +		if (vma->vm_file != sframe_vma->vm_file)
> +			continue;
> +		if (vma->vm_flags & VM_EXEC) {
> +			if (text_vma) {
> +				/*
> +				 * Multiple EXEC segments in a single file
> +				 * aren't currently supported, is that a thing?
> +				 */
> +				WARN_ON_ONCE(1);
> +				goto err_unlock;
> +			}
> +			text_vma = vma;
> +		}
> +	}
> +
> +	file->sframe_addr	= sframe_addr;
> +	file->text_start	= text_vma->vm_start;
> +	file->text_end		= text_vma->vm_end;
> +
> +	mmap_read_unlock(mm);
> +	return 0;
> +
> +err_unlock:
> +	mmap_read_unlock(mm);
> +	return -EINVAL;
> +}
> +
> +static int validate_sframe_addrs(struct sframe_file *file)
> +{
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *text_vma;
> +
> +	mmap_read_lock(mm);
> +
> +	if (!vma_lookup(mm, file->sframe_addr))
> +		goto err_unlock;
> +
> +	text_vma = vma_lookup(mm, file->text_start);
> +	if (!(text_vma->vm_flags & VM_EXEC))
> +		goto err_unlock;
> +
> +	if (vma_lookup(mm, file->text_end-1) != text_vma)
> +		goto err_unlock;
> +
> +	mmap_read_unlock(mm);
> +	return 0;
> +
> +err_unlock:
> +	mmap_read_unlock(mm);
> +	return -EINVAL;
> +}
> +
> +int __sframe_add_section(struct sframe_file *file)
> +{
> +	struct maple_tree *sframe_mt = &current->mm->sframe_mt;
> +	struct sframe_section *sec;
> +	struct sframe_header shdr;
> +	unsigned long header_end;
> +	int ret;
> +
> +	if (copy_from_user(&shdr, (void *)file->sframe_addr, sizeof(shdr)))
> +		return -EFAULT;
> +
> +	if (shdr.preamble.magic != SFRAME_MAGIC ||
> +	    shdr.preamble.version != SFRAME_VERSION_1 ||
> +	    (!shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
> +	    shdr.auxhdr_len || !shdr.num_fdes || !shdr.num_fres ||
> +	    shdr.fdes_off > shdr.fres_off) {
> +		return -EINVAL;
> +	}
> +

I would say that it will be ideal to start the support with 
SFRAME_VERSION_2 onwards, if we have a choice.

The structure SFrame FDE in SFRAME_VERSION_1 was unaligned on-disk.  We 
fixed that in SFRAME_VERSION_2 (Binutils 2.41) by adding some padding as 
you have already noted. For x86_64, its not an issue though, yes.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 09/10] unwind: Introduce SFrame user space unwinding
  2023-11-09 19:31   ` Indu Bhagat
@ 2023-11-09 19:37     ` Josh Poimboeuf
  2023-11-09 19:49       ` Steven Rostedt
  0 siblings, 1 reply; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09 19:37 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Thu, Nov 09, 2023 at 11:31:59AM -0800, Indu Bhagat wrote:
> > +	if (shdr.preamble.magic != SFRAME_MAGIC ||
> > +	    shdr.preamble.version != SFRAME_VERSION_1 ||
> > +	    (!shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
> > +	    shdr.auxhdr_len || !shdr.num_fdes || !shdr.num_fres ||
> > +	    shdr.fdes_off > shdr.fres_off) {
> > +		return -EINVAL;
> > +	}
> > +
> 
> I would say that it will be ideal to start the support with SFRAME_VERSION_2
> onwards, if we have a choice.
> 
> The structure SFrame FDE in SFRAME_VERSION_1 was unaligned on-disk.  We
> fixed that in SFRAME_VERSION_2 (Binutils 2.41) by adding some padding as you
> have already noted. For x86_64, its not an issue though, yes.

Agreed.  I actually had v2 implemented but then realized my binutils was
older so I changed to v1 for testing.  But yeah, we can make v2 the
minimum.

-- 
Josh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 09/10] unwind: Introduce SFrame user space unwinding
  2023-11-09 19:37     ` Josh Poimboeuf
@ 2023-11-09 19:49       ` Steven Rostedt
  2023-11-09 19:53         ` Josh Poimboeuf
  0 siblings, 1 reply; 30+ messages in thread
From: Steven Rostedt @ 2023-11-09 19:49 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Indu Bhagat, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Thu, 9 Nov 2023 11:37:59 -0800
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> > The structure SFrame FDE in SFRAME_VERSION_1 was unaligned on-disk.  We
> > fixed that in SFRAME_VERSION_2 (Binutils 2.41) by adding some padding as you
> > have already noted. For x86_64, its not an issue though, yes.  
> 
> Agreed.  I actually had v2 implemented but then realized my binutils was
> older so I changed to v1 for testing.  But yeah, we can make v2 the
> minimum.

Cool, my test box has binutils 2.41, so it should be good to go ;-)

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 09/10] unwind: Introduce SFrame user space unwinding
  2023-11-09 19:49       ` Steven Rostedt
@ 2023-11-09 19:53         ` Josh Poimboeuf
  0 siblings, 0 replies; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-09 19:53 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Indu Bhagat, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Namhyung Kim, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Thu, Nov 09, 2023 at 02:49:56PM -0500, Steven Rostedt wrote:
> On Thu, 9 Nov 2023 11:37:59 -0800
> Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> 
> > > The structure SFrame FDE in SFRAME_VERSION_1 was unaligned on-disk.  We
> > > fixed that in SFRAME_VERSION_2 (Binutils 2.41) by adding some padding as you
> > > have already noted. For x86_64, its not an issue though, yes.  
> > 
> > Agreed.  I actually had v2 implemented but then realized my binutils was
> > older so I changed to v1 for testing.  But yeah, we can make v2 the
> > minimum.
> 
> Cool, my test box has binutils 2.41, so it should be good to go ;-)

You'll need something like this (untested!)

diff --git a/kernel/unwind/sframe.c b/kernel/unwind/sframe.c
index b167c19497e5..f8b2a520a689 100644
--- a/kernel/unwind/sframe.c
+++ b/kernel/unwind/sframe.c
@@ -142,10 +142,8 @@ static int find_fre(struct sframe_section *sec, struct sframe_fde *fde,
 				break;
 		} else {
 			/* SFRAME_FDE_TYPE_PCMASK */
-#if 0 /* sframe v2 */
 			if (ip_off % fde->rep_size < fre_ip_off)
 				break;
-#endif
 		}
 
 		SFRAME_GET_USER_UNSIGNED(fre_info, f, 1);
@@ -304,7 +302,7 @@ int __sframe_add_section(struct sframe_file *file)
 		return -EFAULT;
 
 	if (shdr.preamble.magic != SFRAME_MAGIC ||
-	    shdr.preamble.version != SFRAME_VERSION_1 ||
+	    shdr.preamble.version != SFRAME_VERSION_2 ||
 	    (!shdr.preamble.flags & SFRAME_F_FDE_SORTED) ||
 	    shdr.auxhdr_len || !shdr.num_fdes || !shdr.num_fres ||
 	    shdr.fdes_off > shdr.fres_off) {
diff --git a/kernel/unwind/sframe.h b/kernel/unwind/sframe.h
index 1f91b696daf5..1adfc88b784d 100644
--- a/kernel/unwind/sframe.h
+++ b/kernel/unwind/sframe.h
@@ -125,10 +125,8 @@ struct sframe_fde {
 	u32 fres_off;
 	u32 fres_num;
 	u8  info;
-#if 0 /* TODO sframe v2 */
 	u8  rep_size;
 	u16 padding;
-#endif
 } __packed;
 
 /*

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 01/10] perf: Remove get_perf_callchain() 'init_nr' argument
  2023-11-09  0:41 ` [PATCH RFC 01/10] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
@ 2023-11-11  6:09   ` Namhyung Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Namhyung Kim @ 2023-11-11  6:09 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

Hello,

On Wed, Nov 8, 2023 at 4:43 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> The 'init_nr' argument has double duty: it's used to initialize both the
> number of contexts and the number of stack entries.  That's confusing
> and the callers always pass zero anyway.  Hard code the zero.

IIRC it was used to skip a number of first stack entries in BPF.
I changed the code to not use init_nr but forgot to update
the perf code.

Acked-by: Namhyung Kim <Namhyung@kernel.org>

Thanks,
Namhyung

>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  include/linux/perf_event.h |  2 +-
>  kernel/bpf/stackmap.c      |  4 ++--
>  kernel/events/callchain.c  | 12 ++++++------
>  kernel/events/core.c       |  2 +-
>  4 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index afb028c54f33..f4b05954076c 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1533,7 +1533,7 @@ DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
>  extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
>  extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
>  extern struct perf_callchain_entry *
> -get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
> +get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>                    u32 max_stack, bool crosstask, bool add_mark);
>  extern int get_callchain_buffers(int max_stack);
>  extern void put_callchain_buffers(void);
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index d6b277482085..b0b0fbff7c18 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -294,7 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>         if (max_depth > sysctl_perf_event_max_stack)
>                 max_depth = sysctl_perf_event_max_stack;
>
> -       trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
> +       trace = get_perf_callchain(regs, kernel, user, max_depth,
>                                    false, false);
>
>         if (unlikely(!trace))
> @@ -420,7 +420,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>         else if (kernel && task)
>                 trace = get_callchain_entry_for_task(task, max_depth);
>         else
> -               trace = get_perf_callchain(regs, 0, kernel, user, max_depth,
> +               trace = get_perf_callchain(regs, kernel, user, max_depth,
>                                            false, false);
>         if (unlikely(!trace))
>                 goto err_fault;
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 1273be84392c..1e135195250c 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -177,7 +177,7 @@ put_callchain_entry(int rctx)
>  }
>
>  struct perf_callchain_entry *
> -get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
> +get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>                    u32 max_stack, bool crosstask, bool add_mark)
>  {
>         struct perf_callchain_entry *entry;
> @@ -188,11 +188,11 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user,
>         if (!entry)
>                 return NULL;
>
> -       ctx.entry     = entry;
> -       ctx.max_stack = max_stack;
> -       ctx.nr        = entry->nr = init_nr;
> -       ctx.contexts       = 0;
> -       ctx.contexts_maxed = false;
> +       ctx.entry               = entry;
> +       ctx.max_stack           = max_stack;
> +       ctx.nr                  = entry->nr = 0;
> +       ctx.contexts            = 0;
> +       ctx.contexts_maxed      = false;
>
>         if (kernel && !user_mode(regs)) {
>                 if (add_mark)
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 683dc086ef10..b0d62df7df4e 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7600,7 +7600,7 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
>         if (!kernel && !user)
>                 return &__empty_callchain;
>
> -       callchain = get_perf_callchain(regs, 0, kernel, user,
> +       callchain = get_perf_callchain(regs, kernel, user,
>                                        max_stack, crosstask, true);
>         return callchain ?: &__empty_callchain;
>  }
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 02/10] perf: Remove get_perf_callchain() 'crosstask' argument
  2023-11-09  0:41 ` [PATCH RFC 02/10] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
@ 2023-11-11  6:11   ` Namhyung Kim
  2023-11-11 20:53     ` Jordan Rome
  0 siblings, 1 reply; 30+ messages in thread
From: Namhyung Kim @ 2023-11-11  6:11 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Wed, Nov 8, 2023 at 4:44 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> get_perf_callchain() doesn't support cross-task unwinding, so it doesn't

For only user stacks, but it seems there's no place to get cross-task kernel
stacks too.

> make much sense to have 'crosstask' as an argument.  Instead, have
> perf_callchain() adjust 'user' accordingly.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

Acked-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung

> ---
>  include/linux/perf_event.h | 2 +-
>  kernel/bpf/stackmap.c      | 5 ++---
>  kernel/events/callchain.c  | 6 +-----
>  kernel/events/core.c       | 8 ++++----
>  4 files changed, 8 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index f4b05954076c..2d8fa253b9df 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1534,7 +1534,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
>  extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
>  extern struct perf_callchain_entry *
>  get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> -                  u32 max_stack, bool crosstask, bool add_mark);
> +                  u32 max_stack, bool add_mark);
>  extern int get_callchain_buffers(int max_stack);
>  extern void put_callchain_buffers(void);
>  extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index b0b0fbff7c18..e4827ca5378d 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -294,8 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>         if (max_depth > sysctl_perf_event_max_stack)
>                 max_depth = sysctl_perf_event_max_stack;
>
> -       trace = get_perf_callchain(regs, kernel, user, max_depth,
> -                                  false, false);
> +       trace = get_perf_callchain(regs, kernel, user, max_depth, false);
>
>         if (unlikely(!trace))
>                 /* couldn't fetch the stack trace */
> @@ -421,7 +420,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>                 trace = get_callchain_entry_for_task(task, max_depth);
>         else
>                 trace = get_perf_callchain(regs, kernel, user, max_depth,
> -                                          false, false);
> +                                          false);
>         if (unlikely(!trace))
>                 goto err_fault;
>
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 1e135195250c..aa5f9d11c28d 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -178,7 +178,7 @@ put_callchain_entry(int rctx)
>
>  struct perf_callchain_entry *
>  get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> -                  u32 max_stack, bool crosstask, bool add_mark)
> +                  u32 max_stack, bool add_mark)
>  {
>         struct perf_callchain_entry *entry;
>         struct perf_callchain_entry_ctx ctx;
> @@ -209,9 +209,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>                 }
>
>                 if (regs) {
> -                       if (crosstask)
> -                               goto exit_put;
> -
>                         if (add_mark)
>                                 perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
>
> @@ -219,7 +216,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>                 }
>         }
>
> -exit_put:
>         put_callchain_entry(rctx);
>
>         return entry;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index b0d62df7df4e..5e41a3b70bcd 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7592,16 +7592,16 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
>  {
>         bool kernel = !event->attr.exclude_callchain_kernel;
>         bool user   = !event->attr.exclude_callchain_user;
> -       /* Disallow cross-task user callchains. */
> -       bool crosstask = event->ctx->task && event->ctx->task != current;
>         const u32 max_stack = event->attr.sample_max_stack;
>         struct perf_callchain_entry *callchain;
>
> +       /* Disallow cross-task user callchains. */
> +       user &= !event->ctx->task || event->ctx->task == current;
> +
>         if (!kernel && !user)
>                 return &__empty_callchain;
>
> -       callchain = get_perf_callchain(regs, kernel, user,
> -                                      max_stack, crosstask, true);
> +       callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
>         return callchain ?: &__empty_callchain;
>  }
>
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 03/10] perf: Simplify get_perf_callchain() user logic
  2023-11-09  0:41 ` [PATCH RFC 03/10] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
@ 2023-11-11  6:11   ` Namhyung Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Namhyung Kim @ 2023-11-11  6:11 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Wed, Nov 8, 2023 at 4:44 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> Simplify the get_perf_callchain() user logic a bit.  task_pt_regs()
> should never be NULL.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>

Acked-by: Namhyung Kim <namhyung@kernel.org>

Thanks,
Namhyung

> ---
>  kernel/events/callchain.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index aa5f9d11c28d..2bee8b6fda0e 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -202,20 +202,18 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>
>         if (user) {
>                 if (!user_mode(regs)) {
> -                       if  (current->mm)
> -                               regs = task_pt_regs(current);
> -                       else
> -                               regs = NULL;
> +                       if (!current->mm)
> +                               goto exit_put;
> +                       regs = task_pt_regs(current);
>                 }
>
> -               if (regs) {
> -                       if (add_mark)
> -                               perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
> +               if (add_mark)
> +                       perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
>
> -                       perf_callchain_user(&ctx, regs);
> -               }
> +               perf_callchain_user(&ctx, regs);
>         }
>
> +exit_put:
>         put_callchain_entry(rctx);
>
>         return entry;
> --
> 2.41.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-09  0:41 ` [PATCH RFC 04/10] perf: Introduce deferred user callchains Josh Poimboeuf
@ 2023-11-11  6:57   ` Namhyung Kim
  2023-11-11 18:49     ` Josh Poimboeuf
  0 siblings, 1 reply; 30+ messages in thread
From: Namhyung Kim @ 2023-11-11  6:57 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Wed, Nov 8, 2023 at 4:43 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> Instead of attempting to unwind user space from the NMI handler, defer
> it to run in task context by sending a self-IPI and then scheduling the
> unwind to run in the IRQ's exit task work before returning to user space.
>
> This allows the user stack page to be paged in if needed, avoids
> duplicate unwinds for kernel-bound workloads, and prepares for SFrame
> unwinding (so .sframe sections can be paged in on demand).
>
> Suggested-by: Steven Rostedt <rostedt@goodmis.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> ---
>  arch/Kconfig                    |  3 ++
>  include/linux/perf_event.h      | 22 ++++++--
>  include/uapi/linux/perf_event.h |  1 +
>  kernel/bpf/stackmap.c           |  5 +-
>  kernel/events/callchain.c       |  7 ++-
>  kernel/events/core.c            | 90 ++++++++++++++++++++++++++++++---
>  6 files changed, 115 insertions(+), 13 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index f4b210ab0612..690c82212224 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -425,6 +425,9 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH
>           It uses the same command line parameters, and sysctl interface,
>           as the generic hardlockup detectors.
>
> +config HAVE_PERF_CALLCHAIN_DEFERRED
> +       bool
> +
>  config HAVE_PERF_REGS
>         bool
>         help
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 2d8fa253b9df..2f232111dff2 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -786,6 +786,7 @@ struct perf_event {
>         struct irq_work                 pending_irq;
>         struct callback_head            pending_task;
>         unsigned int                    pending_work;
> +       unsigned int                    pending_unwind;
>
>         atomic_t                        event_limit;
>
> @@ -1113,7 +1114,10 @@ int perf_event_read_local(struct perf_event *event, u64 *value,
>  extern u64 perf_event_read_value(struct perf_event *event,
>                                  u64 *enabled, u64 *running);
>
> -extern struct perf_callchain_entry *perf_callchain(struct perf_event *event, struct pt_regs *regs);
> +extern void perf_callchain(struct perf_sample_data *data,
> +                          struct perf_event *event, struct pt_regs *regs);
> +extern void perf_callchain_deferred(struct perf_sample_data *data,
> +                                   struct perf_event *event, struct pt_regs *regs);
>
>  static inline bool branch_sample_no_flags(const struct perf_event *event)
>  {
> @@ -1189,6 +1193,7 @@ struct perf_sample_data {
>         u64                             data_page_size;
>         u64                             code_page_size;
>         u64                             aux_size;
> +       bool                            deferred;
>  } ____cacheline_aligned;
>
>  /* default value for data source */
> @@ -1206,6 +1211,7 @@ static inline void perf_sample_data_init(struct perf_sample_data *data,
>         data->sample_flags = PERF_SAMPLE_PERIOD;
>         data->period = period;
>         data->dyn_size = 0;
> +       data->deferred = false;
>
>         if (addr) {
>                 data->addr = addr;
> @@ -1219,7 +1225,11 @@ static inline void perf_sample_save_callchain(struct perf_sample_data *data,
>  {
>         int size = 1;
>
> -       data->callchain = perf_callchain(event, regs);
> +       if (data->deferred)
> +               perf_callchain_deferred(data, event, regs);
> +       else
> +               perf_callchain(data, event, regs);
> +
>         size += data->callchain->nr;
>
>         data->dyn_size += size * sizeof(u64);
> @@ -1534,12 +1544,18 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
>  extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
>  extern struct perf_callchain_entry *
>  get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> -                  u32 max_stack, bool add_mark);
> +                  u32 max_stack, bool add_mark, bool defer_user);
>  extern int get_callchain_buffers(int max_stack);
>  extern void put_callchain_buffers(void);
>  extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
>  extern void put_callchain_entry(int rctx);
>
> +#ifdef CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED
> +extern void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
> +#else
> +static inline void perf_callchain_user_deferred(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) {}
> +#endif
> +
>  extern int sysctl_perf_event_max_stack;
>  extern int sysctl_perf_event_max_contexts_per_stack;
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 39c6a250dd1b..9a1127af4cda 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1237,6 +1237,7 @@ enum perf_callchain_context {
>         PERF_CONTEXT_HV                 = (__u64)-32,
>         PERF_CONTEXT_KERNEL             = (__u64)-128,
>         PERF_CONTEXT_USER               = (__u64)-512,
> +       PERF_CONTEXT_USER_DEFERRED      = (__u64)-640,
>
>         PERF_CONTEXT_GUEST              = (__u64)-2048,
>         PERF_CONTEXT_GUEST_KERNEL       = (__u64)-2176,
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index e4827ca5378d..fcdd26715b12 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -294,8 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>         if (max_depth > sysctl_perf_event_max_stack)
>                 max_depth = sysctl_perf_event_max_stack;
>
> -       trace = get_perf_callchain(regs, kernel, user, max_depth, false);
> -
> +       trace = get_perf_callchain(regs, kernel, user, max_depth, false, false);
>         if (unlikely(!trace))
>                 /* couldn't fetch the stack trace */
>                 return -EFAULT;
> @@ -420,7 +419,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>                 trace = get_callchain_entry_for_task(task, max_depth);
>         else
>                 trace = get_perf_callchain(regs, kernel, user, max_depth,
> -                                          false);
> +                                          false, false);

Hmm.. BPF is not supported.  Any plans?


>         if (unlikely(!trace))
>                 goto err_fault;
>
> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
> index 2bee8b6fda0e..16571c8d6771 100644
> --- a/kernel/events/callchain.c
> +++ b/kernel/events/callchain.c
> @@ -178,7 +178,7 @@ put_callchain_entry(int rctx)
>
>  struct perf_callchain_entry *
>  get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> -                  u32 max_stack, bool add_mark)
> +                  u32 max_stack, bool add_mark, bool defer_user)
>  {
>         struct perf_callchain_entry *entry;
>         struct perf_callchain_entry_ctx ctx;
> @@ -207,6 +207,11 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>                         regs = task_pt_regs(current);
>                 }
>
> +               if (defer_user) {
> +                       perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
> +                       goto exit_put;

Hmm.. ok.  Then now perf tools need to stitch the call stacks
in two (or more?) samples.


> +               }
> +
>                 if (add_mark)
>                         perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 5e41a3b70bcd..290e06b0071c 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6751,6 +6751,12 @@ static void perf_pending_irq(struct irq_work *entry)
>         struct perf_event *event = container_of(entry, struct perf_event, pending_irq);
>         int rctx;
>
> +       if (!is_software_event(event)) {
> +               if (event->pending_unwind)
> +                       task_work_add(current, &event->pending_task, TWA_RESUME);

I'm not familiar with this code.  Is it possible to the current task
is preempted before returning to user space (and run the
perf_pending_task_unwind) ?


> +               return;
> +       }
> +
>         /*
>          * If we 'fail' here, that's OK, it means recursion is already disabled
>          * and we won't recurse 'further'.
> @@ -6772,11 +6778,57 @@ static void perf_pending_irq(struct irq_work *entry)
>                 perf_swevent_put_recursion_context(rctx);
>  }
>
> +static void perf_pending_task_unwind(struct perf_event *event)
> +{
> +       struct pt_regs *regs = task_pt_regs(current);
> +       struct perf_output_handle handle;
> +       struct perf_event_header header;
> +       struct perf_sample_data data;
> +       struct perf_callchain_entry *callchain;
> +
> +       callchain = kmalloc(sizeof(struct perf_callchain_entry) +
> +                           (sizeof(__u64) * event->attr.sample_max_stack) +
> +                           (sizeof(__u64) * 1) /* one context */,
> +                           GFP_KERNEL);

Any chance it can reuse get_perf_callchain() instead of
allocating the callchains every time?


> +       if (!callchain)
> +               return;
> +
> +       callchain->nr = 0;
> +       data.callchain = callchain;
> +
> +       perf_sample_data_init(&data, 0, event->hw.last_period);

It would double count the period for a sample.
I guess we want to set the period to 0.

> +
> +       data.deferred = true;
> +
> +       perf_prepare_sample(&data, event, regs);

I don't think this would work properly.  Not to mention it duplicates
sample data unnecessarily, some data needs special handling to
avoid inefficient (full) data copy like for (user) stack, regs and aux.

Anyway I'm not sure it can support these additional samples for
deferred callchains without breaking the existing perf tools.
Anyway it doesn't know PERF_CONTEXT_USER_DEFERRED at least.
I think this should be controlled by a new feature bit in the
perf_event_attr.

Then we might add a breaking change to have a special
sample record for the deferred callchains and sample ID only.

> +
> +       perf_prepare_header(&header, &data, event, regs);
> +
> +       if (perf_output_begin(&handle, &data, event, header.size))
> +               goto done;
> +
> +       perf_output_sample(&handle, &header, &data, event);
> +
> +       perf_output_end(&handle);
> +
> +done:
> +       kfree(callchain);
> +}
> +
> +
>  static void perf_pending_task(struct callback_head *head)
>  {
>         struct perf_event *event = container_of(head, struct perf_event, pending_task);
>         int rctx;
>
> +       if (!is_software_event(event)) {
> +               if (event->pending_unwind) {
> +                       perf_pending_task_unwind(event);
> +                       event->pending_unwind = 0;
> +               }
> +               return;
> +       }
> +
>         /*
>          * If we 'fail' here, that's OK, it means recursion is already disabled
>          * and we won't recurse 'further'.
> @@ -7587,22 +7639,48 @@ static u64 perf_get_page_size(unsigned long addr)
>
>  static struct perf_callchain_entry __empty_callchain = { .nr = 0, };
>
> -struct perf_callchain_entry *
> -perf_callchain(struct perf_event *event, struct pt_regs *regs)
> +void perf_callchain(struct perf_sample_data *data, struct perf_event *event,
> +                   struct pt_regs *regs)
>  {
>         bool kernel = !event->attr.exclude_callchain_kernel;
>         bool user   = !event->attr.exclude_callchain_user;
>         const u32 max_stack = event->attr.sample_max_stack;
> -       struct perf_callchain_entry *callchain;
> +       bool defer_user = IS_ENABLED(CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED);

Is it always enabled depending on the kernel config?
It could be controlled by event->attr.something..

Thanks,
Namhyung

>
>         /* Disallow cross-task user callchains. */
>         user &= !event->ctx->task || event->ctx->task == current;
>
>         if (!kernel && !user)
> -               return &__empty_callchain;
> +               goto empty;
>
> -       callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
> -       return callchain ?: &__empty_callchain;
> +       data->callchain = get_perf_callchain(regs, kernel, user, max_stack, true, defer_user);
> +       if (!data->callchain)
> +               goto empty;
> +
> +       if (user && defer_user && !event->pending_unwind) {
> +               event->pending_unwind = 1;
> +               irq_work_queue(&event->pending_irq);
> +       }
> +
> +       return;
> +
> +empty:
> +       data->callchain = &__empty_callchain;
> +}
> +
> +void perf_callchain_deferred(struct perf_sample_data *data,
> +                            struct perf_event *event, struct pt_regs *regs)
> +{
> +       struct perf_callchain_entry_ctx ctx;
> +
> +       ctx.entry               = data->callchain;
> +       ctx.max_stack           = event->attr.sample_max_stack;
> +       ctx.nr                  = 0;
> +       ctx.contexts            = 0;
> +       ctx.contexts_maxed      = false;
> +
> +       perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
> +       perf_callchain_user_deferred(&ctx, regs);
>  }
>
>  static __always_inline u64 __cond_set(u64 flags, u64 s, u64 d)
> --
> 2.41.0
>
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-11  6:57   ` Namhyung Kim
@ 2023-11-11 18:49     ` Josh Poimboeuf
  2023-11-11 18:54       ` Josh Poimboeuf
  2023-11-13 16:56       ` Namhyung Kim
  0 siblings, 2 replies; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-11 18:49 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Fri, Nov 10, 2023 at 10:57:58PM -0800, Namhyung Kim wrote:
> On Wed, Nov 8, 2023 at 4:43 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > +++ b/kernel/bpf/stackmap.c
> > @@ -294,8 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
> >         if (max_depth > sysctl_perf_event_max_stack)
> >                 max_depth = sysctl_perf_event_max_stack;
> >
> > -       trace = get_perf_callchain(regs, kernel, user, max_depth, false);
> > -
> > +       trace = get_perf_callchain(regs, kernel, user, max_depth, false, false);
> >         if (unlikely(!trace))
> >                 /* couldn't fetch the stack trace */
> >                 return -EFAULT;
> > @@ -420,7 +419,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
> >                 trace = get_callchain_entry_for_task(task, max_depth);
> >         else
> >                 trace = get_perf_callchain(regs, kernel, user, max_depth,
> > -                                          false);
> > +                                          false, false);
> 
> Hmm.. BPF is not supported.  Any plans?

I'm not sure whether this feature makes sense for BPF.  If the BPF
program runs in atomic context then it would have to defer the unwind
until right before exiting to user space.  But then the program is no
longer running.

> > +++ b/kernel/events/callchain.c
> > @@ -178,7 +178,7 @@ put_callchain_entry(int rctx)
> >
> >  struct perf_callchain_entry *
> >  get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> > -                  u32 max_stack, bool add_mark)
> > +                  u32 max_stack, bool add_mark, bool defer_user)
> >  {
> >         struct perf_callchain_entry *entry;
> >         struct perf_callchain_entry_ctx ctx;
> > @@ -207,6 +207,11 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> >                         regs = task_pt_regs(current);
> >                 }
> >
> > +               if (defer_user) {
> > +                       perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
> > +                       goto exit_put;
> 
> Hmm.. ok.  Then now perf tools need to stitch the call stacks
> in two (or more?) samples.

Yes.  And it could be more than two samples if there were multiple NMIs
before returning to user space.  In that case there would be a single
user sample which needs to be stitched to each of the kernel samples.

> > +               }
> > +
> >                 if (add_mark)
> >                         perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 5e41a3b70bcd..290e06b0071c 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -6751,6 +6751,12 @@ static void perf_pending_irq(struct irq_work *entry)
> >         struct perf_event *event = container_of(entry, struct perf_event, pending_irq);
> >         int rctx;
> >
> > +       if (!is_software_event(event)) {
> > +               if (event->pending_unwind)
> > +                       task_work_add(current, &event->pending_task, TWA_RESUME);
> 
> I'm not familiar with this code.  Is it possible to the current task
> is preempted before returning to user space (and run the
> perf_pending_task_unwind) ?

Yes, the task work (perf_pending_task_unwind) is preemptible.

> > +static void perf_pending_task_unwind(struct perf_event *event)
> > +{
> > +       struct pt_regs *regs = task_pt_regs(current);
> > +       struct perf_output_handle handle;
> > +       struct perf_event_header header;
> > +       struct perf_sample_data data;
> > +       struct perf_callchain_entry *callchain;
> > +
> > +       callchain = kmalloc(sizeof(struct perf_callchain_entry) +
> > +                           (sizeof(__u64) * event->attr.sample_max_stack) +
> > +                           (sizeof(__u64) * 1) /* one context */,
> > +                           GFP_KERNEL);
> 
> Any chance it can reuse get_perf_callchain() instead of
> allocating the callchains every time?

I don't think so, because if it gets preempted, the new task might also
need to do an unwind.  But there's only one task-context callchain per
CPU.

The allocation is less than ideal.  I could just allocate it on the
stack, and keep the number of entries bounded to 128 entries or so.

> > +       if (!callchain)
> > +               return;
> > +
> > +       callchain->nr = 0;
> > +       data.callchain = callchain;
> > +
> > +       perf_sample_data_init(&data, 0, event->hw.last_period);
> 
> It would double count the period for a sample.
> I guess we want to set the period to 0.

Ok.

> > +
> > +       data.deferred = true;
> > +
> > +       perf_prepare_sample(&data, event, regs);
> 
> I don't think this would work properly.  Not to mention it duplicates
> sample data unnecessarily, some data needs special handling to
> avoid inefficient (full) data copy like for (user) stack, regs and aux.

Yeah.  I tried sending only the sample ID and callchain, but perf tool
didn't appreciate that ;-) So for the RFC I gave up and did this.

> Anyway I'm not sure it can support these additional samples for
> deferred callchains without breaking the existing perf tools.
> Anyway it doesn't know PERF_CONTEXT_USER_DEFERRED at least.
> I think this should be controlled by a new feature bit in the
> perf_event_attr.
> 
> Then we might add a breaking change to have a special
> sample record for the deferred callchains and sample ID only.

Sounds like a good idea.  I'll need to study the code to figure out how
to do that on the perf tool side.  Or would you care to write a patch?

> > -struct perf_callchain_entry *
> > -perf_callchain(struct perf_event *event, struct pt_regs *regs)
> > +void perf_callchain(struct perf_sample_data *data, struct perf_event *event,
> > +                   struct pt_regs *regs)
> >  {
> >         bool kernel = !event->attr.exclude_callchain_kernel;
> >         bool user   = !event->attr.exclude_callchain_user;
> >         const u32 max_stack = event->attr.sample_max_stack;
> > -       struct perf_callchain_entry *callchain;
> > +       bool defer_user = IS_ENABLED(CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED);
> 
> Is it always enabled depending on the kernel config?
> It could be controlled by event->attr.something..

Possibly, depending on what y'all think.  Also, fp vs sframe could be an
attr (though sframe implies deferred).

Thanks for the review!

-- 
Josh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-11 18:49     ` Josh Poimboeuf
@ 2023-11-11 18:54       ` Josh Poimboeuf
  2023-11-13 16:56       ` Namhyung Kim
  1 sibling, 0 replies; 30+ messages in thread
From: Josh Poimboeuf @ 2023-11-11 18:54 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Sat, Nov 11, 2023 at 10:49:10AM -0800, Josh Poimboeuf wrote:
> On Fri, Nov 10, 2023 at 10:57:58PM -0800, Namhyung Kim wrote:
> > > +static void perf_pending_task_unwind(struct perf_event *event)
> > > +{
> > > +       struct pt_regs *regs = task_pt_regs(current);
> > > +       struct perf_output_handle handle;
> > > +       struct perf_event_header header;
> > > +       struct perf_sample_data data;
> > > +       struct perf_callchain_entry *callchain;
> > > +
> > > +       callchain = kmalloc(sizeof(struct perf_callchain_entry) +
> > > +                           (sizeof(__u64) * event->attr.sample_max_stack) +
> > > +                           (sizeof(__u64) * 1) /* one context */,
> > > +                           GFP_KERNEL);
> > 
> > Any chance it can reuse get_perf_callchain() instead of
> > allocating the callchains every time?
> 
> I don't think so, because if it gets preempted, the new task might also
> need to do an unwind.  But there's only one task-context callchain per
> CPU.

BTW it's not just preemption, this code can also block when the unwinder
tries to copy from user space.  So disabling preemption isn't an option.

-- 
Josh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 02/10] perf: Remove get_perf_callchain() 'crosstask' argument
  2023-11-11  6:11   ` Namhyung Kim
@ 2023-11-11 20:53     ` Jordan Rome
  0 siblings, 0 replies; 30+ messages in thread
From: Jordan Rome @ 2023-11-11 20:53 UTC (permalink / raw)
  To: Namhyung Kim, Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains



On 11/11/23 1:11 AM, Namhyung Kim wrote:
> On Wed, Nov 8, 2023 at 4:44 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>>
>> get_perf_callchain() doesn't support cross-task unwinding, so it doesn't
> 
> For only user stacks, but it seems there's no place to get cross-task kernel
> stacks too.
> 

There is bpf_get_task_stack in kernel/bpf/stackmap.c. This can be called
inside of a BPF task iterator, where you can get the kernel stacks
for every task on the host. But as this change points out, this doesn't
work for crosstask user stack unwinding. I have a similar patch that
just exits early in this case:
https://lore.kernel.org/linux-perf-users/20231111172001.1259065-1-linux@jordanrome.com/

Though I'm not opposed with just removing the *crosstask* param
entirely as a similar check was just added in the bpf tree for
bpf_get_task_stack:
https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com/

>> make much sense to have 'crosstask' as an argument.  Instead, have
>> perf_callchain() adjust 'user' accordingly.
>>
>> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
> 
> Acked-by: Namhyung Kim <namhyung@kernel.org>
> 
> Thanks,
> Namhyung
> 
>> ---
>>   include/linux/perf_event.h | 2 +-
>>   kernel/bpf/stackmap.c      | 5 ++---
>>   kernel/events/callchain.c  | 6 +-----
>>   kernel/events/core.c       | 8 ++++----
>>   4 files changed, 8 insertions(+), 13 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index f4b05954076c..2d8fa253b9df 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -1534,7 +1534,7 @@ extern void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct p
>>   extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs);
>>   extern struct perf_callchain_entry *
>>   get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>> -                  u32 max_stack, bool crosstask, bool add_mark);
>> +                  u32 max_stack, bool add_mark);
>>   extern int get_callchain_buffers(int max_stack);
>>   extern void put_callchain_buffers(void);
>>   extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
>> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
>> index b0b0fbff7c18..e4827ca5378d 100644
>> --- a/kernel/bpf/stackmap.c
>> +++ b/kernel/bpf/stackmap.c
>> @@ -294,8 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
>>          if (max_depth > sysctl_perf_event_max_stack)
>>                  max_depth = sysctl_perf_event_max_stack;
>>
>> -       trace = get_perf_callchain(regs, kernel, user, max_depth,
>> -                                  false, false);
>> +       trace = get_perf_callchain(regs, kernel, user, max_depth, false);
>>
>>          if (unlikely(!trace))
>>                  /* couldn't fetch the stack trace */
>> @@ -421,7 +420,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
>>                  trace = get_callchain_entry_for_task(task, max_depth);
>>          else
>>                  trace = get_perf_callchain(regs, kernel, user, max_depth,
>> -                                          false, false);
>> +                                          false);
>>          if (unlikely(!trace))
>>                  goto err_fault;
>>
>> diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
>> index 1e135195250c..aa5f9d11c28d 100644
>> --- a/kernel/events/callchain.c
>> +++ b/kernel/events/callchain.c
>> @@ -178,7 +178,7 @@ put_callchain_entry(int rctx)
>>
>>   struct perf_callchain_entry *
>>   get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>> -                  u32 max_stack, bool crosstask, bool add_mark)
>> +                  u32 max_stack, bool add_mark)
>>   {
>>          struct perf_callchain_entry *entry;
>>          struct perf_callchain_entry_ctx ctx;
>> @@ -209,9 +209,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>>                  }
>>
>>                  if (regs) {
>> -                       if (crosstask)
>> -                               goto exit_put;
>> -
>>                          if (add_mark)
>>                                  perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
>>
>> @@ -219,7 +216,6 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
>>                  }
>>          }
>>
>> -exit_put:
>>          put_callchain_entry(rctx);
>>
>>          return entry;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index b0d62df7df4e..5e41a3b70bcd 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -7592,16 +7592,16 @@ perf_callchain(struct perf_event *event, struct pt_regs *regs)
>>   {
>>          bool kernel = !event->attr.exclude_callchain_kernel;
>>          bool user   = !event->attr.exclude_callchain_user;
>> -       /* Disallow cross-task user callchains. */
>> -       bool crosstask = event->ctx->task && event->ctx->task != current;
>>          const u32 max_stack = event->attr.sample_max_stack;
>>          struct perf_callchain_entry *callchain;
>>
>> +       /* Disallow cross-task user callchains. */
>> +       user &= !event->ctx->task || event->ctx->task == current;
>> +
>>          if (!kernel && !user)
>>                  return &__empty_callchain;
>>
>> -       callchain = get_perf_callchain(regs, kernel, user,
>> -                                      max_stack, crosstask, true);
>> +       callchain = get_perf_callchain(regs, kernel, user, max_stack, true);
>>          return callchain ?: &__empty_callchain;
>>   }
>>
>> --
>> 2.41.0
>>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-11 18:49     ` Josh Poimboeuf
  2023-11-11 18:54       ` Josh Poimboeuf
@ 2023-11-13 16:56       ` Namhyung Kim
  2023-11-13 17:21         ` Peter Zijlstra
  2023-11-15 16:13         ` Namhyung Kim
  1 sibling, 2 replies; 30+ messages in thread
From: Namhyung Kim @ 2023-11-13 16:56 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Sat, Nov 11, 2023 at 10:49 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Fri, Nov 10, 2023 at 10:57:58PM -0800, Namhyung Kim wrote:
> > On Wed, Nov 8, 2023 at 4:43 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > > +++ b/kernel/bpf/stackmap.c
> > > @@ -294,8 +294,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
> > >         if (max_depth > sysctl_perf_event_max_stack)
> > >                 max_depth = sysctl_perf_event_max_stack;
> > >
> > > -       trace = get_perf_callchain(regs, kernel, user, max_depth, false);
> > > -
> > > +       trace = get_perf_callchain(regs, kernel, user, max_depth, false, false);
> > >         if (unlikely(!trace))
> > >                 /* couldn't fetch the stack trace */
> > >                 return -EFAULT;
> > > @@ -420,7 +419,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struct task_struct *task,
> > >                 trace = get_callchain_entry_for_task(task, max_depth);
> > >         else
> > >                 trace = get_perf_callchain(regs, kernel, user, max_depth,
> > > -                                          false);
> > > +                                          false, false);
> >
> > Hmm.. BPF is not supported.  Any plans?
>
> I'm not sure whether this feature makes sense for BPF.  If the BPF
> program runs in atomic context then it would have to defer the unwind
> until right before exiting to user space.  But then the program is no
> longer running.

Ok, then BPF gets no user call stacks even with sframes.

>
> > > +++ b/kernel/events/callchain.c
> > > @@ -178,7 +178,7 @@ put_callchain_entry(int rctx)
> > >
> > >  struct perf_callchain_entry *
> > >  get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> > > -                  u32 max_stack, bool add_mark)
> > > +                  u32 max_stack, bool add_mark, bool defer_user)
> > >  {
> > >         struct perf_callchain_entry *entry;
> > >         struct perf_callchain_entry_ctx ctx;
> > > @@ -207,6 +207,11 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
> > >                         regs = task_pt_regs(current);
> > >                 }
> > >
> > > +               if (defer_user) {
> > > +                       perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED);
> > > +                       goto exit_put;
> >
> > Hmm.. ok.  Then now perf tools need to stitch the call stacks
> > in two (or more?) samples.
>
> Yes.  And it could be more than two samples if there were multiple NMIs
> before returning to user space.  In that case there would be a single
> user sample which needs to be stitched to each of the kernel samples.

Ok, but ...

>
> > > +               }
> > > +
> > >                 if (add_mark)
> > >                         perf_callchain_store_context(&ctx, PERF_CONTEXT_USER);
> > >
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index 5e41a3b70bcd..290e06b0071c 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -6751,6 +6751,12 @@ static void perf_pending_irq(struct irq_work *entry)
> > >         struct perf_event *event = container_of(entry, struct perf_event, pending_irq);
> > >         int rctx;
> > >
> > > +       if (!is_software_event(event)) {
> > > +               if (event->pending_unwind)
> > > +                       task_work_add(current, &event->pending_task, TWA_RESUME);
> >
> > I'm not familiar with this code.  Is it possible to the current task
> > is preempted before returning to user space (and run the
> > perf_pending_task_unwind) ?
>
> Yes, the task work (perf_pending_task_unwind) is preemptible.

If the task is preempted, the call stack would be from another
task (if it also has the pending call stacks) and we need to
check which user call stack matches which kernel call stack.
So there's no guarantee we can just use adjacent samples.

>
> > > +static void perf_pending_task_unwind(struct perf_event *event)
> > > +{
> > > +       struct pt_regs *regs = task_pt_regs(current);
> > > +       struct perf_output_handle handle;
> > > +       struct perf_event_header header;
> > > +       struct perf_sample_data data;
> > > +       struct perf_callchain_entry *callchain;
> > > +
> > > +       callchain = kmalloc(sizeof(struct perf_callchain_entry) +
> > > +                           (sizeof(__u64) * event->attr.sample_max_stack) +
> > > +                           (sizeof(__u64) * 1) /* one context */,
> > > +                           GFP_KERNEL);
> >
> > Any chance it can reuse get_perf_callchain() instead of
> > allocating the callchains every time?
>
> I don't think so, because if it gets preempted, the new task might also
> need to do an unwind.  But there's only one task-context callchain per
> CPU.
>
> The allocation is less than ideal.  I could just allocate it on the
> stack, and keep the number of entries bounded to 128 entries or so.

Yeah, probably that's better.

>
> > > +       if (!callchain)
> > > +               return;
> > > +
> > > +       callchain->nr = 0;
> > > +       data.callchain = callchain;
> > > +
> > > +       perf_sample_data_init(&data, 0, event->hw.last_period);
> >
> > It would double count the period for a sample.
> > I guess we want to set the period to 0.
>
> Ok.
>
> > > +
> > > +       data.deferred = true;
> > > +
> > > +       perf_prepare_sample(&data, event, regs);
> >
> > I don't think this would work properly.  Not to mention it duplicates
> > sample data unnecessarily, some data needs special handling to
> > avoid inefficient (full) data copy like for (user) stack, regs and aux.
>
> Yeah.  I tried sending only the sample ID and callchain, but perf tool
> didn't appreciate that ;-) So for the RFC I gave up and did this.

Right, it should have some compatible sample ID header fields.
It's dynamic and depends on the attr but at least it should have a
PID to match callchains.

>
> > Anyway I'm not sure it can support these additional samples for
> > deferred callchains without breaking the existing perf tools.
> > Anyway it doesn't know PERF_CONTEXT_USER_DEFERRED at least.
> > I think this should be controlled by a new feature bit in the
> > perf_event_attr.
> >
> > Then we might add a breaking change to have a special
> > sample record for the deferred callchains and sample ID only.
>
> Sounds like a good idea.  I'll need to study the code to figure out how
> to do that on the perf tool side.  Or would you care to write a patch?

Sure, I'd be happy to write one.

>
> > > -struct perf_callchain_entry *
> > > -perf_callchain(struct perf_event *event, struct pt_regs *regs)
> > > +void perf_callchain(struct perf_sample_data *data, struct perf_event *event,
> > > +                   struct pt_regs *regs)
> > >  {
> > >         bool kernel = !event->attr.exclude_callchain_kernel;
> > >         bool user   = !event->attr.exclude_callchain_user;
> > >         const u32 max_stack = event->attr.sample_max_stack;
> > > -       struct perf_callchain_entry *callchain;
> > > +       bool defer_user = IS_ENABLED(CONFIG_HAVE_PERF_CALLCHAIN_DEFERRED);
> >
> > Is it always enabled depending on the kernel config?
> > It could be controlled by event->attr.something..
>
> Possibly, depending on what y'all think.  Also, fp vs sframe could be an
> attr (though sframe implies deferred).
>
> Thanks for the review!

Thanks for your work,
Namhyung

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-13 16:56       ` Namhyung Kim
@ 2023-11-13 17:21         ` Peter Zijlstra
  2023-11-13 17:48           ` Namhyung Kim
  2023-11-15 16:13         ` Namhyung Kim
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-11-13 17:21 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Josh Poimboeuf, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Mon, Nov 13, 2023 at 08:56:39AM -0800, Namhyung Kim wrote:

> Ok, then BPF gets no user call stacks even with sframes.

Well, you could obviously run another BPF program at return to user and
stitch there.

> Ok, but ...

> If the task is preempted, the call stack would be from another
> task (if it also has the pending call stacks) and we need to
> check which user call stack matches which kernel call stack.
> So there's no guarantee we can just use adjacent samples.

So upon requesting a user backtrace for the task it could request a
token (from eg a global atomic u64 in the most crude case) and place
this in the task_struct. The backtrace will emit this as forward
reference for the user trace.

Once the task_work generates the user stacktrace, it will again dump
this token along with the user unwind, after which it will reset the
token for this this task (to 0).


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-13 17:21         ` Peter Zijlstra
@ 2023-11-13 17:48           ` Namhyung Kim
  2023-11-13 18:49             ` Peter Zijlstra
  0 siblings, 1 reply; 30+ messages in thread
From: Namhyung Kim @ 2023-11-13 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Mon, Nov 13, 2023 at 9:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 13, 2023 at 08:56:39AM -0800, Namhyung Kim wrote:
>
> > Ok, then BPF gets no user call stacks even with sframes.
>
> Well, you could obviously run another BPF program at return to user and
> stitch there.

Right, maybe some more work needed for bpf_get_stackid()
but it seems possible.

>
> > Ok, but ...
>
> > If the task is preempted, the call stack would be from another
> > task (if it also has the pending call stacks) and we need to
> > check which user call stack matches which kernel call stack.
> > So there's no guarantee we can just use adjacent samples.
>
> So upon requesting a user backtrace for the task it could request a
> token (from eg a global atomic u64 in the most crude case) and place
> this in the task_struct. The backtrace will emit this as forward
> reference for the user trace.
>
> Once the task_work generates the user stacktrace, it will again dump
> this token along with the user unwind, after which it will reset the
> token for this this task (to 0).

Yeah, I thought something like this first, but then I thought
"can we just use PID for this?"

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-13 17:48           ` Namhyung Kim
@ 2023-11-13 18:49             ` Peter Zijlstra
  2023-11-13 19:16               ` Namhyung Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2023-11-13 18:49 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Josh Poimboeuf, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Mon, Nov 13, 2023 at 09:48:32AM -0800, Namhyung Kim wrote:

> Yeah, I thought something like this first, but then I thought
> "can we just use PID for this?"

TID, and assuming things are otherwise time ordered, yes.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-13 18:49             ` Peter Zijlstra
@ 2023-11-13 19:16               ` Namhyung Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Namhyung Kim @ 2023-11-13 19:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Josh Poimboeuf, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Mon, Nov 13, 2023 at 10:50 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 13, 2023 at 09:48:32AM -0800, Namhyung Kim wrote:
>
> > Yeah, I thought something like this first, but then I thought
> > "can we just use PID for this?"
>
> TID, and assuming things are otherwise time ordered, yes.

Right, I meant that, not TGID.

At least, the perf tools handle events in time ordered.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-13 16:56       ` Namhyung Kim
  2023-11-13 17:21         ` Peter Zijlstra
@ 2023-11-15 16:13         ` Namhyung Kim
  2023-11-20 14:03           ` Peter Zijlstra
  1 sibling, 1 reply; 30+ messages in thread
From: Namhyung Kim @ 2023-11-15 16:13 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Peter Zijlstra, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Mon, Nov 13, 2023 at 08:56:39AM -0800, Namhyung Kim wrote:
> On Sat, Nov 11, 2023 at 10:49 AM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> >
> > On Fri, Nov 10, 2023 at 10:57:58PM -0800, Namhyung Kim wrote:
> > > Anyway I'm not sure it can support these additional samples for
> > > deferred callchains without breaking the existing perf tools.
> > > Anyway it doesn't know PERF_CONTEXT_USER_DEFERRED at least.
> > > I think this should be controlled by a new feature bit in the
> > > perf_event_attr.
> > >
> > > Then we might add a breaking change to have a special
> > > sample record for the deferred callchains and sample ID only.
> >
> > Sounds like a good idea.  I'll need to study the code to figure out how
> > to do that on the perf tool side.  Or would you care to write a patch?
> 
> Sure, I'd be happy to write one.

I think we can start with something like the below.
The sample id (attr.sample_type) should have
IDENTIFIER | TID | TIME to enable defer_callchain
in order to match sample and callchain records.

Thanks,
Namhyung


---8<---
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 39c6a250dd1b..a3765ff59798 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -456,7 +456,8 @@ struct perf_event_attr {
 				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
 				remove_on_exec :  1, /* event is removed from task on exec */
 				sigtrap        :  1, /* send synchronous SIGTRAP on event */
-				__reserved_1   : 26;
+				defer_callchain:  1, /* generate DEFERRED_CALLCHAINS records for userspace */
+				__reserved_1   : 25;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -1207,6 +1208,20 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
 
+	/*
+	 * Deferred user stack callchains (for SFrame).  Previous samples would
+	 * have kernel callchains only and they need to be stitched with this
+	 * to make full callchains.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u64				nr;
+	 *	u64				ips[nr];
+	 *	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_DEFERRED_CALLCHAINS		= 22,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH RFC 04/10] perf: Introduce deferred user callchains
  2023-11-15 16:13         ` Namhyung Kim
@ 2023-11-20 14:03           ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2023-11-20 14:03 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Josh Poimboeuf, Steven Rostedt, Ingo Molnar,
	Arnaldo Carvalho de Melo, linux-kernel, x86, Indu Bhagat,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Ian Rogers,
	Adrian Hunter, linux-perf-users, Mark Brown, linux-toolchains

On Wed, Nov 15, 2023 at 08:13:31AM -0800, Namhyung Kim wrote:

> ---8<---
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 39c6a250dd1b..a3765ff59798 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -456,7 +456,8 @@ struct perf_event_attr {
>  				inherit_thread :  1, /* children only inherit if cloned with CLONE_THREAD */
>  				remove_on_exec :  1, /* event is removed from task on exec */
>  				sigtrap        :  1, /* send synchronous SIGTRAP on event */
> -				__reserved_1   : 26;
> +				defer_callchain:  1, /* generate DEFERRED_CALLCHAINS records for userspace */
> +				__reserved_1   : 25;
>  
>  	union {
>  		__u32		wakeup_events;	  /* wakeup every n events */
> @@ -1207,6 +1208,20 @@ enum perf_event_type {
>  	 */
>  	PERF_RECORD_AUX_OUTPUT_HW_ID		= 21,
>  
> +	/*
> +	 * Deferred user stack callchains (for SFrame).  Previous samples would

Possibly also useful for ShadowStack based unwinders. And by virtue of
it possibly saving work when multiple consecutive samples hit
the same kernel section, for everything.

> +	 * have kernel callchains only and they need to be stitched with this
> +	 * to make full callchains.
> +	 *
> +	 * struct {
> +	 *	struct perf_event_header	header;
> +	 *	u64				nr;
> +	 *	u64				ips[nr];
> +	 *	struct sample_id		sample_id;
> +	 * };
> +	 */
> +	PERF_RECORD_DEFERRED_CALLCHAINS		= 22,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };
>  

Anyway, yeah, that should do I suppose.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-11-20 14:04 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-09  0:41 [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf
2023-11-09  0:41 ` [PATCH RFC 01/10] perf: Remove get_perf_callchain() 'init_nr' argument Josh Poimboeuf
2023-11-11  6:09   ` Namhyung Kim
2023-11-09  0:41 ` [PATCH RFC 02/10] perf: Remove get_perf_callchain() 'crosstask' argument Josh Poimboeuf
2023-11-11  6:11   ` Namhyung Kim
2023-11-11 20:53     ` Jordan Rome
2023-11-09  0:41 ` [PATCH RFC 03/10] perf: Simplify get_perf_callchain() user logic Josh Poimboeuf
2023-11-11  6:11   ` Namhyung Kim
2023-11-09  0:41 ` [PATCH RFC 04/10] perf: Introduce deferred user callchains Josh Poimboeuf
2023-11-11  6:57   ` Namhyung Kim
2023-11-11 18:49     ` Josh Poimboeuf
2023-11-11 18:54       ` Josh Poimboeuf
2023-11-13 16:56       ` Namhyung Kim
2023-11-13 17:21         ` Peter Zijlstra
2023-11-13 17:48           ` Namhyung Kim
2023-11-13 18:49             ` Peter Zijlstra
2023-11-13 19:16               ` Namhyung Kim
2023-11-15 16:13         ` Namhyung Kim
2023-11-20 14:03           ` Peter Zijlstra
2023-11-09  0:41 ` [PATCH RFC 05/10] perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED Josh Poimboeuf
2023-11-09  0:41 ` [PATCH RFC 06/10] unwind: Introduce generic user space unwinding interfaces Josh Poimboeuf
2023-11-09  0:41 ` [PATCH RFC 07/10] unwind/x86: Add HAVE_USER_UNWIND Josh Poimboeuf
2023-11-09  0:41 ` [PATCH RFC 08/10] perf/x86: Use user_unwind interface Josh Poimboeuf
2023-11-09  0:41 ` [PATCH RFC 09/10] unwind: Introduce SFrame user space unwinding Josh Poimboeuf
2023-11-09 19:31   ` Indu Bhagat
2023-11-09 19:37     ` Josh Poimboeuf
2023-11-09 19:49       ` Steven Rostedt
2023-11-09 19:53         ` Josh Poimboeuf
2023-11-09  0:41 ` [PATCH RFC 10/10] unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME Josh Poimboeuf
2023-11-09  0:45 ` [PATCH RFC 00/10] perf: user space sframe unwinding Josh Poimboeuf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).