linux-toolchains.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [POC 0/5] SFrame based stack tracer for user space in the kernel
@ 2023-05-01 20:04 Indu Bhagat
  2023-05-01 20:04 ` [POC 1/5] Kconfig: x86: Add new config options for userspace unwinder Indu Bhagat
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Indu Bhagat @ 2023-05-01 20:04 UTC (permalink / raw)
  To: linux-toolchains
  Cc: daandemeyer, andrii, rostedt, kris.van.hees, elena.zannoni,
	nick.alcock, Indu Bhagat

Hello,

This patch set is a Proof of Concept implementation for an SFrame-based
stack tracer for user space in the kernel. Some of you had expressed interest
in exploring this earlier; hopefully, this POC helps discuss the design and
take it forward.

Motivation
==========
Generating stack traces is vital for all profiling, tracing and debugging
tools. In context of generating stack traces for user space, frame-pointer
based unwinding works, but has its issues ([1],[2]).  EH_Frame based
unwinding seems undesirable for kernel's unwinding needs ([3],[4]). 
In general, EH_Frame based unwinding is undesirable in applications that need
fast, real-time stack tracers (e.g., profilers), because of the overhead of
interpreting and executing DWARF opcodes to calculate the relevant stack
offsets.

SFrame (Simple Frame) stack trace format is designed to address these concerns.
With this POC, we would like to see how to use SFrame as a viable alternative
for user space stack tracing needs in the kernel.

[1] https://lwn.net/Articles/919940/
[2] https://pagure.io/fesco/issue/2817
[3] https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/OOJDAKTJB5WGMOZRXTUX7FTPFBF3H7WE/#NXRMNKD4B23HX7U5ICMKFRZO6Z3VXQXL
[4] https://lkml.org/lkml/2012/2/10/356

What is SFrame format
=====================

SFrame is the "Simple Frame" stack trace format.  The format is documented as
part of the binutils documentation at https://sourceware.org/binutils/docs.

Starting with binutils 2.40, the GNU assembler (as) can generate SFrame stack
trace data based on the CFI directives found in the source assembly.  This is
achieved by using the --gsframe command line option when invoking the
assembler.  This option plays the same role as the existing --gdwarf-[2345]
options, only this time referring to SFrame.  The resulting stack tracing
information is stored in a new segment of its own with type PT_GNU_SFRAME,
containing a section named '.sframe'.

Also starting with binutils 2.40, the GNU linker (ld) knows how to merge
sections containing SFrame stack trace info.

SFrame based user space stack tracer POC
========================================
These patches implement a POC for an SFrame based user space stack tracer (for
x86) in the kernel.  The purpose of this code is to serve as a reference,
initiate discussions, and perhaps serve as a starting point for a viable
implementation of an SFrame based stack tracer.  Please keep in mind that my
familiarity with with kernel code/processes/conventions is still limited ;-).

High-level Design in this POC
=============================
Kconfig adds two config options for userspace unwinding
  - config USER_UNWINDER_SFRAME to enable the SFrame userspace unwinder
  - config USER_UNWINDER_FRAME_POINTER to enable the Frame Pointer userspace
    unwinder

If CONFIG_USER_UNWINDER_SFRAME is set, the task_struct keeps a reference to
the sframe_state object for the task.

For long running user programs, it makes sense to cache the sframe_state
in the task and be able to simply do a quick do_sframe_unwind() at every
unwind request.  Caching the sframe_state also means keeping the .sframe
pages (for the prog and its DSOs) pinned.  The task's sframe_state is
kmalloc'ed and initialized in load_elf_binary, when the task is close to begin
execution.  The (open) issue with this design, however, remains that we need to
detect when additional DSOs are brought in at run-time by the application.

The detection (and resolution) of stale sframe_state is not implemented in this
POC.  As such, the POC at this time is fit only for applications that are
statically linked.

Following pseudo code roughly describe the relevant stubs around how the 
SFrame-based unwinder is currently hooked.

load_elf_binary()
{
  ...
  // check if any phdr.p_type with PT_GNU_SFRAME is seen
  if phdr.p_type == PT_GNU_SFRAME is seen
  sframe_avail = true
  ...

  if sframe_avail
    sframe_state_setup() // does all kmallocs and get_user_pages_XX
  ...
  finalize_exec (bprm)
}


perf_callchain_user()
{
   ...
   // check if task.sframe_state is valid
   sframe_avail = check_sframe_state_p (current);
   
   pagefault_disable()
   
   // check if task.sframe_state is ready and not stale
   if sframe_avail && task.sframe_state is ready
       ret = sframe_callchain_user() // uses __get_user to access stack
       if ret is success
           pagefault_enable()
           return
   
   ...
   Frame pointer based unwinding
   pagefault_enable()
   ...
}

tast_struct.sframe_state is cleaned up in release_task().

What do you think about the above workflow ?

What about caching the sframe_state in task_struct? As you see, there are
some open issues around this, and discussion is needed to help resolve
some of those.

Apart from the above design points, other reasons why this remains a POC and
not ready for submission are:

  - Code deals with only Elf64_Phdr (no Elf32_Phdr) at this time; some specific
  cases like when ELF hdr's e_phnum is equal to PN_XNUM are not handled yet
  (iterate_phdr.c).
  - Missing detection of when there is a change in the memory mappings of a
  task.  E.g., dlopen/dlclose are two of the possibilities using which a user
  program's mappings may have changed over time.
  - Code stubs around user space memory access by the kernel.  For sake of
  clarity, let me outline here the three locations where user space memory is
  accessed in context of SFrame based unwinding:
    1. Access the ELF header in iterate_phdr(), followed by accessing the ELF
       PHDRs in add_sframe_unwind_info(). This is currently using
       get_user_pages_remote() in iterate_phdr().
    2. Access the .sframe section for decoding in sframe_unw_info_init_dctx().
       This is currently done by using get_user_pages_unlocked()
    3. Access the program's execution stack in sframe_unwind_next_frame() to
       read, say the caller's IP on x86_64.  This is currently done by using
       __get_user().
  - Other stubs marked with FIXME TODO,
  - The patches may not be bisectable.  I haven't particularly tried to
    compile them individually either.
  - More testing, including checking out some regression tests.

Each commit log has further details.

Testing Notes
==============
I have tested these patches minimally using:
  1. perf on kernel master
  2. BPF uprobe on kernel master
  3. dtrace with dtrace-linux-kernel v2/6.1.8 
  (https://github.com/oracle/dtrace-linux-kernel/tree/v2/6.1.8).  This
  diff between the v2/6.1.8 branch and the Linux 6.1.8 is the few patches for
  CTF/DTrace. dtrace is a tracing tool that can be used to diagnose problems
  and probe a running linux system. For the following experiment, I used 
  unchanged dtrace packages. The dtrace  command line used is:
      dtrace -c prog -n 'pid$target::func:entry { ustack (); exit(0); }'
  This triggers a ustack() action when the said function 'func' in program
  'prog' is entered.  It gives the user stack then exits. 

  The dtrace ustack() action internally invokes the perf_callchain_user(). The
  latter is updated in the POC patch set to perform SFrame based stack tracing
  for user space.  DTrace uses BPF under the hood, but testing both DTrace and
  BPF individually has been valuable overall.

All binaries below were compiled with -Wa,--gsframe. A few tests to showcase
the POC are given below.

TEST 1: Toy hello world program with the call chain as follows:
main() -> foo() -> bar() -> baz()

$ cat deep_hello_sframe.c

#include <stdio.h>
#include <stdlib.h>

int baz (int a)
{
    return a * rand () + 100;
}

int bar (int a)
{
    int c = baz (a);
    return c * a * rand ();
}

int foo (int a)
{
    int b = bar (a);
    return b * a * rand ();
}

void main (void)
{
    int a = 100;
    int b = foo (a);
    printf ("Hello world %d \n", b);
}

$ dtrace -c ./deep_hello_sframe -n \
    'pid$target::baz:entry { ustack (); exit(0); }'
DTrace 2.0.0 [Pre-Release with limited functionality]
dtrace: description 'pid$target::baz:entry ' matched 1 probe
...
CPU     ID                    FUNCTION:NAME
  1 114215                        baz:entry
              deep_hello_sframe`baz
              deep_hello_sframe`bar+0x16
              deep_hello_sframe`foo+0x16
              deep_hello_sframe`main+0x19

$ perf probe -x ./deep_hello_sframe --add baz
$ perf record -g -e probe_deep_hello_sframe:baz ./deep_hello_sframe
$ perf script
deep_hello_sfra 25887 [000] 125196.580149: probe_deep_hello_sframe:baz:
(401136)
                    1136 baz+0x0  (/<TESTPATH>/deep_hello_sframe)
                    1165 bar+0x16 (/<TESTPATH>/deep_hello_sframe)
                    1195 foo+0x16 (/<TESTPATH>/deep_hello_sframe)
                    11c8 main+0x19 (/<TESTPATH>/deep_hello_sframe)


$ perf report --call-graph --stdio
   100.00%   100.00%  (401146)
                 |
		 ---main
	            foo
	            bar
	            baz

TEST 2: Using a BPF program target.c, get stacktrace using BPF bpf_get_stack
helper in bpf-uprobe.c.  I am skipping the BPF program for brevity.

$ cat target.c
#include <stdio.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>        /* open */

int fd;

int foo9(int x) {
  write(fd, &x, sizeof(x));
  return x  ^ 1;
}
int foo8(int x) { return foo9(x) ^ 1; }
int foo7(int x) { return foo8(x) ^ 1; }
int foo6(int x) { return foo7(x) ^ 1; }
int foo5(int x) { return foo6(x) ^ 1; }
int foo4(int x) { return foo5(x) ^ 1; }
int foo3(int x) { return foo4(x) ^ 1; }
int foo2(int x) { return foo3(x) ^ 1; }
int foo1(int x) { return foo2(x) ^ 1; }
int foo0(int x) { return foo1(x) ^ 1; }

int main(int c, char **v) {
  int x = 0;

  fd = open("/dev/null", O_WRONLY);
  if (fd == -1) {
    printf("open failed\n");
    return 1;
  }

  while ((x = foo0(x)) < 10) ;

  close(fd);
  return 0;
}

$ gcc -Wa,--gsframe -o target.sframe target.c
$ #offset=getoffset_of_foo9_in_target - baseloadaddress_in_target
$ echo "p:ibhagat/myuprobe $path_to_target:$offset" >> /sys/kernel/debug/tracing/uprobe_events
$ ./target &
$ #target_pid=`pgrep target.sframe`
$ #event_id=`sudo cat /sys/kernel/debug/tracing/events/username/myuprobe/id`
$ gcc -DTARGET_PID=$target_pid -DEVENT_ID=$event_id -o bpf-ustack bpf-ustack.c

$ sudo ./bpf-ustack # dumps IPs of callchain
  401156 401197 4011b1 4011cb 4011e5 4011ff 401219 401233 40124d 401267 4012c3

$ grep -A 2 'call' target.sframe.s | grep -A 1 'foo' 
  401192:	e8 bf ff ff ff       	callq  401156 <foo9>
  401197:	83 f0 01             	xor    $0x1,%eax
--
  4011ac:	e8 d1 ff ff ff       	callq  401182 <foo8>
  4011b1:	83 f0 01             	xor    $0x1,%eax
--
  4011c6:	e8 d1 ff ff ff       	callq  40119c <foo7>
  4011cb:	83 f0 01             	xor    $0x1,%eax
--
  4011e0:	e8 d1 ff ff ff       	callq  4011b6 <foo6>
  4011e5:	83 f0 01             	xor    $0x1,%eax
--
  4011fa:	e8 d1 ff ff ff       	callq  4011d0 <foo5>
  4011ff:	83 f0 01             	xor    $0x1,%eax
--
  401214:	e8 d1 ff ff ff       	callq  4011ea <foo4>
  401219:	83 f0 01             	xor    $0x1,%eax
--
  40122e:	e8 d1 ff ff ff       	callq  401204 <foo3>
  401233:	83 f0 01             	xor    $0x1,%eax
--
  401248:	e8 d1 ff ff ff       	callq  40121e <foo2>
  40124d:	83 f0 01             	xor    $0x1,%eax
--
  401262:	e8 d1 ff ff ff       	callq  401238 <foo1>
  401267:	83 f0 01             	xor    $0x1,%eax
--
  4012be:	e8 8f ff ff ff       	callq  401252 <foo0>
  4012c3:	89 45 fc             	mov    %eax,-0x4(%rbp)

$ perf probe -x ./target.sframe --add foo9
$ perf record -g -e probe_target:foo9 ./target.sframe
^C
$ perf script
...
target.sframe 20395 [000] 69987.711764: probe_target:foo9: (401156)
                    1156 foo9+0x0 (<TESTPATH>/target.sframe)
                    1197 foo8+0x15 (<TESTPATH>/target.sframe)
                    11b1 foo7+0x15 (<TESTPATH>/target.sframe)
                    11cb foo6+0x15 (<TESTPATH>/target.sframe)
                    11e5 foo5+0x15 (<TESTPATH>/target.sframe)
                    11ff foo4+0x15 (<TESTPATH>/target.sframe)
                    1219 foo3+0x15 (<TESTPATH>/target.sframe)
                    1233 foo2+0x15 (<TESTPATH>/target.sframe)
                    124d foo1+0x15 (<TESTPATH>/target.sframe)
                    1267 foo0+0x15 (<TESTPATH>/target.sframe)
                    12c3 main+0x57 (<TESTPATH>/target.sframe)
...

Please take a look. Any feedback is appreciated.

Thanks,

Indu Bhagat (5):
  Kconfig: x86: Add new config options for userspace unwinder
  task_struct : add additional member for sframe state
  sframe: add new SFrame library
  sframe: add an SFrame format stack tracer
  x86_64: invoke SFrame based stack tracer for user space

 arch/arm64/include/asm/sframe_regs.h |  37 ++
 arch/x86/Kconfig.debug               |  31 ++
 arch/x86/events/core.c               |  51 +++
 arch/x86/include/asm/sframe_regs.h   |  34 ++
 fs/binfmt_elf.c                      |  39 +++
 include/linux/sched.h                |   5 +
 include/sframe/sframe_regs.h         |  11 +
 include/sframe/sframe_unwind.h       |  62 ++++
 kernel/exit.c                        |   9 +
 lib/Makefile                         |   1 +
 lib/sframe/Makefile                  |  11 +
 lib/sframe/iterate_phdr.c            | 113 ++++++
 lib/sframe/iterate_phdr.h            |  34 ++
 lib/sframe/sframe.h                  | 263 ++++++++++++++
 lib/sframe/sframe_read.c             | 498 +++++++++++++++++++++++++++
 lib/sframe/sframe_read.h             |  75 ++++
 lib/sframe/sframe_state.c            | 424 +++++++++++++++++++++++
 lib/sframe/sframe_state.h            |  80 +++++
 lib/sframe/sframe_unwind.c           | 208 +++++++++++
 19 files changed, 1986 insertions(+)
 create mode 100644 arch/arm64/include/asm/sframe_regs.h
 create mode 100644 arch/x86/include/asm/sframe_regs.h
 create mode 100644 include/sframe/sframe_regs.h
 create mode 100644 include/sframe/sframe_unwind.h
 create mode 100644 lib/sframe/Makefile
 create mode 100644 lib/sframe/iterate_phdr.c
 create mode 100644 lib/sframe/iterate_phdr.h
 create mode 100644 lib/sframe/sframe.h
 create mode 100644 lib/sframe/sframe_read.c
 create mode 100644 lib/sframe/sframe_read.h
 create mode 100644 lib/sframe/sframe_state.c
 create mode 100644 lib/sframe/sframe_state.h
 create mode 100644 lib/sframe/sframe_unwind.c

-- 
2.39.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [POC 1/5] Kconfig: x86: Add new config options for userspace unwinder
  2023-05-01 20:04 [POC 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
@ 2023-05-01 20:04 ` Indu Bhagat
  2023-05-01 20:04 ` [POC 2/5] task_struct : add additional member for sframe state Indu Bhagat
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Indu Bhagat @ 2023-05-01 20:04 UTC (permalink / raw)
  To: linux-toolchains
  Cc: daandemeyer, andrii, rostedt, kris.van.hees, elena.zannoni,
	nick.alcock, Indu Bhagat

Add two config options for userspace unwinding:
  - config USER_UNWINDER_SFRAME to enable the SFrame userspace unwinder,
  - config USER_UNWINDER_FRAME_POINTER to enable the Frame Pointer
    userspace unwinder.

As users may still compile their applications without SFrame sections,
the userspace stack tracer falls back on the frame-pointer based
approach (if SFrame section is not present for the user program).  If an
SFrame section is absent for a subset of the running program (e.g. a
DSO), a best-effort SFrame section based stack trace is returned.

TODO - may be rename the existing CONFIG_UNWINDER_FRAME_POINTER etc. for
the kernel space unwinder to CONFIG_KERNEL_UNWINDER_FRAME_POINTER ? Thoughts ?

Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
---
 arch/x86/Kconfig.debug | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index c5d614d28a75..e7f928dc8d9f 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -275,3 +275,34 @@ endchoice
 config FRAME_POINTER
 	depends on !UNWINDER_ORC && !UNWINDER_GUESS
 	bool
+
+choice
+	prompt "Choose userspace unwinder"
+	default USER_UNWINDER_SFRAME if X86_64
+	default USER_UNWINDER_FRAME_POINTER if !X86_64
+	help
+	  This determines which method will be used for unwinding user stack
+	  traces.
+
+config USER_UNWINDER_SFRAME
+	bool "SFrame userspace unwinder"
+	depends on X86_64
+	help
+	  This option enables the SFrame unwinder for unwinding user stack
+	  traces.
+
+	  User programs must be built with SFrame support. If not, no SFrame
+	  section will be present in the user program binary; in such a case,
+	  the userspace unwinder defaults to frame pointer unwinding.
+
+
+config USER_UNWINDER_FRAME_POINTER
+	bool "Frame pointer userspace unwinder"
+	help
+	  This option enables the frame pointer unwinder for unwinding user
+	  stack traces.
+
+	  On some architectures, user programs must be built with
+	  -fno-omit-frame-pointer to ensure useful stack traces.
+
+endchoice
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [POC 2/5] task_struct : add additional member for sframe state
  2023-05-01 20:04 [POC 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
  2023-05-01 20:04 ` [POC 1/5] Kconfig: x86: Add new config options for userspace unwinder Indu Bhagat
@ 2023-05-01 20:04 ` Indu Bhagat
  2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 32+ messages in thread
From: Indu Bhagat @ 2023-05-01 20:04 UTC (permalink / raw)
  To: linux-toolchains
  Cc: daandemeyer, andrii, rostedt, kris.van.hees, elena.zannoni,
	nick.alcock, Indu Bhagat

Add a new member to keep track of the SFrame sections for the current
task (program and its DSOs).

The definition of struct sframe_state is owned by the SFrame unwinder,
and added in a later commit.  Regarding the state management of the
task_struct.sframe_state:
  - Allocation and initialization is done at the task initialization
    time in the kernel.
  - Update: Not clear. We need to be able to track dlopen/dlclose, or
    additional shared libraries loaded via the dynamic linker at the
    task execution time.

Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
---
 include/linux/sched.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d1f..fc0b0c720979 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -71,6 +71,7 @@ struct signal_struct;
 struct task_delay_info;
 struct task_group;
 struct user_event_mm;
+struct sframe_state;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1534,6 +1535,10 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_USER_UNWINDER_SFRAME
+	struct sframe_state		*sframe_state;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [POC 3/5] sframe: add new SFrame library
  2023-05-01 20:04 [POC 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
  2023-05-01 20:04 ` [POC 1/5] Kconfig: x86: Add new config options for userspace unwinder Indu Bhagat
  2023-05-01 20:04 ` [POC 2/5] task_struct : add additional member for sframe state Indu Bhagat
@ 2023-05-01 20:04 ` Indu Bhagat
  2023-05-01 22:40   ` Steven Rostedt
                     ` (6 more replies)
  2023-05-01 20:04 ` [POC 4/5] sframe: add an SFrame format stack tracer Indu Bhagat
                   ` (2 subsequent siblings)
  5 siblings, 7 replies; 32+ messages in thread
From: Indu Bhagat @ 2023-05-01 20:04 UTC (permalink / raw)
  To: linux-toolchains
  Cc: daandemeyer, andrii, rostedt, kris.van.hees, elena.zannoni,
	nick.alcock, Indu Bhagat

This patch adds an implementation to read SFrame stack trace data from
a .sframe section.  Some APIs are also provided to find stack tracing
information per PC, e.g., given a PC, find the SFrame FRE.

These routines are provided in the sframe_read.h and sframe_read.c.

This implmentation is malloc-free.

Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
---
 lib/Makefile             |   1 +
 lib/sframe/Makefile      |   5 +
 lib/sframe/sframe.h      | 263 +++++++++++++++++++++
 lib/sframe/sframe_read.c | 498 +++++++++++++++++++++++++++++++++++++++
 lib/sframe/sframe_read.h |  75 ++++++
 5 files changed, 842 insertions(+)
 create mode 100644 lib/sframe/Makefile
 create mode 100644 lib/sframe/sframe.h
 create mode 100644 lib/sframe/sframe_read.c
 create mode 100644 lib/sframe/sframe_read.h

diff --git a/lib/Makefile b/lib/Makefile
index 876fcdeae34e..cb02d16dbffd 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -198,6 +198,7 @@ obj-$(CONFIG_ZSTD_COMPRESS) += zstd/
 obj-$(CONFIG_ZSTD_DECOMPRESS) += zstd/
 obj-$(CONFIG_XZ_DEC) += xz/
 obj-$(CONFIG_RAID6_PQ) += raid6/
+obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe/
 
 lib-$(CONFIG_DECOMPRESS_GZIP) += decompress_inflate.o
 lib-$(CONFIG_DECOMPRESS_BZIP2) += decompress_bunzip2.o
diff --git a/lib/sframe/Makefile b/lib/sframe/Makefile
new file mode 100644
index 000000000000..4e4291d9294f
--- /dev/null
+++ b/lib/sframe/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+##################################
+obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe_read.o \
+
+CFLAGS_sframe_read.o += -I $(srctree)/lib/sframe/
diff --git a/lib/sframe/sframe.h b/lib/sframe/sframe.h
new file mode 100644
index 000000000000..b1290e92839a
--- /dev/null
+++ b/lib/sframe/sframe.h
@@ -0,0 +1,263 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef	SFRAME_H
+#define	SFRAME_H
+
+#include <linux/types.h>
+
+/* This file contains definitions for the SFrame stack tracing format, which is
+ * documented at https://sourceware.org/binutils/docs */
+
+#define SFRAME_VERSION_1	1
+#define SFRAME_MAGIC		0xdee2
+#define SFRAME_VERSION	SFRAME_VERSION_1
+
+/* Function Descriptor Entries are sorted on PC. */
+#define SFRAME_F_FDE_SORTED	0x1
+/* Frame-pointer based stack tracing. Defined, but not set. */
+#define SFRAME_F_FRAME_POINTER 0x2
+
+#define SFRAME_CFA_FIXED_FP_INVALID 0
+#define SFRAME_CFA_FIXED_RA_INVALID 0
+
+/* Supported ABIs/Arch. */
+#define SFRAME_ABI_AARCH64_ENDIAN_BIG      1 /* AARCH64 big endian. */
+#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE   2 /* AARCH64 little endian. */
+#define SFRAME_ABI_AMD64_ENDIAN_LITTLE     3 /* AMD64 little endian. */
+
+/* SFrame FRE types. */
+#define SFRAME_FRE_TYPE_ADDR1	0
+#define SFRAME_FRE_TYPE_ADDR2	1
+#define SFRAME_FRE_TYPE_ADDR4	2
+
+/*
+ * SFrame Function Descriptor Entry types.
+ *
+ * The SFrame format has two possible representations for functions.  The
+ * choice of which type to use is made according to the instruction patterns
+ * in the relevant program stub.
+ */
+
+/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
+#define SFRAME_FDE_TYPE_PCINC   0
+/*
+ * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
+ * to look up a matching FRE.  Typical usecases are pltN entries, trampolines
+ * etc.
+ */
+#define SFRAME_FDE_TYPE_PCMASK  1
+
+struct sframe_preamble
+{
+	/* Magic number (SFRAME_MAGIC). */
+	uint16_t magic;
+	/* Data format version number (SFRAME_VERSION). */
+	uint8_t version;
+	/* Various flags. */
+	uint8_t flags;
+} __packed;
+
+struct sframe_header
+{
+	struct sframe_preamble preamble;
+	/* Information about the arch (endianness) and ABI. */
+	uint8_t abi_arch;
+	/*
+	 * Offset for the Frame Pointer (FP) from CFA may be fixed for some
+	 * ABIs (e.g, in AMD64 when -fno-omit-frame-pointer is used).  When fixed,
+	 * this field specifies the fixed stack frame offset and the individual
+	 * FREs do not need to track it.  When not fixed, it is set to
+	 * SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may provide
+	 * the applicable stack frame offset, if any.
+	 */
+	int8_t cfa_fixed_fp_offset;
+	/*
+	 * Offset for the Return Address from CFA is fixed for some ABIs
+	 * (e.g., AMD64 has it as CFA-8).  When fixed, the header specifies the
+	 * fixed stack frame offset and the individual FREs do not track it.  When
+	 * not fixed, it is set to SFRAME_CFA_FIXED_RA_INVALID, and individual
+	 * FREs provide the applicable stack frame offset, if any.
+	 */
+	int8_t cfa_fixed_ra_offset;
+	/*
+	 * Number of bytes making up the auxiliary header, if any.
+	 * Some ABI/arch, in the future, may use this space for extending the
+	 * information in SFrame header.  Auxiliary header is contained in
+	 * bytes sequentially following the sframe_header.
+	 */
+	uint8_t auxhdr_len;
+	/* Number of SFrame FDEs in this SFrame section. */
+	uint32_t num_fdes;
+	/* Number of SFrame Frame Row Entries. */
+	uint32_t num_fres;
+	/* Number of bytes in the SFrame Frame Row Entry section. */
+	uint32_t fre_len;
+	/* Offset of SFrame Function Descriptor Entry section. */
+	uint32_t fdeoff;
+	/* Offset of SFrame Frame Row Entry section. */
+	uint32_t freoff;
+} __packed;
+
+#define SFRAME_V1_HDR_SIZE(sframe_hdr)	\
+  ((sizeof (struct sframe_header) + (sframe_hdr).auxhdr_len))
+
+/* Two possible keys for executable (instruction) pointers signing. */
+#define SFRAME_AARCH64_PAUTH_KEY_A    0 /* Key A. */
+#define SFRAME_AARCH64_PAUTH_KEY_B    1 /* Key B. */
+
+struct sframe_func_desc_entry
+{
+	/*
+	 * Function start address.  Encoded as a signed offset, relative to the
+	 * beginning of the current FDE.
+	 */
+	int32_t func_start_address;
+	/* Size of the function in bytes. */
+	uint32_t func_size;
+	/*
+	 * Offset of the first SFrame Frame Row Entry of the function, relative to the
+	 * beginning of the SFrame Frame Row Entry sub-section.
+	 */
+	uint32_t func_start_fre_off;
+	/* Number of frame row entries for the function. */
+	uint32_t func_num_fres;
+	/*
+	 * Additional information for deciphering the unwind information for the
+	 * function.
+	 *   - 4-bits: Identify the FRE type used for the function.
+	 *   - 1-bit: Identify the FDE type of the function - mask or inc.
+	 *   - 1-bit: PAC authorization A/B key (aarch64).
+	 *   - 2-bits: Unused.
+	 * --------------------------------------------------------------------------
+	 * |     Unused    |  PAC auth A/B key (aarch64) |  FDE type |   FRE type   |
+	 * |               |        Unused (amd64)       |           |              |
+	 * --------------------------------------------------------------------------
+	 * 8               6                             5           4              0
+	 */
+	uint8_t func_info;
+} __packed;
+
+/* Note: Set PAC auth key to SFRAME_AARCH64_PAUTH_KEY_A by default.  */
+#define SFRAME_V1_FUNC_INFO(fde_type, fre_enc_type) \
+  (((SFRAME_AARCH64_PAUTH_KEY_A & 0x1) << 5) | \
+   (((fde_type) & 0x1) << 4) | ((fre_enc_type) & 0xf))
+
+#define SFRAME_V1_FUNC_FRE_TYPE(data)	  ((data) & 0xf)
+#define SFRAME_V1_FUNC_FDE_TYPE(data)	  (((data) >> 4) & 0x1)
+#define SFRAME_V1_FUNC_PAUTH_KEY(data)	  (((data) >> 5) & 0x1)
+
+/*
+ * Size of stack frame offsets in an SFrame Frame Row Entry.  A single
+ * SFrame FRE has all offsets of the same size.  Offset size may vary
+ * across frame row entries.
+ */
+#define SFRAME_FRE_OFFSET_1B	  0
+#define SFRAME_FRE_OFFSET_2B	  1
+#define SFRAME_FRE_OFFSET_4B	  2
+
+/* An SFrame Frame Row Entry can be SP or FP based.  */
+#define SFRAME_BASE_REG_FP	0
+#define SFRAME_BASE_REG_SP	1
+
+/*
+ * The index at which a specific offset is presented in the variable length
+ * bytes of an FRE.
+ */
+#define SFRAME_FRE_CFA_OFFSET_IDX   0
+/*
+ * The RA stack offset, if present, will always be at index 1 in the variable
+ * length bytes of the FRE.
+ */
+#define SFRAME_FRE_RA_OFFSET_IDX    1
+/*
+ * The FP stack offset may appear at offset 1 or 2, depending on the ABI as RA
+ * may or may not be tracked.
+ */
+#define SFRAME_FRE_FP_OFFSET_IDX    2
+
+struct sframe_fre_info
+{
+  /* Information about
+   *   - 1 bit: base reg for CFA
+   *   - 4 bits: Number of offsets (N).  A value of upto 3 is allowed to track
+   *     all three of CFA, FP and RA (fixed implicit order).
+   *   - 2 bits: information about size of the offsets (S) in bytes.
+   *     Valid values are SFRAME_FRE_OFFSET_1B, SFRAME_FRE_OFFSET_2B,
+   *     SFRAME_FRE_OFFSET_4B
+   *   - 1 bit: Mangled RA state bit (aarch64 only).
+   * -----------------------------------------------------------------------------------
+   * | Mangled-RA (aarch64) |  Size of offsets   |   Number of offsets    |   base_reg |
+   * |  Unused (amd64)      |                    |                        |            |
+   * -----------------------------------------------------------------------------------
+   * 8                     7                    5                        1            0
+   */
+  uint8_t fre_info;
+};
+
+/* Macros to compose and decompose FRE info. */
+
+/* Note: Set mangled_ra_p to zero by default. */
+#define SFRAME_V1_FRE_INFO(base_reg_id, offset_num, offset_size) \
+  (((0 & 0x1) << 7) | (((offset_size) & 0x3) << 5) | \
+   (((offset_num) & 0xf) << 1) | ((base_reg_id) & 0x1))
+
+/* Set the mangled_ra_p bit as indicated. */
+#define SFRAME_V1_FRE_INFO_UPDATE_MANGLED_RA_P(mangled_ra_p, fre_info) \
+  ((((mangled_ra_p) & 0x1) << 7) | ((fre_info) & 0x7f))
+
+#define SFRAME_V1_FRE_CFA_BASE_REG_ID(data)	  ((data) & 0x1)
+#define SFRAME_V1_FRE_OFFSET_COUNT(data)	  (((data) >> 1) & 0xf)
+#define SFRAME_V1_FRE_OFFSET_SIZE(data)		  (((data) >> 5) & 0x3)
+#define SFRAME_V1_FRE_MANGLED_RA_P(data)	  (((data) >> 7) & 0x1)
+
+/* SFrame Frame Row Entry definitions. */
+
+/*
+ * Used when SFRAME_FRE_TYPE_ADDR1 is specified as FRE type.
+ * Upper limit of start address in sframe_frame_row_entry_addr1 if 0x100 (not
+ * inclusive).
+ */
+struct sframe_frame_row_entry_addr1
+{
+	/*
+	 * Start address of the frame row entry.  Encoded as an 1-byte unsigned
+	 * offset, relative to the start address of the function.
+	 */
+	uint8_t fre_start_ip_offset;
+	struct sframe_fre_info fre_info;
+} __packed;
+
+/*
+ * Used when SFRAME_FRE_TYPE_ADDR2 is specified as FRE type.
+ * Upper limit of start address in sframe_frame_row_entry_addr2 is 0x10000 (not
+ * inclusive).
+ */
+struct sframe_frame_row_entry_addr2
+{
+	/*
+	 * Start address of the frame row entry.  Encoded as an 2-byte unsigned
+	 * offset, relative to the start address of the function.
+	 */
+	uint16_t fre_start_ip_offset;
+	struct sframe_fre_info fre_info;
+} __packed;
+
+/*
+ * Used when SFRAME_FRE_TYPE_ADDR4 is specified as FRE type.
+ * Upper limit of start address in sframe_frame_row_entry_addr2
+ * is 0x100000000 (not inclusive).
+ */
+struct sframe_frame_row_entry_addr4
+{
+	/*
+	 * Start address of the frame row entry.  Encoded as a 4-byte unsigned
+	 * offset, relative to the start address of the function.
+	 */
+	uint32_t fre_start_ip_offset;
+	struct sframe_fre_info fre_info;
+} __packed;
+
+#endif	/* SFRAME_H */
diff --git a/lib/sframe/sframe_read.c b/lib/sframe/sframe_read.c
new file mode 100644
index 000000000000..9d6558d62c54
--- /dev/null
+++ b/lib/sframe/sframe_read.c
@@ -0,0 +1,498 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/string.h>
+
+#include "sframe_read.h"
+
+struct sframe_sec {
+	/* SFrame header. */
+	struct sframe_header header;
+	/* SFrame Function Desc Entries. */
+	void *fdes;
+	/* SFrame Frame Row Entries. */
+	void *fres;
+	/* Number of bytes needed for SFrame FREs. */
+	uint32_t fre_nbytes;
+};
+
+static int sframe_set_errno(int *errp, int error)
+{
+	if (errp != NULL)
+		*errp = error;
+	return SFRAME_ERR;
+}
+
+static uint32_t sframe_sec_get_hdr_size(struct sframe_header *sfh)
+{
+	return SFRAME_V1_HDR_SIZE(*sfh);
+}
+
+static unsigned int sframe_fre_get_offset_count(unsigned char fre_info)
+{
+	return SFRAME_V1_FRE_OFFSET_COUNT(fre_info);
+}
+
+static unsigned int sframe_fre_get_offset_size(unsigned char fre_info)
+{
+	return SFRAME_V1_FRE_OFFSET_SIZE(fre_info);
+}
+
+static unsigned int sframe_get_fre_type(struct sframe_func_desc_entry *fdep)
+{
+	return (fdep) ? SFRAME_V1_FUNC_FRE_TYPE(fdep->func_info) : 0;
+}
+
+static unsigned int sframe_get_fde_type(struct sframe_func_desc_entry *fdep)
+{
+	return (fdep) ? SFRAME_V1_FUNC_FDE_TYPE(fdep->func_info) : 0;
+}
+
+static bool sframe_header_sanity_check_p(struct sframe_header *hp)
+{
+	unsigned char all_flags = SFRAME_F_FDE_SORTED | SFRAME_F_FRAME_POINTER;
+	/* Check that the preamble is valid. */
+	if ((hp->preamble.magic != SFRAME_MAGIC)
+	    || (hp->preamble.version != SFRAME_VERSION)
+	    || ((hp->preamble.flags | all_flags) != all_flags))
+		return false;
+
+	/* Check that the offsets are valid. */
+	if (hp->fdeoff > hp->freoff)
+		return false;
+
+	return true;
+}
+
+static bool sframe_fre_sanity_check_p(struct sframe_fre *frep)
+{
+	unsigned int offset_size, offset_cnt;
+
+	if (frep == NULL)
+		return false;
+
+	offset_size = sframe_fre_get_offset_size(frep->fre_info);
+
+	if (offset_size != SFRAME_FRE_OFFSET_1B
+	    && offset_size != SFRAME_FRE_OFFSET_2B
+	    && offset_size != SFRAME_FRE_OFFSET_4B)
+		return false;
+
+	offset_cnt = sframe_fre_get_offset_count(frep->fre_info);
+	if (offset_cnt > MAX_NUM_STACK_OFFSETS)
+		return false;
+
+	return true;
+}
+
+static int32_t sframe_get_fre_offset(struct sframe_fre *frep, uint32_t idx,
+				     int *errp)
+{
+	int offset_cnt, offset_size;
+
+	if (frep == NULL || !sframe_fre_sanity_check_p(frep))
+		return sframe_set_errno(errp, SFRAME_ERR_FRE_INVAL);
+
+	offset_cnt = sframe_fre_get_offset_count(frep->fre_info);
+	offset_size = sframe_fre_get_offset_size(frep->fre_info);
+
+	if (offset_cnt < idx + 1)
+		return sframe_set_errno(errp, SFRAME_ERR_FREOFFSET_NOPRESENT);
+
+	if (errp != NULL)
+		*errp = 0; /* Offset Valid. */
+
+	if (offset_size == SFRAME_FRE_OFFSET_1B) {
+		int8_t *stack_offsets = (int8_t *)frep->fre_offsets;
+		return stack_offsets[idx];
+	} else if (offset_size == SFRAME_FRE_OFFSET_2B) {
+		int16_t *stack_offsets = (int16_t *)frep->fre_offsets;
+		return stack_offsets[idx];
+	} else {
+		int32_t *stack_offsets = (int32_t *)frep->fre_offsets;
+		return stack_offsets[idx];
+	}
+}
+
+static struct sframe_header *sframe_sec_get_header(struct sframe_sec *sfsec)
+{
+	return sfsec ? &sfsec->header : NULL;
+}
+
+static int sframe_fre_copy(struct sframe_fre *dst,
+			   struct sframe_fre *src)
+{
+	if (dst == NULL || src == NULL)
+		return SFRAME_ERR;
+
+	memcpy(dst, src, sizeof(struct sframe_fre));
+	return 0;
+}
+
+static int sframe_decode_start_ip_offset(const char *fre_buf,
+					 uint32_t *start_ip_offset,
+					 unsigned int fre_type)
+{
+	uint32_t saddr = 0;
+
+	if (fre_type == SFRAME_FRE_TYPE_ADDR1) {
+		uint8_t *uc = (uint8_t *)fre_buf;
+		saddr = (uint32_t)*uc;
+	} else if (fre_type == SFRAME_FRE_TYPE_ADDR2) {
+		uint16_t *ust = (uint16_t *)fre_buf;
+		saddr = (uint32_t)*ust;
+	} else if (fre_type == SFRAME_FRE_TYPE_ADDR4) {
+		uint32_t *uit = (uint32_t *)fre_buf;
+		saddr = (uint32_t)*uit;
+	} else {
+		return SFRAME_ERR_INVAL;
+	}
+
+	*start_ip_offset = saddr;
+	return 0;
+}
+
+/* Get the total size in bytes for the stack offsets. */
+static size_t sframe_fre_stack_offsets_size(unsigned char fre_info)
+{
+	unsigned int offset_size, offset_cnt;
+
+	offset_size = sframe_fre_get_offset_size(fre_info);
+	offset_cnt = sframe_fre_get_offset_count(fre_info);
+
+	if (offset_size == SFRAME_FRE_OFFSET_2B
+	    || offset_size == SFRAME_FRE_OFFSET_4B)	/* 2 or 4 bytes. */
+		return (offset_cnt * (offset_size * 2));
+
+	return offset_cnt;
+}
+
+static size_t sframe_fre_get_start_address_tsize(unsigned int fre_type)
+{
+	/* Type size of the start_addr in an FRE. */
+	size_t saddr_tsize = 0;
+
+	switch (fre_type) {
+	case SFRAME_FRE_TYPE_ADDR1:
+		saddr_tsize = sizeof(uint8_t);
+		break;
+	case SFRAME_FRE_TYPE_ADDR2:
+		saddr_tsize = sizeof(uint16_t);
+		break;
+	case SFRAME_FRE_TYPE_ADDR4:
+		saddr_tsize = sizeof(uint32_t);
+		break;
+	default:
+		/* No other value is expected. */
+		break;
+	}
+	return saddr_tsize;
+}
+
+static size_t sframe_fre_vlen_size(struct sframe_fre *frep,
+				   unsigned int fre_type)
+{
+	unsigned char fre_info;
+	size_t ip_offset_tsize;
+
+	if (frep == NULL)
+		return 0;
+
+	fre_info = frep->fre_info;
+	ip_offset_tsize = sframe_fre_get_start_address_tsize(fre_type);
+
+	/*
+	 * An SFrame FRE is a variable length structure.  It includes the start
+	 * IP offset, FRE info field, and all trailing the stack offsets.
+	 */
+	return (ip_offset_tsize + sizeof(fre_info)
+		+ sframe_fre_stack_offsets_size(fre_info));
+}
+
+/*
+ * Read an SFrame FRE which starts at location FRE_BUF.  The function
+ * updates FRE_SIZE to the size of the FRE as stored in the binary format.
+ *
+ * Returns SFRAME_ERR if failure.
+ */
+static int sframe_sec_read_fre(const char *fre_buf, struct sframe_fre *frep,
+			       unsigned int fre_type, size_t *fre_size)
+{
+	void *stack_offsets;
+	size_t stack_offsets_sz;
+	size_t ip_offset_tsize;
+	size_t esz;
+
+	if (fre_buf == NULL || frep == NULL || fre_size == NULL)
+		return SFRAME_ERR_INVAL;
+
+	/* Copy over the FRE start address. */
+	sframe_decode_start_ip_offset(fre_buf, &frep->start_ip_offset,
+				      fre_type);
+
+	ip_offset_tsize = sframe_fre_get_start_address_tsize(fre_type);
+	/* PS: Note how this API works closely with SFrame binary format. */
+	frep->fre_info = *(unsigned char *)(fre_buf + ip_offset_tsize);
+
+	memset(frep->fre_offsets, 0, MAX_STACK_OFFSET_NBYTES);
+	/* Get stack offsets. */
+	stack_offsets_sz = sframe_fre_stack_offsets_size(frep->fre_info);
+	stack_offsets = ((unsigned char *)fre_buf + ip_offset_tsize
+			 + sizeof(frep->fre_info));
+	memcpy(frep->fre_offsets, stack_offsets, stack_offsets_sz);
+
+	esz = sframe_fre_vlen_size(frep, fre_type);
+	*fre_size = esz;
+
+	return 0;
+}
+
+static struct sframe_func_desc_entry *
+sframe_sec_find_fde(struct sframe_sec *sfsec, int32_t addr, int *errp)
+{
+	struct sframe_header *header;
+	struct sframe_func_desc_entry *fde;
+	int low, high, cnt;
+
+	if (sfsec == NULL) {
+		sframe_set_errno(errp, SFRAME_ERR_INVAL);
+		return NULL;
+	}
+
+	header = sframe_sec_get_header(sfsec);
+	if (header == NULL || header->num_fdes == 0 || sfsec->fdes == NULL) {
+		sframe_set_errno(errp, SFRAME_ERR_INIT_INVAL);
+		return NULL;
+	}
+	/*
+	 * Skip binary search if FDE sub-section is not sorted on PCs.  GNU ld
+	 * sorts the FDEs on start PC by default though.
+	 */
+	if ((header->preamble.flags & SFRAME_F_FDE_SORTED) == 0) {
+		sframe_set_errno(errp, SFRAME_ERR_FDE_NOTSORTED);
+		return NULL;
+	}
+
+	/* Find the FDE that may contain the addr. */
+	fde = (struct sframe_func_desc_entry *)sfsec->fdes;
+	low = 0;
+	high = header->num_fdes;
+	cnt = high;
+	while (low <= high) {
+		int mid = low + (high - low) / 2;
+
+		if (fde[mid].func_start_address == addr)
+			return fde + mid;
+
+		if (fde[mid].func_start_address < addr) {
+			if (mid == (cnt - 1))
+				return fde + (cnt - 1);
+			else if (fde[mid+1].func_start_address > addr)
+				return fde + mid;
+			low = mid + 1;
+		} else
+			high = mid - 1;
+	}
+
+	sframe_set_errno(errp, SFRAME_ERR_FDE_NOTFOUND);
+	return NULL;
+}
+
+static int8_t sframe_sec_get_fixed_fp_offset(struct sframe_sec *sfsec)
+{
+	struct sframe_header *header = sframe_sec_get_header(sfsec);
+	return header->cfa_fixed_fp_offset;
+}
+
+static int8_t sframe_sec_get_fixed_ra_offset(struct sframe_sec *sfsec)
+{
+	struct sframe_header *header = sframe_sec_get_header(sfsec);
+	return header->cfa_fixed_ra_offset;
+}
+
+size_t sframe_sec_sizeof(void)
+{
+	return sizeof (struct sframe_sec);
+}
+
+int sframe_sec_init(struct sframe_sec *sfsec, const char *sf_buf,
+		    size_t sf_size)
+{
+	char *frame_buf;
+	const struct sframe_preamble *preamble;
+	struct sframe_header *header;
+
+	if ((sf_buf == NULL) || (sf_size < sizeof(struct sframe_header)))
+		return SFRAME_ERR_INVAL;
+
+	/* Check for foreign endianness. */
+	preamble = (const struct sframe_preamble *) sf_buf;
+	if (preamble->magic != SFRAME_MAGIC)
+		return SFRAME_ERR_INVAL;
+
+	/* Reset the SFrame section object. */
+	memset(sfsec, 0, sizeof(struct sframe_sec));
+
+	frame_buf = (char *)sf_buf;
+
+	/* Initialize the reference to the SFrame header. */
+	sfsec->header = *(struct sframe_header *) frame_buf;
+	header = &sfsec->header;
+	if (!sframe_header_sanity_check_p(header))
+		return SFRAME_ERR_INVAL;
+
+	/* Initialize the referece to the SFrame FDE section. */
+	frame_buf += sframe_sec_get_hdr_size(header);
+	sfsec->fdes = frame_buf;
+
+	/* Initialize the reference to the the SFrame FRE section. */
+	frame_buf += (header->num_fdes * sizeof(struct sframe_func_desc_entry));
+	sfsec->fres = frame_buf;
+
+	sfsec->fre_nbytes = header->fre_len;
+
+	return 0;
+}
+
+/*
+ * Find the SFrame Frame Row Entry which contains the PC.
+ * Returns error code if failure.
+ */
+int sframe_sec_find_fre(struct sframe_sec *sfsec, int32_t pc,
+			struct sframe_fre *frep)
+{
+	struct sframe_func_desc_entry *fdep;
+	uint32_t start_address, i;
+	struct sframe_fre cur_fre, next_fre;
+	unsigned char *fres;
+	unsigned int fre_type, fde_type;
+	size_t esz;
+	int err = 0;
+	size_t size = 0;
+	/*
+	 * For regular FDEs(i.e. fde_type SFRAME_FDE_TYPE_PCINC),
+	 * where the start address in the FRE is an offset from start pc,
+	 * use a bitmask with all bits set so that none of the address bits are
+	 * ignored.  In this case, we need to return the FRE where
+	 * (PC >= FRE_START_ADDR).
+	 */
+	uint64_t bitmask = 0xffffffff;
+
+	if ((sfsec == NULL) || (frep == NULL))
+		return SFRAME_ERR_INVAL;
+
+	/* Find the FDE which contains the PC, then scan its FRE entries. */
+	fdep = sframe_sec_find_fde(sfsec, pc, &err);
+	if (fdep == NULL || sfsec->fres == NULL)
+		return SFRAME_ERR_INIT_INVAL;
+
+	fre_type = sframe_get_fre_type(fdep);
+	fde_type = sframe_get_fde_type(fdep);
+
+	/*
+	 * For FDEs for repetitive pattern of insns, we need to return the FRE
+	 * such that(PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK).
+	 * so, update the bitmask to the start address.
+	 */
+	/* FIXME - the bitmask. */
+	if (fde_type == SFRAME_FDE_TYPE_PCMASK)
+		bitmask = 0xff;
+
+	fres = (unsigned char *)sfsec->fres + fdep->func_start_fre_off;
+	for (i = 0; i < fdep->func_num_fres; i++) {
+		err = sframe_sec_read_fre((const char *)fres, &next_fre,
+					  fre_type, &esz);
+		start_address = next_fre.start_ip_offset;
+
+		if (((fdep->func_start_address
+		      + (int32_t)start_address) & bitmask) <= (pc & bitmask)) {
+			sframe_fre_copy(&cur_fre, &next_fre);
+
+			/* Get the next FRE in sequence. */
+			if (i < fdep->func_num_fres - 1) {
+				fres += esz;
+				err = sframe_sec_read_fre((const char *)fres,
+							  &next_fre,
+							  fre_type, &esz);
+
+				/* Sanity check the next FRE. */
+				if (!sframe_fre_sanity_check_p(&next_fre))
+					return SFRAME_ERR_FRE_INVAL;
+
+				size = next_fre.start_ip_offset;
+			} else {
+				size = fdep->func_size;
+			}
+
+			if (((fdep->func_start_address
+			      + (int32_t)size) & bitmask) > (pc & bitmask)) {
+				/* Cur FRE contains the PC, return it. */
+				sframe_fre_copy(frep, &cur_fre);
+				return 0;
+			}
+		} else {
+			return SFRAME_ERR_FRE_INVAL;
+		}
+	}
+	return SFRAME_ERR_FDE_INVAL;
+}
+
+unsigned int sframe_fre_get_base_reg_id(struct sframe_fre *frep,
+					int *errp)
+{
+	if (!frep)
+		return sframe_set_errno(errp, SFRAME_ERR_FRE_INVAL);
+
+	return SFRAME_V1_FRE_CFA_BASE_REG_ID(frep->fre_info);
+}
+
+int32_t sframe_fre_get_cfa_offset(struct sframe_sec *sfsec __always_unused,
+				  struct sframe_fre *frep, int *errp)
+{
+	return sframe_get_fre_offset(frep, SFRAME_FRE_CFA_OFFSET_IDX, errp);
+}
+
+int32_t sframe_fre_get_fp_offset(struct sframe_sec *sfsec,
+				 struct sframe_fre *frep, int *errp)
+{
+	uint32_t fp_offset_idx = 0;
+	int8_t fp_offset = sframe_sec_get_fixed_fp_offset(sfsec);
+	/*
+	 * If the FP offset is not being tracked, return the fixed FP offset
+	 * from the SFrame header.
+	 */
+	if (fp_offset != SFRAME_CFA_FIXED_FP_INVALID) {
+		*errp = 0;
+		return fp_offset;
+	}
+
+	/*
+	 * In some ABIs, the stack offset to recover RA (using the CFA) from is
+	 * fixed (like AMD64).  In such cases, the stack offset to recover FP
+	 * will appear at the second index.
+	 */
+	fp_offset_idx = ((sframe_sec_get_fixed_ra_offset(sfsec)
+			  != SFRAME_CFA_FIXED_RA_INVALID)
+			 ? SFRAME_FRE_RA_OFFSET_IDX
+			 : SFRAME_FRE_FP_OFFSET_IDX);
+	return sframe_get_fre_offset(frep, fp_offset_idx, errp);
+}
+
+int32_t sframe_fre_get_ra_offset(struct sframe_sec *sfsec,
+				 struct sframe_fre *frep, int *errp)
+{
+	int8_t ra_offset = sframe_sec_get_fixed_ra_offset(sfsec);
+	/*
+	 * If the RA offset was not being tracked, return the fixed RA offset
+	 * from the SFrame header.
+	 */
+	if (ra_offset != SFRAME_CFA_FIXED_RA_INVALID) {
+		*errp = 0;
+		return ra_offset;
+	}
+
+	/* Otherwise, get the RA offset from the FRE. */
+	return sframe_get_fre_offset(frep, SFRAME_FRE_RA_OFFSET_IDX, errp);
+}
diff --git a/lib/sframe/sframe_read.h b/lib/sframe/sframe_read.h
new file mode 100644
index 000000000000..6632fb76d8b1
--- /dev/null
+++ b/lib/sframe/sframe_read.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef SFRAME_READ_H
+#define SFRAME_READ_H
+
+#include <linux/types.h>
+
+#include "sframe.h"
+
+struct sframe_sec;
+
+#define MAX_NUM_STACK_OFFSETS 3
+
+#define MAX_STACK_OFFSET_NBYTES \
+  ((SFRAME_FRE_OFFSET_4B * 2 * MAX_NUM_STACK_OFFSETS))
+
+/*
+ * SFrame Frame Row Entry for the SFrame reader.
+ * Providing such an abstraction helps decouple stack tracer from the
+ * binary format representation of the same.
+ */
+struct sframe_fre {
+	uint32_t start_ip_offset;
+	unsigned char fre_offsets[MAX_STACK_OFFSET_NBYTES];
+	unsigned char fre_info;
+};
+
+#define SFRAME_ERR ((int) -1)
+
+/* SFrame version not supported. */
+#define SFRAME_ERR_VERSION_INVAL	(-2000)
+/* Corrupt SFrame. */
+#define SFRAME_ERR_INVAL		(-2001)
+/* SFrame Section Initialization Error. */
+#define SFRAME_ERR_INIT_INVAL		(-2002)
+/* Corrupt FDE. */
+#define SFRAME_ERR_FDE_INVAL		(-2003)
+/* Corrupt FRE. */
+#define SFRAME_ERR_FRE_INVAL		(-2004)
+/* FDE not found. */
+#define SFRAME_ERR_FDE_NOTFOUND		(-2005)
+/* FDEs not sorted. */
+#define SFRAME_ERR_FDE_NOTSORTED	(-2006)
+/* FRE not found. */
+#define SFRAME_ERR_FRE_NOTFOUND		(-2007)
+/* FRE offset not present. */
+#define SFRAME_ERR_FREOFFSET_NOPRESENT	(-2008)
+
+extern size_t sframe_sec_sizeof(void);
+
+extern int sframe_sec_init(struct sframe_sec *sfsec, const char *cf_buf,
+			   size_t cf_size);
+
+extern int sframe_sec_find_fre(struct sframe_sec *ctx, int32_t pc,
+			   struct sframe_fre *frep);
+
+extern unsigned int sframe_fre_get_base_reg_id(struct sframe_fre *fre,
+					       int *errp);
+extern int32_t sframe_fre_get_cfa_offset(struct sframe_sec *dtcx,
+					 struct sframe_fre *fre,
+					 int *errp);
+extern int32_t sframe_fre_get_fp_offset(struct sframe_sec *sfsec,
+					struct sframe_fre *fre,
+					int *errp);
+extern int32_t sframe_fre_get_ra_offset(struct sframe_sec *sfsec,
+					struct sframe_fre *fre,
+					int *errp);
+extern bool sframe_fre_get_ra_mangled_p(struct sframe_sec *sfsec,
+					struct sframe_fre *fre,
+					int *errp);
+
+#endif /* SFRAME_READ_H */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [POC 4/5] sframe: add an SFrame format stack tracer
  2023-05-01 20:04 [POC 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
                   ` (2 preceding siblings ...)
  2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
@ 2023-05-01 20:04 ` Indu Bhagat
  2023-05-01 23:00   ` Steven Rostedt
                     ` (2 more replies)
  2023-05-01 20:04 ` [POC 5/5] x86_64: invoke SFrame based stack tracer for user space Indu Bhagat
  2023-05-01 22:15 ` [POC 0/5] SFrame based stack tracer for user space in the kernel Steven Rostedt
  5 siblings, 3 replies; 32+ messages in thread
From: Indu Bhagat @ 2023-05-01 20:04 UTC (permalink / raw)
  To: linux-toolchains
  Cc: daandemeyer, andrii, rostedt, kris.van.hees, elena.zannoni,
	nick.alcock, Indu Bhagat

This patch adds an SFrame format based stack tracer.

The files iterate_phdr.c, iterate_phdr.h implement a dl_iterate_phdr()
like functionality.

The SFrame format based stack tracer is implemented in the
sframe_unwind.c with architecture specific bits in the
arch/arm64/include/asm/sframe_regs.h and
arch/x86/include/asm/sframe_regs.h.  Please note that the SFrame format
is supported for x86_64 (AMD64 ABI) and aarch64 (AAPCS64 ABI) only at
this time.

The files sframe_state.[ch] implement the SFrame state management APIs.

Some aspects of the implementation are "POC like". These will need to
addressed for the implementation to become more palatable:
- dealing with only Elf64_Phdr (no Elf32_Phdr) at this time, and other
  TODOs in the iterate_phdr.c,
- detecting whether a program did a dlopen/dlclose,
- code stubs around user space memory access (.sframe section, ELF hdr
  etc.) by the kernel need careful review.

There are more aspects than above; The intention of this patch set is to
help drive the discussion on how to best incorporate an SFrame-based user
space unwinder in the kernel.

Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
---
 arch/arm64/include/asm/sframe_regs.h |  37 +++
 arch/x86/include/asm/sframe_regs.h   |  34 +++
 include/sframe/sframe_regs.h         |  11 +
 include/sframe/sframe_unwind.h       |  62 ++++
 lib/sframe/Makefile                  |   8 +-
 lib/sframe/iterate_phdr.c            | 113 +++++++
 lib/sframe/iterate_phdr.h            |  34 +++
 lib/sframe/sframe_state.c            | 424 +++++++++++++++++++++++++++
 lib/sframe/sframe_state.h            |  80 +++++
 lib/sframe/sframe_unwind.c           | 208 +++++++++++++
 10 files changed, 1010 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/include/asm/sframe_regs.h
 create mode 100644 arch/x86/include/asm/sframe_regs.h
 create mode 100644 include/sframe/sframe_regs.h
 create mode 100644 include/sframe/sframe_unwind.h
 create mode 100644 lib/sframe/iterate_phdr.c
 create mode 100644 lib/sframe/iterate_phdr.h
 create mode 100644 lib/sframe/sframe_state.c
 create mode 100644 lib/sframe/sframe_state.h
 create mode 100644 lib/sframe/sframe_unwind.c

diff --git a/arch/arm64/include/asm/sframe_regs.h b/arch/arm64/include/asm/sframe_regs.h
new file mode 100644
index 000000000000..ae9ab9d5d3c1
--- /dev/null
+++ b/arch/arm64/include/asm/sframe_regs.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifdef ASM_ARM64_SFRAME_REGS_H
+#define ASM_ARM64_SFRAME_REGS_H
+
+#define STACK_ACCESS_LEN 8
+
+static inline uint64_t
+get_ptregs_ip(struct pt_regs *regs)
+{
+	return regs->pc;
+}
+
+static inline uint64_t
+get_ptregs_sp(struct pt_regs *regs)
+{
+	return regs->sp;
+}
+
+static inline uint64_t
+get_ptregs_fp(struct pt_regs *regs)
+{
+#define UNWIND_AARCH64_X29              29      /* 64-bit frame pointer.  */
+	return (uint64_t)regs->regs[UNWIND_AARCH64_X29];
+}
+
+static inline uint64_t
+get_ptregs_ra(struct pt_regs *regs)
+{
+#define UNWIND_AARCH64_X30              30      /* 64-bit link pointer.  */
+	return (uint64_t)regs->regs[UNWIND_AARCH64_X30];
+}
+
+#endif /* ASM_ARM64_SFRAME_REGS_H */
diff --git a/arch/x86/include/asm/sframe_regs.h b/arch/x86/include/asm/sframe_regs.h
new file mode 100644
index 000000000000..99f84955854a
--- /dev/null
+++ b/arch/x86/include/asm/sframe_regs.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef ASM_X86_SFRAME_REGS_H
+#define ASM_X86_SFRAME_REGS_H
+
+#define STACK_ACCESS_LEN 8
+
+static inline uint64_t
+get_ptregs_ip(struct pt_regs *regs)
+{
+	return (uint64_t)regs->ip;
+}
+
+static inline uint64_t
+get_ptregs_sp(struct pt_regs *regs)
+{
+	return (uint64_t)regs->sp;
+}
+
+static inline uint64_t
+get_ptregs_fp(struct pt_regs *regs)
+{
+	return (uint64_t)regs->bp;
+}
+
+static inline uint64_t
+get_ptregs_ra(struct pt_regs *regs)
+{
+	return 0; /* SFRAME_CFA_FIXED_RA_INVALID */
+}
+#endif /* ASM_X86_SFRAME_REGS_H */
diff --git a/include/sframe/sframe_regs.h b/include/sframe/sframe_regs.h
new file mode 100644
index 000000000000..32b67f7a7c78
--- /dev/null
+++ b/include/sframe/sframe_regs.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef _SFRAME_REGS_H
+#define _SFRAME_REGS_H
+
+#include <asm/sframe_regs.h>
+
+#endif /* _SFRAME_REGS_H */
diff --git a/include/sframe/sframe_unwind.h b/include/sframe/sframe_unwind.h
new file mode 100644
index 000000000000..3e2c12816b60
--- /dev/null
+++ b/include/sframe/sframe_unwind.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef _SFRAME_UNWIND_H
+#define _SFRAME_UNWIND_H
+
+#include <linux/sched.h>
+#include <linux/perf_event.h>
+
+#define PT_GNU_SFRAME  0x6474e554
+
+/*
+ * State used for SFrame-based stack tracing for a user space task.
+ */
+struct user_unwind_state {
+	uint64_t pc, sp, fp, ra;
+	enum stack_type stype;
+	struct task_struct *task;
+	bool error;
+};
+
+/*
+ * APIs for an SFrame based stack tracer.
+ */
+
+void sframe_unwind_start(struct user_unwind_state *state,
+			 struct task_struct *task, struct pt_regs *regs);
+bool sframe_unwind_next_frame(struct user_unwind_state *state);
+uint64_t sframe_unwind_get_return_address(struct user_unwind_state *state);
+
+static inline bool sframe_unwind_done(struct user_unwind_state *state)
+{
+	return state->stype == STACK_TYPE_UNKNOWN;
+}
+
+static inline bool sframe_unwind_error(struct user_unwind_state *state)
+{
+	return state->error;
+}
+
+/*
+ * APIs to manage the SFrame state per task for stack tracing.
+ */
+
+extern struct sframe_state *unwind_sframe_state_alloc(struct task_struct *task);
+extern int unwind_sframe_state_update(struct task_struct *task);
+extern void unwind_sframe_state_cleanup(struct task_struct *task);
+
+extern bool unwind_sframe_state_valid_p(struct sframe_state *sfstate);
+extern bool unwind_sframe_state_ready_p(struct sframe_state *sftate);
+
+/*
+ * Get the callchain using SFrame unwind info for the given task.
+ */
+extern int
+sframe_callchain_user(struct task_struct *task,
+		      struct perf_callchain_entry_ctx *entry,
+		      struct pt_regs *regs);
+
+#endif /* _SFRAME_UNWIND_H */
diff --git a/lib/sframe/Makefile b/lib/sframe/Makefile
index 4e4291d9294f..5ee9e3e7ec93 100644
--- a/lib/sframe/Makefile
+++ b/lib/sframe/Makefile
@@ -1,5 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0
 ##################################
-obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe_read.o \
+obj-$(CONFIG_USER_UNWINDER_SFRAME) += iterate_phdr.o \
+				      sframe_read.o \
+				      sframe_state.o \
+				      sframe_unwind.o
 
+CFLAGS_iterate_phdr.o += -I $(srctree)/lib/sframe/ -Wno-error=declaration-after-statement
 CFLAGS_sframe_read.o += -I $(srctree)/lib/sframe/
+CFLAGS_sframe_state.o += -I $(srctree)/lib/sframe/
+CFLAGS_sframe_unwind.o += -I $(srctree)/lib/sframe/
diff --git a/lib/sframe/iterate_phdr.c b/lib/sframe/iterate_phdr.c
new file mode 100644
index 000000000000..c10d590ecc67
--- /dev/null
+++ b/lib/sframe/iterate_phdr.c
@@ -0,0 +1,113 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/elf.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/mm_types.h>
+
+#include "iterate_phdr.h"
+
+/*
+ * Iterate over the task's memory mappings and find the ELF headers.
+ *
+ * This is expected to be called from perf_callchain_user(), so user process
+ * context is expected.
+ */
+
+int iterate_phdr(int (*callback)(struct phdr_info *info,
+				 struct task_struct *task,
+				 void *data),
+		 struct task_struct *task, void *data)
+{
+	struct mm_struct *mm;
+	struct vm_area_struct *vma_mt;
+	struct page *page;
+
+	Elf64_Ehdr *ehdr;
+	struct phdr_info phinfo;
+
+	int ret = 0, res = 0;
+	int err = 0;
+	bool first = true;
+
+	memset(&phinfo, 0, sizeof(struct phdr_info));
+
+	mm = task->mm;
+
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
+
+	mas_for_each(&mas, vma_mt, ULONG_MAX) {
+		/* ELF header has a fixed place in the file, starting at offset
+		 * zero.
+		 */
+		if (vma_mt->vm_pgoff)
+			continue;
+
+		/* For the callback to infer if its the prog or DSO we are
+		 * dealing with.
+		 */
+		phinfo.pi_prog = first;
+		first = false;
+		/* FIXME TODO
+		 *  - This code assumes 64-bit ELF by using Elf64_Ehdr.
+		 *  - Detect the case when ELF program headers to be of
+		 * size > 1 page.
+		 */
+
+		/* FIXME TODO KERNEL
+		 *  - get_user_pages_WHAT, which API.
+		 *  What flags ? Is this correct ?
+		 */
+		ret = get_user_pages_remote(mm, vma_mt->vm_start, 1, FOLL_GET,
+					    &page, &vma_mt, NULL);
+		if (ret <= 0)
+			continue;
+
+		/* The first page must have the ELF header. */
+		ehdr = vmap(&page, 1, VM_MAP, PAGE_KERNEL);
+		if (!ehdr)
+			goto put_page;
+
+		/* Check for magic bytes to make sure this is ehdr. */
+		err = 0;
+		err |= ((ehdr->e_ident[EI_MAG0] != ELFMAG0)
+			|| (ehdr->e_ident[EI_MAG1] != ELFMAG1)
+			|| (ehdr->e_ident[EI_MAG2] != ELFMAG2)
+			|| (ehdr->e_ident[EI_MAG3] != ELFMAG3));
+		if (err)
+			goto unmap;
+
+		/*
+		 * FIXME TODO handle the case when number of program headers is
+		 * greater than or equal to PN_XNUM later.
+		 */
+		if (ehdr->e_phnum == PN_XNUM)
+			goto unmap;
+		/*
+		 * FIXME TODO handle the case when Elf phdrs span more than one
+		 * page later ?
+		 */
+		if ((sizeof(Elf64_Ehdr) + ehdr->e_phentsize * ehdr->e_phnum)
+		    > PAGE_SIZE)
+			goto unmap;
+
+		/* Save the location of program headers and the phnum. */
+		phinfo.pi_addr = vma_mt->vm_start;
+		phinfo.pi_phdr = (void *)ehdr + ehdr->e_phoff;
+		phinfo.pi_phnum = ehdr->e_phnum;
+
+		res = callback(&phinfo, task, data);
+unmap:
+		vunmap(ehdr);
+put_page:
+		put_page(page);
+
+		if (res < 0)
+			break;
+	}
+
+	return res;
+}
diff --git a/lib/sframe/iterate_phdr.h b/lib/sframe/iterate_phdr.h
new file mode 100644
index 000000000000..78e73cade579
--- /dev/null
+++ b/lib/sframe/iterate_phdr.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef ITERATE_PHDR_H_
+#define ITERATE_PHDR_H_
+
+#include <linux/sched.h>
+
+struct phdr_info {
+	/* Determine whether prog or DSO. */
+	bool pi_prog;
+	/* Base address. */
+	unsigned long pi_addr;
+	/* Reference to the ELF program headers of the object. */
+	void *pi_phdr;
+	/* Number of entries in the program header table. */
+	unsigned int pi_phnum;
+	/*
+	 * Following two fields are for optimization - keep track of any
+	 * dlopen/dlclose activity done after program startup.
+	 * FIXME TODO Currently unused.
+	 */
+	uint64_t pi_adds;
+	uint64_t pi_subs;
+};
+
+int iterate_phdr(int (*callback)(struct phdr_info *info,
+				 struct task_struct *task,
+				 void *data),
+		 struct task_struct *task, void *data);
+
+#endif /* ITERATE_PHDR_H_ */
diff --git a/lib/sframe/sframe_state.c b/lib/sframe/sframe_state.c
new file mode 100644
index 000000000000..a34f762acf42
--- /dev/null
+++ b/lib/sframe/sframe_state.c
@@ -0,0 +1,424 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/elf.h>
+#include <linux/vmalloc.h>
+#include <sframe/sframe_unwind.h>
+
+#include "sframe_state.h"
+#include "iterate_phdr.h"
+
+#define NUM_OF_DSOS    32
+
+static int num_entries = NUM_OF_DSOS;
+
+/*
+ * Error codes for SFrame state.
+ *
+ * All condition codes less than SFRAME_UNW_INFO_OK are used to indicate
+ * an unhealthy SFrame state.
+ */
+enum {
+	SFRAME_UNW_INVAL_SFRAME = -3, /* An SFrame section is invalid. */
+	SFRAME_UNW_NO_SFSTATE = -2, /* SFrame state could not be set up. */
+	SFRAME_UNW_NO_PROG_SFRAME = -1,  /* No SFrame section in prog. */
+	SFRAME_UNW_INFO_OK = 0,
+	SFRAME_UNW_PARTIAL_INFO = 1,
+};
+
+static int sframe_unw_info_cleanup(struct sframe_unw_info *sfu_info)
+{
+	int i;
+
+	if (!sfu_info)
+		return 1;
+
+	if (sfu_info->sframe_vmap) {
+		vunmap(sfu_info->sframe_vmap);
+		sfu_info->sframe_vmap = NULL;
+	}
+
+	if (sfu_info->sframe_pages) {
+		for (i = 0; i < sfu_info->sframe_npages; i++)
+			put_page(sfu_info->sframe_pages[i]);
+		kfree(sfu_info->sframe_pages);
+		sfu_info->sframe_pages = NULL;
+	}
+
+	kfree(sfu_info->sfsec);
+	sfu_info->sfsec = NULL;
+
+	return 0;
+}
+
+static void sframe_unw_info_init(struct sframe_unw_info *sfu_info,
+				 uint64_t sframe_addr, size_t sframe_size,
+				 uint64_t text_addr, size_t text_size)
+{
+	if (!sfu_info)
+		return;
+	sfu_info->sframe_addr = sframe_addr;
+	sfu_info->sframe_size = sframe_size;
+	sfu_info->text_addr = text_addr;
+	sfu_info->text_size = text_size;
+	sfu_info->sframe_pages = NULL;
+	sfu_info->sframe_vmap = NULL;
+}
+
+/*
+ * Get the user pages containing the SFrame section and set up the SFrame
+ * section object for the stacktracer to use later.
+ */
+static int sframe_unw_info_init_sfsec(struct sframe_state *sfstate,
+				      struct sframe_unw_info *sfu_info)
+{
+	int i;
+	int err = 0;
+	struct task_struct *task;
+	struct vm_area_struct *vma;
+	struct sframe_sec *sfsec;
+	struct page **pages;
+	unsigned long npages;
+	char *sframe_vmap, *sframe_buf;
+
+	sfsec = kmalloc(sframe_sec_sizeof(), GFP_KERNEL);
+	if (!sfsec)
+		return -ENOMEM;
+	sfu_info->sfsec = sfsec;
+
+	task = sfstate->task;
+
+	vma = find_vma(task->mm, sfu_info->sframe_addr);
+	npages = vma_pages(vma);
+	pages = kmalloc((sizeof(struct page *) * npages), GFP_KERNEL);
+	if (!pages) {
+		err = -ENOMEM;
+		goto free_sfsec;
+	}
+
+#if 0
+	npages = get_user_pages_remote(task->mm, sfu_info->sframe_addr, npages,
+				       FOLL_GET, pages, &vma, NULL);
+#endif
+	npages = get_user_pages_unlocked(vma->vm_start, npages, pages,
+					 FOLL_GET);
+	if (npages <= 0)
+		goto free_page;
+
+	sfu_info->sframe_pages = pages;
+	sfu_info->sframe_npages = npages;
+
+	sframe_vmap = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+	if (!sframe_vmap)
+		goto put_page;
+	sfu_info->sframe_vmap = sframe_vmap;
+
+	sframe_buf = sframe_vmap + (sfu_info->sframe_addr - vma->vm_start);
+	err = sframe_sec_init(sfu_info->sfsec,
+			      sframe_buf,
+			      sfu_info->sframe_size);
+
+	/*
+	 * put_page, vunmap should not be done yet as SFrame section will be
+	 * used when do_sframe_unwind ().
+	 * In the rare possibility that this is a corrupt SFrame section,
+	 * clean up the sframe_unw_info object, and signal error so the
+	 * caller.
+	 */
+	if (!err)
+		return 0;
+
+	vunmap(sframe_vmap);
+put_page:
+	for (i = 0; i < npages; i++)
+		put_page(pages[i]);
+free_page:
+	kfree(pages);
+free_sfsec:
+	kfree(sfu_info->sfsec);
+	sfu_info->sfsec = NULL;
+
+	return err;
+}
+
+static int sframe_state_unw_info_list_add(struct sframe_state *sfstate,
+					  struct sframe_unw_info *sfu_info)
+{
+	size_t realloc_size = 0;
+
+	if (sfstate->su_dsos.alloced == 0) {
+		sfstate->su_dsos.entry
+		  = kcalloc(num_entries, sizeof(struct sframe_unw_info),
+			    GFP_KERNEL);
+		if (!sfstate->su_dsos.entry)
+			return -ENOMEM;
+		sfstate->su_dsos.alloced = num_entries;
+	} else if (sfstate->su_dsos.used == sfstate->su_dsos.alloced) {
+		realloc_size = ((sfstate->su_dsos.alloced + num_entries)
+				* sizeof(struct sframe_unw_info));
+		sfstate->su_dsos.entry
+		  = krealloc(sfstate->su_dsos.entry, realloc_size, GFP_KERNEL);
+		if (!sfstate->su_dsos.entry)
+			return -ENOMEM;
+
+		memset(&sfstate->su_dsos.entry[sfstate->su_dsos.alloced], 0,
+			num_entries * sizeof(struct sframe_unw_info));
+		sfstate->su_dsos.alloced += num_entries;
+	}
+
+	sfstate->su_dsos.entry[sfstate->su_dsos.used++] = *sfu_info;
+	return 0;
+}
+
+static int sframe_state_add_unw_info(struct sframe_state *sfstate,
+				     uint64_t sframe_addr, size_t sframe_size,
+				     uint64_t text_addr, size_t text_size)
+{
+	struct sframe_unw_info *sfu_info;
+	int ret = 0;
+
+	sfu_info = kzalloc(sizeof(*sfu_info), GFP_KERNEL);
+	if (!sfu_info)
+		return -ENOMEM;
+
+	sframe_unw_info_init(sfu_info, sframe_addr, sframe_size, text_addr,
+			     text_size);
+
+	if (sframe_unw_info_init_sfsec(sfstate, sfu_info)) {
+		ret = SFRAME_UNW_INVAL_SFRAME;
+		goto end;
+	}
+
+	/* Add sframe_unw_info object for the program or its DSOs. */
+	if (!sfstate->su_prog.sframe_size)
+		memcpy(&(sfstate->su_prog), sfu_info, sizeof(*sfu_info));
+	else
+		ret = sframe_state_unw_info_list_add(sfstate, sfu_info);
+
+end:
+	kfree(sfu_info);
+
+	return ret;
+}
+
+/*
+ * Add SFrame unwind info for the given (prog or DSO) phdr_info into the SFrame
+ * state object in the task.
+ *
+ * Callback routine from iterate_phdr function.
+ *
+ * Returns 0 if success.
+ */
+static int add_sframe_unwind_info(struct phdr_info *info,
+				  struct task_struct *task,
+				  void *data)
+{
+	int err = 0;
+	int p_type;
+	struct sframe_state *sframe_state;
+	/* FIXME TODO what if its Elf32_Phdr. */
+	Elf64_Phdr *phdr = NULL, *sframe_phdr = NULL, *text_phdr = NULL;
+
+	uint64_t text_addr;
+	size_t text_size;
+	uint64_t sframe_addr;
+	size_t sframe_size;
+
+	phdr = info->pi_phdr;
+	sframe_state = (struct sframe_state *)data;
+
+	for (int j = 0; j < info->pi_phnum; j++) {
+		p_type = phdr[j].p_type;
+		/* Find the executable section and the SFrame section. */
+		if (p_type == PT_GNU_SFRAME) {
+			sframe_phdr = &phdr[j];
+			continue;
+		}
+		/* FIXME TODO Elf 101 - there be only one PF_X. Looks like it? */
+		if (p_type == PT_LOAD && phdr[j].p_flags & PF_X) {
+			/*
+			 * This is the executable part of the ELF binary
+			 * containing the instructions, and may contain
+			 * sections other than .text.  The usage of `text` in
+			 * this function is colloquial.
+			 */
+			text_phdr = &phdr[j];
+			continue;
+		}
+
+		if (sframe_phdr && text_phdr)
+			break;
+	}
+
+	/*
+	 * If there is no SFrame section for the prog, SFrame based unwinding
+	 * should not be attempted.  If no SFrame section is found for a DSO,
+	 * however, it may still be possible to generate useful stacktraces
+	 * using the SFrame sections' for other parts of the program.
+	 */
+	if (!sframe_phdr)
+		return info->pi_prog
+		  ? SFRAME_UNW_NO_PROG_SFRAME
+		  : SFRAME_UNW_PARTIAL_INFO;
+
+	text_addr = info->pi_prog ? text_phdr->p_vaddr
+				  : info->pi_addr + text_phdr->p_vaddr;
+	text_size = text_phdr->p_memsz;
+	sframe_addr = info->pi_prog ? sframe_phdr->p_vaddr
+				    : info->pi_addr + sframe_phdr->p_vaddr;
+	sframe_size = sframe_phdr->p_memsz;
+
+	/* Add the SFrame unwind info object to the list. */
+	err = sframe_state_add_unw_info(sframe_state, sframe_addr,
+					sframe_size, text_addr, text_size);
+	/*
+	 * An error indicates SFrame unwind info addition failed, but one can
+	 * still unwind using .sframe for other parts of the program.
+	 */
+	if (err)
+		return SFRAME_UNW_PARTIAL_INFO;
+
+	return SFRAME_UNW_INFO_OK;
+}
+
+static int add_sframe_unwind_info_for_task(struct task_struct *task)
+{
+	struct sframe_state *sfstate = task->sframe_state;
+
+	/* sfstate should be already allocated. */
+	if (!sfstate)
+		return SFRAME_UNW_NO_SFSTATE;
+
+	return iterate_phdr(add_sframe_unwind_info, task, sfstate);
+}
+
+/*
+ * SFrame Unwind Info APIs.
+ */
+
+struct sframe_unw_info *sframe_state_find_unw_info(struct sframe_state *sfstate,
+						   uint64_t addr)
+{
+	struct sframe_unw_info_list *unw_info_list;
+	struct sframe_unw_info sfu_info;
+	int i;
+
+	if (!sfstate)
+		return NULL;
+
+	if (sfstate->su_prog.text_addr < addr
+	    && sfstate->su_prog.text_addr + sfstate->su_prog.text_size > addr)
+		return &sfstate->su_prog;
+
+	unw_info_list = &sfstate->su_dsos;
+	for (i = 0; i < unw_info_list->used; ++i) {
+		sfu_info = unw_info_list->entry[i];
+		if ((sfu_info.text_addr <= addr)
+		    && (sfu_info.text_addr + sfu_info.text_size >= addr))
+			return &unw_info_list->entry[i];
+	}
+
+	return NULL;
+}
+
+struct sframe_sec *sframe_unw_info_get_sfsec(struct sframe_unw_info *sfu_info)
+{
+	if (!sfu_info || !sfu_info->sfsec)
+		return NULL;
+
+	return sfu_info->sfsec;
+}
+
+/*
+ * SFrame state APIs.
+ */
+
+static void unwind_sframe_state_free(struct sframe_state *sfstate)
+{
+	struct sframe_unw_info_list *unw_info_list;
+	struct sframe_unw_info *sfu_info;
+	int i;
+
+	if (!sfstate)
+		return;
+
+	sfu_info = &(sfstate->su_prog);
+	sframe_unw_info_cleanup(sfu_info);
+
+	unw_info_list = &sfstate->su_dsos;
+	for (i = 0; i < unw_info_list->used; ++i) {
+		sfu_info = &unw_info_list->entry[i];
+		sframe_unw_info_cleanup(sfu_info);
+	}
+
+	kfree(sfstate->su_dsos.entry);
+	sfstate->su_dsos.entry = NULL;
+	sfstate->su_dsos.alloced = 0;
+	sfstate->su_dsos.used = 0;
+
+	sfstate->task = NULL;
+}
+
+bool unwind_sframe_state_valid_p(struct sframe_state *sfstate)
+{
+	return (sfstate && sfstate->cond != SFSTATE_INVAL);
+}
+
+bool unwind_sframe_state_ready_p(struct sframe_state *sfstate)
+{
+	return (sfstate && sfstate->cond == SFSTATE_READY);
+}
+
+struct sframe_state *unwind_sframe_state_alloc(struct task_struct *task)
+{
+	struct sframe_state *sfstate = NULL;
+	/*
+	 * Check if the task's SFrame unwind information needs to be set up.
+	 */
+	sfstate = task->sframe_state;
+	if (!sfstate) {
+		/* Free'd up in release_task(). */
+		sfstate = kzalloc(sizeof(*sfstate), GFP_KERNEL);
+		if (!sfstate)
+			return NULL;
+		sfstate->cond = SFSTATE_ALLOCED;
+		task->sframe_state = sfstate;
+	}
+
+	sfstate->task = task;
+
+	return sfstate;
+}
+
+void unwind_sframe_state_cleanup(struct task_struct *task)
+{
+	if (!task->sframe_state)
+		return;
+
+	unwind_sframe_state_free(task->sframe_state);
+	kfree(task->sframe_state);
+	task->sframe_state = NULL;
+}
+
+/*
+ * Update the SFrame unwind state object cached per task.
+ *
+ * Sets cond to SFSTATE_INVAL if any error.
+ * Sets cond to SFSTATE_READY if no error.
+ */
+int unwind_sframe_state_update(struct task_struct *task)
+{
+	struct sframe_state *sfstate = NULL;
+	int sferr = 0;
+
+	sfstate = task->sframe_state;
+	if (sfstate->cond == SFSTATE_ALLOCED || sfstate->cond == SFSTATE_STALE)
+		sferr = add_sframe_unwind_info_for_task(task);
+
+	sfstate->cond = (sferr < SFRAME_UNW_INFO_OK) ? SFSTATE_INVAL
+						     : SFSTATE_READY;
+
+	return sferr < SFRAME_UNW_INFO_OK;
+}
diff --git a/lib/sframe/sframe_state.h b/lib/sframe/sframe_state.h
new file mode 100644
index 000000000000..e2f0251b30d5
--- /dev/null
+++ b/lib/sframe/sframe_state.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef SFRAME_STATE_H
+#define SFRAME_STATE_H
+
+#include "sframe_read.h"
+
+/*
+ * SFrame unwind info of the program or a DSO.
+ */
+struct sframe_unw_info {
+	/* SFrame segment's virtual addr.  */
+	uint64_t sframe_addr;
+	/* SFrame segment's size.  */
+	uint64_t sframe_size;
+	/*
+	 * Keep a reference to the pages and vma for the lifetime of this SFrame
+	 * unwind info object.
+	 */
+	struct page **sframe_pages;
+	unsigned long sframe_npages;
+	/* Address of the vmap'd area that contains the .sframe section.  */
+	const char *sframe_vmap;
+	/*
+	 * text_addr and text_size below are used only for looking up the
+	 * associated SFrame section.  See sframe_state_find_unw_info.
+	 */
+	/* Text segment's virtual addr.  */
+	uint64_t text_addr;
+	/* Text segment's size.  */
+	uint64_t text_size;
+	/* SFrame section contents.  */
+	struct sframe_sec *sfsec;
+};
+
+/*
+ * List of SFrame unwind info objects.
+ * Typically used to represent SFrame unwind info for set of shared libraries
+ * of a program.
+ */
+struct sframe_unw_info_list {
+	/* Entries allocated.  */
+	int alloced;
+	/* Entries used.  */
+	int used;
+	/* List of SFrame unwind info objects.  */
+	struct sframe_unw_info *entry;
+};
+
+enum sframe_state_code {
+	SFSTATE_READY = 0,  /* SFrame unwind OK to use.  */
+	SFSTATE_INVAL = 1,  /* SFrame unwind is invalid.  */
+	SFSTATE_ALLOCED = 2,   /* SFrame unwind is alloc'd but not initialized.  */
+	SFSTATE_STALE = 3,  /* SFrame unwind is stale and not OK to use.  */
+};
+
+/*
+ * Per task SFrame unwind state.
+ */
+struct sframe_state {
+	/* The task that this SFrame unwind info belongs to.  */
+	struct task_struct *task;
+	/* Current SFrame state condition.  */
+	enum sframe_state_code cond;
+	/* SFrame stack trace info for program.  */
+	struct sframe_unw_info su_prog;
+	/* SFrame stack trace info for the shared objects.  */
+	struct sframe_unw_info_list su_dsos;
+};
+
+extern struct sframe_unw_info *
+sframe_state_find_unw_info(struct sframe_state *sfstate, uint64_t addr);
+
+extern struct sframe_sec *
+sframe_unw_info_get_sfsec(struct sframe_unw_info *sfu_info);
+
+#endif /* SFRAME_STATE_H.  */
diff --git a/lib/sframe/sframe_unwind.c b/lib/sframe/sframe_unwind.c
new file mode 100644
index 000000000000..32716008d0b4
--- /dev/null
+++ b/lib/sframe/sframe_unwind.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/perf_event.h>
+#include <sframe/sframe_unwind.h>
+#include <sframe/sframe_regs.h>
+
+#include "sframe_state.h"
+
+/*
+ * Check if ADDR is in text segment (for which we also found a corresponding
+ * SFrame section.  The address is considered invalid, if it is not in any of
+ * the address ranges of the text segments of either the main program or any of
+ * it's DSOs (for which a corresponding SFrame section existed).
+ * Return true if valid, false otherwise.
+ */
+static bool unwind_sframe_ip_ok(struct sframe_state *sfstate, uint64_t addr)
+{
+	return (sframe_state_find_unw_info(sfstate, addr) != NULL);
+}
+
+void sframe_unwind_start(struct user_unwind_state *user_unw_state,
+			 struct task_struct *task, struct pt_regs *regs)
+{
+	if (!task->sframe_state
+	    || !unwind_sframe_state_ready_p(task->sframe_state))
+		goto error;
+
+	if (!regs)
+		goto error;
+
+	user_unw_state->sp = get_ptregs_sp(regs);
+	user_unw_state->pc = get_ptregs_ip(regs);
+	user_unw_state->fp = get_ptregs_fp(regs);
+
+	user_unw_state->task = task;
+	user_unw_state->stype = STACK_TYPE_TASK;
+
+	/* We need to skip ahead by one. */
+	sframe_unwind_next_frame(user_unw_state);
+	return;
+error:
+	user_unw_state->error = true;
+}
+
+bool sframe_unwind_next_frame(struct user_unwind_state *ustate)
+{
+	struct sframe_unw_info *sfu_info;
+	struct sframe_sec *sfsec;
+	struct task_struct *task;
+	struct sframe_state *sfstate;
+
+	int32_t ra_offset, rfp_offset, cfa_offset;
+
+	uint64_t cfa;
+	uint64_t return_addr;
+	uint64_t rfp_stack_loc;
+	uint64_t ra_stack_loc;
+	uint64_t sframe_vma;
+
+	uint64_t pc = ustate->pc;
+	uint64_t sp = ustate->sp;
+	uint64_t fp = ustate->fp;
+
+	struct sframe_fre fre, *frep = &fre;
+	int err = 0;
+
+	task = ustate->task;
+	sfstate = task->sframe_state;
+	/*
+	 * Indicate end of stack trace when SFrame unwind info
+	 * is not found for the given PC.
+	 */
+	sfu_info = sframe_state_find_unw_info(sfstate, pc);
+
+	if (!sfu_info)
+		goto the_end;
+
+	sfsec = sframe_unw_info_get_sfsec(sfu_info);
+
+	/* Find the SFrame FRE. */
+	sframe_vma = sfu_info->sframe_addr;
+	pc -= sframe_vma;
+	err = sframe_sec_find_fre(sfsec, pc, frep);
+
+	if (err != 0)
+		goto error;
+
+	/* Get the CFA offset from the FRE. */
+	cfa_offset = sframe_fre_get_cfa_offset(sfsec, frep, &err);
+	if (err == SFRAME_ERR_FREOFFSET_NOPRESENT)
+		goto error;
+	cfa = ((sframe_fre_get_base_reg_id(frep, &err)
+		== SFRAME_BASE_REG_SP) ? sp : fp) + cfa_offset;
+
+	/* Get the RA offset from the FRE. */
+	ra_offset = sframe_fre_get_ra_offset(sfsec, frep, &err);
+	if (err != 0)
+		goto error;
+	ra_stack_loc = cfa + ra_offset;
+
+	if (!access_ok((const uint64_t __user *)ra_stack_loc, STACK_ACCESS_LEN))
+		goto error;
+
+	if (__get_user(return_addr, (const uint64_t __user *)ra_stack_loc))
+		goto error;
+
+	/* Get the FP offset from the FRE. */
+	rfp_offset = sframe_fre_get_fp_offset(sfsec, frep, &err);
+	if (err == 0) {
+		rfp_stack_loc = cfa + rfp_offset;
+
+		if (!access_ok((const uint64_t __user *) rfp_stack_loc,
+			       STACK_ACCESS_LEN))
+			goto error;
+
+		if (__get_user(fp, (const uint64_t __user *)rfp_stack_loc))
+			goto error;
+	}
+
+	/* Validate and add return address to the list. */
+	if (unwind_sframe_ip_ok(sfstate, return_addr)) {
+		ustate->pc = return_addr;
+		ustate->sp = cfa;
+		ustate->fp = fp;
+
+	} else {
+		goto error;
+	}
+
+	return true;
+
+error:
+	ustate->error = true;
+	return false;
+the_end:
+	ustate->stype = STACK_TYPE_UNKNOWN;
+	return false;
+}
+
+uint64_t sframe_unwind_get_return_address(struct user_unwind_state *state)
+{
+	return state->pc;
+}
+
+/*
+ * Generate stack trace using SFrame stack trace information.
+ * Return 0 if success, 1 otherwise.
+ */
+
+static int do_sframe_unwind(struct task_struct *task,
+			    struct sframe_state *sfstate,
+			    struct perf_callchain_entry_ctx *entry,
+			    struct pt_regs *regs)
+{
+	struct user_unwind_state ustate;
+	uint64_t addr;
+
+	memset(&ustate, 0, sizeof(ustate));
+
+	for (sframe_unwind_start(&ustate, task, regs);
+	     !sframe_unwind_done(&ustate) && !sframe_unwind_error(&ustate);
+	     sframe_unwind_next_frame(&ustate)) {
+		addr = sframe_unwind_get_return_address(&ustate);
+		if (!addr || perf_callchain_store(entry, addr))
+			break;
+	}
+
+	return 0;
+}
+
+/*
+ * Get the stack trace for the task using SFrame stack trace information.
+ * Returns 0 if success, 1 otherwise.
+ */
+
+int sframe_callchain_user(struct task_struct *task,
+			   struct perf_callchain_entry_ctx *entry,
+			   struct pt_regs *regs)
+{
+	int ret = 0;
+	struct sframe_state *sfstate;
+
+	if (task != current)
+		return 1;
+
+	/* Get the current task's sframe state. */
+	sfstate = task->sframe_state;
+
+	if (!sfstate)
+		return 1;
+
+	/*
+	 * Prepare for stack tracing.  State must be SFSTATE_READY at this time.
+	 * FIXME TODO SFrame state may be stale because there was a change in
+	 * the set of DSOs used by the program, for example.
+	 * FIXME TODO Need to update task->sframe_state if program
+	 * dlopen/dlclose a DSO.
+	 */
+	if (ret || !unwind_sframe_state_ready_p(sfstate))
+		return 1;
+
+	ret = do_sframe_unwind(task, sfstate, entry, regs);
+
+	return ret;
+}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2023-05-01 20:04 [POC 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
                   ` (3 preceding siblings ...)
  2023-05-01 20:04 ` [POC 4/5] sframe: add an SFrame format stack tracer Indu Bhagat
@ 2023-05-01 20:04 ` Indu Bhagat
  2023-05-01 23:11   ` Steven Rostedt
  2023-05-02 10:53   ` Peter Zijlstra
  2023-05-01 22:15 ` [POC 0/5] SFrame based stack tracer for user space in the kernel Steven Rostedt
  5 siblings, 2 replies; 32+ messages in thread
From: Indu Bhagat @ 2023-05-01 20:04 UTC (permalink / raw)
  To: linux-toolchains
  Cc: daandemeyer, andrii, rostedt, kris.van.hees, elena.zannoni,
	nick.alcock, Indu Bhagat

The task's sframe_state is allocated and initialized if a phdr with type
PT_GNU_SFRAME is encountered for the binary.

perf_callchain_user() will fall back on the frame pointer based stack
trace approach if:
  - SFrame section for the main program is not found.
  - SFrame state for the task is either not setup or stale and cannot
  be refreshed.

Finally, the sframe_state is cleaned up in release_task().

Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
---
 arch/x86/events/core.c | 51 ++++++++++++++++++++++++++++++++++++++++++
 fs/binfmt_elf.c        | 39 ++++++++++++++++++++++++++++++++
 kernel/exit.c          |  9 ++++++++
 3 files changed, 99 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d096b04bf80e..4be9e826714a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2860,11 +2860,54 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 }
 #endif
 
+#ifdef CONFIG_USER_UNWINDER_SFRAME
+
+#include <sframe/sframe_unwind.h>
+
+/* Check if the specified task has SFrame unwind state set up.  */
+static inline bool check_sframe_state_p(struct task_struct *task)
+{
+	bool sframe_ok = false;
+
+	/* FIXME TODO - only current task can be unwinded at this time.
+	 * Even for current tasks, following unknowns remain and hence, not
+	 * handled:
+	 *    - dlopen / dlclose detection and update of sframe_state,
+	 *    - in general, any change in memory mappings.
+	 */
+	if (task != current)
+		return false;
+
+	if (!task->sframe_state)
+		return false;
+
+	sframe_ok = (unwind_sframe_state_valid_p(task->sframe_state)
+		     || unwind_sframe_state_ready_p(task->sframe_state));
+
+	return sframe_ok;
+}
+
+#else
+/* Check if the specified task has SFrame unwind state set up. */
+static inline bool check_sframe_state_p(struct task_struct *task)
+{
+	  return false;
+}
+
+static inline int sframe_callchain_user(struct task_struct *task,
+					struct perf_callchain_entry_ctx *entry,
+					struct pt_regs *regs)
+{
+	  return 0;
+}
+#endif
+
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
 {
 	struct stack_frame frame;
 	const struct stack_frame __user *fp;
+	bool sframe_avail;
 
 	if (perf_guest_state()) {
 		/* TODO: We don't support guest os callchain now */
@@ -2887,7 +2930,15 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 	if (perf_callchain_user32(regs, entry))
 		return;
 
+	sframe_avail = check_sframe_state_p (current);
+
 	pagefault_disable();
+
+	if (sframe_avail && !sframe_callchain_user (current, entry, regs)) {
+		pagefault_enable();
+		return;
+	}
+
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 1033fbdfdbec..8a932bc9ad5d 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -820,6 +820,34 @@ static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
 	return ret == -ENOENT ? 0 : ret;
 }
 
+#ifdef CONFIG_USER_UNWINDER_SFRAME
+
+#include <sframe/sframe_unwind.h>
+
+static inline int sframe_state_setup(void)
+{
+	int ret;
+
+	/* Allocate the SFrame state (per task) if NULL. */
+	if (!current->sframe_state)
+		current->sframe_state = unwind_sframe_state_alloc(current);
+
+	if (!current->sframe_state)
+		return -ENOMEM;
+
+	ret = unwind_sframe_state_update(current);
+
+	return ret;
+}
+
+#else
+static inline int sframe_state_setup(void)
+{
+	  return 0;
+}
+
+#endif
+
 static int load_elf_binary(struct linux_binprm *bprm)
 {
 	struct file *interpreter = NULL; /* to shut gcc up */
@@ -842,6 +870,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
 	struct mm_struct *mm;
 	struct pt_regs *regs;
+	bool sframe_avail = false;
 
 	retval = -ENOEXEC;
 	/* First of all, some simple consistency checks */
@@ -861,6 +890,14 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	if (!elf_phdata)
 		goto out;
 
+	elf_ppnt = elf_phdata;
+	for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) {
+		if (elf_ppnt->p_type == PT_GNU_SFRAME) {
+			  sframe_avail = true;
+			  break;
+		}
+	}
+
 	elf_ppnt = elf_phdata;
 	for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) {
 		char *elf_interpreter;
@@ -1342,6 +1379,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	 */
 	ELF_PLAT_INIT(regs, reloc_func_desc);
 #endif
+	if (sframe_avail && sframe_state_setup())
+		goto out;
 
 	finalize_exec(bprm);
 	START_THREAD(elf_ex, regs, elf_entry, bprm->p);
diff --git a/kernel/exit.c b/kernel/exit.c
index 34b90e2e7cf7..e9b461f689a4 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -70,6 +70,10 @@
 #include <linux/sysfs.h>
 #include <linux/user_events.h>
 
+#ifdef CONFIG_USER_UNWINDER_SFRAME
+#include <sframe/sframe_unwind.h>
+#endif
+
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
 #include <asm/mmu_context.h>
@@ -241,6 +245,11 @@ void release_task(struct task_struct *p)
 	struct task_struct *leader;
 	struct pid *thread_pid;
 	int zap_leader;
+
+#ifdef CONFIG_USER_UNWINDER_SFRAME
+	unwind_sframe_state_cleanup(p);
+#endif
+
 repeat:
 	/* don't need to get the RCU readlock here - the process is dead and
 	 * can't be modifying its own credentials. But shut RCU-lockdep up */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [POC 0/5] SFrame based stack tracer for user space in the kernel
  2023-05-01 20:04 [POC 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
                   ` (4 preceding siblings ...)
  2023-05-01 20:04 ` [POC 5/5] x86_64: invoke SFrame based stack tracer for user space Indu Bhagat
@ 2023-05-01 22:15 ` Steven Rostedt
  5 siblings, 0 replies; 32+ messages in thread
From: Steven Rostedt @ 2023-05-01 22:15 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon,  1 May 2023 13:04:05 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:

> Hello,
> 

Hi Indu,

This is really great! I think we should include LKML in this as well. And
possibly even linux-trace-kernel@vger.kernel.org.


> This patch set is a Proof of Concept implementation for an SFrame-based
> stack tracer for user space in the kernel. Some of you had expressed interest
> in exploring this earlier; hopefully, this POC helps discuss the design and
> take it forward.
> 
> Motivation
> ==========
> Generating stack traces is vital for all profiling, tracing and debugging
> tools. In context of generating stack traces for user space, frame-pointer
> based unwinding works, but has its issues ([1],[2]).  EH_Frame based
> unwinding seems undesirable for kernel's unwinding needs ([3],[4]). 
> In general, EH_Frame based unwinding is undesirable in applications that need
> fast, real-time stack tracers (e.g., profilers), because of the overhead of
> interpreting and executing DWARF opcodes to calculate the relevant stack
> offsets.
> 
> SFrame (Simple Frame) stack trace format is designed to address these concerns.
> With this POC, we would like to see how to use SFrame as a viable alternative
> for user space stack tracing needs in the kernel.
> 
> [1] https://lwn.net/Articles/919940/
> [2] https://pagure.io/fesco/issue/2817
> [3] https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/OOJDAKTJB5WGMOZRXTUX7FTPFBF3H7WE/#NXRMNKD4B23HX7U5ICMKFRZO6Z3VXQXL
> [4] https://lkml.org/lkml/2012/2/10/356
> 
> What is SFrame format
> =====================
> 
> SFrame is the "Simple Frame" stack trace format.  The format is documented as
> part of the binutils documentation at https://sourceware.org/binutils/docs.
> 
> Starting with binutils 2.40, the GNU assembler (as) can generate SFrame stack
> trace data based on the CFI directives found in the source assembly.  This is
> achieved by using the --gsframe command line option when invoking the
> assembler.  This option plays the same role as the existing --gdwarf-[2345]
> options, only this time referring to SFrame.  The resulting stack tracing
> information is stored in a new segment of its own with type PT_GNU_SFRAME,
> containing a section named '.sframe'.
> 
> Also starting with binutils 2.40, the GNU linker (ld) knows how to merge
> sections containing SFrame stack trace info.
> 
> SFrame based user space stack tracer POC
> ========================================
> These patches implement a POC for an SFrame based user space stack tracer (for
> x86) in the kernel.  The purpose of this code is to serve as a reference,
> initiate discussions, and perhaps serve as a starting point for a viable
> implementation of an SFrame based stack tracer.  Please keep in mind that my
> familiarity with with kernel code/processes/conventions is still limited ;-).
> 
> High-level Design in this POC
> =============================
> Kconfig adds two config options for userspace unwinding
>   - config USER_UNWINDER_SFRAME to enable the SFrame userspace unwinder
>   - config USER_UNWINDER_FRAME_POINTER to enable the Frame Pointer userspace
>     unwinder
> 
> If CONFIG_USER_UNWINDER_SFRAME is set, the task_struct keeps a reference to
> the sframe_state object for the task.
> 
> For long running user programs, it makes sense to cache the sframe_state
> in the task and be able to simply do a quick do_sframe_unwind() at every
> unwind request.  Caching the sframe_state also means keeping the .sframe
> pages (for the prog and its DSOs) pinned.  The task's sframe_state is
> kmalloc'ed and initialized in load_elf_binary, when the task is close to begin
> execution.  The (open) issue with this design, however, remains that we need to
> detect when additional DSOs are brought in at run-time by the application.
> 
> The detection (and resolution) of stale sframe_state is not implemented in this
> POC.  As such, the POC at this time is fit only for applications that are
> statically linked.

So my thoughts on this was not to pin the sframe, but simply note that it
exists. When perf/bpf/ftrace wants a user space stack trace, it will ask
for one (I plan on adding an interface around this process, as it will also
handle the case the sframe is not available).

As the user stack trace will not change while the task is in the kernel, it
does not need to be triggered when asked for. Instead, it could register a
callback, and then on exiting back to user space (via the ptrace path), it
would then do the sframe look up, and pass the user space stack trace to
the perf/bpf/ftrace handlers.

In this location, we can allow for the sframe to be faulted in, as it will
be in a context where it can safely take a fault (and schedule out!). It
would be no different than any part of the elf file faulting in, and can be
swapped back out with memory pressure.

I'll go ahead and play with this code.

Thanks again, this is really helpful.

-- Steve


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
@ 2023-05-01 22:40   ` Steven Rostedt
  2023-05-02  5:07     ` Indu Bhagat
  2023-05-02  8:46     ` Peter Zijlstra
  2023-05-02  9:09   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Steven Rostedt @ 2023-05-01 22:40 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon,  1 May 2023 13:04:08 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:

> This patch adds an implementation to read SFrame stack trace data from
> a .sframe section.  Some APIs are also provided to find stack tracing
> information per PC, e.g., given a PC, find the SFrame FRE.
> 
> These routines are provided in the sframe_read.h and sframe_read.c.
> 
> This implmentation is malloc-free.
> 
> Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
> ---
>  lib/Makefile             |   1 +
>  lib/sframe/Makefile      |   5 +
>  lib/sframe/sframe.h      | 263 +++++++++++++++++++++
>  lib/sframe/sframe_read.c | 498 +++++++++++++++++++++++++++++++++++++++
>  lib/sframe/sframe_read.h |  75 ++++++
>  5 files changed, 842 insertions(+)
>  create mode 100644 lib/sframe/Makefile
>  create mode 100644 lib/sframe/sframe.h
>  create mode 100644 lib/sframe/sframe_read.c
>  create mode 100644 lib/sframe/sframe_read.h
> 
> diff --git a/lib/Makefile b/lib/Makefile
> index 876fcdeae34e..cb02d16dbffd 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -198,6 +198,7 @@ obj-$(CONFIG_ZSTD_COMPRESS) += zstd/
>  obj-$(CONFIG_ZSTD_DECOMPRESS) += zstd/
>  obj-$(CONFIG_XZ_DEC) += xz/
>  obj-$(CONFIG_RAID6_PQ) += raid6/
> +obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe/
>  
>  lib-$(CONFIG_DECOMPRESS_GZIP) += decompress_inflate.o
>  lib-$(CONFIG_DECOMPRESS_BZIP2) += decompress_bunzip2.o
> diff --git a/lib/sframe/Makefile b/lib/sframe/Makefile
> new file mode 100644
> index 000000000000..4e4291d9294f
> --- /dev/null
> +++ b/lib/sframe/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +##################################
> +obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe_read.o \

The ending backslash is only needed if there's a line wrap, which you don't
have here.

> +
> +CFLAGS_sframe_read.o += -I $(srctree)/lib/sframe/
> diff --git a/lib/sframe/sframe.h b/lib/sframe/sframe.h
> new file mode 100644
> index 000000000000..b1290e92839a
> --- /dev/null
> +++ b/lib/sframe/sframe.h
> @@ -0,0 +1,263 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + */
> +
> +#ifndef	SFRAME_H
> +#define	SFRAME_H
> +
> +#include <linux/types.h>
> +
> +/* This file contains definitions for the SFrame stack tracing format, which is
> + * documented at https://sourceware.org/binutils/docs */
> +
> +#define SFRAME_VERSION_1	1
> +#define SFRAME_MAGIC		0xdee2
> +#define SFRAME_VERSION	SFRAME_VERSION_1
> +
> +/* Function Descriptor Entries are sorted on PC. */
> +#define SFRAME_F_FDE_SORTED	0x1
> +/* Frame-pointer based stack tracing. Defined, but not set. */
> +#define SFRAME_F_FRAME_POINTER 0x2
> +
> +#define SFRAME_CFA_FIXED_FP_INVALID 0
> +#define SFRAME_CFA_FIXED_RA_INVALID 0
> +
> +/* Supported ABIs/Arch. */
> +#define SFRAME_ABI_AARCH64_ENDIAN_BIG      1 /* AARCH64 big endian. */
> +#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE   2 /* AARCH64 little endian. */
> +#define SFRAME_ABI_AMD64_ENDIAN_LITTLE     3 /* AMD64 little endian. */
> +
> +/* SFrame FRE types. */
> +#define SFRAME_FRE_TYPE_ADDR1	0
> +#define SFRAME_FRE_TYPE_ADDR2	1
> +#define SFRAME_FRE_TYPE_ADDR4	2
> +
> +/*
> + * SFrame Function Descriptor Entry types.
> + *
> + * The SFrame format has two possible representations for functions.  The
> + * choice of which type to use is made according to the instruction patterns
> + * in the relevant program stub.
> + */
> +
> +/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
> +#define SFRAME_FDE_TYPE_PCINC   0
> +/*
> + * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
> + * to look up a matching FRE.  Typical usecases are pltN entries, trampolines
> + * etc.
> + */
> +#define SFRAME_FDE_TYPE_PCMASK  1
> +
> +struct sframe_preamble
> +{

Note, for Linux coding style, structs should have the '{' on the same line:

struct sframe_preamble {

This is required because a common way to find a struct definition is to
search the name followed by a '{'.


> +	/* Magic number (SFRAME_MAGIC). */
> +	uint16_t magic;
> +	/* Data format version number (SFRAME_VERSION). */
> +	uint8_t version;
> +	/* Various flags. */
> +	uint8_t flags;
> +} __packed;
> +
> +struct sframe_header
> +{

Here too.

> +	struct sframe_preamble preamble;
> +	/* Information about the arch (endianness) and ABI. */
> +	uint8_t abi_arch;
> +	/*
> +	 * Offset for the Frame Pointer (FP) from CFA may be fixed for some
> +	 * ABIs (e.g, in AMD64 when -fno-omit-frame-pointer is used).  When fixed,
> +	 * this field specifies the fixed stack frame offset and the individual
> +	 * FREs do not need to track it.  When not fixed, it is set to
> +	 * SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may provide
> +	 * the applicable stack frame offset, if any.
> +	 */
> +	int8_t cfa_fixed_fp_offset;
> +	/*
> +	 * Offset for the Return Address from CFA is fixed for some ABIs
> +	 * (e.g., AMD64 has it as CFA-8).  When fixed, the header specifies the
> +	 * fixed stack frame offset and the individual FREs do not track it.  When
> +	 * not fixed, it is set to SFRAME_CFA_FIXED_RA_INVALID, and individual
> +	 * FREs provide the applicable stack frame offset, if any.
> +	 */
> +	int8_t cfa_fixed_ra_offset;
> +	/*
> +	 * Number of bytes making up the auxiliary header, if any.
> +	 * Some ABI/arch, in the future, may use this space for extending the
> +	 * information in SFrame header.  Auxiliary header is contained in
> +	 * bytes sequentially following the sframe_header.
> +	 */
> +	uint8_t auxhdr_len;
> +	/* Number of SFrame FDEs in this SFrame section. */
> +	uint32_t num_fdes;
> +	/* Number of SFrame Frame Row Entries. */
> +	uint32_t num_fres;
> +	/* Number of bytes in the SFrame Frame Row Entry section. */
> +	uint32_t fre_len;
> +	/* Offset of SFrame Function Descriptor Entry section. */
> +	uint32_t fdeoff;
> +	/* Offset of SFrame Frame Row Entry section. */
> +	uint32_t freoff;
> +} __packed;
> +
> +#define SFRAME_V1_HDR_SIZE(sframe_hdr)	\
> +  ((sizeof (struct sframe_header) + (sframe_hdr).auxhdr_len))
> +
> +/* Two possible keys for executable (instruction) pointers signing. */
> +#define SFRAME_AARCH64_PAUTH_KEY_A    0 /* Key A. */
> +#define SFRAME_AARCH64_PAUTH_KEY_B    1 /* Key B. */
> +
> +struct sframe_func_desc_entry
> +{

Basically all of them.

> +	/*
> +	 * Function start address.  Encoded as a signed offset, relative to the
> +	 * beginning of the current FDE.
> +	 */
> +	int32_t func_start_address;
> +	/* Size of the function in bytes. */
> +	uint32_t func_size;
> +	/*
> +	 * Offset of the first SFrame Frame Row Entry of the function, relative to the
> +	 * beginning of the SFrame Frame Row Entry sub-section.
> +	 */
> +	uint32_t func_start_fre_off;
> +	/* Number of frame row entries for the function. */
> +	uint32_t func_num_fres;
> +	/*
> +	 * Additional information for deciphering the unwind information for the
> +	 * function.
> +	 *   - 4-bits: Identify the FRE type used for the function.
> +	 *   - 1-bit: Identify the FDE type of the function - mask or inc.
> +	 *   - 1-bit: PAC authorization A/B key (aarch64).
> +	 *   - 2-bits: Unused.
> +	 * --------------------------------------------------------------------------
> +	 * |     Unused    |  PAC auth A/B key (aarch64) |  FDE type |   FRE type   |
> +	 * |               |        Unused (amd64)       |           |              |
> +	 * --------------------------------------------------------------------------
> +	 * 8               6                             5           4              0
> +	 */
> +	uint8_t func_info;
> +} __packed;
> +
> +/* Note: Set PAC auth key to SFRAME_AARCH64_PAUTH_KEY_A by default.  */
> +#define SFRAME_V1_FUNC_INFO(fde_type, fre_enc_type) \
> +  (((SFRAME_AARCH64_PAUTH_KEY_A & 0x1) << 5) | \
> +   (((fde_type) & 0x1) << 4) | ((fre_enc_type) & 0xf))
> +
> +#define SFRAME_V1_FUNC_FRE_TYPE(data)	  ((data) & 0xf)
> +#define SFRAME_V1_FUNC_FDE_TYPE(data)	  (((data) >> 4) & 0x1)
> +#define SFRAME_V1_FUNC_PAUTH_KEY(data)	  (((data) >> 5) & 0x1)
> +
> +/*
> + * Size of stack frame offsets in an SFrame Frame Row Entry.  A single
> + * SFrame FRE has all offsets of the same size.  Offset size may vary
> + * across frame row entries.
> + */
> +#define SFRAME_FRE_OFFSET_1B	  0
> +#define SFRAME_FRE_OFFSET_2B	  1
> +#define SFRAME_FRE_OFFSET_4B	  2
> +
> +/* An SFrame Frame Row Entry can be SP or FP based.  */
> +#define SFRAME_BASE_REG_FP	0
> +#define SFRAME_BASE_REG_SP	1
> +
> +/*
> + * The index at which a specific offset is presented in the variable length
> + * bytes of an FRE.
> + */
> +#define SFRAME_FRE_CFA_OFFSET_IDX   0
> +/*
> + * The RA stack offset, if present, will always be at index 1 in the variable
> + * length bytes of the FRE.
> + */
> +#define SFRAME_FRE_RA_OFFSET_IDX    1
> +/*
> + * The FP stack offset may appear at offset 1 or 2, depending on the ABI as RA
> + * may or may not be tracked.
> + */
> +#define SFRAME_FRE_FP_OFFSET_IDX    2
> +
> +struct sframe_fre_info
> +{
> +  /* Information about
> +   *   - 1 bit: base reg for CFA
> +   *   - 4 bits: Number of offsets (N).  A value of upto 3 is allowed to track
> +   *     all three of CFA, FP and RA (fixed implicit order).
> +   *   - 2 bits: information about size of the offsets (S) in bytes.
> +   *     Valid values are SFRAME_FRE_OFFSET_1B, SFRAME_FRE_OFFSET_2B,
> +   *     SFRAME_FRE_OFFSET_4B
> +   *   - 1 bit: Mangled RA state bit (aarch64 only).
> +   * -----------------------------------------------------------------------------------
> +   * | Mangled-RA (aarch64) |  Size of offsets   |   Number of offsets    |   base_reg |
> +   * |  Unused (amd64)      |                    |                        |            |
> +   * -----------------------------------------------------------------------------------
> +   * 8                     7                    5                        1            0
> +   */
> +  uint8_t fre_info;
> +};
> +
> +/* Macros to compose and decompose FRE info. */
> +
> +/* Note: Set mangled_ra_p to zero by default. */
> +#define SFRAME_V1_FRE_INFO(base_reg_id, offset_num, offset_size) \
> +  (((0 & 0x1) << 7) | (((offset_size) & 0x3) << 5) | \
> +   (((offset_num) & 0xf) << 1) | ((base_reg_id) & 0x1))
> +
> +/* Set the mangled_ra_p bit as indicated. */
> +#define SFRAME_V1_FRE_INFO_UPDATE_MANGLED_RA_P(mangled_ra_p, fre_info) \
> +  ((((mangled_ra_p) & 0x1) << 7) | ((fre_info) & 0x7f))
> +
> +#define SFRAME_V1_FRE_CFA_BASE_REG_ID(data)	  ((data) & 0x1)
> +#define SFRAME_V1_FRE_OFFSET_COUNT(data)	  (((data) >> 1) & 0xf)
> +#define SFRAME_V1_FRE_OFFSET_SIZE(data)		  (((data) >> 5) & 0x3)
> +#define SFRAME_V1_FRE_MANGLED_RA_P(data)	  (((data) >> 7) & 0x1)
> +
> +/* SFrame Frame Row Entry definitions. */
> +
> +/*
> + * Used when SFRAME_FRE_TYPE_ADDR1 is specified as FRE type.
> + * Upper limit of start address in sframe_frame_row_entry_addr1 if 0x100 (not
> + * inclusive).
> + */
> +struct sframe_frame_row_entry_addr1
> +{
> +	/*
> +	 * Start address of the frame row entry.  Encoded as an 1-byte unsigned
> +	 * offset, relative to the start address of the function.
> +	 */
> +	uint8_t fre_start_ip_offset;
> +	struct sframe_fre_info fre_info;
> +} __packed;
> +
> +/*
> + * Used when SFRAME_FRE_TYPE_ADDR2 is specified as FRE type.
> + * Upper limit of start address in sframe_frame_row_entry_addr2 is 0x10000 (not
> + * inclusive).
> + */
> +struct sframe_frame_row_entry_addr2
> +{
> +	/*
> +	 * Start address of the frame row entry.  Encoded as an 2-byte unsigned
> +	 * offset, relative to the start address of the function.
> +	 */
> +	uint16_t fre_start_ip_offset;
> +	struct sframe_fre_info fre_info;
> +} __packed;
> +
> +/*
> + * Used when SFRAME_FRE_TYPE_ADDR4 is specified as FRE type.
> + * Upper limit of start address in sframe_frame_row_entry_addr2
> + * is 0x100000000 (not inclusive).
> + */
> +struct sframe_frame_row_entry_addr4
> +{
> +	/*
> +	 * Start address of the frame row entry.  Encoded as a 4-byte unsigned
> +	 * offset, relative to the start address of the function.
> +	 */
> +	uint32_t fre_start_ip_offset;
> +	struct sframe_fre_info fre_info;
> +} __packed;
> +
> +#endif	/* SFRAME_H */
> diff --git a/lib/sframe/sframe_read.c b/lib/sframe/sframe_read.c
> new file mode 100644
> index 000000000000..9d6558d62c54
> --- /dev/null
> +++ b/lib/sframe/sframe_read.c
> @@ -0,0 +1,498 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + */
> +
> +#include <linux/string.h>
> +
> +#include "sframe_read.h"
> +
> +struct sframe_sec {
> +	/* SFrame header. */
> +	struct sframe_header header;
> +	/* SFrame Function Desc Entries. */
> +	void *fdes;
> +	/* SFrame Frame Row Entries. */
> +	void *fres;
> +	/* Number of bytes needed for SFrame FREs. */
> +	uint32_t fre_nbytes;
> +};
> +
> +static int sframe_set_errno(int *errp, int error)
> +{
> +	if (errp != NULL)
> +		*errp = error;
> +	return SFRAME_ERR;
> +}
> +
> +static uint32_t sframe_sec_get_hdr_size(struct sframe_header *sfh)
> +{
> +	return SFRAME_V1_HDR_SIZE(*sfh);
> +}
> +
> +static unsigned int sframe_fre_get_offset_count(unsigned char fre_info)
> +{
> +	return SFRAME_V1_FRE_OFFSET_COUNT(fre_info);
> +}
> +
> +static unsigned int sframe_fre_get_offset_size(unsigned char fre_info)
> +{
> +	return SFRAME_V1_FRE_OFFSET_SIZE(fre_info);
> +}
> +
> +static unsigned int sframe_get_fre_type(struct sframe_func_desc_entry *fdep)
> +{
> +	return (fdep) ? SFRAME_V1_FUNC_FRE_TYPE(fdep->func_info) : 0;
> +}
> +
> +static unsigned int sframe_get_fde_type(struct sframe_func_desc_entry *fdep)
> +{
> +	return (fdep) ? SFRAME_V1_FUNC_FDE_TYPE(fdep->func_info) : 0;
> +}
> +
> +static bool sframe_header_sanity_check_p(struct sframe_header *hp)
> +{
> +	unsigned char all_flags = SFRAME_F_FDE_SORTED | SFRAME_F_FRAME_POINTER;

Add a space here.

> +	/* Check that the preamble is valid. */
> +	if ((hp->preamble.magic != SFRAME_MAGIC)
> +	    || (hp->preamble.version != SFRAME_VERSION)
> +	    || ((hp->preamble.flags | all_flags) != all_flags))
> +		return false;
> +
> +	/* Check that the offsets are valid. */
> +	if (hp->fdeoff > hp->freoff)
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool sframe_fre_sanity_check_p(struct sframe_fre *frep)
> +{
> +	unsigned int offset_size, offset_cnt;
> +
> +	if (frep == NULL)
> +		return false;
> +
> +	offset_size = sframe_fre_get_offset_size(frep->fre_info);
> +
> +	if (offset_size != SFRAME_FRE_OFFSET_1B
> +	    && offset_size != SFRAME_FRE_OFFSET_2B
> +	    && offset_size != SFRAME_FRE_OFFSET_4B)
> +		return false;
> +
> +	offset_cnt = sframe_fre_get_offset_count(frep->fre_info);
> +	if (offset_cnt > MAX_NUM_STACK_OFFSETS)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int32_t sframe_get_fre_offset(struct sframe_fre *frep, uint32_t idx,
> +				     int *errp)
> +{
> +	int offset_cnt, offset_size;
> +
> +	if (frep == NULL || !sframe_fre_sanity_check_p(frep))
> +		return sframe_set_errno(errp, SFRAME_ERR_FRE_INVAL);
> +
> +	offset_cnt = sframe_fre_get_offset_count(frep->fre_info);
> +	offset_size = sframe_fre_get_offset_size(frep->fre_info);
> +
> +	if (offset_cnt < idx + 1)
> +		return sframe_set_errno(errp, SFRAME_ERR_FREOFFSET_NOPRESENT);
> +
> +	if (errp != NULL)
> +		*errp = 0; /* Offset Valid. */
> +
> +	if (offset_size == SFRAME_FRE_OFFSET_1B) {
> +		int8_t *stack_offsets = (int8_t *)frep->fre_offsets;
> +		return stack_offsets[idx];
> +	} else if (offset_size == SFRAME_FRE_OFFSET_2B) {
> +		int16_t *stack_offsets = (int16_t *)frep->fre_offsets;
> +		return stack_offsets[idx];
> +	} else {
> +		int32_t *stack_offsets = (int32_t *)frep->fre_offsets;
> +		return stack_offsets[idx];
> +	}
> +}
> +
> +static struct sframe_header *sframe_sec_get_header(struct sframe_sec *sfsec)
> +{
> +	return sfsec ? &sfsec->header : NULL;
> +}
> +
> +static int sframe_fre_copy(struct sframe_fre *dst,
> +			   struct sframe_fre *src)
> +{
> +	if (dst == NULL || src == NULL)
> +		return SFRAME_ERR;
> +
> +	memcpy(dst, src, sizeof(struct sframe_fre));
> +	return 0;
> +}
> +
> +static int sframe_decode_start_ip_offset(const char *fre_buf,
> +					 uint32_t *start_ip_offset,
> +					 unsigned int fre_type)
> +{
> +	uint32_t saddr = 0;
> +
> +	if (fre_type == SFRAME_FRE_TYPE_ADDR1) {
> +		uint8_t *uc = (uint8_t *)fre_buf;
> +		saddr = (uint32_t)*uc;
> +	} else if (fre_type == SFRAME_FRE_TYPE_ADDR2) {
> +		uint16_t *ust = (uint16_t *)fre_buf;
> +		saddr = (uint32_t)*ust;
> +	} else if (fre_type == SFRAME_FRE_TYPE_ADDR4) {
> +		uint32_t *uit = (uint32_t *)fre_buf;
> +		saddr = (uint32_t)*uit;
> +	} else {
> +		return SFRAME_ERR_INVAL;
> +	}
> +
> +	*start_ip_offset = saddr;
> +	return 0;
> +}
> +
> +/* Get the total size in bytes for the stack offsets. */
> +static size_t sframe_fre_stack_offsets_size(unsigned char fre_info)
> +{
> +	unsigned int offset_size, offset_cnt;
> +
> +	offset_size = sframe_fre_get_offset_size(fre_info);
> +	offset_cnt = sframe_fre_get_offset_count(fre_info);
> +
> +	if (offset_size == SFRAME_FRE_OFFSET_2B
> +	    || offset_size == SFRAME_FRE_OFFSET_4B)	/* 2 or 4 bytes. */
> +		return (offset_cnt * (offset_size * 2));
> +
> +	return offset_cnt;
> +}
> +
> +static size_t sframe_fre_get_start_address_tsize(unsigned int fre_type)
> +{
> +	/* Type size of the start_addr in an FRE. */
> +	size_t saddr_tsize = 0;
> +
> +	switch (fre_type) {
> +	case SFRAME_FRE_TYPE_ADDR1:
> +		saddr_tsize = sizeof(uint8_t);
> +		break;
> +	case SFRAME_FRE_TYPE_ADDR2:
> +		saddr_tsize = sizeof(uint16_t);
> +		break;
> +	case SFRAME_FRE_TYPE_ADDR4:
> +		saddr_tsize = sizeof(uint32_t);
> +		break;
> +	default:
> +		/* No other value is expected. */
> +		break;
> +	}
> +	return saddr_tsize;
> +}
> +
> +static size_t sframe_fre_vlen_size(struct sframe_fre *frep,
> +				   unsigned int fre_type)
> +{
> +	unsigned char fre_info;
> +	size_t ip_offset_tsize;
> +
> +	if (frep == NULL)
> +		return 0;
> +
> +	fre_info = frep->fre_info;
> +	ip_offset_tsize = sframe_fre_get_start_address_tsize(fre_type);
> +
> +	/*
> +	 * An SFrame FRE is a variable length structure.  It includes the start
> +	 * IP offset, FRE info field, and all trailing the stack offsets.
> +	 */
> +	return (ip_offset_tsize + sizeof(fre_info)
> +		+ sframe_fre_stack_offsets_size(fre_info));
> +}
> +
> +/*
> + * Read an SFrame FRE which starts at location FRE_BUF.  The function
> + * updates FRE_SIZE to the size of the FRE as stored in the binary format.
> + *
> + * Returns SFRAME_ERR if failure.
> + */
> +static int sframe_sec_read_fre(const char *fre_buf, struct sframe_fre *frep,
> +			       unsigned int fre_type, size_t *fre_size)
> +{
> +	void *stack_offsets;
> +	size_t stack_offsets_sz;
> +	size_t ip_offset_tsize;
> +	size_t esz;
> +
> +	if (fre_buf == NULL || frep == NULL || fre_size == NULL)
> +		return SFRAME_ERR_INVAL;
> +
> +	/* Copy over the FRE start address. */
> +	sframe_decode_start_ip_offset(fre_buf, &frep->start_ip_offset,
> +				      fre_type);
> +
> +	ip_offset_tsize = sframe_fre_get_start_address_tsize(fre_type);
> +	/* PS: Note how this API works closely with SFrame binary format. */
> +	frep->fre_info = *(unsigned char *)(fre_buf + ip_offset_tsize);
> +
> +	memset(frep->fre_offsets, 0, MAX_STACK_OFFSET_NBYTES);
> +	/* Get stack offsets. */
> +	stack_offsets_sz = sframe_fre_stack_offsets_size(frep->fre_info);
> +	stack_offsets = ((unsigned char *)fre_buf + ip_offset_tsize
> +			 + sizeof(frep->fre_info));
> +	memcpy(frep->fre_offsets, stack_offsets, stack_offsets_sz);
> +
> +	esz = sframe_fre_vlen_size(frep, fre_type);
> +	*fre_size = esz;
> +
> +	return 0;
> +}
> +
> +static struct sframe_func_desc_entry *
> +sframe_sec_find_fde(struct sframe_sec *sfsec, int32_t addr, int *errp)
> +{
> +	struct sframe_header *header;
> +	struct sframe_func_desc_entry *fde;
> +	int low, high, cnt;
> +
> +	if (sfsec == NULL) {
> +		sframe_set_errno(errp, SFRAME_ERR_INVAL);
> +		return NULL;
> +	}
> +
> +	header = sframe_sec_get_header(sfsec);
> +	if (header == NULL || header->num_fdes == 0 || sfsec->fdes == NULL) {
> +		sframe_set_errno(errp, SFRAME_ERR_INIT_INVAL);
> +		return NULL;
> +	}
> +	/*
> +	 * Skip binary search if FDE sub-section is not sorted on PCs.  GNU ld
> +	 * sorts the FDEs on start PC by default though.
> +	 */
> +	if ((header->preamble.flags & SFRAME_F_FDE_SORTED) == 0) {
> +		sframe_set_errno(errp, SFRAME_ERR_FDE_NOTSORTED);
> +		return NULL;
> +	}
> +
> +	/* Find the FDE that may contain the addr. */
> +	fde = (struct sframe_func_desc_entry *)sfsec->fdes;
> +	low = 0;
> +	high = header->num_fdes;
> +	cnt = high;
> +	while (low <= high) {
> +		int mid = low + (high - low) / 2;
> +
> +		if (fde[mid].func_start_address == addr)
> +			return fde + mid;
> +
> +		if (fde[mid].func_start_address < addr) {
> +			if (mid == (cnt - 1))
> +				return fde + (cnt - 1);
> +			else if (fde[mid+1].func_start_address > addr)
> +				return fde + mid;
> +			low = mid + 1;
> +		} else
> +			high = mid - 1;
> +	}
> +
> +	sframe_set_errno(errp, SFRAME_ERR_FDE_NOTFOUND);
> +	return NULL;
> +}
> +
> +static int8_t sframe_sec_get_fixed_fp_offset(struct sframe_sec *sfsec)
> +{
> +	struct sframe_header *header = sframe_sec_get_header(sfsec);
> +	return header->cfa_fixed_fp_offset;
> +}
> +
> +static int8_t sframe_sec_get_fixed_ra_offset(struct sframe_sec *sfsec)
> +{
> +	struct sframe_header *header = sframe_sec_get_header(sfsec);
> +	return header->cfa_fixed_ra_offset;
> +}
> +
> +size_t sframe_sec_sizeof(void)
> +{
> +	return sizeof (struct sframe_sec);
> +}
> +
> +int sframe_sec_init(struct sframe_sec *sfsec, const char *sf_buf,
> +		    size_t sf_size)
> +{
> +	char *frame_buf;
> +	const struct sframe_preamble *preamble;
> +	struct sframe_header *header;
> +
> +	if ((sf_buf == NULL) || (sf_size < sizeof(struct sframe_header)))
> +		return SFRAME_ERR_INVAL;
> +
> +	/* Check for foreign endianness. */
> +	preamble = (const struct sframe_preamble *) sf_buf;
> +	if (preamble->magic != SFRAME_MAGIC)
> +		return SFRAME_ERR_INVAL;
> +
> +	/* Reset the SFrame section object. */
> +	memset(sfsec, 0, sizeof(struct sframe_sec));
> +
> +	frame_buf = (char *)sf_buf;
> +
> +	/* Initialize the reference to the SFrame header. */
> +	sfsec->header = *(struct sframe_header *) frame_buf;
> +	header = &sfsec->header;
> +	if (!sframe_header_sanity_check_p(header))
> +		return SFRAME_ERR_INVAL;
> +
> +	/* Initialize the referece to the SFrame FDE section. */
> +	frame_buf += sframe_sec_get_hdr_size(header);
> +	sfsec->fdes = frame_buf;
> +
> +	/* Initialize the reference to the the SFrame FRE section. */
> +	frame_buf += (header->num_fdes * sizeof(struct sframe_func_desc_entry));
> +	sfsec->fres = frame_buf;
> +
> +	sfsec->fre_nbytes = header->fre_len;
> +
> +	return 0;
> +}
> +
> +/*
> + * Find the SFrame Frame Row Entry which contains the PC.
> + * Returns error code if failure.
> + */
> +int sframe_sec_find_fre(struct sframe_sec *sfsec, int32_t pc,
> +			struct sframe_fre *frep)
> +{
> +	struct sframe_func_desc_entry *fdep;
> +	uint32_t start_address, i;
> +	struct sframe_fre cur_fre, next_fre;

Although the sframe_fre structure is relatively small, I always get nervous
when I see structures defined on the kernel stack.

Also, I personally prefer to keep items defined themselves. That is, one
variable per line.

> +	unsigned char *fres;
> +	unsigned int fre_type, fde_type;
> +	size_t esz;
> +	int err = 0;
> +	size_t size = 0;

And to organize it in an "upside-down x-mas" tree fashion:

	struct sframe_func_desc_entry *fdep;
	struct sframe_fre next_fre;
	struct sframe_fre cur_fre
	uint32_t start_address, i;
	unsigned int fre_type;
	unsigned int fde_type;
	unsigned char *fres;
	size_t size = 0;
	size_t esz;
	int err = 0;

Looks much nicer and easier to read that way ;-)

-- Steve


> +	/*
> +	 * For regular FDEs(i.e. fde_type SFRAME_FDE_TYPE_PCINC),
> +	 * where the start address in the FRE is an offset from start pc,
> +	 * use a bitmask with all bits set so that none of the address bits are
> +	 * ignored.  In this case, we need to return the FRE where
> +	 * (PC >= FRE_START_ADDR).
> +	 */
> +	uint64_t bitmask = 0xffffffff;
> +
> +	if ((sfsec == NULL) || (frep == NULL))
> +		return SFRAME_ERR_INVAL;
> +
> +	/* Find the FDE which contains the PC, then scan its FRE entries. */
> +	fdep = sframe_sec_find_fde(sfsec, pc, &err);
> +	if (fdep == NULL || sfsec->fres == NULL)
> +		return SFRAME_ERR_INIT_INVAL;
> +
> +	fre_type = sframe_get_fre_type(fdep);
> +	fde_type = sframe_get_fde_type(fdep);
> +
> +	/*
> +	 * For FDEs for repetitive pattern of insns, we need to return the FRE
> +	 * such that(PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK).
> +	 * so, update the bitmask to the start address.
> +	 */
> +	/* FIXME - the bitmask. */
> +	if (fde_type == SFRAME_FDE_TYPE_PCMASK)
> +		bitmask = 0xff;
> +
> +	fres = (unsigned char *)sfsec->fres + fdep->func_start_fre_off;
> +	for (i = 0; i < fdep->func_num_fres; i++) {
> +		err = sframe_sec_read_fre((const char *)fres, &next_fre,
> +					  fre_type, &esz);
> +		start_address = next_fre.start_ip_offset;
> +
> +		if (((fdep->func_start_address
> +		      + (int32_t)start_address) & bitmask) <= (pc & bitmask)) {
> +			sframe_fre_copy(&cur_fre, &next_fre);
> +
> +			/* Get the next FRE in sequence. */
> +			if (i < fdep->func_num_fres - 1) {
> +				fres += esz;
> +				err = sframe_sec_read_fre((const char *)fres,
> +							  &next_fre,
> +							  fre_type, &esz);
> +
> +				/* Sanity check the next FRE. */
> +				if (!sframe_fre_sanity_check_p(&next_fre))
> +					return SFRAME_ERR_FRE_INVAL;
> +
> +				size = next_fre.start_ip_offset;
> +			} else {
> +				size = fdep->func_size;
> +			}
> +
> +			if (((fdep->func_start_address
> +			      + (int32_t)size) & bitmask) > (pc & bitmask)) {
> +				/* Cur FRE contains the PC, return it. */
> +				sframe_fre_copy(frep, &cur_fre);
> +				return 0;
> +			}
> +		} else {
> +			return SFRAME_ERR_FRE_INVAL;
> +		}
> +	}
> +	return SFRAME_ERR_FDE_INVAL;
> +}
> +
> +unsigned int sframe_fre_get_base_reg_id(struct sframe_fre *frep,
> +					int *errp)
> +{
> +	if (!frep)
> +		return sframe_set_errno(errp, SFRAME_ERR_FRE_INVAL);
> +
> +	return SFRAME_V1_FRE_CFA_BASE_REG_ID(frep->fre_info);
> +}
> +
> +int32_t sframe_fre_get_cfa_offset(struct sframe_sec *sfsec __always_unused,
> +				  struct sframe_fre *frep, int *errp)
> +{
> +	return sframe_get_fre_offset(frep, SFRAME_FRE_CFA_OFFSET_IDX, errp);
> +}
> +
> +int32_t sframe_fre_get_fp_offset(struct sframe_sec *sfsec,
> +				 struct sframe_fre *frep, int *errp)
> +{
> +	uint32_t fp_offset_idx = 0;
> +	int8_t fp_offset = sframe_sec_get_fixed_fp_offset(sfsec);
> +	/*
> +	 * If the FP offset is not being tracked, return the fixed FP offset
> +	 * from the SFrame header.
> +	 */
> +	if (fp_offset != SFRAME_CFA_FIXED_FP_INVALID) {
> +		*errp = 0;
> +		return fp_offset;
> +	}
> +
> +	/*
> +	 * In some ABIs, the stack offset to recover RA (using the CFA) from is
> +	 * fixed (like AMD64).  In such cases, the stack offset to recover FP
> +	 * will appear at the second index.
> +	 */
> +	fp_offset_idx = ((sframe_sec_get_fixed_ra_offset(sfsec)
> +			  != SFRAME_CFA_FIXED_RA_INVALID)
> +			 ? SFRAME_FRE_RA_OFFSET_IDX
> +			 : SFRAME_FRE_FP_OFFSET_IDX);
> +	return sframe_get_fre_offset(frep, fp_offset_idx, errp);
> +}
> +
> +int32_t sframe_fre_get_ra_offset(struct sframe_sec *sfsec,
> +				 struct sframe_fre *frep, int *errp)
> +{
> +	int8_t ra_offset = sframe_sec_get_fixed_ra_offset(sfsec);
> +	/*
> +	 * If the RA offset was not being tracked, return the fixed RA offset
> +	 * from the SFrame header.
> +	 */
> +	if (ra_offset != SFRAME_CFA_FIXED_RA_INVALID) {
> +		*errp = 0;
> +		return ra_offset;
> +	}
> +
> +	/* Otherwise, get the RA offset from the FRE. */
> +	return sframe_get_fre_offset(frep, SFRAME_FRE_RA_OFFSET_IDX, errp);
> +}
> diff --git a/lib/sframe/sframe_read.h b/lib/sframe/sframe_read.h
> new file mode 100644
> index 000000000000..6632fb76d8b1
> --- /dev/null
> +++ b/lib/sframe/sframe_read.h
> @@ -0,0 +1,75 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + */
> +
> +#ifndef SFRAME_READ_H
> +#define SFRAME_READ_H
> +
> +#include <linux/types.h>
> +
> +#include "sframe.h"
> +
> +struct sframe_sec;
> +
> +#define MAX_NUM_STACK_OFFSETS 3
> +
> +#define MAX_STACK_OFFSET_NBYTES \
> +  ((SFRAME_FRE_OFFSET_4B * 2 * MAX_NUM_STACK_OFFSETS))
> +
> +/*
> + * SFrame Frame Row Entry for the SFrame reader.
> + * Providing such an abstraction helps decouple stack tracer from the
> + * binary format representation of the same.
> + */
> +struct sframe_fre {
> +	uint32_t start_ip_offset;
> +	unsigned char fre_offsets[MAX_STACK_OFFSET_NBYTES];
> +	unsigned char fre_info;
> +};
> +
> +#define SFRAME_ERR ((int) -1)
> +
> +/* SFrame version not supported. */
> +#define SFRAME_ERR_VERSION_INVAL	(-2000)
> +/* Corrupt SFrame. */
> +#define SFRAME_ERR_INVAL		(-2001)
> +/* SFrame Section Initialization Error. */
> +#define SFRAME_ERR_INIT_INVAL		(-2002)
> +/* Corrupt FDE. */
> +#define SFRAME_ERR_FDE_INVAL		(-2003)
> +/* Corrupt FRE. */
> +#define SFRAME_ERR_FRE_INVAL		(-2004)
> +/* FDE not found. */
> +#define SFRAME_ERR_FDE_NOTFOUND		(-2005)
> +/* FDEs not sorted. */
> +#define SFRAME_ERR_FDE_NOTSORTED	(-2006)
> +/* FRE not found. */
> +#define SFRAME_ERR_FRE_NOTFOUND		(-2007)
> +/* FRE offset not present. */
> +#define SFRAME_ERR_FREOFFSET_NOPRESENT	(-2008)
> +
> +extern size_t sframe_sec_sizeof(void);
> +
> +extern int sframe_sec_init(struct sframe_sec *sfsec, const char *cf_buf,
> +			   size_t cf_size);
> +
> +extern int sframe_sec_find_fre(struct sframe_sec *ctx, int32_t pc,
> +			   struct sframe_fre *frep);
> +
> +extern unsigned int sframe_fre_get_base_reg_id(struct sframe_fre *fre,
> +					       int *errp);
> +extern int32_t sframe_fre_get_cfa_offset(struct sframe_sec *dtcx,
> +					 struct sframe_fre *fre,
> +					 int *errp);
> +extern int32_t sframe_fre_get_fp_offset(struct sframe_sec *sfsec,
> +					struct sframe_fre *fre,
> +					int *errp);
> +extern int32_t sframe_fre_get_ra_offset(struct sframe_sec *sfsec,
> +					struct sframe_fre *fre,
> +					int *errp);
> +extern bool sframe_fre_get_ra_mangled_p(struct sframe_sec *sfsec,
> +					struct sframe_fre *fre,
> +					int *errp);
> +
> +#endif /* SFRAME_READ_H */


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 4/5] sframe: add an SFrame format stack tracer
  2023-05-01 20:04 ` [POC 4/5] sframe: add an SFrame format stack tracer Indu Bhagat
@ 2023-05-01 23:00   ` Steven Rostedt
  2023-05-02  6:16     ` Indu Bhagat
  2023-05-02  8:53   ` Peter Zijlstra
  2023-05-02  9:04   ` Peter Zijlstra
  2 siblings, 1 reply; 32+ messages in thread
From: Steven Rostedt @ 2023-05-01 23:00 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon,  1 May 2023 13:04:09 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:

> This patch adds an SFrame format based stack tracer.
> 
> The files iterate_phdr.c, iterate_phdr.h implement a dl_iterate_phdr()
> like functionality.
> 
> The SFrame format based stack tracer is implemented in the
> sframe_unwind.c with architecture specific bits in the
> arch/arm64/include/asm/sframe_regs.h and
> arch/x86/include/asm/sframe_regs.h.  Please note that the SFrame format
> is supported for x86_64 (AMD64 ABI) and aarch64 (AAPCS64 ABI) only at
> this time.
> 
> The files sframe_state.[ch] implement the SFrame state management APIs.
> 
> Some aspects of the implementation are "POC like". These will need to
> addressed for the implementation to become more palatable:
> - dealing with only Elf64_Phdr (no Elf32_Phdr) at this time, and other
>   TODOs in the iterate_phdr.c,
> - detecting whether a program did a dlopen/dlclose,
> - code stubs around user space memory access (.sframe section, ELF hdr
>   etc.) by the kernel need careful review.
> 
> There are more aspects than above; The intention of this patch set is to
> help drive the discussion on how to best incorporate an SFrame-based user
> space unwinder in the kernel.
> 
> Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
> ---
>  arch/arm64/include/asm/sframe_regs.h |  37 +++
>  arch/x86/include/asm/sframe_regs.h   |  34 +++
>  include/sframe/sframe_regs.h         |  11 +
>  include/sframe/sframe_unwind.h       |  62 ++++
>  lib/sframe/Makefile                  |   8 +-
>  lib/sframe/iterate_phdr.c            | 113 +++++++
>  lib/sframe/iterate_phdr.h            |  34 +++
>  lib/sframe/sframe_state.c            | 424 +++++++++++++++++++++++++++
>  lib/sframe/sframe_state.h            |  80 +++++
>  lib/sframe/sframe_unwind.c           | 208 +++++++++++++
>  10 files changed, 1010 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/include/asm/sframe_regs.h
>  create mode 100644 arch/x86/include/asm/sframe_regs.h
>  create mode 100644 include/sframe/sframe_regs.h
>  create mode 100644 include/sframe/sframe_unwind.h
>  create mode 100644 lib/sframe/iterate_phdr.c
>  create mode 100644 lib/sframe/iterate_phdr.h
>  create mode 100644 lib/sframe/sframe_state.c
>  create mode 100644 lib/sframe/sframe_state.h
>  create mode 100644 lib/sframe/sframe_unwind.c
> 
> diff --git a/arch/arm64/include/asm/sframe_regs.h b/arch/arm64/include/asm/sframe_regs.h
> new file mode 100644
> index 000000000000..ae9ab9d5d3c1
> --- /dev/null
> +++ b/arch/arm64/include/asm/sframe_regs.h
> @@ -0,0 +1,37 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + */
> +
> +#ifdef ASM_ARM64_SFRAME_REGS_H
> +#define ASM_ARM64_SFRAME_REGS_H
> +
> +#define STACK_ACCESS_LEN 8
> +
> +static inline uint64_t
> +get_ptregs_ip(struct pt_regs *regs)
> +{
> +	return regs->pc;
> +}
> +
> +static inline uint64_t
> +get_ptregs_sp(struct pt_regs *regs)
> +{
> +	return regs->sp;
> +}
> +
> +static inline uint64_t
> +get_ptregs_fp(struct pt_regs *regs)
> +{
> +#define UNWIND_AARCH64_X29              29      /* 64-bit frame pointer.  */
> +	return (uint64_t)regs->regs[UNWIND_AARCH64_X29];
> +}
> +
> +static inline uint64_t
> +get_ptregs_ra(struct pt_regs *regs)
> +{
> +#define UNWIND_AARCH64_X30              30      /* 64-bit link pointer.  */
> +	return (uint64_t)regs->regs[UNWIND_AARCH64_X30];
> +}
> +
> +#endif /* ASM_ARM64_SFRAME_REGS_H */
> diff --git a/arch/x86/include/asm/sframe_regs.h b/arch/x86/include/asm/sframe_regs.h
> new file mode 100644
> index 000000000000..99f84955854a
> --- /dev/null
> +++ b/arch/x86/include/asm/sframe_regs.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + */
> +
> +#ifndef ASM_X86_SFRAME_REGS_H
> +#define ASM_X86_SFRAME_REGS_H
> +
> +#define STACK_ACCESS_LEN 8
> +
> +static inline uint64_t
> +get_ptregs_ip(struct pt_regs *regs)
> +{
> +	return (uint64_t)regs->ip;
> +}
> +
> +static inline uint64_t
> +get_ptregs_sp(struct pt_regs *regs)
> +{
> +	return (uint64_t)regs->sp;
> +}
> +
> +static inline uint64_t
> +get_ptregs_fp(struct pt_regs *regs)
> +{
> +	return (uint64_t)regs->bp;
> +}
> +
> +static inline uint64_t
> +get_ptregs_ra(struct pt_regs *regs)
> +{
> +	return 0; /* SFRAME_CFA_FIXED_RA_INVALID */
> +}
> +#endif /* ASM_X86_SFRAME_REGS_H */
> diff --git a/include/sframe/sframe_regs.h b/include/sframe/sframe_regs.h
> new file mode 100644
> index 000000000000..32b67f7a7c78
> --- /dev/null
> +++ b/include/sframe/sframe_regs.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + */
> +
> +#ifndef _SFRAME_REGS_H
> +#define _SFRAME_REGS_H
> +
> +#include <asm/sframe_regs.h>
> +
> +#endif /* _SFRAME_REGS_H */
> diff --git a/include/sframe/sframe_unwind.h b/include/sframe/sframe_unwind.h
> new file mode 100644
> index 000000000000..3e2c12816b60
> --- /dev/null
> +++ b/include/sframe/sframe_unwind.h

Also, these should probably go into include/linux, Unless there's going to
be a lot more header files.

> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + */
> +
> +#ifndef _SFRAME_UNWIND_H
> +#define _SFRAME_UNWIND_H
> +
> +#include <linux/sched.h>
> +#include <linux/perf_event.h>
> +
> +#define PT_GNU_SFRAME  0x6474e554
> +
> +/*
> + * State used for SFrame-based stack tracing for a user space task.
> + */
> +struct user_unwind_state {
> +	uint64_t pc, sp, fp, ra;

I know this is POC, but please make each structure field a separate item.
Also, should be tab delimited.

> +	enum stack_type stype;
> +	struct task_struct *task;
> +	bool error;
> +};

Also swap the task and the stype, as the pointer to the task will create a
hole in the structure.

struct user_unwind_state {
	uint64_t		pc;
	uint64_t		sp;
	uint64_t		fp;
	uint64_t		ra;
	struct task_stuct	*task;
	enum stack_type		stype;
	bool			error;
};

> +
> +/*
> + * APIs for an SFrame based stack tracer.
> + */
> +
> +void sframe_unwind_start(struct user_unwind_state *state,
> +			 struct task_struct *task, struct pt_regs *regs);
> +bool sframe_unwind_next_frame(struct user_unwind_state *state);
> +uint64_t sframe_unwind_get_return_address(struct user_unwind_state *state);
> +
> +static inline bool sframe_unwind_done(struct user_unwind_state *state)
> +{
> +	return state->stype == STACK_TYPE_UNKNOWN;
> +}
> +
> +static inline bool sframe_unwind_error(struct user_unwind_state *state)
> +{
> +	return state->error;
> +}
> +
> +/*
> + * APIs to manage the SFrame state per task for stack tracing.
> + */
> +
> +extern struct sframe_state *unwind_sframe_state_alloc(struct task_struct *task);
> +extern int unwind_sframe_state_update(struct task_struct *task);
> +extern void unwind_sframe_state_cleanup(struct task_struct *task);
> +
> +extern bool unwind_sframe_state_valid_p(struct sframe_state *sfstate);
> +extern bool unwind_sframe_state_ready_p(struct sframe_state *sftate);
> +
> +/*
> + * Get the callchain using SFrame unwind info for the given task.
> + */
> +extern int
> +sframe_callchain_user(struct task_struct *task,
> +		      struct perf_callchain_entry_ctx *entry,
> +		      struct pt_regs *regs);


I plan on using this without any perf involvement, I'd like to keep perf
separate from the sframe logic. As I mentioned in a previous email, I
expect sframe to have callbacks. So the callchain format should be defined
by sframe, and not reuse perf.

> +
> +#endif /* _SFRAME_UNWIND_H */
> diff --git a/lib/sframe/Makefile b/lib/sframe/Makefile
> index 4e4291d9294f..5ee9e3e7ec93 100644
> --- a/lib/sframe/Makefile
> +++ b/lib/sframe/Makefile
> @@ -1,5 +1,11 @@
>  # SPDX-License-Identifier: GPL-2.0
>  ##################################
> -obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe_read.o \
> +obj-$(CONFIG_USER_UNWINDER_SFRAME) += iterate_phdr.o \
> +				      sframe_read.o \
> +				      sframe_state.o \
> +				      sframe_unwind.o

Ah, the backslash is fixed here.

>  
> +CFLAGS_iterate_phdr.o += -I $(srctree)/lib/sframe/ -Wno-error=declaration-after-statement
>  CFLAGS_sframe_read.o += -I $(srctree)/lib/sframe/
> +CFLAGS_sframe_state.o += -I $(srctree)/lib/sframe/
> +CFLAGS_sframe_unwind.o += -I $(srctree)/lib/sframe/
> diff --git a/lib/sframe/iterate_phdr.c b/lib/sframe/iterate_phdr.c
> new file mode 100644
> index 000000000000..c10d590ecc67
> --- /dev/null
> +++ b/lib/sframe/iterate_phdr.c
> @@ -0,0 +1,113 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + */
> +
> +#include <linux/elf.h>
> +#include <linux/mm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/mm_types.h>
> +
> +#include "iterate_phdr.h"
> +
> +/*
> + * Iterate over the task's memory mappings and find the ELF headers.
> + *
> + * This is expected to be called from perf_callchain_user(), so user process
> + * context is expected.

My thought is that this will be called in the ptrace path (not the perf
path), so yes, it will be in user process context.

> + */
> +
> +int iterate_phdr(int (*callback)(struct phdr_info *info,
> +				 struct task_struct *task,
> +				 void *data),
> +		 struct task_struct *task, void *data)
> +{
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma_mt;
> +	struct page *page;
> +
> +	Elf64_Ehdr *ehdr;
> +	struct phdr_info phinfo;
> +
> +	int ret = 0, res = 0;
> +	int err = 0;
> +	bool first = true;
> +
> +	memset(&phinfo, 0, sizeof(struct phdr_info));
> +
> +	mm = task->mm;
> +
> +	MA_STATE(mas, &mm->mm_mt, 0, 0);
> +

So this is the code I want to discuss at LSFMM :-) As there will be more
experts about this than what I know.

Let me go and start making the infrastructure to encompass this.

-- Steve


> +	mas_for_each(&mas, vma_mt, ULONG_MAX) {
> +		/* ELF header has a fixed place in the file, starting at offset
> +		 * zero.
> +		 */
> +		if (vma_mt->vm_pgoff)
> +			continue;
> +
> +		/* For the callback to infer if its the prog or DSO we are
> +		 * dealing with.
> +		 */
> +		phinfo.pi_prog = first;
> +		first = false;
> +		/* FIXME TODO
> +		 *  - This code assumes 64-bit ELF by using Elf64_Ehdr.
> +		 *  - Detect the case when ELF program headers to be of
> +		 * size > 1 page.
> +		 */
> +
> +		/* FIXME TODO KERNEL
> +		 *  - get_user_pages_WHAT, which API.
> +		 *  What flags ? Is this correct ?
> +		 */
> +		ret = get_user_pages_remote(mm, vma_mt->vm_start, 1, FOLL_GET,
> +					    &page, &vma_mt, NULL);
> +		if (ret <= 0)
> +			continue;
> +
> +		/* The first page must have the ELF header. */
> +		ehdr = vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> +		if (!ehdr)
> +			goto put_page;
> +
> +		/* Check for magic bytes to make sure this is ehdr. */
> +		err = 0;
> +		err |= ((ehdr->e_ident[EI_MAG0] != ELFMAG0)
> +			|| (ehdr->e_ident[EI_MAG1] != ELFMAG1)
> +			|| (ehdr->e_ident[EI_MAG2] != ELFMAG2)
> +			|| (ehdr->e_ident[EI_MAG3] != ELFMAG3));
> +		if (err)
> +			goto unmap;
> +
> +		/*
> +		 * FIXME TODO handle the case when number of program headers is
> +		 * greater than or equal to PN_XNUM later.
> +		 */
> +		if (ehdr->e_phnum == PN_XNUM)
> +			goto unmap;
> +		/*
> +		 * FIXME TODO handle the case when Elf phdrs span more than one
> +		 * page later ?
> +		 */
> +		if ((sizeof(Elf64_Ehdr) + ehdr->e_phentsize * ehdr->e_phnum)
> +		    > PAGE_SIZE)
> +			goto unmap;
> +
> +		/* Save the location of program headers and the phnum. */
> +		phinfo.pi_addr = vma_mt->vm_start;
> +		phinfo.pi_phdr = (void *)ehdr + ehdr->e_phoff;
> +		phinfo.pi_phnum = ehdr->e_phnum;
> +
> +		res = callback(&phinfo, task, data);
> +unmap:
> +		vunmap(ehdr);
> +put_page:
> +		put_page(page);
> +
> +		if (res < 0)
> +			break;
> +	}
> +
> +	return res;
> +}
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2023-05-01 20:04 ` [POC 5/5] x86_64: invoke SFrame based stack tracer for user space Indu Bhagat
@ 2023-05-01 23:11   ` Steven Rostedt
  2023-05-02 10:53   ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Steven Rostedt @ 2023-05-01 23:11 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon,  1 May 2023 13:04:10 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:

>  void
>  perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
>  {
>  	struct stack_frame frame;
>  	const struct stack_frame __user *fp;
> +	bool sframe_avail;
>  
>  	if (perf_guest_state()) {
>  		/* TODO: We don't support guest os callchain now */
> @@ -2887,7 +2930,15 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
>  	if (perf_callchain_user32(regs, entry))
>  		return;
>  
> +	sframe_avail = check_sframe_state_p (current);
> +
>  	pagefault_disable();
> +
> +	if (sframe_avail && !sframe_callchain_user (current,1 entry, regs)) {
> +		pagefault_enable();
> +		return;
> +	}
> +

So my idea is to have a more generic approach for user space stack traces
that perf calls into, instead of having sframe hooking into perf.

-- Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-01 22:40   ` Steven Rostedt
@ 2023-05-02  5:07     ` Indu Bhagat
  2023-05-02  8:46     ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Indu Bhagat @ 2023-05-02  5:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-toolchains, daandemeyer, andrii, kris.van.hees,
	elena.zannoni, nick.alcock

On 5/1/23 15:40, Steven Rostedt wrote:
> On Mon,  1 May 2023 13:04:08 -0700
> Indu Bhagat <indu.bhagat@oracle.com> wrote:
> 
>> This patch adds an implementation to read SFrame stack trace data from
>> a .sframe section.  Some APIs are also provided to find stack tracing
>> information per PC, e.g., given a PC, find the SFrame FRE.
>>
>> These routines are provided in the sframe_read.h and sframe_read.c.
>>
>> This implmentation is malloc-free.
>>
>> Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
>> ---
>>   lib/Makefile             |   1 +
>>   lib/sframe/Makefile      |   5 +
>>   lib/sframe/sframe.h      | 263 +++++++++++++++++++++
>>   lib/sframe/sframe_read.c | 498 +++++++++++++++++++++++++++++++++++++++
>>   lib/sframe/sframe_read.h |  75 ++++++
>>   5 files changed, 842 insertions(+)
>>   create mode 100644 lib/sframe/Makefile
>>   create mode 100644 lib/sframe/sframe.h
>>   create mode 100644 lib/sframe/sframe_read.c
>>   create mode 100644 lib/sframe/sframe_read.h
>>
>> diff --git a/lib/Makefile b/lib/Makefile
>> index 876fcdeae34e..cb02d16dbffd 100644
>> --- a/lib/Makefile
>> +++ b/lib/Makefile
>> @@ -198,6 +198,7 @@ obj-$(CONFIG_ZSTD_COMPRESS) += zstd/
>>   obj-$(CONFIG_ZSTD_DECOMPRESS) += zstd/
>>   obj-$(CONFIG_XZ_DEC) += xz/
>>   obj-$(CONFIG_RAID6_PQ) += raid6/
>> +obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe/
>>   
>>   lib-$(CONFIG_DECOMPRESS_GZIP) += decompress_inflate.o
>>   lib-$(CONFIG_DECOMPRESS_BZIP2) += decompress_bunzip2.o
>> diff --git a/lib/sframe/Makefile b/lib/sframe/Makefile
>> new file mode 100644
>> index 000000000000..4e4291d9294f
>> --- /dev/null
>> +++ b/lib/sframe/Makefile
>> @@ -0,0 +1,5 @@
>> +# SPDX-License-Identifier: GPL-2.0
>> +##################################
>> +obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe_read.o \
> 
> The ending backslash is only needed if there's a line wrap, which you don't
> have here.
> 
>> +
>> +CFLAGS_sframe_read.o += -I $(srctree)/lib/sframe/
>> diff --git a/lib/sframe/sframe.h b/lib/sframe/sframe.h
>> new file mode 100644
>> index 000000000000..b1290e92839a
>> --- /dev/null
>> +++ b/lib/sframe/sframe.h
>> @@ -0,0 +1,263 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> +/*
>> + * Copyright (C) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +#ifndef	SFRAME_H
>> +#define	SFRAME_H
>> +
>> +#include <linux/types.h>
>> +
>> +/* This file contains definitions for the SFrame stack tracing format, which is
>> + * documented at https://sourceware.org/binutils/docs */
>> +
>> +#define SFRAME_VERSION_1	1
>> +#define SFRAME_MAGIC		0xdee2
>> +#define SFRAME_VERSION	SFRAME_VERSION_1
>> +
>> +/* Function Descriptor Entries are sorted on PC. */
>> +#define SFRAME_F_FDE_SORTED	0x1
>> +/* Frame-pointer based stack tracing. Defined, but not set. */
>> +#define SFRAME_F_FRAME_POINTER 0x2
>> +
>> +#define SFRAME_CFA_FIXED_FP_INVALID 0
>> +#define SFRAME_CFA_FIXED_RA_INVALID 0
>> +
>> +/* Supported ABIs/Arch. */
>> +#define SFRAME_ABI_AARCH64_ENDIAN_BIG      1 /* AARCH64 big endian. */
>> +#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE   2 /* AARCH64 little endian. */
>> +#define SFRAME_ABI_AMD64_ENDIAN_LITTLE     3 /* AMD64 little endian. */
>> +
>> +/* SFrame FRE types. */
>> +#define SFRAME_FRE_TYPE_ADDR1	0
>> +#define SFRAME_FRE_TYPE_ADDR2	1
>> +#define SFRAME_FRE_TYPE_ADDR4	2
>> +
>> +/*
>> + * SFrame Function Descriptor Entry types.
>> + *
>> + * The SFrame format has two possible representations for functions.  The
>> + * choice of which type to use is made according to the instruction patterns
>> + * in the relevant program stub.
>> + */
>> +
>> +/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
>> +#define SFRAME_FDE_TYPE_PCINC   0
>> +/*
>> + * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
>> + * to look up a matching FRE.  Typical usecases are pltN entries, trampolines
>> + * etc.
>> + */
>> +#define SFRAME_FDE_TYPE_PCMASK  1
>> +
>> +struct sframe_preamble
>> +{
> 
> Note, for Linux coding style, structs should have the '{' on the same line:
> 
> struct sframe_preamble {
> 
> This is required because a common way to find a struct definition is to
> search the name followed by a '{'.
> 
> 
>> +	/* Magic number (SFRAME_MAGIC). */
>> +	uint16_t magic;
>> +	/* Data format version number (SFRAME_VERSION). */
>> +	uint8_t version;
>> +	/* Various flags. */
>> +	uint8_t flags;
>> +} __packed;
>> +
>> +struct sframe_header
>> +{
> 
> Here too.
> 
>> +	struct sframe_preamble preamble;
>> +	/* Information about the arch (endianness) and ABI. */
>> +	uint8_t abi_arch;
>> +	/*
>> +	 * Offset for the Frame Pointer (FP) from CFA may be fixed for some
>> +	 * ABIs (e.g, in AMD64 when -fno-omit-frame-pointer is used).  When fixed,
>> +	 * this field specifies the fixed stack frame offset and the individual
>> +	 * FREs do not need to track it.  When not fixed, it is set to
>> +	 * SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may provide
>> +	 * the applicable stack frame offset, if any.
>> +	 */
>> +	int8_t cfa_fixed_fp_offset;
>> +	/*
>> +	 * Offset for the Return Address from CFA is fixed for some ABIs
>> +	 * (e.g., AMD64 has it as CFA-8).  When fixed, the header specifies the
>> +	 * fixed stack frame offset and the individual FREs do not track it.  When
>> +	 * not fixed, it is set to SFRAME_CFA_FIXED_RA_INVALID, and individual
>> +	 * FREs provide the applicable stack frame offset, if any.
>> +	 */
>> +	int8_t cfa_fixed_ra_offset;
>> +	/*
>> +	 * Number of bytes making up the auxiliary header, if any.
>> +	 * Some ABI/arch, in the future, may use this space for extending the
>> +	 * information in SFrame header.  Auxiliary header is contained in
>> +	 * bytes sequentially following the sframe_header.
>> +	 */
>> +	uint8_t auxhdr_len;
>> +	/* Number of SFrame FDEs in this SFrame section. */
>> +	uint32_t num_fdes;
>> +	/* Number of SFrame Frame Row Entries. */
>> +	uint32_t num_fres;
>> +	/* Number of bytes in the SFrame Frame Row Entry section. */
>> +	uint32_t fre_len;
>> +	/* Offset of SFrame Function Descriptor Entry section. */
>> +	uint32_t fdeoff;
>> +	/* Offset of SFrame Frame Row Entry section. */
>> +	uint32_t freoff;
>> +} __packed;
>> +
>> +#define SFRAME_V1_HDR_SIZE(sframe_hdr)	\
>> +  ((sizeof (struct sframe_header) + (sframe_hdr).auxhdr_len))
>> +
>> +/* Two possible keys for executable (instruction) pointers signing. */
>> +#define SFRAME_AARCH64_PAUTH_KEY_A    0 /* Key A. */
>> +#define SFRAME_AARCH64_PAUTH_KEY_B    1 /* Key B. */
>> +
>> +struct sframe_func_desc_entry
>> +{
> 
> Basically all of them.
> 
>> +	/*
>> +	 * Function start address.  Encoded as a signed offset, relative to the
>> +	 * beginning of the current FDE.
>> +	 */
>> +	int32_t func_start_address;
>> +	/* Size of the function in bytes. */
>> +	uint32_t func_size;
>> +	/*
>> +	 * Offset of the first SFrame Frame Row Entry of the function, relative to the
>> +	 * beginning of the SFrame Frame Row Entry sub-section.
>> +	 */
>> +	uint32_t func_start_fre_off;
>> +	/* Number of frame row entries for the function. */
>> +	uint32_t func_num_fres;
>> +	/*
>> +	 * Additional information for deciphering the unwind information for the
>> +	 * function.
>> +	 *   - 4-bits: Identify the FRE type used for the function.
>> +	 *   - 1-bit: Identify the FDE type of the function - mask or inc.
>> +	 *   - 1-bit: PAC authorization A/B key (aarch64).
>> +	 *   - 2-bits: Unused.
>> +	 * --------------------------------------------------------------------------
>> +	 * |     Unused    |  PAC auth A/B key (aarch64) |  FDE type |   FRE type   |
>> +	 * |               |        Unused (amd64)       |           |              |
>> +	 * --------------------------------------------------------------------------
>> +	 * 8               6                             5           4              0
>> +	 */
>> +	uint8_t func_info;
>> +} __packed;
>> +
>> +/* Note: Set PAC auth key to SFRAME_AARCH64_PAUTH_KEY_A by default.  */
>> +#define SFRAME_V1_FUNC_INFO(fde_type, fre_enc_type) \
>> +  (((SFRAME_AARCH64_PAUTH_KEY_A & 0x1) << 5) | \
>> +   (((fde_type) & 0x1) << 4) | ((fre_enc_type) & 0xf))
>> +
>> +#define SFRAME_V1_FUNC_FRE_TYPE(data)	  ((data) & 0xf)
>> +#define SFRAME_V1_FUNC_FDE_TYPE(data)	  (((data) >> 4) & 0x1)
>> +#define SFRAME_V1_FUNC_PAUTH_KEY(data)	  (((data) >> 5) & 0x1)
>> +
>> +/*
>> + * Size of stack frame offsets in an SFrame Frame Row Entry.  A single
>> + * SFrame FRE has all offsets of the same size.  Offset size may vary
>> + * across frame row entries.
>> + */
>> +#define SFRAME_FRE_OFFSET_1B	  0
>> +#define SFRAME_FRE_OFFSET_2B	  1
>> +#define SFRAME_FRE_OFFSET_4B	  2
>> +
>> +/* An SFrame Frame Row Entry can be SP or FP based.  */
>> +#define SFRAME_BASE_REG_FP	0
>> +#define SFRAME_BASE_REG_SP	1
>> +
>> +/*
>> + * The index at which a specific offset is presented in the variable length
>> + * bytes of an FRE.
>> + */
>> +#define SFRAME_FRE_CFA_OFFSET_IDX   0
>> +/*
>> + * The RA stack offset, if present, will always be at index 1 in the variable
>> + * length bytes of the FRE.
>> + */
>> +#define SFRAME_FRE_RA_OFFSET_IDX    1
>> +/*
>> + * The FP stack offset may appear at offset 1 or 2, depending on the ABI as RA
>> + * may or may not be tracked.
>> + */
>> +#define SFRAME_FRE_FP_OFFSET_IDX    2
>> +
>> +struct sframe_fre_info
>> +{
>> +  /* Information about
>> +   *   - 1 bit: base reg for CFA
>> +   *   - 4 bits: Number of offsets (N).  A value of upto 3 is allowed to track
>> +   *     all three of CFA, FP and RA (fixed implicit order).
>> +   *   - 2 bits: information about size of the offsets (S) in bytes.
>> +   *     Valid values are SFRAME_FRE_OFFSET_1B, SFRAME_FRE_OFFSET_2B,
>> +   *     SFRAME_FRE_OFFSET_4B
>> +   *   - 1 bit: Mangled RA state bit (aarch64 only).
>> +   * -----------------------------------------------------------------------------------
>> +   * | Mangled-RA (aarch64) |  Size of offsets   |   Number of offsets    |   base_reg |
>> +   * |  Unused (amd64)      |                    |                        |            |
>> +   * -----------------------------------------------------------------------------------
>> +   * 8                     7                    5                        1            0
>> +   */
>> +  uint8_t fre_info;
>> +};
>> +
>> +/* Macros to compose and decompose FRE info. */
>> +
>> +/* Note: Set mangled_ra_p to zero by default. */
>> +#define SFRAME_V1_FRE_INFO(base_reg_id, offset_num, offset_size) \
>> +  (((0 & 0x1) << 7) | (((offset_size) & 0x3) << 5) | \
>> +   (((offset_num) & 0xf) << 1) | ((base_reg_id) & 0x1))
>> +
>> +/* Set the mangled_ra_p bit as indicated. */
>> +#define SFRAME_V1_FRE_INFO_UPDATE_MANGLED_RA_P(mangled_ra_p, fre_info) \
>> +  ((((mangled_ra_p) & 0x1) << 7) | ((fre_info) & 0x7f))
>> +
>> +#define SFRAME_V1_FRE_CFA_BASE_REG_ID(data)	  ((data) & 0x1)
>> +#define SFRAME_V1_FRE_OFFSET_COUNT(data)	  (((data) >> 1) & 0xf)
>> +#define SFRAME_V1_FRE_OFFSET_SIZE(data)		  (((data) >> 5) & 0x3)
>> +#define SFRAME_V1_FRE_MANGLED_RA_P(data)	  (((data) >> 7) & 0x1)
>> +
>> +/* SFrame Frame Row Entry definitions. */
>> +
>> +/*
>> + * Used when SFRAME_FRE_TYPE_ADDR1 is specified as FRE type.
>> + * Upper limit of start address in sframe_frame_row_entry_addr1 if 0x100 (not
>> + * inclusive).
>> + */
>> +struct sframe_frame_row_entry_addr1
>> +{
>> +	/*
>> +	 * Start address of the frame row entry.  Encoded as an 1-byte unsigned
>> +	 * offset, relative to the start address of the function.
>> +	 */
>> +	uint8_t fre_start_ip_offset;
>> +	struct sframe_fre_info fre_info;
>> +} __packed;
>> +
>> +/*
>> + * Used when SFRAME_FRE_TYPE_ADDR2 is specified as FRE type.
>> + * Upper limit of start address in sframe_frame_row_entry_addr2 is 0x10000 (not
>> + * inclusive).
>> + */
>> +struct sframe_frame_row_entry_addr2
>> +{
>> +	/*
>> +	 * Start address of the frame row entry.  Encoded as an 2-byte unsigned
>> +	 * offset, relative to the start address of the function.
>> +	 */
>> +	uint16_t fre_start_ip_offset;
>> +	struct sframe_fre_info fre_info;
>> +} __packed;
>> +
>> +/*
>> + * Used when SFRAME_FRE_TYPE_ADDR4 is specified as FRE type.
>> + * Upper limit of start address in sframe_frame_row_entry_addr2
>> + * is 0x100000000 (not inclusive).
>> + */
>> +struct sframe_frame_row_entry_addr4
>> +{
>> +	/*
>> +	 * Start address of the frame row entry.  Encoded as a 4-byte unsigned
>> +	 * offset, relative to the start address of the function.
>> +	 */
>> +	uint32_t fre_start_ip_offset;
>> +	struct sframe_fre_info fre_info;
>> +} __packed;
>> +
>> +#endif	/* SFRAME_H */
>> diff --git a/lib/sframe/sframe_read.c b/lib/sframe/sframe_read.c
>> new file mode 100644
>> index 000000000000..9d6558d62c54
>> --- /dev/null
>> +++ b/lib/sframe/sframe_read.c
>> @@ -0,0 +1,498 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * Copyright (C) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +#include <linux/string.h>
>> +
>> +#include "sframe_read.h"
>> +
>> +struct sframe_sec {
>> +	/* SFrame header. */
>> +	struct sframe_header header;
>> +	/* SFrame Function Desc Entries. */
>> +	void *fdes;
>> +	/* SFrame Frame Row Entries. */
>> +	void *fres;
>> +	/* Number of bytes needed for SFrame FREs. */
>> +	uint32_t fre_nbytes;
>> +};
>> +
>> +static int sframe_set_errno(int *errp, int error)
>> +{
>> +	if (errp != NULL)
>> +		*errp = error;
>> +	return SFRAME_ERR;
>> +}
>> +
>> +static uint32_t sframe_sec_get_hdr_size(struct sframe_header *sfh)
>> +{
>> +	return SFRAME_V1_HDR_SIZE(*sfh);
>> +}
>> +
>> +static unsigned int sframe_fre_get_offset_count(unsigned char fre_info)
>> +{
>> +	return SFRAME_V1_FRE_OFFSET_COUNT(fre_info);
>> +}
>> +
>> +static unsigned int sframe_fre_get_offset_size(unsigned char fre_info)
>> +{
>> +	return SFRAME_V1_FRE_OFFSET_SIZE(fre_info);
>> +}
>> +
>> +static unsigned int sframe_get_fre_type(struct sframe_func_desc_entry *fdep)
>> +{
>> +	return (fdep) ? SFRAME_V1_FUNC_FRE_TYPE(fdep->func_info) : 0;
>> +}
>> +
>> +static unsigned int sframe_get_fde_type(struct sframe_func_desc_entry *fdep)
>> +{
>> +	return (fdep) ? SFRAME_V1_FUNC_FDE_TYPE(fdep->func_info) : 0;
>> +}
>> +
>> +static bool sframe_header_sanity_check_p(struct sframe_header *hp)
>> +{
>> +	unsigned char all_flags = SFRAME_F_FDE_SORTED | SFRAME_F_FRAME_POINTER;
> 
> Add a space here.
> 
>> +	/* Check that the preamble is valid. */
>> +	if ((hp->preamble.magic != SFRAME_MAGIC)
>> +	    || (hp->preamble.version != SFRAME_VERSION)
>> +	    || ((hp->preamble.flags | all_flags) != all_flags))
>> +		return false;
>> +
>> +	/* Check that the offsets are valid. */
>> +	if (hp->fdeoff > hp->freoff)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static bool sframe_fre_sanity_check_p(struct sframe_fre *frep)
>> +{
>> +	unsigned int offset_size, offset_cnt;
>> +
>> +	if (frep == NULL)
>> +		return false;
>> +
>> +	offset_size = sframe_fre_get_offset_size(frep->fre_info);
>> +
>> +	if (offset_size != SFRAME_FRE_OFFSET_1B
>> +	    && offset_size != SFRAME_FRE_OFFSET_2B
>> +	    && offset_size != SFRAME_FRE_OFFSET_4B)
>> +		return false;
>> +
>> +	offset_cnt = sframe_fre_get_offset_count(frep->fre_info);
>> +	if (offset_cnt > MAX_NUM_STACK_OFFSETS)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static int32_t sframe_get_fre_offset(struct sframe_fre *frep, uint32_t idx,
>> +				     int *errp)
>> +{
>> +	int offset_cnt, offset_size;
>> +
>> +	if (frep == NULL || !sframe_fre_sanity_check_p(frep))
>> +		return sframe_set_errno(errp, SFRAME_ERR_FRE_INVAL);
>> +
>> +	offset_cnt = sframe_fre_get_offset_count(frep->fre_info);
>> +	offset_size = sframe_fre_get_offset_size(frep->fre_info);
>> +
>> +	if (offset_cnt < idx + 1)
>> +		return sframe_set_errno(errp, SFRAME_ERR_FREOFFSET_NOPRESENT);
>> +
>> +	if (errp != NULL)
>> +		*errp = 0; /* Offset Valid. */
>> +
>> +	if (offset_size == SFRAME_FRE_OFFSET_1B) {
>> +		int8_t *stack_offsets = (int8_t *)frep->fre_offsets;
>> +		return stack_offsets[idx];
>> +	} else if (offset_size == SFRAME_FRE_OFFSET_2B) {
>> +		int16_t *stack_offsets = (int16_t *)frep->fre_offsets;
>> +		return stack_offsets[idx];
>> +	} else {
>> +		int32_t *stack_offsets = (int32_t *)frep->fre_offsets;
>> +		return stack_offsets[idx];
>> +	}
>> +}
>> +
>> +static struct sframe_header *sframe_sec_get_header(struct sframe_sec *sfsec)
>> +{
>> +	return sfsec ? &sfsec->header : NULL;
>> +}
>> +
>> +static int sframe_fre_copy(struct sframe_fre *dst,
>> +			   struct sframe_fre *src)
>> +{
>> +	if (dst == NULL || src == NULL)
>> +		return SFRAME_ERR;
>> +
>> +	memcpy(dst, src, sizeof(struct sframe_fre));
>> +	return 0;
>> +}
>> +
>> +static int sframe_decode_start_ip_offset(const char *fre_buf,
>> +					 uint32_t *start_ip_offset,
>> +					 unsigned int fre_type)
>> +{
>> +	uint32_t saddr = 0;
>> +
>> +	if (fre_type == SFRAME_FRE_TYPE_ADDR1) {
>> +		uint8_t *uc = (uint8_t *)fre_buf;
>> +		saddr = (uint32_t)*uc;
>> +	} else if (fre_type == SFRAME_FRE_TYPE_ADDR2) {
>> +		uint16_t *ust = (uint16_t *)fre_buf;
>> +		saddr = (uint32_t)*ust;
>> +	} else if (fre_type == SFRAME_FRE_TYPE_ADDR4) {
>> +		uint32_t *uit = (uint32_t *)fre_buf;
>> +		saddr = (uint32_t)*uit;
>> +	} else {
>> +		return SFRAME_ERR_INVAL;
>> +	}
>> +
>> +	*start_ip_offset = saddr;
>> +	return 0;
>> +}
>> +
>> +/* Get the total size in bytes for the stack offsets. */
>> +static size_t sframe_fre_stack_offsets_size(unsigned char fre_info)
>> +{
>> +	unsigned int offset_size, offset_cnt;
>> +
>> +	offset_size = sframe_fre_get_offset_size(fre_info);
>> +	offset_cnt = sframe_fre_get_offset_count(fre_info);
>> +
>> +	if (offset_size == SFRAME_FRE_OFFSET_2B
>> +	    || offset_size == SFRAME_FRE_OFFSET_4B)	/* 2 or 4 bytes. */
>> +		return (offset_cnt * (offset_size * 2));
>> +
>> +	return offset_cnt;
>> +}
>> +
>> +static size_t sframe_fre_get_start_address_tsize(unsigned int fre_type)
>> +{
>> +	/* Type size of the start_addr in an FRE. */
>> +	size_t saddr_tsize = 0;
>> +
>> +	switch (fre_type) {
>> +	case SFRAME_FRE_TYPE_ADDR1:
>> +		saddr_tsize = sizeof(uint8_t);
>> +		break;
>> +	case SFRAME_FRE_TYPE_ADDR2:
>> +		saddr_tsize = sizeof(uint16_t);
>> +		break;
>> +	case SFRAME_FRE_TYPE_ADDR4:
>> +		saddr_tsize = sizeof(uint32_t);
>> +		break;
>> +	default:
>> +		/* No other value is expected. */
>> +		break;
>> +	}
>> +	return saddr_tsize;
>> +}
>> +
>> +static size_t sframe_fre_vlen_size(struct sframe_fre *frep,
>> +				   unsigned int fre_type)
>> +{
>> +	unsigned char fre_info;
>> +	size_t ip_offset_tsize;
>> +
>> +	if (frep == NULL)
>> +		return 0;
>> +
>> +	fre_info = frep->fre_info;
>> +	ip_offset_tsize = sframe_fre_get_start_address_tsize(fre_type);
>> +
>> +	/*
>> +	 * An SFrame FRE is a variable length structure.  It includes the start
>> +	 * IP offset, FRE info field, and all trailing the stack offsets.
>> +	 */
>> +	return (ip_offset_tsize + sizeof(fre_info)
>> +		+ sframe_fre_stack_offsets_size(fre_info));
>> +}
>> +
>> +/*
>> + * Read an SFrame FRE which starts at location FRE_BUF.  The function
>> + * updates FRE_SIZE to the size of the FRE as stored in the binary format.
>> + *
>> + * Returns SFRAME_ERR if failure.
>> + */
>> +static int sframe_sec_read_fre(const char *fre_buf, struct sframe_fre *frep,
>> +			       unsigned int fre_type, size_t *fre_size)
>> +{
>> +	void *stack_offsets;
>> +	size_t stack_offsets_sz;
>> +	size_t ip_offset_tsize;
>> +	size_t esz;
>> +
>> +	if (fre_buf == NULL || frep == NULL || fre_size == NULL)
>> +		return SFRAME_ERR_INVAL;
>> +
>> +	/* Copy over the FRE start address. */
>> +	sframe_decode_start_ip_offset(fre_buf, &frep->start_ip_offset,
>> +				      fre_type);
>> +
>> +	ip_offset_tsize = sframe_fre_get_start_address_tsize(fre_type);
>> +	/* PS: Note how this API works closely with SFrame binary format. */
>> +	frep->fre_info = *(unsigned char *)(fre_buf + ip_offset_tsize);
>> +
>> +	memset(frep->fre_offsets, 0, MAX_STACK_OFFSET_NBYTES);
>> +	/* Get stack offsets. */
>> +	stack_offsets_sz = sframe_fre_stack_offsets_size(frep->fre_info);
>> +	stack_offsets = ((unsigned char *)fre_buf + ip_offset_tsize
>> +			 + sizeof(frep->fre_info));
>> +	memcpy(frep->fre_offsets, stack_offsets, stack_offsets_sz);
>> +
>> +	esz = sframe_fre_vlen_size(frep, fre_type);
>> +	*fre_size = esz;
>> +
>> +	return 0;
>> +}
>> +
>> +static struct sframe_func_desc_entry *
>> +sframe_sec_find_fde(struct sframe_sec *sfsec, int32_t addr, int *errp)
>> +{
>> +	struct sframe_header *header;
>> +	struct sframe_func_desc_entry *fde;
>> +	int low, high, cnt;
>> +
>> +	if (sfsec == NULL) {
>> +		sframe_set_errno(errp, SFRAME_ERR_INVAL);
>> +		return NULL;
>> +	}
>> +
>> +	header = sframe_sec_get_header(sfsec);
>> +	if (header == NULL || header->num_fdes == 0 || sfsec->fdes == NULL) {
>> +		sframe_set_errno(errp, SFRAME_ERR_INIT_INVAL);
>> +		return NULL;
>> +	}
>> +	/*
>> +	 * Skip binary search if FDE sub-section is not sorted on PCs.  GNU ld
>> +	 * sorts the FDEs on start PC by default though.
>> +	 */
>> +	if ((header->preamble.flags & SFRAME_F_FDE_SORTED) == 0) {
>> +		sframe_set_errno(errp, SFRAME_ERR_FDE_NOTSORTED);
>> +		return NULL;
>> +	}
>> +
>> +	/* Find the FDE that may contain the addr. */
>> +	fde = (struct sframe_func_desc_entry *)sfsec->fdes;
>> +	low = 0;
>> +	high = header->num_fdes;
>> +	cnt = high;
>> +	while (low <= high) {
>> +		int mid = low + (high - low) / 2;
>> +
>> +		if (fde[mid].func_start_address == addr)
>> +			return fde + mid;
>> +
>> +		if (fde[mid].func_start_address < addr) {
>> +			if (mid == (cnt - 1))
>> +				return fde + (cnt - 1);
>> +			else if (fde[mid+1].func_start_address > addr)
>> +				return fde + mid;
>> +			low = mid + 1;
>> +		} else
>> +			high = mid - 1;
>> +	}
>> +
>> +	sframe_set_errno(errp, SFRAME_ERR_FDE_NOTFOUND);
>> +	return NULL;
>> +}
>> +
>> +static int8_t sframe_sec_get_fixed_fp_offset(struct sframe_sec *sfsec)
>> +{
>> +	struct sframe_header *header = sframe_sec_get_header(sfsec);
>> +	return header->cfa_fixed_fp_offset;
>> +}
>> +
>> +static int8_t sframe_sec_get_fixed_ra_offset(struct sframe_sec *sfsec)
>> +{
>> +	struct sframe_header *header = sframe_sec_get_header(sfsec);
>> +	return header->cfa_fixed_ra_offset;
>> +}
>> +
>> +size_t sframe_sec_sizeof(void)
>> +{
>> +	return sizeof (struct sframe_sec);
>> +}
>> +
>> +int sframe_sec_init(struct sframe_sec *sfsec, const char *sf_buf,
>> +		    size_t sf_size)
>> +{
>> +	char *frame_buf;
>> +	const struct sframe_preamble *preamble;
>> +	struct sframe_header *header;
>> +
>> +	if ((sf_buf == NULL) || (sf_size < sizeof(struct sframe_header)))
>> +		return SFRAME_ERR_INVAL;
>> +
>> +	/* Check for foreign endianness. */
>> +	preamble = (const struct sframe_preamble *) sf_buf;
>> +	if (preamble->magic != SFRAME_MAGIC)
>> +		return SFRAME_ERR_INVAL;
>> +
>> +	/* Reset the SFrame section object. */
>> +	memset(sfsec, 0, sizeof(struct sframe_sec));
>> +
>> +	frame_buf = (char *)sf_buf;
>> +
>> +	/* Initialize the reference to the SFrame header. */
>> +	sfsec->header = *(struct sframe_header *) frame_buf;
>> +	header = &sfsec->header;
>> +	if (!sframe_header_sanity_check_p(header))
>> +		return SFRAME_ERR_INVAL;
>> +
>> +	/* Initialize the referece to the SFrame FDE section. */
>> +	frame_buf += sframe_sec_get_hdr_size(header);
>> +	sfsec->fdes = frame_buf;
>> +
>> +	/* Initialize the reference to the the SFrame FRE section. */
>> +	frame_buf += (header->num_fdes * sizeof(struct sframe_func_desc_entry));
>> +	sfsec->fres = frame_buf;
>> +
>> +	sfsec->fre_nbytes = header->fre_len;
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Find the SFrame Frame Row Entry which contains the PC.
>> + * Returns error code if failure.
>> + */
>> +int sframe_sec_find_fre(struct sframe_sec *sfsec, int32_t pc,
>> +			struct sframe_fre *frep)
>> +{
>> +	struct sframe_func_desc_entry *fdep;
>> +	uint32_t start_address, i;
>> +	struct sframe_fre cur_fre, next_fre;
> 
> Although the sframe_fre structure is relatively small, I always get nervous
> when I see structures defined on the kernel stack.
> 
> Also, I personally prefer to keep items defined themselves. That is, one
> variable per line.
> 
>> +	unsigned char *fres;
>> +	unsigned int fre_type, fde_type;
>> +	size_t esz;
>> +	int err = 0;
>> +	size_t size = 0;
> 
> And to organize it in an "upside-down x-mas" tree fashion:
> 
> 	struct sframe_func_desc_entry *fdep;
> 	struct sframe_fre next_fre;
> 	struct sframe_fre cur_fre
> 	uint32_t start_address, i;
> 	unsigned int fre_type;
> 	unsigned int fde_type;
> 	unsigned char *fres;
> 	size_t size = 0;
> 	size_t esz;
> 	int err = 0;
> 
> Looks much nicer and easier to read that way ;-)
> 
> -- Steve
> 

OK. I can work these out. At some point I was running the 
script/checkpatch.pl but forgot about it over time. Will fix these issues.

Thanks


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 4/5] sframe: add an SFrame format stack tracer
  2023-05-01 23:00   ` Steven Rostedt
@ 2023-05-02  6:16     ` Indu Bhagat
  0 siblings, 0 replies; 32+ messages in thread
From: Indu Bhagat @ 2023-05-02  6:16 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-toolchains, daandemeyer, andrii, kris.van.hees,
	elena.zannoni, nick.alcock

On 5/1/23 16:00, Steven Rostedt wrote:
> On Mon,  1 May 2023 13:04:09 -0700
> Indu Bhagat <indu.bhagat@oracle.com> wrote:
> 
>> This patch adds an SFrame format based stack tracer.
>>
>> The files iterate_phdr.c, iterate_phdr.h implement a dl_iterate_phdr()
>> like functionality.
>>
>> The SFrame format based stack tracer is implemented in the
>> sframe_unwind.c with architecture specific bits in the
>> arch/arm64/include/asm/sframe_regs.h and
>> arch/x86/include/asm/sframe_regs.h.  Please note that the SFrame format
>> is supported for x86_64 (AMD64 ABI) and aarch64 (AAPCS64 ABI) only at
>> this time.
>>
>> The files sframe_state.[ch] implement the SFrame state management APIs.
>>
>> Some aspects of the implementation are "POC like". These will need to
>> addressed for the implementation to become more palatable:
>> - dealing with only Elf64_Phdr (no Elf32_Phdr) at this time, and other
>>    TODOs in the iterate_phdr.c,
>> - detecting whether a program did a dlopen/dlclose,
>> - code stubs around user space memory access (.sframe section, ELF hdr
>>    etc.) by the kernel need careful review.
>>
>> There are more aspects than above; The intention of this patch set is to
>> help drive the discussion on how to best incorporate an SFrame-based user
>> space unwinder in the kernel.
>>
>> Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
>> ---
>>   arch/arm64/include/asm/sframe_regs.h |  37 +++
>>   arch/x86/include/asm/sframe_regs.h   |  34 +++
>>   include/sframe/sframe_regs.h         |  11 +
>>   include/sframe/sframe_unwind.h       |  62 ++++
>>   lib/sframe/Makefile                  |   8 +-
>>   lib/sframe/iterate_phdr.c            | 113 +++++++
>>   lib/sframe/iterate_phdr.h            |  34 +++
>>   lib/sframe/sframe_state.c            | 424 +++++++++++++++++++++++++++
>>   lib/sframe/sframe_state.h            |  80 +++++
>>   lib/sframe/sframe_unwind.c           | 208 +++++++++++++
>>   10 files changed, 1010 insertions(+), 1 deletion(-)
>>   create mode 100644 arch/arm64/include/asm/sframe_regs.h
>>   create mode 100644 arch/x86/include/asm/sframe_regs.h
>>   create mode 100644 include/sframe/sframe_regs.h
>>   create mode 100644 include/sframe/sframe_unwind.h
>>   create mode 100644 lib/sframe/iterate_phdr.c
>>   create mode 100644 lib/sframe/iterate_phdr.h
>>   create mode 100644 lib/sframe/sframe_state.c
>>   create mode 100644 lib/sframe/sframe_state.h
>>   create mode 100644 lib/sframe/sframe_unwind.c
>>
>> diff --git a/arch/arm64/include/asm/sframe_regs.h b/arch/arm64/include/asm/sframe_regs.h
>> new file mode 100644
>> index 000000000000..ae9ab9d5d3c1
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/sframe_regs.h
>> @@ -0,0 +1,37 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> +/*
>> + * Copyright (C) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +#ifdef ASM_ARM64_SFRAME_REGS_H
>> +#define ASM_ARM64_SFRAME_REGS_H
>> +
>> +#define STACK_ACCESS_LEN 8
>> +
>> +static inline uint64_t
>> +get_ptregs_ip(struct pt_regs *regs)
>> +{
>> +	return regs->pc;
>> +}
>> +
>> +static inline uint64_t
>> +get_ptregs_sp(struct pt_regs *regs)
>> +{
>> +	return regs->sp;
>> +}
>> +
>> +static inline uint64_t
>> +get_ptregs_fp(struct pt_regs *regs)
>> +{
>> +#define UNWIND_AARCH64_X29              29      /* 64-bit frame pointer.  */
>> +	return (uint64_t)regs->regs[UNWIND_AARCH64_X29];
>> +}
>> +
>> +static inline uint64_t
>> +get_ptregs_ra(struct pt_regs *regs)
>> +{
>> +#define UNWIND_AARCH64_X30              30      /* 64-bit link pointer.  */
>> +	return (uint64_t)regs->regs[UNWIND_AARCH64_X30];
>> +}
>> +
>> +#endif /* ASM_ARM64_SFRAME_REGS_H */
>> diff --git a/arch/x86/include/asm/sframe_regs.h b/arch/x86/include/asm/sframe_regs.h
>> new file mode 100644
>> index 000000000000..99f84955854a
>> --- /dev/null
>> +++ b/arch/x86/include/asm/sframe_regs.h
>> @@ -0,0 +1,34 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> +/*
>> + * Copyright (C) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +#ifndef ASM_X86_SFRAME_REGS_H
>> +#define ASM_X86_SFRAME_REGS_H
>> +
>> +#define STACK_ACCESS_LEN 8
>> +
>> +static inline uint64_t
>> +get_ptregs_ip(struct pt_regs *regs)
>> +{
>> +	return (uint64_t)regs->ip;
>> +}
>> +
>> +static inline uint64_t
>> +get_ptregs_sp(struct pt_regs *regs)
>> +{
>> +	return (uint64_t)regs->sp;
>> +}
>> +
>> +static inline uint64_t
>> +get_ptregs_fp(struct pt_regs *regs)
>> +{
>> +	return (uint64_t)regs->bp;
>> +}
>> +
>> +static inline uint64_t
>> +get_ptregs_ra(struct pt_regs *regs)
>> +{
>> +	return 0; /* SFRAME_CFA_FIXED_RA_INVALID */
>> +}
>> +#endif /* ASM_X86_SFRAME_REGS_H */
>> diff --git a/include/sframe/sframe_regs.h b/include/sframe/sframe_regs.h
>> new file mode 100644
>> index 000000000000..32b67f7a7c78
>> --- /dev/null
>> +++ b/include/sframe/sframe_regs.h
>> @@ -0,0 +1,11 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> +/*
>> + * Copyright (C) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +#ifndef _SFRAME_REGS_H
>> +#define _SFRAME_REGS_H
>> +
>> +#include <asm/sframe_regs.h>
>> +
>> +#endif /* _SFRAME_REGS_H */
>> diff --git a/include/sframe/sframe_unwind.h b/include/sframe/sframe_unwind.h
>> new file mode 100644
>> index 000000000000..3e2c12816b60
>> --- /dev/null
>> +++ b/include/sframe/sframe_unwind.h
> 
> Also, these should probably go into include/linux, Unless there's going to
> be a lot more header files.
> 

I'd expect at most the current headers:
include/sframe/sframe_unwind.h
include/sframe/sframe_regs.h

And perhaps one more, for a callchain format and callbacks suggested below ?

>> @@ -0,0 +1,62 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> +/*
>> + * Copyright (C) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +#ifndef _SFRAME_UNWIND_H
>> +#define _SFRAME_UNWIND_H
>> +
>> +#include <linux/sched.h>
>> +#include <linux/perf_event.h>
>> +
>> +#define PT_GNU_SFRAME  0x6474e554
>> +
>> +/*
>> + * State used for SFrame-based stack tracing for a user space task.
>> + */
>> +struct user_unwind_state {
>> +	uint64_t pc, sp, fp, ra;
> 
> I know this is POC, but please make each structure field a separate item.
> Also, should be tab delimited.
> 

OK.

>> +	enum stack_type stype;
>> +	struct task_struct *task;
>> +	bool error;
>> +};
> 
> Also swap the task and the stype, as the pointer to the task will create a
> hole in the structure.
> 
> struct user_unwind_state {
> 	uint64_t		pc;
> 	uint64_t		sp;
> 	uint64_t		fp;
> 	uint64_t		ra;
> 	struct task_stuct	*task;
> 	enum stack_type		stype;
> 	bool			error;
> };
> 

OK.

>> +
>> +/*
>> + * APIs for an SFrame based stack tracer.
>> + */
>> +
>> +void sframe_unwind_start(struct user_unwind_state *state,
>> +			 struct task_struct *task, struct pt_regs *regs);
>> +bool sframe_unwind_next_frame(struct user_unwind_state *state);
>> +uint64_t sframe_unwind_get_return_address(struct user_unwind_state *state);
>> +
>> +static inline bool sframe_unwind_done(struct user_unwind_state *state)
>> +{
>> +	return state->stype == STACK_TYPE_UNKNOWN;
>> +}
>> +
>> +static inline bool sframe_unwind_error(struct user_unwind_state *state)
>> +{
>> +	return state->error;
>> +}
>> +
>> +/*
>> + * APIs to manage the SFrame state per task for stack tracing.
>> + */
>> +
>> +extern struct sframe_state *unwind_sframe_state_alloc(struct task_struct *task);
>> +extern int unwind_sframe_state_update(struct task_struct *task);
>> +extern void unwind_sframe_state_cleanup(struct task_struct *task);
>> +
>> +extern bool unwind_sframe_state_valid_p(struct sframe_state *sfstate);
>> +extern bool unwind_sframe_state_ready_p(struct sframe_state *sftate);
>> +
>> +/*
>> + * Get the callchain using SFrame unwind info for the given task.
>> + */
>> +extern int
>> +sframe_callchain_user(struct task_struct *task,
>> +		      struct perf_callchain_entry_ctx *entry,
>> +		      struct pt_regs *regs);
> 
> 
> I plan on using this without any perf involvement, I'd like to keep perf
> separate from the sframe logic. As I mentioned in a previous email, I
> expect sframe to have callbacks. So the callchain format should be defined
> by sframe, and not reuse perf.
> 

I will think about this. Do you have some model of the expected 
callbacks for me to explore ?

>> +
>> +#endif /* _SFRAME_UNWIND_H */
>> diff --git a/lib/sframe/Makefile b/lib/sframe/Makefile
>> index 4e4291d9294f..5ee9e3e7ec93 100644
>> --- a/lib/sframe/Makefile
>> +++ b/lib/sframe/Makefile
>> @@ -1,5 +1,11 @@
>>   # SPDX-License-Identifier: GPL-2.0
>>   ##################################
>> -obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe_read.o \
>> +obj-$(CONFIG_USER_UNWINDER_SFRAME) += iterate_phdr.o \
>> +				      sframe_read.o \
>> +				      sframe_state.o \
>> +				      sframe_unwind.o
> 
> Ah, the backslash is fixed here.
> 

Yes, it was a rebase thing. It got missed when moving code between patches.

>>   
>> +CFLAGS_iterate_phdr.o += -I $(srctree)/lib/sframe/ -Wno-error=declaration-after-statement
>>   CFLAGS_sframe_read.o += -I $(srctree)/lib/sframe/
>> +CFLAGS_sframe_state.o += -I $(srctree)/lib/sframe/
>> +CFLAGS_sframe_unwind.o += -I $(srctree)/lib/sframe/
>> diff --git a/lib/sframe/iterate_phdr.c b/lib/sframe/iterate_phdr.c
>> new file mode 100644
>> index 000000000000..c10d590ecc67
>> --- /dev/null
>> +++ b/lib/sframe/iterate_phdr.c
>> @@ -0,0 +1,113 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * Copyright (C) 2023, Oracle and/or its affiliates.
>> + */
>> +
>> +#include <linux/elf.h>
>> +#include <linux/mm.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/mm_types.h>
>> +
>> +#include "iterate_phdr.h"
>> +
>> +/*
>> + * Iterate over the task's memory mappings and find the ELF headers.
>> + *
>> + * This is expected to be called from perf_callchain_user(), so user process
>> + * context is expected.
> 
> My thought is that this will be called in the ptrace path (not the perf
> path), so yes, it will be in user process context.
> 
>> + */
>> +
>> +int iterate_phdr(int (*callback)(struct phdr_info *info,
>> +				 struct task_struct *task,
>> +				 void *data),
>> +		 struct task_struct *task, void *data)
>> +{
>> +	struct mm_struct *mm;
>> +	struct vm_area_struct *vma_mt;
>> +	struct page *page;
>> +
>> +	Elf64_Ehdr *ehdr;
>> +	struct phdr_info phinfo;
>> +
>> +	int ret = 0, res = 0;
>> +	int err = 0;
>> +	bool first = true;
>> +
>> +	memset(&phinfo, 0, sizeof(struct phdr_info));
>> +
>> +	mm = task->mm;
>> +
>> +	MA_STATE(mas, &mm->mm_mt, 0, 0);
>> +
> 
> So this is the code I want to discuss at LSFMM :-) As there will be more
> experts about this than what I know.
> 
> Let me go and start making the infrastructure to encompass this.
> 
> -- Steve
> 
> 
>> +	mas_for_each(&mas, vma_mt, ULONG_MAX) {
>> +		/* ELF header has a fixed place in the file, starting at offset
>> +		 * zero.
>> +		 */
>> +		if (vma_mt->vm_pgoff)
>> +			continue;
>> +
>> +		/* For the callback to infer if its the prog or DSO we are
>> +		 * dealing with.
>> +		 */
>> +		phinfo.pi_prog = first;
>> +		first = false;
>> +		/* FIXME TODO
>> +		 *  - This code assumes 64-bit ELF by using Elf64_Ehdr.
>> +		 *  - Detect the case when ELF program headers to be of
>> +		 * size > 1 page.
>> +		 */
>> +
>> +		/* FIXME TODO KERNEL
>> +		 *  - get_user_pages_WHAT, which API.
>> +		 *  What flags ? Is this correct ?
>> +		 */
>> +		ret = get_user_pages_remote(mm, vma_mt->vm_start, 1, FOLL_GET,
>> +					    &page, &vma_mt, NULL);
>> +		if (ret <= 0)
>> +			continue;
>> +
>> +		/* The first page must have the ELF header. */
>> +		ehdr = vmap(&page, 1, VM_MAP, PAGE_KERNEL);
>> +		if (!ehdr)
>> +			goto put_page;
>> +
>> +		/* Check for magic bytes to make sure this is ehdr. */
>> +		err = 0;
>> +		err |= ((ehdr->e_ident[EI_MAG0] != ELFMAG0)
>> +			|| (ehdr->e_ident[EI_MAG1] != ELFMAG1)
>> +			|| (ehdr->e_ident[EI_MAG2] != ELFMAG2)
>> +			|| (ehdr->e_ident[EI_MAG3] != ELFMAG3));
>> +		if (err)
>> +			goto unmap;
>> +
>> +		/*
>> +		 * FIXME TODO handle the case when number of program headers is
>> +		 * greater than or equal to PN_XNUM later.
>> +		 */
>> +		if (ehdr->e_phnum == PN_XNUM)
>> +			goto unmap;
>> +		/*
>> +		 * FIXME TODO handle the case when Elf phdrs span more than one
>> +		 * page later ?
>> +		 */
>> +		if ((sizeof(Elf64_Ehdr) + ehdr->e_phentsize * ehdr->e_phnum)
>> +		    > PAGE_SIZE)
>> +			goto unmap;
>> +
>> +		/* Save the location of program headers and the phnum. */
>> +		phinfo.pi_addr = vma_mt->vm_start;
>> +		phinfo.pi_phdr = (void *)ehdr + ehdr->e_phoff;
>> +		phinfo.pi_phnum = ehdr->e_phnum;
>> +
>> +		res = callback(&phinfo, task, data);
>> +unmap:
>> +		vunmap(ehdr);
>> +put_page:
>> +		put_page(page);
>> +
>> +		if (res < 0)
>> +			break;
>> +	}
>> +
>> +	return res;
>> +}
>>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-01 22:40   ` Steven Rostedt
  2023-05-02  5:07     ` Indu Bhagat
@ 2023-05-02  8:46     ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2023-05-02  8:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Indu Bhagat, linux-toolchains, daandemeyer, andrii,
	kris.van.hees, elena.zannoni, nick.alcock

On Mon, May 01, 2023 at 06:40:08PM -0400, Steven Rostedt wrote:
> > --- /dev/null
> > +++ b/lib/sframe/sframe.h
> > @@ -0,0 +1,263 @@
> > +/* SPDX-License-Identifier: GPL-2.0-or-later */
> > +/*
> > + * Copyright (C) 2023, Oracle and/or its affiliates.
> > + */
> > +
> > +#ifndef	SFRAME_H
> > +#define	SFRAME_H
> > +
> > +#include <linux/types.h>
> > +
> > +/* This file contains definitions for the SFrame stack tracing format, which is
> > + * documented at https://sourceware.org/binutils/docs */

More CodingStyle nits, this also isn't a valid multi line comment style
for the kernel.

> > +
> > +#define SFRAME_VERSION_1	1
> > +#define SFRAME_MAGIC		0xdee2
> > +#define SFRAME_VERSION	SFRAME_VERSION_1
> > +
> > +/* Function Descriptor Entries are sorted on PC. */
> > +#define SFRAME_F_FDE_SORTED	0x1
> > +/* Frame-pointer based stack tracing. Defined, but not set. */
> > +#define SFRAME_F_FRAME_POINTER 0x2
> > +
> > +#define SFRAME_CFA_FIXED_FP_INVALID 0
> > +#define SFRAME_CFA_FIXED_RA_INVALID 0
> > +
> > +/* Supported ABIs/Arch. */
> > +#define SFRAME_ABI_AARCH64_ENDIAN_BIG      1 /* AARCH64 big endian. */
> > +#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE   2 /* AARCH64 little endian. */
> > +#define SFRAME_ABI_AMD64_ENDIAN_LITTLE     3 /* AMD64 little endian. */
> > +
> > +/* SFrame FRE types. */
> > +#define SFRAME_FRE_TYPE_ADDR1	0
> > +#define SFRAME_FRE_TYPE_ADDR2	1
> > +#define SFRAME_FRE_TYPE_ADDR4	2
> > +
> > +/*
> > + * SFrame Function Descriptor Entry types.
> > + *
> > + * The SFrame format has two possible representations for functions.  The
> > + * choice of which type to use is made according to the instruction patterns
> > + * in the relevant program stub.
> > + */

This is.

> > +/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
> > +#define SFRAME_FDE_TYPE_PCINC   0
> > +/*
> > + * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
> > + * to look up a matching FRE.  Typical usecases are pltN entries, trampolines
> > + * etc.
> > + */
> > +#define SFRAME_FDE_TYPE_PCMASK  1
> > +

> > +static bool sframe_header_sanity_check_p(struct sframe_header *hp)
> > +{
> > +	unsigned char all_flags = SFRAME_F_FDE_SORTED | SFRAME_F_FRAME_POINTER;
> 
> Add a space here.
> 
> > +	/* Check that the preamble is valid. */
> > +	if ((hp->preamble.magic != SFRAME_MAGIC)
> > +	    || (hp->preamble.version != SFRAME_VERSION)
> > +	    || ((hp->preamble.flags | all_flags) != all_flags))
> > +		return false;

In kernel we prefer the operators at the end of a line in case of
linebreak.

> > +
> > +	/* Check that the offsets are valid. */
> > +	if (hp->fdeoff > hp->freoff)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static bool sframe_fre_sanity_check_p(struct sframe_fre *frep)
> > +{
> > +	unsigned int offset_size, offset_cnt;
> > +
> > +	if (frep == NULL)
> > +		return false;
> > +
> > +	offset_size = sframe_fre_get_offset_size(frep->fre_info);
> > +
> > +	if (offset_size != SFRAME_FRE_OFFSET_1B
> > +	    && offset_size != SFRAME_FRE_OFFSET_2B
> > +	    && offset_size != SFRAME_FRE_OFFSET_4B)
> > +		return false;

idem.

> > +
> > +	offset_cnt = sframe_fre_get_offset_count(frep->fre_info);
> > +	if (offset_cnt > MAX_NUM_STACK_OFFSETS)
> > +		return false;
> > +
> > +	return true;
> > +}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 4/5] sframe: add an SFrame format stack tracer
  2023-05-01 20:04 ` [POC 4/5] sframe: add an SFrame format stack tracer Indu Bhagat
  2023-05-01 23:00   ` Steven Rostedt
@ 2023-05-02  8:53   ` Peter Zijlstra
  2023-05-02  9:04   ` Peter Zijlstra
  2 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2023-05-02  8:53 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, rostedt, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon, May 01, 2023 at 01:04:09PM -0700, Indu Bhagat wrote:
> diff --git a/arch/arm64/include/asm/sframe_regs.h b/arch/arm64/include/asm/sframe_regs.h
> new file mode 100644
> index 000000000000..ae9ab9d5d3c1
> --- /dev/null
> +++ b/arch/arm64/include/asm/sframe_regs.h
> @@ -0,0 +1,37 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + */
> +
> +#ifdef ASM_ARM64_SFRAME_REGS_H
> +#define ASM_ARM64_SFRAME_REGS_H
> +
> +#define STACK_ACCESS_LEN 8
> +
> +static inline uint64_t
> +get_ptregs_ip(struct pt_regs *regs)
> +{
> +	return regs->pc;
> +}
> +
> +static inline uint64_t
> +get_ptregs_sp(struct pt_regs *regs)
> +{
> +	return regs->sp;
> +}
> +
> +static inline uint64_t
> +get_ptregs_fp(struct pt_regs *regs)
> +{
> +#define UNWIND_AARCH64_X29              29      /* 64-bit frame pointer.  */
> +	return (uint64_t)regs->regs[UNWIND_AARCH64_X29];
> +}
> +
> +static inline uint64_t
> +get_ptregs_ra(struct pt_regs *regs)
> +{
> +#define UNWIND_AARCH64_X30              30      /* 64-bit link pointer.  */
> +	return (uint64_t)regs->regs[UNWIND_AARCH64_X30];
> +}
> +
> +#endif /* ASM_ARM64_SFRAME_REGS_H */
> diff --git a/arch/x86/include/asm/sframe_regs.h b/arch/x86/include/asm/sframe_regs.h
> new file mode 100644
> index 000000000000..99f84955854a
> --- /dev/null
> +++ b/arch/x86/include/asm/sframe_regs.h
> @@ -0,0 +1,34 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2023, Oracle and/or its affiliates.
> + */
> +
> +#ifndef ASM_X86_SFRAME_REGS_H
> +#define ASM_X86_SFRAME_REGS_H
> +
> +#define STACK_ACCESS_LEN 8
> +
> +static inline uint64_t
> +get_ptregs_ip(struct pt_regs *regs)
> +{
> +	return (uint64_t)regs->ip;
> +}
> +
> +static inline uint64_t
> +get_ptregs_sp(struct pt_regs *regs)
> +{
> +	return (uint64_t)regs->sp;
> +}
> +
> +static inline uint64_t
> +get_ptregs_fp(struct pt_regs *regs)
> +{
> +	return (uint64_t)regs->bp;
> +}
> +
> +static inline uint64_t
> +get_ptregs_ra(struct pt_regs *regs)
> +{
> +	return 0; /* SFRAME_CFA_FIXED_RA_INVALID */
> +}
> +#endif /* ASM_X86_SFRAME_REGS_H */

arch/*/include/ptrace.h already provides:

  instruction_pointer(regs)
  kernel_stack_pointer(regs)
  frame_pointer(regs)

and on arm64:

  procedure_link_pointer(regs)

That do all this, I would suggest you use those instead of creating yet
another set of wrappers.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 4/5] sframe: add an SFrame format stack tracer
  2023-05-01 20:04 ` [POC 4/5] sframe: add an SFrame format stack tracer Indu Bhagat
  2023-05-01 23:00   ` Steven Rostedt
  2023-05-02  8:53   ` Peter Zijlstra
@ 2023-05-02  9:04   ` Peter Zijlstra
  2 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2023-05-02  9:04 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, rostedt, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon, May 01, 2023 at 01:04:09PM -0700, Indu Bhagat wrote:
> +int iterate_phdr(int (*callback)(struct phdr_info *info,
> +				 struct task_struct *task,
> +				 void *data),
> +		 struct task_struct *task, void *data)
> +{
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma_mt;
> +	struct page *page;
> +
> +	Elf64_Ehdr *ehdr;
> +	struct phdr_info phinfo;
> +
> +	int ret = 0, res = 0;
> +	int err = 0;
> +	bool first = true;
> +
> +	memset(&phinfo, 0, sizeof(struct phdr_info));
> +
> +	mm = task->mm;
> +
> +	MA_STATE(mas, &mm->mm_mt, 0, 0);
> +
> +	mas_for_each(&mas, vma_mt, ULONG_MAX) {
> +		/* ELF header has a fixed place in the file, starting at offset
> +		 * zero.
> +		 */

This is a vile comment form; please abstain from using it.

  https://lkml.org/lkml/2016/7/8/625

> +		if (vma_mt->vm_pgoff)
> +			continue;
> +
> +		/* For the callback to infer if its the prog or DSO we are
> +		 * dealing with.
> +		 */
> +		phinfo.pi_prog = first;
> +		first = false;
> +		/* FIXME TODO
> +		 *  - This code assumes 64-bit ELF by using Elf64_Ehdr.
> +		 *  - Detect the case when ELF program headers to be of
> +		 * size > 1 page.
> +		 */
> +
> +		/* FIXME TODO KERNEL
> +		 *  - get_user_pages_WHAT, which API.
> +		 *  What flags ? Is this correct ?
> +		 */
> +		ret = get_user_pages_remote(mm, vma_mt->vm_start, 1, FOLL_GET,
> +					    &page, &vma_mt, NULL);
> +		if (ret <= 0)
> +			continue;
> +
> +		/* The first page must have the ELF header. */
> +		ehdr = vmap(&page, 1, VM_MAP, PAGE_KERNEL);
> +		if (!ehdr)
> +			goto put_page;
> +
> +		/* Check for magic bytes to make sure this is ehdr. */
> +		err = 0;
> +		err |= ((ehdr->e_ident[EI_MAG0] != ELFMAG0)
> +			|| (ehdr->e_ident[EI_MAG1] != ELFMAG1)
> +			|| (ehdr->e_ident[EI_MAG2] != ELFMAG2)
> +			|| (ehdr->e_ident[EI_MAG3] != ELFMAG3));

And again, operators go at the end.

> +		if (err)
> +			goto unmap;
> +
> +		/*
> +		 * FIXME TODO handle the case when number of program headers is
> +		 * greater than or equal to PN_XNUM later.
> +		 */

Also note the glorious inconsistency in comment styles within a single
function; why?

> +		if (ehdr->e_phnum == PN_XNUM)
> +			goto unmap;
> +		/*
> +		 * FIXME TODO handle the case when Elf phdrs span more than one
> +		 * page later ?
> +		 */
> +		if ((sizeof(Elf64_Ehdr) + ehdr->e_phentsize * ehdr->e_phnum)
> +		    > PAGE_SIZE)

Srsly, you can read this?, just bust the line limit or use a temporary
somewhere.

> +			goto unmap;
> +
> +		/* Save the location of program headers and the phnum. */
> +		phinfo.pi_addr = vma_mt->vm_start;
> +		phinfo.pi_phdr = (void *)ehdr + ehdr->e_phoff;
> +		phinfo.pi_phnum = ehdr->e_phnum;
> +
> +		res = callback(&phinfo, task, data);
> +unmap:
> +		vunmap(ehdr);
> +put_page:
> +		put_page(page);
> +
> +		if (res < 0)
> +			break;
> +	}
> +
> +	return res;
> +}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
  2023-05-01 22:40   ` Steven Rostedt
@ 2023-05-02  9:09   ` Peter Zijlstra
  2023-05-02  9:20   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2023-05-02  9:09 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, rostedt, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon, May 01, 2023 at 01:04:08PM -0700, Indu Bhagat wrote:
> +struct sframe_header
> +{
> +	struct sframe_preamble preamble;
> +	/* Information about the arch (endianness) and ABI. */
> +	uint8_t abi_arch;
> +	/*
> +	 * Offset for the Frame Pointer (FP) from CFA may be fixed for some
> +	 * ABIs (e.g, in AMD64 when -fno-omit-frame-pointer is used).  When fixed,
> +	 * this field specifies the fixed stack frame offset and the individual
> +	 * FREs do not need to track it.  When not fixed, it is set to
> +	 * SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may provide
> +	 * the applicable stack frame offset, if any.
> +	 */
> +	int8_t cfa_fixed_fp_offset;
> +	/*
> +	 * Offset for the Return Address from CFA is fixed for some ABIs
> +	 * (e.g., AMD64 has it as CFA-8).  When fixed, the header specifies the
> +	 * fixed stack frame offset and the individual FREs do not track it.  When
> +	 * not fixed, it is set to SFRAME_CFA_FIXED_RA_INVALID, and individual
> +	 * FREs provide the applicable stack frame offset, if any.
> +	 */
> +	int8_t cfa_fixed_ra_offset;
> +	/*
> +	 * Number of bytes making up the auxiliary header, if any.
> +	 * Some ABI/arch, in the future, may use this space for extending the
> +	 * information in SFrame header.  Auxiliary header is contained in
> +	 * bytes sequentially following the sframe_header.
> +	 */
> +	uint8_t auxhdr_len;
> +	/* Number of SFrame FDEs in this SFrame section. */
> +	uint32_t num_fdes;
> +	/* Number of SFrame Frame Row Entries. */
> +	uint32_t num_fres;
> +	/* Number of bytes in the SFrame Frame Row Entry section. */
> +	uint32_t fre_len;
> +	/* Offset of SFrame Function Descriptor Entry section. */
> +	uint32_t fdeoff;
> +	/* Offset of SFrame Frame Row Entry section. */
> +	uint32_t freoff;
> +} __packed;

So I know there's people that hate hard on tail comments, but personally
I can't read the above, something like:

	/*
	 * Number of bytes making up the auxiliary header, if any.
	 * Some ABI/arch, in the future, may use this space for extending the
	 * information in SFrame header.  Auxiliary header is contained in
	 * bytes sequentially following the sframe_header.
	 */
	uint8_t auxhdr_len;
	uint32_t num_fdes; /* Number of SFrame FDEs in this SFrame section. */
	uint32_t num_fres; /* Number of SFrame Frame Row Entries. */
	uint32_t fre_len;  /* Number of bytes in the SFrame Frame Row Entry section. */
	uint32_t fdeoff;   /* Offset of SFrame Function Descriptor Entry section. */
	uint32_t freoff;   /* Offset of SFrame Frame Row Entry section. */

Is far more readable (but yes, busts the line length a little).

An alternative form might be:

	/*
	 * @auxhdr_len: Number of bytes making up the auxiliary header, if any.
	 *   Some ABI/arch, in the future, may use this space for extending the
	 *   information in SFrame header.  Auxiliary header is contained in
	 *   bytes sequentially following the sframe_header.
	 * @num_fdes: Number of SFrame FDEs in this SFrame section.
	 * @num_fres: Number of SFrame Frame Row Entries.
	 * @fre_len:  Number of bytes in the SFrame Frame Row Entry section.
	 * @fdeoff:   Offset of SFrame Function Descriptor Entry section.
	 * @freoff:   Offset of SFrame Frame Row Entry section.
	 */
	uint8_t auxhdr_len;
	uint32_t num_fdes;
	uint32_t num_fres;
	uint32_t fre_len;
	uint32_t fdeoff;
	uint32_t freoff;



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
  2023-05-01 22:40   ` Steven Rostedt
  2023-05-02  9:09   ` Peter Zijlstra
@ 2023-05-02  9:20   ` Peter Zijlstra
  2023-05-02  9:28   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2023-05-02  9:20 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, rostedt, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon, May 01, 2023 at 01:04:08PM -0700, Indu Bhagat wrote:
> +static unsigned int sframe_get_fre_type(struct sframe_func_desc_entry *fdep)
> +{
> +	return (fdep) ? SFRAME_V1_FUNC_FRE_TYPE(fdep->func_info) : 0;
> +}
> +
> +static unsigned int sframe_get_fde_type(struct sframe_func_desc_entry *fdep)
> +{
> +	return (fdep) ? SFRAME_V1_FUNC_FDE_TYPE(fdep->func_info) : 0;
> +}

vs.

> +unsigned int sframe_fre_get_base_reg_id(struct sframe_fre *frep,
> +					int *errp)
> +{
> +	if (!frep)
> +		return sframe_set_errno(errp, SFRAME_ERR_FRE_INVAL);
> +
> +	return SFRAME_V1_FRE_CFA_BASE_REG_ID(frep->fre_info);
> +}

Inconsistent codying style here... FWIW, I prefer the latter.


> +int32_t sframe_fre_get_fp_offset(struct sframe_sec *sfsec,
> +				 struct sframe_fre *frep, int *errp)
> +{
> +	uint32_t fp_offset_idx = 0;

There's an unconditional assignment later, why are you initializing to 0
here? Also, that xmas thing Steve mentioned.

> +	int8_t fp_offset = sframe_sec_get_fixed_fp_offset(sfsec);
> +	/*
> +	 * If the FP offset is not being tracked, return the fixed FP offset
> +	 * from the SFrame header.
> +	 */
> +	if (fp_offset != SFRAME_CFA_FIXED_FP_INVALID) {
> +		*errp = 0;
> +		return fp_offset;
> +	}
> +
> +	/*
> +	 * In some ABIs, the stack offset to recover RA (using the CFA) from is
> +	 * fixed (like AMD64).  In such cases, the stack offset to recover FP
> +	 * will appear at the second index.
> +	 */
> +	fp_offset_idx = ((sframe_sec_get_fixed_ra_offset(sfsec)
> +			  != SFRAME_CFA_FIXED_RA_INVALID)
> +			 ? SFRAME_FRE_RA_OFFSET_IDX
> +			 : SFRAME_FRE_FP_OFFSET_IDX);

And as such, I'll suggest you write this like:

	fp_offset_idx = SFRAME_FRE_FP_OFFSET_IDX;
	if (sframe_sec_get_fixed_ra_offset(sfrec) != SFRAME_CFA_FIXED_RA_INVALID)
		fp_offset_idx = SFRAME_FRE_RA_OFFSET_IDX;

And also, if you'd use shorter identifiers, you'd not constantly run
into the line limit like you do.

Basically never use ?: unless you _have_ to :-)

> +	return sframe_get_fre_offset(frep, fp_offset_idx, errp);
> +}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
                     ` (2 preceding siblings ...)
  2023-05-02  9:20   ` Peter Zijlstra
@ 2023-05-02  9:28   ` Peter Zijlstra
  2023-05-02  9:30   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2023-05-02  9:28 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, rostedt, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon, May 01, 2023 at 01:04:08PM -0700, Indu Bhagat wrote:
> +int sframe_sec_find_fre(struct sframe_sec *sfsec, int32_t pc,
> +			struct sframe_fre *frep)
> +{
> +	struct sframe_func_desc_entry *fdep;
> +	uint32_t start_address, i;
> +	struct sframe_fre cur_fre, next_fre;
> +	unsigned char *fres;
> +	unsigned int fre_type, fde_type;
> +	size_t esz;
> +	int err = 0;
> +	size_t size = 0;
> +	/*
> +	 * For regular FDEs(i.e. fde_type SFRAME_FDE_TYPE_PCINC),
> +	 * where the start address in the FRE is an offset from start pc,
> +	 * use a bitmask with all bits set so that none of the address bits are
> +	 * ignored.  In this case, we need to return the FRE where
> +	 * (PC >= FRE_START_ADDR).
> +	 */
> +	uint64_t bitmask = 0xffffffff;
> +
> +	if ((sfsec == NULL) || (frep == NULL))
> +		return SFRAME_ERR_INVAL;
> +
> +	/* Find the FDE which contains the PC, then scan its FRE entries. */
> +	fdep = sframe_sec_find_fde(sfsec, pc, &err);
> +	if (fdep == NULL || sfsec->fres == NULL)
> +		return SFRAME_ERR_INIT_INVAL;
> +
> +	fre_type = sframe_get_fre_type(fdep);
> +	fde_type = sframe_get_fde_type(fdep);
> +
> +	/*
> +	 * For FDEs for repetitive pattern of insns, we need to return the FRE
> +	 * such that(PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK).
> +	 * so, update the bitmask to the start address.
> +	 */
> +	/* FIXME - the bitmask. */
> +	if (fde_type == SFRAME_FDE_TYPE_PCMASK)
> +		bitmask = 0xff;
> +
> +	fres = (unsigned char *)sfsec->fres + fdep->func_start_fre_off;
> +	for (i = 0; i < fdep->func_num_fres; i++) {
> +		err = sframe_sec_read_fre((const char *)fres, &next_fre,
> +					  fre_type, &esz);
> +		start_address = next_fre.start_ip_offset;
> +
> +		if (((fdep->func_start_address
> +		      + (int32_t)start_address) & bitmask) <= (pc & bitmask)) {

What's the purpose of that int32_t cast?

> +			sframe_fre_copy(&cur_fre, &next_fre);
> +
> +			/* Get the next FRE in sequence. */
> +			if (i < fdep->func_num_fres - 1) {
> +				fres += esz;
> +				err = sframe_sec_read_fre((const char *)fres,
> +							  &next_fre,
> +							  fre_type, &esz);
> +
> +				/* Sanity check the next FRE. */
> +				if (!sframe_fre_sanity_check_p(&next_fre))
> +					return SFRAME_ERR_FRE_INVAL;
> +
> +				size = next_fre.start_ip_offset;
> +			} else {
> +				size = fdep->func_size;
> +			}
> +
> +			if (((fdep->func_start_address
> +			      + (int32_t)size) & bitmask) > (pc & bitmask)) {
> +				/* Cur FRE contains the PC, return it. */
> +				sframe_fre_copy(frep, &cur_fre);
> +				return 0;
> +			}
> +		} else {
> +			return SFRAME_ERR_FRE_INVAL;
> +		}

Or, you could've done:

		start_address += fdep->func_start_address;
		if ((start_address & bitmask) > (pc & bitmask))
			return SFRAME_ERR_FRE_INVAL;

		...and saved yourself an indent level and an unreadable
		   linebreak...

> +	}
> +	return SFRAME_ERR_FDE_INVAL;
> +}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
                     ` (3 preceding siblings ...)
  2023-05-02  9:28   ` Peter Zijlstra
@ 2023-05-02  9:30   ` Peter Zijlstra
  2023-05-03  6:03     ` Indu Bhagat
  2023-05-02 10:31   ` Peter Zijlstra
  2023-05-02 10:41   ` Peter Zijlstra
  6 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2023-05-02  9:30 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, rostedt, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon, May 01, 2023 at 01:04:08PM -0700, Indu Bhagat wrote:
> +	/*
> +	 * Skip binary search if FDE sub-section is not sorted on PCs.  GNU ld
> +	 * sorts the FDEs on start PC by default though.
> +	 */
> +	if ((header->preamble.flags & SFRAME_F_FDE_SORTED) == 0) {
> +		sframe_set_errno(errp, SFRAME_ERR_FDE_NOTSORTED);
> +		return NULL;
> +	}

Why does SFrame allow unsorted at all? That seems like a bug in the
spec.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
                     ` (4 preceding siblings ...)
  2023-05-02  9:30   ` Peter Zijlstra
@ 2023-05-02 10:31   ` Peter Zijlstra
  2023-05-02 10:41   ` Peter Zijlstra
  6 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2023-05-02 10:31 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, rostedt, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon, May 01, 2023 at 01:04:08PM -0700, Indu Bhagat wrote:

> +static int sframe_fre_copy(struct sframe_fre *dst,
> +			   struct sframe_fre *src)
> +{
> +	if (dst == NULL || src == NULL)
> +		return SFRAME_ERR;
> +
> +	memcpy(dst, src, sizeof(struct sframe_fre));
> +	return 0;
> +}

> +int sframe_sec_find_fre(struct sframe_sec *sfsec, int32_t pc,
> +			struct sframe_fre *frep)
> +{
> +	struct sframe_func_desc_entry *fdep;
> +	uint32_t start_address, i;
> +	struct sframe_fre cur_fre, next_fre;
> +	unsigned char *fres;
> +	unsigned int fre_type, fde_type;
> +	size_t esz;
> +	int err = 0;
> +	size_t size = 0;
> +	/*
> +	 * For regular FDEs(i.e. fde_type SFRAME_FDE_TYPE_PCINC),
> +	 * where the start address in the FRE is an offset from start pc,
> +	 * use a bitmask with all bits set so that none of the address bits are
> +	 * ignored.  In this case, we need to return the FRE where
> +	 * (PC >= FRE_START_ADDR).
> +	 */
> +	uint64_t bitmask = 0xffffffff;
> +
> +	if ((sfsec == NULL) || (frep == NULL))
> +		return SFRAME_ERR_INVAL;
> +
> +	/* Find the FDE which contains the PC, then scan its FRE entries. */
> +	fdep = sframe_sec_find_fde(sfsec, pc, &err);
> +	if (fdep == NULL || sfsec->fres == NULL)
> +		return SFRAME_ERR_INIT_INVAL;
> +
> +	fre_type = sframe_get_fre_type(fdep);
> +	fde_type = sframe_get_fde_type(fdep);
> +
> +	/*
> +	 * For FDEs for repetitive pattern of insns, we need to return the FRE
> +	 * such that(PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK).
> +	 * so, update the bitmask to the start address.
> +	 */
> +	/* FIXME - the bitmask. */
> +	if (fde_type == SFRAME_FDE_TYPE_PCMASK)
> +		bitmask = 0xff;
> +
> +	fres = (unsigned char *)sfsec->fres + fdep->func_start_fre_off;
> +	for (i = 0; i < fdep->func_num_fres; i++) {
> +		err = sframe_sec_read_fre((const char *)fres, &next_fre,
> +					  fre_type, &esz);
> +		start_address = next_fre.start_ip_offset;
> +
> +		if (((fdep->func_start_address
> +		      + (int32_t)start_address) & bitmask) <= (pc & bitmask)) {
> +			sframe_fre_copy(&cur_fre, &next_fre);
> +
> +			/* Get the next FRE in sequence. */
> +			if (i < fdep->func_num_fres - 1) {
> +				fres += esz;
> +				err = sframe_sec_read_fre((const char *)fres,
> +							  &next_fre,
> +							  fre_type, &esz);
> +
> +				/* Sanity check the next FRE. */
> +				if (!sframe_fre_sanity_check_p(&next_fre))
> +					return SFRAME_ERR_FRE_INVAL;
> +
> +				size = next_fre.start_ip_offset;
> +			} else {
> +				size = fdep->func_size;
> +			}
> +
> +			if (((fdep->func_start_address
> +			      + (int32_t)size) & bitmask) > (pc & bitmask)) {
> +				/* Cur FRE contains the PC, return it. */
> +				sframe_fre_copy(frep, &cur_fre);
> +				return 0;
> +			}
> +		} else {
> +			return SFRAME_ERR_FRE_INVAL;
> +		}
> +	}
> +	return SFRAME_ERR_FDE_INVAL;
> +}

What's the purpose of that sframe_fre_copy() thing? Why can't you do:

	cur_fre = next_fre;

	...

	if (frep)
		*frep = cur_fre;


and call it a day? C is quite capable of writing that memcpy for you.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
                     ` (5 preceding siblings ...)
  2023-05-02 10:31   ` Peter Zijlstra
@ 2023-05-02 10:41   ` Peter Zijlstra
  2023-05-02 15:22     ` Steven Rostedt
  6 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2023-05-02 10:41 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, rostedt, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon, May 01, 2023 at 01:04:08PM -0700, Indu Bhagat wrote:
> +/* SFrame FRE types. */
> +#define SFRAME_FRE_TYPE_ADDR1	0
> +#define SFRAME_FRE_TYPE_ADDR2	1
> +#define SFRAME_FRE_TYPE_ADDR4	2

> +static size_t sframe_fre_get_start_address_tsize(unsigned int fre_type)

That name is just amazing :/

> +{
> +	/* Type size of the start_addr in an FRE. */
> +	size_t saddr_tsize = 0;
> +
> +	switch (fre_type) {
> +	case SFRAME_FRE_TYPE_ADDR1:
> +		saddr_tsize = sizeof(uint8_t);
> +		break;
> +	case SFRAME_FRE_TYPE_ADDR2:
> +		saddr_tsize = sizeof(uint16_t);
> +		break;
> +	case SFRAME_FRE_TYPE_ADDR4:
> +		saddr_tsize = sizeof(uint32_t);
> +		break;
> +	default:
> +		/* No other value is expected. */
> +		break;
> +	}
> +	return saddr_tsize;

	return 1 << fre_type;

was too complicated?

> +}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2023-05-01 20:04 ` [POC 5/5] x86_64: invoke SFrame based stack tracer for user space Indu Bhagat
  2023-05-01 23:11   ` Steven Rostedt
@ 2023-05-02 10:53   ` Peter Zijlstra
  2023-05-02 15:27     ` Steven Rostedt
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2023-05-02 10:53 UTC (permalink / raw)
  To: Indu Bhagat
  Cc: linux-toolchains, daandemeyer, andrii, rostedt, kris.van.hees,
	elena.zannoni, nick.alcock

On Mon, May 01, 2023 at 01:04:10PM -0700, Indu Bhagat wrote:
> The task's sframe_state is allocated and initialized if a phdr with type
> PT_GNU_SFRAME is encountered for the binary.
> 
> perf_callchain_user() will fall back on the frame pointer based stack
> trace approach if:
>   - SFrame section for the main program is not found.
>   - SFrame state for the task is either not setup or stale and cannot
>   be refreshed.
> 
> Finally, the sframe_state is cleaned up in release_task().
> 
> Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
> ---
>  arch/x86/events/core.c | 51 ++++++++++++++++++++++++++++++++++++++++++
>  fs/binfmt_elf.c        | 39 ++++++++++++++++++++++++++++++++
>  kernel/exit.c          |  9 ++++++++
>  3 files changed, 99 insertions(+)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index d096b04bf80e..4be9e826714a 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2860,11 +2860,54 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
>  }
>  #endif
>  
> +#ifdef CONFIG_USER_UNWINDER_SFRAME
> +
> +#include <sframe/sframe_unwind.h>
> +
> +/* Check if the specified task has SFrame unwind state set up.  */
> +static inline bool check_sframe_state_p(struct task_struct *task)
> +{
> +	bool sframe_ok = false;
> +
> +	/* FIXME TODO - only current task can be unwinded at this time.
> +	 * Even for current tasks, following unknowns remain and hence, not
> +	 * handled:
> +	 *    - dlopen / dlclose detection and update of sframe_state,
> +	 *    - in general, any change in memory mappings.
> +	 */
> +	if (task != current)
> +		return false;
> +
> +	if (!task->sframe_state)
> +		return false;
> +
> +	sframe_ok = (unwind_sframe_state_valid_p(task->sframe_state)
> +		     || unwind_sframe_state_ready_p(task->sframe_state));

Please; forget this style is even possible, it is horrific :/

> +
> +	return sframe_ok;
> +}
> +
> +#else
> +/* Check if the specified task has SFrame unwind state set up. */
> +static inline bool check_sframe_state_p(struct task_struct *task)
> +{
> +	  return false;
> +}
> +
> +static inline int sframe_callchain_user(struct task_struct *task,
> +					struct perf_callchain_entry_ctx *entry,
> +					struct pt_regs *regs)
> +{
> +	  return 0;
> +}
> +#endif
> +
>  void
>  perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
>  {
>  	struct stack_frame frame;
>  	const struct stack_frame __user *fp;
> +	bool sframe_avail;
>  
>  	if (perf_guest_state()) {
>  		/* TODO: We don't support guest os callchain now */
> @@ -2887,7 +2930,15 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
>  	if (perf_callchain_user32(regs, entry))
>  		return;
>  
> +	sframe_avail = check_sframe_state_p (current);
> +
>  	pagefault_disable();
> +
> +	if (sframe_avail && !sframe_callchain_user (current, entry, regs)) {
> +		pagefault_enable();
> +		return;
> +	}
> +
>  	while (entry->nr < entry->max_stack) {
>  		if (!valid_user_frame(fp, sizeof(frame)))
>  			break;

This is broken, the sframe stuff is not NMI safe. First you need to
change perf to emit a forward reference to a 'next' user trace and then
emit the user trace on return-to-user.

As is, perf does everything in-place from NMI context.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-02 10:41   ` Peter Zijlstra
@ 2023-05-02 15:22     ` Steven Rostedt
  0 siblings, 0 replies; 32+ messages in thread
From: Steven Rostedt @ 2023-05-02 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Indu Bhagat, linux-toolchains, daandemeyer, andrii,
	kris.van.hees, elena.zannoni, nick.alcock

On Tue, 2 May 2023 12:41:25 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > +{
> > +	/* Type size of the start_addr in an FRE. */
> > +	size_t saddr_tsize = 0;
> > +
> > +	switch (fre_type) {
> > +	case SFRAME_FRE_TYPE_ADDR1:
> > +		saddr_tsize = sizeof(uint8_t);
> > +		break;
> > +	case SFRAME_FRE_TYPE_ADDR2:
> > +		saddr_tsize = sizeof(uint16_t);
> > +		break;
> > +	case SFRAME_FRE_TYPE_ADDR4:
> > +		saddr_tsize = sizeof(uint32_t);
> > +		break;
> > +	default:
> > +		/* No other value is expected. */
> > +		break;
> > +	}
> > +	return saddr_tsize;  
> 
> 	return 1 << fre_type;
> 
> was too complicated?

I would argue that the above is easier for the reviewer, but yes, the
"1 << fre_type" works too, and is more efficient, but requires a comment
that explains it.

It would also require a comment stating the dependency to this by the enum
definition.

-- Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2023-05-02 10:53   ` Peter Zijlstra
@ 2023-05-02 15:27     ` Steven Rostedt
  2023-05-16 17:25       ` Andrii Nakryiko
  2024-03-13 14:37       ` Tatsuyuki Ishi
  0 siblings, 2 replies; 32+ messages in thread
From: Steven Rostedt @ 2023-05-02 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Indu Bhagat, linux-toolchains, daandemeyer, andrii,
	kris.van.hees, elena.zannoni, nick.alcock

On Tue, 2 May 2023 12:53:53 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> >  	while (entry->nr < entry->max_stack) {
> >  		if (!valid_user_frame(fp, sizeof(frame)))
> >  			break;  
> 
> This is broken, the sframe stuff is not NMI safe. First you need to
> change perf to emit a forward reference to a 'next' user trace and then
> emit the user trace on return-to-user.
> 
> As is, perf does everything in-place from NMI context.

Exactly. What I would like to see with the new sframe infrastructure is
just a way to tell it "I want a user stack trace before going back to user
space". Then all the work can be done in the ptrace path, where it's safe
to memory map in the sframe section. It's not like the user space stack
frame will change before going back to user space.

This will allow for asking for the user space stack trace at any context,
because the work will not be done until later. Of course, this may also
require user space tooling to be updated to handle this case. Perf may
already do that.

It would also mean that even if you take multiple profiling hits in one
system call, you will only get one user space stack trace. But that's a
good thing. As the stack trace doesn't change, and you don't waste ring
buffer space with unneeded duplicate stacks.

-- Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 3/5] sframe: add new SFrame library
  2023-05-02  9:30   ` Peter Zijlstra
@ 2023-05-03  6:03     ` Indu Bhagat
  0 siblings, 0 replies; 32+ messages in thread
From: Indu Bhagat @ 2023-05-03  6:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-toolchains, daandemeyer, andrii, rostedt, kris.van.hees,
	elena.zannoni, nick.alcock

On 5/2/23 02:30, Peter Zijlstra wrote:
> On Mon, May 01, 2023 at 01:04:08PM -0700, Indu Bhagat wrote:
>> +	/*
>> +	 * Skip binary search if FDE sub-section is not sorted on PCs.  GNU ld
>> +	 * sorts the FDEs on start PC by default though.
>> +	 */
>> +	if ((header->preamble.flags & SFRAME_F_FDE_SORTED) == 0) {
>> +		sframe_set_errno(errp, SFRAME_ERR_FDE_NOTSORTED);
>> +		return NULL;
>> +	}
> 
> Why does SFrame allow unsorted at all? That seems like a bug in the
> spec.

The GNU assembler emits an SFrame section with Function Descriptor 
Entries (FDEs) not sorted on PC.

Its only the GNU ld which always emits the final SFrame section with 
FDEs sorted on the start PCs of functions.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2023-05-02 15:27     ` Steven Rostedt
@ 2023-05-16 17:25       ` Andrii Nakryiko
  2023-05-16 17:38         ` Steven Rostedt
  2024-03-13 14:37       ` Tatsuyuki Ishi
  1 sibling, 1 reply; 32+ messages in thread
From: Andrii Nakryiko @ 2023-05-16 17:25 UTC (permalink / raw)
  To: Steven Rostedt, bpf
  Cc: Peter Zijlstra, Indu Bhagat, linux-toolchains, daandemeyer,
	andrii, kris.van.hees, elena.zannoni, nick.alcock

adding bpf@

On Tue, May 2, 2023 at 8:27 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 2 May 2023 12:53:53 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > >     while (entry->nr < entry->max_stack) {
> > >             if (!valid_user_frame(fp, sizeof(frame)))
> > >                     break;
> >
> > This is broken, the sframe stuff is not NMI safe. First you need to
> > change perf to emit a forward reference to a 'next' user trace and then
> > emit the user trace on return-to-user.
> >
> > As is, perf does everything in-place from NMI context.
>
> Exactly. What I would like to see with the new sframe infrastructure is
> just a way to tell it "I want a user stack trace before going back to user
> space". Then all the work can be done in the ptrace path, where it's safe
> to memory map in the sframe section. It's not like the user space stack
> frame will change before going back to user space.
>
> This will allow for asking for the user space stack trace at any context,
> because the work will not be done until later. Of course, this may also
> require user space tooling to be updated to handle this case. Perf may
> already do that.
>
> It would also mean that even if you take multiple profiling hits in one
> system call, you will only get one user space stack trace. But that's a
> good thing. As the stack trace doesn't change, and you don't waste ring
> buffer space with unneeded duplicate stacks.

As discussed in the halls of LSF/MM2023, such lazily mapped .sframe in
approach won't work with BPF's bpf_get_stack() approach, which expects
stack trace to be grabbed and returned synchronously from NMI context.
But we can probably retrofit it into bpf_get_stackid()+STACK_TRACE BPF
map API.

Indu, please cc bpf@vger.kernel.org for future revisions so we can
track and plan accordingly. Thank you!

>
> -- Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2023-05-16 17:25       ` Andrii Nakryiko
@ 2023-05-16 17:38         ` Steven Rostedt
  2023-05-16 17:51           ` Andrii Nakryiko
  0 siblings, 1 reply; 32+ messages in thread
From: Steven Rostedt @ 2023-05-16 17:38 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Peter Zijlstra, Indu Bhagat, linux-toolchains, daandemeyer,
	andrii, kris.van.hees, elena.zannoni, nick.alcock

On Tue, 16 May 2023 10:25:52 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> As discussed in the halls of LSF/MM2023, such lazily mapped .sframe in
> approach won't work with BPF's bpf_get_stack() approach, which expects
> stack trace to be grabbed and returned synchronously from NMI context.
> But we can probably retrofit it into bpf_get_stackid()+STACK_TRACE BPF
> map API.

Note that we will likely not be able to use sframe in NMI context, and if
that's a requirement, then BPF will need to continue using the method it is
currently using.

The best way to access sframe reliable is in normal context. NMI is
special, and really should never had been used to access user space
addresses. That was just a simple solution but not a good one. There's a
lot of hacks just to allow a page fault in NMI context.
See https://lwn.net/Articles/484932/

> 
> Indu, please cc bpf@vger.kernel.org for future revisions so we can
> track and plan accordingly. Thank you!

I'll likely be taking over the kernel side of sframes. I'll be happy to
Cc that work to the bpf mailing list.

-- Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2023-05-16 17:38         ` Steven Rostedt
@ 2023-05-16 17:51           ` Andrii Nakryiko
  0 siblings, 0 replies; 32+ messages in thread
From: Andrii Nakryiko @ 2023-05-16 17:51 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: bpf, Peter Zijlstra, Indu Bhagat, linux-toolchains, daandemeyer,
	andrii, kris.van.hees, elena.zannoni, nick.alcock

On Tue, May 16, 2023 at 10:38 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 16 May 2023 10:25:52 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > As discussed in the halls of LSF/MM2023, such lazily mapped .sframe in
> > approach won't work with BPF's bpf_get_stack() approach, which expects
> > stack trace to be grabbed and returned synchronously from NMI context.
> > But we can probably retrofit it into bpf_get_stackid()+STACK_TRACE BPF
> > map API.
>
> Note that we will likely not be able to use sframe in NMI context, and if
> that's a requirement, then BPF will need to continue using the method it is
> currently using.

Yes, unfortunately. We should give users a way to use sframes, but
transition might be slow.

>
> The best way to access sframe reliable is in normal context. NMI is
> special, and really should never had been used to access user space
> addresses. That was just a simple solution but not a good one. There's a
> lot of hacks just to allow a page fault in NMI context.
> See https://lwn.net/Articles/484932/
>
> >
> > Indu, please cc bpf@vger.kernel.org for future revisions so we can
> > track and plan accordingly. Thank you!
>
> I'll likely be taking over the kernel side of sframes. I'll be happy to
> Cc that work to the bpf mailing list.

Thanks!


>
> -- Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2023-05-02 15:27     ` Steven Rostedt
  2023-05-16 17:25       ` Andrii Nakryiko
@ 2024-03-13 14:37       ` Tatsuyuki Ishi
  2024-03-13 14:52         ` Steven Rostedt
  1 sibling, 1 reply; 32+ messages in thread
From: Tatsuyuki Ishi @ 2024-03-13 14:37 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Indu Bhagat, linux-toolchains, daandemeyer,
	andrii, kris.van.hees, elena.zannoni, nick.alcock

> On May 2, 2023, at 23:27, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Tue, 2 May 2023 12:53:53 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>>> 	while (entry->nr < entry->max_stack) {
>>> 		if (!valid_user_frame(fp, sizeof(frame)))
>>> 			break;  
>> 
>> This is broken, the sframe stuff is not NMI safe. First you need to
>> change perf to emit a forward reference to a 'next' user trace and then
>> emit the user trace on return-to-user.
>> 
>> As is, perf does everything in-place from NMI context.
> 
> Exactly. What I would like to see with the new sframe infrastructure is
> just a way to tell it "I want a user stack trace before going back to user
> space". Then all the work can be done in the ptrace path, where it's safe
> to memory map in the sframe section. It's not like the user space stack
> frame will change before going back to user space.
> 

Hi, just discovered this idea while looking for options to implement in-kernel unwinding, and in our case, for Wine applications where recompiling existing proprietary applications with frame pointers is not feasible.

I looked a bit more into how deferring work from NMI could work, but it looks like this might not be easy for now. For normal IRQs, we can call task_work_add and irqentry_exit will handle it for us. But task works are not triggered for NMI, and that’s likely because unmasking NMI requires IRET on x86, so if the NMI comes in during user mode, then we need to forge a kernel IRET frame to do non-NMI work.

Does the forging stuff sound feasible? Suggestions are also welcome.

Tatsuyuki.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2024-03-13 14:37       ` Tatsuyuki Ishi
@ 2024-03-13 14:52         ` Steven Rostedt
  2024-03-13 14:58           ` Tatsuyuki Ishi
  0 siblings, 1 reply; 32+ messages in thread
From: Steven Rostedt @ 2024-03-13 14:52 UTC (permalink / raw)
  To: Tatsuyuki Ishi
  Cc: Peter Zijlstra, Indu Bhagat, linux-toolchains, daandemeyer,
	andrii, kris.van.hees, elena.zannoni, nick.alcock

On Wed, 13 Mar 2024 22:37:26 +0800
Tatsuyuki Ishi <ishitatsuyuki@gmail.com> wrote:

  
> 
> Hi, just discovered this idea while looking for options to implement
> in-kernel unwinding, and in our case, for Wine applications where
> recompiling existing proprietary applications with frame pointers is not
> feasible.
> 
> I looked a bit more into how deferring work from NMI could work, but it
> looks like this might not be easy for now. For normal IRQs, we can call
> task_work_add and irqentry_exit will handle it for us. But task works are
> not triggered for NMI, and that’s likely because unmasking NMI requires
> IRET on x86, so if the NMI comes in during user mode, then we need to
> forge a kernel IRET frame to do non-NMI work.
> 
> Does the forging stuff sound feasible? Suggestions are also welcome.

The simple answer is to call irq_work. That will trigger an interrupt when
irqs are enabled again (returning from NMI and just before entering user
space).

The return from the irq_work can do the proper ptrace path.

-- Steve


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2024-03-13 14:52         ` Steven Rostedt
@ 2024-03-13 14:58           ` Tatsuyuki Ishi
  2024-03-13 15:04             ` Steven Rostedt
  0 siblings, 1 reply; 32+ messages in thread
From: Tatsuyuki Ishi @ 2024-03-13 14:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Indu Bhagat, linux-toolchains, daandemeyer,
	andrii, kris.van.hees, elena.zannoni, nick.alcock

> On Mar 13, 2024, at 22:52, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> The simple answer is to call irq_work. That will trigger an interrupt when
> irqs are enabled again (returning from NMI and just before entering user
> space).
> 
> The return from the irq_work can do the proper ptrace path.
> 

Yeah, that makes sense. Thanks!
Not sure how much overhead a self-IPI causes, compared to forging an IRET, but we can probably measure that later.

Tatsuyuki


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [POC 5/5] x86_64: invoke SFrame based stack tracer for user space
  2024-03-13 14:58           ` Tatsuyuki Ishi
@ 2024-03-13 15:04             ` Steven Rostedt
  0 siblings, 0 replies; 32+ messages in thread
From: Steven Rostedt @ 2024-03-13 15:04 UTC (permalink / raw)
  To: Tatsuyuki Ishi
  Cc: Peter Zijlstra, Indu Bhagat, linux-toolchains, daandemeyer,
	andrii, kris.van.hees, elena.zannoni, nick.alcock

On Wed, 13 Mar 2024 22:58:18 +0800
Tatsuyuki Ishi <ishitatsuyuki@gmail.com> wrote:

> > On Mar 13, 2024, at 22:52, Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > The simple answer is to call irq_work. That will trigger an interrupt when
> > irqs are enabled again (returning from NMI and just before entering user
> > space).
> > 
> > The return from the irq_work can do the proper ptrace path.
> >   
> 
> Yeah, that makes sense. Thanks!
> Not sure how much overhead a self-IPI causes, compared to forging an IRET, but we can probably measure that later.
> 

I brought up the "forging of an IRET" before, and was told about irq_work ;-)

The "forging of an IRET" from NMI will never make it upstream. The NMI
handling is already way too complex. Adding more complexity just for
tracing user space stack traces will never fly.

-- Steve

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-03-13 15:02 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-01 20:04 [POC 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
2023-05-01 20:04 ` [POC 1/5] Kconfig: x86: Add new config options for userspace unwinder Indu Bhagat
2023-05-01 20:04 ` [POC 2/5] task_struct : add additional member for sframe state Indu Bhagat
2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
2023-05-01 22:40   ` Steven Rostedt
2023-05-02  5:07     ` Indu Bhagat
2023-05-02  8:46     ` Peter Zijlstra
2023-05-02  9:09   ` Peter Zijlstra
2023-05-02  9:20   ` Peter Zijlstra
2023-05-02  9:28   ` Peter Zijlstra
2023-05-02  9:30   ` Peter Zijlstra
2023-05-03  6:03     ` Indu Bhagat
2023-05-02 10:31   ` Peter Zijlstra
2023-05-02 10:41   ` Peter Zijlstra
2023-05-02 15:22     ` Steven Rostedt
2023-05-01 20:04 ` [POC 4/5] sframe: add an SFrame format stack tracer Indu Bhagat
2023-05-01 23:00   ` Steven Rostedt
2023-05-02  6:16     ` Indu Bhagat
2023-05-02  8:53   ` Peter Zijlstra
2023-05-02  9:04   ` Peter Zijlstra
2023-05-01 20:04 ` [POC 5/5] x86_64: invoke SFrame based stack tracer for user space Indu Bhagat
2023-05-01 23:11   ` Steven Rostedt
2023-05-02 10:53   ` Peter Zijlstra
2023-05-02 15:27     ` Steven Rostedt
2023-05-16 17:25       ` Andrii Nakryiko
2023-05-16 17:38         ` Steven Rostedt
2023-05-16 17:51           ` Andrii Nakryiko
2024-03-13 14:37       ` Tatsuyuki Ishi
2024-03-13 14:52         ` Steven Rostedt
2024-03-13 14:58           ` Tatsuyuki Ishi
2024-03-13 15:04             ` Steven Rostedt
2023-05-01 22:15 ` [POC 0/5] SFrame based stack tracer for user space in the kernel Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).