linux-toolchains.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Indu Bhagat <indu.bhagat@oracle.com>
To: linux-toolchains@vger.kernel.org
Cc: daandemeyer@meta.com, andrii@kernel.org, rostedt@goodmis.org,
	kris.van.hees@oracle.com, elena.zannoni@oracle.com,
	nick.alcock@oracle.com, Indu Bhagat <indu.bhagat@oracle.com>
Subject: [POC 0/5] SFrame based stack tracer for user space in the kernel
Date: Mon,  1 May 2023 13:04:05 -0700	[thread overview]
Message-ID: <20230501200410.3973453-1-indu.bhagat@oracle.com> (raw)

Hello,

This patch set is a Proof of Concept implementation for an SFrame-based
stack tracer for user space in the kernel. Some of you had expressed interest
in exploring this earlier; hopefully, this POC helps discuss the design and
take it forward.

Motivation
==========
Generating stack traces is vital for all profiling, tracing and debugging
tools. In context of generating stack traces for user space, frame-pointer
based unwinding works, but has its issues ([1],[2]).  EH_Frame based
unwinding seems undesirable for kernel's unwinding needs ([3],[4]). 
In general, EH_Frame based unwinding is undesirable in applications that need
fast, real-time stack tracers (e.g., profilers), because of the overhead of
interpreting and executing DWARF opcodes to calculate the relevant stack
offsets.

SFrame (Simple Frame) stack trace format is designed to address these concerns.
With this POC, we would like to see how to use SFrame as a viable alternative
for user space stack tracing needs in the kernel.

[1] https://lwn.net/Articles/919940/
[2] https://pagure.io/fesco/issue/2817
[3] https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/OOJDAKTJB5WGMOZRXTUX7FTPFBF3H7WE/#NXRMNKD4B23HX7U5ICMKFRZO6Z3VXQXL
[4] https://lkml.org/lkml/2012/2/10/356

What is SFrame format
=====================

SFrame is the "Simple Frame" stack trace format.  The format is documented as
part of the binutils documentation at https://sourceware.org/binutils/docs.

Starting with binutils 2.40, the GNU assembler (as) can generate SFrame stack
trace data based on the CFI directives found in the source assembly.  This is
achieved by using the --gsframe command line option when invoking the
assembler.  This option plays the same role as the existing --gdwarf-[2345]
options, only this time referring to SFrame.  The resulting stack tracing
information is stored in a new segment of its own with type PT_GNU_SFRAME,
containing a section named '.sframe'.

Also starting with binutils 2.40, the GNU linker (ld) knows how to merge
sections containing SFrame stack trace info.

SFrame based user space stack tracer POC
========================================
These patches implement a POC for an SFrame based user space stack tracer (for
x86) in the kernel.  The purpose of this code is to serve as a reference,
initiate discussions, and perhaps serve as a starting point for a viable
implementation of an SFrame based stack tracer.  Please keep in mind that my
familiarity with with kernel code/processes/conventions is still limited ;-).

High-level Design in this POC
=============================
Kconfig adds two config options for userspace unwinding
  - config USER_UNWINDER_SFRAME to enable the SFrame userspace unwinder
  - config USER_UNWINDER_FRAME_POINTER to enable the Frame Pointer userspace
    unwinder

If CONFIG_USER_UNWINDER_SFRAME is set, the task_struct keeps a reference to
the sframe_state object for the task.

For long running user programs, it makes sense to cache the sframe_state
in the task and be able to simply do a quick do_sframe_unwind() at every
unwind request.  Caching the sframe_state also means keeping the .sframe
pages (for the prog and its DSOs) pinned.  The task's sframe_state is
kmalloc'ed and initialized in load_elf_binary, when the task is close to begin
execution.  The (open) issue with this design, however, remains that we need to
detect when additional DSOs are brought in at run-time by the application.

The detection (and resolution) of stale sframe_state is not implemented in this
POC.  As such, the POC at this time is fit only for applications that are
statically linked.

Following pseudo code roughly describe the relevant stubs around how the 
SFrame-based unwinder is currently hooked.

load_elf_binary()
{
  ...
  // check if any phdr.p_type with PT_GNU_SFRAME is seen
  if phdr.p_type == PT_GNU_SFRAME is seen
  sframe_avail = true
  ...

  if sframe_avail
    sframe_state_setup() // does all kmallocs and get_user_pages_XX
  ...
  finalize_exec (bprm)
}


perf_callchain_user()
{
   ...
   // check if task.sframe_state is valid
   sframe_avail = check_sframe_state_p (current);
   
   pagefault_disable()
   
   // check if task.sframe_state is ready and not stale
   if sframe_avail && task.sframe_state is ready
       ret = sframe_callchain_user() // uses __get_user to access stack
       if ret is success
           pagefault_enable()
           return
   
   ...
   Frame pointer based unwinding
   pagefault_enable()
   ...
}

tast_struct.sframe_state is cleaned up in release_task().

What do you think about the above workflow ?

What about caching the sframe_state in task_struct? As you see, there are
some open issues around this, and discussion is needed to help resolve
some of those.

Apart from the above design points, other reasons why this remains a POC and
not ready for submission are:

  - Code deals with only Elf64_Phdr (no Elf32_Phdr) at this time; some specific
  cases like when ELF hdr's e_phnum is equal to PN_XNUM are not handled yet
  (iterate_phdr.c).
  - Missing detection of when there is a change in the memory mappings of a
  task.  E.g., dlopen/dlclose are two of the possibilities using which a user
  program's mappings may have changed over time.
  - Code stubs around user space memory access by the kernel.  For sake of
  clarity, let me outline here the three locations where user space memory is
  accessed in context of SFrame based unwinding:
    1. Access the ELF header in iterate_phdr(), followed by accessing the ELF
       PHDRs in add_sframe_unwind_info(). This is currently using
       get_user_pages_remote() in iterate_phdr().
    2. Access the .sframe section for decoding in sframe_unw_info_init_dctx().
       This is currently done by using get_user_pages_unlocked()
    3. Access the program's execution stack in sframe_unwind_next_frame() to
       read, say the caller's IP on x86_64.  This is currently done by using
       __get_user().
  - Other stubs marked with FIXME TODO,
  - The patches may not be bisectable.  I haven't particularly tried to
    compile them individually either.
  - More testing, including checking out some regression tests.

Each commit log has further details.

Testing Notes
==============
I have tested these patches minimally using:
  1. perf on kernel master
  2. BPF uprobe on kernel master
  3. dtrace with dtrace-linux-kernel v2/6.1.8 
  (https://github.com/oracle/dtrace-linux-kernel/tree/v2/6.1.8).  This
  diff between the v2/6.1.8 branch and the Linux 6.1.8 is the few patches for
  CTF/DTrace. dtrace is a tracing tool that can be used to diagnose problems
  and probe a running linux system. For the following experiment, I used 
  unchanged dtrace packages. The dtrace  command line used is:
      dtrace -c prog -n 'pid$target::func:entry { ustack (); exit(0); }'
  This triggers a ustack() action when the said function 'func' in program
  'prog' is entered.  It gives the user stack then exits. 

  The dtrace ustack() action internally invokes the perf_callchain_user(). The
  latter is updated in the POC patch set to perform SFrame based stack tracing
  for user space.  DTrace uses BPF under the hood, but testing both DTrace and
  BPF individually has been valuable overall.

All binaries below were compiled with -Wa,--gsframe. A few tests to showcase
the POC are given below.

TEST 1: Toy hello world program with the call chain as follows:
main() -> foo() -> bar() -> baz()

$ cat deep_hello_sframe.c

#include <stdio.h>
#include <stdlib.h>

int baz (int a)
{
    return a * rand () + 100;
}

int bar (int a)
{
    int c = baz (a);
    return c * a * rand ();
}

int foo (int a)
{
    int b = bar (a);
    return b * a * rand ();
}

void main (void)
{
    int a = 100;
    int b = foo (a);
    printf ("Hello world %d \n", b);
}

$ dtrace -c ./deep_hello_sframe -n \
    'pid$target::baz:entry { ustack (); exit(0); }'
DTrace 2.0.0 [Pre-Release with limited functionality]
dtrace: description 'pid$target::baz:entry ' matched 1 probe
...
CPU     ID                    FUNCTION:NAME
  1 114215                        baz:entry
              deep_hello_sframe`baz
              deep_hello_sframe`bar+0x16
              deep_hello_sframe`foo+0x16
              deep_hello_sframe`main+0x19

$ perf probe -x ./deep_hello_sframe --add baz
$ perf record -g -e probe_deep_hello_sframe:baz ./deep_hello_sframe
$ perf script
deep_hello_sfra 25887 [000] 125196.580149: probe_deep_hello_sframe:baz:
(401136)
                    1136 baz+0x0  (/<TESTPATH>/deep_hello_sframe)
                    1165 bar+0x16 (/<TESTPATH>/deep_hello_sframe)
                    1195 foo+0x16 (/<TESTPATH>/deep_hello_sframe)
                    11c8 main+0x19 (/<TESTPATH>/deep_hello_sframe)


$ perf report --call-graph --stdio
   100.00%   100.00%  (401146)
                 |
		 ---main
	            foo
	            bar
	            baz

TEST 2: Using a BPF program target.c, get stacktrace using BPF bpf_get_stack
helper in bpf-uprobe.c.  I am skipping the BPF program for brevity.

$ cat target.c
#include <stdio.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>        /* open */

int fd;

int foo9(int x) {
  write(fd, &x, sizeof(x));
  return x  ^ 1;
}
int foo8(int x) { return foo9(x) ^ 1; }
int foo7(int x) { return foo8(x) ^ 1; }
int foo6(int x) { return foo7(x) ^ 1; }
int foo5(int x) { return foo6(x) ^ 1; }
int foo4(int x) { return foo5(x) ^ 1; }
int foo3(int x) { return foo4(x) ^ 1; }
int foo2(int x) { return foo3(x) ^ 1; }
int foo1(int x) { return foo2(x) ^ 1; }
int foo0(int x) { return foo1(x) ^ 1; }

int main(int c, char **v) {
  int x = 0;

  fd = open("/dev/null", O_WRONLY);
  if (fd == -1) {
    printf("open failed\n");
    return 1;
  }

  while ((x = foo0(x)) < 10) ;

  close(fd);
  return 0;
}

$ gcc -Wa,--gsframe -o target.sframe target.c
$ #offset=getoffset_of_foo9_in_target - baseloadaddress_in_target
$ echo "p:ibhagat/myuprobe $path_to_target:$offset" >> /sys/kernel/debug/tracing/uprobe_events
$ ./target &
$ #target_pid=`pgrep target.sframe`
$ #event_id=`sudo cat /sys/kernel/debug/tracing/events/username/myuprobe/id`
$ gcc -DTARGET_PID=$target_pid -DEVENT_ID=$event_id -o bpf-ustack bpf-ustack.c

$ sudo ./bpf-ustack # dumps IPs of callchain
  401156 401197 4011b1 4011cb 4011e5 4011ff 401219 401233 40124d 401267 4012c3

$ grep -A 2 'call' target.sframe.s | grep -A 1 'foo' 
  401192:	e8 bf ff ff ff       	callq  401156 <foo9>
  401197:	83 f0 01             	xor    $0x1,%eax
--
  4011ac:	e8 d1 ff ff ff       	callq  401182 <foo8>
  4011b1:	83 f0 01             	xor    $0x1,%eax
--
  4011c6:	e8 d1 ff ff ff       	callq  40119c <foo7>
  4011cb:	83 f0 01             	xor    $0x1,%eax
--
  4011e0:	e8 d1 ff ff ff       	callq  4011b6 <foo6>
  4011e5:	83 f0 01             	xor    $0x1,%eax
--
  4011fa:	e8 d1 ff ff ff       	callq  4011d0 <foo5>
  4011ff:	83 f0 01             	xor    $0x1,%eax
--
  401214:	e8 d1 ff ff ff       	callq  4011ea <foo4>
  401219:	83 f0 01             	xor    $0x1,%eax
--
  40122e:	e8 d1 ff ff ff       	callq  401204 <foo3>
  401233:	83 f0 01             	xor    $0x1,%eax
--
  401248:	e8 d1 ff ff ff       	callq  40121e <foo2>
  40124d:	83 f0 01             	xor    $0x1,%eax
--
  401262:	e8 d1 ff ff ff       	callq  401238 <foo1>
  401267:	83 f0 01             	xor    $0x1,%eax
--
  4012be:	e8 8f ff ff ff       	callq  401252 <foo0>
  4012c3:	89 45 fc             	mov    %eax,-0x4(%rbp)

$ perf probe -x ./target.sframe --add foo9
$ perf record -g -e probe_target:foo9 ./target.sframe
^C
$ perf script
...
target.sframe 20395 [000] 69987.711764: probe_target:foo9: (401156)
                    1156 foo9+0x0 (<TESTPATH>/target.sframe)
                    1197 foo8+0x15 (<TESTPATH>/target.sframe)
                    11b1 foo7+0x15 (<TESTPATH>/target.sframe)
                    11cb foo6+0x15 (<TESTPATH>/target.sframe)
                    11e5 foo5+0x15 (<TESTPATH>/target.sframe)
                    11ff foo4+0x15 (<TESTPATH>/target.sframe)
                    1219 foo3+0x15 (<TESTPATH>/target.sframe)
                    1233 foo2+0x15 (<TESTPATH>/target.sframe)
                    124d foo1+0x15 (<TESTPATH>/target.sframe)
                    1267 foo0+0x15 (<TESTPATH>/target.sframe)
                    12c3 main+0x57 (<TESTPATH>/target.sframe)
...

Please take a look. Any feedback is appreciated.

Thanks,

Indu Bhagat (5):
  Kconfig: x86: Add new config options for userspace unwinder
  task_struct : add additional member for sframe state
  sframe: add new SFrame library
  sframe: add an SFrame format stack tracer
  x86_64: invoke SFrame based stack tracer for user space

 arch/arm64/include/asm/sframe_regs.h |  37 ++
 arch/x86/Kconfig.debug               |  31 ++
 arch/x86/events/core.c               |  51 +++
 arch/x86/include/asm/sframe_regs.h   |  34 ++
 fs/binfmt_elf.c                      |  39 +++
 include/linux/sched.h                |   5 +
 include/sframe/sframe_regs.h         |  11 +
 include/sframe/sframe_unwind.h       |  62 ++++
 kernel/exit.c                        |   9 +
 lib/Makefile                         |   1 +
 lib/sframe/Makefile                  |  11 +
 lib/sframe/iterate_phdr.c            | 113 ++++++
 lib/sframe/iterate_phdr.h            |  34 ++
 lib/sframe/sframe.h                  | 263 ++++++++++++++
 lib/sframe/sframe_read.c             | 498 +++++++++++++++++++++++++++
 lib/sframe/sframe_read.h             |  75 ++++
 lib/sframe/sframe_state.c            | 424 +++++++++++++++++++++++
 lib/sframe/sframe_state.h            |  80 +++++
 lib/sframe/sframe_unwind.c           | 208 +++++++++++
 19 files changed, 1986 insertions(+)
 create mode 100644 arch/arm64/include/asm/sframe_regs.h
 create mode 100644 arch/x86/include/asm/sframe_regs.h
 create mode 100644 include/sframe/sframe_regs.h
 create mode 100644 include/sframe/sframe_unwind.h
 create mode 100644 lib/sframe/Makefile
 create mode 100644 lib/sframe/iterate_phdr.c
 create mode 100644 lib/sframe/iterate_phdr.h
 create mode 100644 lib/sframe/sframe.h
 create mode 100644 lib/sframe/sframe_read.c
 create mode 100644 lib/sframe/sframe_read.h
 create mode 100644 lib/sframe/sframe_state.c
 create mode 100644 lib/sframe/sframe_state.h
 create mode 100644 lib/sframe/sframe_unwind.c

-- 
2.39.2


             reply	other threads:[~2023-05-01 20:05 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-01 20:04 Indu Bhagat [this message]
2023-05-01 20:04 ` [POC 1/5] Kconfig: x86: Add new config options for userspace unwinder Indu Bhagat
2023-05-01 20:04 ` [POC 2/5] task_struct : add additional member for sframe state Indu Bhagat
2023-05-01 20:04 ` [POC 3/5] sframe: add new SFrame library Indu Bhagat
2023-05-01 22:40   ` Steven Rostedt
2023-05-02  5:07     ` Indu Bhagat
2023-05-02  8:46     ` Peter Zijlstra
2023-05-02  9:09   ` Peter Zijlstra
2023-05-02  9:20   ` Peter Zijlstra
2023-05-02  9:28   ` Peter Zijlstra
2023-05-02  9:30   ` Peter Zijlstra
2023-05-03  6:03     ` Indu Bhagat
2023-05-02 10:31   ` Peter Zijlstra
2023-05-02 10:41   ` Peter Zijlstra
2023-05-02 15:22     ` Steven Rostedt
2023-05-01 20:04 ` [POC 4/5] sframe: add an SFrame format stack tracer Indu Bhagat
2023-05-01 23:00   ` Steven Rostedt
2023-05-02  6:16     ` Indu Bhagat
2023-05-02  8:53   ` Peter Zijlstra
2023-05-02  9:04   ` Peter Zijlstra
2023-05-01 20:04 ` [POC 5/5] x86_64: invoke SFrame based stack tracer for user space Indu Bhagat
2023-05-01 23:11   ` Steven Rostedt
2023-05-02 10:53   ` Peter Zijlstra
2023-05-02 15:27     ` Steven Rostedt
2023-05-16 17:25       ` Andrii Nakryiko
2023-05-16 17:38         ` Steven Rostedt
2023-05-16 17:51           ` Andrii Nakryiko
2024-03-13 14:37       ` Tatsuyuki Ishi
2024-03-13 14:52         ` Steven Rostedt
2024-03-13 14:58           ` Tatsuyuki Ishi
2024-03-13 15:04             ` Steven Rostedt
2023-05-01 22:15 ` [POC 0/5] SFrame based stack tracer for user space in the kernel Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230501200410.3973453-1-indu.bhagat@oracle.com \
    --to=indu.bhagat@oracle.com \
    --cc=andrii@kernel.org \
    --cc=daandemeyer@meta.com \
    --cc=elena.zannoni@oracle.com \
    --cc=kris.van.hees@oracle.com \
    --cc=linux-toolchains@vger.kernel.org \
    --cc=nick.alcock@oracle.com \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).