linux-toolchains.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [POC,V2 0/5] SFrame based stack tracer for user space in the kernel
@ 2023-05-26  5:32 Indu Bhagat
  2023-05-26  5:32 ` [POC,V2 1/5] Kconfig: x86: Add new config options for userspace unwinder Indu Bhagat
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Indu Bhagat @ 2023-05-26  5:32 UTC (permalink / raw)
  To: linux-toolchains, rostedt, peterz; +Cc: Indu Bhagat

Hello,

I have addressed the code review comments and hopefully made the SFrame
reader code more maintainable.

Link to V1 posting: https://lore.kernel.org/linux-toolchains/20230501181515.098acdce@gandalf.local.home/T/#t

The POC still suffers with the same issue: accessing SFrame data is not NMI
safe.  This is expected to be addressed via the changes to perf workings
proposed by Steve.

Once it is clearer what sort of callback APIs are required for the kernel to
gather user space stack trace using SFrame stack tracer, I am happy to work on
that later and make the necessary changes to SFrame stack tracer code in
lib/sframe/sframe_unwind.c and lib/sframe/sframe_state.c etc.

As some of the code in lib/sframe/iterate_phdr.[ch] will get dropped or
reworked, I skipped spending time on improving those files ATM.

Thanks,

Indu Bhagat (5):
  Kconfig: x86: Add new config options for userspace unwinder
  task_struct : add additional member for sframe state
  sframe: add new SFrame library
  sframe: add an SFrame format stack tracer
  x86_64: invoke SFrame based stack tracer for user space

 arch/x86/Kconfig.debug        |  31 ++
 arch/x86/events/core.c        |  51 ++++
 fs/binfmt_elf.c               |  37 +++
 include/linux/sched.h         |   5 +
 include/linux/sframe_unwind.h |  72 +++++
 kernel/exit.c                 |   9 +
 lib/Makefile                  |   1 +
 lib/sframe/Makefile           |  11 +
 lib/sframe/iterate_phdr.c     | 115 +++++++
 lib/sframe/iterate_phdr.h     |  39 +++
 lib/sframe/sframe.h           | 264 ++++++++++++++++
 lib/sframe/sframe_read.c      | 549 ++++++++++++++++++++++++++++++++++
 lib/sframe/sframe_read.h      |  80 +++++
 lib/sframe/sframe_state.c     | 447 +++++++++++++++++++++++++++
 lib/sframe/sframe_state.h     |  84 ++++++
 lib/sframe/sframe_unwind.c    | 214 +++++++++++++
 16 files changed, 2009 insertions(+)
 create mode 100644 include/linux/sframe_unwind.h
 create mode 100644 lib/sframe/Makefile
 create mode 100644 lib/sframe/iterate_phdr.c
 create mode 100644 lib/sframe/iterate_phdr.h
 create mode 100644 lib/sframe/sframe.h
 create mode 100644 lib/sframe/sframe_read.c
 create mode 100644 lib/sframe/sframe_read.h
 create mode 100644 lib/sframe/sframe_state.c
 create mode 100644 lib/sframe/sframe_state.h
 create mode 100644 lib/sframe/sframe_unwind.c

-- 
2.39.2


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [POC,V2 1/5] Kconfig: x86: Add new config options for userspace unwinder
  2023-05-26  5:32 [POC,V2 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
@ 2023-05-26  5:32 ` Indu Bhagat
  2023-05-26  5:32 ` [POC,V2 2/5] task_struct : add additional member for sframe state Indu Bhagat
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Indu Bhagat @ 2023-05-26  5:32 UTC (permalink / raw)
  To: linux-toolchains, rostedt, peterz; +Cc: Indu Bhagat

[Changes in V2]
  - No changes.
[End of changes in V2]

Add two config options for userspace unwinding:
  - config USER_UNWINDER_SFRAME to enable the SFrame userspace unwinder,
  - config USER_UNWINDER_FRAME_POINTER to enable the Frame Pointer
    userspace unwinder.

As users may still compile their applications without SFrame sections,
the userspace stack tracer falls back on the frame-pointer based
approach (if SFrame section is not present for the user program).  If an
SFrame section is absent for a subset of the running program (e.g. a
DSO), a best-effort SFrame section based stack trace is returned.

TODO - may be rename the existing CONFIG_UNWINDER_FRAME_POINTER etc. for
the kernel space unwinder to CONFIG_KERNEL_UNWINDER_FRAME_POINTER? WDYT?

Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
---
 arch/x86/Kconfig.debug | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index c5d614d28a75..e7f928dc8d9f 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -275,3 +275,34 @@ endchoice
 config FRAME_POINTER
 	depends on !UNWINDER_ORC && !UNWINDER_GUESS
 	bool
+
+choice
+	prompt "Choose userspace unwinder"
+	default USER_UNWINDER_SFRAME if X86_64
+	default USER_UNWINDER_FRAME_POINTER if !X86_64
+	help
+	  This determines which method will be used for unwinding user stack
+	  traces.
+
+config USER_UNWINDER_SFRAME
+	bool "SFrame userspace unwinder"
+	depends on X86_64
+	help
+	  This option enables the SFrame unwinder for unwinding user stack
+	  traces.
+
+	  User programs must be built with SFrame support. If not, no SFrame
+	  section will be present in the user program binary; in such a case,
+	  the userspace unwinder defaults to frame pointer unwinding.
+
+
+config USER_UNWINDER_FRAME_POINTER
+	bool "Frame pointer userspace unwinder"
+	help
+	  This option enables the frame pointer unwinder for unwinding user
+	  stack traces.
+
+	  On some architectures, user programs must be built with
+	  -fno-omit-frame-pointer to ensure useful stack traces.
+
+endchoice
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [POC,V2 2/5] task_struct : add additional member for sframe state
  2023-05-26  5:32 [POC,V2 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
  2023-05-26  5:32 ` [POC,V2 1/5] Kconfig: x86: Add new config options for userspace unwinder Indu Bhagat
@ 2023-05-26  5:32 ` Indu Bhagat
  2023-05-26  5:32 ` [POC,V2 3/5] sframe: add new SFrame library Indu Bhagat
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Indu Bhagat @ 2023-05-26  5:32 UTC (permalink / raw)
  To: linux-toolchains, rostedt, peterz; +Cc: Indu Bhagat

[Changes in V2]
  - No changes yet.
  - ATM, it is understood that this POC is broken because accessing
    SFrame sections may fault while the perf event is being handled in
    the NMI context. The changes in this patch will likely be reworked.
[End of Changes in V2]

Add a new member to keep track of the SFrame sections for the current
task (program and its DSOs).

The definition of struct sframe_state is owned by the SFrame unwinder,
and added in a later commit.  Regarding the state management of the
task_struct.sframe_state:
  - Allocation and initialization is done at the task initialization
    time in the kernel.
  - Update: Not clear. We need to be able to track dlopen/dlclose, or
    additional shared libraries loaded via the dynamic linker at the
    task execution time.

Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
---
 include/linux/sched.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d1f..fc0b0c720979 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -71,6 +71,7 @@ struct signal_struct;
 struct task_delay_info;
 struct task_group;
 struct user_event_mm;
+struct sframe_state;
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -1534,6 +1535,10 @@ struct task_struct {
 	struct user_event_mm		*user_event_mm;
 #endif
 
+#ifdef CONFIG_USER_UNWINDER_SFRAME
+	struct sframe_state		*sframe_state;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [POC,V2 3/5] sframe: add new SFrame library
  2023-05-26  5:32 [POC,V2 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
  2023-05-26  5:32 ` [POC,V2 1/5] Kconfig: x86: Add new config options for userspace unwinder Indu Bhagat
  2023-05-26  5:32 ` [POC,V2 2/5] task_struct : add additional member for sframe state Indu Bhagat
@ 2023-05-26  5:32 ` Indu Bhagat
  2023-05-26  5:32 ` [POC,V2 4/5] sframe: add an SFrame format stack tracer Indu Bhagat
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Indu Bhagat @ 2023-05-26  5:32 UTC (permalink / raw)
  To: linux-toolchains, rostedt, peterz; +Cc: Indu Bhagat

[Changes in V2]
  - Attempt to use better function names.
  - Use uint8_t, unit32_t, const char * consistently, where appropriate.
  - Revisit sframe_find_fre API.
  - Use kernel coding style - various fixes to code and comments.
  - Use consistent documentation style for all structures.
[End of Changes in V2]

This patch adds an implementation to read SFrame stack trace data from
a .sframe section.  Some APIs are also provided to find stack tracing
information per PC, e.g., given a PC, find the SFrame FRE.

These routines are provided in the sframe_read.h and sframe_read.c.

This implementation is malloc-free.

The usage of sframe_fre_copy () remains sframe_sec_find_fre ().  As
pointed out in V1 reviews, this call is not entirely necessary; I have
one more idea around optimizing this function, after which this call
will not be necessary in the current form.

Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
---
 lib/Makefile             |   1 +
 lib/sframe/Makefile      |   5 +
 lib/sframe/sframe.h      | 264 +++++++++++++++++++
 lib/sframe/sframe_read.c | 549 +++++++++++++++++++++++++++++++++++++++
 lib/sframe/sframe_read.h |  80 ++++++
 5 files changed, 899 insertions(+)
 create mode 100644 lib/sframe/Makefile
 create mode 100644 lib/sframe/sframe.h
 create mode 100644 lib/sframe/sframe_read.c
 create mode 100644 lib/sframe/sframe_read.h

diff --git a/lib/Makefile b/lib/Makefile
index 876fcdeae34e..cb02d16dbffd 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -198,6 +198,7 @@ obj-$(CONFIG_ZSTD_COMPRESS) += zstd/
 obj-$(CONFIG_ZSTD_DECOMPRESS) += zstd/
 obj-$(CONFIG_XZ_DEC) += xz/
 obj-$(CONFIG_RAID6_PQ) += raid6/
+obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe/
 
 lib-$(CONFIG_DECOMPRESS_GZIP) += decompress_inflate.o
 lib-$(CONFIG_DECOMPRESS_BZIP2) += decompress_bunzip2.o
diff --git a/lib/sframe/Makefile b/lib/sframe/Makefile
new file mode 100644
index 000000000000..14d6cfd42a1d
--- /dev/null
+++ b/lib/sframe/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+##################################
+obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe_read.o
+
+CFLAGS_sframe_read.o += -I $(srctree)/lib/sframe/
diff --git a/lib/sframe/sframe.h b/lib/sframe/sframe.h
new file mode 100644
index 000000000000..340ec80ffa87
--- /dev/null
+++ b/lib/sframe/sframe.h
@@ -0,0 +1,264 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef SFRAME_H
+#define SFRAME_H
+
+#include <linux/types.h>
+
+/*
+ * This file contains definitions for the SFrame stack tracing format, which is
+ * documented at https://sourceware.org/binutils/docs
+ */
+
+#define SFRAME_VERSION_1	1
+#define SFRAME_MAGIC		0xdee2
+#define SFRAME_VERSION		SFRAME_VERSION_1
+
+/* Function Descriptor Entries are sorted on PC. */
+#define SFRAME_F_FDE_SORTED	0x1
+/* Frame-pointer based stack tracing. Defined, but not set. */
+#define SFRAME_F_FRAME_POINTER	0x2
+
+#define SFRAME_CFA_FIXED_FP_INVALID 0
+#define SFRAME_CFA_FIXED_RA_INVALID 0
+
+/* Supported ABIs/Arch. */
+#define SFRAME_ABI_AARCH64_ENDIAN_BIG	    1 /* AARCH64 big endian. */
+#define SFRAME_ABI_AARCH64_ENDIAN_LITTLE    2 /* AARCH64 little endian. */
+#define SFRAME_ABI_AMD64_ENDIAN_LITTLE	    3 /* AMD64 little endian. */
+
+/* SFrame FRE types. */
+#define SFRAME_FRE_TYPE_ADDR1	0
+#define SFRAME_FRE_TYPE_ADDR2	1
+#define SFRAME_FRE_TYPE_ADDR4	2
+
+/*
+ * SFrame Function Descriptor Entry types.
+ *
+ * The SFrame format has two possible representations for functions. The
+ * choice of which type to use is made according to the instruction patterns
+ * in the relevant program stub.
+ */
+
+/* Unwinders perform a (PC >= FRE_START_ADDR) to look up a matching FRE. */
+#define SFRAME_FDE_TYPE_PCINC	0
+/*
+ * Unwinders perform a (PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK)
+ * to look up a matching FRE. Typical usecases are pltN entries, trampolines
+ * etc.
+ */
+#define SFRAME_FDE_TYPE_PCMASK	1
+
+/**
+ * struct sframe_preamble - SFrame Preamble.
+ * @magic: Magic number (SFRAME_MAGIC).
+ * @version: Format version number (SFRAME_VERSION).
+ * @flags: Various flags.
+ */
+struct sframe_preamble {
+	uint16_t magic;
+	uint8_t version;
+	uint8_t flags;
+} __packed;
+
+/**
+ * struct sframe_header - SFrame Header.
+ * @preamble: SFrame preamble.
+ * @abi_arch: Identify the arch (including endianness) and ABI.
+ * @cfa_fixed_fp_offset: Offset for the Frame Pointer (FP) from CFA may be
+ *	  fixed  for some ABIs ((e.g, in AMD64 when -fno-omit-frame-pointer is
+ *	  used). When fixed, this field specifies the fixed stack frame offset
+ *	  and the individual FREs do not need to track it. When not fixed, it
+ *	  is set to SFRAME_CFA_FIXED_FP_INVALID, and the individual FREs may
+ *	  provide the applicable stack frame offset, if any.
+ * @cfa_fixed_ra_offset: Offset for the Return Address from CFA is fixed for
+ *	  some ABIs. When fixed, this field specifies the fixed stack frame
+ *	  offset and the individual FREs do not need to track it. When not
+ *	  fixed, it is set to SFRAME_CFA_FIXED_FP_INVALID.
+ * @auxhdr_len: Number of bytes making up the auxiliary header, if any.
+ *	  Some ABI/arch, in the future, may use this space for extending the
+ *	  information in SFrame header. Auxiliary header is contained in bytes
+ *	  sequentially following the sframe_header.
+ * @num_fdes: Number of SFrame FDEs in this SFrame section.
+ * @num_fres: Number of SFrame Frame Row Entries.
+ * @fre_len:  Number of bytes in the SFrame Frame Row Entry section.
+ * @fdes_off: Offset of SFrame Function Descriptor Entry section.
+ * @fres_off: Offset of SFrame Frame Row Entry section.
+ */
+struct sframe_header {
+	struct sframe_preamble preamble;
+	uint8_t abi_arch;
+	int8_t cfa_fixed_fp_offset;
+	int8_t cfa_fixed_ra_offset;
+	uint8_t auxhdr_len;
+	uint32_t num_fdes;
+	uint32_t num_fres;
+	uint32_t fre_len;
+	uint32_t fdes_off;
+	uint32_t fres_off;
+} __packed;
+
+#define SFRAME_V1_HDR_SIZE(sframe_hdr)	\
+	((sizeof(struct sframe_header) + (sframe_hdr).auxhdr_len))
+
+/* Two possible keys for executable (instruction) pointers signing. */
+#define SFRAME_AARCH64_PAUTH_KEY_A    0 /* Key A. */
+#define SFRAME_AARCH64_PAUTH_KEY_B    1 /* Key B. */
+
+/**
+ * struct sframe_func_desc_entry - SFrame Function Descriptor Entry.
+ * @func_start_addr: Function start address. Encoded as a signed offset,
+ *	  relative to the current FDE.
+ * @func_size: Size of the function in bytes.
+ * @func_fres_off: Offset of the first SFrame Frame Row Entry of the function,
+ *	  relative to the beginning of the SFrame Frame Row Entry sub-section.
+ * @func_fres_num: Number of frame row entries for the function.
+ * @func_info: Additional information for deciphering the stack trace
+ *	  information for the function. Contains information about SFrame FRE
+ *	  type, SFrame FDE type, PAC authorization A/B key, etc.
+ */
+struct sframe_func_desc_entry {
+	int32_t func_start_addr;
+	uint32_t func_size;
+	uint32_t func_fres_off;
+	uint32_t func_fres_num;
+	uint8_t func_info;
+} __packed;
+
+/*
+ * 'func_info' in SFrame FDE contains additional information for deciphering
+ * the stack trace information for the function. In V1, the information is
+ * organized as follows:
+ *   - 4-bits: Identify the FRE type used for the function.
+ *   - 1-bit: Identify the FDE type of the function - mask or inc.
+ *   - 1-bit: PAC authorization A/B key (aarch64).
+ *   - 2-bits: Unused.
+ * ---------------------------------------------------------------------
+ * |  Unused  |  PAC auth A/B key (aarch64) |  FDE type |   FRE type   |
+ * |          |        Unused (amd64)       |           |              |
+ * ---------------------------------------------------------------------
+ * 8          6                             5           4              0
+ */
+
+/* Note: Set PAC auth key to SFRAME_AARCH64_PAUTH_KEY_A by default.  */
+#define SFRAME_V1_FUNC_INFO(fde_type, fre_enc_type) \
+	(((SFRAME_AARCH64_PAUTH_KEY_A & 0x1) << 5) | \
+	 (((fde_type) & 0x1) << 4) | ((fre_enc_type) & 0xf))
+
+#define SFRAME_V1_FUNC_FRE_TYPE(data)	  ((data) & 0xf)
+#define SFRAME_V1_FUNC_FDE_TYPE(data)	  (((data) >> 4) & 0x1)
+#define SFRAME_V1_FUNC_PAUTH_KEY(data)	  (((data) >> 5) & 0x1)
+
+/*
+ * Size of stack frame offsets in an SFrame Frame Row Entry. A single
+ * SFrame FRE has all offsets of the same size. Offset size may vary
+ * across frame row entries.
+ */
+#define SFRAME_FRE_OFFSET_1B	  0
+#define SFRAME_FRE_OFFSET_2B	  1
+#define SFRAME_FRE_OFFSET_4B	  2
+
+/* An SFrame Frame Row Entry can be SP or FP based.  */
+#define SFRAME_BASE_REG_FP	0
+#define SFRAME_BASE_REG_SP	1
+
+/*
+ * The index at which a specific offset is presented in the variable length
+ * bytes of an FRE.
+ */
+#define SFRAME_FRE_CFA_OFFSET_IDX   0
+/*
+ * The RA stack offset, if present, will always be at index 1 in the variable
+ * length bytes of the FRE.
+ */
+#define SFRAME_FRE_RA_OFFSET_IDX    1
+/*
+ * The FP stack offset may appear at offset 1 or 2, depending on the ABI as RA
+ * may or may not be tracked.
+ */
+#define SFRAME_FRE_FP_OFFSET_IDX    2
+
+/**
+ * struct sframe_fre_info - SFrame FRE Info.
+ * @fre_info: Bitmap to store information to decipher the SFrame FREs, e.g.,
+ *	  number and size of stack offsets, whether the RA is mangled, etc.
+ */
+struct sframe_fre_info {
+	uint8_t fre_info;
+};
+
+/*
+ * 'fre_info' in SFrame FRE contains information about:
+ *   - 1 bit: base reg for CFA
+ *   - 4 bits: Number of offsets (N). A value of up to 3 is allowed to track
+ *   all three of CFA, FP and RA (fixed implicit order).
+ *   - 2 bits: information about size of the offsets (S) in bytes.
+ *     Valid values are SFRAME_FRE_OFFSET_1B, SFRAME_FRE_OFFSET_2B,
+ *     SFRAME_FRE_OFFSET_4B
+ *   - 1 bit: Mangled RA state bit (aarch64 only).
+ * ---------------------------------------------------------------
+ * | Mangled-RA (aarch64) |  Size of   |   Number of  | base_reg |
+ * |  Unused (amd64)      |  offsets   |    offsets   |          |
+ * ---------------------------------------------------------------
+ * 8                      7             5             1          0
+ */
+
+/* Note: Set mangled_ra_p to zero by default. */
+#define SFRAME_V1_FRE_INFO(base_reg_id, offset_num, offset_size) \
+	(((0 & 0x1) << 7) | (((offset_size) & 0x3) << 5) | \
+	 (((offset_num) & 0xf) << 1) | ((base_reg_id) & 0x1))
+
+/* Set the mangled_ra_p bit as indicated. */
+#define SFRAME_V1_FRE_INFO_UPDATE_MANGLED_RA_P(mangled_ra_p, fre_info) \
+	((((mangled_ra_p) & 0x1) << 7) | ((fre_info) & 0x7f))
+
+#define SFRAME_V1_FRE_CFA_BASE_REG_ID(data)	  ((data) & 0x1)
+#define SFRAME_V1_FRE_OFFSET_COUNT(data)	  (((data) >> 1) & 0xf)
+#define SFRAME_V1_FRE_OFFSET_SIZE(data)		  (((data) >> 5) & 0x3)
+#define SFRAME_V1_FRE_MANGLED_RA_P(data)	  (((data) >> 7) & 0x1)
+
+/* SFrame Frame Row Entry definitions. */
+
+/**
+ * struct sframe_frame_row_entry_addr1 - SFrame Frame Row Entry for 1-byte
+ *	  Address offsets.
+ * @fre_start_ip_off: Start address of the frame row entry. Encoded as a
+ *	  1-byte unsigned offset, relative to the start address of the
+ *	  function.
+ * @fre_info: SFrame FRE Info.
+ */
+struct sframe_frame_row_entry_addr1 {
+	uint8_t fre_start_ip_off;
+	struct sframe_fre_info fre_info;
+} __packed;
+
+/**
+ * struct sframe_frame_row_entry_addr2 - SFrame Frame Row Entry for 2-byte
+ *	  Address offsets.
+ * @fre_start_ip_off: Start address of the frame row entry. Encoded as a
+ *	  2-byte unsigned offset, relative to the start address of the
+ *	  function.
+ * @fre_info: SFrame FRE Info.
+ */
+struct sframe_frame_row_entry_addr2 {
+	uint16_t fre_start_ip_off;
+	struct sframe_fre_info fre_info;
+} __packed;
+
+/**
+ * struct sframe_frame_row_entry_addr4 - SFrame Frame Row Entry for 4-byte
+ *	  Address offsets.
+ * @fre_start_ip_off: Start address of the frame row entry. Encoded as a
+ *	  4-byte unsigned offset, relative to the start address of the
+ *	  function.
+ * @fre_info: SFrame FRE Info.
+ */
+struct sframe_frame_row_entry_addr4 {
+	uint32_t fre_start_ip_off;
+	struct sframe_fre_info fre_info;
+} __packed;
+
+#endif /* SFRAME_H */
diff --git a/lib/sframe/sframe_read.c b/lib/sframe/sframe_read.c
new file mode 100644
index 000000000000..ebb7d38ac6ef
--- /dev/null
+++ b/lib/sframe/sframe_read.c
@@ -0,0 +1,549 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/string.h>
+
+#include "sframe_read.h"
+
+/**
+ * struct sframe_sec - SFrame section
+ * @header: SFrame header.
+ * @fdes: SFrame Function Descriptor Entries.
+ * @fres: SFrame Frame Row Entries.
+ * @fre_nbytes: Number of bytes needed for SFrame FREs.
+ */
+struct sframe_sec {
+	struct sframe_header header;
+	const char *fdes;
+	const char *fres;
+	uint32_t fre_nbytes;
+};
+
+static int sframe_set_errno(int *errp, int error)
+{
+	if (errp)
+		*errp = error;
+	return SFRAME_ERR;
+}
+
+static uint32_t sframe_sec_get_hdr_size(struct sframe_header *sfh)
+{
+	return SFRAME_V1_HDR_SIZE(*sfh);
+}
+
+static uint8_t sframe_fre_get_offset_count(uint8_t fre_info)
+{
+	return SFRAME_V1_FRE_OFFSET_COUNT(fre_info);
+}
+
+static uint8_t sframe_fre_get_offset_size(uint8_t fre_info)
+{
+	return SFRAME_V1_FRE_OFFSET_SIZE(fre_info);
+}
+
+static uint32_t sframe_get_fre_type(struct sframe_func_desc_entry *fdep)
+{
+	if (!fdep)
+		return 0;
+
+	return SFRAME_V1_FUNC_FRE_TYPE(fdep->func_info);
+}
+
+static uint32_t sframe_get_fde_type(struct sframe_func_desc_entry *fdep)
+{
+	if (!fdep)
+		return 0;
+
+	return SFRAME_V1_FUNC_FDE_TYPE(fdep->func_info);
+}
+
+static bool sframe_header_sanity_check_p(struct sframe_header *hp)
+{
+	uint8_t all_flags;
+
+	all_flags = SFRAME_F_FDE_SORTED | SFRAME_F_FRAME_POINTER;
+
+	/* Check that the preamble is valid. */
+	if ((hp->preamble.magic != SFRAME_MAGIC) ||
+	    (hp->preamble.version != SFRAME_VERSION) ||
+	    ((hp->preamble.flags | all_flags) != all_flags))
+		return false;
+
+	/* Check that the offsets are valid. */
+	if (hp->fdes_off > hp->fres_off)
+		return false;
+
+	return true;
+}
+
+static bool sframe_fre_sanity_check_p(struct sframe_fre *frep)
+{
+	uint8_t offset_size;
+	uint8_t offset_cnt;
+
+	if (!frep)
+		return false;
+
+	offset_size = sframe_fre_get_offset_size(frep->fre_info);
+
+	if ((offset_size != SFRAME_FRE_OFFSET_1B) &&
+	    (offset_size != SFRAME_FRE_OFFSET_2B) &&
+	    (offset_size != SFRAME_FRE_OFFSET_4B))
+		return false;
+
+	offset_cnt = sframe_fre_get_offset_count(frep->fre_info);
+	if (offset_cnt > MAX_NUM_STACK_OFFSETS)
+		return false;
+
+	return true;
+}
+
+static int32_t sframe_get_fre_offset(struct sframe_fre *frep, uint32_t idx,
+				     int *errp)
+{
+	uint8_t offset_size;
+	uint8_t offset_cnt;
+
+	if (!frep || !sframe_fre_sanity_check_p(frep))
+		return sframe_set_errno(errp, SFRAME_ERR_FRE_INVAL);
+
+	offset_cnt = sframe_fre_get_offset_count(frep->fre_info);
+	offset_size = sframe_fre_get_offset_size(frep->fre_info);
+
+	if (offset_cnt < idx + 1)
+		return sframe_set_errno(errp, SFRAME_ERR_FREOFFSET_NOPRESENT);
+
+	if (errp)
+		*errp = 0; /* Offset Valid. */
+
+	if (offset_size == SFRAME_FRE_OFFSET_1B) {
+		int8_t *stack_offsets = (int8_t *)frep->fre_offsets;
+		return stack_offsets[idx];
+	} else if (offset_size == SFRAME_FRE_OFFSET_2B) {
+		int16_t *stack_offsets = (int16_t *)frep->fre_offsets;
+		return stack_offsets[idx];
+	} else {
+		int32_t *stack_offsets = (int32_t *)frep->fre_offsets;
+		return stack_offsets[idx];
+	}
+}
+
+static struct sframe_header *sframe_sec_get_header(struct sframe_sec *sfsec)
+{
+	if (!sfsec)
+		return NULL;
+
+	return &sfsec->header;
+}
+
+static int sframe_fre_copy(struct sframe_fre *dst,
+			   struct sframe_fre *src)
+{
+	if (!dst || !src)
+		return SFRAME_ERR;
+
+	memcpy(dst, src, sizeof(struct sframe_fre));
+	return 0;
+}
+
+static int sframe_read_start_ip_offset(const char *fre_buf,
+				       uint32_t *start_ip_offset,
+				       uint32_t fre_type)
+{
+	uint32_t saddr = 0;
+	uint32_t *uit;
+	uint16_t *ust;
+	uint8_t *uc;
+
+	if (fre_type == SFRAME_FRE_TYPE_ADDR1) {
+		*uc = (uint8_t *)fre_buf;
+		saddr = (uint32_t)*uc;
+	} else if (fre_type == SFRAME_FRE_TYPE_ADDR2) {
+		*ust = (uint16_t *)fre_buf;
+		saddr = (uint32_t)*ust;
+	} else if (fre_type == SFRAME_FRE_TYPE_ADDR4) {
+		*uit = (uint32_t *)fre_buf;
+		saddr = (uint32_t)*uit;
+	} else {
+		return SFRAME_ERR_INVAL;
+	}
+
+	*start_ip_offset = saddr;
+	return 0;
+}
+
+/* Get the total size in bytes for the stack offsets. */
+static size_t sframe_fre_stack_offsets_sizeof(uint8_t fre_info)
+{
+	uint8_t offset_size;
+	uint8_t offset_cnt;
+
+	offset_size = sframe_fre_get_offset_size(fre_info);
+	offset_cnt = sframe_fre_get_offset_count(fre_info);
+
+	if (offset_size == SFRAME_FRE_OFFSET_2B ||
+	    offset_size == SFRAME_FRE_OFFSET_4B)	/* 2 or 4 bytes. */
+		return (offset_cnt * (offset_size * 2));
+
+	return offset_cnt;
+}
+
+static size_t sframe_fre_ip_offset_sizeof(uint32_t fre_type)
+{
+	/* Type size of the start_addr in an FRE. */
+	size_t saddr_tsize = 0;
+
+	switch (fre_type) {
+	case SFRAME_FRE_TYPE_ADDR1:
+		saddr_tsize = sizeof(uint8_t);
+		break;
+	case SFRAME_FRE_TYPE_ADDR2:
+		saddr_tsize = sizeof(uint16_t);
+		break;
+	case SFRAME_FRE_TYPE_ADDR4:
+		saddr_tsize = sizeof(uint32_t);
+		break;
+	default:
+		/* No other value is expected. */
+		break;
+	}
+	return saddr_tsize;
+}
+
+static size_t sframe_fre_vlen_size(struct sframe_fre *frep,
+				   uint32_t fre_type)
+{
+	size_t ip_offset_size;
+	size_t fre_vlen_size;
+	uint8_t fre_info;
+
+	if (!frep)
+		return 0;
+
+	fre_info = frep->fre_info;
+	ip_offset_size = sframe_fre_ip_offset_sizeof(fre_type);
+
+	/*
+	 * An SFrame FRE is a variable length structure. It includes the start
+	 * IP offset, the FRE info field, and all trailing the stack offsets.
+	 */
+	fre_vlen_size = ip_offset_size + sizeof(fre_info);
+	fre_vlen_size += sframe_fre_stack_offsets_sizeof(fre_info);
+
+	return fre_vlen_size;
+}
+
+/*
+ * Get the end IP offset of the SFrame FRE at the given index.
+ * The end IP offset is not directly encoded in the FRE.
+ */
+static uint32_t
+sframe_fre_get_end_ip_offset(struct sframe_func_desc_entry *fdep,
+			     unsigned int i, const char *fres)
+{
+	uint32_t end_ip_offset;
+	uint32_t fre_type;
+
+	fre_type = sframe_get_fre_type(fdep);
+
+	/*
+	 * Use the start address of the next FRE in sequence. If this if the
+	 * last FRE, end IP offset needs to be deduced from the function size.
+	 */
+	if (i < fdep->func_fres_num - 1) {
+		sframe_read_start_ip_offset(fres, &end_ip_offset, fre_type);
+		end_ip_offset -= 1;
+	} else {
+		end_ip_offset = fdep->func_size - 1;
+	}
+
+	return end_ip_offset;
+}
+
+/*
+ * Read an SFrame FRE which starts at location FRE_BUF. The function updates
+ * FRE_SIZE to the size of the FRE as stored in the binary format.
+ *
+ * Returns SFRAME_ERR if failure.
+ */
+static int sframe_sec_read_fre(const char *fre_buf, struct sframe_fre *frep,
+			       uint32_t fre_type, size_t *fre_size)
+{
+	const char *stack_offsets_buf;
+	size_t stack_offsets_size;
+	size_t ip_offset_size;
+	size_t size;
+
+	if (!fre_buf || !frep || !fre_size)
+		return SFRAME_ERR_INVAL;
+
+	/* Copy over the FRE start address. */
+	sframe_read_start_ip_offset(fre_buf, &frep->start_ip_offset, fre_type);
+
+	ip_offset_size = sframe_fre_ip_offset_sizeof(fre_type);
+	/* PS: Note how this API works closely with SFrame binary format. */
+	frep->fre_info = *(uint8_t *)(fre_buf + ip_offset_size);
+
+	memset(frep->fre_offsets, 0, MAX_STACK_OFFSET_NBYTES);
+	/* Copy over the stack offsets. */
+	stack_offsets_size = sframe_fre_stack_offsets_sizeof(frep->fre_info);
+	stack_offsets_buf = fre_buf + ip_offset_size + sizeof(frep->fre_info);
+	memcpy(frep->fre_offsets, stack_offsets_buf, stack_offsets_size);
+
+	size = sframe_fre_vlen_size(frep, fre_type);
+	*fre_size = size;
+
+	return 0;
+}
+
+static struct sframe_func_desc_entry *
+sframe_sec_find_fde(struct sframe_sec *sfsec, int32_t addr, int *errp)
+{
+	struct sframe_func_desc_entry *fde;
+	struct sframe_header *header;
+	int low;
+	int high;
+	int cnt;
+
+	if (!sfsec) {
+		sframe_set_errno(errp, SFRAME_ERR_INVAL);
+		return NULL;
+	}
+
+	header = sframe_sec_get_header(sfsec);
+	if (!header || (header->num_fdes == 0) || !sfsec->fdes) {
+		sframe_set_errno(errp, SFRAME_ERR_INIT_INVAL);
+		return NULL;
+	}
+	/*
+	 * Skip binary search if FDE sub-section is not sorted on PCs. GNU ld
+	 * sorts the FDEs on start PC by default though.
+	 */
+	if ((header->preamble.flags & SFRAME_F_FDE_SORTED) == 0) {
+		sframe_set_errno(errp, SFRAME_ERR_FDE_NOTSORTED);
+		return NULL;
+	}
+
+	/* Find the FDE that may contain the addr. */
+	fde = (struct sframe_func_desc_entry *)sfsec->fdes;
+	low = 0;
+	high = header->num_fdes;
+	cnt = high;
+	while (low <= high) {
+		int mid = low + (high - low) / 2;
+
+		if (fde[mid].func_start_addr == addr)
+			return fde + mid;
+
+		if (fde[mid].func_start_addr < addr) {
+			if (mid == (cnt - 1))
+				return fde + (cnt - 1);
+			else if (fde[mid+1].func_start_addr > addr)
+				return fde + mid;
+			low = mid + 1;
+		} else
+			high = mid - 1;
+	}
+
+	sframe_set_errno(errp, SFRAME_ERR_FDE_NOTFOUND);
+	return NULL;
+}
+
+static int8_t sframe_sec_get_fixed_fp_offset(struct sframe_sec *sfsec)
+{
+	struct sframe_header *header = sframe_sec_get_header(sfsec);
+
+	return header->cfa_fixed_fp_offset;
+}
+
+static int8_t sframe_sec_get_fixed_ra_offset(struct sframe_sec *sfsec)
+{
+	struct sframe_header *header = sframe_sec_get_header(sfsec);
+
+	return header->cfa_fixed_ra_offset;
+}
+
+size_t sframe_sec_sizeof(void)
+{
+	return sizeof(struct sframe_sec);
+}
+
+int sframe_sec_init(struct sframe_sec *sfsec, const char *sf_buf,
+		    size_t sf_size)
+{
+	const struct sframe_preamble *preamble;
+	struct sframe_header *header;
+	const char *frame_buf;
+
+	if (!sf_buf || (sf_size < sizeof(struct sframe_header)))
+		return SFRAME_ERR_INVAL;
+
+	/* Check for foreign endianness. */
+	preamble = (const struct sframe_preamble *) sf_buf;
+	if (preamble->magic != SFRAME_MAGIC)
+		return SFRAME_ERR_INVAL;
+
+	/* Reset the SFrame section object. */
+	memset(sfsec, 0, sizeof(struct sframe_sec));
+
+	frame_buf = (char *)sf_buf;
+
+	/* Initialize the reference to the SFrame header. */
+	sfsec->header = *(struct sframe_header *) frame_buf;
+	header = &sfsec->header;
+	if (!sframe_header_sanity_check_p(header))
+		return SFRAME_ERR_INVAL;
+
+	/* Initialize the reference to the SFrame FDE section. */
+	frame_buf += sframe_sec_get_hdr_size(header);
+	sfsec->fdes = frame_buf;
+
+	/* Initialize the reference to the SFrame FRE section. */
+	frame_buf += (header->num_fdes * sizeof(struct sframe_func_desc_entry));
+	sfsec->fres = frame_buf;
+
+	sfsec->fre_nbytes = header->fre_len;
+
+	return 0;
+}
+
+/*
+ * Find the SFrame Frame Row Entry which contains the PC.
+ * Returns error code if failure.
+ */
+int sframe_sec_find_fre(struct sframe_sec *sfsec, int32_t pc,
+			struct sframe_fre *frep)
+{
+	struct sframe_func_desc_entry *fdep;
+	struct sframe_fre cur_fre;
+	int32_t func_start_addr;
+	const char *fres_next;
+	uint32_t end_offset;
+	uint32_t fre_type;
+	uint32_t fde_type;
+	const char *fres;
+	int32_t start_ip;
+	size_t size = 0;
+	int32_t end_ip;
+	int err = 0;
+	uint32_t i;
+	/*
+	 * For regular FDEs(i.e. fde_type SFRAME_FDE_TYPE_PCINC),
+	 * where the start address in the FRE is an offset from start pc,
+	 * use a bitmask with all bits set so that none of the address bits are
+	 * ignored. In this case, we need to return the FRE where
+	 * (PC >= FRE_START_ADDR).
+	 */
+	uint64_t bitmask = 0xffffffff;
+
+	if (!sfsec || !frep)
+		return SFRAME_ERR_INVAL;
+
+	/* Find the FDE which contains the PC. */
+	fdep = sframe_sec_find_fde(sfsec, pc, &err);
+	if (!fdep || !sfsec->fres)
+		return SFRAME_ERR_INIT_INVAL;
+
+	fre_type = sframe_get_fre_type(fdep);
+	fde_type = sframe_get_fde_type(fdep);
+
+	/*
+	 * For FDEs for repetitive pattern of insns, we need to return the FRE
+	 * such that(PC & FRE_START_ADDR_AS_MASK >= FRE_START_ADDR_AS_MASK).
+	 * so, update the bitmask to the start address.
+	 */
+	/* FIXME - the bitmask should be encoded explicitly in the format. */
+	if (fde_type == SFRAME_FDE_TYPE_PCMASK)
+		bitmask = 0xff;
+
+	fres = sfsec->fres + fdep->func_fres_off;
+	func_start_addr = fdep->func_start_addr;
+
+	for (i = 0; i < fdep->func_fres_num; i++) {
+		err = sframe_sec_read_fre(fres, &cur_fre, fre_type, &size);
+		if (err)
+			return sframe_set_errno(&err, SFRAME_ERR_FRE_INVAL);
+
+		start_ip = func_start_addr + cur_fre.start_ip_offset;
+		fres_next = fres + size;
+		end_offset = sframe_fre_get_end_ip_offset(fdep, i, fres_next);
+		end_ip = func_start_addr + end_offset;
+
+		if ((start_ip & bitmask) > (pc & bitmask))
+			return sframe_set_errno(&err, SFRAME_ERR_FRE_INVAL);
+
+		if (((start_ip & bitmask) <= (pc & bitmask)) &&
+		    ((end_ip & bitmask) >= (pc & bitmask))) {
+			sframe_fre_copy(frep, &cur_fre);
+			return 0;
+		}
+		fres += size;
+	}
+	return sframe_set_errno(&err, SFRAME_ERR_FDE_INVAL);
+}
+
+unsigned int sframe_fre_get_base_reg_id(struct sframe_fre *frep,
+					int *errp)
+{
+	if (!frep)
+		return sframe_set_errno(errp, SFRAME_ERR_FRE_INVAL);
+
+	return SFRAME_V1_FRE_CFA_BASE_REG_ID(frep->fre_info);
+}
+
+int32_t sframe_fre_get_cfa_offset(struct sframe_sec *sfsec __always_unused,
+				  struct sframe_fre *frep, int *errp)
+{
+	return sframe_get_fre_offset(frep, SFRAME_FRE_CFA_OFFSET_IDX, errp);
+}
+
+int32_t sframe_fre_get_fp_offset(struct sframe_sec *sfsec,
+				 struct sframe_fre *frep, int *errp)
+{
+	uint32_t fp_offset_idx;
+	int8_t fp_offset;
+	int8_t ra_offset;
+
+	fp_offset = sframe_sec_get_fixed_fp_offset(sfsec);
+	fp_offset_idx = SFRAME_FRE_FP_OFFSET_IDX;
+	/*
+	 * If the FP offset is not being tracked, return the fixed FP offset
+	 * from the SFrame header.
+	 */
+	if (fp_offset != SFRAME_CFA_FIXED_FP_INVALID) {
+		*errp = 0;
+		return fp_offset;
+	}
+
+	/*
+	 * In some ABIs, the stack offset to recover RA from (relative to the
+	 * CFA) is fixed (like AMD64). In such cases, the stack offset to
+	 * recover FP will appear at the second index.
+	 */
+	ra_offset = sframe_sec_get_fixed_ra_offset(sfsec);
+	if (ra_offset != SFRAME_CFA_FIXED_RA_INVALID)
+		fp_offset_idx = SFRAME_FRE_RA_OFFSET_IDX;
+
+	return sframe_get_fre_offset(frep, fp_offset_idx, errp);
+}
+
+int32_t sframe_fre_get_ra_offset(struct sframe_sec *sfsec,
+				 struct sframe_fre *frep, int *errp)
+{
+	int8_t ra_offset;
+
+	ra_offset = sframe_sec_get_fixed_ra_offset(sfsec);
+	/*
+	 * If the RA offset was not being tracked, return the fixed RA offset
+	 * from the SFrame header.
+	 */
+	if (ra_offset != SFRAME_CFA_FIXED_RA_INVALID) {
+		*errp = 0;
+		return ra_offset;
+	}
+
+	/* Otherwise, get the RA offset from the FRE. */
+	return sframe_get_fre_offset(frep, SFRAME_FRE_RA_OFFSET_IDX, errp);
+}
diff --git a/lib/sframe/sframe_read.h b/lib/sframe/sframe_read.h
new file mode 100644
index 000000000000..30cf6be9cf00
--- /dev/null
+++ b/lib/sframe/sframe_read.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef SFRAME_READ_H
+#define SFRAME_READ_H
+
+#include <linux/types.h>
+
+#include "sframe.h"
+
+struct sframe_sec;
+
+#define MAX_NUM_STACK_OFFSETS 3
+
+#define MAX_STACK_OFFSET_NBYTES \
+	((SFRAME_FRE_OFFSET_4B * 2 * MAX_NUM_STACK_OFFSETS))
+
+/**
+ * struct sframe_fre - SFrame Frame Row Entry for the SFrame reader.
+ * @start_ip_offset: Start IP offset (from the start addr of the function)
+ * @fre_offsets: Byte array containing the stack offsets.
+ * @fre_info: Other information necessary to interpret the stack offsets.
+ *
+ * Providing such an abstraction helps decouple stack tracer from the
+ * binary format representation of the SFrame FRE. Each member is kept aligned
+ * at its natural boundary.
+ */
+struct sframe_fre {
+	uint32_t start_ip_offset;
+	unsigned char fre_offsets[MAX_STACK_OFFSET_NBYTES];
+	unsigned char fre_info;
+};
+
+#define SFRAME_ERR ((int) -1)
+
+/* SFrame version not supported. */
+#define SFRAME_ERR_VERSION_INVAL	(-2000)
+/* Corrupt SFrame. */
+#define SFRAME_ERR_INVAL		(-2001)
+/* SFrame Section Initialization Error. */
+#define SFRAME_ERR_INIT_INVAL		(-2002)
+/* Corrupt FDE. */
+#define SFRAME_ERR_FDE_INVAL		(-2003)
+/* Corrupt FRE. */
+#define SFRAME_ERR_FRE_INVAL		(-2004)
+/* FDE not found. */
+#define SFRAME_ERR_FDE_NOTFOUND		(-2005)
+/* FDEs not sorted. */
+#define SFRAME_ERR_FDE_NOTSORTED	(-2006)
+/* FRE not found. */
+#define SFRAME_ERR_FRE_NOTFOUND		(-2007)
+/* FRE offset not present. */
+#define SFRAME_ERR_FREOFFSET_NOPRESENT	(-2008)
+
+extern size_t sframe_sec_sizeof(void);
+
+extern int sframe_sec_init(struct sframe_sec *sfsec, const char *cf_buf,
+			   size_t cf_size);
+
+extern int sframe_sec_find_fre(struct sframe_sec *ctx, int32_t pc,
+			       struct sframe_fre *frep);
+
+extern unsigned int sframe_fre_get_base_reg_id(struct sframe_fre *fre,
+					       int *errp);
+extern int32_t sframe_fre_get_cfa_offset(struct sframe_sec *dtcx,
+					 struct sframe_fre *fre,
+					 int *errp);
+extern int32_t sframe_fre_get_fp_offset(struct sframe_sec *sfsec,
+					struct sframe_fre *fre,
+					int *errp);
+extern int32_t sframe_fre_get_ra_offset(struct sframe_sec *sfsec,
+					struct sframe_fre *fre,
+					int *errp);
+extern bool sframe_fre_get_ra_mangled_p(struct sframe_sec *sfsec,
+					struct sframe_fre *fre,
+					int *errp);
+
+#endif /* SFRAME_READ_H */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [POC,V2 4/5] sframe: add an SFrame format stack tracer
  2023-05-26  5:32 [POC,V2 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
                   ` (2 preceding siblings ...)
  2023-05-26  5:32 ` [POC,V2 3/5] sframe: add new SFrame library Indu Bhagat
@ 2023-05-26  5:32 ` Indu Bhagat
  2023-05-26  5:32 ` [POC,V2 5/5] x86_64: invoke SFrame based stack tracer for user space Indu Bhagat
  2023-05-26  7:56 ` [POC,V2 0/5] SFrame based stack tracer for user space in the kernel Steven Rostedt
  5 siblings, 0 replies; 7+ messages in thread
From: Indu Bhagat @ 2023-05-26  5:32 UTC (permalink / raw)
  To: linux-toolchains, rostedt, peterz; +Cc: Indu Bhagat

[Changes in V2]
  - Formatting related fixes - kernel sytle code and comments etc.
  - Use abstractions from ptrace.h. Get rid of arch-specific
    sframe_regs.h.
  - Use const char * consistently.
  - More xmas trees.
  - Move sframe_unwind.h from include/sframe/ to include/linux/.
  - Use consistent documentation style for all structures.
[End of changes in V2]

This patch adds an SFrame format based stack tracer.

The files iterate_phdr.c, iterate_phdr.h implement a dl_iterate_phdr()
like functionality.

The SFrame format based stack tracer is implemented in the
sframe_unwind.c with architecture specific bits pulled in from
arch/arm64/include/asm/ptrace.h and arch/x86/include/asm/ptrace.h.
Please note that the SFrame format is supported for x86_64 (AMD64 ABI)
and AArch64 (AAPCS64 ABI) only at this time.

The files sframe_state.[ch] implement the SFrame state management APIs.

Some aspects of the implementation are "POC like". These will need to
addressed for the implementation to become more palatable:
- dealing with only Elf64_Phdr (no Elf32_Phdr) at this time, and other
  TODOs in the iterate_phdr.c,
- detecting whether a program did a dlopen/dlclose,
- code stubs around user space memory access (.sframe section, ELF hdr
  etc.) will need to be adapted to use more appropriate access method.

There are more aspects than above; The intention of this patch set is to
help the initiative to best incorporate an SFrame-based user
space unwinder in the kernel.

Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
---
 include/linux/sframe_unwind.h |  72 ++++++
 lib/sframe/Makefile           |   8 +-
 lib/sframe/iterate_phdr.c     | 115 +++++++++
 lib/sframe/iterate_phdr.h     |  39 +++
 lib/sframe/sframe_state.c     | 447 ++++++++++++++++++++++++++++++++++
 lib/sframe/sframe_state.h     |  84 +++++++
 lib/sframe/sframe_unwind.c    | 214 ++++++++++++++++
 7 files changed, 978 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/sframe_unwind.h
 create mode 100644 lib/sframe/iterate_phdr.c
 create mode 100644 lib/sframe/iterate_phdr.h
 create mode 100644 lib/sframe/sframe_state.c
 create mode 100644 lib/sframe/sframe_state.h
 create mode 100644 lib/sframe/sframe_unwind.c

diff --git a/include/linux/sframe_unwind.h b/include/linux/sframe_unwind.h
new file mode 100644
index 000000000000..d94f44d18400
--- /dev/null
+++ b/include/linux/sframe_unwind.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef _SFRAME_UNWIND_H
+#define _SFRAME_UNWIND_H
+
+#include <linux/sched.h>
+#include <linux/perf_event.h>
+
+#define PT_GNU_SFRAME  0x6474e554
+
+/**
+ * struct user_unwind_state - User space stack tracing state.
+ * @pc: Program counter.
+ * @sp: Stack pointer.
+ * @fp: Frame pointer.
+ * @ra: Return address.
+ * @task: Reference of the user space task.
+ * @stype: Stack type.
+ * @error: Error code.
+ */
+struct user_unwind_state {
+	uint64_t pc;
+	uint64_t sp;
+	uint64_t fp;
+	uint64_t ra;
+	struct task_struct *task;
+	enum stack_type stype;
+	bool error;
+};
+
+/*
+ * APIs for an SFrame based stack tracer.
+ */
+
+void sframe_unwind_start(struct user_unwind_state *state,
+			 struct task_struct *task, struct pt_regs *regs);
+bool sframe_unwind_next_frame(struct user_unwind_state *state);
+uint64_t sframe_unwind_get_return_address(struct user_unwind_state *state);
+
+static inline bool sframe_unwind_done(struct user_unwind_state *state)
+{
+	return state->stype == STACK_TYPE_UNKNOWN;
+}
+
+static inline bool sframe_unwind_error(struct user_unwind_state *state)
+{
+	return state->error;
+}
+
+/*
+ * APIs to manage the SFrame state per task for stack tracing.
+ */
+
+extern struct sframe_state *unwind_sframe_state_alloc(struct task_struct *task);
+extern int unwind_sframe_state_update(struct task_struct *task);
+extern void unwind_sframe_state_cleanup(struct task_struct *task);
+
+extern bool unwind_sframe_state_valid_p(struct sframe_state *sfstate);
+extern bool unwind_sframe_state_ready_p(struct sframe_state *sftate);
+
+/*
+ * Get the callchain using SFrame unwind info for the given task.
+ */
+extern int
+sframe_callchain_user(struct task_struct *task,
+		      struct perf_callchain_entry_ctx *entry,
+		      struct pt_regs *regs);
+
+#endif /* _SFRAME_UNWIND_H */
diff --git a/lib/sframe/Makefile b/lib/sframe/Makefile
index 14d6cfd42a1d..5ee9e3e7ec93 100644
--- a/lib/sframe/Makefile
+++ b/lib/sframe/Makefile
@@ -1,5 +1,11 @@
 # SPDX-License-Identifier: GPL-2.0
 ##################################
-obj-$(CONFIG_USER_UNWINDER_SFRAME) += sframe_read.o
+obj-$(CONFIG_USER_UNWINDER_SFRAME) += iterate_phdr.o \
+				      sframe_read.o \
+				      sframe_state.o \
+				      sframe_unwind.o
 
+CFLAGS_iterate_phdr.o += -I $(srctree)/lib/sframe/ -Wno-error=declaration-after-statement
 CFLAGS_sframe_read.o += -I $(srctree)/lib/sframe/
+CFLAGS_sframe_state.o += -I $(srctree)/lib/sframe/
+CFLAGS_sframe_unwind.o += -I $(srctree)/lib/sframe/
diff --git a/lib/sframe/iterate_phdr.c b/lib/sframe/iterate_phdr.c
new file mode 100644
index 000000000000..c638ff5443ee
--- /dev/null
+++ b/lib/sframe/iterate_phdr.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/elf.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/mm_types.h>
+
+#include "iterate_phdr.h"
+
+/*
+ * Iterate over the task's memory mappings and find the ELF headers.
+ *
+ * This is expected to be called from perf_callchain_user(), so user process
+ * context is expected.
+ */
+
+int iterate_phdr(int (*callback)(struct phdr_info *info,
+				 struct task_struct *task,
+				 void *data),
+		 struct task_struct *task, void *data)
+{
+	struct vm_area_struct *vma_mt;
+	struct phdr_info phinfo;
+	struct mm_struct *mm;
+	struct page *page;
+	bool first = true;
+	Elf64_Ehdr *ehdr;
+	size_t size;
+	int res = 0;
+	int err;
+	int ret;
+
+	memset(&phinfo, 0, sizeof(struct phdr_info));
+
+	mm = task->mm;
+
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
+
+	mas_for_each(&mas, vma_mt, ULONG_MAX) {
+		/*
+		 * ELF header has a fixed place in the file, starting at offset
+		 * zero.
+		 */
+		if (vma_mt->vm_pgoff)
+			continue;
+
+		/*
+		 * For the callback to infer if its the prog or DSO we are
+		 * dealing with.
+		 */
+		phinfo.pi_prog = first;
+		first = false;
+		/*
+		 * FIXME TODO
+		 *  - This code assumes 64-bit ELF by using Elf64_Ehdr.
+		 *  - Detect the case when ELF program headers to be of size
+		 *  greater than 1 page.
+		 */
+
+		/*
+		 * FIXME TODO KERNEL
+		 *  - get_user_pages_WHAT, which API. What flags ?
+		 */
+		ret = get_user_pages_remote(mm, vma_mt->vm_start, 1, FOLL_GET,
+					    &page, &vma_mt, NULL);
+		if (ret <= 0)
+			continue;
+
+		/* The first page must have the ELF header. */
+		ehdr = vmap(&page, 1, VM_MAP, PAGE_KERNEL);
+		if (!ehdr)
+			goto put_page;
+
+		/* Check for magic bytes to make sure this is ehdr. */
+		err = ((ehdr->e_ident[EI_MAG0] != ELFMAG0) ||
+		       (ehdr->e_ident[EI_MAG1] != ELFMAG1) ||
+		       (ehdr->e_ident[EI_MAG2] != ELFMAG2) ||
+		       (ehdr->e_ident[EI_MAG3] != ELFMAG3));
+		if (err)
+			goto unmap;
+
+		/*
+		 * FIXME TODO handle the case when number of program headers is
+		 * greater than or equal to PN_XNUM later.
+		 */
+		if (ehdr->e_phnum == PN_XNUM)
+			goto unmap;
+		/*
+		 * FIXME TODO handle the case when Elf phdrs span more than one
+		 * page later ?
+		 */
+		size = sizeof(Elf64_Ehdr) + ehdr->e_phentsize * ehdr->e_phnum;
+		if (size > PAGE_SIZE)
+			goto unmap;
+
+		/* Save the location of program headers and the phnum. */
+		phinfo.pi_addr = vma_mt->vm_start;
+		phinfo.pi_phdr = (void *)ehdr + ehdr->e_phoff;
+		phinfo.pi_phnum = ehdr->e_phnum;
+
+		res = callback(&phinfo, task, data);
+unmap:
+		vunmap(ehdr);
+put_page:
+		put_page(page);
+
+		if (res < 0)
+			break;
+	}
+
+	return res;
+}
diff --git a/lib/sframe/iterate_phdr.h b/lib/sframe/iterate_phdr.h
new file mode 100644
index 000000000000..44c75e43f5a7
--- /dev/null
+++ b/lib/sframe/iterate_phdr.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef ITERATE_PHDR_H_
+#define ITERATE_PHDR_H_
+
+#include <linux/sched.h>
+
+/**
+ * struct phdr_info - Helper object for iterate_phdr API
+ * @pi_prog: Determine whether prog or DSO.
+ * @pi_addr: Base address.
+ * @pi_phdr: Reference to the ELF program headers of the object.
+ * @pi_phnum: Number of entries in the program header table.
+ * @pi_adds: Number of shared objects added after program startup.
+ * @pi_subs: Number of shared objects removed after program startup.
+ */
+struct phdr_info {
+	bool pi_prog;
+	unsigned long pi_addr;
+	void *pi_phdr;
+	unsigned int pi_phnum;
+	/*
+	 * Following two fields are for optimization - keep track of any
+	 * dlopen/dlclose activity done after program startup.
+	 * FIXME TODO Currently unused.
+	 */
+	uint64_t pi_adds;
+	uint64_t pi_subs;
+};
+
+int iterate_phdr(int (*callback)(struct phdr_info *info,
+				 struct task_struct *task,
+				 void *data),
+		 struct task_struct *task, void *data);
+
+#endif /* ITERATE_PHDR_H_ */
diff --git a/lib/sframe/sframe_state.c b/lib/sframe/sframe_state.c
new file mode 100644
index 000000000000..e2f621f2464c
--- /dev/null
+++ b/lib/sframe/sframe_state.c
@@ -0,0 +1,447 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/elf.h>
+#include <linux/vmalloc.h>
+#include <linux/sframe_unwind.h>
+
+#include "sframe_state.h"
+#include "iterate_phdr.h"
+
+#define NUM_OF_DSOS    32
+
+static int num_entries = NUM_OF_DSOS;
+
+/*
+ * Error codes for SFrame state.
+ *
+ * All condition codes less than SFRAME_UNW_INFO_OK are used to indicate
+ * an unhealthy SFrame state.
+ */
+enum {
+	SFRAME_UNW_INVAL_SFRAME = -3, /* An SFrame section is invalid. */
+	SFRAME_UNW_NO_SFSTATE = -2, /* SFrame state could not be set up. */
+	SFRAME_UNW_NO_PROG_SFRAME = -1,  /* No SFrame section in prog. */
+	SFRAME_UNW_INFO_OK = 0,
+	SFRAME_UNW_PARTIAL_INFO = 1,
+};
+
+/*
+ * SFrame Unwind Info APIs.
+ */
+
+static int sframe_unw_info_cleanup(struct sframe_unw_info *sfu_info)
+{
+	int i;
+
+	if (!sfu_info)
+		return 1;
+
+	if (sfu_info->sframe_vmap) {
+		vunmap(sfu_info->sframe_vmap);
+		sfu_info->sframe_vmap = NULL;
+	}
+
+	if (sfu_info->sframe_pages) {
+		for (i = 0; i < sfu_info->sframe_npages; i++)
+			put_page(sfu_info->sframe_pages[i]);
+		kfree(sfu_info->sframe_pages);
+		sfu_info->sframe_pages = NULL;
+	}
+
+	kfree(sfu_info->sfsec);
+	sfu_info->sfsec = NULL;
+
+	return 0;
+}
+
+static void sframe_unw_info_init(struct sframe_unw_info *sfu_info,
+				 uint64_t sframe_addr, size_t sframe_size,
+				 uint64_t text_addr, size_t text_size)
+{
+	if (!sfu_info)
+		return;
+
+	sfu_info->sframe_addr = sframe_addr;
+	sfu_info->sframe_size = sframe_size;
+	sfu_info->text_addr = text_addr;
+	sfu_info->text_size = text_size;
+	sfu_info->sframe_pages = NULL;
+	sfu_info->sframe_vmap = NULL;
+}
+
+/*
+ * Get the user pages containing the SFrame section and set up the SFrame
+ * section object for the stacktracer to use later.
+ */
+static int sframe_unw_info_init_sfsec(struct sframe_state *sfstate,
+				      struct sframe_unw_info *sfu_info)
+{
+	struct vm_area_struct *vma;
+	struct task_struct *task;
+	struct sframe_sec *sfsec;
+	const char *sframe_vmap;
+	const char *sframe_buf;
+	unsigned long npages;
+	struct page **pages;
+	int err = 0;
+	int i;
+
+	sfsec = kmalloc(sframe_sec_sizeof(), GFP_KERNEL);
+	if (!sfsec)
+		return -ENOMEM;
+	sfu_info->sfsec = sfsec;
+
+	task = sfstate->task;
+
+	vma = find_vma(task->mm, sfu_info->sframe_addr);
+	npages = vma_pages(vma);
+	pages = kmalloc((sizeof(struct page *) * npages), GFP_KERNEL);
+	if (!pages) {
+		err = -ENOMEM;
+		goto free_sfsec;
+	}
+
+#if 0
+	npages = get_user_pages_remote(task->mm, sfu_info->sframe_addr, npages,
+				       FOLL_GET, pages, &vma, NULL);
+#endif
+	npages = get_user_pages_unlocked(vma->vm_start, npages, pages, FOLL_GET);
+	if (npages <= 0)
+		goto free_page;
+
+	sfu_info->sframe_pages = pages;
+	sfu_info->sframe_npages = npages;
+
+	sframe_vmap = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
+	if (!sframe_vmap)
+		goto put_page;
+	sfu_info->sframe_vmap = sframe_vmap;
+
+	sframe_buf = sframe_vmap + (sfu_info->sframe_addr - vma->vm_start);
+	err = sframe_sec_init(sfu_info->sfsec,
+			      sframe_buf,
+			      sfu_info->sframe_size);
+
+	/*
+	 * put_page, vunmap should not be done yet as SFrame section will be
+	 * used when do_sframe_unwind ().
+	 * In the rare possibility that this is a corrupt SFrame section,
+	 * clean up the sframe_unw_info object, and signal error to the
+	 * caller.
+	 */
+	if (!err)
+		return 0;
+
+	vunmap(sframe_vmap);
+put_page:
+	for (i = 0; i < npages; i++)
+		put_page(pages[i]);
+free_page:
+	kfree(pages);
+free_sfsec:
+	kfree(sfu_info->sfsec);
+	sfu_info->sfsec = NULL;
+
+	return err;
+}
+
+static int sframe_state_unw_info_list_add(struct sframe_state *sfstate,
+					  struct sframe_unw_info *sfu_info)
+{
+	size_t realloc_size = 0;
+
+	if (sfstate->su_dsos.alloced == 0) {
+		sfstate->su_dsos.entry
+		  = kcalloc(num_entries, sizeof(struct sframe_unw_info),
+			    GFP_KERNEL);
+		if (!sfstate->su_dsos.entry)
+			return -ENOMEM;
+		sfstate->su_dsos.alloced = num_entries;
+	} else if (sfstate->su_dsos.used == sfstate->su_dsos.alloced) {
+		realloc_size = ((sfstate->su_dsos.alloced + num_entries)
+				* sizeof(struct sframe_unw_info));
+		sfstate->su_dsos.entry
+		  = krealloc(sfstate->su_dsos.entry, realloc_size, GFP_KERNEL);
+		if (!sfstate->su_dsos.entry)
+			return -ENOMEM;
+
+		memset(&sfstate->su_dsos.entry[sfstate->su_dsos.alloced], 0,
+			num_entries * sizeof(struct sframe_unw_info));
+		sfstate->su_dsos.alloced += num_entries;
+	}
+
+	sfstate->su_dsos.entry[sfstate->su_dsos.used++] = *sfu_info;
+	return 0;
+}
+
+static int sframe_state_add_unw_info(struct sframe_state *sfstate,
+				     uint64_t sframe_addr, size_t sframe_size,
+				     uint64_t text_addr, size_t text_size)
+{
+	struct sframe_unw_info *sfu_info;
+	int ret = 0;
+
+	sfu_info = kzalloc(sizeof(*sfu_info), GFP_KERNEL);
+	if (!sfu_info)
+		return -ENOMEM;
+
+	sframe_unw_info_init(sfu_info, sframe_addr, sframe_size, text_addr,
+			     text_size);
+
+	if (sframe_unw_info_init_sfsec(sfstate, sfu_info)) {
+		ret = SFRAME_UNW_INVAL_SFRAME;
+		goto end;
+	}
+
+	/* Add sframe_unw_info object for the program or its DSOs. */
+	if (!sfstate->su_prog.sframe_size)
+		memcpy(&(sfstate->su_prog), sfu_info, sizeof(*sfu_info));
+	else
+		ret = sframe_state_unw_info_list_add(sfstate, sfu_info);
+
+end:
+	kfree(sfu_info);
+
+	return ret;
+}
+
+/*
+ * Add SFrame unwind info for the given (prog or DSO) phdr_info into the SFrame
+ * state object in the task.
+ *
+ * Callback routine from iterate_phdr function.
+ *
+ * Returns 0 if success.
+ */
+static int add_sframe_unwind_info(struct phdr_info *info,
+				  struct task_struct *task,
+				  void *data)
+{
+	struct sframe_state *sframe_state;
+	/* FIXME TODO what if its Elf32_Phdr. */
+	Elf64_Phdr *sframe_phdr = NULL;
+	Elf64_Phdr *text_phdr = NULL;
+	Elf64_Phdr *phdr = NULL;
+
+	uint64_t sframe_addr;
+	size_t sframe_size;
+	uint64_t text_addr;
+	size_t text_size;
+
+	int err = 0;
+	int p_type;
+
+	phdr = info->pi_phdr;
+	sframe_state = (struct sframe_state *)data;
+
+	for (int j = 0; j < info->pi_phnum; j++) {
+		p_type = phdr[j].p_type;
+		/* Find the executable section and the SFrame section. */
+		if (p_type == PT_GNU_SFRAME) {
+			sframe_phdr = &phdr[j];
+			continue;
+		}
+		/* FIXME TODO Elf 101 - there be only one PF_X. Looks like it? */
+		if (p_type == PT_LOAD && phdr[j].p_flags & PF_X) {
+			/*
+			 * This is the executable part of the ELF binary
+			 * containing the instructions, and may contain
+			 * sections other than .text. The usage of `text` in
+			 * this function is colloquial.
+			 */
+			text_phdr = &phdr[j];
+			continue;
+		}
+
+		if (sframe_phdr && text_phdr)
+			break;
+	}
+
+	/*
+	 * If there is no SFrame section for the prog, SFrame based unwinding
+	 * should not be attempted. If no SFrame section is found for a DSO,
+	 * however, it may still be possible to generate useful stacktraces
+	 * using the SFrame sections' for other parts of the program.
+	 */
+	if (!sframe_phdr) {
+		if (info->pi_prog)
+			return SFRAME_UNW_NO_PROG_SFRAME;
+		else
+			return SFRAME_UNW_PARTIAL_INFO;
+	}
+
+	text_addr = info->pi_prog ? text_phdr->p_vaddr
+				  : info->pi_addr + text_phdr->p_vaddr;
+	text_size = text_phdr->p_memsz;
+	sframe_addr = info->pi_prog ? sframe_phdr->p_vaddr
+				    : info->pi_addr + sframe_phdr->p_vaddr;
+	sframe_size = sframe_phdr->p_memsz;
+
+	/* Add the SFrame unwind info object to the list. */
+	err = sframe_state_add_unw_info(sframe_state, sframe_addr,
+					sframe_size, text_addr, text_size);
+	/*
+	 * An error indicates SFrame unwind info addition failed, but one can
+	 * still unwind using .sframe for other parts of the program.
+	 */
+	if (err)
+		return SFRAME_UNW_PARTIAL_INFO;
+
+	return SFRAME_UNW_INFO_OK;
+}
+
+static int add_sframe_unwind_info_for_task(struct task_struct *task)
+{
+	struct sframe_state *sfstate = task->sframe_state;
+
+	/* sfstate should be already allocated. */
+	if (!sfstate)
+		return SFRAME_UNW_NO_SFSTATE;
+
+	return iterate_phdr(add_sframe_unwind_info, task, sfstate);
+}
+
+static bool sframe_unw_info_text_range_p(struct sframe_unw_info *sfu_info,
+					 uint64_t addr)
+{
+	bool in_range = false;
+
+	if (!sfu_info)
+		return false;
+
+	if ((sfu_info->text_addr <= addr) &&
+	    (sfu_info->text_addr + sfu_info->text_size >= addr))
+		in_range = true;
+
+	return in_range;
+}
+
+struct sframe_unw_info *sframe_state_find_unw_info(struct sframe_state *sfstate,
+						   uint64_t addr)
+{
+	struct sframe_unw_info_list *unw_info_list;
+	struct sframe_unw_info sfu_info;
+	int i;
+
+	if (!sfstate)
+		return NULL;
+
+	if (sframe_unw_info_text_range_p(&sfstate->su_prog, addr))
+		return &sfstate->su_prog;
+
+	unw_info_list = &sfstate->su_dsos;
+	for (i = 0; i < unw_info_list->used; ++i) {
+		sfu_info = unw_info_list->entry[i];
+		if (sframe_unw_info_text_range_p(&sfu_info, addr))
+			return &unw_info_list->entry[i];
+	}
+
+	return NULL;
+}
+
+struct sframe_sec *sframe_unw_info_get_sfsec(struct sframe_unw_info *sfu_info)
+{
+	if (!sfu_info || !sfu_info->sfsec)
+		return NULL;
+
+	return sfu_info->sfsec;
+}
+
+/*
+ * SFrame state APIs.
+ */
+
+static void unwind_sframe_state_free(struct sframe_state *sfstate)
+{
+	struct sframe_unw_info_list *unw_info_list;
+	struct sframe_unw_info *sfu_info;
+	int i;
+
+	if (!sfstate)
+		return;
+
+	sfu_info = &(sfstate->su_prog);
+	sframe_unw_info_cleanup(sfu_info);
+
+	unw_info_list = &sfstate->su_dsos;
+	for (i = 0; i < unw_info_list->used; ++i) {
+		sfu_info = &unw_info_list->entry[i];
+		sframe_unw_info_cleanup(sfu_info);
+	}
+
+	kfree(sfstate->su_dsos.entry);
+	sfstate->su_dsos.entry = NULL;
+	sfstate->su_dsos.alloced = 0;
+	sfstate->su_dsos.used = 0;
+
+	sfstate->task = NULL;
+}
+
+bool unwind_sframe_state_valid_p(struct sframe_state *sfstate)
+{
+	return (sfstate && sfstate->cond != SFSTATE_INVAL);
+}
+
+bool unwind_sframe_state_ready_p(struct sframe_state *sfstate)
+{
+	return (sfstate && sfstate->cond == SFSTATE_READY);
+}
+
+struct sframe_state *unwind_sframe_state_alloc(struct task_struct *task)
+{
+	struct sframe_state *sfstate = NULL;
+	/*
+	 * Check if the task's SFrame unwind information needs to be set up.
+	 */
+	sfstate = task->sframe_state;
+	if (!sfstate) {
+		/* Free'd up in release_task(). */
+		sfstate = kzalloc(sizeof(*sfstate), GFP_KERNEL);
+		if (!sfstate)
+			return NULL;
+		sfstate->cond = SFSTATE_ALLOCED;
+		task->sframe_state = sfstate;
+	}
+
+	sfstate->task = task;
+
+	return sfstate;
+}
+
+void unwind_sframe_state_cleanup(struct task_struct *task)
+{
+	if (!task->sframe_state)
+		return;
+
+	unwind_sframe_state_free(task->sframe_state);
+	kfree(task->sframe_state);
+	task->sframe_state = NULL;
+}
+
+/*
+ * Update the SFrame unwind state object cached per task.
+ *
+ * Sets cond to SFSTATE_INVAL if any error.
+ * Sets cond to SFSTATE_READY if no error.
+ */
+int unwind_sframe_state_update(struct task_struct *task)
+{
+	struct sframe_state *sfstate;
+	int sferr = 0;
+	bool ret;
+
+	sfstate = task->sframe_state;
+	if (sfstate->cond == SFSTATE_ALLOCED || sfstate->cond == SFSTATE_STALE)
+		sferr = add_sframe_unwind_info_for_task(task);
+
+	ret = (sferr < SFRAME_UNW_INFO_OK);
+	if (ret)
+		sfstate->cond = SFSTATE_INVAL;
+	else
+		sfstate->cond = SFSTATE_READY;
+
+	return ret;
+}
diff --git a/lib/sframe/sframe_state.h b/lib/sframe/sframe_state.h
new file mode 100644
index 000000000000..abdd279c6597
--- /dev/null
+++ b/lib/sframe/sframe_state.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#ifndef SFRAME_STATE_H
+#define SFRAME_STATE_H
+
+#include "sframe_read.h"
+
+/**
+ * struct sframe_unw_info - SFrame unwind info of the program or a DSO.
+ * @sframe_addr: SFrame segment's virtual addr.
+ * @sframe_size: SFrame segment's size.
+ * @sframe_pages: Pages containing the SFrame section.
+ * @sframe_npages: Number of pages containing the SFrame section.
+ * @sframe_vmap: Address of the vmap'd area that contains the .sframe section.
+ * @text_addr: Text segment's virtual addr.
+ * @text_size: Text segment's size.
+ * @sfsec: SFrame section contents.
+ */
+struct sframe_unw_info {
+	uint64_t sframe_addr;
+	uint64_t sframe_size;
+	/*
+	 * Keep a reference to the pages and vma for the lifetime of this SFrame
+	 * unwind info object.
+	 */
+	struct page **sframe_pages;
+	unsigned long sframe_npages;
+	const char *sframe_vmap;
+	/*
+	 * text_addr and text_size below are used only for looking up the
+	 * associated SFrame section.  See sframe_state_find_unw_info.
+	 */
+	uint64_t text_addr;
+	uint64_t text_size;
+
+	struct sframe_sec *sfsec;
+};
+
+/**
+ * struct sframe_unw_info_list - List of SFrame unwind info objects.
+ * @alloced: Number of entries allocated.
+ * @used: Number of entries used.
+ * @entry: Array of SFrame unwind info objects.
+ *
+ * Typically used to represent SFrame unwind info for set of shared libraries
+ * of a program.
+ */
+struct sframe_unw_info_list {
+	int alloced;
+	int used;
+	struct sframe_unw_info *entry;
+};
+
+enum sframe_state_code {
+	SFSTATE_READY = 0,  /* SFrame unwind OK to use.  */
+	SFSTATE_INVAL = 1,  /* SFrame unwind is invalid.  */
+	SFSTATE_ALLOCED = 2,   /* SFrame unwind is alloc'd but not initialized.  */
+	SFSTATE_STALE = 3,  /* SFrame unwind is stale and not OK to use.  */
+};
+
+/**
+ * struct sframe_state - Per task SFrame unwind state.
+ * @task: The task that this SFrame unwind info belongs to.
+ * @cond: Current condition of the SFrame state.
+ * @su_prog: SFrame unwind info object for the program.
+ * @su_dsos: SFrame unwind info list for the shared objects.
+ */
+struct sframe_state {
+	struct task_struct *task;
+	enum sframe_state_code cond;
+	struct sframe_unw_info su_prog;
+	struct sframe_unw_info_list su_dsos;
+};
+
+extern struct sframe_unw_info *
+sframe_state_find_unw_info(struct sframe_state *sfstate, uint64_t addr);
+
+extern struct sframe_sec *
+sframe_unw_info_get_sfsec(struct sframe_unw_info *sfu_info);
+
+#endif /* SFRAME_STATE_H.  */
diff --git a/lib/sframe/sframe_unwind.c b/lib/sframe/sframe_unwind.c
new file mode 100644
index 000000000000..56d579e18034
--- /dev/null
+++ b/lib/sframe/sframe_unwind.c
@@ -0,0 +1,214 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2023, Oracle and/or its affiliates.
+ */
+
+#include <linux/perf_event.h>
+#include <asm/ptrace.h>
+#include <linux/sframe_unwind.h>
+
+#include "sframe_state.h"
+
+/*
+ * Check if ADDR is in text segment (for which we also found a corresponding
+ * SFrame section. The address is considered invalid, if it is not in any of
+ * the address ranges of the text segments of either the main program or any of
+ * it's DSOs (for which a corresponding SFrame section existed).
+ * Return true if valid, false otherwise.
+ */
+static bool unwind_sframe_ip_ok(struct sframe_state *sfstate, uint64_t addr)
+{
+	return (sframe_state_find_unw_info(sfstate, addr) != NULL);
+}
+
+void sframe_unwind_start(struct user_unwind_state *user_unw_state,
+			 struct task_struct *task, struct pt_regs *regs)
+{
+	if (!task->sframe_state ||
+	    !unwind_sframe_state_ready_p(task->sframe_state))
+		goto error;
+
+	if (!regs)
+		goto error;
+
+	user_unw_state->sp = user_stack_pointer(regs);
+	user_unw_state->pc = instruction_pointer(regs);
+	user_unw_state->fp = frame_pointer(regs);
+
+#ifdef procedure_link_pointer
+	user_unw_state->ra = procedure_link_pointer(regs);
+#else
+	user_unw_state->ra = 0;
+#endif
+
+	user_unw_state->task = task;
+	user_unw_state->stype = STACK_TYPE_TASK;
+
+	/* We need to skip ahead by one. */
+	sframe_unwind_next_frame(user_unw_state);
+	return;
+error:
+	user_unw_state->error = true;
+}
+
+bool sframe_unwind_next_frame(struct user_unwind_state *ustate)
+{
+	struct sframe_unw_info *sfu_info;
+	struct sframe_state *sfstate;
+	struct sframe_sec *sfsec;
+	struct task_struct *task;
+	struct sframe_fre *frep;
+	struct sframe_fre fre;
+
+	int32_t ra_offset;
+	int32_t rfp_offset;
+	int32_t cfa_offset;
+
+	uint64_t rfp_stack_loc;
+	uint64_t ra_stack_loc;
+	uint64_t sframe_vma;
+	uint64_t cfa;
+	int err = 0;
+
+	uint64_t pc;
+	uint64_t sp;
+	uint64_t fp;
+	uint64_t ra;
+
+	pc = ustate->pc;
+	sp = ustate->sp;
+	fp = ustate->fp;
+	ra = ustate->ra;
+
+	frep = &fre;
+
+	task = ustate->task;
+	sfstate = task->sframe_state;
+
+	/*
+	 * Indicate end of stack trace when SFrame unwind info
+	 * is not found for the given PC.
+	 */
+	sfu_info = sframe_state_find_unw_info(sfstate, pc);
+
+	if (!sfu_info)
+		goto the_end;
+
+	sfsec = sframe_unw_info_get_sfsec(sfu_info);
+
+	/* Find the SFrame FRE. */
+	sframe_vma = sfu_info->sframe_addr;
+	pc -= sframe_vma;
+	err = sframe_sec_find_fre(sfsec, pc, frep);
+
+	if (err != 0)
+		goto error;
+
+	/* Get the CFA offset from the FRE. */
+	cfa_offset = sframe_fre_get_cfa_offset(sfsec, frep, &err);
+	if (err == SFRAME_ERR_FREOFFSET_NOPRESENT)
+		goto error;
+	cfa = ((sframe_fre_get_base_reg_id(frep, &err)
+		== SFRAME_BASE_REG_SP) ? sp : fp) + cfa_offset;
+
+	/* Get the RA offset from the FRE. */
+	ra_offset = sframe_fre_get_ra_offset(sfsec, frep, &err);
+	if (err == 0) {
+		ra_stack_loc = cfa + ra_offset;
+		if (__get_user(ra, (const uint64_t __user *)ra_stack_loc))
+			goto error;
+	}
+
+	/* Get the FP offset from the FRE. */
+	rfp_offset = sframe_fre_get_fp_offset(sfsec, frep, &err);
+	if (err == 0) {
+		rfp_stack_loc = cfa + rfp_offset;
+		if (__get_user(fp, (const uint64_t __user *)rfp_stack_loc))
+			goto error;
+	}
+
+	/* Validate and add return address to the list. */
+	if (unwind_sframe_ip_ok(sfstate, ra)) {
+		ustate->pc = ra;
+		ustate->sp = cfa;
+		ustate->fp = fp;
+	} else {
+		goto error;
+	}
+
+	return true;
+
+error:
+	ustate->error = true;
+	return false;
+the_end:
+	ustate->stype = STACK_TYPE_UNKNOWN;
+	return false;
+}
+
+uint64_t sframe_unwind_get_return_address(struct user_unwind_state *state)
+{
+	return state->pc;
+}
+
+/*
+ * Generate stack trace using SFrame stack trace information.
+ * Return 0 if success, 1 otherwise.
+ */
+
+static int do_sframe_unwind(struct task_struct *task,
+			    struct sframe_state *sfstate,
+			    struct perf_callchain_entry_ctx *entry,
+			    struct pt_regs *regs)
+{
+	struct user_unwind_state ustate;
+	uint64_t addr;
+
+	memset(&ustate, 0, sizeof(ustate));
+
+	for (sframe_unwind_start(&ustate, task, regs);
+	     !sframe_unwind_done(&ustate) && !sframe_unwind_error(&ustate);
+	     sframe_unwind_next_frame(&ustate)) {
+		addr = sframe_unwind_get_return_address(&ustate);
+		if (!addr || perf_callchain_store(entry, addr))
+			break;
+	}
+
+	return 0;
+}
+
+/*
+ * Get the stack trace for the task using SFrame stack trace information.
+ * Returns 0 if success, 1 otherwise.
+ */
+
+int sframe_callchain_user(struct task_struct *task,
+			   struct perf_callchain_entry_ctx *entry,
+			   struct pt_regs *regs)
+{
+	struct sframe_state *sfstate;
+	int ret;
+
+	if (task != current)
+		return 1;
+
+	/* Get the current task's sframe state. */
+	sfstate = task->sframe_state;
+
+	if (!sfstate)
+		return 1;
+
+	/*
+	 * Prepare for stack tracing. State must be SFSTATE_READY at this time.
+	 * FIXME TODO SFrame state may be stale because there was a change in
+	 * the set of DSOs used by the program, for example.
+	 * FIXME TODO Need to update task->sframe_state if program
+	 * dlopen/dlclose a DSO.
+	 */
+	if (!unwind_sframe_state_ready_p(sfstate))
+		return 1;
+
+	ret = do_sframe_unwind(task, sfstate, entry, regs);
+
+	return ret;
+}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [POC,V2 5/5] x86_64: invoke SFrame based stack tracer for user space
  2023-05-26  5:32 [POC,V2 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
                   ` (3 preceding siblings ...)
  2023-05-26  5:32 ` [POC,V2 4/5] sframe: add an SFrame format stack tracer Indu Bhagat
@ 2023-05-26  5:32 ` Indu Bhagat
  2023-05-26  7:56 ` [POC,V2 0/5] SFrame based stack tracer for user space in the kernel Steven Rostedt
  5 siblings, 0 replies; 7+ messages in thread
From: Indu Bhagat @ 2023-05-26  5:32 UTC (permalink / raw)
  To: linux-toolchains, rostedt, peterz; +Cc: Indu Bhagat

[Changes in V2]
  - Code changes to use the sframe_unwind.h from its new location
    (include/linux/sframe_unwind.h).
  - No further changes yet. ATM, it is understood that this POC is
    broken because accessing SFrame sections may fault while the
    perf event is being handled in the NMI context. The changes in this
    patch are expected to be reworked.
[End of changes in V2]

The task's sframe_state is allocated and initialized if a phdr with type
PT_GNU_SFRAME is encountered for the binary.

perf_callchain_user() will fall back on the frame pointer based stack
trace approach if:
  - SFrame section for the main program is not found.
  - SFrame state for the task is either not setup or stale and cannot
  be refreshed.

Finally, the sframe_state is cleaned up in release_task().

Signed-off-by: Indu Bhagat <indu.bhagat@oracle.com>
---
 arch/x86/events/core.c | 51 ++++++++++++++++++++++++++++++++++++++++++
 fs/binfmt_elf.c        | 37 ++++++++++++++++++++++++++++++
 kernel/exit.c          |  9 ++++++++
 3 files changed, 97 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d096b04bf80e..0351e3c444a3 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2860,11 +2860,54 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 }
 #endif
 
+#ifdef CONFIG_USER_UNWINDER_SFRAME
+
+#include <linux/sframe_unwind.h>
+
+/* Check if the specified task has SFrame unwind state set up.  */
+static inline bool check_sframe_state_p(struct task_struct *task)
+{
+	bool sframe_ok = false;
+
+	/*
+	 * FIXME TODO - only current task can be unwinded at this time.
+	 * Even for current tasks, following unknowns remain and hence, not
+	 * handled:
+	 *    - dlopen / dlclose detection and update of sframe_state,
+	 *    - in general, any change in memory mappings.
+	 */
+	if (task != current)
+		return false;
+
+	if (!task->sframe_state)
+		return false;
+
+	sframe_ok = unwind_sframe_state_ready_p(task->sframe_state);
+
+	return sframe_ok;
+}
+
+#else
+/* Check if the specified task has SFrame unwind state set up. */
+static inline bool check_sframe_state_p(struct task_struct *task)
+{
+	return false;
+}
+
+static inline int sframe_callchain_user(struct task_struct *task,
+					struct perf_callchain_entry_ctx *entry,
+					struct pt_regs *regs)
+{
+	return 0;
+}
+#endif
+
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
 {
 	struct stack_frame frame;
 	const struct stack_frame __user *fp;
+	bool sframe_avail;
 
 	if (perf_guest_state()) {
 		/* TODO: We don't support guest os callchain now */
@@ -2887,7 +2930,15 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 	if (perf_callchain_user32(regs, entry))
 		return;
 
+	sframe_avail = check_sframe_state_p(current);
+
 	pagefault_disable();
+
+	if (sframe_avail && !sframe_callchain_user(current, entry, regs)) {
+		pagefault_enable();
+		return;
+	}
+
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 1033fbdfdbec..750f9d397c1e 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -820,6 +820,32 @@ static int parse_elf_properties(struct file *f, const struct elf_phdr *phdr,
 	return ret == -ENOENT ? 0 : ret;
 }
 
+#ifdef CONFIG_USER_UNWINDER_SFRAME
+
+#include <linux/sframe_unwind.h>
+
+static inline int sframe_state_setup(void)
+{
+	int ret;
+
+	/* Allocate the SFrame state (per task) if NULL. */
+	if (!current->sframe_state)
+		current->sframe_state = unwind_sframe_state_alloc(current);
+
+	if (!current->sframe_state)
+		return -ENOMEM;
+
+	ret = unwind_sframe_state_update(current);
+
+	return ret;
+}
+#else
+static inline int sframe_state_setup(void)
+{
+	return 0;
+}
+#endif
+
 static int load_elf_binary(struct linux_binprm *bprm)
 {
 	struct file *interpreter = NULL; /* to shut gcc up */
@@ -842,6 +868,7 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
 	struct mm_struct *mm;
 	struct pt_regs *regs;
+	bool sframe_avail = false;
 
 	retval = -ENOEXEC;
 	/* First of all, some simple consistency checks */
@@ -861,6 +888,14 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	if (!elf_phdata)
 		goto out;
 
+	elf_ppnt = elf_phdata;
+	for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) {
+		if (elf_ppnt->p_type == PT_GNU_SFRAME) {
+			sframe_avail = true;
+			break;
+		}
+	}
+
 	elf_ppnt = elf_phdata;
 	for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) {
 		char *elf_interpreter;
@@ -1342,6 +1377,8 @@ static int load_elf_binary(struct linux_binprm *bprm)
 	 */
 	ELF_PLAT_INIT(regs, reloc_func_desc);
 #endif
+	if (sframe_avail && sframe_state_setup())
+		goto out;
 
 	finalize_exec(bprm);
 	START_THREAD(elf_ex, regs, elf_entry, bprm->p);
diff --git a/kernel/exit.c b/kernel/exit.c
index 34b90e2e7cf7..14c57ecd03ea 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -70,6 +70,10 @@
 #include <linux/sysfs.h>
 #include <linux/user_events.h>
 
+#ifdef CONFIG_USER_UNWINDER_SFRAME
+#include <linux/sframe_unwind.h>
+#endif
+
 #include <linux/uaccess.h>
 #include <asm/unistd.h>
 #include <asm/mmu_context.h>
@@ -241,6 +245,11 @@ void release_task(struct task_struct *p)
 	struct task_struct *leader;
 	struct pid *thread_pid;
 	int zap_leader;
+
+#ifdef CONFIG_USER_UNWINDER_SFRAME
+	unwind_sframe_state_cleanup(p);
+#endif
+
 repeat:
 	/* don't need to get the RCU readlock here - the process is dead and
 	 * can't be modifying its own credentials. But shut RCU-lockdep up */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [POC,V2 0/5] SFrame based stack tracer for user space in the kernel
  2023-05-26  5:32 [POC,V2 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
                   ` (4 preceding siblings ...)
  2023-05-26  5:32 ` [POC,V2 5/5] x86_64: invoke SFrame based stack tracer for user space Indu Bhagat
@ 2023-05-26  7:56 ` Steven Rostedt
  5 siblings, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2023-05-26  7:56 UTC (permalink / raw)
  To: Indu Bhagat; +Cc: linux-toolchains, peterz

On Thu, 25 May 2023 22:32:10 -0700
Indu Bhagat <indu.bhagat@oracle.com> wrote:

> Hello,

Hi Indu,

> 
> I have addressed the code review comments and hopefully made the SFrame
> reader code more maintainable.
> 
> Link to V1 posting: https://lore.kernel.org/linux-toolchains/20230501181515.098acdce@gandalf.local.home/T/#t

This is great!

> 
> The POC still suffers with the same issue: accessing SFrame data is not NMI
> safe.  This is expected to be addressed via the changes to perf workings
> proposed by Steve.
> 
> Once it is clearer what sort of callback APIs are required for the kernel to
> gather user space stack trace using SFrame stack tracer, I am happy to work on
> that later and make the necessary changes to SFrame stack tracer code in
> lib/sframe/sframe_unwind.c and lib/sframe/sframe_state.c etc.

Thanks for sending these.

> 
> As some of the code in lib/sframe/iterate_phdr.[ch] will get dropped or
> reworked, I skipped spending time on improving those files ATM.

OK.

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-26  7:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-26  5:32 [POC,V2 0/5] SFrame based stack tracer for user space in the kernel Indu Bhagat
2023-05-26  5:32 ` [POC,V2 1/5] Kconfig: x86: Add new config options for userspace unwinder Indu Bhagat
2023-05-26  5:32 ` [POC,V2 2/5] task_struct : add additional member for sframe state Indu Bhagat
2023-05-26  5:32 ` [POC,V2 3/5] sframe: add new SFrame library Indu Bhagat
2023-05-26  5:32 ` [POC,V2 4/5] sframe: add an SFrame format stack tracer Indu Bhagat
2023-05-26  5:32 ` [POC,V2 5/5] x86_64: invoke SFrame based stack tracer for user space Indu Bhagat
2023-05-26  7:56 ` [POC,V2 0/5] SFrame based stack tracer for user space in the kernel Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).