linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] arm64: perf: Make compat tracing better
@ 2021-05-07 20:55 Douglas Anderson
  2021-05-07 20:55 ` [PATCH 1/3] arm64: perf: perf_callchain_user() compat support for clang/non-APCS-gcc-arm Douglas Anderson
                   ` (4 more replies)
  0 siblings, 5 replies; 9+ messages in thread
From: Douglas Anderson @ 2021-05-07 20:55 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Nick Desaulniers, Seth LaForge, Ricky Liang, Douglas Anderson,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar,
	Jiri Olsa, Mark Rutland, Namhyung Kim, Nathan Chancellor,
	Peter Zijlstra, clang-built-linux, linux-arm-kernel,
	linux-kernel, linux-perf-users

The goal for this series is to improve "perf" behavior when 32-bit
userspace code is involved. This turns out to be fairly important for
Chrome OS which still runs 32-bit userspace for the time being (long
story there).

I won't repeat everything said in the individual patches since since
they are wordy enough as it is.

Please enjoy and I hope this isn't too ugly/hacky for inclusion in
mainline.

Thanks to Nick Desaulniers for his early review of these patches and
to Ricky for the super early prototype that some of this is based on.


Douglas Anderson (3):
  arm64: perf: perf_callchain_user() compat support for
    clang/non-APCS-gcc-arm
  arm64: perf: Improve compat perf_callchain_user() for clang leaf
    functions
  arm64: perf: Add a config option saying 32-bit thumb code uses R11 for
    FP

 arch/arm64/Kconfig                 |  12 ++
 arch/arm64/kernel/perf_callchain.c | 329 +++++++++++++++++++++++++----
 2 files changed, 305 insertions(+), 36 deletions(-)

-- 
2.31.1.607.g51e8a6a459-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] arm64: perf: perf_callchain_user() compat support for clang/non-APCS-gcc-arm
  2021-05-07 20:55 [PATCH 0/3] arm64: perf: Make compat tracing better Douglas Anderson
@ 2021-05-07 20:55 ` Douglas Anderson
  2021-05-07 20:55 ` [PATCH 2/3] arm64: perf: Improve compat perf_callchain_user() for clang leaf functions Douglas Anderson
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 9+ messages in thread
From: Douglas Anderson @ 2021-05-07 20:55 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Nick Desaulniers, Seth LaForge, Ricky Liang, Douglas Anderson,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar,
	Jiri Olsa, Mark Rutland, Namhyung Kim, Nathan Chancellor,
	Peter Zijlstra, clang-built-linux, linux-arm-kernel,
	linux-kernel, linux-perf-users

Up until now the arm32-compat support in perf_callchain_user() made
the assumption that userspace code was compiled with the defunct APCS
(ARM Procedure Call Standard) calling convention. The existing code
also didn't support Thumb at all. Let's add a whole lot more
support. Specifically, we'll now support:

- clang-compiled Thumb userspace code w/ frame pointers.
- clang-compiled ARM userspace code w/ frame pointers.
- gcc-compiled ARM userspace code w/ frame pointers.

We'll also continue to keep supporting the APCS calling convention.

All of the above will be supported automagically. The important
first-order thing to care about is that ARM and Thumb use different
registers for storing the frame pointer. After that, we need to handle
the fact that the compilers don't agree on the format of a stack
frame. Luckily, we can differentiate between all of the different
supported conventions by relying on the simple fact that the stack is
non-executable and the PC is executable. All of the details of how
this is accomplished is described in the copius inline comments
introduced by this patch.

Please note that there is one compiler convention that is explicitly
_NOT_ supported by this patch, primary because it's believed to be
impossible to support without much fancier logic (like unwind
tables). The _UNSUPPORTED_ case is gcc-compiled Thumb userspace code
w/ frame pointers. The frame pointers in gcc-compiled Thumb userspace
code are largely useless for the purposes of crawling the stack. Given
that the primary purpose of frame pointers is to facilitate stack
crawling, it has been asserted that turning on frame pointers in
gcc-compiled Thumb userspace code is 100% useless. For some history,
you can see the gnu bugzilla [1].

Other notes:
- Presumably tracing code that intermixes ARM / Thumb wouldn't work
  amazingly well given the disagreement about which register stores
  the frame pointer. This patch would theoretically support such a
  situation if (and only if) the code creating the frame pointers was
  smart enough to create them correctly. The problem here is that the
  userspace code creating the stack frames would need to look at the
  mode of the caller (ARM vs. thumb) to figure out if it should store
  R11 or R7 as the old FP. I don't believe code does that.
- This patch likely supports intermixing APCS, clang-ARM, and gcc-ARM
  code, though that hasn't been tested.
- Given a limitation in clang, tracing clang code will omit the
  _callers_ of leaf functions. A future patch will attempt to improve
  the situation a little.

This patch has been tested with some toy code. Some details and
history may be able to be found in the Chrome OS bug system [2]. In
general, when testing, I used these commands:
  perf record  -e cycles:u -g -o "${OUT}" -- "${PROG"}
  perf report -i "${OUT}" -g flat

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92172
[2] https://crbug.com/1040659

Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Ricky Liang <jcliang@chromium.org>
Cc: Seth LaForge <sethml@google.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
---
I carried Nick's review from the downstream review on the Chromium
gerrit. I believe this is what he intended by +1-ing my WIP patch at
<https://crrev.com/c/2877595>.

 arch/arm64/kernel/perf_callchain.c | 311 +++++++++++++++++++++++++----
 1 file changed, 275 insertions(+), 36 deletions(-)

diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c
index 88ff471b0bce..e5ce5f7965d1 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -53,49 +53,294 @@ user_backtrace(struct frame_tail __user *tail,
 }
 
 #ifdef CONFIG_COMPAT
-/*
- * The registers we're interested in are at the end of the variable
- * length saved register structure. The fp points at the end of this
- * structure so the address of this struct is:
- * (struct compat_frame_tail *)(xxx->fp)-1
- *
- * This code has been adapted from the ARM OProfile support.
- */
-struct compat_frame_tail {
-	compat_uptr_t	fp; /* a (struct compat_frame_tail *) in compat mode */
-	u32		sp;
-	u32		lr;
-} __attribute__((packed));
 
-static struct compat_frame_tail __user *
-compat_user_backtrace(struct compat_frame_tail __user *tail,
-		      struct perf_callchain_entry_ctx *entry)
+static int compat_user_read(void *dest, u32 user_addr, size_t bytes)
 {
-	struct compat_frame_tail buftail;
+	const u32 __user *addr = compat_ptr(user_addr);
 	unsigned long err;
 
-	/* Also check accessibility of one struct frame_tail beyond */
-	if (!access_ok(tail, sizeof(buftail)))
-		return NULL;
+	if (!access_ok(addr, bytes))
+		return -EACCES;
 
 	pagefault_disable();
-	err = __copy_from_user_inatomic(&buftail, tail, sizeof(buftail));
+	err = __copy_from_user_inatomic(dest, addr, bytes);
 	pagefault_enable();
 
 	if (err)
-		return NULL;
+		return -EACCES;
+
+	return 0;
+}
 
-	perf_callchain_store(entry, buftail.lr);
+static bool compat_is_ptr_executable(u32 ptr, u32 fp)
+{
+	struct vm_area_struct *vma;
 
 	/*
-	 * Frame pointers should strictly progress back up the stack
-	 * (towards higher addresses).
+	 * Fastpath: if what's pointed to by the frame pointer
+	 * is on the same page as the frame pointer (which
+	 * must point to the stack) then it's on the stack too.
 	 */
-	if (tail + 1 >= (struct compat_frame_tail __user *)
-			compat_ptr(buftail.fp))
-		return NULL;
+	if ((ptr & ~0xfff) == (fp & ~0xfff))
+		return false;
 
-	return (struct compat_frame_tail __user *)compat_ptr(buftail.fp) - 1;
+	vma = find_vma(current->active_mm, (unsigned long)compat_ptr(ptr));
+
+	if (!vma)
+		return false;
+
+	return vma->vm_flags & VM_EXEC;
+}
+
+/**
+ * compat_perf_trace_1() - Trace down one stack frame
+ * @fp: Input: the frame pointer to trace at; output: the next frame pointer.
+ * @pc: Input: the program counter at the start; output: the next PC.
+ * @leaf_lr: If this is the first frame, this will have the value from the
+ *           CPU register LR (link register). If this isn't the first frame,
+ *           this will be 0.
+ *
+ * Return: 0 if no error; else an error code
+ */
+static int compat_perf_trace_1(u32 *fp, u32 *pc, u32 leaf_lr)
+{
+	u32 frame_buf[3];
+	u32 before_fp;
+	u32 at_fp;
+	u32 at_at_fp;
+	int err;
+
+	if (*pc & 1) {
+		/*
+		 * Thumb mode.
+		 *
+		 * There are no flags that make frame pointers on gcc-compiled
+		 * thumb code useful. Thus we'll simply assume userspace was
+		 * compiled by clang and hope for the best. On clang things
+		 * look like this:
+		 *
+		 * +-----------+ 0x1000
+		 * |   random  |
+		 * +-----------+ 0x1004
+		 * | old FP    | <--- FP
+		 * +-----------+ 0x1008
+		 * |     LR    |
+		 * +-----------+
+		 * |   random  |
+		 * +-----------+
+		 *
+		 * Note that clang doesn't create frames for leaf functions
+		 * which means the _caller_ of the leaf function will be
+		 * omitted from the crawl. Clang has a flag
+		 * "-mno-omit-leaf-frame-pointer " that's supposed to change
+		 * this behavior but (when I tried it) it didn't seem to work.
+		 * We also won't know for sure if userspace was built with it
+		 * since it presumably adds a bunch of extra overhead.
+		 *
+		 *
+		 * Just for reference, here is what happens when you ask gcc to
+		 * compile in thumb mode with frame pointers. This illustrates
+		 * why ARM should probably remove the following from AAPCS:
+		 *   The AAPCS does not specify where, within a function's
+		 *   stack frame record, the frame chain data structure
+		 *   resides. This permits implementors the freedom to use
+		 *   whatever location will result in the most efficient code
+		 *   needed to establish the frame chain record.
+		 *
+		 * +-----------+ 0x1000
+		 * |   random  | <--- FP
+		 * +-----------+ 0x1004
+		 * |    ...    |
+		 * +-----------+ ???
+		 * |   random  |
+		 * +-----------+ ???
+		 * | old FP    |
+		 * +-----------+ ???
+		 * |     LR    |
+		 * +-----------+
+		 * |   random  |
+		 * +-----------+
+		 */
+		err = compat_user_read(frame_buf, *fp, 8);
+		if (err)
+			return err;
+
+		*fp = frame_buf[0];
+		*pc = frame_buf[1];
+
+		return 0;
+	}
+
+	/*
+	 * ARM mode.
+	 *
+	 * With ARM code we can support 3 options:
+	 * - clang compiled with frame pointers
+	 *     Looks just like the clang thumb style except that we get the
+	 *     initial FP from R11 instead of R7.
+	 *
+	 *     NOTE: the R11 vs. R7 bit will likely change in the future.
+	 *     The 2019Q4 Issue of AAPCS revised the frame pointer to be R11
+	 *     for BOTH ARM and Thumb, as of 2021Q2 no such production
+	 *     compiler implements that change. There are rumors that
+	 *     toolchains may change to match this specification in the future.
+	 *     If/when that happens, we'll probably need a CONFIG option to
+	 *     support userspace code generated by the old compilers.
+	 *
+	 *
+	 * - gcc compiled with frame pointers (non-leaf)
+	 *     +-----------+ 0x1000
+	 *     |   random  |
+	 *     +-----------+ 0x1004
+	 *     | old FP    |
+	 *     +-----------+ 0x1008
+	 *     |     LR    | <--- FP
+	 *     +-----------+
+	 *     |   random  |
+	 *     +-----------+
+	 *
+	 * - gcc compiled with frame pointers (leaf). NOTE LR is in registers
+	 *     +-----------+ 0x1000
+	 *     |   random  |
+	 *     +-----------+ 0x1004
+	 *     | old FP    | <--- FP
+	 *     +-----------+
+	 *     |   random  |
+	 *     +-----------+
+	 *
+	 * - APCS (ARM Procedure Call Standard)
+	 *     +-----------+ 0x1000
+	 *     |   random  |
+	 *     +-----------+ 0x1004
+	 *     | old FP    |
+	 *     +-----------+ 0x1008
+	 *     |     SP    |
+	 *     +-----------+
+	 *     |     LR    |
+	 *     +-----------+
+	 *     |     PC    |  <--- FP
+	 *     +-----------+
+	 *     |   random  |
+	 *     +-----------+
+	 *
+	 * We will autodetect.
+	 */
+
+	/*
+	 * We'll start with reading whatever the FP points directly at. That
+	 * should be valid memory with all 3 supported layouts.
+	 */
+	err = compat_user_read(&at_fp, *fp, 4);
+	if (err)
+		return err;
+
+	/*
+	 * First let's try to identify clang or a gcc leaf function. In both
+	 * cases the FP points directly at the old FP. Frame pointers always
+	 * point to the stack and the stack should never be marked as
+	 * executable. This contrasts to APCS / gcc-non-leaf where FP will
+	 * either point to a PC near the start of the function or a LR, both
+	 * of which must be executable.
+	 */
+	if (!compat_is_ptr_executable(at_fp, *fp)) {
+		/*
+		 * We need to tell the difference between gcc-leaf and clang.
+		 *
+		 * The first clue is to look at "leaf_lr". It will only be
+		 * non-zero on the first call which is the only one that can
+		 * be a leaf. Thus if it's non-zero, we're clang.
+		 *
+		 * If there's a chance we could be in a leaf, look one frame
+		 * up. On gcc that would be a non-leaf FP and would point to
+		 * something executable. If it's non-executable then we're
+		 * clang. If we have any errors, we'll assume clang.
+		 */
+		if (!leaf_lr || compat_user_read(&at_at_fp, at_fp, 4) ||
+		    !compat_is_ptr_executable(at_at_fp, at_fp)) {
+			/* It's clang; we already have new FP so get PC */
+			err = compat_user_read(pc, *fp + 4, 4);
+			if (err)
+				return err;
+
+			*fp = at_fp;
+
+			return 0;
+		}
+
+		/* It's gcc leaf */
+		*pc = leaf_lr;
+		*fp = at_fp;
+		return 0;
+	}
+
+	/* We need to look at one word before to decide APCS / gcc-ARM */
+	err = compat_user_read(&before_fp, *fp - 4, 4);
+	if (err)
+		return err;
+
+	if (!compat_is_ptr_executable(before_fp, *fp)) {
+		/* It's gcc-ARM */
+		*fp = before_fp;
+		*pc = at_fp;
+
+		return 0;
+	}
+
+	/* APCS */
+	err = compat_user_read(frame_buf, *fp - 12, 12);
+	if (err)
+		return err;
+
+	*fp = frame_buf[0];
+	*pc = frame_buf[2];
+
+	return 0;
+}
+
+static void compat_perf_callchain_user(struct perf_callchain_entry_ctx *entry,
+				       struct pt_regs *regs)
+{
+	u32 old_fp = 0;
+	u32 fp;
+	u32 pc;
+	u32 leaf_lr;
+	int err;
+
+	pc = regs->pc;
+	leaf_lr = regs->regs[14];
+
+	/*
+	 * Assuming userspace is compiled with frame pointers then it's in
+	 * R11 for ARM code and R7 for thumb code. If it's thumb mode we'll
+	 * also set the low bit of the PC to match how the PC indicates thumb
+	 * mode when crawling down the stack.
+	 */
+	if (compat_thumb_mode(regs)) {
+		fp = regs->regs[7];
+		pc |= BIT(0);
+	} else {
+		fp = regs->compat_fp;
+	}
+
+	/* Trace as long as we have room and we have a 4-byte aligned FP. */
+	while ((entry->nr < entry->max_stack) && fp && !(fp & 0x3)) {
+		err = compat_perf_trace_1(&fp, &pc, leaf_lr);
+
+		/* Bail out on any type of error */
+		if (err)
+			break;
+
+		/*
+		 * Frame pointers should strictly progress back up the stack
+		 * (towards higher addresses).
+		 */
+		if (fp <= old_fp)
+			break;
+		old_fp = fp;
+		leaf_lr = 0;
+
+		perf_callchain_store(entry, pc & ~BIT(0));
+	}
 }
 #endif /* CONFIG_COMPAT */
 
@@ -121,13 +366,7 @@ void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 	} else {
 #ifdef CONFIG_COMPAT
 		/* AARCH32 compat mode */
-		struct compat_frame_tail __user *tail;
-
-		tail = (struct compat_frame_tail __user *)regs->compat_fp - 1;
-
-		while ((entry->nr < entry->max_stack) &&
-			tail && !((unsigned long)tail & 0x3))
-			tail = compat_user_backtrace(tail, entry);
+		compat_perf_callchain_user(entry, regs);
 #endif
 	}
 }
-- 
2.31.1.607.g51e8a6a459-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] arm64: perf: Improve compat perf_callchain_user() for clang leaf functions
  2021-05-07 20:55 [PATCH 0/3] arm64: perf: Make compat tracing better Douglas Anderson
  2021-05-07 20:55 ` [PATCH 1/3] arm64: perf: perf_callchain_user() compat support for clang/non-APCS-gcc-arm Douglas Anderson
@ 2021-05-07 20:55 ` Douglas Anderson
  2021-06-07  9:14   ` James Clark
  2021-05-07 20:55 ` [PATCH 3/3] arm64: perf: Add a config option saying 32-bit thumb code uses R11 for FP Douglas Anderson
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 9+ messages in thread
From: Douglas Anderson @ 2021-05-07 20:55 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Nick Desaulniers, Seth LaForge, Ricky Liang, Douglas Anderson,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar,
	Jiri Olsa, Mark Rutland, Namhyung Kim, Nathan Chancellor,
	Peter Zijlstra, clang-built-linux, linux-arm-kernel,
	linux-kernel, linux-perf-users

It turns out that even when you compile code with clang with
"-fno-omit-frame-pointer" that it won't generate a frame pointer for
leaf functions (those that don't call any sub-functions). Presumably
clang does this to reduce the overhead of frame pointers. In a leaf
function you don't really need frame pointers since the Link Register
(LR) is guaranteed to always point to the caller.

Clang's optimization here is a bit of a pain for us, though. A human
might have an easy time figuring out if a function is a leaf function
or not and in theory we could have a way to lookup a given PC to find
out if it's in a leaf function. Unfortunately we haven't passed the
Turing test and we don't have any auxiliary data to help us.

If we just ignore this problem then the end result isn't terrible. It
just turns out that the _callers_ of leaf functions won't be logged. I
guess that's OK, but it could lead to some confusing traces.

Another option (the one proposed in this patch) is to always log the
first LR when we're tracing, assuming that we hadn't already decided
to log it for some other reason.

Good things about always logging the LR:
* clang leaf functions are handled better.
* if PC is right at the start of a function (even on non-clang) it's
  handled better.

Bad things about the LR:
* We could log a "bogus" PC in the trace.

I believe that the most common "bogus" PC that would be logged would
be a PC somewhere in the top function being traced. As an example, if
we have this function:

  non_leaf():
    1. Setup the frame pointer
    2. Call example()
    3. Do something slow
    4. Do something else slow
    5. Call example2()
    6. Return

If the PC was in the middle of "Do something else slow" and then we
tried to trace, our stack trace would look like this:
  Top:  a) A PC in the middle of "Do something else slow".
        b) The return address that example() used, probably in "Do
           something slow"
	c) The caller of non_leaf()

Specifically you can see that there would be two PCs logged for
non_leaf().

To me it feels like this is a net-win. It also should be noted that
the consumer of our trace records probably _does_ have more
information than we do. It could fairly easily ignore this info.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 arch/arm64/kernel/perf_callchain.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c
index e5ce5f7965d1..b3cd9f371469 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -326,6 +326,20 @@ static void compat_perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 	while ((entry->nr < entry->max_stack) && fp && !(fp & 0x3)) {
 		err = compat_perf_trace_1(&fp, &pc, leaf_lr);
 
+		/*
+		 * If this is the first trace and it didn't find the LR then
+		 * let's throw it in the trace first. This isn't perfect but
+		 * is the best we can do for handling clang leaf functions (or
+		 * the case where we're right at the start of the function
+		 * before the new frame has been pushed). In the worst case
+		 * this can cause us to throw an extra entry that will be some
+		 * location in the same function as the PC. That's not
+		 * amazing but shouldn't really hurt. It seems better than
+		 * throwing away the LR.
+		 */
+		if (leaf_lr && leaf_lr != pc)
+			perf_callchain_store(entry, leaf_lr & ~BIT(0));
+
 		/* Bail out on any type of error */
 		if (err)
 			break;
-- 
2.31.1.607.g51e8a6a459-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] arm64: perf: Add a config option saying 32-bit thumb code uses R11 for FP
  2021-05-07 20:55 [PATCH 0/3] arm64: perf: Make compat tracing better Douglas Anderson
  2021-05-07 20:55 ` [PATCH 1/3] arm64: perf: perf_callchain_user() compat support for clang/non-APCS-gcc-arm Douglas Anderson
  2021-05-07 20:55 ` [PATCH 2/3] arm64: perf: Improve compat perf_callchain_user() for clang leaf functions Douglas Anderson
@ 2021-05-07 20:55 ` Douglas Anderson
  2021-05-25 15:04 ` [PATCH 0/3] arm64: perf: Make compat tracing better Doug Anderson
  2021-06-02 17:55 ` Will Deacon
  4 siblings, 0 replies; 9+ messages in thread
From: Douglas Anderson @ 2021-05-07 20:55 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Nick Desaulniers, Seth LaForge, Ricky Liang, Douglas Anderson,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar,
	Jiri Olsa, Mark Rutland, Namhyung Kim, Nathan Chancellor,
	Peter Zijlstra, clang-built-linux, linux-arm-kernel,
	linux-kernel, linux-perf-users

The frame pointer story for 32-bit ARM/Thumb code is a bit of a
nightmare. See the patch ("arm64: perf: perf_callchain_user() compat
support clang/non-APCS-gcc-arm") (including the inline comments) for
some details.

Apparently, all hope is not lost for some resolution to this story.
Recently it's been agreed upon that the frame pointer should be R11
across both ARM and Thumb. This should, at least, allow us to crawl
through mixed code.

Unfortunately I can't think of any automagic way to figure out if code
is using R7 or R11 for the frame pointer. We'll force the person
compiling the kernel to choose one or the other.

NOTE: apparently as-of right now (2021Q1) there are no compilers out
there that actually support this. Thus this patch is untested.
However, it's so simple that it feels right to land it now while
everyone is thinking about it. I have, at least, confirmed that
tracing Thumb code produced with the old compiler _breaks_ when I set
this option. ;-)

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 arch/arm64/Kconfig                 | 12 ++++++++++++
 arch/arm64/kernel/perf_callchain.c | 12 ++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f1d8566bbf9..f123736ac535 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1062,6 +1062,18 @@ config ARCH_SPARSEMEM_ENABLE
 	select SPARSEMEM_VMEMMAP_ENABLE
 	select SPARSEMEM_VMEMMAP
 
+config PERF_COMPAT_THUMB_FP_R11
+	bool "Assume userspace 32-bit Thumb code uses R11 for Frame pointer"
+	help
+	  Historically R11 was the frame pointer (FP) for 32-bit ARM code
+	  and R7 was the frame pointer for 32-bit Thumb code. This resulted in
+	  the inability to use the FP for stack crawling with mixed code.
+	  The 2019Q4 Issue of AAPCS revised the frame pointer to be R11 for
+	  BOTH ARM and Thumb. If your userspace was built with this new
+	  standard then say "yes" here.
+	depends on PERF_EVENTS
+	depends on COMPAT
+
 config HW_PERF_EVENTS
 	def_bool y
 	depends on ARM_PMU
diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c
index b3cd9f371469..c8187acdbf3f 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -311,12 +311,16 @@ static void compat_perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 
 	/*
 	 * Assuming userspace is compiled with frame pointers then it's in
-	 * R11 for ARM code and R7 for thumb code. If it's thumb mode we'll
-	 * also set the low bit of the PC to match how the PC indicates thumb
-	 * mode when crawling down the stack.
+	 * R11 for ARM code and R7 for thumb code (unless you've got a really
+	 * new compiler). If it's thumb mode we'll also set the low bit of
+	 * the PC to match how the PC indicates thumb mode when crawling
+	 * down the stack.
 	 */
 	if (compat_thumb_mode(regs)) {
-		fp = regs->regs[7];
+		if (IS_ENABLED(CONFIG_PERF_COMPAT_THUMB_FP_R11))
+			fp = regs->regs[11];
+		else
+			fp = regs->regs[7];
 		pc |= BIT(0);
 	} else {
 		fp = regs->compat_fp;
-- 
2.31.1.607.g51e8a6a459-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] arm64: perf: Make compat tracing better
  2021-05-07 20:55 [PATCH 0/3] arm64: perf: Make compat tracing better Douglas Anderson
                   ` (2 preceding siblings ...)
  2021-05-07 20:55 ` [PATCH 3/3] arm64: perf: Add a config option saying 32-bit thumb code uses R11 for FP Douglas Anderson
@ 2021-05-25 15:04 ` Doug Anderson
  2021-06-02 17:55 ` Will Deacon
  4 siblings, 0 replies; 9+ messages in thread
From: Doug Anderson @ 2021-05-25 15:04 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon
  Cc: Nick Desaulniers, Seth LaForge, Ricky Liang, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Namhyung Kim, Nathan Chancellor, Peter Zijlstra,
	clang-built-linux, Linux ARM, LKML, linux-perf-users

Catalin / Will,

On Fri, May 7, 2021 at 1:55 PM Douglas Anderson <dianders@chromium.org> wrote:
>
> The goal for this series is to improve "perf" behavior when 32-bit
> userspace code is involved. This turns out to be fairly important for
> Chrome OS which still runs 32-bit userspace for the time being (long
> story there).
>
> I won't repeat everything said in the individual patches since since
> they are wordy enough as it is.
>
> Please enjoy and I hope this isn't too ugly/hacky for inclusion in
> mainline.
>
> Thanks to Nick Desaulniers for his early review of these patches and
> to Ricky for the super early prototype that some of this is based on.
>
>
> Douglas Anderson (3):
>   arm64: perf: perf_callchain_user() compat support for
>     clang/non-APCS-gcc-arm
>   arm64: perf: Improve compat perf_callchain_user() for clang leaf
>     functions
>   arm64: perf: Add a config option saying 32-bit thumb code uses R11 for
>     FP
>
>  arch/arm64/Kconfig                 |  12 ++
>  arch/arm64/kernel/perf_callchain.c | 329 +++++++++++++++++++++++++----
>  2 files changed, 305 insertions(+), 36 deletions(-)

While this isn't massively urgent, I'd like to confirm that this is on
your plate to eventually review and/or land. If it's not, do you have
any suggestions for how I should proceed?

Thanks!

-Doug

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] arm64: perf: Make compat tracing better
  2021-05-07 20:55 [PATCH 0/3] arm64: perf: Make compat tracing better Douglas Anderson
                   ` (3 preceding siblings ...)
  2021-05-25 15:04 ` [PATCH 0/3] arm64: perf: Make compat tracing better Doug Anderson
@ 2021-06-02 17:55 ` Will Deacon
  2021-06-07 20:34   ` Doug Anderson
  4 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2021-06-02 17:55 UTC (permalink / raw)
  To: Douglas Anderson
  Cc: Catalin Marinas, Nick Desaulniers, Seth LaForge, Ricky Liang,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar,
	Jiri Olsa, Mark Rutland, Namhyung Kim, Nathan Chancellor,
	Peter Zijlstra, clang-built-linux, linux-arm-kernel,
	linux-kernel, linux-perf-users

Hi Doug,

Thanks for posting this, and sorry for the delay in getting to it.

On Fri, May 07, 2021 at 01:55:10PM -0700, Douglas Anderson wrote:
> The goal for this series is to improve "perf" behavior when 32-bit
> userspace code is involved. This turns out to be fairly important for
> Chrome OS which still runs 32-bit userspace for the time being (long
> story there).

Watch out, your days are numbered! See [1].

> I won't repeat everything said in the individual patches since since
> they are wordy enough as it is.
> 
> Please enjoy and I hope this isn't too ugly/hacky for inclusion in
> mainline.
> 
> Thanks to Nick Desaulniers for his early review of these patches and
> to Ricky for the super early prototype that some of this is based on.

I can see that you've put a lot of effort into this, but I'm not thrilled
with the prospect of maintaining these heuristics in the kernel. The
callchain behaviour is directly visible to userspace, and all we'll be able
to do is throw more heuristics at it if faced with any regression reports.
Every assumption made about userspace behaviour results in diminishing
returns where some set of programs no longer fall into the "supported"
bucket and, on balance, I don't think the trade-off is worth it.

If we were to do this in the kernel, then I'd like to see a spec for how
frame-pointer based unwinding should work for Thumb and have it agreed
upon and implemented by both GCC and LLVM. That way, we can implement
the unwinder according to that spec and file bug reports against the
compiler if it goes wrong.

In lieu of that, I think we must defer to userspace to unwind using DWARF.
Perf supports this via PERF_SAMPLE_STACK_USER and PERF_SAMPLE_REGS_USER,
which allows libunwind to be used to create the callchain. You haven't
mentioned that here, so I'd be interested to know why not.

Finally, you've probably noticed that our unwinding code for compat tasks
is basically identical to the code in arch/arm/. If the functionality is
going to be extended, it should be done there first and then we will follow
to be compatible.

Cheers,

Will

[1] https://lore.kernel.org/lkml/20210602164719.31777-20-will@kernel.org/T/#u

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] arm64: perf: Improve compat perf_callchain_user() for clang leaf functions
  2021-05-07 20:55 ` [PATCH 2/3] arm64: perf: Improve compat perf_callchain_user() for clang leaf functions Douglas Anderson
@ 2021-06-07  9:14   ` James Clark
  2021-06-07 20:57     ` Doug Anderson
  0 siblings, 1 reply; 9+ messages in thread
From: James Clark @ 2021-06-07  9:14 UTC (permalink / raw)
  To: Douglas Anderson, Catalin Marinas, Will Deacon
  Cc: Nick Desaulniers, Seth LaForge, Ricky Liang, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Ingo Molnar, Jiri Olsa, Mark Rutland,
	Namhyung Kim, Nathan Chancellor, Peter Zijlstra,
	clang-built-linux, linux-arm-kernel, linux-kernel,
	linux-perf-users, Alexandre Truong, Wilco Dijkstra, Al Grant



On 07/05/2021 23:55, Douglas Anderson wrote:
> It turns out that even when you compile code with clang with
> "-fno-omit-frame-pointer" that it won't generate a frame pointer for
> leaf functions (those that don't call any sub-functions). Presumably
> clang does this to reduce the overhead of frame pointers. In a leaf
> function you don't really need frame pointers since the Link Register
> (LR) is guaranteed to always point to the caller> 
[...]
> 
>  arch/arm64/kernel/perf_callchain.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c
> index e5ce5f7965d1..b3cd9f371469 100644
> --- a/arch/arm64/kernel/perf_callchain.c
> +++ b/arch/arm64/kernel/perf_callchain.c
> @@ -326,6 +326,20 @@ static void compat_perf_callchain_user(struct perf_callchain_entry_ctx *entry,
>  	while ((entry->nr < entry->max_stack) && fp && !(fp & 0x3)) {
>  		err = compat_perf_trace_1(&fp, &pc, leaf_lr);
>  
> +		/*
> +		 * If this is the first trace and it didn't find the LR then
> +		 * let's throw it in the trace first. This isn't perfect but
> +		 * is the best we can do for handling clang leaf functions (or
> +		 * the case where we're right at the start of the function
> +		 * before the new frame has been pushed). In the worst case
> +		 * this can cause us to throw an extra entry that will be some
> +		 * location in the same function as the PC. That's not
> +		 * amazing but shouldn't really hurt. It seems better than
> +		 * throwing away the LR.
> +		 */

Hi Douglas,

I think the behaviour with GCC is also similar. We were working on this change
(https://lore.kernel.org/lkml/20210304163255.10363-4-alexandre.truong@arm.com/)
in userspace Perf which addresses the same issue.

The basic concept of our version is to record only the link register
(as in --user-regs=lr). Then use the existing dwarf based unwind
to determine if the link register is valid for that frame, and then if
it is and it doesn't already exist on the stack then insert it.

You mention that your version isn't perfect, do you think that saving the
LR and using something like libunwind in a post process could be better?

Thanks
James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] arm64: perf: Make compat tracing better
  2021-06-02 17:55 ` Will Deacon
@ 2021-06-07 20:34   ` Doug Anderson
  0 siblings, 0 replies; 9+ messages in thread
From: Doug Anderson @ 2021-06-07 20:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Nick Desaulniers, Seth LaForge, Ricky Liang,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Ingo Molnar,
	Jiri Olsa, Mark Rutland, Namhyung Kim, Nathan Chancellor,
	Peter Zijlstra, clang-built-linux, Linux ARM, LKML,
	linux-perf-users

Hi,

On Wed, Jun 2, 2021 at 10:56 AM Will Deacon <will@kernel.org> wrote:
>
> Hi Doug,
>
> Thanks for posting this, and sorry for the delay in getting to it.
>
> On Fri, May 07, 2021 at 01:55:10PM -0700, Douglas Anderson wrote:
> > The goal for this series is to improve "perf" behavior when 32-bit
> > userspace code is involved. This turns out to be fairly important for
> > Chrome OS which still runs 32-bit userspace for the time being (long
> > story there).
>
> Watch out, your days are numbered! See [1].

Yeah, folks on the Chrome OS team are aware and we're trying our
darndest to move away. It's been an unfortunate set of circumstances
that has kept us on 32-bit this long. :( BTW: I like your suggestion
of "retirement" as a solution to dealing with this problem, but I'm
not quite ready to retire yet.


> > I won't repeat everything said in the individual patches since since
> > they are wordy enough as it is.
> >
> > Please enjoy and I hope this isn't too ugly/hacky for inclusion in
> > mainline.
> >
> > Thanks to Nick Desaulniers for his early review of these patches and
> > to Ricky for the super early prototype that some of this is based on.
>
> I can see that you've put a lot of effort into this, but I'm not thrilled
> with the prospect of maintaining these heuristics in the kernel. The
> callchain behaviour is directly visible to userspace, and all we'll be able
> to do is throw more heuristics at it if faced with any regression reports.
> Every assumption made about userspace behaviour results in diminishing
> returns where some set of programs no longer fall into the "supported"
> bucket and, on balance, I don't think the trade-off is worth it.
>
> If we were to do this in the kernel, then I'd like to see a spec for how
> frame-pointer based unwinding should work for Thumb and have it agreed
> upon and implemented by both GCC and LLVM. That way, we can implement
> the unwinder according to that spec and file bug reports against the
> compiler if it goes wrong.

Given how long this has been going on, I'd somewhat guess that getting
this implemented in GCC and LLVM is 1+ year out. Presumably Chrome OS
will be transitioned off 32-bit ARM by then.


> In lieu of that, I think we must defer to userspace to unwind using DWARF.
> Perf supports this via PERF_SAMPLE_STACK_USER and PERF_SAMPLE_REGS_USER,
> which allows libunwind to be used to create the callchain. You haven't
> mentioned that here, so I'd be interested to know why not.

Good point. So I guess I didn't mention it because:

a) I really know very little about perf. I got roped in this because I
understand stack unwinding, not because I know how to use perf well.
:-P So I personally have no idea how to set this up.

b) In the little bit of reading I did about this, people seemed to say
that using libunwind for perf sampling was just too slow / too much
overhead.


> Finally, you've probably noticed that our unwinding code for compat tasks
> is basically identical to the code in arch/arm/. If the functionality is
> going to be extended, it should be done there first and then we will follow
> to be compatible.

That's fair. I doubt that submitting patches to this area of code for
arm32 would be enjoyable, so I'll pass if it's all the same.

Given your feedback, I think it's fair to consider ${SUBJECT} patch
abandoned then. I'll see if people want to land it as a private patch
in the Chrome OS tree for the time being until we can more fully
abandon arm32 support or until the ARM teams working on gcc and clang
come up with a standard that we can support more properly.

In the meantime, if anyone cares to pick this patch up and move
forward, feel free to do so with my blessing.

-Doug

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] arm64: perf: Improve compat perf_callchain_user() for clang leaf functions
  2021-06-07  9:14   ` James Clark
@ 2021-06-07 20:57     ` Doug Anderson
  0 siblings, 0 replies; 9+ messages in thread
From: Doug Anderson @ 2021-06-07 20:57 UTC (permalink / raw)
  To: James Clark
  Cc: Catalin Marinas, Will Deacon, Nick Desaulniers, Seth LaForge,
	Ricky Liang, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ingo Molnar, Jiri Olsa, Mark Rutland, Namhyung Kim,
	Nathan Chancellor, Peter Zijlstra, clang-built-linux, Linux ARM,
	LKML, linux-perf-users, Alexandre Truong, Wilco Dijkstra,
	Al Grant

Hi,

On Mon, Jun 7, 2021 at 2:14 AM James Clark <james.clark@arm.com> wrote:
>
>
>
> On 07/05/2021 23:55, Douglas Anderson wrote:
> > It turns out that even when you compile code with clang with
> > "-fno-omit-frame-pointer" that it won't generate a frame pointer for
> > leaf functions (those that don't call any sub-functions). Presumably
> > clang does this to reduce the overhead of frame pointers. In a leaf
> > function you don't really need frame pointers since the Link Register
> > (LR) is guaranteed to always point to the caller>
> [...]
> >
> >  arch/arm64/kernel/perf_callchain.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c
> > index e5ce5f7965d1..b3cd9f371469 100644
> > --- a/arch/arm64/kernel/perf_callchain.c
> > +++ b/arch/arm64/kernel/perf_callchain.c
> > @@ -326,6 +326,20 @@ static void compat_perf_callchain_user(struct perf_callchain_entry_ctx *entry,
> >       while ((entry->nr < entry->max_stack) && fp && !(fp & 0x3)) {
> >               err = compat_perf_trace_1(&fp, &pc, leaf_lr);
> >
> > +             /*
> > +              * If this is the first trace and it didn't find the LR then
> > +              * let's throw it in the trace first. This isn't perfect but
> > +              * is the best we can do for handling clang leaf functions (or
> > +              * the case where we're right at the start of the function
> > +              * before the new frame has been pushed). In the worst case
> > +              * this can cause us to throw an extra entry that will be some
> > +              * location in the same function as the PC. That's not
> > +              * amazing but shouldn't really hurt. It seems better than
> > +              * throwing away the LR.
> > +              */
>
> Hi Douglas,
>
> I think the behaviour with GCC is also similar. We were working on this change
> (https://lore.kernel.org/lkml/20210304163255.10363-4-alexandre.truong@arm.com/)
> in userspace Perf which addresses the same issue.
>
> The basic concept of our version is to record only the link register
> (as in --user-regs=lr). Then use the existing dwarf based unwind
> to determine if the link register is valid for that frame, and then if
> it is and it doesn't already exist on the stack then insert it.
>
> You mention that your version isn't perfect, do you think that saving the
> LR and using something like libunwind in a post process could be better?

Using post processing atop a patch to always save the LR is definitely
the right solution IMO and (I think) you can fully overcome the "no
frame pointers in leaf functions" with the post processing.

-Doug

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-06-07 21:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-07 20:55 [PATCH 0/3] arm64: perf: Make compat tracing better Douglas Anderson
2021-05-07 20:55 ` [PATCH 1/3] arm64: perf: perf_callchain_user() compat support for clang/non-APCS-gcc-arm Douglas Anderson
2021-05-07 20:55 ` [PATCH 2/3] arm64: perf: Improve compat perf_callchain_user() for clang leaf functions Douglas Anderson
2021-06-07  9:14   ` James Clark
2021-06-07 20:57     ` Doug Anderson
2021-05-07 20:55 ` [PATCH 3/3] arm64: perf: Add a config option saying 32-bit thumb code uses R11 for FP Douglas Anderson
2021-05-25 15:04 ` [PATCH 0/3] arm64: perf: Make compat tracing better Doug Anderson
2021-06-02 17:55 ` Will Deacon
2021-06-07 20:34   ` Doug Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).