linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] tracing: Function stack size and its name mismatch in arm64
@ 2019-08-02  9:41 Jiping Ma
  2019-08-02 15:22 ` Steven Rostedt
  0 siblings, 1 reply; 15+ messages in thread
From: Jiping Ma @ 2019-08-02  9:41 UTC (permalink / raw)
  To: rostedt, mingo, catalin.marinas, will.deacon, joel
  Cc: linux-kernel, linux-arm-kernel, jiping.ma2

There is not PC in ARM64 stack, LR is used to for walk_stackframe in
ARM64. Tere is no the issue in ARM32 because there is PC in ARM32 stack.
PC is used to calculate the stack size in trace_stack.c, so the
function name and its stack size appear to be off-by-one.
ARM64 stack layout:
	LR
        FP
        ......
        LR
        FP
        ......

Wrong info:
Depth Size Location (16 entries)
----- ---- --------
0) 5400 16 __update_load_avg_se.isra.2+0x28/0x220
1) 5384 96 put_prev_entity+0x250/0x338
2) 5288 80 pick_next_task_fair+0x4c4/0x508
3) 5208 72 __schedule+0x100/0x600
4) 5136 184 preempt_schedule_common+0x28/0x48
5) 4952 32 preempt_schedule+0x28/0x30
6) 4920 16 vprintk_emit+0x170/0x1f8
7) 4904 128 vprintk_default+0x48/0x58
8) 4776 64 vprintk_func+0xf8/0x1c8
9) 4712 112 printk+0x70/0x90
10) 4600 176 occupy_stack_init+0x64/0xc0 [kernel_stack]
11) 4424 3376 do_one_initcall+0x68/0x248
12) 1048 144 do_init_module+0x60/0x1f0
13) 904 48 load_module+0x1d50/0x2340
14) 856 352 sys_finit_module+0xd0/0xe8
15) 504 504 el0_svc_naked+0x30/0x34

Correct info:
Depth Size Location (18 entries)
----- ---- --------
0) 5464 48 cgroup_rstat_updated+0x20/0x100
1) 5416 32 cgroup_base_stat_cputime_account_end.isra.0+0x30/0x60
2) 5384 32 __cgroup_account_cputime+0x3c/0x48
3) 5352 64 update_curr+0xc4/0x1d0
4) 5288 72 pick_next_task_fair+0x444/0x508
5) 5216 184 __schedule+0x100/0x600
6) 5032 32 preempt_schedule_common+0x28/0x48
7) 5000 16 preempt_schedule+0x28/0x30
8) 4984 128 vprintk_emit+0x170/0x1f8
9) 4856 64 vprintk_default+0x48/0x58
10) 4792 112 vprintk_func+0xf8/0x1c8
11) 4680 176 printk+0x70/0x90
12) 4504 80 func_test+0x7c/0xb8 [kernel_stack]
13) 4424 3376 occupy_stack_init+0x7c/0xc0 [kernel_stack]
14) 1048 144 do_one_initcall+0x68/0x248
15) 904 48 do_init_module+0x60/0x1f0
16) 856 352 load_module+0x1d50/0x2340
17) 504 504 sys_finit_module+0xd0/0xe8

Signed-off-by: Jiping Ma <jiping.ma2@windriver.com>
---
 kernel/trace/trace_stack.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index 5d16f73898db..ed80b95abf06 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -40,16 +40,28 @@ static void print_max_stack(void)
 
 	pr_emerg("        Depth    Size   Location    (%d entries)\n"
 			   "        -----    ----   --------\n",
+#ifdef CONFIG_ARM64
+			   stack_trace_nr_entries - 1);
+#else
 			   stack_trace_nr_entries);
-
+#endif
+#ifdef CONFIG_ARM64
+	for (i = 1; i < stack_trace_nr_entries; i++) {
+#else
 	for (i = 0; i < stack_trace_nr_entries; i++) {
+#endif
 		if (i + 1 == stack_trace_nr_entries)
 			size = stack_trace_index[i];
 		else
 			size = stack_trace_index[i] - stack_trace_index[i+1];
 
+#ifdef CONFIG_ARM64
+		pr_emerg("%3ld) %8d   %5d   %pS\n", i-1, stack_trace_index[i],
+				size, (void *)stack_dump_trace[i-1]);
+#else
 		pr_emerg("%3ld) %8d   %5d   %pS\n", i, stack_trace_index[i],
 				size, (void *)stack_dump_trace[i]);
+#endif
 	}
 }
 
@@ -324,8 +336,11 @@ static int t_show(struct seq_file *m, void *v)
 		seq_printf(m, "        Depth    Size   Location"
 			   "    (%d entries)\n"
 			   "        -----    ----   --------\n",
+#ifdef CONFIG_ARM64
+			   stack_trace_nr_entries - 1);
+#else
 			   stack_trace_nr_entries);
-
+#endif
 		if (!stack_tracer_enabled && !stack_trace_max_size)
 			print_disabled(m);
 
@@ -334,6 +349,10 @@ static int t_show(struct seq_file *m, void *v)
 
 	i = *(long *)v;
 
+#ifdef CONFIG_ARM64
+	if (i == 0)
+		return 0;
+#endif
 	if (i >= stack_trace_nr_entries)
 		return 0;
 
@@ -342,9 +361,14 @@ static int t_show(struct seq_file *m, void *v)
 	else
 		size = stack_trace_index[i] - stack_trace_index[i+1];
 
+#ifdef CONFIG_ARM64
+	seq_printf(m, "%3ld) %8d   %5d   ", i-1, stack_trace_index[i], size);
+	trace_lookup_stack(m, i-1);
+#else
 	seq_printf(m, "%3ld) %8d   %5d   ", i, stack_trace_index[i], size);
 
 	trace_lookup_stack(m, i);
+#endif
 
 	return 0;
 }
-- 
2.18.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-02  9:41 [PATCH v3] tracing: Function stack size and its name mismatch in arm64 Jiping Ma
@ 2019-08-02 15:22 ` Steven Rostedt
  2019-08-02 16:09   ` Steven Rostedt
  2019-08-03  8:26   ` Joel Fernandes
  0 siblings, 2 replies; 15+ messages in thread
From: Steven Rostedt @ 2019-08-02 15:22 UTC (permalink / raw)
  To: Jiping Ma
  Cc: mingo, catalin.marinas, will.deacon, joel, linux-kernel,
	linux-arm-kernel

On Fri, 2 Aug 2019 17:41:03 +0800
Jiping Ma <jiping.ma2@windriver.com> wrote:

First of all, this patch will not be accepted. I will not allow arch
specific defines in generic code like this. That said, there are better
solutions.


> There is not PC in ARM64 stack, LR is used to for walk_stackframe in
> ARM64. Tere is no the issue in ARM32 because there is PC in ARM32 stack.
> PC is used to calculate the stack size in trace_stack.c, so the
> function name and its stack size appear to be off-by-one.
> ARM64 stack layout:
> 	LR
>         FP
>         ......
>         LR
>         FP
>         ......

I think you are not explaining the issue correctly. From looking at the
document, I think what you want to say is that the LR is saved *after*
the data for the function. Is that correct? If so, then yes, it would
cause the stack tracing algorithm to be incorrect.

Most archs do this:

On entry to a function:

	save return address
 	reserve local variables and such for current function

I think you are saying that arm64 does this backwards.

	reserve local variables and such for current function
	save return address (LR)


For normal archs, we usually have something like this:

sys_call_entry:
   [save regs and crap on stack ]
   bl sys_finit_module
 sys_call_entry+x:
   [ here's the address put into the lr regsister ]

sys_finit_module:
   save lr on stack [syscall_entry+x]
   save more stuff on stack
   [..]
   bl load_module
  sys_finit_module+y:

load_module:
  save lr on stack [sys_finit_module+y]
  save more stuff on stack
  [..]
  bl do_init_module
 load_module+z:

do_init_module:
  save lr on stack [load_module+z]
  [ let's say we do the stack trace here! ]

On the stack we have:

  data for kernel entry
  sys_call_entry+x [ lr saved in sys_finit_module ]
  data for sys_finit_module
  sys_finit_module+y [ lr saved in load_module ]
  data for load_module
  load_module+z [ lr saved in do_init_module

When we call stack_trace_save() we get:

  sys_call_entry+x
  sys_finit_module+y
  load_module+z

We then find those locations on the stack. Let's add some numbers here:

0:   top of stack
     data for kernel entry
504: sys_call_entry+x
     data for sys_finit_module
856: sys_finit_module+y
     data for sys_load_module
904: load_module+z
1048: BOTTOM OF STACK HERE

To do the work, we start from the bottom of the stack, looking for what
was returned from the stack_trace_save

start = 1048
 *p == load_module+z
  stack size = 1048 - 904

Then we say load_module stack size is 144.

What you are describing doesn't make any sense. Yes, we all add LR to
the stack, not the PC! But if the issue is that the LR is saved *after*
the function data, then that *will* make a difference. We then have:


sys_call_entry:
   [save regs and crap on stack ]
   bl sys_finit_module
 sys_call_entry+x:
   [ here's the address put into the lr regsister ]

sys_finit_module:
   save sys_finit_module stuff on stack
   save lr on stack [syscall_entry+x]  <-- reversed order from before!
   [..]
   bl load_module
  sys_finit_module+y:

load_module:
  save load_module stuff on stack
  save lr on stack [sys_finit_module+y]
  [..]
  bl do_init_module
 load_module+z:

do_init_module:
  [ let's say we do the stack trace here! ]

On the stack we have:

  data for sys_call_entry
  data for sys_finit_module
  sys_call_entry+x [ lr saved in sys_finit_module ]
  data for load_module
  sys_finit_module+y [ lr saved in load_module ]
  data for do_init_module
  load_module+z [ lr saved in do_init_module

This also brings up the question of when the mcount is called?

We get the same data from stack_trace_save():

  sys_call_entry+x
  sys_finit_module+y
  load_module+z

But this time, the stack is different!

0:   top of stack
     data from kernel entry
     data for sys_call_entry
     data for sys_finit_module
504: sys_call_entry+x
     data for sys_finit_module
856: sys_finit_module+y
     data for load_module
904: load_module+z
     data for do_init_module?
1048: BOTTOM OF STACK HERE

So yes, the same algorithm is will come up with the wrong result,
because...

start = 1048
 *p == load_module+z
  stack size = 1048 - 904

We just calculated the size of data for do_init_module, and not for
load_module!

Can someone confirm that this is the real issue?

-- Steve


> 
> Wrong info:
> Depth Size Location (16 entries)
> ----- ---- --------
> 0) 5400 16 __update_load_avg_se.isra.2+0x28/0x220
> 1) 5384 96 put_prev_entity+0x250/0x338
> 2) 5288 80 pick_next_task_fair+0x4c4/0x508
> 3) 5208 72 __schedule+0x100/0x600
> 4) 5136 184 preempt_schedule_common+0x28/0x48
> 5) 4952 32 preempt_schedule+0x28/0x30
> 6) 4920 16 vprintk_emit+0x170/0x1f8
> 7) 4904 128 vprintk_default+0x48/0x58
> 8) 4776 64 vprintk_func+0xf8/0x1c8
> 9) 4712 112 printk+0x70/0x90
> 10) 4600 176 occupy_stack_init+0x64/0xc0 [kernel_stack]
> 11) 4424 3376 do_one_initcall+0x68/0x248
> 12) 1048 144 do_init_module+0x60/0x1f0
> 13) 904 48 load_module+0x1d50/0x2340
> 14) 856 352 sys_finit_module+0xd0/0xe8
> 15) 504 504 el0_svc_naked+0x30/0x34
> 
> Correct info:
> Depth Size Location (18 entries)
> ----- ---- --------
> 0) 5464 48 cgroup_rstat_updated+0x20/0x100
> 1) 5416 32 cgroup_base_stat_cputime_account_end.isra.0+0x30/0x60
> 2) 5384 32 __cgroup_account_cputime+0x3c/0x48
> 3) 5352 64 update_curr+0xc4/0x1d0
> 4) 5288 72 pick_next_task_fair+0x444/0x508
> 5) 5216 184 __schedule+0x100/0x600
> 6) 5032 32 preempt_schedule_common+0x28/0x48
> 7) 5000 16 preempt_schedule+0x28/0x30
> 8) 4984 128 vprintk_emit+0x170/0x1f8
> 9) 4856 64 vprintk_default+0x48/0x58
> 10) 4792 112 vprintk_func+0xf8/0x1c8
> 11) 4680 176 printk+0x70/0x90
> 12) 4504 80 func_test+0x7c/0xb8 [kernel_stack]
> 13) 4424 3376 occupy_stack_init+0x7c/0xc0 [kernel_stack]
> 14) 1048 144 do_one_initcall+0x68/0x248
> 15) 904 48 do_init_module+0x60/0x1f0
> 16) 856 352 load_module+0x1d50/0x2340
> 17) 504 504 sys_finit_module+0xd0/0xe8
> 
> Signed-off-by: Jiping Ma <jiping.ma2@windriver.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-02 15:22 ` Steven Rostedt
@ 2019-08-02 16:09   ` Steven Rostedt
  2019-08-02 16:11     ` Steven Rostedt
  2019-08-06  6:01     ` Jiping Ma
  2019-08-03  8:26   ` Joel Fernandes
  1 sibling, 2 replies; 15+ messages in thread
From: Steven Rostedt @ 2019-08-02 16:09 UTC (permalink / raw)
  To: Jiping Ma
  Cc: mingo, catalin.marinas, will.deacon, joel, linux-kernel,
	linux-arm-kernel

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

On Fri, 2 Aug 2019 11:22:59 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I think you are not explaining the issue correctly. From looking at the
> document, I think what you want to say is that the LR is saved *after*
> the data for the function. Is that correct? If so, then yes, it would
> cause the stack tracing algorithm to be incorrect.
> 

[..]

> Can someone confirm that this is the real issue?

Does this patch fix your issue?

-- Steve

diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 5ab5200b2bdc..13a4832cfb00 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -13,6 +13,7 @@
 #define HAVE_FUNCTION_GRAPH_FP_TEST
 #define MCOUNT_ADDR		((unsigned long)_mcount)
 #define MCOUNT_INSN_SIZE	AARCH64_INSN_SIZE
+#define ARCH_RET_ADDR_AFTER_LOCAL_VARS 1
 
 #ifndef __ASSEMBLY__
 #include <linux/compat.h>
diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index 5d16f73898db..050c6bd9beac 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -158,6 +158,18 @@ static void check_stack(unsigned long ip, unsigned long *stack)
 			i++;
 	}
 
+#ifdef ARCH_RET_ADDR_AFTER_LOCAL_VARS
+	/*
+	 * Most archs store the return address before storing the
+	 * function's local variables. But some archs do this backwards.
+	 */
+	if (x > 1) {
+		memmove(&stack_trace_index[0], &stack_trace_index[1],
+			sizeof(stack_trace_index[0]) * (x - 1));
+		x--;
+	}
+#endif
+
 	stack_trace_nr_entries = x;
 
 	if (task_stack_end_corrupted(current)) {

[-- Attachment #2: f.patch --]
[-- Type: text/x-patch, Size: 1100 bytes --]

diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 5ab5200b2bdc..13a4832cfb00 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -13,6 +13,7 @@
 #define HAVE_FUNCTION_GRAPH_FP_TEST
 #define MCOUNT_ADDR		((unsigned long)_mcount)
 #define MCOUNT_INSN_SIZE	AARCH64_INSN_SIZE
+#define ARCH_RET_ADDR_AFTER_LOCAL_VARS 1
 
 #ifndef __ASSEMBLY__
 #include <linux/compat.h>
diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index 5d16f73898db..050c6bd9beac 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -158,6 +158,18 @@ static void check_stack(unsigned long ip, unsigned long *stack)
 			i++;
 	}
 
+#ifdef ARCH_RET_ADDR_AFTER_LOCAL_VARS
+	/*
+	 * Most archs store the return address before storing the
+	 * function's local variables. But some archs do this backwards.
+	 */
+	if (x > 1) {
+		memmove(&stack_trace_index[0], &stack_trace_index[1],
+			sizeof(stack_trace_index[0]) * (x - 1));
+		x--;
+	}
+#endif
+
 	stack_trace_nr_entries = x;
 
 	if (task_stack_end_corrupted(current)) {

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-02 16:09   ` Steven Rostedt
@ 2019-08-02 16:11     ` Steven Rostedt
  2019-08-06 15:48       ` Joel Fernandes
  2019-08-06  6:01     ` Jiping Ma
  1 sibling, 1 reply; 15+ messages in thread
From: Steven Rostedt @ 2019-08-02 16:11 UTC (permalink / raw)
  To: Jiping Ma
  Cc: mingo, catalin.marinas, will.deacon, joel, linux-kernel,
	linux-arm-kernel

On Fri, 2 Aug 2019 12:09:20 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Fri, 2 Aug 2019 11:22:59 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > I think you are not explaining the issue correctly. From looking at the
> > document, I think what you want to say is that the LR is saved *after*
> > the data for the function. Is that correct? If so, then yes, it would
> > cause the stack tracing algorithm to be incorrect.
> > 
> 
> [..]
> 
> > Can someone confirm that this is the real issue?
> 
> Does this patch fix your issue?
>

Bah, I hit "attach" instead of "insert" (I wondered why it didn't
insert). Here's the patch without the attachment.

-- Steve

diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 5ab5200b2bdc..13a4832cfb00 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -13,6 +13,7 @@
 #define HAVE_FUNCTION_GRAPH_FP_TEST
 #define MCOUNT_ADDR		((unsigned long)_mcount)
 #define MCOUNT_INSN_SIZE	AARCH64_INSN_SIZE
+#define ARCH_RET_ADDR_AFTER_LOCAL_VARS 1
 
 #ifndef __ASSEMBLY__
 #include <linux/compat.h>
diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index 5d16f73898db..050c6bd9beac 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -158,6 +158,18 @@ static void check_stack(unsigned long ip, unsigned long *stack)
 			i++;
 	}
 
+#ifdef ARCH_RET_ADDR_AFTER_LOCAL_VARS
+	/*
+	 * Most archs store the return address before storing the
+	 * function's local variables. But some archs do this backwards.
+	 */
+	if (x > 1) {
+		memmove(&stack_trace_index[0], &stack_trace_index[1],
+			sizeof(stack_trace_index[0]) * (x - 1));
+		x--;
+	}
+#endif
+
 	stack_trace_nr_entries = x;
 
 	if (task_stack_end_corrupted(current)) {

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-02 15:22 ` Steven Rostedt
  2019-08-02 16:09   ` Steven Rostedt
@ 2019-08-03  8:26   ` Joel Fernandes
  2019-08-03  8:32     ` Joel Fernandes
  2019-08-05 11:25     ` Will Deacon
  1 sibling, 2 replies; 15+ messages in thread
From: Joel Fernandes @ 2019-08-03  8:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiping Ma, mingo, catalin.marinas, will.deacon, linux-kernel,
	linux-arm-kernel, kernel-team

On Fri, Aug 02, 2019 at 11:22:59AM -0400, Steven Rostedt wrote:
[snip]
> > There is not PC in ARM64 stack, LR is used to for walk_stackframe in
> > ARM64. Tere is no the issue in ARM32 because there is PC in ARM32 stack.
> > PC is used to calculate the stack size in trace_stack.c, so the
> > function name and its stack size appear to be off-by-one.
> > ARM64 stack layout:
> > 	LR
> >         FP
> >         ......
> >         LR
> >         FP
> >         ......
> 
> I think you are not explaining the issue correctly. From looking at the
> document, I think what you want to say is that the LR is saved *after*
> the data for the function. Is that correct? If so, then yes, it would
> cause the stack tracing algorithm to be incorrect.
> 
> Most archs do this:
> 
> On entry to a function:
> 
> 	save return address
>  	reserve local variables and such for current function
> 
> I think you are saying that arm64 does this backwards.
> 
> 	reserve local variables and such for current function
> 	save return address (LR)

Actually for arm64 it is like what you said about 'Most archs'. It saves FP
and LR first onto the current stack frame, then assigns the top of the stack
to FP (marking the new frame). Then executes branch-link, and then allocates
space to variables on stack in the callee.

Disassembly of prog:

int foo(int x, int y) {
	int a[32];
        return (x + y + a[31]);
}

int bar(void) {
        foo(1, 2);
}

confirms it:

000000000000073c <bar>:
 73c:   a9bf7bfd        stp     x29, x30, [sp, #-16]! <-- save FP and LR
 740:   910003fd        mov     x29, sp		      <-- create new FP 
 744:   52800041        mov     w1, #0x2              <-- pass arguments
 748:   52800020        mov     w0, #0x1 
 74c:   97fffff4        bl      71c <foo>             <-- branch (sets LR)
 750:   d503201f        nop
 754:   a8c17bfd        ldp     x29, x30, [sp], #16   <-- restore FP, LR
 758:   d65f03c0        ret

000000000000071c <foo>:
 71c:   d10243ff        sub     sp, sp, #0x90        <-- space for local var
 720:   b9000fe0        str     w0, [sp, #12]
 724:   b9000be1        str     w1, [sp, #8]
 728:   b9400fe1        ldr     w1, [sp, #12]
 72c:   b9400be0        ldr     w0, [sp, #8]
 730:   0b000021        add     w1, w1, w0
 734:   b9408fe0        ldr     w0, [sp, #140]
 738:   0b000020        add     w0, w1, w0
 73c:   910243ff        add     sp, sp, #0x90      <-- restore sp before ret
 740:   d65f03c0        ret

 thanks,

  - Joel


> For normal archs, we usually have something like this:
> 
> sys_call_entry:
>    [save regs and crap on stack ]
>    bl sys_finit_module
>  sys_call_entry+x:
>    [ here's the address put into the lr regsister ]
> 
> sys_finit_module:
>    save lr on stack [syscall_entry+x]
>    save more stuff on stack
>    [..]
>    bl load_module
>   sys_finit_module+y:
> 
> load_module:
>   save lr on stack [sys_finit_module+y]
>   save more stuff on stack
>   [..]
>   bl do_init_module
>  load_module+z:
> 
> do_init_module:
>   save lr on stack [load_module+z]
>   [ let's say we do the stack trace here! ]
> 
> On the stack we have:
> 
>   data for kernel entry
>   sys_call_entry+x [ lr saved in sys_finit_module ]
>   data for sys_finit_module
>   sys_finit_module+y [ lr saved in load_module ]
>   data for load_module
>   load_module+z [ lr saved in do_init_module
> 
> When we call stack_trace_save() we get:
> 
>   sys_call_entry+x
>   sys_finit_module+y
>   load_module+z
> 
> We then find those locations on the stack. Let's add some numbers here:
> 
> 0:   top of stack
>      data for kernel entry
> 504: sys_call_entry+x
>      data for sys_finit_module
> 856: sys_finit_module+y
>      data for sys_load_module
> 904: load_module+z
> 1048: BOTTOM OF STACK HERE
> 
> To do the work, we start from the bottom of the stack, looking for what
> was returned from the stack_trace_save
> 
> start = 1048
>  *p == load_module+z
>   stack size = 1048 - 904
> 
> Then we say load_module stack size is 144.
> 
> What you are describing doesn't make any sense. Yes, we all add LR to
> the stack, not the PC! But if the issue is that the LR is saved *after*
> the function data, then that *will* make a difference. We then have:
> 
> 
> sys_call_entry:
>    [save regs and crap on stack ]
>    bl sys_finit_module
>  sys_call_entry+x:
>    [ here's the address put into the lr regsister ]
> 
> sys_finit_module:
>    save sys_finit_module stuff on stack
>    save lr on stack [syscall_entry+x]  <-- reversed order from before!
>    [..]
>    bl load_module
>   sys_finit_module+y:
> 
> load_module:
>   save load_module stuff on stack
>   save lr on stack [sys_finit_module+y]
>   [..]
>   bl do_init_module
>  load_module+z:
> 
> do_init_module:
>   [ let's say we do the stack trace here! ]
> 
> On the stack we have:
> 
>   data for sys_call_entry
>   data for sys_finit_module
>   sys_call_entry+x [ lr saved in sys_finit_module ]
>   data for load_module
>   sys_finit_module+y [ lr saved in load_module ]
>   data for do_init_module
>   load_module+z [ lr saved in do_init_module
> 
> This also brings up the question of when the mcount is called?
> 
> We get the same data from stack_trace_save():
> 
>   sys_call_entry+x
>   sys_finit_module+y
>   load_module+z
> 
> But this time, the stack is different!
> 
> 0:   top of stack
>      data from kernel entry
>      data for sys_call_entry
>      data for sys_finit_module
> 504: sys_call_entry+x
>      data for sys_finit_module
> 856: sys_finit_module+y
>      data for load_module
> 904: load_module+z
>      data for do_init_module?
> 1048: BOTTOM OF STACK HERE
> 
> So yes, the same algorithm is will come up with the wrong result,
> because...
> 
> start = 1048
>  *p == load_module+z
>   stack size = 1048 - 904
> 
> We just calculated the size of data for do_init_module, and not for
> load_module!
> 
> Can someone confirm that this is the real issue?
> 
> -- Steve
> 
> 
> > 
> > Wrong info:
> > Depth Size Location (16 entries)
> > ----- ---- --------
> > 0) 5400 16 __update_load_avg_se.isra.2+0x28/0x220
> > 1) 5384 96 put_prev_entity+0x250/0x338
> > 2) 5288 80 pick_next_task_fair+0x4c4/0x508
> > 3) 5208 72 __schedule+0x100/0x600
> > 4) 5136 184 preempt_schedule_common+0x28/0x48
> > 5) 4952 32 preempt_schedule+0x28/0x30
> > 6) 4920 16 vprintk_emit+0x170/0x1f8
> > 7) 4904 128 vprintk_default+0x48/0x58
> > 8) 4776 64 vprintk_func+0xf8/0x1c8
> > 9) 4712 112 printk+0x70/0x90
> > 10) 4600 176 occupy_stack_init+0x64/0xc0 [kernel_stack]
> > 11) 4424 3376 do_one_initcall+0x68/0x248
> > 12) 1048 144 do_init_module+0x60/0x1f0
> > 13) 904 48 load_module+0x1d50/0x2340
> > 14) 856 352 sys_finit_module+0xd0/0xe8
> > 15) 504 504 el0_svc_naked+0x30/0x34
> > 
> > Correct info:
> > Depth Size Location (18 entries)
> > ----- ---- --------
> > 0) 5464 48 cgroup_rstat_updated+0x20/0x100
> > 1) 5416 32 cgroup_base_stat_cputime_account_end.isra.0+0x30/0x60
> > 2) 5384 32 __cgroup_account_cputime+0x3c/0x48
> > 3) 5352 64 update_curr+0xc4/0x1d0
> > 4) 5288 72 pick_next_task_fair+0x444/0x508
> > 5) 5216 184 __schedule+0x100/0x600
> > 6) 5032 32 preempt_schedule_common+0x28/0x48
> > 7) 5000 16 preempt_schedule+0x28/0x30
> > 8) 4984 128 vprintk_emit+0x170/0x1f8
> > 9) 4856 64 vprintk_default+0x48/0x58
> > 10) 4792 112 vprintk_func+0xf8/0x1c8
> > 11) 4680 176 printk+0x70/0x90
> > 12) 4504 80 func_test+0x7c/0xb8 [kernel_stack]
> > 13) 4424 3376 occupy_stack_init+0x7c/0xc0 [kernel_stack]
> > 14) 1048 144 do_one_initcall+0x68/0x248
> > 15) 904 48 do_init_module+0x60/0x1f0
> > 16) 856 352 load_module+0x1d50/0x2340
> > 17) 504 504 sys_finit_module+0xd0/0xe8
> > 
> > Signed-off-by: Jiping Ma <jiping.ma2@windriver.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-03  8:26   ` Joel Fernandes
@ 2019-08-03  8:32     ` Joel Fernandes
  2019-08-05 11:25     ` Will Deacon
  1 sibling, 0 replies; 15+ messages in thread
From: Joel Fernandes @ 2019-08-03  8:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiping Ma, Ingo Molnar, Catalin Marinas, Will Deacon, LKML,
	moderated list:ARM/STM32 ARCHITECTURE, kernel-team

On Sat, Aug 3, 2019 at 4:26 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Fri, Aug 02, 2019 at 11:22:59AM -0400, Steven Rostedt wrote:
> [snip]
> > > There is not PC in ARM64 stack, LR is used to for walk_stackframe in
> > > ARM64. Tere is no the issue in ARM32 because there is PC in ARM32 stack.
> > > PC is used to calculate the stack size in trace_stack.c, so the
> > > function name and its stack size appear to be off-by-one.
> > > ARM64 stack layout:
> > >     LR
> > >         FP
> > >         ......
> > >         LR
> > >         FP
> > >         ......
> >
> > I think you are not explaining the issue correctly. From looking at the
> > document, I think what you want to say is that the LR is saved *after*
> > the data for the function. Is that correct? If so, then yes, it would
> > cause the stack tracing algorithm to be incorrect.
> >
> > Most archs do this:
> >
> > On entry to a function:
> >
> >       save return address
> >       reserve local variables and such for current function
> >
> > I think you are saying that arm64 does this backwards.
> >
> >       reserve local variables and such for current function
> >       save return address (LR)
>
> Actually for arm64 it is like what you said about 'Most archs'. It saves FP
> and LR first onto the current stack frame, then assigns the top of the stack
> to FP (marking the new frame). Then executes branch-link, and then allocates
> space to variables on stack in the callee.

Just to add to that, when the branch-link (BL) instruction is
executed, the Link Register (LR) will contain the return address. This
why the existing LR needs to be saved first before the call. The
existing LR will contain the return address of the caller's original
caller.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-03  8:26   ` Joel Fernandes
  2019-08-03  8:32     ` Joel Fernandes
@ 2019-08-05 11:25     ` Will Deacon
  2019-08-05 13:59       ` Steven Rostedt
                         ` (2 more replies)
  1 sibling, 3 replies; 15+ messages in thread
From: Will Deacon @ 2019-08-05 11:25 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Steven Rostedt, Jiping Ma, catalin.marinas, will.deacon,
	linux-kernel, mingo, kernel-team, linux-arm-kernel,
	takahiro.akashi

[+Akashi, since he may remember more of the gory details here]

On Sat, Aug 03, 2019 at 04:26:42AM -0400, Joel Fernandes wrote:
> On Fri, Aug 02, 2019 at 11:22:59AM -0400, Steven Rostedt wrote:
> [snip]
> > > There is not PC in ARM64 stack, LR is used to for walk_stackframe in
> > > ARM64. Tere is no the issue in ARM32 because there is PC in ARM32 stack.
> > > PC is used to calculate the stack size in trace_stack.c, so the
> > > function name and its stack size appear to be off-by-one.
> > > ARM64 stack layout:
> > > 	LR
> > >         FP
> > >         ......
> > >         LR
> > >         FP
> > >         ......
> > 
> > I think you are not explaining the issue correctly. From looking at the
> > document, I think what you want to say is that the LR is saved *after*
> > the data for the function. Is that correct? If so, then yes, it would
> > cause the stack tracing algorithm to be incorrect.
> > 
> > Most archs do this:
> > 
> > On entry to a function:
> > 
> > 	save return address
> >  	reserve local variables and such for current function
> > 
> > I think you are saying that arm64 does this backwards.
> > 
> > 	reserve local variables and such for current function
> > 	save return address (LR)
> 
> Actually for arm64 it is like what you said about 'Most archs'. It saves FP
> and LR first onto the current stack frame, then assigns the top of the stack
> to FP (marking the new frame). Then executes branch-link, and then allocates
> space to variables on stack in the callee.
> 
> Disassembly of prog:
> 
> int foo(int x, int y) {
> 	int a[32];
>         return (x + y + a[31]);
> }
> 
> int bar(void) {
>         foo(1, 2);
> }
> 
> confirms it:
> 
> 000000000000073c <bar>:
>  73c:   a9bf7bfd        stp     x29, x30, [sp, #-16]! <-- save FP and LR
>  740:   910003fd        mov     x29, sp		      <-- create new FP 
>  744:   52800041        mov     w1, #0x2              <-- pass arguments
>  748:   52800020        mov     w0, #0x1 
>  74c:   97fffff4        bl      71c <foo>             <-- branch (sets LR)
>  750:   d503201f        nop
>  754:   a8c17bfd        ldp     x29, x30, [sp], #16   <-- restore FP, LR
>  758:   d65f03c0        ret
> 
> 000000000000071c <foo>:
>  71c:   d10243ff        sub     sp, sp, #0x90        <-- space for local var
>  720:   b9000fe0        str     w0, [sp, #12]
>  724:   b9000be1        str     w1, [sp, #8]
>  728:   b9400fe1        ldr     w1, [sp, #12]
>  72c:   b9400be0        ldr     w0, [sp, #8]
>  730:   0b000021        add     w1, w1, w0
>  734:   b9408fe0        ldr     w0, [sp, #140]
>  738:   0b000020        add     w0, w1, w0
>  73c:   910243ff        add     sp, sp, #0x90      <-- restore sp before ret
>  740:   d65f03c0        ret

I think we need to untangle things a bit here.

The arm64 PCS makes no guarantee about the position of the frame record with
respect to stack allocations, so relying on this is fragile at best. This is
partly why the ftrace-with-regs work currently relies on
-fpatchable-function-entry, since that allows the very beginning of the
function to be intercepted which I don't think is necessarily the case with
-pg/_mcount.

For the snippet above, foo is a leaf function and does not have a frame record
placed on the stack. If we instead look at something like:

__attribute__((noinline)) int baz(int x)
{
	return x;
}

__attribute__((noinline)) int bar(int x)
{
	int a[32];
	return baz(x * a[0]);
}

int foo(int x)
{
	return bar(x);
}

then the first instruction of bar() allocates stack space and pushes
the frame record at the same time:

	stp	x29, x30, [sp, -144]!

This can be read as "subtract 144 bytes (32*4 + 16) from the stack pointer,
write the frame record there and then update the stack pointer to point at the
bottom of the newly allocated stack", which means that the array 'a[32]' sits
directly /above/ the frame record on the stack. However, this is just what my
GCC happened to do today. When we looked at this back in 2015, there were other
cases we ended up having to identify with heuristics based on what had been
observed under various compilers:

http://lists.infradead.org/pipermail/linux-arm-kernel/2015-December/393721.html

This was deemed to be far too fragile, so we didn't merge it in the end.

If this is to work reliably, then we need support from the tools. This was
raised when we first merged support for ftrace, but I'm not sure it went
anywhere:

https://gcc.gnu.org/ml/gcc/2016-01/msg00035.html

So, I completely agree with Steve that we shouldn't be littering the core code
with #ifdef CONFIG_ARM64, but we probably do need something in the arch backend
if we're going to do this properly, and that in turn is likely to need a very
different algorithm from the one currently in use.

Will

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-05 11:25     ` Will Deacon
@ 2019-08-05 13:59       ` Steven Rostedt
  2019-08-06 13:00       ` Steven Rostedt
  2019-08-06 14:47       ` Joel Fernandes
  2 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2019-08-05 13:59 UTC (permalink / raw)
  To: Will Deacon
  Cc: Joel Fernandes, Jiping Ma, catalin.marinas, will.deacon,
	linux-kernel, mingo, kernel-team, linux-arm-kernel,
	takahiro.akashi

On Mon, 5 Aug 2019 12:25:25 +0100
Will Deacon <will@kernel.org> wrote:

> So, I completely agree with Steve that we shouldn't be littering the core code
> with #ifdef CONFIG_ARM64, but we probably do need something in the arch backend
> if we're going to do this properly, and that in turn is likely to need a very
> different algorithm from the one currently in use.

And I'm perfectly fine if someone wants to tear up the current
implementation, and make one that has callbacks to arch code (with
defaults for those that don't need it).

But it has to be clean. We worked hard to get rid of most "#ifdef"
being scattered in the kernel, and I'm not bringing it back in my code.

-- Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-02 16:09   ` Steven Rostedt
  2019-08-02 16:11     ` Steven Rostedt
@ 2019-08-06  6:01     ` Jiping Ma
  1 sibling, 0 replies; 15+ messages in thread
From: Jiping Ma @ 2019-08-06  6:01 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: catalin.marinas, will.deacon, linux-kernel, mingo, joel,
	linux-arm-kernel



On 2019年08月03日 00:09, Steven Rostedt wrote:
> On Fri, 2 Aug 2019 11:22:59 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
>
>> I think you are not explaining the issue correctly. From looking at the
>> document, I think what you want to say is that the LR is saved *after*
>> the data for the function. Is that correct? If so, then yes, it would
>> cause the stack tracing algorithm to be incorrect.
>>
> [..]
>
>> Can someone confirm that this is the real issue?
> Does this patch fix your issue?
Yes, it does.

-- Jiping
>
> -- Steve
>
> diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
> index 5ab5200b2bdc..13a4832cfb00 100644
> --- a/arch/arm64/include/asm/ftrace.h
> +++ b/arch/arm64/include/asm/ftrace.h
> @@ -13,6 +13,7 @@
>   #define HAVE_FUNCTION_GRAPH_FP_TEST
>   #define MCOUNT_ADDR		((unsigned long)_mcount)
>   #define MCOUNT_INSN_SIZE	AARCH64_INSN_SIZE
> +#define ARCH_RET_ADDR_AFTER_LOCAL_VARS 1
>   
>   #ifndef __ASSEMBLY__
>   #include <linux/compat.h>
> diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
> index 5d16f73898db..050c6bd9beac 100644
> --- a/kernel/trace/trace_stack.c
> +++ b/kernel/trace/trace_stack.c
> @@ -158,6 +158,18 @@ static void check_stack(unsigned long ip, unsigned long *stack)
>   			i++;
>   	}
>   
> +#ifdef ARCH_RET_ADDR_AFTER_LOCAL_VARS
> +	/*
> +	 * Most archs store the return address before storing the
> +	 * function's local variables. But some archs do this backwards.
> +	 */
> +	if (x > 1) {
> +		memmove(&stack_trace_index[0], &stack_trace_index[1],
> +			sizeof(stack_trace_index[0]) * (x - 1));
> +		x--;
> +	}
> +#endif
> +
>   	stack_trace_nr_entries = x;
>   
>   	if (task_stack_end_corrupted(current)) {
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-05 11:25     ` Will Deacon
  2019-08-05 13:59       ` Steven Rostedt
@ 2019-08-06 13:00       ` Steven Rostedt
  2019-08-06 14:47       ` Joel Fernandes
  2 siblings, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2019-08-06 13:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: Joel Fernandes, Jiping Ma, catalin.marinas, will.deacon,
	linux-kernel, mingo, kernel-team, linux-arm-kernel,
	takahiro.akashi

On Mon, 5 Aug 2019 12:25:25 +0100
Will Deacon <will@kernel.org> wrote:

> This can be read as "subtract 144 bytes (32*4 + 16) from the stack pointer,
> write the frame record there and then update the stack pointer to point at the
> bottom of the newly allocated stack", which means that the array 'a[32]' sits
> directly /above/ the frame record on the stack. However, this is just what my
> GCC happened to do today. When we looked at this back in 2015, there were other
> cases we ended up having to identify with heuristics based on what had been
> observed under various compilers:
> 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-December/393721.html

That's a bit more involved than what I came up with.

> 
> This was deemed to be far too fragile, so we didn't merge it in the end.
> 
> If this is to work reliably, then we need support from the tools. This was
> raised when we first merged support for ftrace, but I'm not sure it went
> anywhere:
> 
> https://gcc.gnu.org/ml/gcc/2016-01/msg00035.html
> 
> So, I completely agree with Steve that we shouldn't be littering the core code
> with #ifdef CONFIG_ARM64, but we probably do need something in the arch backend
> if we're going to do this properly, and that in turn is likely to need a very
> different algorithm from the one currently in use.

So basically it seems that on arm64, gcc only saves the lr when needed
(leaf functions don't need it). And I can even see that if you have
several paths that don't call other functions, why save the lr?

It does seem that doing the slight change makes the current code more
accurate without need for any heuristics.

Here's my patch again, slightly tweaked and Jiping said it solved the
issue for him.

Are you OK with this change?

-- Steve

diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
index 5ab5200b2bdc..13a4832cfb00 100644
--- a/arch/arm64/include/asm/ftrace.h
+++ b/arch/arm64/include/asm/ftrace.h
@@ -13,6 +13,7 @@
 #define HAVE_FUNCTION_GRAPH_FP_TEST
 #define MCOUNT_ADDR		((unsigned long)_mcount)
 #define MCOUNT_INSN_SIZE	AARCH64_INSN_SIZE
+#define ARCH_RET_ADDR_AFTER_LOCAL_VARS 1
 
 #ifndef __ASSEMBLY__
 #include <linux/compat.h>
diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index 5d16f73898db..40e4a88eea8f 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -158,6 +158,20 @@ static void check_stack(unsigned long ip, unsigned long *stack)
 			i++;
 	}
 
+#ifdef ARCH_RET_ADDR_AFTER_LOCAL_VARS
+	/*
+	 * Some archs will store the link register before calling
+	 * nested functions. This means the saved return address
+	 * comes after the local storage, and we need to shift
+	 * for that.
+	 */
+	if (x > 1) {
+		memmove(&stack_trace_index[0], &stack_trace_index[1],
+			sizeof(stack_trace_index[0]) * (x - 1));
+		x--;
+	}
+#endif
+
 	stack_trace_nr_entries = x;
 
 	if (task_stack_end_corrupted(current)) {

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-05 11:25     ` Will Deacon
  2019-08-05 13:59       ` Steven Rostedt
  2019-08-06 13:00       ` Steven Rostedt
@ 2019-08-06 14:47       ` Joel Fernandes
  2 siblings, 0 replies; 15+ messages in thread
From: Joel Fernandes @ 2019-08-06 14:47 UTC (permalink / raw)
  To: Will Deacon
  Cc: Steven Rostedt, Jiping Ma, catalin.marinas, will.deacon,
	linux-kernel, mingo, kernel-team, linux-arm-kernel,
	takahiro.akashi

On Mon, Aug 05, 2019 at 12:25:25PM +0100, Will Deacon wrote:
> [+Akashi, since he may remember more of the gory details here]
> 
> On Sat, Aug 03, 2019 at 04:26:42AM -0400, Joel Fernandes wrote:
> > On Fri, Aug 02, 2019 at 11:22:59AM -0400, Steven Rostedt wrote:
> > [snip]
> > > > There is not PC in ARM64 stack, LR is used to for walk_stackframe in
> > > > ARM64. Tere is no the issue in ARM32 because there is PC in ARM32 stack.
> > > > PC is used to calculate the stack size in trace_stack.c, so the
> > > > function name and its stack size appear to be off-by-one.
> > > > ARM64 stack layout:
> > > > 	LR
> > > >         FP
> > > >         ......
> > > >         LR
> > > >         FP
> > > >         ......
> > > 
> > > I think you are not explaining the issue correctly. From looking at the
> > > document, I think what you want to say is that the LR is saved *after*
> > > the data for the function. Is that correct? If so, then yes, it would
> > > cause the stack tracing algorithm to be incorrect.
> > > 
> > > Most archs do this:
> > > 
> > > On entry to a function:
> > > 
> > > 	save return address
> > >  	reserve local variables and such for current function
> > > 
> > > I think you are saying that arm64 does this backwards.
> > > 
> > > 	reserve local variables and such for current function
> > > 	save return address (LR)
> > 
> > Actually for arm64 it is like what you said about 'Most archs'. It saves FP
> > and LR first onto the current stack frame, then assigns the top of the stack
> > to FP (marking the new frame). Then executes branch-link, and then allocates
> > space to variables on stack in the callee.
> > 
> > Disassembly of prog:
> > 
> > int foo(int x, int y) {
> > 	int a[32];
> >         return (x + y + a[31]);
> > }
> > 
> > int bar(void) {
> >         foo(1, 2);
> > }
> > 
> > confirms it:
> > 
> > 000000000000073c <bar>:
> >  73c:   a9bf7bfd        stp     x29, x30, [sp, #-16]! <-- save FP and LR
> >  740:   910003fd        mov     x29, sp		      <-- create new FP 
> >  744:   52800041        mov     w1, #0x2              <-- pass arguments
> >  748:   52800020        mov     w0, #0x1 
> >  74c:   97fffff4        bl      71c <foo>             <-- branch (sets LR)
> >  750:   d503201f        nop
> >  754:   a8c17bfd        ldp     x29, x30, [sp], #16   <-- restore FP, LR
> >  758:   d65f03c0        ret
> > 
> > 000000000000071c <foo>:
> >  71c:   d10243ff        sub     sp, sp, #0x90        <-- space for local var
> >  720:   b9000fe0        str     w0, [sp, #12]
> >  724:   b9000be1        str     w1, [sp, #8]
> >  728:   b9400fe1        ldr     w1, [sp, #12]
> >  72c:   b9400be0        ldr     w0, [sp, #8]
> >  730:   0b000021        add     w1, w1, w0
> >  734:   b9408fe0        ldr     w0, [sp, #140]
> >  738:   0b000020        add     w0, w1, w0
> >  73c:   910243ff        add     sp, sp, #0x90      <-- restore sp before ret
> >  740:   d65f03c0        ret
> 
> I think we need to untangle things a bit here.
> 
> The arm64 PCS makes no guarantee about the position of the frame record with
> respect to stack allocations, so relying on this is fragile at best. This is
> partly why the ftrace-with-regs work currently relies on
> -fpatchable-function-entry, since that allows the very beginning of the
> function to be intercepted which I don't think is necessarily the case with
> -pg/_mcount.
> 
> For the snippet above, foo is a leaf function and does not have a frame record
> placed on the stack. If we instead look at something like:
> 
> __attribute__((noinline)) int baz(int x)
> {
> 	return x;
> }
> 
> __attribute__((noinline)) int bar(int x)
> {
> 	int a[32];
> 	return baz(x * a[0]);
> }
> 
> int foo(int x)
> {
> 	return bar(x);
> }
> 
> then the first instruction of bar() allocates stack space and pushes
> the frame record at the same time:
> 
> 	stp	x29, x30, [sp, -144]!
> 
> This can be read as "subtract 144 bytes (32*4 + 16) from the stack pointer,
> write the frame record there and then update the stack pointer to point at the
> bottom of the newly allocated stack", which means that the array 'a[32]' sits
> directly /above/ the frame record on the stack. However, this is just what my
> GCC happened to do today. When we looked at this back in 2015, there were other
> cases we ended up having to identify with heuristics based on what had been
> observed under various compilers:

Right point taken, I was looking at bar()'s stack frame instead of foo()'s.
foo() did not have to save any frame record, so probably my example was
confusing. My bad.

So indeed, it seems an odd ball case.


> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-December/393721.html
> 
> This was deemed to be far too fragile, so we didn't merge it in the end.
> 
> If this is to work reliably, then we need support from the tools. This was
> raised when we first merged support for ftrace, but I'm not sure it went
> anywhere:
> 
> https://gcc.gnu.org/ml/gcc/2016-01/msg00035.html


> So, I completely agree with Steve that we shouldn't be littering the core code
> with #ifdef CONFIG_ARM64, but we probably do need something in the arch backend
> if we're going to do this properly, and that in turn is likely to need a very
> different algorithm from the one currently in use.

Yeah, I am in line with that thought as well. I am reviewing the latest patch
from Steve now to understand it better.


thanks,

 - Joel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-02 16:11     ` Steven Rostedt
@ 2019-08-06 15:48       ` Joel Fernandes
  2019-08-06 16:34         ` Steven Rostedt
  0 siblings, 1 reply; 15+ messages in thread
From: Joel Fernandes @ 2019-08-06 15:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiping Ma, mingo, catalin.marinas, will.deacon, linux-kernel,
	linux-arm-kernel

On Fri, Aug 02, 2019 at 12:11:24PM -0400, Steven Rostedt wrote:
> On Fri, 2 Aug 2019 12:09:20 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Fri, 2 Aug 2019 11:22:59 -0400
> > Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > I think you are not explaining the issue correctly. From looking at the
> > > document, I think what you want to say is that the LR is saved *after*
> > > the data for the function. Is that correct? If so, then yes, it would
> > > cause the stack tracing algorithm to be incorrect.
> > > 
> > 
> > [..]
> > 
> > > Can someone confirm that this is the real issue?
> > 
> > Does this patch fix your issue?
> >
> 
> Bah, I hit "attach" instead of "insert" (I wondered why it didn't
> insert). Here's the patch without the attachment.
> 
> -- Steve
> 
> diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
> index 5ab5200b2bdc..13a4832cfb00 100644
> --- a/arch/arm64/include/asm/ftrace.h
> +++ b/arch/arm64/include/asm/ftrace.h
> @@ -13,6 +13,7 @@
>  #define HAVE_FUNCTION_GRAPH_FP_TEST
>  #define MCOUNT_ADDR		((unsigned long)_mcount)
>  #define MCOUNT_INSN_SIZE	AARCH64_INSN_SIZE
> +#define ARCH_RET_ADDR_AFTER_LOCAL_VARS 1
>  
>  #ifndef __ASSEMBLY__
>  #include <linux/compat.h>
> diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
> index 5d16f73898db..050c6bd9beac 100644
> --- a/kernel/trace/trace_stack.c
> +++ b/kernel/trace/trace_stack.c
> @@ -158,6 +158,18 @@ static void check_stack(unsigned long ip, unsigned long *stack)
>  			i++;
>  	}
>  
> +#ifdef ARCH_RET_ADDR_AFTER_LOCAL_VARS
> +	/*
> +	 * Most archs store the return address before storing the
> +	 * function's local variables. But some archs do this backwards.
> +	 */
> +	if (x > 1) {
> +		memmove(&stack_trace_index[0], &stack_trace_index[1],
> +			sizeof(stack_trace_index[0]) * (x - 1));
> +		x--;
> +	}
> +#endif
> +
>  	stack_trace_nr_entries = x;
>  
>  	if (task_stack_end_corrupted(current)) {


I am not fully understanding the fix :(. If the positions of the data and
FP/LR are swapped, then there should be a loop of some sort where the FP/LR
are copied repeatedly to undo the mess we are discussing. But in this patch
I see only one copy happening. May be I just don't understand this code well
enough. Are there any more clues for helping understand the fix?

Also, this stack trace loop (original code) is a bit hairy :) It appears
there is a call to stack_trace_save() followed by another loop that goes
through the returned entries from there and tries to generate a set of
indexes. Isn't the real issue that the entries returned by stack_trace_save()
are a out of whack? I am curious also if other users of stack_trace_save()
will suffer from the same issue.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-06 15:48       ` Joel Fernandes
@ 2019-08-06 16:34         ` Steven Rostedt
  2019-08-06 17:03           ` Steven Rostedt
  2019-08-06 17:25           ` Joel Fernandes
  0 siblings, 2 replies; 15+ messages in thread
From: Steven Rostedt @ 2019-08-06 16:34 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Jiping Ma, mingo, catalin.marinas, will.deacon, linux-kernel,
	linux-arm-kernel

On Tue, 6 Aug 2019 11:48:11 -0400
Joel Fernandes <joel@joelfernandes.org> wrote:


> > diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
> > index 5ab5200b2bdc..13a4832cfb00 100644
> > --- a/arch/arm64/include/asm/ftrace.h
> > +++ b/arch/arm64/include/asm/ftrace.h
> > @@ -13,6 +13,7 @@
> >  #define HAVE_FUNCTION_GRAPH_FP_TEST
> >  #define MCOUNT_ADDR		((unsigned long)_mcount)
> >  #define MCOUNT_INSN_SIZE	AARCH64_INSN_SIZE
> > +#define ARCH_RET_ADDR_AFTER_LOCAL_VARS 1
> >  
> >  #ifndef __ASSEMBLY__
> >  #include <linux/compat.h>
> > diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
> > index 5d16f73898db..050c6bd9beac 100644
> > --- a/kernel/trace/trace_stack.c
> > +++ b/kernel/trace/trace_stack.c
> > @@ -158,6 +158,18 @@ static void check_stack(unsigned long ip, unsigned long *stack)
> >  			i++;
> >  	}
> >  
> > +#ifdef ARCH_RET_ADDR_AFTER_LOCAL_VARS
> > +	/*
> > +	 * Most archs store the return address before storing the
> > +	 * function's local variables. But some archs do this backwards.
> > +	 */
> > +	if (x > 1) {
> > +		memmove(&stack_trace_index[0], &stack_trace_index[1],
> > +			sizeof(stack_trace_index[0]) * (x - 1));
> > +		x--;
> > +	}
> > +#endif
> > +
> >  	stack_trace_nr_entries = x;
> >  
> >  	if (task_stack_end_corrupted(current)) {  
> 
> 
> I am not fully understanding the fix :(. If the positions of the data and
> FP/LR are swapped, then there should be a loop of some sort where the FP/LR
> are copied repeatedly to undo the mess we are discussing. But in this patch
> I see only one copy happening. May be I just don't understand this code well
> enough. Are there any more clues for helping understand the fix?

Here's the best way to explain this. The code is using the stack trace
to figure out which function is the stack hog. Or perhaps a serious of
stack hogs. On x86, a call stores the return address as it calls the
next function. Then that function allocates its stack frame for its
local variables and saving of registers.

on x86:

[ top of stack ]
 0: sys call entry frame
10: return addr to entry code
11: start of sys_foo frame
20: return addr to sys_foo
21: start of kernel_func_bar frame
30: return addr to kernel_func_bar
31: [ do trace stack here ]


Then we do a save_stack_trace which returns the addresses of the
functions it finds. Which would be (from the bottom of the stack to the
top)

  return addr to kernel_func_bar
  return addr to sys_foo
  return addr to entry code

What we do here is try to figure out how much stack each of theses
functions have. So we loop through the stack looking for the addresses
returned by the save_stack trace, and see where on the stack this is.
This gives us:

  return addr to kernel_func_bar [ 30 ]
  return addr to sys_foo         [ 20 ]
  return addr to entry frame     [ 10 ]

From this, we can conclude (on x86) that the size of the stack used for
kernel_func_bar is 30 - 20 = 10. Because on the stack we have:

20: return addr to sys_foo
21: start of kernel_func_bar frame  <<-- kernel_func_bar stack frame
30: return addr to kernel_func_bar


Now, what Jiping reported, is that on arm64, it saves the link register
(the return address) when it is needed to, which is after the stack
frame for the current function has been saved. That means we have
something that looks like this:

on arm64:

[ top of stack ]
 0: sys call entry frame
10: start of sys_foo_frame
19: return addr to entry code << lr saved before calling kern_func_bar
20: start of kernel_func_bar frame
29: return addr to sys_foo_frame << lr saved before calling next function
30: [ do trace stack here ]

Now, I have a question. To call the mcount code (ftrace and the stack
tracing), you need to save the return address of kern_func_bar
somewhere, otherwise the call to mcount will overwrite the lr. But
let's say it does and then forgets it, so we have:

30: return addr of kernel_func_bar frame
31: [ do trace stack here ]

Now save_stack_trace gives us the same result:

 return addr to kernel_func bar
 return addr to sys_foo
 return addr to entry frame

But we get a different result when finding them in the location of the
stack.

 return addr to kernel_func_bar [ 30 ]
 return addr to sys foo         [ 29 ]
 return addr to entry frame     [ 19 ]

The simple subtractions will be off:

kernel_func_bar stack size = 30 - 29 = 1
Or even, sys_foo 29 - 19 = 10, but if we look at the stack:

10: start of sys_foo_frame
19: return addr to entry_code
20: start of kernel_func_bar frame
29: return addr to sys_foo

We are measuring the kernel_func_bar frame for sys_foo!

We are off by one here.

stack_trace_index[] is an array of the offsets mapping to the function
return addresses found. If we shift it by one, then we then sync the
functions found with their frames:

stack_trace_index[0] = 30
stack_trace_index[1] = 29
stack_trace_index[2] = 19

		memmove((&stack_trace_index[0], &stack_trace_index[1],
			sizeof(stack_trace_index[0]) * (x - 1));

Makes that:

stack_trace_index[0] = 29
stack_trace_index[1] = 19

And we do x-- to lose the last frame.

With the stack_dump_trace being:

stack_dump_trace[0] = return addr kernel_func_bar
stack_dump_trace[1] = return addr sys_foo

we then match which frame size belongs to which function better.


> 
> Also, this stack trace loop (original code) is a bit hairy :) It appears
> there is a call to stack_trace_save() followed by another loop that goes
> through the returned entries from there and tries to generate a set of
> indexes. Isn't the real issue that the entries returned by stack_trace_save()
> are a out of whack? I am curious also if other users of stack_trace_save()
> will suffer from the same issue.

No, the order is fine. The issue is that we are using the location of
the return address in the stack to find out what function has the
biggest stack usage, and our assumption for arm64 is incorrect in that
location.

-- Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-06 16:34         ` Steven Rostedt
@ 2019-08-06 17:03           ` Steven Rostedt
  2019-08-06 17:25           ` Joel Fernandes
  1 sibling, 0 replies; 15+ messages in thread
From: Steven Rostedt @ 2019-08-06 17:03 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Jiping Ma, mingo, catalin.marinas, will.deacon, linux-kernel,
	linux-arm-kernel

On Tue, 6 Aug 2019 12:34:55 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Here's the best way to explain this. The code is using the stack trace
> to figure out which function is the stack hog. Or perhaps a serious of

 Why so serious? ....  s/serious/series/

-- Steve

> stack hogs. On x86, a call stores the return address as it calls the
> next function. Then that function allocates its stack frame for its
> local variables and saving of registers.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] tracing: Function stack size and its name mismatch in arm64
  2019-08-06 16:34         ` Steven Rostedt
  2019-08-06 17:03           ` Steven Rostedt
@ 2019-08-06 17:25           ` Joel Fernandes
  1 sibling, 0 replies; 15+ messages in thread
From: Joel Fernandes @ 2019-08-06 17:25 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jiping Ma, mingo, catalin.marinas, will.deacon, linux-kernel,
	linux-arm-kernel

On Tue, Aug 06, 2019 at 12:34:55PM -0400, Steven Rostedt wrote:
> On Tue, 6 Aug 2019 11:48:11 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
> > > diff --git a/arch/arm64/include/asm/ftrace.h b/arch/arm64/include/asm/ftrace.h
> > > index 5ab5200b2bdc..13a4832cfb00 100644
> > > --- a/arch/arm64/include/asm/ftrace.h
> > > +++ b/arch/arm64/include/asm/ftrace.h
> > > @@ -13,6 +13,7 @@
> > >  #define HAVE_FUNCTION_GRAPH_FP_TEST
> > >  #define MCOUNT_ADDR		((unsigned long)_mcount)
> > >  #define MCOUNT_INSN_SIZE	AARCH64_INSN_SIZE
> > > +#define ARCH_RET_ADDR_AFTER_LOCAL_VARS 1
> > >  
> > >  #ifndef __ASSEMBLY__
> > >  #include <linux/compat.h>
> > > diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
> > > index 5d16f73898db..050c6bd9beac 100644
> > > --- a/kernel/trace/trace_stack.c
> > > +++ b/kernel/trace/trace_stack.c
> > > @@ -158,6 +158,18 @@ static void check_stack(unsigned long ip, unsigned long *stack)
> > >  			i++;
> > >  	}
> > >  
> > > +#ifdef ARCH_RET_ADDR_AFTER_LOCAL_VARS
> > > +	/*
> > > +	 * Most archs store the return address before storing the
> > > +	 * function's local variables. But some archs do this backwards.
> > > +	 */
> > > +	if (x > 1) {
> > > +		memmove(&stack_trace_index[0], &stack_trace_index[1],
> > > +			sizeof(stack_trace_index[0]) * (x - 1));
> > > +		x--;
> > > +	}
> > > +#endif
> > > +
> > >  	stack_trace_nr_entries = x;
> > >  
> > >  	if (task_stack_end_corrupted(current)) {  
> > 
> > 
> > I am not fully understanding the fix :(. If the positions of the data and
> > FP/LR are swapped, then there should be a loop of some sort where the FP/LR
> > are copied repeatedly to undo the mess we are discussing. But in this patch
> > I see only one copy happening. May be I just don't understand this code well
> > enough. Are there any more clues for helping understand the fix?
> 
> Here's the best way to explain this. The code is using the stack trace
> to figure out which function is the stack hog. Or perhaps a serious of
> stack hogs. On x86, a call stores the return address as it calls the
> next function. Then that function allocates its stack frame for its
> local variables and saving of registers.

This makes perfect sense, (probably also makes sense to push this whole
explanation into either the changelog or the kernel documentation)

Thanks a lot, Steve!

 - Joel



> on x86:
> 
> [ top of stack ]
>  0: sys call entry frame
> 10: return addr to entry code
> 11: start of sys_foo frame
> 20: return addr to sys_foo
> 21: start of kernel_func_bar frame
> 30: return addr to kernel_func_bar
> 31: [ do trace stack here ]
> 
> 
> Then we do a save_stack_trace which returns the addresses of the
> functions it finds. Which would be (from the bottom of the stack to the
> top)
> 
>   return addr to kernel_func_bar
>   return addr to sys_foo
>   return addr to entry code
> 
> What we do here is try to figure out how much stack each of theses
> functions have. So we loop through the stack looking for the addresses
> returned by the save_stack trace, and see where on the stack this is.
> This gives us:
> 
>   return addr to kernel_func_bar [ 30 ]
>   return addr to sys_foo         [ 20 ]
>   return addr to entry frame     [ 10 ]
> 
> From this, we can conclude (on x86) that the size of the stack used for
> kernel_func_bar is 30 - 20 = 10. Because on the stack we have:
> 
> 20: return addr to sys_foo
> 21: start of kernel_func_bar frame  <<-- kernel_func_bar stack frame
> 30: return addr to kernel_func_bar
> 
> 
> Now, what Jiping reported, is that on arm64, it saves the link register
> (the return address) when it is needed to, which is after the stack
> frame for the current function has been saved. That means we have
> something that looks like this:
> 
> on arm64:
> 
> [ top of stack ]
>  0: sys call entry frame
> 10: start of sys_foo_frame
> 19: return addr to entry code << lr saved before calling kern_func_bar
> 20: start of kernel_func_bar frame
> 29: return addr to sys_foo_frame << lr saved before calling next function
> 30: [ do trace stack here ]
> 
> Now, I have a question. To call the mcount code (ftrace and the stack
> tracing), you need to save the return address of kern_func_bar
> somewhere, otherwise the call to mcount will overwrite the lr. But
> let's say it does and then forgets it, so we have:
> 
> 30: return addr of kernel_func_bar frame
> 31: [ do trace stack here ]
> 
> Now save_stack_trace gives us the same result:
> 
>  return addr to kernel_func bar
>  return addr to sys_foo
>  return addr to entry frame
> 
> But we get a different result when finding them in the location of the
> stack.
> 
>  return addr to kernel_func_bar [ 30 ]
>  return addr to sys foo         [ 29 ]
>  return addr to entry frame     [ 19 ]
> 
> The simple subtractions will be off:
> 
> kernel_func_bar stack size = 30 - 29 = 1
> Or even, sys_foo 29 - 19 = 10, but if we look at the stack:
> 
> 10: start of sys_foo_frame
> 19: return addr to entry_code
> 20: start of kernel_func_bar frame
> 29: return addr to sys_foo
> 
> We are measuring the kernel_func_bar frame for sys_foo!
> 
> We are off by one here.
> 
> stack_trace_index[] is an array of the offsets mapping to the function
> return addresses found. If we shift it by one, then we then sync the
> functions found with their frames:
> 
> stack_trace_index[0] = 30
> stack_trace_index[1] = 29
> stack_trace_index[2] = 19
> 
> 		memmove((&stack_trace_index[0], &stack_trace_index[1],
> 			sizeof(stack_trace_index[0]) * (x - 1));
> 
> Makes that:
> 
> stack_trace_index[0] = 29
> stack_trace_index[1] = 19
> 
> And we do x-- to lose the last frame.
> 
> With the stack_dump_trace being:
> 
> stack_dump_trace[0] = return addr kernel_func_bar
> stack_dump_trace[1] = return addr sys_foo
> 
> we then match which frame size belongs to which function better.
> 
> 
> > 
> > Also, this stack trace loop (original code) is a bit hairy :) It appears
> > there is a call to stack_trace_save() followed by another loop that goes
> > through the returned entries from there and tries to generate a set of
> > indexes. Isn't the real issue that the entries returned by stack_trace_save()
> > are a out of whack? I am curious also if other users of stack_trace_save()
> > will suffer from the same issue.
> 
> No, the order is fine. The issue is that we are using the location of
> the return address in the stack to find out what function has the
> biggest stack usage, and our assumption for arm64 is incorrect in that
> location.
> 
> -- Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-08-06 17:25 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-02  9:41 [PATCH v3] tracing: Function stack size and its name mismatch in arm64 Jiping Ma
2019-08-02 15:22 ` Steven Rostedt
2019-08-02 16:09   ` Steven Rostedt
2019-08-02 16:11     ` Steven Rostedt
2019-08-06 15:48       ` Joel Fernandes
2019-08-06 16:34         ` Steven Rostedt
2019-08-06 17:03           ` Steven Rostedt
2019-08-06 17:25           ` Joel Fernandes
2019-08-06  6:01     ` Jiping Ma
2019-08-03  8:26   ` Joel Fernandes
2019-08-03  8:32     ` Joel Fernandes
2019-08-05 11:25     ` Will Deacon
2019-08-05 13:59       ` Steven Rostedt
2019-08-06 13:00       ` Steven Rostedt
2019-08-06 14:47       ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).