[RFC,v3] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
diff mbox series

Message ID 20190418160730.11901-1-kasong@redhat.com
State New, archived
Headers show
Series
  • [RFC,v3] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER
Related show

Commit Message

Kairui Song April 18, 2019, 4:07 p.m. UTC
Currently perf callchain doesn't work well when sampling from trace
point, with ORC unwinder enabled and CONFIG_FRAME_POINTER disabled.
We'll get useless in kernel callchain like this:

perf  6429 [000]    22.498450:             kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
    ffffffffbe23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
	7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
	5651468729c1 [unknown] (/usr/bin/perf)
	5651467ee82a main+0x69a (/usr/bin/perf)
	7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
    5541f689495641d7 [unknown] ([unknown])

The root cause is within a trace point perf will try to dump the
required caller's registers, but without CONFIG_FRAME_POINTER we
can't get caller's BP as the frame pointer, so current frame pointer
is returned instead. We get a invalid register combination which
confuse the unwinder and end the stacktrace early.

So in such case just don't try dump BP when doing partial register
dump. And just let the unwinder start directly when the register is
incapable of presenting a unwinding start point. Use SP as the skip
mark, skip all the frames until we meet the frame we want.

This make the callchain get the full kernel space stacktrace again:

perf  6503 [000]  1567.570191:             kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
    ffffffffb523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
    ffffffffb5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux)
	7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
	55a22960d9c1 [unknown] (/usr/bin/perf)
	55a22958982a main+0x69a (/usr/bin/perf)
	7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
    5541f689495641d7 [unknown] ([unknown])

Signed-off-by: Kairui Song <kasong@redhat.com>
---

Update from V2:
  - Instead of looking at if BP is 0, use X86_EFLAGS_FIXED flag bit as
    the indicator of where the pt_regs is valid for unwinding. As
    suggested by Peter Zijlstra
  - Update some comments accordingly.

Update from V1:
  Get rid of a lot of unneccessary code and just don't dump a inaccurate
  BP, and use SP as the marker for target frame.

 arch/x86/events/core.c            | 24 +++++++++++++++++++++---
 arch/x86/include/asm/perf_event.h |  5 +++++
 arch/x86/include/asm/stacktrace.h |  9 +++++++--
 include/linux/perf_event.h        |  6 +++---
 4 files changed, 36 insertions(+), 8 deletions(-)

Comments

Josh Poimboeuf April 19, 2019, 12:58 a.m. UTC | #1
On Thu, Apr 18, 2019 at 12:07:30PM -0400, Kairui Song wrote:
> Currently perf callchain doesn't work well when sampling from trace
> point, with ORC unwinder enabled and CONFIG_FRAME_POINTER disabled.
> We'll get useless in kernel callchain like this:
> 
> perf  6429 [000]    22.498450:             kmem:mm_page_alloc: page=0x176a17 pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
>     ffffffffbe23e32e __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
> 	7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
> 	5651468729c1 [unknown] (/usr/bin/perf)
> 	5651467ee82a main+0x69a (/usr/bin/perf)
> 	7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
>     5541f689495641d7 [unknown] ([unknown])
> 
> The root cause is within a trace point perf will try to dump the
> required caller's registers, but without CONFIG_FRAME_POINTER we
> can't get caller's BP as the frame pointer, so current frame pointer
> is returned instead. We get a invalid register combination which
> confuse the unwinder and end the stacktrace early.
> 
> So in such case just don't try dump BP when doing partial register
> dump. And just let the unwinder start directly when the register is
> incapable of presenting a unwinding start point. Use SP as the skip
> mark, skip all the frames until we meet the frame we want.
> 
> This make the callchain get the full kernel space stacktrace again:
> 
> perf  6503 [000]  1567.570191:             kmem:mm_page_alloc: page=0x16c904 pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
>     ffffffffb523e2ae __alloc_pages_nodemask+0x22e (/lib/modules/5.1.0-rc3+/build/vmlinux)
>     ffffffffb52383bd __get_free_pages+0xd (/lib/modules/5.1.0-rc3+/build/vmlinux)
>     ffffffffb52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
>     ffffffffb521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
>     ffffffffb52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
>     ffffffffb52ff027 __x64_sys_poll+0x37 (/lib/modules/5.1.0-rc3+/build/vmlinux)
>     ffffffffb500418b do_syscall_64+0x5b (/lib/modules/5.1.0-rc3+/build/vmlinux)
>     ffffffffb5a0008c entry_SYSCALL_64_after_hwframe+0x44 (/lib/modules/5.1.0-rc3+/build/vmlinux)
> 	7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
> 	55a22960d9c1 [unknown] (/usr/bin/perf)
> 	55a22958982a main+0x69a (/usr/bin/perf)
> 	7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
>     5541f689495641d7 [unknown] ([unknown])
> 
> Signed-off-by: Kairui Song <kasong@redhat.com>
> ---
> 
> Update from V2:
>   - Instead of looking at if BP is 0, use X86_EFLAGS_FIXED flag bit as
>     the indicator of where the pt_regs is valid for unwinding. As
>     suggested by Peter Zijlstra
>   - Update some comments accordingly.
> 
> Update from V1:
>   Get rid of a lot of unneccessary code and just don't dump a inaccurate
>   BP, and use SP as the marker for target frame.
> 
>  arch/x86/events/core.c            | 24 +++++++++++++++++++++---
>  arch/x86/include/asm/perf_event.h |  5 +++++
>  arch/x86/include/asm/stacktrace.h |  9 +++++++--
>  include/linux/perf_event.h        |  6 +++---
>  4 files changed, 36 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index e2b1447192a8..e181e195fe5d 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2355,6 +2355,18 @@ void arch_perf_update_userpage(struct perf_event *event,
>  	cyc2ns_read_end();
>  }
>  
> +static inline int
> +valid_unwinding_registers(struct pt_regs *regs)
> +{
> +	/*
> +	 * regs might be a fake one, it won't dump the flags reg,
> +	 * and without frame pointer, it won't have a valid BP.
> +	 */
> +	if (IS_ENABLED(CONFIG_FRAME_POINTER))
> +		return 1;
> +	return (regs->flags & PERF_EFLAGS_SNAP);
> +}
> +
>  void
>  perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
>  {
> @@ -2366,11 +2378,17 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
>  		return;
>  	}
>  
> -	if (perf_callchain_store(entry, regs->ip))
> +	if (valid_unwinding_registers(regs)) {
> +		if (perf_callchain_store(entry, regs->ip))
> +			return;
> +		unwind_start(&state, current, regs, NULL);
> +	} else if (regs->sp) {
> +		unwind_start(&state, current, NULL, (unsigned long *)regs->sp);
> +	} else {
>  		return;
> +	}
>  
> -	for (unwind_start(&state, current, regs, NULL); !unwind_done(&state);
> -	     unwind_next_frame(&state)) {
> +	for (; !unwind_done(&state); unwind_next_frame(&state)) {
>  		addr = unwind_get_return_address(&state);
>  		if (!addr || perf_callchain_store(entry, addr))
>  			return;
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index 8bdf74902293..77c8519512ff 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -239,11 +239,16 @@ extern void perf_events_lapic_init(void);
>   * Abuse bits {3,5} of the cpu eflags register. These flags are otherwise
>   * unused and ABI specified to be 0, so nobody should care what we do with
>   * them.
> + * And also leverage X86_EFLAGS_FIXED (bit 1). This bit is reserved as
> + * always set.
>   *
> + * SNAP  - flags is considered a real snapshot captured upon event
> + *         is triggered.
>   * EXACT - the IP points to the exact instruction that triggered the
>   *         event (HW bugs exempt).
>   * VM    - original X86_VM_MASK; see set_linear_ip().
>   */
> +#define PERF_EFLAGS_SNAP	X86_EFLAGS_FIXED
>  #define PERF_EFLAGS_EXACT	(1UL << 3)
>  #define PERF_EFLAGS_VM		(1UL << 5)
>  
> diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
> index f335aad404a4..226077e20412 100644
> --- a/arch/x86/include/asm/stacktrace.h
> +++ b/arch/x86/include/asm/stacktrace.h
> @@ -98,18 +98,23 @@ struct stack_frame_ia32 {
>      u32 return_address;
>  };
>  
> +#ifdef CONFIG_FRAME_POINTER
>  static inline unsigned long caller_frame_pointer(void)
>  {
>  	struct stack_frame *frame;
>  
>  	frame = __builtin_frame_address(0);
>  
> -#ifdef CONFIG_FRAME_POINTER
>  	frame = frame->next_frame;
> -#endif
>  
>  	return (unsigned long)frame;
>  }
> +#else
> +static inline unsigned long caller_frame_pointer(void)
> +{
> +	return 0;
> +}
> +#endif
>  
>  void show_opcodes(struct pt_regs *regs, const char *loglvl);
>  void show_ip(struct pt_regs *regs, const char *loglvl);
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index e47ef764f613..07bcb1870220 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1055,11 +1055,11 @@ static inline void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned lo
>  #endif
>  
>  /*
> - * Take a snapshot of the regs. Skip ip and frame pointer to
> - * the nth caller. We only need a few of the regs:
> + * Take a partial snapshot of the caller's regs.
> + * We only need a few of the regs:
>   * - ip for PERF_SAMPLE_IP
>   * - cs for user_mode() tests
> - * - bp for callchains
> + * - sp, bp for callchains
>   * - eflags, for future purposes, just in case
>   */
>  static inline void perf_fetch_caller_regs(struct pt_regs *regs)

I still don't like using regs->bp because it results in different code
paths for FP and ORC.  In the FP case, the regs are treated like real
regs even though they're fake.

Something like the below would be much simpler.  Would this work?  I don't
know if any other code relies on the fake regs->bp or regs->sp.

(BTW, tomorrow is a US holiday so I may not be very responsive until
Monday.)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index de1a924a4914..f315425d8468 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2382,6 +2382,15 @@ void arch_perf_update_userpage(struct perf_event *event,
 	cyc2ns_read_end();
 }
 
+/*
+ * Determine whether the regs were taken from an irq/exception handler rather
+ * than from perf_arch_fetch_caller_regs().
+ */
+static bool perf_hw_regs(struct pt_regs *regs)
+{
+	return regs->flags & X86_EFLAGS_FIXED;
+}
+
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
 {
@@ -2393,11 +2402,15 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
 		return;
 	}
 
-	if (perf_callchain_store(entry, regs->ip))
-		return;
+	if (perf_hw_regs(regs)) {
+		if (perf_callchain_store(entry, regs->ip))
+			return;
+		unwind_start(&state, current, regs, NULL);
+	} else {
+		unwind_start(&state, current, NULL, (void *)regs->sp);
+	}
 
-	for (unwind_start(&state, current, regs, NULL); !unwind_done(&state);
-	     unwind_next_frame(&state)) {
+	for (; !unwind_done(&state); unwind_next_frame(&state)) {
 		addr = unwind_get_return_address(&state);
 		if (!addr || perf_callchain_store(entry, addr))
 			return;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 04768a3a5454..1392d5e6e8d6 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -308,14 +308,9 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
  */
 #define perf_arch_fetch_caller_regs(regs, __ip)		{	\
 	(regs)->ip = (__ip);					\
-	(regs)->bp = caller_frame_pointer();			\
+	(regs)->sp = (unsigned long)__builtin_frame_address(0);	\
 	(regs)->cs = __KERNEL_CS;				\
 	regs->flags = 0;					\
-	asm volatile(						\
-		_ASM_MOV "%%"_ASM_SP ", %0\n"			\
-		: "=m" ((regs)->sp)				\
-		:: "memory"					\
-	);							\
 }
 
 struct perf_guest_switch_msr {
diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index d6d758a187b6..a8d0cdf48616 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -100,19 +100,6 @@ struct stack_frame_ia32 {
     u32 return_address;
 };
 
-static inline unsigned long caller_frame_pointer(void)
-{
-	struct stack_frame *frame;
-
-	frame = __builtin_frame_address(0);
-
-#ifdef CONFIG_FRAME_POINTER
-	frame = frame->next_frame;
-#endif
-
-	return (unsigned long)frame;
-}
-
 void show_opcodes(struct pt_regs *regs, const char *loglvl);
 void show_ip(struct pt_regs *regs, const char *loglvl);
 #endif /* _ASM_X86_STACKTRACE_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f3864e1c5569..0f560069aeec 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1062,7 +1062,7 @@ static inline void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned lo
  * the nth caller. We only need a few of the regs:
  * - ip for PERF_SAMPLE_IP
  * - cs for user_mode() tests
- * - bp for callchains
+ * - sp for callchains
  * - eflags, for future purposes, just in case
  */
 static inline void perf_fetch_caller_regs(struct pt_regs *regs)
Kairui Song April 19, 2019, 2:17 a.m. UTC | #2
On Fri, Apr 19, 2019 at 8:58 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>
> I still don't like using regs->bp because it results in different code
> paths for FP and ORC.  In the FP case, the regs are treated like real
> regs even though they're fake.
>
> Something like the below would be much simpler.  Would this work?  I don't
> know if any other code relies on the fake regs->bp or regs->sp.

Works perfectly. My only concern is that FP path used to work very
well, not sure it's a good idea to change it, and this may bring some
extra overhead for FP path.

>
> (BTW, tomorrow is a US holiday so I may not be very responsive until
> Monday.)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index de1a924a4914..f315425d8468 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2382,6 +2382,15 @@ void arch_perf_update_userpage(struct perf_event *event,
>         cyc2ns_read_end();
>  }
>
> +/*
> + * Determine whether the regs were taken from an irq/exception handler rather
> + * than from perf_arch_fetch_caller_regs().
> + */
> +static bool perf_hw_regs(struct pt_regs *regs)
> +{
> +       return regs->flags & X86_EFLAGS_FIXED;
> +}
> +
>  void
>  perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
>  {
> @@ -2393,11 +2402,15 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
>                 return;
>         }
>
> -       if (perf_callchain_store(entry, regs->ip))
> -               return;
> +       if (perf_hw_regs(regs)) {
> +               if (perf_callchain_store(entry, regs->ip))
> +                       return;
> +               unwind_start(&state, current, regs, NULL);
> +       } else {
> +               unwind_start(&state, current, NULL, (void *)regs->sp);
> +       }
>
> -       for (unwind_start(&state, current, regs, NULL); !unwind_done(&state);
> -            unwind_next_frame(&state)) {
> +       for (; !unwind_done(&state); unwind_next_frame(&state)) {
>                 addr = unwind_get_return_address(&state);
>                 if (!addr || perf_callchain_store(entry, addr))
>                         return;
> diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
> index 04768a3a5454..1392d5e6e8d6 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -308,14 +308,9 @@ extern unsigned long perf_misc_flags(struct pt_regs *regs);
>   */
>  #define perf_arch_fetch_caller_regs(regs, __ip)                {       \
>         (regs)->ip = (__ip);                                    \
> -       (regs)->bp = caller_frame_pointer();                    \
> +       (regs)->sp = (unsigned long)__builtin_frame_address(0); \
>         (regs)->cs = __KERNEL_CS;                               \
>         regs->flags = 0;                                        \
> -       asm volatile(                                           \
> -               _ASM_MOV "%%"_ASM_SP ", %0\n"                   \
> -               : "=m" ((regs)->sp)                             \
> -               :: "memory"                                     \
> -       );                                                      \
>  }
>
>  struct perf_guest_switch_msr {
> diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
> index d6d758a187b6..a8d0cdf48616 100644
> --- a/arch/x86/include/asm/stacktrace.h
> +++ b/arch/x86/include/asm/stacktrace.h
> @@ -100,19 +100,6 @@ struct stack_frame_ia32 {
>      u32 return_address;
>  };
>
> -static inline unsigned long caller_frame_pointer(void)
> -{
> -       struct stack_frame *frame;
> -
> -       frame = __builtin_frame_address(0);
> -
> -#ifdef CONFIG_FRAME_POINTER
> -       frame = frame->next_frame;
> -#endif
> -
> -       return (unsigned long)frame;
> -}
> -
>  void show_opcodes(struct pt_regs *regs, const char *loglvl);
>  void show_ip(struct pt_regs *regs, const char *loglvl);
>  #endif /* _ASM_X86_STACKTRACE_H */
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index f3864e1c5569..0f560069aeec 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1062,7 +1062,7 @@ static inline void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned lo
>   * the nth caller. We only need a few of the regs:
>   * - ip for PERF_SAMPLE_IP
>   * - cs for user_mode() tests
> - * - bp for callchains
> + * - sp for callchains
>   * - eflags, for future purposes, just in case
>   */
>  static inline void perf_fetch_caller_regs(struct pt_regs *regs)
Peter Zijlstra April 19, 2019, 9:43 a.m. UTC | #3
On Fri, Apr 19, 2019 at 10:17:49AM +0800, Kairui Song wrote:
> On Fri, Apr 19, 2019 at 8:58 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > I still don't like using regs->bp because it results in different code
> > paths for FP and ORC.  In the FP case, the regs are treated like real
> > regs even though they're fake.
> >
> > Something like the below would be much simpler.  Would this work?  I don't
> > know if any other code relies on the fake regs->bp or regs->sp.
> 
> Works perfectly. My only concern is that FP path used to work very
> well, not sure it's a good idea to change it, and this may bring some
> extra overhead for FP path.

Given Josh wrote all that code, I'm fairly sure it is still OK :-)

But also looking at the code in unwind_frame.c, __unwind_start() seems
to pretty much do what the removed caller_frame_pointer() did (when
.regs=NULL) but better.

> > +       if (perf_hw_regs(regs)) {
> > +               if (perf_callchain_store(entry, regs->ip))
> > +                       return;
> > +               unwind_start(&state, current, regs, NULL);
> > +       } else {
> > +               unwind_start(&state, current, NULL, (void *)regs->sp);
> > +       }

> >  #define perf_arch_fetch_caller_regs(regs, __ip)                {       \
> >         (regs)->ip = (__ip);                                    \
> > +       (regs)->sp = (unsigned long)__builtin_frame_address(0); \
> >         (regs)->cs = __KERNEL_CS;                               \
> >         regs->flags = 0;                                        \
> >  }
Kairui Song April 19, 2019, 11:47 a.m. UTC | #4
On Fri, Apr 19, 2019 at 5:43 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Apr 19, 2019 at 10:17:49AM +0800, Kairui Song wrote:
> > On Fri, Apr 19, 2019 at 8:58 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > >
> > > I still don't like using regs->bp because it results in different code
> > > paths for FP and ORC.  In the FP case, the regs are treated like real
> > > regs even though they're fake.
> > >
> > > Something like the below would be much simpler.  Would this work?  I don't
> > > know if any other code relies on the fake regs->bp or regs->sp.
> >
> > Works perfectly. My only concern is that FP path used to work very
> > well, not sure it's a good idea to change it, and this may bring some
> > extra overhead for FP path.
>
> Given Josh wrote all that code, I'm fairly sure it is still OK :-)
>
> But also looking at the code in unwind_frame.c, __unwind_start() seems
> to pretty much do what the removed caller_frame_pointer() did (when
> .regs=NULL) but better.
>

OK, with FP we will also need to do a few more extra unwinding,
previously it start directly from the frame of the trace point, now
have to trace back to the trace point first.
If that's fine I could post another update (that will be pretty much
just copy&paste from the Josh's code he posted :P , is this OK?)





--
Best Regards,
Kairui Song
Josh Poimboeuf April 19, 2019, 3:45 p.m. UTC | #5
On Fri, Apr 19, 2019 at 07:47:18PM +0800, Kairui Song wrote:
> On Fri, Apr 19, 2019 at 5:43 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Apr 19, 2019 at 10:17:49AM +0800, Kairui Song wrote:
> > > On Fri, Apr 19, 2019 at 8:58 AM Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > >
> > > > I still don't like using regs->bp because it results in different code
> > > > paths for FP and ORC.  In the FP case, the regs are treated like real
> > > > regs even though they're fake.
> > > >
> > > > Something like the below would be much simpler.  Would this work?  I don't
> > > > know if any other code relies on the fake regs->bp or regs->sp.
> > >
> > > Works perfectly. My only concern is that FP path used to work very
> > > well, not sure it's a good idea to change it, and this may bring some
> > > extra overhead for FP path.
> >
> > Given Josh wrote all that code, I'm fairly sure it is still OK :-)
> >
> > But also looking at the code in unwind_frame.c, __unwind_start() seems
> > to pretty much do what the removed caller_frame_pointer() did (when
> > .regs=NULL) but better.
> >
> 
> OK, with FP we will also need to do a few more extra unwinding,
> previously it start directly from the frame of the trace point, now
> have to trace back to the trace point first.
> If that's fine I could post another update (that will be pretty much
> just copy&paste from the Josh's code he posted :P , is this OK?)

You're right that FP will need to unwind a few extra frames, but I doubt
that will make much of a difference performance-wise.  I prefer the
simpler approach.

If you use that patch for the next version then you can add

  Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>

Thanks.

Patch
diff mbox series

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e2b1447192a8..e181e195fe5d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2355,6 +2355,18 @@  void arch_perf_update_userpage(struct perf_event *event,
 	cyc2ns_read_end();
 }
 
+static inline int
+valid_unwinding_registers(struct pt_regs *regs)
+{
+	/*
+	 * regs might be a fake one, it won't dump the flags reg,
+	 * and without frame pointer, it won't have a valid BP.
+	 */
+	if (IS_ENABLED(CONFIG_FRAME_POINTER))
+		return 1;
+	return (regs->flags & PERF_EFLAGS_SNAP);
+}
+
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs)
 {
@@ -2366,11 +2378,17 @@  perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re
 		return;
 	}
 
-	if (perf_callchain_store(entry, regs->ip))
+	if (valid_unwinding_registers(regs)) {
+		if (perf_callchain_store(entry, regs->ip))
+			return;
+		unwind_start(&state, current, regs, NULL);
+	} else if (regs->sp) {
+		unwind_start(&state, current, NULL, (unsigned long *)regs->sp);
+	} else {
 		return;
+	}
 
-	for (unwind_start(&state, current, regs, NULL); !unwind_done(&state);
-	     unwind_next_frame(&state)) {
+	for (; !unwind_done(&state); unwind_next_frame(&state)) {
 		addr = unwind_get_return_address(&state);
 		if (!addr || perf_callchain_store(entry, addr))
 			return;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 8bdf74902293..77c8519512ff 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -239,11 +239,16 @@  extern void perf_events_lapic_init(void);
  * Abuse bits {3,5} of the cpu eflags register. These flags are otherwise
  * unused and ABI specified to be 0, so nobody should care what we do with
  * them.
+ * And also leverage X86_EFLAGS_FIXED (bit 1). This bit is reserved as
+ * always set.
  *
+ * SNAP  - flags is considered a real snapshot captured upon event
+ *         is triggered.
  * EXACT - the IP points to the exact instruction that triggered the
  *         event (HW bugs exempt).
  * VM    - original X86_VM_MASK; see set_linear_ip().
  */
+#define PERF_EFLAGS_SNAP	X86_EFLAGS_FIXED
 #define PERF_EFLAGS_EXACT	(1UL << 3)
 #define PERF_EFLAGS_VM		(1UL << 5)
 
diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index f335aad404a4..226077e20412 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -98,18 +98,23 @@  struct stack_frame_ia32 {
     u32 return_address;
 };
 
+#ifdef CONFIG_FRAME_POINTER
 static inline unsigned long caller_frame_pointer(void)
 {
 	struct stack_frame *frame;
 
 	frame = __builtin_frame_address(0);
 
-#ifdef CONFIG_FRAME_POINTER
 	frame = frame->next_frame;
-#endif
 
 	return (unsigned long)frame;
 }
+#else
+static inline unsigned long caller_frame_pointer(void)
+{
+	return 0;
+}
+#endif
 
 void show_opcodes(struct pt_regs *regs, const char *loglvl);
 void show_ip(struct pt_regs *regs, const char *loglvl);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e47ef764f613..07bcb1870220 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1055,11 +1055,11 @@  static inline void perf_arch_fetch_caller_regs(struct pt_regs *regs, unsigned lo
 #endif
 
 /*
- * Take a snapshot of the regs. Skip ip and frame pointer to
- * the nth caller. We only need a few of the regs:
+ * Take a partial snapshot of the caller's regs.
+ * We only need a few of the regs:
  * - ip for PERF_SAMPLE_IP
  * - cs for user_mode() tests
- * - bp for callchains
+ * - sp, bp for callchains
  * - eflags, for future purposes, just in case
  */
 static inline void perf_fetch_caller_regs(struct pt_regs *regs)