* [BUG] Stack overflow when running perf and function tracer
@ 2020-10-30 4:27 Steven Rostedt
2020-10-30 9:00 ` Peter Zijlstra
0 siblings, 1 reply; 9+ messages in thread
From: Steven Rostedt @ 2020-10-30 4:27 UTC (permalink / raw)
To: LKML; +Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, kan.liang, like.xu
I found a bug in the recursion protection that prevented function
tracing from running in NMI context. Applying this fix to 5.9 worked
fine (tested by running perf record and function tracing at the same
time). But when I applied the patch to 5.10-rc1, it blew up with a
stack overflow:
perf: interrupt took too long (14611 > 13550), lowering kernel.perf_event_max_sample_rate to 13000
perf: interrupt took too long (27965 > 18263), lowering kernel.perf_event_max_sample_rate to 7000
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 0.000 msecs
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.087 msecs
INFO: NMI handler (perf_event_nmi_handler) took too long to run: 9.082 msecs
perf: interrupt took too long (27965 > 18263), lowering kernel.perf_event_max_sample_rate to 7000
traps: PANIC: double fault, error_code: 0x0
double fault: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 1655 Comm: perf Not tainted 5.9.0-rc1-test+ #685
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
RIP: 0010:__rb_reserve_next+0xb/0x450
Code: 4f 10 f0 ff 41 08 0f 0b 83 f8 02 0f 84 97 fb ff ff e9 a5 fc ff ff b8 04 00 00 00 eb e1 66 90 41 57 41 56 41 55 41 54 49 89 f4 <55> 48 89 fd 53 48 83 ec 08 48 8b 9f 98 00 00 00 48 89 5e 28 4c 8b
RSP: 0018:fffffe000003c000 EFLAGS: 00010046
RAX: 000000000000001c RBX: ffff928ada27b400 RCX: 0000000000000000
RDX: ffff928ada07b200 RSI: fffffe000003c028 RDI: ffff928ada27b400
RBP: ffff928ada27b4f0 R08: 0000000000000001 R09: 0000000000000000
R10: fffffe000003c440 R11: ffff928a7383cc60 R12: fffffe000003c028
R13: 00000000000003e8 R14: 0000000000000046 R15: 0000000000110001
FS: 00007f25d43cf780(0000) GS:ffff928adaa40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffe000003bff8 CR3: 00000000b52a8005 CR4: 00000000001707e0
Call Trace:
<NMI>
ring_buffer_lock_reserve+0x12c/0x380
? perf_output_begin+0x4d/0x2d0
? rcu_panic+0x20/0x20
trace_function+0x27/0x130
? rcu_panic+0x20/0x20
? perf_output_begin+0x4d/0x2d0
function_trace_call+0x133/0x180
0xffffffffc02a106a
? __rcu_read_lock+0x5/0x20
__rcu_read_lock+0x5/0x20
perf_output_begin+0x4d/0x2d0
? sched_clock+0x5/0x10
? trace_clock_local+0xc/0x20
? __rb_reserve_next+0x5c/0x450
? sched_clock+0x5/0x10
? ring_buffer_lock_reserve+0x12c/0x380
? sched_clock+0x5/0x10
? trace_clock_local+0xc/0x20
? __rb_reserve_next+0x5c/0x450
? ring_buffer_lock_reserve+0x12c/0x380
? perf_log_throttle+0x9a/0x120
? perf_log_throttle+0x9a/0x120
? perf_output_begin_backward+0x280/0x280
? trace_function+0x88/0x130
? ring_buffer_unlock_commit+0x25/0x140
? perf_output_begin_backward+0x280/0x280
? perf_log_throttle+0x9a/0x120
? function_trace_call+0x133/0x180
? 0xffffffffc02a106a
? perf_log_throttle+0x9a/0x120
? perf_output_begin+0x5/0x2d0
? __perf_event_header__init_id.isra.0+0x39/0xd0
? perf_output_begin+0x5/0x2d0
perf_log_throttle+0x9a/0x120
? trace_clock_local+0xc/0x20
? __rb_reserve_next+0x5c/0x450
? sched_clock+0x5/0x10
? ring_buffer_lock_reserve+0x12c/0x380
? sched_clock+0x5/0x10
? sched_clock+0x5/0x10
? sched_clock+0x5/0x10
? trace_clock_local+0xc/0x20
? __rb_reserve_next+0x5c/0x450
? sched_clock+0x5/0x10
? trace_clock_local+0xc/0x20
? __rb_reserve_next+0x5c/0x450
? setup_pebs_fixed_sample_data+0x365/0x3a0
? setup_pebs_fixed_sample_data+0x365/0x3a0
? intel_pmu_pebs_event_update_no_drain+0x70/0x70
? trace_function+0x88/0x130
? ring_buffer_unlock_commit+0x25/0x140
? intel_pmu_pebs_event_update_no_drain+0x70/0x70
? setup_pebs_fixed_sample_data+0x365/0x3a0
? function_trace_call+0x133/0x180
? intel_pmu_pebs_fixup_ip+0x32/0x1b0
? function_trace_call+0x133/0x180
? native_write_msr+0x6/0x20
__perf_event_account_interrupt+0xa9/0x120
__perf_event_overflow+0x2b/0xf0
__intel_pmu_pebs_event+0x2ec/0x3e0
? trace_function+0x88/0x130
? ring_buffer_unlock_commit+0x25/0x140
? setup_pebs_adaptive_sample_data+0x3e0/0x3e0
? 0xffffffffc02a106a
? 0xffffffffc02a106a
? generic_exec_single+0xa6/0xe0
? perf_duration_warn+0x20/0x20
? generic_exec_single+0xa6/0xe0
? native_write_msr+0x6/0x20
? ring_buffer_lock_reserve+0x1b9/0x380
? ring_buffer_lock_reserve+0x12c/0x380
? intel_pmu_drain_pebs_nhm+0x268/0x330
? intel_pmu_drain_pebs_nhm+0x268/0x330
? release_bts_buffer+0x60/0x60
? trace_function+0x88/0x130
? ring_buffer_unlock_commit+0x25/0x140
? release_bts_buffer+0x60/0x60
? intel_pmu_drain_pebs_nhm+0x268/0x330
? function_trace_call+0x133/0x180
? 0xffffffffc02a106a
? ring_buffer_lock_reserve+0x12c/0x380
intel_pmu_drain_pebs_nhm+0x268/0x330
? setup_pebs_adaptive_sample_data+0x3e0/0x3e0
handle_pmi_common+0xc2/0x2b0
? __rb_reserve_next+0x5c/0x450
? sched_clock+0x5/0x10
? trace_clock_local+0xc/0x20
? __rb_reserve_next+0x5c/0x450
? ring_buffer_lock_reserve+0x12c/0x380
? intel_pmu_handle_irq+0xc8/0x160
? intel_pmu_ha
Lost 135 message(s)!
---[ end trace ba215b7ba269800a ]---
RIP: 0010:__rb_reserve_next+0xb/0x450
Code: 4f 10 f0 ff 41 08 0f 0b 83 f8 02 0f 84 97 fb ff ff e9 a5 fc ff ff b8 04 00 00 00 eb e1 66 90 41 57 41 56 41 55 41 54 49 89 f4 <55> 48 89 fd 53 48 83 ec 08 48 8b 9f 98 00 00 00 48 89 5e 28 4c 8b
RSP: 0018:fffffe000003c000 EFLAGS: 00010046
RAX: 000000000000001c RBX: ffff928ada27b400 RCX: 0000000000000000
RDX: ffff928ada07b200 RSI: fffffe000003c028 RDI: ffff928ada27b400
RBP: ffff928ada27b4f0 R08: 0000000000000001 R09: 0000000000000000
R10: fffffe000003c440 R11: ffff928a7383cc60 R12: fffffe000003c028
R13: 00000000000003e8 R14: 0000000000000046 R15: 0000000000110001
FS: 00007f25d43cf780(0000) GS:ffff928adaa40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffe000003bff8 CR3: 00000000b52a8005 CR4: 00000000001707e0
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
I bisected it down to:
35d1ce6bec133679ff16325d335217f108b84871 ("perf/x86/intel/ds: Fix
x86_pmu_stop warning for large PEBS")
Which looks to be storing an awful lot on the stack:
static void __intel_pmu_pebs_event(struct perf_event *event,
struct pt_regs *iregs,
void *base, void *top,
int bit, int count,
void (*setup_sample)(struct perf_event *,
struct pt_regs *,
void *,
struct perf_sample_data *,
struct pt_regs *))
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
struct perf_sample_data data;
struct x86_perf_regs perf_regs;
struct pt_regs *regs = &perf_regs.regs;
void *at = get_next_pebs_record_by_bit(base, top, bit);
struct pt_regs dummy_iregs;
Reverting this patch from 5.10-rc1 makes makes the crash go away.
-- Steve
PS. here's the patch to fix nmi function tracing:
From c1c8a30be7c138a5c3519c12c8ef3b80288d5514 Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (VMware)" <rostedt@goodmis.org>
Date: Thu, 29 Oct 2020 17:31:45 -0400
Subject: [PATCH] ftrace: Fix recursion check for NMI test
The code that checks recursion will work to only do the recursion check once
if there's nested checks. The top one will do the check, the other nested
checks will see recursion was already checked and return zero for its "bit".
On the return side, nothing will be done if the "bit" is zero.
The problem is that zero is returned for the "good" bit when in NMI context.
This will set the bit for NMIs making it look like *all* NMI tracing is
recursing, and prevent tracing of anything in NMI context!
The simple fix is to return "bit + 1" and subtract that bit on the end to
get the real bit.
Cc: stable@vger.kernel.org
Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
kernel/trace/trace.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index f3f5e77123ad..fee535a89560 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -698,7 +698,7 @@ static __always_inline int trace_test_and_set_recursion(int start, int max)
current->trace_recursion = val;
barrier();
- return bit;
+ return bit + 1;
}
static __always_inline void trace_clear_recursion(int bit)
@@ -708,6 +708,7 @@ static __always_inline void trace_clear_recursion(int bit)
if (!bit)
return;
+ bit--;
bit = 1 << bit;
val &= ~bit;
--
2.25.4
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [BUG] Stack overflow when running perf and function tracer
2020-10-30 4:27 [BUG] Stack overflow when running perf and function tracer Steven Rostedt
@ 2020-10-30 9:00 ` Peter Zijlstra
2020-10-30 10:26 ` Thomas Gleixner
2020-10-30 11:19 ` [PATCH] perf/x86: Make dummy_iregs static Peter Zijlstra
0 siblings, 2 replies; 9+ messages in thread
From: Peter Zijlstra @ 2020-10-30 9:00 UTC (permalink / raw)
To: Steven Rostedt; +Cc: LKML, Thomas Gleixner, Ingo Molnar, kan.liang, like.xu
On Fri, Oct 30, 2020 at 12:27:22AM -0400, Steven Rostedt wrote:
> I found a bug in the recursion protection that prevented function
> tracing from running in NMI context. Applying this fix to 5.9 worked
> fine (tested by running perf record and function tracing at the same
> time). But when I applied the patch to 5.10-rc1, it blew up with a
> stack overflow:
So we just blew away our NMI stack, right?
> perf: interrupt took too long (14611 > 13550), lowering kernel.perf_event_max_sample_rate to 13000
> perf: interrupt took too long (27965 > 18263), lowering kernel.perf_event_max_sample_rate to 7000
> INFO: NMI handler (perf_event_nmi_handler) took too long to run: 0.000 msecs
> INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.087 msecs
> INFO: NMI handler (perf_event_nmi_handler) took too long to run: 9.082 msecs
> perf: interrupt took too long (27965 > 18263), lowering kernel.perf_event_max_sample_rate to 7000
> traps: PANIC: double fault, error_code: 0x0
> double fault: 0000 [#1] PREEMPT SMP PTI
> CPU: 1 PID: 1655 Comm: perf Not tainted 5.9.0-rc1-test+ #685
> Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
> RIP: 0010:__rb_reserve_next+0xb/0x450
> Code: 4f 10 f0 ff 41 08 0f 0b 83 f8 02 0f 84 97 fb ff ff e9 a5 fc ff ff b8 04 00 00 00 eb e1 66 90 41 57 41 56 41 55 41 54 49 89 f4 <55> 48 89 fd 53 48 83 ec 08 48 8b 9f 98 00 00 00 48 89 5e 28 4c 8b
> RSP: 0018:fffffe000003c000 EFLAGS: 00010046
> RAX: 000000000000001c RBX: ffff928ada27b400 RCX: 0000000000000000
> RDX: ffff928ada07b200 RSI: fffffe000003c028 RDI: ffff928ada27b400
> RBP: ffff928ada27b4f0 R08: 0000000000000001 R09: 0000000000000000
> R10: fffffe000003c440 R11: ffff928a7383cc60 R12: fffffe000003c028
> R13: 00000000000003e8 R14: 0000000000000046 R15: 0000000000110001
> FS: 00007f25d43cf780(0000) GS:ffff928adaa40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: fffffe000003bff8 CR3: 00000000b52a8005 CR4: 00000000001707e0
> Call Trace:
> <NMI>
> ring_buffer_lock_reserve+0x12c/0x380
> ? perf_output_begin+0x4d/0x2d0
> ? rcu_panic+0x20/0x20
> trace_function+0x27/0x130
> ? rcu_panic+0x20/0x20
> ? perf_output_begin+0x4d/0x2d0
> function_trace_call+0x133/0x180
> 0xffffffffc02a106a
> ? __rcu_read_lock+0x5/0x20
> __rcu_read_lock+0x5/0x20
> perf_output_begin+0x4d/0x2d0
> ? sched_clock+0x5/0x10
> ? trace_clock_local+0xc/0x20
> ? __rb_reserve_next+0x5c/0x450
> ? sched_clock+0x5/0x10
> ? ring_buffer_lock_reserve+0x12c/0x380
> ? sched_clock+0x5/0x10
> ? trace_clock_local+0xc/0x20
> ? __rb_reserve_next+0x5c/0x450
> ? ring_buffer_lock_reserve+0x12c/0x380
> ? perf_log_throttle+0x9a/0x120
> ? perf_log_throttle+0x9a/0x120
> ? perf_output_begin_backward+0x280/0x280
> ? trace_function+0x88/0x130
> ? ring_buffer_unlock_commit+0x25/0x140
> ? perf_output_begin_backward+0x280/0x280
> ? perf_log_throttle+0x9a/0x120
> ? function_trace_call+0x133/0x180
> ? 0xffffffffc02a106a
> ? perf_log_throttle+0x9a/0x120
> ? perf_output_begin+0x5/0x2d0
> ? __perf_event_header__init_id.isra.0+0x39/0xd0
> ? perf_output_begin+0x5/0x2d0
> perf_log_throttle+0x9a/0x120
> ? trace_clock_local+0xc/0x20
> ? __rb_reserve_next+0x5c/0x450
> ? sched_clock+0x5/0x10
> ? ring_buffer_lock_reserve+0x12c/0x380
> ? sched_clock+0x5/0x10
> ? sched_clock+0x5/0x10
> ? sched_clock+0x5/0x10
> ? trace_clock_local+0xc/0x20
> ? __rb_reserve_next+0x5c/0x450
> ? sched_clock+0x5/0x10
> ? trace_clock_local+0xc/0x20
> ? __rb_reserve_next+0x5c/0x450
> ? setup_pebs_fixed_sample_data+0x365/0x3a0
> ? setup_pebs_fixed_sample_data+0x365/0x3a0
> ? intel_pmu_pebs_event_update_no_drain+0x70/0x70
> ? trace_function+0x88/0x130
> ? ring_buffer_unlock_commit+0x25/0x140
> ? intel_pmu_pebs_event_update_no_drain+0x70/0x70
> ? setup_pebs_fixed_sample_data+0x365/0x3a0
> ? function_trace_call+0x133/0x180
> ? intel_pmu_pebs_fixup_ip+0x32/0x1b0
> ? function_trace_call+0x133/0x180
> ? native_write_msr+0x6/0x20
> __perf_event_account_interrupt+0xa9/0x120
> __perf_event_overflow+0x2b/0xf0
> __intel_pmu_pebs_event+0x2ec/0x3e0
> ? trace_function+0x88/0x130
> ? ring_buffer_unlock_commit+0x25/0x140
> ? setup_pebs_adaptive_sample_data+0x3e0/0x3e0
> ? 0xffffffffc02a106a
> ? 0xffffffffc02a106a
> ? generic_exec_single+0xa6/0xe0
> ? perf_duration_warn+0x20/0x20
> ? generic_exec_single+0xa6/0xe0
> ? native_write_msr+0x6/0x20
> ? ring_buffer_lock_reserve+0x1b9/0x380
> ? ring_buffer_lock_reserve+0x12c/0x380
> ? intel_pmu_drain_pebs_nhm+0x268/0x330
> ? intel_pmu_drain_pebs_nhm+0x268/0x330
> ? release_bts_buffer+0x60/0x60
> ? trace_function+0x88/0x130
> ? ring_buffer_unlock_commit+0x25/0x140
> ? release_bts_buffer+0x60/0x60
> ? intel_pmu_drain_pebs_nhm+0x268/0x330
> ? function_trace_call+0x133/0x180
> ? 0xffffffffc02a106a
> ? ring_buffer_lock_reserve+0x12c/0x380
> intel_pmu_drain_pebs_nhm+0x268/0x330
> ? setup_pebs_adaptive_sample_data+0x3e0/0x3e0
> handle_pmi_common+0xc2/0x2b0
> ? __rb_reserve_next+0x5c/0x450
> ? sched_clock+0x5/0x10
> ? trace_clock_local+0xc/0x20
> ? __rb_reserve_next+0x5c/0x450
> ? ring_buffer_lock_reserve+0x12c/0x380
> ? intel_pmu_handle_irq+0xc8/0x160
> ? intel_pmu_ha
> Lost 135 message(s)!
What kind of shit console is that?
Anyway, I can't seem to see any actual recursion happening, possibly due
to that loosing 135 entries. Which seems to suggest we geniuinely blew
away the NMI stack.
> ---[ end trace ba215b7ba269800a ]---
> RIP: 0010:__rb_reserve_next+0xb/0x450
> Code: 4f 10 f0 ff 41 08 0f 0b 83 f8 02 0f 84 97 fb ff ff e9 a5 fc ff ff b8 04 00 00 00 eb e1 66 90 41 57 41 56 41 55 41 54 49 89 f4 <55> 48 89 fd 53 48 83 ec 08 48 8b 9f 98 00 00 00 48 89 5e 28 4c 8b
> RSP: 0018:fffffe000003c000 EFLAGS: 00010046
> RAX: 000000000000001c RBX: ffff928ada27b400 RCX: 0000000000000000
> RDX: ffff928ada07b200 RSI: fffffe000003c028 RDI: ffff928ada27b400
> RBP: ffff928ada27b4f0 R08: 0000000000000001 R09: 0000000000000000
> R10: fffffe000003c440 R11: ffff928a7383cc60 R12: fffffe000003c028
> R13: 00000000000003e8 R14: 0000000000000046 R15: 0000000000110001
> FS: 00007f25d43cf780(0000) GS:ffff928adaa40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: fffffe000003bff8 CR3: 00000000b52a8005 CR4: 00000000001707e0
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
>
> I bisected it down to:
>
> 35d1ce6bec133679ff16325d335217f108b84871 ("perf/x86/intel/ds: Fix
> x86_pmu_stop warning for large PEBS")
>
> Which looks to be storing an awful lot on the stack:
>
> static void __intel_pmu_pebs_event(struct perf_event *event,
> struct pt_regs *iregs,
> void *base, void *top,
> int bit, int count,
> void (*setup_sample)(struct perf_event *,
> struct pt_regs *,
> void *,
> struct perf_sample_data *,
> struct pt_regs *))
> {
> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> struct hw_perf_event *hwc = &event->hw;
> struct perf_sample_data data;
> struct x86_perf_regs perf_regs;
> struct pt_regs *regs = &perf_regs.regs;
> void *at = get_next_pebs_record_by_bit(base, top, bit);
> struct pt_regs dummy_iregs;
The only thing I can come up with in a hurry is that that dummy_iregs
thing really should be static. That's 168 bytes of stack out the window
right there.
Still, this seems to suggest (barring some actual issue hidding in those
135 lost lines, we're very close to the limit on the NMI stack, which is
a single 4k page IIRC.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] Stack overflow when running perf and function tracer
2020-10-30 9:00 ` Peter Zijlstra
@ 2020-10-30 10:26 ` Thomas Gleixner
2020-10-30 10:32 ` Peter Zijlstra
2020-10-30 10:59 ` [PATCH] perf/arch: Remove perf_sample_data::regs_user_copy Peter Zijlstra
2020-10-30 11:19 ` [PATCH] perf/x86: Make dummy_iregs static Peter Zijlstra
1 sibling, 2 replies; 9+ messages in thread
From: Thomas Gleixner @ 2020-10-30 10:26 UTC (permalink / raw)
To: Peter Zijlstra, Steven Rostedt; +Cc: LKML, Ingo Molnar, kan.liang, like.xu
On Fri, Oct 30 2020 at 10:00, Peter Zijlstra wrote:
> On Fri, Oct 30, 2020 at 12:27:22AM -0400, Steven Rostedt wrote:
>> I found a bug in the recursion protection that prevented function
>> tracing from running in NMI context. Applying this fix to 5.9 worked
>> fine (tested by running perf record and function tracing at the same
>> time). But when I applied the patch to 5.10-rc1, it blew up with a
>> stack overflow:
>
> So we just blew away our NMI stack, right?
Looks like that:
>> RSP: 0018:fffffe000003c000 EFLAGS: 00010046
Clearly a page boundary.
>> RAX: 000000000000001c RBX: ffff928ada27b400 RCX: 0000000000000000
>> RDX: ffff928ada07b200 RSI: fffffe000003c028 RDI: ffff928ada27b400
>> RBP: ffff928ada27b4f0 R08: 0000000000000001 R09: 0000000000000000
>> R10: fffffe000003c440 R11: ffff928a7383cc60 R12: fffffe000003c028
>> R13: 00000000000003e8 R14: 0000000000000046 R15: 0000000000110001
>> FS: 00007f25d43cf780(0000) GS:ffff928adaa40000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: fffffe000003bff8 CR3: 00000000b52a8005 CR4: 00000000001707e0
and CR2 says it tried below.
>> I bisected it down to:
>>
>> 35d1ce6bec133679ff16325d335217f108b84871 ("perf/x86/intel/ds: Fix
>> x86_pmu_stop warning for large PEBS")
>>
>> Which looks to be storing an awful lot on the stack:
>>
>> static void __intel_pmu_pebs_event(struct perf_event *event,
>> struct pt_regs *iregs,
>> void *base, void *top,
>> int bit, int count,
>> void (*setup_sample)(struct perf_event *,
>> struct pt_regs *,
>> void *,
>> struct perf_sample_data *,
>> struct pt_regs *))
>> {
>> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
>> struct hw_perf_event *hwc = &event->hw;
>> struct perf_sample_data data;
>> struct x86_perf_regs perf_regs;
>> struct pt_regs *regs = &perf_regs.regs;
>> void *at = get_next_pebs_record_by_bit(base, top, bit);
>> struct pt_regs dummy_iregs;
>
> The only thing I can come up with in a hurry is that that dummy_iregs
> thing really should be static. That's 168 bytes of stack out the window
> right there.
What's worse is perf_sample_data which is 384 bytes and is 64 bytes aligned.
> Still, this seems to suggest (barring some actual issue hidding in those
> 135 lost lines, we're very close to the limit on the NMI stack, which is
> a single 4k page IIRC.
Yes, unless KASAN is enabled
Thanks,
tglx
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] Stack overflow when running perf and function tracer
2020-10-30 10:26 ` Thomas Gleixner
@ 2020-10-30 10:32 ` Peter Zijlstra
2020-10-30 11:36 ` Thomas Gleixner
2020-10-30 10:59 ` [PATCH] perf/arch: Remove perf_sample_data::regs_user_copy Peter Zijlstra
1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2020-10-30 10:32 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: Steven Rostedt, LKML, Ingo Molnar, kan.liang, like.xu
On Fri, Oct 30, 2020 at 11:26:01AM +0100, Thomas Gleixner wrote:
> > The only thing I can come up with in a hurry is that that dummy_iregs
> > thing really should be static. That's 168 bytes of stack out the window
> > right there.
>
> What's worse is perf_sample_data which is 384 bytes and is 64 bytes aligned.
Urgh, that thing just keeps on growing :/
I'll try and have a look if we can both shrink the thing and move it off
stack.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH] perf/arch: Remove perf_sample_data::regs_user_copy
2020-10-30 10:26 ` Thomas Gleixner
2020-10-30 10:32 ` Peter Zijlstra
@ 2020-10-30 10:59 ` Peter Zijlstra
2020-10-30 12:33 ` Steven Rostedt
1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2020-10-30 10:59 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Steven Rostedt, LKML, Ingo Molnar, kan.liang, like.xu, Andy Lutomirski
On Fri, Oct 30, 2020 at 11:26:01AM +0100, Thomas Gleixner wrote:
> What's worse is perf_sample_data which is 384 bytes and is 64 bytes aligned.
Here; this shrinks it to 192 bytes. Combined with the static dummy this
should reduce stack usage by 360 bytes.
---
Subject: perf/arch: Remove perf_sample_data::regs_user_copy
struct perf_sample_data lives on-stack, we should be careful about it's
size. Furthermore, the pt_regs copy in there is only because x86_64 is a
trainwreck, solve it differently.
Halves sizeof(perf_sample_data).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
arch/arm/kernel/perf_regs.c | 3 +--
arch/arm64/kernel/perf_regs.c | 3 +--
arch/csky/kernel/perf_regs.c | 3 +--
arch/powerpc/perf/perf_regs.c | 3 +--
arch/riscv/kernel/perf_regs.c | 3 +--
arch/s390/kernel/perf_regs.c | 3 +--
arch/x86/kernel/perf_regs.c | 15 +++++++++++----
include/linux/perf_event.h | 6 ------
include/linux/perf_regs.h | 6 ++----
kernel/events/core.c | 8 +++-----
10 files changed, 22 insertions(+), 31 deletions(-)
diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index 05fe92aa7d98..0529f90395c9 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -32,8 +32,7 @@ u64 perf_reg_abi(struct task_struct *task)
}
void perf_get_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy)
+ struct pt_regs *regs)
{
regs_user->regs = task_pt_regs(current);
regs_user->abi = perf_reg_abi(current);
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index 94e8718e7229..f6f58e6265df 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -73,8 +73,7 @@ u64 perf_reg_abi(struct task_struct *task)
}
void perf_get_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy)
+ struct pt_regs *regs)
{
regs_user->regs = task_pt_regs(current);
regs_user->abi = perf_reg_abi(current);
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index eb32838b8210..09b7f88a2d6a 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -32,8 +32,7 @@ u64 perf_reg_abi(struct task_struct *task)
}
void perf_get_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy)
+ struct pt_regs *regs)
{
regs_user->regs = task_pt_regs(current);
regs_user->abi = perf_reg_abi(current);
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index 8e53f2fc3fe0..6f681b105eec 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -144,8 +144,7 @@ u64 perf_reg_abi(struct task_struct *task)
}
void perf_get_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy)
+ struct pt_regs *regs)
{
regs_user->regs = task_pt_regs(current);
regs_user->abi = (regs_user->regs) ? perf_reg_abi(current) :
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index 04a38fbeb9c7..fd304a248de6 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -36,8 +36,7 @@ u64 perf_reg_abi(struct task_struct *task)
}
void perf_get_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy)
+ struct pt_regs *regs)
{
regs_user->regs = task_pt_regs(current);
regs_user->abi = perf_reg_abi(current);
diff --git a/arch/s390/kernel/perf_regs.c b/arch/s390/kernel/perf_regs.c
index 4352a504f235..6e9e5d5e927e 100644
--- a/arch/s390/kernel/perf_regs.c
+++ b/arch/s390/kernel/perf_regs.c
@@ -53,8 +53,7 @@ u64 perf_reg_abi(struct task_struct *task)
}
void perf_get_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy)
+ struct pt_regs *regs)
{
/*
* Use the regs from the first interruption and let
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index bb7e1132290b..750129bf539c 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -101,8 +101,7 @@ u64 perf_reg_abi(struct task_struct *task)
}
void perf_get_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy)
+ struct pt_regs *regs)
{
regs_user->regs = task_pt_regs(current);
regs_user->abi = perf_reg_abi(current);
@@ -129,12 +128,20 @@ u64 perf_reg_abi(struct task_struct *task)
return PERF_SAMPLE_REGS_ABI_64;
}
+static DEFINE_PER_CPU(struct pt_regs, nmi_user_regs);
+
void perf_get_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy)
+ struct pt_regs *regs)
{
+ struct pt_regs *regs_user_copy = this_cpu_ptr(&nmi_user_regs);
struct pt_regs *user_regs = task_pt_regs(current);
+ if (!in_nmi()) {
+ regs_user->regs = task_pt_regs(current);
+ regs_user->abi = perf_reg_abi(current);
+ return;
+ }
+
/*
* If we're in an NMI that interrupted task_pt_regs setup, then
* we can't sample user regs at all. This check isn't really
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0c19d279b97f..e58f4e6b6e1c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1022,13 +1022,7 @@ struct perf_sample_data {
struct perf_callchain_entry *callchain;
u64 aux_size;
- /*
- * regs_user may point to task_pt_regs or to regs_user_copy, depending
- * on arch details.
- */
struct perf_regs regs_user;
- struct pt_regs regs_user_copy;
-
struct perf_regs regs_intr;
u64 stack_user_size;
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index 2d12e97d5e7b..5efcdb768c75 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -20,8 +20,7 @@ u64 perf_reg_value(struct pt_regs *regs, int idx);
int perf_reg_validate(u64 mask);
u64 perf_reg_abi(struct task_struct *task);
void perf_get_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy);
+ struct pt_regs *regs);
#else
#define PERF_REG_EXTENDED_MASK 0
@@ -42,8 +41,7 @@ static inline u64 perf_reg_abi(struct task_struct *task)
}
static inline void perf_get_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy)
+ struct pt_regs *regs);
{
regs_user->regs = task_pt_regs(current);
regs_user->abi = perf_reg_abi(current);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index da467e1dd49a..14709182d94e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6374,14 +6374,13 @@ perf_output_sample_regs(struct perf_output_handle *handle,
}
static void perf_sample_regs_user(struct perf_regs *regs_user,
- struct pt_regs *regs,
- struct pt_regs *regs_user_copy)
+ struct pt_regs *regs)
{
if (user_mode(regs)) {
regs_user->abi = perf_reg_abi(current);
regs_user->regs = regs;
} else if (!(current->flags & PF_KTHREAD)) {
- perf_get_regs_user(regs_user, regs, regs_user_copy);
+ perf_get_regs_user(regs_user, regs);
} else {
regs_user->abi = PERF_SAMPLE_REGS_ABI_NONE;
regs_user->regs = NULL;
@@ -7083,8 +7082,7 @@ void perf_prepare_sample(struct perf_event_header *header,
}
if (sample_type & (PERF_SAMPLE_REGS_USER | PERF_SAMPLE_STACK_USER))
- perf_sample_regs_user(&data->regs_user, regs,
- &data->regs_user_copy);
+ perf_sample_regs_user(&data->regs_user, regs);
if (sample_type & PERF_SAMPLE_REGS_USER) {
/* regs dump ABI info */
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH] perf/x86: Make dummy_iregs static
2020-10-30 9:00 ` Peter Zijlstra
2020-10-30 10:26 ` Thomas Gleixner
@ 2020-10-30 11:19 ` Peter Zijlstra
1 sibling, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2020-10-30 11:19 UTC (permalink / raw)
To: Steven Rostedt; +Cc: LKML, Thomas Gleixner, Ingo Molnar, kan.liang, like.xu
On Fri, Oct 30, 2020 at 10:00:37AM +0100, Peter Zijlstra wrote:
> > static void __intel_pmu_pebs_event(struct perf_event *event,
> > struct pt_regs *iregs,
> > void *base, void *top,
> > int bit, int count,
> > void (*setup_sample)(struct perf_event *,
> > struct pt_regs *,
> > void *,
> > struct perf_sample_data *,
> > struct pt_regs *))
> > {
> > struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> > struct hw_perf_event *hwc = &event->hw;
> > struct perf_sample_data data;
> > struct x86_perf_regs perf_regs;
> > struct pt_regs *regs = &perf_regs.regs;
> > void *at = get_next_pebs_record_by_bit(base, top, bit);
> > struct pt_regs dummy_iregs;
>
> The only thing I can come up with in a hurry is that that dummy_iregs
> thing really should be static. That's 168 bytes of stack out the window
> right there.
---
Subject: perf/x86: Make dummy_iregs static
From: Peter Zijlstra <peterz@infradead.org>
Date: Fri Oct 30 12:15:06 CET 2020
Having pt_regs on-stack is unfortunate, it's 168 bytes. Since it isn't
actually used, make it a static variable. This both gets if off the
stack and ensures it gets 0 initialized, just in case someone does
look at it.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
arch/x86/events/intel/ds.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1735,7 +1735,7 @@ static void __intel_pmu_pebs_event(struc
struct x86_perf_regs perf_regs;
struct pt_regs *regs = &perf_regs.regs;
void *at = get_next_pebs_record_by_bit(base, top, bit);
- struct pt_regs dummy_iregs;
+ static struct pt_regs dummy_iregs;
if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
/*
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] Stack overflow when running perf and function tracer
2020-10-30 10:32 ` Peter Zijlstra
@ 2020-10-30 11:36 ` Thomas Gleixner
2020-10-30 13:42 ` Thomas Gleixner
0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2020-10-30 11:36 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Steven Rostedt, LKML, Ingo Molnar, kan.liang, like.xu
On Fri, Oct 30 2020 at 11:32, Peter Zijlstra wrote:
> On Fri, Oct 30, 2020 at 11:26:01AM +0100, Thomas Gleixner wrote:
>
>> > The only thing I can come up with in a hurry is that that dummy_iregs
>> > thing really should be static. That's 168 bytes of stack out the window
>> > right there.
>>
>> What's worse is perf_sample_data which is 384 bytes and is 64 bytes
>> aligned.
And there is also x86_perf_regs which is another 176 bytes ....
> Urgh, that thing just keeps on growing :/
>
> I'll try and have a look if we can both shrink the thing and move it off
> stack.
Even then we still need to worry about the stack depth because that
tracer code was not yet done. Throw some BPF at it and it goes south
again.
So the real question is what else is on that stack which blows it up
close to 4k? Btw, it would be massively helpful for this kind of crash
to print the actual stack depth per entry in the backtrace.
Here is the partial stack trace:
Stack usage
ring_buffer_lock_reserve+0x12c/0x380
trace_function+0x27/0x130
function_trace_call+0x133/0x180
perf_output_begin+0x4d/0x2d0 64+
perf_log_throttle+0x9a/0x120 470+
__perf_event_account_interrupt+0xa9/0x120
__perf_event_overflow+0x2b/0xf0
__intel_pmu_pebs_event+0x2ec/0x3e0 760+
intel_pmu_drain_pebs_nhm+0x268/0x330 200+
handle_pmi_common+0xc2/0x2b0
The missing call chain is:
intel_pmu_handle_irq or intel_pmu_handle_irq_v4
perf_event_nmi_handler
nmi
So the larger offenders accumulate to ~1.5k data on stack, but
where is the rest of the 4k?
It's about 15 calls. So to fill up the stack you'd need about 230 bytes
per call to go up to 4k.
Something is fishy.
Thanks,
tglx
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] perf/arch: Remove perf_sample_data::regs_user_copy
2020-10-30 10:59 ` [PATCH] perf/arch: Remove perf_sample_data::regs_user_copy Peter Zijlstra
@ 2020-10-30 12:33 ` Steven Rostedt
0 siblings, 0 replies; 9+ messages in thread
From: Steven Rostedt @ 2020-10-30 12:33 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, LKML, Ingo Molnar, kan.liang, like.xu, Andy Lutomirski
On Fri, 30 Oct 2020 11:59:17 +0100
Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Oct 30, 2020 at 11:26:01AM +0100, Thomas Gleixner wrote:
> > What's worse is perf_sample_data which is 384 bytes and is 64 bytes aligned.
>
> Here; this shrinks it to 192 bytes. Combined with the static dummy this
> should reduce stack usage by 360 bytes.
>
> ---
> Subject: perf/arch: Remove perf_sample_data::regs_user_copy
>
> struct perf_sample_data lives on-stack, we should be careful about it's
> size. Furthermore, the pt_regs copy in there is only because x86_64 is a
> trainwreck, solve it differently.
>
> Halves sizeof(perf_sample_data).
>
>
I tested with just this patch and it doesn't crash when running
function tracing and perf simultaneously.
-- Steve
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] Stack overflow when running perf and function tracer
2020-10-30 11:36 ` Thomas Gleixner
@ 2020-10-30 13:42 ` Thomas Gleixner
0 siblings, 0 replies; 9+ messages in thread
From: Thomas Gleixner @ 2020-10-30 13:42 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Steven Rostedt, LKML, Ingo Molnar, kan.liang, like.xu
On Fri, Oct 30 2020 at 12:36, Thomas Gleixner wrote:
> On Fri, Oct 30 2020 at 11:32, Peter Zijlstra wrote:
> So the real question is what else is on that stack which blows it up
> close to 4k? Btw, it would be massively helpful for this kind of crash
> to print the actual stack depth per entry in the backtrace.
>
> Here is the partial stack trace:
> Stack usage
> ring_buffer_lock_reserve+0x12c/0x380
> trace_function+0x27/0x130
> function_trace_call+0x133/0x180
> perf_output_begin+0x4d/0x2d0 64+
> perf_log_throttle+0x9a/0x120 470+
> __perf_event_account_interrupt+0xa9/0x120
> __perf_event_overflow+0x2b/0xf0
> __intel_pmu_pebs_event+0x2ec/0x3e0 760+
> intel_pmu_drain_pebs_nhm+0x268/0x330 200+
> handle_pmi_common+0xc2/0x2b0
So Steven provided a backtrace with the actual stack depth printed:
ring_buffer_lock_reserve+0x12c/0x380 0030 104
trace_function+0x27/0xf0 0098 56
function_trace_call+0x124/0x190 00d0 224
__rcu_read_lock+0x5/0x20 01b0 8
perf_output_begin+0x4d/0x2d0 01b8 640
perf_log_throttle+0x9a/0x120 0438 624
__perf_event_account_interrupt+0xa6/0x120 06a8 32
__perf_event_overflow+0x2b/0xf0 06c8 48
__intel_pmu_pebs_event+0x2ec/0x3e0 06f8 960
intel_pmu_drain_pebs_nhm+0x268/0x330 0ab8 256
handle_pmi_common+0xc2/0x2b0 0bb8 584
intel_pmu_handle_irq+0xc8/0x160 0e00 64
perf_event_nmi_handler+0x28/0x50 0e40 32
nmi_handle+0x80/0x190 0e60 64
default_do_nmi+0x6b/0x170 0ea0 40
exc_nmi+0x15d/0x1a0 0ec8 40
end_repeat_nmi+0x16/0x55 0ef0 272
So I missed perf_output_begin and handle_pmi_common ...
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2020-10-30 13:43 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-30 4:27 [BUG] Stack overflow when running perf and function tracer Steven Rostedt
2020-10-30 9:00 ` Peter Zijlstra
2020-10-30 10:26 ` Thomas Gleixner
2020-10-30 10:32 ` Peter Zijlstra
2020-10-30 11:36 ` Thomas Gleixner
2020-10-30 13:42 ` Thomas Gleixner
2020-10-30 10:59 ` [PATCH] perf/arch: Remove perf_sample_data::regs_user_copy Peter Zijlstra
2020-10-30 12:33 ` Steven Rostedt
2020-10-30 11:19 ` [PATCH] perf/x86: Make dummy_iregs static Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).