* perf/tracepoint: another fuzzer generated lockup @ 2013-11-08 19:45 Vince Weaver 2013-11-08 20:06 ` Vince Weaver 0 siblings, 1 reply; 38+ messages in thread From: Vince Weaver @ 2013-11-08 19:45 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Ingo Molnar, Peter Zijlstra, Dave Jones, Frederic Weisbecker The perf_fuzzer hits this bug pretty reliably within about an hour, but unfortunately after a large number of syscalls so not easy to isolate. This is against stock 3.12 with the no-ftrace-function-tracing-unless-root patch applied. I can trigger this is a normal user, no root involved. [106657.412003] ------------[ cut here ]------------ [106657.412003] WARNING: CPU: 1 PID: 4131 at kernel/watchdog.c:245 watchdog_ove) [106657.412003] Watchdog detected hard LOCKUP on cpu 1 [106657.412003] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powed [106657.412003] CPU: 1 PID: 4131 Comm: perf_fuzzer Not tainted 3.12.0 #2 [106657.412003] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, B2 [106657.412003] 00000000000000f5 ffff88011fc87c48 ffffffff8151aad6 ffff88011fc8 [106657.412003] ffffffff8103cd68 0000000100000000 ffffffff810a101c ffff88011b30 [106657.412003] ffff88011fc87ef8 ffff88011fc87d58 0000000000000000 ffff88011fc8 [106657.412003] Call Trace: [106657.412003] <NMI> [<ffffffff8151aad6>] ? dump_stack+0x41/0x53 [106657.412003] [<ffffffff8103cd68>] ? warn_slowpath_common+0x81/0x9c [106657.412003] [<ffffffff810a101c>] ? watchdog_overflow_callback+0x9b/0xa6 [106657.412003] [<ffffffff8103ce25>] ? warn_slowpath_fmt+0x46/0x48 [106657.412003] [<ffffffff810a101c>] ? watchdog_overflow_callback+0x9b/0xa6 [106657.412003] [<ffffffff810ca4a2>] ? __perf_event_overflow+0x12e/0x1ab [106657.412003] [<ffffffff810167d3>] ? intel_pmu_lbr_read+0x253/0x349 [106657.412003] [<ffffffff81018c95>] ? intel_pmu_handle_irq+0x2b8/0x34d [106657.412003] [<ffffffff8151edfc>] ? nmi_handle+0xa8/0x13a [106657.412003] [<ffffffff8151f473>] ? perf_event_nmi_handler+0x24/0x3d [106657.412003] [<ffffffff8151edb2>] ? nmi_handle+0x5e/0x13a [106657.412003] [<ffffffff8151ef35>] ? do_nmi+0xa7/0x2bd [106657.412003] [<ffffffff8151e537>] ? end_repeat_nmi+0x1e/0x2e [106657.412003] [<ffffffff8104b09c>] ? do_send_sig_info+0x73/0x75 [106657.412003] [<ffffffff8104b09c>] ? do_send_sig_info+0x73/0x75 [106657.412003] [<ffffffff8104b09c>] ? do_send_sig_info+0x73/0x75 [106657.412003] <<EOE>> <IRQ> [<ffffffff8111f683>] ? send_sigio_to_task+0xeb5 [106657.412003] [<ffffffff8111f737>] ? send_sigio+0xaa/0xdd [106657.412003] [<ffffffff8111f7d3>] ? kill_fasync+0x69/0x86 [106657.412003] [<ffffffff810cabf9>] ? perf_event_wakeup+0x8c/0xa1 [106657.412003] [<ffffffff810c494d>] ? __irq_work_run+0x5f/0x7b [106657.412003] [<ffffffff81005f82>] ? smp_trace_irq_work_interrupt+0x4b/0x7e [106657.412003] [<ffffffff815261ca>] ? trace_irq_work_interrupt+0x6a/0x70 [106657.412003] [<ffffffff8151da07>] ? _raw_spin_unlock_irqrestore+0xf/0x11 [106657.412003] [<ffffffff810a3a89>] ? rcu_process_callbacks+0x4a4/0x4b8 [106657.412003] [<ffffffff81041027>] ? __do_softirq+0x102/0x208 [106657.412003] [<ffffffff815263dc>] ? call_softirq+0x1c/0x30 [106657.412003] [<ffffffff810041b9>] ? do_softirq+0x38/0x6d [106657.412003] [<ffffffff81040d1c>] ? irq_exit+0x45/0xa1 [106657.412003] [<ffffffff81526a14>] ? smp_trace_apic_timer_interrupt+0x83/0x91 [106657.412003] [<ffffffff8152597a>] ? trace_apic_timer_interrupt+0x6a/0x70 [106657.412003] <EOI> [106657.412003] ---[ end trace ff266ee4a75ddbbf ]--- [106657.412003] perf samples too long (6605838 > 10000), lowering kernel.perf_e0 [106657.412003] INFO: NMI handler (perf_event_nmi_handler) took too long to runs [106678.192001] perf samples too long (6554240 > 20000), lowering kernel.perf_e0 [106679.508002] perf samples too long (6503044 > 40000), lowering kernel.perf_e0 [106680.852001] perf samples too long (6452247 > 76923), lowering kernel.perf_e0 [106682.192001] perf samples too long (6401848 > 142857), lowering kernel.perf_0 [106683.536001] perf samples too long (6351842 > 250000), lowering kernel.perf_0 [106684.876001] perf samples too long (6302227 > 500000), lowering kernel.perf_0 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 19:45 perf/tracepoint: another fuzzer generated lockup Vince Weaver @ 2013-11-08 20:06 ` Vince Weaver 2013-11-08 20:02 ` Frederic Weisbecker 0 siblings, 1 reply; 38+ messages in thread From: Vince Weaver @ 2013-11-08 20:06 UTC (permalink / raw) To: Vince Weaver Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Dave Jones, Frederic Weisbecker and again, this time after 600k successful syscalls or so. This is on a core2 machine. [ 1020.396002] ------------[ cut here ]------------ [ 1020.396002] WARNING: CPU: 1 PID: 3036 at kernel/watchdog.c:245 watchdog_over) [ 1020.396002] Watchdog detected hard LOCKUP on cpu 1 [ 1020.396002] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powern [ 1020.396002] CPU: 1 PID: 3036 Comm: perf_fuzzer Not tainted 3.12.0 #2 [ 1020.396002] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2 [ 1020.396002] 00000000000000f5 ffff88011fc87c48 ffffffff8151aad6 ffff88011fc88 [ 1020.396002] ffffffff8103cd68 0000000100000000 ffffffff810a101c ffff88011b380 [ 1020.396002] ffff88011fc87ef8 ffff88011fc87d58 0000000000000000 ffff88011fc88 [ 1020.396002] Call Trace: [ 1020.396002] <NMI> [<ffffffff8151aad6>] ? dump_stack+0x41/0x53 [ 1020.396002] [<ffffffff8103cd68>] ? warn_slowpath_common+0x81/0x9c [ 1020.396002] [<ffffffff810a101c>] ? watchdog_overflow_callback+0x9b/0xa6 [ 1020.396002] [<ffffffff8103ce25>] ? warn_slowpath_fmt+0x46/0x48 [ 1020.396002] [<ffffffff810a101c>] ? watchdog_overflow_callback+0x9b/0xa6 [ 1020.396002] [<ffffffff810ca4a2>] ? __perf_event_overflow+0x12e/0x1ab [ 1020.396002] [<ffffffff81018cf5>] ? intel_pmu_handle_irq+0x318/0x34d [ 1020.396002] [<ffffffff81018cfc>] ? intel_pmu_handle_irq+0x31f/0x34d [ 1020.396002] [<ffffffff81018c95>] ? intel_pmu_handle_irq+0x2b8/0x34d [ 1020.396002] [<ffffffff81018671>] ? intel_pmu_enable_all+0x55/0xa7 [ 1020.396002] [<ffffffff8151edfc>] ? nmi_handle+0xa8/0x13a [ 1020.396002] [<ffffffff8151f473>] ? perf_event_nmi_handler+0x24/0x3d [ 1020.396002] [<ffffffff8151edb2>] ? nmi_handle+0x5e/0x13a [ 1020.396002] [<ffffffff8151ef35>] ? do_nmi+0xa7/0x2bd [ 1020.396002] [<ffffffff8151e537>] ? end_repeat_nmi+0x1e/0x2e [ 1020.396002] [<ffffffff8102bcc8>] ? apic_send_IPI_self+0x33/0x35 [ 1020.396002] [<ffffffff8102bcc8>] ? apic_send_IPI_self+0x33/0x35 [ 1020.396002] [<ffffffff8102bcc8>] ? apic_send_IPI_self+0x33/0x35 [ 1020.396002] <<EOE>> <IRQ> [<ffffffff81005ee2>] ? arch_irq_work_raise+0x27a [ 1020.396002] [<ffffffff810c4ada>] ? irq_work_queue+0x70/0x72 [ 1020.396002] [<ffffffff810ca510>] ? __perf_event_overflow+0x19c/0x1ab [ 1020.396002] [<ffffffff8104ae16>] ? __send_signal+0x244/0x26f [ 1020.396002] [<ffffffff810ca68b>] ? perf_swevent_overflow+0x3f/0x58 [ 1020.396002] [<ffffffff810ca8a5>] ? perf_tp_event+0xe9/0x1f6 [ 1020.396002] [<ffffffff81005f6b>] ? smp_trace_irq_work_interrupt+0x34/0x7e [ 1020.396002] [<ffffffff81040954>] ? perf_trace_softirq+0x8a/0x96 [ 1020.396002] [<ffffffff8100366f>] ? perf_trace_x86_irq_vector+0xc0/0xc9 [ 1020.396002] [<ffffffff81005f6b>] ? smp_trace_irq_work_interrupt+0x34/0x7e [ 1020.396002] [<ffffffff81005f6b>] ? smp_trace_irq_work_interrupt+0x34/0x7e [ 1020.396002] [<ffffffff815261ca>] ? trace_irq_work_interrupt+0x6a/0x70 [ 1020.396002] [<ffffffff81040fba>] ? __do_softirq+0x95/0x208 [ 1020.396002] [<ffffffff815263dc>] ? call_softirq+0x1c/0x30 [ 1020.396002] [<ffffffff810041b9>] ? do_softirq+0x38/0x6d [ 1020.396002] [<ffffffff81040d1c>] ? irq_exit+0x45/0xa1 [ 1020.396002] [<ffffffff81526a14>] ? smp_trace_apic_timer_interrupt+0x83/0x91 [ 1020.396002] [<ffffffff8152597a>] ? trace_apic_timer_interrupt+0x6a/0x70 [ 1020.396002] <EOI> [<ffffffff810c7c20>] ? __perf_event_enable+0x118/0x173 [ 1020.396002] [<ffffffff8104c694>] ? get_signal_to_deliver+0x47d/0x48a [ 1020.396002] [<ffffffff810024da>] ? do_signal+0x46/0x5f5 [ 1020.396002] [<ffffffff8100b9f5>] ? perf_trace_sys_exit+0x3e/0xa7 [ 1020.396002] [<ffffffff81002ab5>] ? do_notify_resume+0x2c/0x6a [ 1020.396002] [<ffffffff81524ea0>] ? int_signal+0x12/0x17 [ 1020.396002] ---[ end trace 2b385d729ba32e92 ]--- [ 1020.396002] perf samples too long (8101340 > 10000), lowering kernel.perf_ev0 [ 1020.396002] INFO: NMI handler (perf_event_nmi_handler) took too long to run:s [ 1033.301811] perf samples too long (8038058 > 20000), lowering kernel.perf_ev0 [ 1034.125811] perf samples too long (7975268 > 40000), lowering kernel.perf_ev0 [ 1034.973813] perf samples too long (7912968 > 76923), lowering kernel.perf_ev0 [ 1035.821813] perf samples too long (7851154 > 142857), lowering kernel.perf_e0 [ 1036.669813] perf samples too long (7789824 > 250000), lowering kernel.perf_e0 [ 1037.517812] perf samples too long (7728973 > 500000), lowering kernel.perf_e0 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 20:06 ` Vince Weaver @ 2013-11-08 20:02 ` Frederic Weisbecker 2013-11-08 20:23 ` Vince Weaver 0 siblings, 1 reply; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-08 20:02 UTC (permalink / raw) To: Vince Weaver Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Dave Jones On Fri, Nov 08, 2013 at 03:06:11PM -0500, Vince Weaver wrote: > > and again, this time after 600k successful syscalls or so. > > This is on a core2 machine. > > [ 1020.396002] ------------[ cut here ]------------ > [ 1020.396002] WARNING: CPU: 1 PID: 3036 at kernel/watchdog.c:245 watchdog_over) > [ 1020.396002] Watchdog detected hard LOCKUP on cpu 1 > [ 1020.396002] Modules linked in: cpufreq_userspace cpufreq_stats cpufreq_powern > [ 1020.396002] CPU: 1 PID: 3036 Comm: perf_fuzzer Not tainted 3.12.0 #2 > [ 1020.396002] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BI2 > [ 1020.396002] 00000000000000f5 ffff88011fc87c48 ffffffff8151aad6 ffff88011fc88 > [ 1020.396002] ffffffff8103cd68 0000000100000000 ffffffff810a101c ffff88011b380 > [ 1020.396002] ffff88011fc87ef8 ffff88011fc87d58 0000000000000000 ffff88011fc88 > [ 1020.396002] Call Trace: > [ 1020.396002] <NMI> [<ffffffff8151aad6>] ? dump_stack+0x41/0x53 > [ 1020.396002] [<ffffffff8103cd68>] ? warn_slowpath_common+0x81/0x9c > [ 1020.396002] [<ffffffff810a101c>] ? watchdog_overflow_callback+0x9b/0xa6 > [ 1020.396002] [<ffffffff8103ce25>] ? warn_slowpath_fmt+0x46/0x48 > [ 1020.396002] [<ffffffff810a101c>] ? watchdog_overflow_callback+0x9b/0xa6 > [ 1020.396002] [<ffffffff810ca4a2>] ? __perf_event_overflow+0x12e/0x1ab > [ 1020.396002] [<ffffffff81018cf5>] ? intel_pmu_handle_irq+0x318/0x34d > [ 1020.396002] [<ffffffff81018cfc>] ? intel_pmu_handle_irq+0x31f/0x34d > [ 1020.396002] [<ffffffff81018c95>] ? intel_pmu_handle_irq+0x2b8/0x34d > [ 1020.396002] [<ffffffff81018671>] ? intel_pmu_enable_all+0x55/0xa7 > [ 1020.396002] [<ffffffff8151edfc>] ? nmi_handle+0xa8/0x13a > [ 1020.396002] [<ffffffff8151f473>] ? perf_event_nmi_handler+0x24/0x3d > [ 1020.396002] [<ffffffff8151edb2>] ? nmi_handle+0x5e/0x13a > [ 1020.396002] [<ffffffff8151ef35>] ? do_nmi+0xa7/0x2bd > [ 1020.396002] [<ffffffff8151e537>] ? end_repeat_nmi+0x1e/0x2e > [ 1020.396002] [<ffffffff8102bcc8>] ? apic_send_IPI_self+0x33/0x35 > [ 1020.396002] [<ffffffff8102bcc8>] ? apic_send_IPI_self+0x33/0x35 > [ 1020.396002] [<ffffffff8102bcc8>] ? apic_send_IPI_self+0x33/0x35 > [ 1020.396002] <<EOE>> <IRQ> [<ffffffff81005ee2>] ? arch_irq_work_raise+0x27a > [ 1020.396002] [<ffffffff810c4ada>] ? irq_work_queue+0x70/0x72 > [ 1020.396002] [<ffffffff810ca510>] ? __perf_event_overflow+0x19c/0x1ab > [ 1020.396002] [<ffffffff8104ae16>] ? __send_signal+0x244/0x26f > [ 1020.396002] [<ffffffff810ca68b>] ? perf_swevent_overflow+0x3f/0x58 > [ 1020.396002] [<ffffffff810ca8a5>] ? perf_tp_event+0xe9/0x1f6 > [ 1020.396002] [<ffffffff81005f6b>] ? smp_trace_irq_work_interrupt+0x34/0x7e > [ 1020.396002] [<ffffffff81040954>] ? perf_trace_softirq+0x8a/0x96 > [ 1020.396002] [<ffffffff8100366f>] ? perf_trace_x86_irq_vector+0xc0/0xc9 > [ 1020.396002] [<ffffffff81005f6b>] ? smp_trace_irq_work_interrupt+0x34/0x7e > [ 1020.396002] [<ffffffff81005f6b>] ? smp_trace_irq_work_interrupt+0x34/0x7e > [ 1020.396002] [<ffffffff815261ca>] ? trace_irq_work_interrupt+0x6a/0x70 > [ 1020.396002] [<ffffffff81040fba>] ? __do_softirq+0x95/0x208 > [ 1020.396002] [<ffffffff815263dc>] ? call_softirq+0x1c/0x30 > [ 1020.396002] [<ffffffff810041b9>] ? do_softirq+0x38/0x6d > [ 1020.396002] [<ffffffff81040d1c>] ? irq_exit+0x45/0xa1 > [ 1020.396002] [<ffffffff81526a14>] ? smp_trace_apic_timer_interrupt+0x83/0x91 > [ 1020.396002] [<ffffffff8152597a>] ? trace_apic_timer_interrupt+0x6a/0x70 > [ 1020.396002] <EOI> [<ffffffff810c7c20>] ? __perf_event_enable+0x118/0x173 > [ 1020.396002] [<ffffffff8104c694>] ? get_signal_to_deliver+0x47d/0x48a > [ 1020.396002] [<ffffffff810024da>] ? do_signal+0x46/0x5f5 > [ 1020.396002] [<ffffffff8100b9f5>] ? perf_trace_sys_exit+0x3e/0xa7 > [ 1020.396002] [<ffffffff81002ab5>] ? do_notify_resume+0x2c/0x6a > [ 1020.396002] [<ffffffff81524ea0>] ? int_signal+0x12/0x17 > [ 1020.396002] ---[ end trace 2b385d729ba32e92 ]--- > [ 1020.396002] perf samples too long (8101340 > 10000), lowering kernel.perf_ev0 > [ 1020.396002] INFO: NMI handler (perf_event_nmi_handler) took too long to run:s > [ 1033.301811] perf samples too long (8038058 > 20000), lowering kernel.perf_ev0 > [ 1034.125811] perf samples too long (7975268 > 40000), lowering kernel.perf_ev0 > [ 1034.973813] perf samples too long (7912968 > 76923), lowering kernel.perf_ev0 > [ 1035.821813] perf samples too long (7851154 > 142857), lowering kernel.perf_e0 > [ 1036.669813] perf samples too long (7789824 > 250000), lowering kernel.perf_e0 > [ 1037.517812] perf samples too long (7728973 > 500000), lowering kernel.perf_e0 > There seem to be a loop that takes too long in intel_pmu_handle_irq(). Your two previous reports seemed to suggest that lbr is involved, but not this one. I'm going to look further into this, unfortunately my testbox is an AMD... ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 20:02 ` Frederic Weisbecker @ 2013-11-08 20:23 ` Vince Weaver 2013-11-08 20:48 ` Frederic Weisbecker 0 siblings, 1 reply; 38+ messages in thread From: Vince Weaver @ 2013-11-08 20:23 UTC (permalink / raw) To: Frederic Weisbecker Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Dave Jones On Fri, 8 Nov 2013, Frederic Weisbecker wrote: > > There seem to be a loop that takes too long in intel_pmu_handle_irq(). Your two > previous reports seemed to suggest that lbr is involved, but not this one. I may be wrong but I think everything between <NMI> and <EOE> is just noise from the NMI perf-event watchdog timer kicking in. Why that code would be reading the LBR registers I don't know. Vince ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 20:23 ` Vince Weaver @ 2013-11-08 20:48 ` Frederic Weisbecker 2013-11-08 21:15 ` Vince Weaver 0 siblings, 1 reply; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-08 20:48 UTC (permalink / raw) To: Vince Weaver Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Dave Jones On Fri, Nov 08, 2013 at 03:23:07PM -0500, Vince Weaver wrote: > On Fri, 8 Nov 2013, Frederic Weisbecker wrote: > > > > There seem to be a loop that takes too long in intel_pmu_handle_irq(). Your two > > previous reports seemed to suggest that lbr is involved, but not this one. > > I may be wrong but I think everything between <NMI> and <EOE> is just > noise from the NMI perf-event watchdog timer kicking in. Ah good point. So the pattern seem to be that irq work/perf_event_wakeup is involved, may be interrupting a tracepoint event or so. It would be nice if you enable CONFIG_FRAME_POINTER in your future reports so that we get more precise traces. Or may be you actually enabled it and it doesn't work? > Why that code would be reading the LBR registers I don't know. Yeah that's just an inconditional call to check if there is lbr branch to record. > Vince > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 20:48 ` Frederic Weisbecker @ 2013-11-08 21:15 ` Vince Weaver 2013-11-08 22:24 ` Frederic Weisbecker ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Vince Weaver @ 2013-11-08 21:15 UTC (permalink / raw) To: Frederic Weisbecker Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Dave Jones [-- Attachment #1: Type: TEXT/PLAIN, Size: 993 bytes --] On Fri, 8 Nov 2013, Frederic Weisbecker wrote: > On Fri, Nov 08, 2013 at 03:23:07PM -0500, Vince Weaver wrote: > > On Fri, 8 Nov 2013, Frederic Weisbecker wrote: > > > > > > There seem to be a loop that takes too long in intel_pmu_handle_irq(). Your two > > > previous reports seemed to suggest that lbr is involved, but not this one. > > > > I may be wrong but I think everything between <NMI> and <EOE> is just > > noise from the NMI perf-event watchdog timer kicking in. > > Ah good point. > > So the pattern seem to be that irq work/perf_event_wakeup is involved, may be > interrupting a tracepoint event or so. I managed to construct a reproducible test case, which is attached. I sometimes have to run it 5-10 times before it triggers. > It would be nice if you enable CONFIG_FRAME_POINTER in your future reports > so that we get more precise traces. Or may be you actually enabled it and it > doesn't work? No, it's not enabled, I can try setting that in the future. Vince [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: TEXT/x-csrc; name=out.c, Size: 2950 bytes --] /* log_to_code output from ./new.trace */ /* by Vince Weaver <vincent.weaver _at_ maine.edu */ #define _GNU_SOURCE 1 #include <stdio.h> #include <unistd.h> #include <fcntl.h> #include <string.h> #include <signal.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/ioctl.h> #include <sys/prctl.h> #include <poll.h> #include <linux/hw_breakpoint.h> #include <linux/perf_event.h> int fd[1024]; struct perf_event_attr pe[1024]; char *mmap_result[1024]; #define MAX_READ_SIZE 65536 static long long data[MAX_READ_SIZE]; #define MAX_POLL_FDS 128 struct pollfd pollfds[MAX_POLL_FDS]; long long id; int forked_pid; struct sigaction sa; static int overflows=0; static int sigios=0; static void our_handler(int signum, siginfo_t *info, void *uc) { int fd = info->si_fd; int ret; overflows++; ioctl(fd,PERF_EVENT_IOC_DISABLE,0); if (sigios) return; ret=ioctl(fd, PERF_EVENT_IOC_REFRESH,1); } int perf_event_open(struct perf_event_attr *hw_event_uptr, pid_t pid, int cpu, int group_fd, unsigned long flags) { return syscall(__NR_perf_event_open,hw_event_uptr, pid, cpu, group_fd, flags); } int main(int argc, char **argv) { /* 1 */ /* fd = 82 */ memset(&pe[82],0,sizeof(struct perf_event_attr)); pe[82].type=PERF_TYPE_TRACEPOINT; pe[82].size=80; pe[82].config=0x18; pe[82].sample_period=0xdd95cac7b947b610; pe[82].sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TIME|PERF_SAMPLE_READ|PERF_SAMPLE_CALLCHAIN|PERF_SAMPLE_PERIOD|PERF_SAMPLE_DATA_SRC; //|PERF_SAMPLE_IDENTIFIER; /* 18135 */ pe[82].read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING|PERF_FORMAT_ID|PERF_FORMAT_GROUP; /* f */ pe[82].exclude_hv=1; pe[82].mmap=1; pe[82].comm=1; pe[82].watermark=1; pe[82].precise_ip=1; /* constant skid */ pe[82].mmap_data=1; pe[82].exclude_host=1; pe[82].exclude_guest=1; pe[82].exclude_callchain_user=1; pe[82].wakeup_watermark=232; pe[82].bp_type=HW_BREAKPOINT_EMPTY; pe[82].config1=0x9; pe[82].branch_sample_type=PERF_SAMPLE_BRANCH_USER|PERF_SAMPLE_BRANCH_KERNEL|PERF_SAMPLE_BRANCH_HV|PERF_SAMPLE_BRANCH_ANY|PERF_SAMPLE_BRANCH_ANY_CALL|PERF_SAMPLE_BRANCH_ANY_RETURN|PERF_SAMPLE_BRANCH_IND_CALL; //|PERF_SAMPLE_BRANCH_ABORT_TX; pe[82].sample_regs_user=39; pe[82].sample_stack_user=37552; fd[82]=perf_event_open(&pe[82], 0, /* current thread */ -1, /* all cpus */ -1, /* New Group Leader */ PERF_FLAG_FD_OUTPUT /*2*/ ); /* 2 */ mmap_result[82]=mmap(NULL, 20480,PROT_READ|PROT_WRITE, MAP_SHARED,fd[82], 0); /* 3 */ memset(&sa, 0, sizeof(struct sigaction)); sa.sa_sigaction = our_handler; sa.sa_flags = SA_SIGINFO; if (sigaction( SIGRTMIN+2, &sa, NULL) < 0) { printf("Error setting up signal handler\n"); } fcntl(fd[82], F_SETFL, O_RDWR|O_NONBLOCK|O_ASYNC); fcntl(fd[82], F_SETSIG, SIGRTMIN+2); fcntl(fd[82], F_SETOWN,getpid()); /* Replayed 3 syscalls */ return 0; } ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 21:15 ` Vince Weaver @ 2013-11-08 22:24 ` Frederic Weisbecker 2013-11-08 22:36 ` Frederic Weisbecker 2013-11-09 0:25 ` perf/tracepoint: another fuzzer generated lockup Frederic Weisbecker 2 siblings, 0 replies; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-08 22:24 UTC (permalink / raw) To: Vince Weaver Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Dave Jones On Fri, Nov 08, 2013 at 04:15:21PM -0500, Vince Weaver wrote: > On Fri, 8 Nov 2013, Frederic Weisbecker wrote: > > > On Fri, Nov 08, 2013 at 03:23:07PM -0500, Vince Weaver wrote: > > > On Fri, 8 Nov 2013, Frederic Weisbecker wrote: > > > > > > > > There seem to be a loop that takes too long in intel_pmu_handle_irq(). Your two > > > > previous reports seemed to suggest that lbr is involved, but not this one. > > > > > > I may be wrong but I think everything between <NMI> and <EOE> is just > > > noise from the NMI perf-event watchdog timer kicking in. > > > > Ah good point. > > > > So the pattern seem to be that irq work/perf_event_wakeup is involved, may be > > interrupting a tracepoint event or so. > > I managed to construct a reproducible test case, which is attached. I > sometimes have to run it 5-10 times before it triggers. > Ah with this I can reproduce. I need to run it into a loop for a few seconds to trigger it: [ 91.750943] ------------[ cut here ]------------ [ 91.755440] WARNING: CPU: 1 PID: 968 at kernel/watchdog.c:246 watchdog_overflow_callback+0x9a/0xc0() [ 91.764530] Watchdog detected hard LOCKUP on cpu 1 [ 91.769121] Modules linked in: [ 91.772328] CPU: 1 PID: 968 Comm: out Not tainted 3.12.0+ #47 [ 91.778042] Hardware name: FUJITSU SIEMENS AMD690VM-FMH/AMD690VM-FMH, BIOS V5.13 03/14/2008 [ 91.786358] 00000000000000f6 ffff880107c87bd8 ffffffff815bac1a ffff8800cd0bafa0 [ 91.793714] ffff880107c87c28 ffff880107c87c18 ffffffff8104dffc ffff880107c87e08 [ 91.801076] ffff880103d20000 0000000000000000 ffff880107c87d38 0000000000000000 [ 91.808444] Call Trace: [ 91.810869] <NMI> [<ffffffff815bac1a>] dump_stack+0x4f/0x7c [ 91.816589] [<ffffffff8104dffc>] warn_slowpath_common+0x8c/0xc0 [ 91.822564] [<ffffffff8104e0e6>] warn_slowpath_fmt+0x46/0x50 [ 91.828281] [<ffffffff810ef2ea>] watchdog_overflow_callback+0x9a/0xc0 [ 91.834777] [<ffffffff8111bc78>] __perf_event_overflow+0x98/0x310 [ 91.840928] [<ffffffff81118a40>] ? perf_event_task_disable+0x90/0x90 [ 91.847336] [<ffffffff8111cb44>] perf_event_overflow+0x14/0x20 [ 91.853227] [<ffffffff81019eda>] x86_pmu_handle_irq+0x12a/0x180 [ 91.859198] [<ffffffff81018ad4>] perf_event_nmi_handler+0x34/0x60 [ 91.865353] [<ffffffff81006f76>] nmi_handle.isra.3+0xc6/0x3e0 [ 91.871154] [<ffffffff81006eb5>] ? nmi_handle.isra.3+0x5/0x3e0 [ 91.877045] [<ffffffff8102fd60>] ? perf_ibs_handle_irq+0x420/0x420 [ 91.883275] [<ffffffff810073a0>] do_nmi+0x110/0x390 [ 91.888219] [<ffffffff815c9631>] end_repeat_nmi+0x1e/0x2e [ 91.893676] [<ffffffff812ef38d>] ? delay_tsc+0x6d/0xc0 [ 91.898871] [<ffffffff812ef38d>] ? delay_tsc+0x6d/0xc0 [ 91.904068] [<ffffffff812ef38d>] ? delay_tsc+0x6d/0xc0 [ 91.909258] <<EOE>> <IRQ> [<ffffffff812ef27f>] __delay+0xf/0x20 [ 91.915415] [<ffffffff812ef2b7>] __const_udelay+0x27/0x30 [ 91.920873] [<ffffffff810ba31c>] __rcu_read_unlock+0x5c/0xa0 [ 91.926589] [<ffffffff811815b3>] kill_fasync+0x163/0x2a0 [ 91.931958] [<ffffffff81181476>] ? kill_fasync+0x26/0x2a0 [ 91.937414] [<ffffffff812ef2b7>] ? __const_udelay+0x27/0x30 [ 91.943045] [<ffffffff8111988e>] perf_event_wakeup+0x15e/0x290 [ 91.948935] [<ffffffff81119730>] ? __perf_event_task_sched_out+0x540/0x540 [ 91.955864] [<ffffffff811199f7>] perf_pending_event+0x37/0x60 [ 91.961667] [<ffffffff81112bb7>] __irq_work_run+0x77/0xa0 [ 91.967122] [<ffffffff81112c58>] irq_work_run+0x18/0x30 [ 91.972404] [<ffffffff810080ad>] smp_trace_irq_work_interrupt+0x3d/0x2a0 [ 91.979163] [<ffffffff815cb132>] trace_irq_work_interrupt+0x72/0x80 [ 91.985486] [<ffffffff815c8fb7>] ? retint_restore_args+0x13/0x13 [ 91.991548] [<ffffffff815c877a>] ? _raw_spin_unlock_irqrestore+0x7a/0x90 [ 91.998305] [<ffffffff810c101b>] rcu_process_callbacks+0x1db/0x530 [ 92.004541] [<ffffffff8105381d>] __do_softirq+0xdd/0x490 [ 92.009911] [<ffffffff81053fe6>] irq_exit+0x96/0xc0 [ 92.014849] [<ffffffff815cbc3a>] smp_trace_apic_timer_interrupt+0x5a/0x2b4 [ 92.021777] [<ffffffff815ca7b2>] trace_apic_timer_interrupt+0x72/0x80 [ 92.028264] <EOI> [<ffffffff81180a95>] ? f_setown+0x5/0x150 [ 92.033990] [<ffffffff81180d3a>] ? SyS_fcntl+0xda/0x640 [ 92.039267] [<ffffffff812f06ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 92.045683] [<ffffffff815c9aa9>] system_call_fastpath+0x16/0x1b [ 92.051657] ---[ end trace cd18ca1175c8c7fd ]--- [ 92.056246] perf samples too long (2385259 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [ 92.065604] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 314.659 msecs [ 101.744824] perf samples too long (2366626 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [ 111.738707] perf samples too long (2348139 > 10000), lowering kernel.perf_event_max_sample_rate to 13000 [ 121.732589] perf samples too long (2329797 > 19230), lowering kernel.perf_event_max_sample_rate to 7000 [ 131.726473] perf samples too long (2311632 > 35714), lowering kernel.perf_event_max_sample_rate to 4000 [ 141.720356] perf samples too long (2293575 > 62500), lowering kernel.perf_event_max_sample_rate to 2000 [ 151.714239] perf samples too long (2275693 > 125000), lowering kernel.perf_event_max_sample_rate to 1000 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 21:15 ` Vince Weaver 2013-11-08 22:24 ` Frederic Weisbecker @ 2013-11-08 22:36 ` Frederic Weisbecker 2013-11-09 1:09 ` Steven Rostedt ` (2 more replies) 2013-11-09 0:25 ` perf/tracepoint: another fuzzer generated lockup Frederic Weisbecker 2 siblings, 3 replies; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-08 22:36 UTC (permalink / raw) To: Vince Weaver Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Dave Jones On Fri, Nov 08, 2013 at 04:15:21PM -0500, Vince Weaver wrote: > On Fri, 8 Nov 2013, Frederic Weisbecker wrote: > > > On Fri, Nov 08, 2013 at 03:23:07PM -0500, Vince Weaver wrote: > > > On Fri, 8 Nov 2013, Frederic Weisbecker wrote: > > > > > > > > There seem to be a loop that takes too long in intel_pmu_handle_irq(). Your two > > > > previous reports seemed to suggest that lbr is involved, but not this one. > > > > > > I may be wrong but I think everything between <NMI> and <EOE> is just > > > noise from the NMI perf-event watchdog timer kicking in. > > > > Ah good point. > > > > So the pattern seem to be that irq work/perf_event_wakeup is involved, may be > > interrupting a tracepoint event or so. > > I managed to construct a reproducible test case, which is attached. I > sometimes have to run it 5-10 times before it triggers. > And another one, I wonder if actually the problem is on the perf NMI handler itself, look below: [ 237.192886] ------------[ cut here ]------------ [ 237.197385] WARNING: CPU: 2 PID: 1016 at kernel/watchdog.c:246 watchdog_overflow_callback+0x9a/0xc0() [ 237.206562] Watchdog detected hard LOCKUP on cpu 2 [ 237.211152] Modules linked in: [ 237.214361] CPU: 2 PID: 1016 Comm: out Not tainted 3.12.0+ #47 [ 237.220160] Hardware name: FUJITSU SIEMENS AMD690VM-FMH/AMD690VM-FMH, BIOS V5.13 03/14/2008 [ 237.228475] 00000000000000f6 ffff880107d07bd8 ffffffff815bac1a ffff8800cda34f60 [ 237.235834] ffff880107d07c28 ffff880107d07c18 ffffffff8104dffc ffff880107d07e08 [ 237.243192] ffff880103d58000 0000000000000000 ffff880107d07d38 0000000000000000 [ 237.250556] Call Trace: [ 237.252984] <NMI> [<ffffffff815bac1a>] dump_stack+0x4f/0x7c [ 237.258706] [<ffffffff8104dffc>] warn_slowpath_common+0x8c/0xc0 [ 237.264682] [<ffffffff8104e0e6>] warn_slowpath_fmt+0x46/0x50 [ 237.270397] [<ffffffff810ef2ea>] watchdog_overflow_callback+0x9a/0xc0 [ 237.276895] [<ffffffff8111bc78>] __perf_event_overflow+0x98/0x310 [ 237.283045] [<ffffffff81118a40>] ? perf_event_task_disable+0x90/0x90 [ 237.289453] [<ffffffff8111cb44>] perf_event_overflow+0x14/0x20 [ 237.295344] [<ffffffff81019eda>] x86_pmu_handle_irq+0x12a/0x180 [ 237.301319] [<ffffffff81018ad4>] perf_event_nmi_handler+0x34/0x60 [ 237.307464] [<ffffffff81006f76>] nmi_handle.isra.3+0xc6/0x3e0 [ 237.313271] [<ffffffff81006eb5>] ? nmi_handle.isra.3+0x5/0x3e0 [ 237.319163] [<ffffffff8102fd60>] ? perf_ibs_handle_irq+0x420/0x420 [ 237.325397] [<ffffffff810073a0>] do_nmi+0x110/0x390 [ 237.330336] [<ffffffff815c9631>] end_repeat_nmi+0x1e/0x2e [ 237.335792] [<ffffffff81005045>] ? dump_trace+0x75/0x2f0 [ 237.341162] [<ffffffff81005045>] ? dump_trace+0x75/0x2f0 [ 237.346532] [<ffffffff81005045>] ? dump_trace+0x75/0x2f0 [ 237.351900] <<EOE>> <IRQ> [<ffffffff8105ec15>] ? __sigqueue_alloc+0x5/0x280 [ 237.359091] [<ffffffff8101a4d1>] perf_callchain_kernel+0x51/0x70 [ 237.365155] [<ffffffff8111fec6>] perf_callchain+0x256/0x2c0 [ 237.370783] [<ffffffff8111bb5b>] perf_prepare_sample+0x27b/0x300 [ 237.376849] [<ffffffff810bc1ea>] ? __rcu_is_watching+0x1a/0x30 [ 237.382736] [<ffffffff8111bd2c>] __perf_event_overflow+0x14c/0x310 [ 237.388973] [<ffffffff8111bcd9>] ? __perf_event_overflow+0xf9/0x310 [ 237.395291] [<ffffffff8109aa6d>] ? trace_hardirqs_off+0xd/0x10 [ 237.401186] [<ffffffff815c8753>] ? _raw_spin_unlock_irqrestore+0x53/0x90 [ 237.407941] [<ffffffff81061b46>] ? do_send_sig_info+0x66/0x90 [ 237.413744] [<ffffffff8111c0f9>] perf_swevent_overflow+0xa9/0xc0 [ 237.419808] [<ffffffff8111c16f>] perf_swevent_event+0x5f/0x80 [ 237.425610] [<ffffffff8111c2b8>] perf_tp_event+0x128/0x420 [ 237.431154] [<ffffffff81008108>] ? smp_trace_irq_work_interrupt+0x98/0x2a0 [ 237.438085] [<ffffffff815c83b5>] ? _raw_read_unlock+0x35/0x60 [ 237.443887] [<ffffffff81003fe7>] perf_trace_x86_irq_vector+0xc7/0xe0 [ 237.450295] [<ffffffff81008108>] ? smp_trace_irq_work_interrupt+0x98/0x2a0 [ 237.457226] [<ffffffff81008108>] smp_trace_irq_work_interrupt+0x98/0x2a0 [ 237.463983] [<ffffffff815cb132>] trace_irq_work_interrupt+0x72/0x80 [ 237.470303] [<ffffffff815c8fb7>] ? retint_restore_args+0x13/0x13 [ 237.476366] [<ffffffff815c877a>] ? _raw_spin_unlock_irqrestore+0x7a/0x90 [ 237.483117] [<ffffffff810c101b>] rcu_process_callbacks+0x1db/0x530 [ 237.489360] [<ffffffff8105381d>] __do_softirq+0xdd/0x490 [ 237.494728] [<ffffffff81053fe6>] irq_exit+0x96/0xc0 [ 237.499668] [<ffffffff815cbc3a>] smp_trace_apic_timer_interrupt+0x5a/0x2b4 [ 237.506596] [<ffffffff815ca7b2>] trace_apic_timer_interrupt+0x72/0x80 [ 237.513083] <EOI> [<ffffffff812ef38a>] ? delay_tsc+0x6a/0xc0 [ 237.518895] [<ffffffff812ef382>] ? delay_tsc+0x62/0xc0 [ 237.524091] [<ffffffff812ef27f>] __delay+0xf/0x20 [ 237.528856] [<ffffffff812ef2b7>] __const_udelay+0x27/0x30 [ 237.534313] [<ffffffff810ba31c>] __rcu_read_unlock+0x5c/0xa0 [ 237.540029] [<ffffffff81122255>] find_get_page+0xf5/0x220 [ 237.545485] [<ffffffff81122165>] ? find_get_page+0x5/0x220 [ 237.551031] [<ffffffff8112497b>] filemap_fault+0x8b/0x500 [ 237.556487] [<ffffffff81148459>] __do_fault+0x69/0x4f0 [ 237.561682] [<ffffffff8114c6dc>] handle_mm_fault+0x18c/0x940 [ 237.567400] [<ffffffff81042c71>] ? __do_page_fault+0x111/0x4e0 [ 237.573289] [<ffffffff8109af14>] ? __lock_is_held+0x54/0x70 [ 237.578913] [<ffffffff81042cde>] __do_page_fault+0x17e/0x4e0 [ 237.584636] [<ffffffff812ef3a8>] ? delay_tsc+0x88/0xc0 [ 237.589831] [<ffffffff812ef27f>] ? __delay+0xf/0x20 [ 237.594769] [<ffffffff812ef2b7>] ? __const_udelay+0x27/0x30 [ 237.600399] [<ffffffff810ba31c>] ? __rcu_read_unlock+0x5c/0xa0 [ 237.606290] [<ffffffff812f06ed>] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 237.612785] [<ffffffff8104307e>] do_page_fault+0xe/0x10 [ 237.618069] [<ffffffff815c9228>] page_fault+0x28/0x30 [ 237.623178] ---[ end trace 40cda30d05d0ffa6 ]--- [ 237.627769] perf samples too long (3397569 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 [ 237.637124] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 444.233 msecs 444 msecs is huge. [ 247.186562] perf samples too long (3371028 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 [ 257.180241] perf samples too long (3344694 > 10000), lowering kernel.perf_event_max_sample_rate to 13000 [ 267.173920] perf samples too long (3318566 > 19230), lowering kernel.perf_event_max_sample_rate to 7000 [ 277.167598] perf samples too long (3292642 > 35714), lowering kernel.perf_event_max_sample_rate to 4000 [ 287.161277] perf samples too long (3266920 > 62500), lowering kernel.perf_event_max_sample_rate to 2000 [ 297.154956] perf samples too long (3241400 > 125000), lowering kernel.perf_event_max_sample_rate to 1000 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 22:36 ` Frederic Weisbecker @ 2013-11-09 1:09 ` Steven Rostedt 2013-11-09 14:10 ` Peter Zijlstra 2013-11-09 15:11 ` Peter Zijlstra 2 siblings, 0 replies; 38+ messages in thread From: Steven Rostedt @ 2013-11-09 1:09 UTC (permalink / raw) To: Frederic Weisbecker Cc: Vince Weaver, LKML, Ingo Molnar, Peter Zijlstra, Dave Jones On Fri, 8 Nov 2013 23:36:58 +0100 Frederic Weisbecker <fweisbec@gmail.com> wrote: > [ 237.623178] ---[ end trace 40cda30d05d0ffa6 ]--- > [ 237.627769] perf samples too long (3397569 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 > [ 237.637124] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 444.233 msecs > > 444 msecs is huge. I don't think that's the issue you are looking for. Peter told me there's a timer race with overflows or something (I don't remember the details, Peter can clarify it), that causes the second timestamp to be bogus compared to the first, and this makes the code think that the NMI handler was running much longer that it actually was. -- Steve > > [ 247.186562] perf samples too long (3371028 > 5000), lowering kernel.perf_event_max_sample_rate to 25000 > [ 257.180241] perf samples too long (3344694 > 10000), lowering kernel.perf_event_max_sample_rate to 13000 > [ 267.173920] perf samples too long (3318566 > 19230), lowering kernel.perf_event_max_sample_rate to 7000 > [ 277.167598] perf samples too long (3292642 > 35714), lowering kernel.perf_event_max_sample_rate to 4000 > [ 287.161277] perf samples too long (3266920 > 62500), lowering kernel.perf_event_max_sample_rate to 2000 > [ 297.154956] perf samples too long (3241400 > 125000), lowering kernel.perf_event_max_sample_rate to 1000 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 22:36 ` Frederic Weisbecker 2013-11-09 1:09 ` Steven Rostedt @ 2013-11-09 14:10 ` Peter Zijlstra 2013-11-09 14:20 ` Frederic Weisbecker 2013-11-09 14:52 ` Frederic Weisbecker 2013-11-09 15:11 ` Peter Zijlstra 2 siblings, 2 replies; 38+ messages in thread From: Peter Zijlstra @ 2013-11-09 14:10 UTC (permalink / raw) To: Frederic Weisbecker Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Fri, Nov 08, 2013 at 11:36:58PM +0100, Frederic Weisbecker wrote: > [ 237.627769] perf samples too long (3397569 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 > [ 237.637124] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 444.233 msecs > > 444 msecs is huge. Be glad your system lived to tell about it ;-) Calling printk() from NMI context is Russian roulette; I'm still waiting for the first report it actually locked up :-) That said, I'm not sure what kernel you're running, but there were some issues with time-keeping hereabouts, but more importantly that second timing includes the printk() call of the first -- so that's always going to be fucked. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-09 14:10 ` Peter Zijlstra @ 2013-11-09 14:20 ` Frederic Weisbecker 2013-11-11 12:44 ` Ingo Molnar 2013-11-09 14:52 ` Frederic Weisbecker 1 sibling, 1 reply; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-09 14:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Sat, Nov 09, 2013 at 03:10:39PM +0100, Peter Zijlstra wrote: > On Fri, Nov 08, 2013 at 11:36:58PM +0100, Frederic Weisbecker wrote: > > [ 237.627769] perf samples too long (3397569 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 > > [ 237.637124] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 444.233 msecs > > > > 444 msecs is huge. > > Be glad your system lived to tell about it ;-) Calling printk() from NMI > context is Russian roulette; I'm still waiting for the first report it > actually locked up :-) Haha, right. I dump with earlyprintk but that doesn't change the fact it passes through printk machinery. Fortunately I haven't yet got burdened with that. Although... maybe printk plays a role in the issue here... > > That said, I'm not sure what kernel you're running, but there were some > issues with time-keeping hereabouts, but more importantly that second > timing includes the printk() call of the first -- so that's always going > to be fucked. It's a recent tip:master. So the delta debug printout is certainly buggy, meanwhile these lockup only happen with Vince selftests, and they trigger a lot of these NMI-too-long issues, or may be that's the other way round :)... I'm trying to narrow down the issue, lets hope the lockup is not actually due to printk itself. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-09 14:20 ` Frederic Weisbecker @ 2013-11-11 12:44 ` Ingo Molnar 2013-11-11 15:53 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Ingo Molnar @ 2013-11-11 12:44 UTC (permalink / raw) To: Frederic Weisbecker Cc: Peter Zijlstra, Vince Weaver, Steven Rostedt, LKML, Dave Jones * Frederic Weisbecker <fweisbec@gmail.com> wrote: > > That said, I'm not sure what kernel you're running, but there were > > some issues with time-keeping hereabouts, but more importantly that > > second timing includes the printk() call of the first -- so that's > > always going to be fucked. > > It's a recent tip:master. So the delta debug printout is certainly > buggy, meanwhile these lockup only happen with Vince selftests, and they > trigger a lot of these NMI-too-long issues, or may be that's the other > way round :)... > > I'm trying to narrow down the issue, lets hope the lockup is not > actually due to printk itself. I'd _very_ strongly suggest to not include the printk() overhead in the execution time delta! What that function wants to report is pure NMI execution overhead, not problem reporting overhead. That way any large number reported there is always a bug somewhere, somehow. Thanks, Ingo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-11 12:44 ` Ingo Molnar @ 2013-11-11 15:53 ` Peter Zijlstra 2013-11-11 21:13 ` Ingo Molnar 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2013-11-11 15:53 UTC (permalink / raw) To: Ingo Molnar Cc: Frederic Weisbecker, Vince Weaver, Steven Rostedt, LKML, Dave Jones On Mon, Nov 11, 2013 at 01:44:19PM +0100, Ingo Molnar wrote: > > * Frederic Weisbecker <fweisbec@gmail.com> wrote: > > > > That said, I'm not sure what kernel you're running, but there were > > > some issues with time-keeping hereabouts, but more importantly that > > > second timing includes the printk() call of the first -- so that's > > > always going to be fucked. > > > > It's a recent tip:master. So the delta debug printout is certainly > > buggy, meanwhile these lockup only happen with Vince selftests, and they > > trigger a lot of these NMI-too-long issues, or may be that's the other > > way round :)... > > > > I'm trying to narrow down the issue, lets hope the lockup is not > > actually due to printk itself. > > I'd _very_ strongly suggest to not include the printk() overhead in the > execution time delta! What that function wants to report is pure NMI > execution overhead, not problem reporting overhead. > > That way any large number reported there is always a bug somewhere, > somehow. -ENOPATCH :-) You'll find that there's two levels of measuring NMI latency and the outer will invariably include the reporting of the inner one; fixing that is going to be hideously ugly. That said, I would very strongly suggest to tear that printk() from the NMI path, its just waiting to wreck someone's machine :-) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-11 15:53 ` Peter Zijlstra @ 2013-11-11 21:13 ` Ingo Molnar 0 siblings, 0 replies; 38+ messages in thread From: Ingo Molnar @ 2013-11-11 21:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Frederic Weisbecker, Vince Weaver, Steven Rostedt, LKML, Dave Jones * Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Nov 11, 2013 at 01:44:19PM +0100, Ingo Molnar wrote: > > > > * Frederic Weisbecker <fweisbec@gmail.com> wrote: > > > > > > That said, I'm not sure what kernel you're running, but there were > > > > some issues with time-keeping hereabouts, but more importantly that > > > > second timing includes the printk() call of the first -- so that's > > > > always going to be fucked. > > > > > > It's a recent tip:master. So the delta debug printout is certainly > > > buggy, meanwhile these lockup only happen with Vince selftests, and they > > > trigger a lot of these NMI-too-long issues, or may be that's the other > > > way round :)... > > > > > > I'm trying to narrow down the issue, lets hope the lockup is not > > > actually due to printk itself. > > > > I'd _very_ strongly suggest to not include the printk() overhead in the > > execution time delta! What that function wants to report is pure NMI > > execution overhead, not problem reporting overhead. > > > > That way any large number reported there is always a bug somewhere, > > somehow. > > -ENOPATCH :-) > > You'll find that there's two levels of measuring NMI latency and the > outer will invariably include the reporting of the inner one; fixing > that is going to be hideously ugly. > > That said, I would very strongly suggest to tear that printk() from the > NMI path, its just waiting to wreck someone's machine :-) So why not just write the value somewhere and printk once at the end of the NMI sequence, once everything is said and done? Thanks, Ingo ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-09 14:10 ` Peter Zijlstra 2013-11-09 14:20 ` Frederic Weisbecker @ 2013-11-09 14:52 ` Frederic Weisbecker 2013-11-09 15:13 ` Peter Zijlstra 1 sibling, 1 reply; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-09 14:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Sat, Nov 09, 2013 at 03:10:39PM +0100, Peter Zijlstra wrote: > On Fri, Nov 08, 2013 at 11:36:58PM +0100, Frederic Weisbecker wrote: > > [ 237.627769] perf samples too long (3397569 > 2500), lowering kernel.perf_event_max_sample_rate to 50000 > > [ 237.637124] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 444.233 msecs > > > > 444 msecs is huge. > > Be glad your system lived to tell about it ;-) Calling printk() from NMI > context is Russian roulette; I'm still waiting for the first report it > actually locked up :-) > > That said, I'm not sure what kernel you're running, but there were some > issues with time-keeping hereabouts, but more importantly that second > timing includes the printk() call of the first -- so that's always going > to be fucked. So, an idea of what may be happening: an event overflows while FASYNC flag is set so it triggers an irq work to send the signal (kill_fasync). After the irq work triggers, it generates an irq_work_exit event, which in turn overflows and, if it has FASYNC, triggers a new irq work. The irq work triggers and generates an irq work exit event which has FASYNC flag, etc... Looks like a nice way to deadlock with an infinite loop of irq work. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-09 14:52 ` Frederic Weisbecker @ 2013-11-09 15:13 ` Peter Zijlstra 2013-11-09 15:27 ` Frederic Weisbecker 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2013-11-09 15:13 UTC (permalink / raw) To: Frederic Weisbecker Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Sat, Nov 09, 2013 at 03:52:59PM +0100, Frederic Weisbecker wrote: > So, an idea of what may be happening: an event overflows while FASYNC flag is set so it triggers an irq work > to send the signal (kill_fasync). > After the irq work triggers, it generates an irq_work_exit event, which in turn overflows and, > if it has FASYNC, triggers a new irq work. The irq work triggers and generates an irq work exit event which > has FASYNC flag, etc... > > Looks like a nice way to deadlock with an infinite loop of irq work. Yep, exactly, see the email I just send. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-09 15:13 ` Peter Zijlstra @ 2013-11-09 15:27 ` Frederic Weisbecker 2013-11-09 15:59 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-09 15:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Sat, Nov 09, 2013 at 04:13:56PM +0100, Peter Zijlstra wrote: > On Sat, Nov 09, 2013 at 03:52:59PM +0100, Frederic Weisbecker wrote: > > So, an idea of what may be happening: an event overflows while FASYNC flag is set so it triggers an irq work > > to send the signal (kill_fasync). > > After the irq work triggers, it generates an irq_work_exit event, which in turn overflows and, > > if it has FASYNC, triggers a new irq work. The irq work triggers and generates an irq work exit event which > > has FASYNC flag, etc... > > > > Looks like a nice way to deadlock with an infinite loop of irq work. > > > Yep, exactly, see the email I just send. In fact, raising an irq work from an irq work should simply be prohibited. That's not a sane behaviour. It's natural for async stuffs that have reasonable delays between each pass allow re-enqueuing, like workqueue or rcu callbacks, or timers. But with irq work that doesn't look right, expect for lazy irq works though. But lets just not allow it at all :) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-09 15:27 ` Frederic Weisbecker @ 2013-11-09 15:59 ` Peter Zijlstra 2013-11-09 16:08 ` Frederic Weisbecker 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2013-11-09 15:59 UTC (permalink / raw) To: Frederic Weisbecker Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Sat, Nov 09, 2013 at 04:27:01PM +0100, Frederic Weisbecker wrote: > In fact, raising an irq work from an irq work should simply be prohibited. That's not a sane > behaviour. Well, it is because as you raised on IRC we could be holding locks and trying to avoid deadlocks. This is the very reason irq_work gets used in printk. And its not a recursive run()->work()->queue() either, because as you said this tracepoint is in arch code _after_ work_run completes. All in all an exceedingly vexing issue. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-09 15:59 ` Peter Zijlstra @ 2013-11-09 16:08 ` Frederic Weisbecker 0 siblings, 0 replies; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-09 16:08 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Sat, Nov 09, 2013 at 04:59:38PM +0100, Peter Zijlstra wrote: > On Sat, Nov 09, 2013 at 04:27:01PM +0100, Frederic Weisbecker wrote: > > In fact, raising an irq work from an irq work should simply be prohibited. That's not a sane > > behaviour. > > Well, it is because as you raised on IRC we could be holding locks and > trying to avoid deadlocks. This is the very reason irq_work gets used in > printk. > > And its not a recursive run()->work()->queue() either, because as you said this > tracepoint is in arch code _after_ work_run completes. Yeah, indeed :( > > All in all an exceedingly vexing issue. Yep, the only sane way to solve this now is to check that signal pending, unfortunately as you said it's not acquired that we can even just check this lockless... ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 22:36 ` Frederic Weisbecker 2013-11-09 1:09 ` Steven Rostedt 2013-11-09 14:10 ` Peter Zijlstra @ 2013-11-09 15:11 ` Peter Zijlstra 2013-11-09 15:22 ` Frederic Weisbecker 2013-11-14 15:23 ` Peter Zijlstra 2 siblings, 2 replies; 38+ messages in thread From: Peter Zijlstra @ 2013-11-09 15:11 UTC (permalink / raw) To: Frederic Weisbecker Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Fri, Nov 08, 2013 at 11:36:58PM +0100, Frederic Weisbecker wrote: > [ 237.359091] [<ffffffff8101a4d1>] perf_callchain_kernel+0x51/0x70 > [ 237.365155] [<ffffffff8111fec6>] perf_callchain+0x256/0x2c0 > [ 237.370783] [<ffffffff8111bb5b>] perf_prepare_sample+0x27b/0x300 > [ 237.376849] [<ffffffff810bc1ea>] ? __rcu_is_watching+0x1a/0x30 > [ 237.382736] [<ffffffff8111bd2c>] __perf_event_overflow+0x14c/0x310 > [ 237.388973] [<ffffffff8111bcd9>] ? __perf_event_overflow+0xf9/0x310 > [ 237.395291] [<ffffffff8109aa6d>] ? trace_hardirqs_off+0xd/0x10 > [ 237.401186] [<ffffffff815c8753>] ? _raw_spin_unlock_irqrestore+0x53/0x90 > [ 237.407941] [<ffffffff81061b46>] ? do_send_sig_info+0x66/0x90 > [ 237.413744] [<ffffffff8111c0f9>] perf_swevent_overflow+0xa9/0xc0 > [ 237.419808] [<ffffffff8111c16f>] perf_swevent_event+0x5f/0x80 > [ 237.425610] [<ffffffff8111c2b8>] perf_tp_event+0x128/0x420 > [ 237.431154] [<ffffffff81008108>] ? smp_trace_irq_work_interrupt+0x98/0x2a0 > [ 237.438085] [<ffffffff815c83b5>] ? _raw_read_unlock+0x35/0x60 > [ 237.443887] [<ffffffff81003fe7>] perf_trace_x86_irq_vector+0xc7/0xe0 > [ 237.450295] [<ffffffff81008108>] ? smp_trace_irq_work_interrupt+0x98/0x2a0 > [ 237.457226] [<ffffffff81008108>] smp_trace_irq_work_interrupt+0x98/0x2a0 > [ 237.463983] [<ffffffff815cb132>] trace_irq_work_interrupt+0x72/0x80 > [ 237.470303] [<ffffffff815c8fb7>] ? retint_restore_args+0x13/0x13 > [ 237.476366] [<ffffffff815c877a>] ? _raw_spin_unlock_irqrestore+0x7a/0x90 > [ 237.483117] [<ffffffff810c101b>] rcu_process_callbacks+0x1db/0x530 > [ 237.489360] [<ffffffff8105381d>] __do_softirq+0xdd/0x490 > [ 237.494728] [<ffffffff81053fe6>] irq_exit+0x96/0xc0 > [ 237.499668] [<ffffffff815cbc3a>] smp_trace_apic_timer_interrupt+0x5a/0x2b4 > [ 237.506596] [<ffffffff815ca7b2>] trace_apic_timer_interrupt+0x72/0x80 Cute.. so what appears to happen is that: 1) we trace irq_work_exit 2) we generate event 3) event needs to deliver signal 4) we queue irq_work to send signal 5) goto 1 Does something like this solve it? --- kernel/events/core.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 4dc078d18929..a3ad40f347c4 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5289,6 +5289,16 @@ static void perf_log_throttle(struct perf_event *event, int enable) perf_output_end(&handle); } +static inline void perf_pending(struct perf_event *event) +{ + if (in_nmi()) { + irq_work_pending(&event->pending); + return; + } + + perf_pending_event(&event->pending); +} + /* * Generic event overflow handling, sampling. */ @@ -5345,7 +5355,7 @@ static int __perf_event_overflow(struct perf_event *event, ret = 1; event->pending_kill = POLL_HUP; event->pending_disable = 1; - irq_work_queue(&event->pending); + perf_pending(event); } if (event->overflow_handler) @@ -5355,7 +5365,7 @@ static int __perf_event_overflow(struct perf_event *event, if (event->fasync && event->pending_kill) { event->pending_wakeup = 1; - irq_work_queue(&event->pending); + perf_pending(event); } return ret; ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-09 15:11 ` Peter Zijlstra @ 2013-11-09 15:22 ` Frederic Weisbecker 2013-11-09 15:30 ` Peter Zijlstra 2013-11-14 15:23 ` Peter Zijlstra 1 sibling, 1 reply; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-09 15:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Sat, Nov 09, 2013 at 04:11:01PM +0100, Peter Zijlstra wrote: > On Fri, Nov 08, 2013 at 11:36:58PM +0100, Frederic Weisbecker wrote: > > [ 237.359091] [<ffffffff8101a4d1>] perf_callchain_kernel+0x51/0x70 > > [ 237.365155] [<ffffffff8111fec6>] perf_callchain+0x256/0x2c0 > > [ 237.370783] [<ffffffff8111bb5b>] perf_prepare_sample+0x27b/0x300 > > [ 237.376849] [<ffffffff810bc1ea>] ? __rcu_is_watching+0x1a/0x30 > > [ 237.382736] [<ffffffff8111bd2c>] __perf_event_overflow+0x14c/0x310 > > [ 237.388973] [<ffffffff8111bcd9>] ? __perf_event_overflow+0xf9/0x310 > > [ 237.395291] [<ffffffff8109aa6d>] ? trace_hardirqs_off+0xd/0x10 > > [ 237.401186] [<ffffffff815c8753>] ? _raw_spin_unlock_irqrestore+0x53/0x90 > > [ 237.407941] [<ffffffff81061b46>] ? do_send_sig_info+0x66/0x90 > > [ 237.413744] [<ffffffff8111c0f9>] perf_swevent_overflow+0xa9/0xc0 > > [ 237.419808] [<ffffffff8111c16f>] perf_swevent_event+0x5f/0x80 > > [ 237.425610] [<ffffffff8111c2b8>] perf_tp_event+0x128/0x420 > > [ 237.431154] [<ffffffff81008108>] ? smp_trace_irq_work_interrupt+0x98/0x2a0 > > [ 237.438085] [<ffffffff815c83b5>] ? _raw_read_unlock+0x35/0x60 > > [ 237.443887] [<ffffffff81003fe7>] perf_trace_x86_irq_vector+0xc7/0xe0 > > [ 237.450295] [<ffffffff81008108>] ? smp_trace_irq_work_interrupt+0x98/0x2a0 > > [ 237.457226] [<ffffffff81008108>] smp_trace_irq_work_interrupt+0x98/0x2a0 > > [ 237.463983] [<ffffffff815cb132>] trace_irq_work_interrupt+0x72/0x80 > > [ 237.470303] [<ffffffff815c8fb7>] ? retint_restore_args+0x13/0x13 > > [ 237.476366] [<ffffffff815c877a>] ? _raw_spin_unlock_irqrestore+0x7a/0x90 > > [ 237.483117] [<ffffffff810c101b>] rcu_process_callbacks+0x1db/0x530 > > [ 237.489360] [<ffffffff8105381d>] __do_softirq+0xdd/0x490 > > [ 237.494728] [<ffffffff81053fe6>] irq_exit+0x96/0xc0 > > [ 237.499668] [<ffffffff815cbc3a>] smp_trace_apic_timer_interrupt+0x5a/0x2b4 > > [ 237.506596] [<ffffffff815ca7b2>] trace_apic_timer_interrupt+0x72/0x80 > > Cute.. so what appears to happen is that: > > 1) we trace irq_work_exit > 2) we generate event > 3) event needs to deliver signal > 4) we queue irq_work to send signal > 5) goto 1 > > Does something like this solve it? > > --- > kernel/events/core.c | 14 ++++++++++++-- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 4dc078d18929..a3ad40f347c4 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -5289,6 +5289,16 @@ static void perf_log_throttle(struct perf_event *event, int enable) > perf_output_end(&handle); > } > > +static inline void perf_pending(struct perf_event *event) > +{ > + if (in_nmi()) { > + irq_work_pending(&event->pending); I guess you mean irq_work_queue()? But there are much more reasons that just being in nmi to async wakeups, signal sending, etc... The fact that an event can happen anywhere (rq lock acquire or whatever) makes perf events all fragile enough to always require irq work for these. Probably what we need is rather some limit. Maybe we can't seriously apply recursion checks here but perhaps the simple fact that we raise an irq work from an irq work should trigger an alarm of some sort. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-09 15:22 ` Frederic Weisbecker @ 2013-11-09 15:30 ` Peter Zijlstra 0 siblings, 0 replies; 38+ messages in thread From: Peter Zijlstra @ 2013-11-09 15:30 UTC (permalink / raw) To: Frederic Weisbecker Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Sat, Nov 09, 2013 at 04:22:57PM +0100, Frederic Weisbecker wrote: > > --- > > kernel/events/core.c | 14 ++++++++++++-- > > 1 file changed, 12 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/events/core.c b/kernel/events/core.c > > index 4dc078d18929..a3ad40f347c4 100644 > > --- a/kernel/events/core.c > > +++ b/kernel/events/core.c > > @@ -5289,6 +5289,16 @@ static void perf_log_throttle(struct perf_event *event, int enable) > > perf_output_end(&handle); > > } > > > > +static inline void perf_pending(struct perf_event *event) > > +{ > > + if (in_nmi()) { > > + irq_work_pending(&event->pending); > > I guess you mean irq_work_queue()? Uhm yah > But there are much more reasons that just being in nmi to async > wakeups, signal sending, etc... The fact that an event can happen > anywhere (rq lock acquire or whatever) makes perf events all fragile > enough to always require irq work for these. Fair enough :/ > Probably what we need is rather some limit. Maybe we can't seriously > apply recursion checks here but perhaps the simple fact that we raise > an irq work from an irq work should trigger an alarm of some sort. I think irq_work was designed explicitly to allow this -- Oleg had some usecase for this. So my initial approach was trying to detect if there was a fasync signal pending and break out of the loop in that case; but fasync gives me a bloody headache. It looks like you cannot even determine the signum you need to test pending without acquiring locks, let alone find all the tasks it would raise it against. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-09 15:11 ` Peter Zijlstra 2013-11-09 15:22 ` Frederic Weisbecker @ 2013-11-14 15:23 ` Peter Zijlstra 2013-11-14 15:33 ` Peter Zijlstra ` (3 more replies) 1 sibling, 4 replies; 38+ messages in thread From: Peter Zijlstra @ 2013-11-14 15:23 UTC (permalink / raw) To: Frederic Weisbecker Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Sat, Nov 09, 2013 at 04:10:14PM +0100, Peter Zijlstra wrote: > Cute.. so what appears to happen is that: > > 1) we trace irq_work_exit > 2) we generate event > 3) event needs to deliver signal > 4) we queue irq_work to send signal > 5) goto 1 --- arch/x86/include/asm/trace/irq_vectors.h | 11 +++++++++++ include/linux/ftrace_event.h | 16 ++++++++++++++++ include/linux/tracepoint.h | 4 ++++ include/trace/ftrace.h | 7 +++++++ kernel/trace/trace_event_perf.c | 6 ++++++ 5 files changed, 44 insertions(+) diff --git a/arch/x86/include/asm/trace/irq_vectors.h b/arch/x86/include/asm/trace/irq_vectors.h index 2874df24e7a4..4cab890007a7 100644 --- a/arch/x86/include/asm/trace/irq_vectors.h +++ b/arch/x86/include/asm/trace/irq_vectors.h @@ -72,6 +72,17 @@ DEFINE_IRQ_VECTOR_EVENT(x86_platform_ipi); DEFINE_IRQ_VECTOR_EVENT(irq_work); /* + * We must dis-allow sampling irq_work_exit() because perf event sampling + * itself can cause irq_work, which would lead to an infinite loop; + * + * 1) irq_work_exit happens + * 2) generates perf sample + * 3) generates irq_work + * 4) goto 1 + */ +TRACE_EVENT_PERF_PERM(irq_work_exit, is_sampling_event(p_event) ? -EPERM : 0); + +/* * call_function - called when entering/exiting a call function interrupt * vector handler */ diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h index 5eaa746735ff..3e4d05566022 100644 --- a/include/linux/ftrace_event.h +++ b/include/linux/ftrace_event.h @@ -244,6 +244,9 @@ struct ftrace_event_call { #ifdef CONFIG_PERF_EVENTS int perf_refcount; struct hlist_head __percpu *perf_events; + + int (*perf_perm)(struct ftrace_event_call *, + struct perf_event *); #endif }; @@ -306,6 +309,19 @@ struct ftrace_event_file { } \ early_initcall(trace_init_flags_##name); +#define __TRACE_EVENT_PERF_PERM(name, expr) \ + static int perf_perm_##name(struct ftrace_event_call *tp_event, \ + struct perf_event *p_event) \ + { \ + return expr; \ + } \ + static int __init trace_init_flags_##name(void) \ + { \ + event_##name.perf_perm = &perf_perm_##name; \ + return 0; \ + } \ + early_initcall(trace_init_flags_##name); + #define PERF_MAX_TRACE_SIZE 2048 #define MAX_FILTER_STR_VAL 256 /* Should handle KSYM_SYMBOL_LEN */ diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index ebeab360d851..dd5cb1dca207 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -267,6 +267,8 @@ static inline void tracepoint_synchronize_unregister(void) #define TRACE_EVENT_FLAGS(event, flag) +#define TRACE_EVENT_PERF_PERM(event, expr) + #endif /* DECLARE_TRACE */ #ifndef TRACE_EVENT @@ -399,4 +401,6 @@ static inline void tracepoint_synchronize_unregister(void) #define TRACE_EVENT_FLAGS(event, flag) +#define TRACE_EVENT_PERF_PERM(event, expr) + #endif /* ifdef TRACE_EVENT (see note above) */ diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h index 5c7ab17cbb02..7bdda40fc406 100644 --- a/include/trace/ftrace.h +++ b/include/trace/ftrace.h @@ -90,6 +90,10 @@ #define TRACE_EVENT_FLAGS(name, value) \ __TRACE_EVENT_FLAGS(name, value) +#undef TRACE_EVENT_PERF_PERM +#define TRACE_EVENT_PERF_PERM(name, expr) \ + __TRACE_EVENT_PERF_PERM(name, expr) + #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) @@ -140,6 +144,9 @@ #undef TRACE_EVENT_FLAGS #define TRACE_EVENT_FLAGS(event, flag) +#undef TRACE_EVENT_PERF_PERM +#define TRACE_EVENT_PERF_PERM(event, expr) + #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) /* diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 78e27e3b52ac..630889f68b1d 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -24,6 +24,12 @@ static int total_ref_count; static int perf_trace_event_perm(struct ftrace_event_call *tp_event, struct perf_event *p_event) { + if (tp_event->perf_perm) { + int ret = tp_event->perf_perm(tp_event, p_event); + if (ret) + return ret; + } + /* The ftrace function trace is allowed only for root. */ if (ftrace_event_is_function(tp_event) && perf_paranoid_tracepoint_raw() && !capable(CAP_SYS_ADMIN)) ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-14 15:23 ` Peter Zijlstra @ 2013-11-14 15:33 ` Peter Zijlstra 2013-11-14 15:35 ` Frederic Weisbecker 2013-11-15 1:16 ` Masami Hiramatsu 2013-11-14 16:03 ` Frederic Weisbecker ` (2 subsequent siblings) 3 siblings, 2 replies; 38+ messages in thread From: Peter Zijlstra @ 2013-11-14 15:33 UTC (permalink / raw) To: Frederic Weisbecker Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones, masami.hiramatsu.pt On Thu, Nov 14, 2013 at 04:23:04PM +0100, Peter Zijlstra wrote: > /* > + * We must dis-allow sampling irq_work_exit() because perf event sampling > + * itself can cause irq_work, which would lead to an infinite loop; > + * > + * 1) irq_work_exit happens > + * 2) generates perf sample > + * 3) generates irq_work > + * 4) goto 1 > + */ > +TRACE_EVENT_PERF_PERM(irq_work_exit, is_sampling_event(p_event) ? -EPERM : 0); And the only reason this doesn't feed fwd itself into oblivion for irq_work_enter() is because the irq_work_list must not be empty when the interrupt is raised, and queueing further work does not re-raise the IPI. Also, we should probably do something 'smart' for kprobes, as all of irq_work.c and plenty of perf itself is not __kprobe marked so you're all free to insert kprobes in the middle of perf and then attach perf to such a thing. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-14 15:33 ` Peter Zijlstra @ 2013-11-14 15:35 ` Frederic Weisbecker 2013-11-15 1:16 ` Masami Hiramatsu 1 sibling, 0 replies; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-14 15:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones, masami.hiramatsu.pt On Thu, Nov 14, 2013 at 04:33:01PM +0100, Peter Zijlstra wrote: > On Thu, Nov 14, 2013 at 04:23:04PM +0100, Peter Zijlstra wrote: > > /* > > + * We must dis-allow sampling irq_work_exit() because perf event sampling > > + * itself can cause irq_work, which would lead to an infinite loop; > > + * > > + * 1) irq_work_exit happens > > + * 2) generates perf sample > > + * 3) generates irq_work > > + * 4) goto 1 > > + */ > > +TRACE_EVENT_PERF_PERM(irq_work_exit, is_sampling_event(p_event) ? -EPERM : 0); > > And the only reason this doesn't feed fwd itself into oblivion for > irq_work_enter() is because the irq_work_list must not be empty when the > interrupt is raised, and queueing further work does not re-raise the > IPI. Right. > > > Also, we should probably do something 'smart' for kprobes, as all of > irq_work.c and plenty of perf itself is not __kprobe marked so you're > all free to insert kprobes in the middle of perf and then attach perf to > such a thing. > True, ok I'm going to comment about that on your patch. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-14 15:33 ` Peter Zijlstra 2013-11-14 15:35 ` Frederic Weisbecker @ 2013-11-15 1:16 ` Masami Hiramatsu 2013-11-15 12:28 ` Peter Zijlstra 1 sibling, 1 reply; 38+ messages in thread From: Masami Hiramatsu @ 2013-11-15 1:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Frederic Weisbecker, Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones (2013/11/15 0:33), Peter Zijlstra wrote: > On Thu, Nov 14, 2013 at 04:23:04PM +0100, Peter Zijlstra wrote: >> /* >> + * We must dis-allow sampling irq_work_exit() because perf event sampling >> + * itself can cause irq_work, which would lead to an infinite loop; >> + * >> + * 1) irq_work_exit happens >> + * 2) generates perf sample >> + * 3) generates irq_work >> + * 4) goto 1 >> + */ >> +TRACE_EVENT_PERF_PERM(irq_work_exit, is_sampling_event(p_event) ? -EPERM : 0); > > And the only reason this doesn't feed fwd itself into oblivion for > irq_work_enter() is because the irq_work_list must not be empty when the > interrupt is raised, and queueing further work does not re-raise the > IPI. > > > Also, we should probably do something 'smart' for kprobes, as all of > irq_work.c and plenty of perf itself is not __kprobe marked so you're > all free to insert kprobes in the middle of perf and then attach perf to > such a thing. Kprobes itself can detect nested call by using per-cpu current-running kprobe pointer. And if it is nested, it just skips calling handlers. Anyway, I don't recommend to probe inside the handlers, but yes, you can trace perf-handler by ftrace B). I actually traced a kprobe-bug by kprobe-tracer last night, that was amazing :) Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu.pt@hitachi.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-15 1:16 ` Masami Hiramatsu @ 2013-11-15 12:28 ` Peter Zijlstra 2013-11-15 14:15 ` Steven Rostedt 0 siblings, 1 reply; 38+ messages in thread From: Peter Zijlstra @ 2013-11-15 12:28 UTC (permalink / raw) To: Masami Hiramatsu Cc: Frederic Weisbecker, Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Fri, Nov 15, 2013 at 10:16:18AM +0900, Masami Hiramatsu wrote: > Kprobes itself can detect nested call by using per-cpu current-running > kprobe pointer. And if it is nested, it just skips calling handlers. > Anyway, I don't recommend to probe inside the handlers, but yes, > you can trace perf-handler by ftrace B). I actually traced a kprobe-bug > by kprobe-tracer last night, that was amazing :) Ah, ok, so that would avoid the worst problems. Good. Should we still mark the entire perf swevent path as __kprobes just to be sure? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-15 12:28 ` Peter Zijlstra @ 2013-11-15 14:15 ` Steven Rostedt 2013-11-15 14:28 ` Frederic Weisbecker 0 siblings, 1 reply; 38+ messages in thread From: Steven Rostedt @ 2013-11-15 14:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Masami Hiramatsu, Frederic Weisbecker, Vince Weaver, LKML, Ingo Molnar, Dave Jones On Fri, 15 Nov 2013 13:28:33 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Nov 15, 2013 at 10:16:18AM +0900, Masami Hiramatsu wrote: > > Kprobes itself can detect nested call by using per-cpu current-running > > kprobe pointer. And if it is nested, it just skips calling handlers. > > Anyway, I don't recommend to probe inside the handlers, but yes, > > you can trace perf-handler by ftrace B). I actually traced a kprobe-bug > > by kprobe-tracer last night, that was amazing :) > > Ah, ok, so that would avoid the worst problems. Good. Should we still > mark the entire perf swevent path as __kprobes just to be sure? I wouldn't unless we can prove that it breaks. It's sometimes nice to be able to debug the debugging facilities with the debugging facilities ;-) -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-15 14:15 ` Steven Rostedt @ 2013-11-15 14:28 ` Frederic Weisbecker 2013-11-17 7:53 ` Masami Hiramatsu 0 siblings, 1 reply; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-15 14:28 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Masami Hiramatsu, Vince Weaver, LKML, Ingo Molnar, Dave Jones On Fri, Nov 15, 2013 at 09:15:21AM -0500, Steven Rostedt wrote: > On Fri, 15 Nov 2013 13:28:33 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, Nov 15, 2013 at 10:16:18AM +0900, Masami Hiramatsu wrote: > > > Kprobes itself can detect nested call by using per-cpu current-running > > > kprobe pointer. And if it is nested, it just skips calling handlers. > > > Anyway, I don't recommend to probe inside the handlers, but yes, > > > you can trace perf-handler by ftrace B). I actually traced a kprobe-bug > > > by kprobe-tracer last night, that was amazing :) > > > > Ah, ok, so that would avoid the worst problems. Good. Should we still > > mark the entire perf swevent path as __kprobes just to be sure? > > I wouldn't unless we can prove that it breaks. It's sometimes nice to > be able to debug the debugging facilities with the debugging > facilities ;-) Even with the existing __kprobes annotations, I'm sure we can find many ways to break the kernel. We can reproduce that bug with irq_work recursion with setting a kprobe in the end of the irq_work() arch low level handler for example. Or simply somewhere in irq_exit(). I think we could do dangerous things with breakpoints as well. Setting breakpoints in do_debug() or stuffs like that. So keeping up with __kprobes annotations to every potential dangerous site is not viable IMHO. It's important to maintain basic sanity with tagging sites that are used by kprobes itself but we can't really prevent from any issue. At some point it's up to the user to know what he's doing and avoid recursion. As long as kprobes can be set only by root. OTOH it would be nice to detect these kind of behaviour (thinking about irq_work for example) and warn when something wrong is suspected. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-15 14:28 ` Frederic Weisbecker @ 2013-11-17 7:53 ` Masami Hiramatsu 2013-11-17 9:43 ` Peter Zijlstra 0 siblings, 1 reply; 38+ messages in thread From: Masami Hiramatsu @ 2013-11-17 7:53 UTC (permalink / raw) To: Frederic Weisbecker Cc: Steven Rostedt, Peter Zijlstra, Vince Weaver, LKML, Ingo Molnar, Dave Jones (2013/11/15 23:28), Frederic Weisbecker wrote: > On Fri, Nov 15, 2013 at 09:15:21AM -0500, Steven Rostedt wrote: >> On Fri, 15 Nov 2013 13:28:33 +0100 >> Peter Zijlstra <peterz@infradead.org> wrote: >> >>> On Fri, Nov 15, 2013 at 10:16:18AM +0900, Masami Hiramatsu wrote: >>>> Kprobes itself can detect nested call by using per-cpu current-running >>>> kprobe pointer. And if it is nested, it just skips calling handlers. >>>> Anyway, I don't recommend to probe inside the handlers, but yes, >>>> you can trace perf-handler by ftrace B). I actually traced a kprobe-bug >>>> by kprobe-tracer last night, that was amazing :) >>> >>> Ah, ok, so that would avoid the worst problems. Good. Should we still >>> mark the entire perf swevent path as __kprobes just to be sure? >> >> I wouldn't unless we can prove that it breaks. It's sometimes nice to >> be able to debug the debugging facilities with the debugging >> facilities ;-) > > Even with the existing __kprobes annotations, I'm sure we can find many ways to > break the kernel. > > We can reproduce that bug with irq_work recursion with setting a kprobe in > the end of the irq_work() arch low level handler for example. Or simply > somewhere in irq_exit(). > > I think we could do dangerous things with breakpoints as well. Setting breakpoints > in do_debug() or stuffs like that. > > So keeping up with __kprobes annotations to every potential dangerous site > is not viable IMHO. It's important to maintain basic sanity with tagging sites > that are used by kprobes itself but we can't really prevent from any issue. > > At some point it's up to the user to know what he's doing and avoid recursion. > As long as kprobes can be set only by root. Hmm, it would be better to add some documentation how users can avoid such thing. > OTOH it would be nice to detect these kind of behaviour (thinking about irq_work for > example) and warn when something wrong is suspected. Agreed. FYI, kprobes has a recursion detection counter and it is reported via debugfs/tracing/kprobe_profile :) Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu.pt@hitachi.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-17 7:53 ` Masami Hiramatsu @ 2013-11-17 9:43 ` Peter Zijlstra 0 siblings, 0 replies; 38+ messages in thread From: Peter Zijlstra @ 2013-11-17 9:43 UTC (permalink / raw) To: Masami Hiramatsu Cc: Frederic Weisbecker, Steven Rostedt, Vince Weaver, LKML, Ingo Molnar, Dave Jones On Sun, Nov 17, 2013 at 04:53:59PM +0900, Masami Hiramatsu wrote: > Agreed. > FYI, kprobes has a recursion detection counter and it is reported via > debugfs/tracing/kprobe_profile :) Perf also has a recursion counter (4 actually, not reported to userspace), but the problem with the irq_work_exit thing is that its a non-recursive event loop, so recursion checks don't actually work to catch it. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-14 15:23 ` Peter Zijlstra 2013-11-14 15:33 ` Peter Zijlstra @ 2013-11-14 16:03 ` Frederic Weisbecker 2013-11-14 17:20 ` Vince Weaver 2013-11-19 19:18 ` [tip:perf/urgent] ftrace, perf: Avoid infinite event generation loop tip-bot for Peter Zijlstra 3 siblings, 0 replies; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-14 16:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Thu, Nov 14, 2013 at 04:23:04PM +0100, Peter Zijlstra wrote: > On Sat, Nov 09, 2013 at 04:10:14PM +0100, Peter Zijlstra wrote: > > Cute.. so what appears to happen is that: > > > > 1) we trace irq_work_exit > > 2) we generate event > > 3) event needs to deliver signal > > 4) we queue irq_work to send signal > > 5) goto 1 > > --- > arch/x86/include/asm/trace/irq_vectors.h | 11 +++++++++++ > include/linux/ftrace_event.h | 16 ++++++++++++++++ > include/linux/tracepoint.h | 4 ++++ > include/trace/ftrace.h | 7 +++++++ > kernel/trace/trace_event_perf.c | 6 ++++++ > 5 files changed, 44 insertions(+) > > diff --git a/arch/x86/include/asm/trace/irq_vectors.h b/arch/x86/include/asm/trace/irq_vectors.h > index 2874df24e7a4..4cab890007a7 100644 > --- a/arch/x86/include/asm/trace/irq_vectors.h > +++ b/arch/x86/include/asm/trace/irq_vectors.h > @@ -72,6 +72,17 @@ DEFINE_IRQ_VECTOR_EVENT(x86_platform_ipi); > DEFINE_IRQ_VECTOR_EVENT(irq_work); > > /* > + * We must dis-allow sampling irq_work_exit() because perf event sampling > + * itself can cause irq_work, which would lead to an infinite loop; > + * > + * 1) irq_work_exit happens > + * 2) generates perf sample > + * 3) generates irq_work > + * 4) goto 1 > + */ > +TRACE_EVENT_PERF_PERM(irq_work_exit, is_sampling_event(p_event) ? -EPERM : 0); > + > +/* > * call_function - called when entering/exiting a call function interrupt > * vector handler > */ > diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h > index 5eaa746735ff..3e4d05566022 100644 > --- a/include/linux/ftrace_event.h > +++ b/include/linux/ftrace_event.h > @@ -244,6 +244,9 @@ struct ftrace_event_call { > #ifdef CONFIG_PERF_EVENTS > int perf_refcount; > struct hlist_head __percpu *perf_events; > + > + int (*perf_perm)(struct ftrace_event_call *, > + struct perf_event *); > #endif > }; > > @@ -306,6 +309,19 @@ struct ftrace_event_file { > } \ > early_initcall(trace_init_flags_##name); > > +#define __TRACE_EVENT_PERF_PERM(name, expr) \ > + static int perf_perm_##name(struct ftrace_event_call *tp_event, \ > + struct perf_event *p_event) \ > + { \ > + return expr; \ > + } \ > + static int __init trace_init_flags_##name(void) \ > + { \ > + event_##name.perf_perm = &perf_perm_##name; \ > + return 0; \ > + } \ > + early_initcall(trace_init_flags_##name); > + > #define PERF_MAX_TRACE_SIZE 2048 > > #define MAX_FILTER_STR_VAL 256 /* Should handle KSYM_SYMBOL_LEN */ > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h > index ebeab360d851..dd5cb1dca207 100644 > --- a/include/linux/tracepoint.h > +++ b/include/linux/tracepoint.h > @@ -267,6 +267,8 @@ static inline void tracepoint_synchronize_unregister(void) > > #define TRACE_EVENT_FLAGS(event, flag) > > +#define TRACE_EVENT_PERF_PERM(event, expr) > + > #endif /* DECLARE_TRACE */ > > #ifndef TRACE_EVENT > @@ -399,4 +401,6 @@ static inline void tracepoint_synchronize_unregister(void) > > #define TRACE_EVENT_FLAGS(event, flag) > > +#define TRACE_EVENT_PERF_PERM(event, expr) > + > #endif /* ifdef TRACE_EVENT (see note above) */ > diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h > index 5c7ab17cbb02..7bdda40fc406 100644 > --- a/include/trace/ftrace.h > +++ b/include/trace/ftrace.h > @@ -90,6 +90,10 @@ > #define TRACE_EVENT_FLAGS(name, value) \ > __TRACE_EVENT_FLAGS(name, value) > > +#undef TRACE_EVENT_PERF_PERM > +#define TRACE_EVENT_PERF_PERM(name, expr) \ > + __TRACE_EVENT_PERF_PERM(name, expr) > + > #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) > > > @@ -140,6 +144,9 @@ > #undef TRACE_EVENT_FLAGS > #define TRACE_EVENT_FLAGS(event, flag) > > +#undef TRACE_EVENT_PERF_PERM > +#define TRACE_EVENT_PERF_PERM(event, expr) > + > #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) > > /* > diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c > index 78e27e3b52ac..630889f68b1d 100644 > --- a/kernel/trace/trace_event_perf.c > +++ b/kernel/trace/trace_event_perf.c > @@ -24,6 +24,12 @@ static int total_ref_count; > static int perf_trace_event_perm(struct ftrace_event_call *tp_event, > struct perf_event *p_event) > { > + if (tp_event->perf_perm) { > + int ret = tp_event->perf_perm(tp_event, p_event); > + if (ret) > + return ret; > + } > + > /* The ftrace function trace is allowed only for root. */ > if (ftrace_event_is_function(tp_event) && > perf_paranoid_tracepoint_raw() && !capable(CAP_SYS_ADMIN)) Looks good, I can't think of a more general solution. Now thinking about kprobes, which can be used to reproduce this same kind of scenario when put in the right place, perhaps we need to do something in perf_swevent_overflow() which reject events once they cross a dangerous amount which we arbitrary define on top of what looks like a storm that can trigger a lockup. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-14 15:23 ` Peter Zijlstra 2013-11-14 15:33 ` Peter Zijlstra 2013-11-14 16:03 ` Frederic Weisbecker @ 2013-11-14 17:20 ` Vince Weaver 2013-11-14 17:14 ` Peter Zijlstra 2013-11-19 19:18 ` [tip:perf/urgent] ftrace, perf: Avoid infinite event generation loop tip-bot for Peter Zijlstra 3 siblings, 1 reply; 38+ messages in thread From: Vince Weaver @ 2013-11-14 17:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Frederic Weisbecker, Vince Weaver, Steven Rostedt, LKML, Ingo Molnar, Dave Jones possibly not an issue with your patch, but found in nearby code: in perf_trace_init: int event_id = p_event->attr.config; ... if (tp_event->event.type == event_id && note that "attr.config" is 64-bit but "event_id" is only 32-bit, so things like 0xffffffff00000018 will map to irq_work_exit. I'm not sure if your PERM code properly handles this case, I don't have time to work through the macro-magic in your patch. Vince ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-14 17:20 ` Vince Weaver @ 2013-11-14 17:14 ` Peter Zijlstra 2013-11-14 17:41 ` Steven Rostedt 2013-11-14 19:18 ` Vince Weaver 0 siblings, 2 replies; 38+ messages in thread From: Peter Zijlstra @ 2013-11-14 17:14 UTC (permalink / raw) To: Vince Weaver Cc: Frederic Weisbecker, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Thu, Nov 14, 2013 at 12:20:35PM -0500, Vince Weaver wrote: > > possibly not an issue with your patch, but found in nearby code: > > in perf_trace_init: > > int event_id = p_event->attr.config; > > ... > > if (tp_event->event.type == event_id && > > > note that "attr.config" is 64-bit but "event_id" is only 32-bit, > so things like 0xffffffff00000018 will map to irq_work_exit. I will have a look. > I'm not sure if your PERM code properly handles this case, I don't have > time to work through the macro-magic in your patch. Can't blame you; took me a fair chunk of the day to come up with this. That trace event stuff is horrid. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-14 17:14 ` Peter Zijlstra @ 2013-11-14 17:41 ` Steven Rostedt 2013-11-14 19:18 ` Vince Weaver 1 sibling, 0 replies; 38+ messages in thread From: Steven Rostedt @ 2013-11-14 17:41 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, Frederic Weisbecker, LKML, Ingo Molnar, Dave Jones On Thu, 14 Nov 2013 18:14:05 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > Can't blame you; took me a fair chunk of the day to come up with this. > That trace event stuff is horrid. I feel honored ;-) -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-14 17:14 ` Peter Zijlstra 2013-11-14 17:41 ` Steven Rostedt @ 2013-11-14 19:18 ` Vince Weaver 1 sibling, 0 replies; 38+ messages in thread From: Vince Weaver @ 2013-11-14 19:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, Frederic Weisbecker, Steven Rostedt, LKML, Ingo Molnar, Dave Jones On Thu, 14 Nov 2013, Peter Zijlstra wrote: > On Thu, Nov 14, 2013 at 12:20:35PM -0500, Vince Weaver wrote: > > > > possibly not an issue with your patch, but found in nearby code: > > > > in perf_trace_init: > > > > int event_id = p_event->attr.config; > > > > ... > > > > if (tp_event->event.type == event_id && > > > > > > note that "attr.config" is 64-bit but "event_id" is only 32-bit, > > so things like 0xffffffff00000018 will map to irq_work_exit. > > I will have a look. Assuming this is a real issue, I'll submit a patch to fix this if I'm not beaten to it by someone else. > > I'm not sure if your PERM code properly handles this case, I don't have > > time to work through the macro-magic in your patch. > > Can't blame you; took me a fair chunk of the day to come up with this. > That trace event stuff is horrid. I did some tests both with the high bits set and non-set and wasn't able to trigger the bug. Also ran the fuzzer for a while and while it keeps finding other bugs, it hasn't tripped over any that were overtly tracepoint related. So you can add my Tested-by: Vince Weaver <vincent.weaver@maine.edu> Vince ^ permalink raw reply [flat|nested] 38+ messages in thread
* [tip:perf/urgent] ftrace, perf: Avoid infinite event generation loop 2013-11-14 15:23 ` Peter Zijlstra ` (2 preceding siblings ...) 2013-11-14 17:20 ` Vince Weaver @ 2013-11-19 19:18 ` tip-bot for Peter Zijlstra 3 siblings, 0 replies; 38+ messages in thread From: tip-bot for Peter Zijlstra @ 2013-11-19 19:18 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, peterz, davej, vincent.weaver, fweisbec, rostedt, tglx Commit-ID: d5b5f391d434c5cc8bcb1ab2d759738797b85f52 Gitweb: http://git.kernel.org/tip/d5b5f391d434c5cc8bcb1ab2d759738797b85f52 Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Thu, 14 Nov 2013 16:23:04 +0100 Committer: Ingo Molnar <mingo@kernel.org> CommitDate: Tue, 19 Nov 2013 16:57:40 +0100 ftrace, perf: Avoid infinite event generation loop Vince's perf-trinity fuzzer found yet another 'interesting' problem. When we sample the irq_work_exit tracepoint with period==1 (or PERF_SAMPLE_PERIOD) and we add an fasync SIGNAL handler we create an infinite event generation loop: ,-> <IPI> | irq_work_exit() -> | trace_irq_work_exit() -> | ... | __perf_event_overflow() -> (due to fasync) | irq_work_queue() -> (irq_work_list must be empty) '--------- arch_irq_work_raise() Similar things can happen due to regular poll() wakeups if we exceed the ring-buffer wakeup watermark, or have an event_limit. To avoid this, dis-allow sampling this particular tracepoint. In order to achieve this, create a special perf_perm function pointer for each event and call this (when set) on trying to create a tracepoint perf event. [ roasted: use expr... to allow for ',' in your expression ] Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Dave Jones <davej@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20131114152304.GC5364@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> --- arch/x86/include/asm/trace/irq_vectors.h | 11 +++++++++++ include/linux/ftrace_event.h | 16 ++++++++++++++++ include/linux/tracepoint.h | 4 ++++ include/trace/ftrace.h | 7 +++++++ kernel/trace/trace_event_perf.c | 6 ++++++ 5 files changed, 44 insertions(+) diff --git a/arch/x86/include/asm/trace/irq_vectors.h b/arch/x86/include/asm/trace/irq_vectors.h index 2874df2..4cab890 100644 --- a/arch/x86/include/asm/trace/irq_vectors.h +++ b/arch/x86/include/asm/trace/irq_vectors.h @@ -72,6 +72,17 @@ DEFINE_IRQ_VECTOR_EVENT(x86_platform_ipi); DEFINE_IRQ_VECTOR_EVENT(irq_work); /* + * We must dis-allow sampling irq_work_exit() because perf event sampling + * itself can cause irq_work, which would lead to an infinite loop; + * + * 1) irq_work_exit happens + * 2) generates perf sample + * 3) generates irq_work + * 4) goto 1 + */ +TRACE_EVENT_PERF_PERM(irq_work_exit, is_sampling_event(p_event) ? -EPERM : 0); + +/* * call_function - called when entering/exiting a call function interrupt * vector handler */ diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h index 9abbe63..8c9b7a1 100644 --- a/include/linux/ftrace_event.h +++ b/include/linux/ftrace_event.h @@ -248,6 +248,9 @@ struct ftrace_event_call { #ifdef CONFIG_PERF_EVENTS int perf_refcount; struct hlist_head __percpu *perf_events; + + int (*perf_perm)(struct ftrace_event_call *, + struct perf_event *); #endif }; @@ -317,6 +320,19 @@ struct ftrace_event_file { } \ early_initcall(trace_init_flags_##name); +#define __TRACE_EVENT_PERF_PERM(name, expr...) \ + static int perf_perm_##name(struct ftrace_event_call *tp_event, \ + struct perf_event *p_event) \ + { \ + return ({ expr; }); \ + } \ + static int __init trace_init_perf_perm_##name(void) \ + { \ + event_##name.perf_perm = &perf_perm_##name; \ + return 0; \ + } \ + early_initcall(trace_init_perf_perm_##name); + #define PERF_MAX_TRACE_SIZE 2048 #define MAX_FILTER_STR_VAL 256 /* Should handle KSYM_SYMBOL_LEN */ diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index ebeab36..f16dc0a 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -267,6 +267,8 @@ static inline void tracepoint_synchronize_unregister(void) #define TRACE_EVENT_FLAGS(event, flag) +#define TRACE_EVENT_PERF_PERM(event, expr...) + #endif /* DECLARE_TRACE */ #ifndef TRACE_EVENT @@ -399,4 +401,6 @@ static inline void tracepoint_synchronize_unregister(void) #define TRACE_EVENT_FLAGS(event, flag) +#define TRACE_EVENT_PERF_PERM(event, expr...) + #endif /* ifdef TRACE_EVENT (see note above) */ diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h index 52594b2..6b852f6 100644 --- a/include/trace/ftrace.h +++ b/include/trace/ftrace.h @@ -90,6 +90,10 @@ #define TRACE_EVENT_FLAGS(name, value) \ __TRACE_EVENT_FLAGS(name, value) +#undef TRACE_EVENT_PERF_PERM +#define TRACE_EVENT_PERF_PERM(name, expr...) \ + __TRACE_EVENT_PERF_PERM(name, expr) + #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) @@ -140,6 +144,9 @@ #undef TRACE_EVENT_FLAGS #define TRACE_EVENT_FLAGS(event, flag) +#undef TRACE_EVENT_PERF_PERM +#define TRACE_EVENT_PERF_PERM(event, expr...) + #include TRACE_INCLUDE(TRACE_INCLUDE_FILE) /* diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 78e27e3..630889f 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -24,6 +24,12 @@ static int total_ref_count; static int perf_trace_event_perm(struct ftrace_event_call *tp_event, struct perf_event *p_event) { + if (tp_event->perf_perm) { + int ret = tp_event->perf_perm(tp_event, p_event); + if (ret) + return ret; + } + /* The ftrace function trace is allowed only for root. */ if (ftrace_event_is_function(tp_event) && perf_paranoid_tracepoint_raw() && !capable(CAP_SYS_ADMIN)) ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: perf/tracepoint: another fuzzer generated lockup 2013-11-08 21:15 ` Vince Weaver 2013-11-08 22:24 ` Frederic Weisbecker 2013-11-08 22:36 ` Frederic Weisbecker @ 2013-11-09 0:25 ` Frederic Weisbecker 2 siblings, 0 replies; 38+ messages in thread From: Frederic Weisbecker @ 2013-11-09 0:25 UTC (permalink / raw) To: Vince Weaver Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Dave Jones On Fri, Nov 08, 2013 at 04:15:21PM -0500, Vince Weaver wrote: > > int main(int argc, char **argv) { > /* 1 */ > /* fd = 82 */ > > memset(&pe[82],0,sizeof(struct perf_event_attr)); > pe[82].type=PERF_TYPE_TRACEPOINT; > pe[82].size=80; > pe[82].config=0x18; I did some more testing and 0x18 on my setup is the irq_work_exit event. I tried a few other tracepoints (tried the timer interrupt for example) but they did not trigger the issue. And since the irq work handles the pending disable and wakeup, maybe that fiddles with some perf state and make the NMI handler run into some buggy loop. Perhaps that's triggered with some help of the disable ioctl. I don't know, my bed is calling so I'll test that further tomorrow or so. Thanks. > pe[82].sample_period=0xdd95cac7b947b610; > pe[82].sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TIME|PERF_SAMPLE_READ|PERF_SAMPLE_CALLCHAIN|PERF_SAMPLE_PERIOD|PERF_SAMPLE_DATA_SRC; > //|PERF_SAMPLE_IDENTIFIER; /* 18135 */ > pe[82].read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING|PERF_FORMAT_ID|PERF_FORMAT_GROUP; /* f */ > pe[82].exclude_hv=1; > pe[82].mmap=1; > pe[82].comm=1; > pe[82].watermark=1; > pe[82].precise_ip=1; /* constant skid */ > pe[82].mmap_data=1; > pe[82].exclude_host=1; > pe[82].exclude_guest=1; > pe[82].exclude_callchain_user=1; > pe[82].wakeup_watermark=232; > pe[82].bp_type=HW_BREAKPOINT_EMPTY; > pe[82].config1=0x9; > pe[82].branch_sample_type=PERF_SAMPLE_BRANCH_USER|PERF_SAMPLE_BRANCH_KERNEL|PERF_SAMPLE_BRANCH_HV|PERF_SAMPLE_BRANCH_ANY|PERF_SAMPLE_BRANCH_ANY_CALL|PERF_SAMPLE_BRANCH_ANY_RETURN|PERF_SAMPLE_BRANCH_IND_CALL; > //|PERF_SAMPLE_BRANCH_ABORT_TX; > pe[82].sample_regs_user=39; > pe[82].sample_stack_user=37552; > > fd[82]=perf_event_open(&pe[82], > 0, /* current thread */ > -1, /* all cpus */ > -1, /* New Group Leader */ > PERF_FLAG_FD_OUTPUT /*2*/ ); > > > /* 2 */ > mmap_result[82]=mmap(NULL, 20480,PROT_READ|PROT_WRITE, MAP_SHARED,fd[82], 0); > /* 3 */ > memset(&sa, 0, sizeof(struct sigaction)); > sa.sa_sigaction = our_handler; > sa.sa_flags = SA_SIGINFO; > if (sigaction( SIGRTMIN+2, &sa, NULL) < 0) { > printf("Error setting up signal handler\n"); > } > fcntl(fd[82], F_SETFL, O_RDWR|O_NONBLOCK|O_ASYNC); > fcntl(fd[82], F_SETSIG, SIGRTMIN+2); > fcntl(fd[82], F_SETOWN,getpid()); > /* Replayed 3 syscalls */ > return 0; > } ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2013-11-19 19:19 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-11-08 19:45 perf/tracepoint: another fuzzer generated lockup Vince Weaver 2013-11-08 20:06 ` Vince Weaver 2013-11-08 20:02 ` Frederic Weisbecker 2013-11-08 20:23 ` Vince Weaver 2013-11-08 20:48 ` Frederic Weisbecker 2013-11-08 21:15 ` Vince Weaver 2013-11-08 22:24 ` Frederic Weisbecker 2013-11-08 22:36 ` Frederic Weisbecker 2013-11-09 1:09 ` Steven Rostedt 2013-11-09 14:10 ` Peter Zijlstra 2013-11-09 14:20 ` Frederic Weisbecker 2013-11-11 12:44 ` Ingo Molnar 2013-11-11 15:53 ` Peter Zijlstra 2013-11-11 21:13 ` Ingo Molnar 2013-11-09 14:52 ` Frederic Weisbecker 2013-11-09 15:13 ` Peter Zijlstra 2013-11-09 15:27 ` Frederic Weisbecker 2013-11-09 15:59 ` Peter Zijlstra 2013-11-09 16:08 ` Frederic Weisbecker 2013-11-09 15:11 ` Peter Zijlstra 2013-11-09 15:22 ` Frederic Weisbecker 2013-11-09 15:30 ` Peter Zijlstra 2013-11-14 15:23 ` Peter Zijlstra 2013-11-14 15:33 ` Peter Zijlstra 2013-11-14 15:35 ` Frederic Weisbecker 2013-11-15 1:16 ` Masami Hiramatsu 2013-11-15 12:28 ` Peter Zijlstra 2013-11-15 14:15 ` Steven Rostedt 2013-11-15 14:28 ` Frederic Weisbecker 2013-11-17 7:53 ` Masami Hiramatsu 2013-11-17 9:43 ` Peter Zijlstra 2013-11-14 16:03 ` Frederic Weisbecker 2013-11-14 17:20 ` Vince Weaver 2013-11-14 17:14 ` Peter Zijlstra 2013-11-14 17:41 ` Steven Rostedt 2013-11-14 19:18 ` Vince Weaver 2013-11-19 19:18 ` [tip:perf/urgent] ftrace, perf: Avoid infinite event generation loop tip-bot for Peter Zijlstra 2013-11-09 0:25 ` perf/tracepoint: another fuzzer generated lockup Frederic Weisbecker
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.