[BUG] Core2 cpu triggers hard lockup with perf test

* [BUG] Core2 cpu triggers hard lockup with perf test
@ 2016-02-27 12:37 Jiri Olsa
  2016-02-27 14:48 ` Peter Zijlstra
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Jiri Olsa @ 2016-02-27 12:37 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Ingo Molnar, Peter Zijlstra,
	Andi Kleen, Stephane Eranian, Wang Nan, zheng.z.yan, Kan Liang
  Cc: LKML

hi,
we are getting hard lockups on Core2 cpus (model 23)
just by running 'perf test'

PID: 10425  TASK: ffff880068562e00  CPU: 3   COMMAND: "perf"
 #0 [ffff88007d985a08] machine_kexec at ffffffff8105521b
 #1 [ffff88007d985a68] crash_kexec at ffffffff810f7412
 #2 [ffff88007d985b38] panic at ffffffff8163c031
 #3 [ffff88007d985bb8] watchdog_overflow_callback at ffffffff81120472
 #4 [ffff88007d985bc8] __perf_event_overflow at ffffffff81164e0e
 #5 [ffff88007d985c00] perf_event_overflow at ffffffff81165a44
 #6 [ffff88007d985c10] intel_pmu_handle_irq at ffffffff81033198
 #7 [ffff88007d985e60] perf_event_nmi_handler at ffffffff8164be8b
 #8 [ffff88007d985e80] nmi_handle at ffffffff8164b5d9
 #9 [ffff88007d985ec8] do_nmi at ffffffff8164b789
#10 [ffff88007d985ef0] end_repeat_nmi at ffffffff8164aa13
    [exception RIP: intel_pmu_enable_all+17]
    RIP: ffffffff81032301  RSP: ffff88005e917c98  RFLAGS: 00000046
    RAX: ffff88007d98cd20  RBX: ffff88005e991000  RCX: 000000000000038f
    RDX: 0000000000000007  RSI: 0000000000000003  RDI: 0000000000000000
    RBP: ffff88005e917cd8   R8: ffffffffffffff85   R9: 000000ffffffffff
    R10: ffff88007d98c100  R11: ffff88005e9179e0  R12: ffff88007d98bd10
    R13: ffff88007d98b9e0  R14: ffff88007d98bc08  R15: 0000000000000002
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#11 [ffff88005e917c98] intel_pmu_enable_all at ffffffff81032301
#12 [ffff88005e917c98] x86_pmu_enable at ffffffff8102ba24
#13 [ffff88005e917ce0] perf_pmu_enable at ffffffff81160457
#14 [ffff88005e917cf0] perf_event_context_sched_in at ffffffff81161930
#15 [ffff88005e917d20] perf_event_exec at ffffffff811621db
#16 [ffff88005e917d68] setup_new_exec at ffffffff811edffd
#17 [ffff88005e917d88] load_elf_binary at ffffffff81240ed9
#18 [ffff88005e917e58] search_binary_handler at ffffffff811ec89d
#19 [ffff88005e917ea0] do_execve_common at ffffffff811ede04
#20 [ffff88005e917f30] sys_execve at ffffffff811ee199
#21 [ffff88005e917f50] stub_execve at ffffffff816531a9

the reproducer seems to be hw event with very small
period like (thanks Arnaldo ;-):
  perf record -e cycles -c 123 kill

I bisected it down to the:
  156174999dd1 perf/intel/x86: Enlarge the PEBS buffer

Looks like the bigger PEBS buffer together with event being
marked as PERF_X86_EVENT_FREERUNNING will block the CPU right
after the event is enabled before it could reach local_irq_enable
and trigger the NMI watchdog.

I can't find what's special about Core2 CPU PEBS setup,
it seems that oher CPUs are ok (tried on ivb/snb/hsw).

reverting the 156174999dd1 fixed the issue for me

ideas? thanks,
jirka

^ permalink raw reply	[flat|nested] 18+ messages in thread