From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752858Ab3CNVGE (ORCPT ); Thu, 14 Mar 2013 17:06:04 -0400 Received: from mail-vb0-f51.google.com ([209.85.212.51]:34000 "EHLO mail-vb0-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752501Ab3CNVGD (ORCPT ); Thu, 14 Mar 2013 17:06:03 -0400 MIME-Version: 1.0 In-Reply-To: References: <20130226070247.GA14094@gmail.com> Date: Thu, 14 Mar 2013 14:06:01 -0700 X-Google-Sender-Auth: nAt9pPsaJM287yNP2ABtfEcvDPQ Message-ID: Subject: Re: [GIT PULL] perf fixes From: Linus Torvalds To: Ingo Molnar , Stephane Eranian Cc: Arnaldo Carvalho de Melo , Peter Zijlstra , Thomas Gleixner , Andrew Morton , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 14, 2013 at 1:32 PM, Linus Torvalds wrote: > > And to make things interesting, I seem to be able to only reproduce > this *after* a suspend cycle. That may be just happenstance, since it > seemed to be hard to replicate and most of the time it has happened > under X with no messages visible at all, but that *seems* to be the > pattern. > > And the one time I got it to happen on the text console, things > scrolled off (watchdog warnings due to lockups), but I did get a NULL > pointer dereference in intel_pmu_enable_all(). > > I'll try to reproduce it and get a picture, Theory more or less confirmed. It does need a suspend/resume cycle, and I have a picture. The oops happens immediately when trying to do any perf work after the first suspend, before suspending I seem to be able to reliably use perf. It could still be just random flakiness, but I don't think so. The NULL pointer dereference is at intel_pmu_enable_all+0x4d/0xa0 for me, which seems to be the load of the if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) thing. It says BUG: unable to handle NULL pointer dereference at 0000000000000028 But that error makes no sense. The code at that EIP is 48 8b 83 00 02 00 00 mov 0x200(%rbx),%rax <-- trapping instruction and the value printed out for %rbx is 0xffff80014f20b8e0, so it should *not* be a NULL pointer dereference (and "cpuc" was also used just before the wrmsrl). So I suspect that the "wrmsrl" that was just before that instruction does something odd, and the PMU is in some odd state, so that the NULL pointer dereference actually has something to do with *that*, rather than the instruction itself. The callchain looks normal. It's finish_task_switch -> __perf_event_task_sched_in -> perf_event_context_sched_in -> perf_pmu_enable -> x86_pmu_enable -> intel_pmu_enable_all() The immediately preceding wrmsrl was done with rax=0xf, rdx=0x7, rcx=0x38f according to the register dump (but the picture isn't great, so the numbers aren't 100% reliable). Does this give any clues? Linus