From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758278Ab2DFWkj (ORCPT ); Fri, 6 Apr 2012 18:40:39 -0400 Received: from www.linutronix.de ([62.245.132.108]:40421 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755403Ab2DFWki (ORCPT ); Fri, 6 Apr 2012 18:40:38 -0400 Date: Sat, 7 Apr 2012 00:40:28 +0200 (CEST) From: Thomas Gleixner To: Jiri Slaby cc: Chris Wilson , Jiri Slaby , Keith Packard , dri-devel@lists.freedesktop.org, LKML , daniel@ffwll.ch Subject: Re: i915_driver_irq_handler: irq 42: nobody cared [generic IRQ handling broken?] In-Reply-To: <4F7F60BF.9070500@suse.cz> Message-ID: References: <4F717CE3.4040206@suse.cz> <4F717D80.9040207@suse.cz> <4F758400.3080907@suse.cz> <1333104359_155028@CP5-2952> <4F75A303.3030409@suse.cz> <1333110296_156038@CP5-2952> <4F7F60BF.9070500@suse.cz> User-Agent: Alpine 2.02 (LFD 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 6 Apr 2012, Jiri Slaby wrote: > On 03/30/2012 02:24 PM, Chris Wilson wrote: > > On Fri, 30 Mar 2012 14:11:47 +0200, Jiri Slaby wrote: > >> On 03/30/2012 12:45 PM, Chris Wilson wrote: > >>> On Fri, 30 Mar 2012 11:59:28 +0200, Jiri Slaby wrote: > >>>> I don't know what to dump more, because iir is obviously zero too. What > >>>> other sources of interrupts are on the (G33) chip? > >>> > >>> IIR is the master interrupt, with chained secondary interrupt statuses. > >>> If IIR is 0, the interrupt wasn't raised by the GPU. > >> > >> This does not make sense, the handler does something different. Even if > >> IIR is 0, it still takes a look at pipe stats. > > > > That was introduced in 05eff845a28499762075d3a72e238a31f4d2407c to close > > a race where the pipestat triggered an interrupt after we processed the > > secondary registers and before reseting the primary. > > > > But the basic premise that we should only enter the interrupt handler > > with IIR!=0 holds (presuming non-shared interrupt lines such as MSI). > > Ok, this behavior is definitely new. I get several "nobody cared" about > this interrupt a week. This never used to happen. And something weird > emerges in /proc/interrupts when this happens: > 42: 1003292 1212890 PCI-MSI-edge ???s????????????:0000:00:02.0 > instead of > 42: 1006715 1218472 PCI-MSI-edge i915@pci:0000:00:02.0 > > It very looks like the generic IRQ handling code is broken. Like it > frees/corrupts irq_desc and ... OMG, your problem analyzing skills are amazing. If irq_desc would have been freed, then it wouldn't print the numbers and the irq type. And irq_desc is not corrupted either, otherwise the whole thing would explode in your face. The printout of the name is done via action->name. The irq action merily holds a pointer to the device name string, which is handed over with request_irq. So you are saying that the core code corrupts the memory which was handed in via a pointer by the driver? So now that's really an amazing core feature: It corrupts the memory with weird characters and still maintains the PCI bus number correct. So it not only corrupts memory it also moves the PCI part of the string a few characters to the end. If the pointer in the irq action would have been corrupted, then you would see a few weird characters and then the full string, not a random thing which is half correct and shifted by a few bytes. The pointer which is handed in is dev->devname, which gets allocated and filled in drm_pci_set_busid(). > ... then as well calls random handlers. Which random handlers would be called? The core code only calls handlers which are associated to an particular interrupt. And only when that particular interrupt is raised and not because the CPU pulls interrupt events out of thin air. And it calls the stupid i915 handler and not something else, otherwise you would not observe the IIR=0 printk or whatever you put there for debugging. > Suspend/resume cycle helps in this case and "i915@pci:0000:00:02.0" is > back in /proc/interrupts as can be seen above. That's proving what? That the irq core code magically restores the correct string, right? And probably it stops calling random handlers as well. Brilliant deduction. You know what? suspend calls free_irq() via i915_drm_freeze() -> drm_irq_uninstall() and the resume code calls request_irq() again. free_irq() removes the action and request_irq installs it fresh. So now the interesting part is that free_irq() checks the dev_id cookie for a match, which is also stored in the irq action. So we are dealing with a magic corrupt only action->name and action->handler problem. Pretty realistic. What the heck makes you assume that the irq core code is broken? Core code, which works on a gazillion of machines and different device drivers and does not corrupt anything except that i915 thingy? Come on, you need to provide better evidence than weird ass guessing. If you're still convinced that the irq core is messing with your device string, then simply hand in a NULL pointer when requesting the interrupt. That will make the core code explode nicely when it tries to modify that memory. Thanks, tglx