From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756916AbXFOV0S (ORCPT ); Fri, 15 Jun 2007 17:26:18 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753313AbXFOV0J (ORCPT ); Fri, 15 Jun 2007 17:26:09 -0400 Received: from mx1.redhat.com ([66.187.233.31]:40930 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751395AbXFOV0G (ORCPT ); Fri, 15 Jun 2007 17:26:06 -0400 Message-ID: <467303C9.9030706@redhat.com> Date: Fri, 15 Jun 2007 17:25:29 -0400 From: Chuck Ebbert Organization: Red Hat User-Agent: Thunderbird 1.5.0.12 (X11/20070530) MIME-Version: 1.0 To: Miklos Szeredi CC: mingo@elte.hu, chris@atlee.ca, linux-kernel@vger.kernel.org, tglx@linutronix.de Subject: Re: [BUG] long freezes on thinkpad t60 References: <20070524125453.GA7554@elte.hu> <20070524141059.GA19872@elte.hu> <20070524144447.GA25068@elte.hu> <20070524210153.GB19672@elte.hu> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On 06/14/2007 12:04 PM, Miklos Szeredi wrote: > I've got some more info about this bug. It is gathered with > nmi_watchdog=2 and a modified nmi_watchdog_tick(), which instead of > calling die_nmi() just prints a line and calls show_registers(). > > This makes the machine actually survive the NMI tracing. The attached > traces are gathered over about an hour of stressing. An mp3 player is > also going on continually, and I can hear a couple of seconds of > "looping" quite often, but it gets as far as the NMI trace only > rarely. AFAICS only the last pair shows a trace for both CPUs during > the same "freeze". > > I've put some effort into understanding what's going on, but I'm not > familiar with how interrupts work and that sort of thing. > > The pattern that emerges is that on CPU0 we have an interrupt, which > is trying to acquire the rq lock, but can't. > > On CPU1 we have strace which is doing wait_task_inactive(), which sort > of spins acquiring and releasing the rq lock. I've checked some of > the traces and it is just before acquiring the rq lock, or just after > releasing it, but is not actually holding it. > > So is it possible that wait_task_inactive() could be starving the > other waiters of the rq spinlock? Any ideas? Spinlocks aren't fair, so this kind of problem is always a possibility. I think maybe we need another kind of unlock that gives another processor a fair chance at the lock. Some things you could try to see if they help: - add smp_mb() after the unlock - replace cpu_relax() with usleep() - use an xchcg instruction to do the unlock, like i386 does when CONFIG_X86_OOSTORE is set