From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935291Ab3BOBPN (ORCPT ); Thu, 14 Feb 2013 20:15:13 -0500 Received: from mx1.redhat.com ([209.132.183.28]:58713 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932201Ab3BOBPL (ORCPT ); Thu, 14 Feb 2013 20:15:11 -0500 Date: Thu, 14 Feb 2013 20:15:03 -0500 From: Dave Jones To: Linus Torvalds Cc: Hugh Dickins , Linux Kernel Mailing List , paul.mckenney@linaro.org Subject: Re: Debugging Thinkpad T430s occasional suspend failure. Message-ID: <20130215011503.GA11914@redhat.com> Mail-Followup-To: Dave Jones , Linus Torvalds , Hugh Dickins , Linux Kernel Mailing List , paul.mckenney@linaro.org References: <20130212193901.GA18906@redhat.com> <20130213004059.GA14451@redhat.com> <20130213041629.GA28622@redhat.com> <20130213193411.GA15928@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 13, 2013 at 11:56:25AM -0800, Linus Torvalds wrote: > On Wed, Feb 13, 2013 at 11:34 AM, Dave Jones wrote: > > > > My test was a loop of 100 suspend/resume cycles before calling something > > 'good'. The 'bad' cases all failed within 10 cycles (usually 2-3). > > Considering that you apparently already found one case where the BIOS > crapped out due to effectively unrelated timing details (ie timing > triggered a temperature issue that then triggered behavioral changes), > I wonder if your more occasional problem might not be a sign of > something similar. > > But since you seem to be able to automate it well, maybe one thing to > try is to change the timing a bit while testing. Maybe some failures > were hidden by the timing just happening to work out. Given I never saw this on a Fedora kernel, just my self-built ones, I eventually gave up on bisecting code, and switched to bisecting config options. I should have started this way, as I figured it out within an hour. 3.7 merge window is when I started seeing this, and here's what got introduced during that time.. commit e3ebfb96f396731ca2d0b108785d5da31b53ab00 Author: Paul E. McKenney Date: Mon Jul 2 14:42:01 2012 -0700 rcu: Add PROVE_RCU_DELAY to provoke difficult races 'difficult' is an understatement. This explains why some of those 'good' bisects survived 100 suspends on one day, and failed the next. Unfortunatly, I don't think there's any sane way to retrieve whatever debug info might be getting spewed. Perhaps when I reinstall, and switch to booting EFI I'll be able to use pstore, but on a bios-based boot, all hope seems lost. No netconsole, no usb-serial, even crippling i915's suspend routine doesn't help. I'll just disable this option for now. Dave