From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752603AbWAHQ2a (ORCPT ); Sun, 8 Jan 2006 11:28:30 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752605AbWAHQ2a (ORCPT ); Sun, 8 Jan 2006 11:28:30 -0500 Received: from main.gmane.org ([80.91.229.2]:39625 "EHLO ciao.gmane.org") by vger.kernel.org with ESMTP id S1752603AbWAHQ2a (ORCPT ); Sun, 8 Jan 2006 11:28:30 -0500 X-Injected-Via-Gmane: http://gmane.org/ To: linux-kernel@vger.kernel.org From: Kalin KOZHUHAROV Subject: Re: Help track down a freezing machine, libata or hardware Date: Mon, 09 Jan 2006 01:28:17 +0900 Message-ID: References: <81b0412b0512140625i7cc5779ar224de3d64c615fbc@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@sea.gmane.org X-Gmane-NNTP-Posting-Host: s175249.ppp.asahi-net.or.jp User-Agent: Mozilla Thunderbird 1.0.7 (X11/20060103) X-Accept-Language: en-us, en In-Reply-To: X-Enigmail-Version: 0.93.0.0 Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Kalin KOZHUHAROV wrote: > Kalin KOZHUHAROV wrote: > >>Alex Riesen wrote: >> >> >>>On 12/14/05, Kalin KOZHUHAROV wrote: >>> >>> >>> >>>>Now that I get a repetitive freeze, is there anything to debug the problem? >>>>I guess, the point when kernel is still responsive to keyboard, but I cannot login. >>> >>> >>>try to connect a serial console to it and press Alt+SysRq+t >> >> >>Thank you for the suggestio, Alex. >> >>I was always trying to avoid the serial console till now (it just seems difficult, and I DO know it >>is not), and didn't even bother with the netconsole... > > > It is, I as have to go and buy a null modem cable... will do it. > > >>So until now, here is an oops, the first I saw in a few months, captured by my camera and then >>digitally enhanced: http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.jpg >> >>The OCRed/handwritten text ( http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char__oops1.txt ) >>says: >> >>Call trace: >>SCSI device sda: 145226112 512-byte hdwr sectors (74356 MB) >>SCSI device sda: drive cache: write back >>[] kobject_put+0x1f/0x30 >>[] scsi_end_request+0xdd/0xf0 >>[] scsi_io_completion+0x26e/0x570 >>[] load_balance_newidle+0x43/0x110 >>[] scsi_generic_done+0x35/0x50 >>[] scsi_finish_command+0x8e/0xd0 >>[] schedule+0x4da/0xd50 >>[] schedule+0x50d/0xd50 >>[] scsi_sortirq+0xdf/0x160 >>[] __do_softirq+0xd6/0xf0 >>[] do_softirq+0x35/0x40 >>[] ksoftirqd+0x95/0xe0 >>[] ksoftirqd+0x0/0xe0 >>[] kthread+0xba/0xc0 >>[] kthread+0x0/0xc0 >>[] kernel_thread_helper+0x5/0x10 >>Code: e1 08 00 89 44 24 04 89 1c 24 e8 27 b0 ff ff eb a5 90 8d 74 26 00 55 57 56 >> 53 83 ec 08 8b 44 24 1c 89 44 24 04 8b 80 ec 00 00 00 <8b> 38 f6 80 79 01 00 00 >> 80 0f 85 98 00 00 00 8b 47 2c 8d 6f 20 >><0>Kernel panic - not syncing: Fatal exception in interrupt >> >>Unfortunately everything was frozen (KBD too), so I couldn't scroll up to see the beginning. As you >>may guess, it was not written to the disk. >> >>The oops happened on boot (after a hard power-off) and is probbably related to the SATA system. > > > All right, the above started to be reproducible, about once every 3 boots: the system freezes when > tries to initialize the ata sybsystem. (still don't have cable for the serial console, sorry) > > >>The .config is available at http://linux.tar.bz/reports/oopses/char/2.6.14.3-K01_char.config > > > Now this is 2.6.14.4 and the new config is: > http://linux.tar.bz/reports/oopses/char/2.6.14.4-K01_char.config > > I added another 250GB SATA HDD and changed the PSU, but it does not seem to be related for that bug. > Will try to tweak the IDE parameters in BIOS. OK, now this looks like hardware failure... I run 2.6.15 the other day and I was happy to see 2d uptime :-) However... everything was borked, root was mounted RO, and fs was generally screwed. I don't have physical acces to the box right now, so I will post more details tomorrow, but this is what I got as dmesg: ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0 ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error } ata1: error=0x04 { DriveStatusError } ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0 ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error } ata1: error=0x04 { DriveStatusError } ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0 ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error } ata1: error=0x04 { DriveStatusError } ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0 ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error } ata1: error=0x04 { DriveStatusError } ata1: port reset, p_is 40000001 is 1 pis 0 cmd c017 tf 471 ss 113 se 0 ata1: translated ATA stat/err 0x71/04 to SCSI SK/ASC/ASCQ 0xb/00/00 ata1: status=0x71 { DriveReady DeviceFault SeekComplete Error } ata1: error=0x04 { DriveStatusError } sd 0:0:0:0: SCSI error: return code = 0x8000002 sda: Current: sense key=0xb ASC=0x0 ASCQ=0x0 end_request: I/O error, dev sda, sector 10803956 Buffer I/O error on device sda3, logical block 362497 The above was repeated many times (buffer was full, so I don't know how many), cannot tell the time, as trying to `cat /var/log/everything` resulted in IO error. I also got a bunch of these: ReiserFS: sda3: warning: clm-6006: writing inode 56216 on readonly FS ReiserFS: sda3: warning: clm-6006: writing inode 56216 on readonly FS ReiserFS: sda3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [25971 27180 0x0 SD] ReiserFS: sda3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [9064 25871 0x0 SD] ReiserFS: sda3: warning: vs-13070: reiserfs_read_locked_inode: i/o failure occurred trying to find stat data of [25971 27180 0x0 SD] Do you see this as hardware problem? The drive in question is a WD740GD (SATA, 10k RPM) and I will run the WD tools on it in a day or two to check if it is a hardware failure. Kalin. -- |[ ~~~~~~~~~~~~~~~~~~~~~~ ]| +-> http://ThinRope.net/ <-+ |[ ______________________ ]|