From: Jeff Garzik <jeff@garzik.org>
To: Niel Lambrechts <niel.lambrechts@gmail.com>
Cc: Tejun Heo <tj@kernel.org>, "linux.kernel" <linux-kernel@vger.kernel.org>
Subject: Re: 2.6.29 regression: ATA bus errors on resume
Date: Fri, 03 Apr 2009 16:09:14 -0400 [thread overview]
Message-ID: <49D66CEA.8080605@garzik.org> (raw)
In-Reply-To: <49D3C4FB.5070002@gmail.com>
Niel Lambrechts wrote:
> On 03/30/2009 04:40 PM, Jeff Garzik wrote:
>> Niel Lambrechts wrote:
>>> On 03/30/2009 11:00 AM, Tejun Heo wrote:
>>>> Hello,
>>>>
>>>> For some reason, I can't find the original thread, so replying here.
>>>>
>>>> Niel Lambrechts wrote:
>>>>>>>>> The ext4 errors are interleaved with hardware errors, and the ext4
>>>>>>>>> errors are about I/O errors.
>>>>>>>>>
>>>>>>>>> EXT4-fs error (device sda6): __ext4_get_inode_loc: unable to
>>>>>>>>> read inode block - inode=2346519
>>>>>>>>> EXT4-fs error (device sda6) in ext4_reserve_inode_write: IO
>>>>>>>>> failure
>>>>>>>>>
>>>>>>>>> This looks more like a hibernation problem than an ext4 problem.
>>>>>>>>> Looks like the hard drive is being left in some inconsistent state
>>>>>>>>> after resuming from hibernation.
>>>> Yeap, ext4 is just the victim here.
>>>>
>>>>>>>> ata1.00: irq_stat 0x00400008, PHY RDY changed
>>>>>>>> ata1: SError: { PHYRdyChg CommWake }
>>>>>>> Your SATA hardware flags a connect-or-disconnect event ("PHY
>>>>>>> RDY"), which requires us to abort a bunch of queued commands:
>>>>>>>
>>>>>>>> ata1.00: cmd 60/18:00:77:88:6f/00:00:0e:00:00/40 tag 0 ncq 12288 in
>>>>>>>> res 50/00:30:07:b3:10/00:00:0c:00:00/40 Emask 0x10 (ATA
>>>>>>>> bus error)
>>>>>>> [...]
>>>> ...
>>>>>>> The SCSI subsystem aborts each of the queued commands.
>>>>>> No .. this is the SCSI subsystem receives an ABORTED COMMAND
>>>>>> return in
>>>>>> sense data for each of the outstanding I/Os
>>>>>>
>>>>>> The only place these are generated is in ata_sense_to_error()
>>>>>> which only
>>>>>> occurs if there's some type of ata error.
>>>>>>
>>>>>> If I had to theorise, I'd say the system suspended with commands
>>>>>> outstanding to the device. On resume, the device gets reset and
>>>>>> returns
>>>>>> some type of ATA error which gets translated to ABORTED COMMAND which
>>>>>> causes a failure.
>>>>>>
>>>>>> In the mid layer, we translate ABORTED_COMMAND into a retry until the
>>>>>> command runs out of them ... could it be there's a race readying the
>>>>>> device and we run through the retries before it can accept the
>>>>>> command?
>>>> When libata-eh thinks that the problem isn't worth retrying, it sets
>>>> scmd->retries to scmd->allowed so that it gets aborted immediately.
>>>> The code is in ata_eh_qc_complete().
>>>>
>>>> Whether a command is to be retried or not is determined with
>>>> ATA_QCFLAG_RETRY which is set in ata_eh_link_autopsy() for each failed
>>>> command. Immediate-failure criteria is pretty strict - only driver
>>>> software errors (AC_ERR_INVALID) and PC or other special commands
>>>> which failed which got aborted by the device get the immediate pink
>>>> slip. In this case, the commands are from FS and failed with
>>>> AC_ERR_ATA_BUS, so it definitely doesn't fit into the criteria.
>>>> Strange.
>>>>
>>>> How reproducible is the problem? Are you interested in trying out
>>>> some debug patches?
>>> Hi Tejun,
>>>
>>> I think I should be able to reproduce when actively using X with 2.6.29,
>>> and I have an external disk where I could backup to / boot from if the
>>> corruption became a problem.
>>>
>>> These issues are keeping me from 2.6.29 so I'll gladly help where I can,
>>> if you can please provide me the patches and the .config settings that
>>> may be required?
>>>
>>> Niel
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>> Any chance you could use bisect to narrow down the problem commit?
>>
>> http://kernel.org/pub/software/scm/git/docs/v1.4.4.4/howto/isolate-bugs-with-bisect.txt
>>
>>
>> This should identify which patch caused your problems, if you have a
>> known good starting point (such as 2.6.28).
> I'm struggling with this - my good kernel is 2.6.28.9 and as far as I
> can tell the closest thing good kernel I can tell git to use is 2.6.28
> base itself. So now what happens is that resume entirely fails during
> some of the bisects due to entirely other regressions that are present
> in older and newer kernels than mine, so I can't test the real issue! :(
"git help bisect" or "man git-bisect" has a wealth of information.
Most notably, you can use "git bisect skip" if the current commit cannot
be tested, and thus cannot be declared good or bad.
Jeff
next prev parent reply other threads:[~2009-04-03 20:09 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <ckpL0-3TE-3@gated-at.bofh.it>
[not found] ` <ckpL0-3TE-5@gated-at.bofh.it>
[not found] ` <ckpL0-3TE-7@gated-at.bofh.it>
[not found] ` <ckpL0-3TE-9@gated-at.bofh.it>
[not found] ` <ckpL0-3TE-11@gated-at.bofh.it>
[not found] ` <ckpL0-3TE-1@gated-at.bofh.it>
[not found] ` <cllvN-2Gf-1@gated-at.bofh.it>
2009-03-30 14:30 ` 2.6.29 regression: ATA bus errors on resume Niel Lambrechts
2009-03-30 14:40 ` Jeff Garzik
2009-04-01 19:48 ` Niel Lambrechts
2009-04-03 20:09 ` Jeff Garzik [this message]
2009-04-03 20:54 ` Niel Lambrechts
2009-04-02 1:50 ` Tejun Heo
2009-04-02 6:20 ` Niel Lambrechts
2009-04-02 6:52 ` Tejun Heo
2009-04-02 11:03 ` Niel Lambrechts
2009-04-02 14:15 ` Niel Lambrechts
2009-04-04 4:54 ` Tejun Heo
2009-04-06 5:01 ` Niel Lambrechts
2009-04-06 10:09 ` Tejun Heo
2009-04-06 18:23 ` Niel Lambrechts
2009-04-06 19:39 ` Tejun Heo
2009-04-06 21:26 ` Niel Lambrechts
2009-04-09 18:18 ` Tejun Heo
2009-05-23 9:17 ` Niel Lambrechts
2009-05-23 10:26 ` 2.6.29 regression: ATA bus errors on resume (output with debug patch) Niel Lambrechts
2009-05-25 0:32 ` Tejun Heo
[not found] <clqON-2Xv-7@gated-at.bofh.it>
[not found] ` <clqON-2Xv-9@gated-at.bofh.it>
[not found] ` <clqON-2Xv-11@gated-at.bofh.it>
[not found] ` <clqON-2Xv-13@gated-at.bofh.it>
[not found] ` <clqON-2Xv-15@gated-at.bofh.it>
[not found] ` <clqON-2Xv-17@gated-at.bofh.it>
[not found] ` <clqON-2Xv-19@gated-at.bofh.it>
[not found] ` <clqON-2Xv-5@gated-at.bofh.it>
[not found] ` <clqYt-3bu-5@gated-at.bofh.it>
2009-03-30 18:24 ` 2.6.29 regression: ATA bus errors on resume Niel Lambrechts
2009-03-30 19:17 ` Jeff Garzik
[not found] ` <cmknZ-8lW-9@gated-at.bofh.it>
[not found] ` <cmoBl-6Ok-21@gated-at.bofh.it>
[not found] ` <cmp4n-7rb-15@gated-at.bofh.it>
[not found] ` <cmsYg-5BR-27@gated-at.bofh.it>
[not found] ` <cmvW7-1Yj-23@gated-at.bofh.it>
[not found] ` <cnheh-3vO-7@gated-at.bofh.it>
[not found] ` <cnPg1-7Q4-19@gated-at.bofh.it>
[not found] ` <cnTWo-7bV-25@gated-at.bofh.it>
[not found] ` <co1Kd-350-5@gated-at.bofh.it>
[not found] ` <co2Qf-4QQ-27@gated-at.bofh.it>
[not found] ` <co4yj-7Mc-5@gated-at.bofh.it>
[not found] ` <cp71c-4py-29@gated-at.bofh.it>
[not found] ` <cEVyE-re-1@gated-at.bofh.it>
2009-05-23 9:36 ` Niel Lambrechts
2009-05-25 1:10 ` Tejun Heo
2009-05-25 8:15 ` Alan Cox
2009-05-25 22:06 ` Niel Lambrechts
2009-05-26 4:58 ` Tejun Heo
2009-05-26 5:43 ` Niel Lambrechts
2009-05-26 5:50 ` Tejun Heo
2009-05-26 6:13 ` Niel Lambrechts
2009-05-26 13:33 ` Tejun Heo
2009-05-26 18:14 ` Niel Lambrechts
2009-05-27 0:07 ` Tejun Heo
2009-05-27 14:01 ` Niel Lambrechts
2009-06-01 18:57 ` Niel Lambrechts
2009-06-03 3:14 ` Tejun Heo
2009-06-03 4:28 ` Tejun Heo
2009-06-06 7:05 ` Niel Lambrechts
2009-06-19 15:04 ` Pavel Machek
2009-06-25 12:57 ` Tejun Heo
2009-06-25 15:25 ` Niel Lambrechts
2009-06-26 0:46 ` Tejun Heo
2009-06-26 6:24 ` Niel Lambrechts
2009-09-18 20:26 ` Berthold Gunreben
2009-09-25 4:11 ` Tejun Heo
2009-09-30 9:58 ` Berthold Gunreben
2009-09-30 10:26 ` Tejun Heo
2009-05-26 4:58 ` Tejun Heo
[not found] <cjtH6-3Ll-13@gated-at.bofh.it>
[not found] ` <cjtH6-3Ll-15@gated-at.bofh.it>
[not found] ` <cjtH6-3Ll-11@gated-at.bofh.it>
[not found] ` <cjutt-577-11@gated-at.bofh.it>
[not found] ` <cjJCb-47c-23@gated-at.bofh.it>
2009-03-27 19:10 ` Niel Lambrechts
2009-03-27 22:30 ` Arjan van de Ven
2009-03-28 10:22 ` Niel Lambrechts
2009-03-28 14:06 ` Rafael J. Wysocki
2009-03-30 8:43 ` Tejun Heo
2009-03-30 8:55 ` Tejun Heo
[not found] <cjlqb-7sp-1@gated-at.bofh.it>
[not found] ` <cjq6y-6sq-11@gated-at.bofh.it>
2009-03-25 5:19 ` 2.6.29 regression: ATA bus errors on resume (was: EXT4: __ext4_get_inode_loc errors after s2disk) Niel Lambrechts
2009-03-25 6:06 ` 2.6.29 regression: ATA bus errors on resume Jeff Garzik
2009-03-25 21:40 ` Niel Lambrechts
2009-03-25 22:16 ` James Bottomley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49D66CEA.8080605@garzik.org \
--to=jeff@garzik.org \
--cc=linux-kernel@vger.kernel.org \
--cc=niel.lambrechts@gmail.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).