All of lore.kernel.org
 help / color / mirror / Atom feed
* libata timeouts when stressing a Samsung HDD
@ 2009-02-02 21:40 Chuck Ebbert
  2009-02-11  1:30 ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Chuck Ebbert @ 2009-02-02 21:40 UTC (permalink / raw)
  To: linux-ide

If I use an ext3 filesystem with noatime I never see problems, but if I use XFS
with barriers and atime enabled, I keep getting this:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
end_request: I/O error, dev sda, sector 13851948


The errors always happen in the XFS log which then causes filesystem shutdown.

The drive reports itself as:
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: SAMSUNG HD160JJ/P, ZM100-34, max UDMA7
ata1.00: 312500000 sectors, multi 8: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133

Kernel is 2.6.27.12, using the ahci driver on this hardware:
Intel Corporation 82801GR/GH (ICH7 Family) SATA AHCI Controller

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-02 21:40 libata timeouts when stressing a Samsung HDD Chuck Ebbert
@ 2009-02-11  1:30 ` Tejun Heo
  2009-02-11  4:08   ` Mark Lord
  2009-02-11 20:24   ` Chuck Ebbert
  0 siblings, 2 replies; 23+ messages in thread
From: Tejun Heo @ 2009-02-11  1:30 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-ide

Chuck Ebbert wrote:
> If I use an ext3 filesystem with noatime I never see problems, but if I use XFS
> with barriers and atime enabled, I keep getting this:
> 
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata1.00: status: { DRDY }
> ata1: hard resetting link
> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata1.00: configured for UDMA/133
> end_request: I/O error, dev sda, sector 13851948

ext3 doesn't use barrier by default.  Timing out on FLUSH_CACHE is a
pretty good sign that something is wrong with the disk.  Can you
please post the output of "smartctl -a /dev/sda"?

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-11  1:30 ` Tejun Heo
@ 2009-02-11  4:08   ` Mark Lord
  2009-02-11 20:29     ` Chuck Ebbert
  2009-02-11 20:24   ` Chuck Ebbert
  1 sibling, 1 reply; 23+ messages in thread
From: Mark Lord @ 2009-02-11  4:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Chuck Ebbert, linux-ide

Tejun Heo wrote:
> Chuck Ebbert wrote:
>> If I use an ext3 filesystem with noatime I never see problems, but if I use XFS
>> with barriers and atime enabled, I keep getting this:
>>
>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>> ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>> ata1.00: status: { DRDY }
>> ata1: hard resetting link
>> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> ata1.00: configured for UDMA/133
>> end_request: I/O error, dev sda, sector 13851948
> 
> ext3 doesn't use barrier by default.  Timing out on FLUSH_CACHE is a
> pretty good sign that something is wrong with the disk.  Can you
> please post the output of "smartctl -a /dev/sda"?
..

I missed the start of this thread,
but want to point out that something similar was observed here
with a pair of Hitachi 750GB drives (RAID0) and XFS and FLUSH_CACHE.

If I let hddtemp or smartctl run periodically during heavy writes,
one (or both?) of the drives would eventually have issues and require
a reset.  Problem was never resolved (I simply got rid of the periodic
hddtemp and smartctl invocations instead).

-ml



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-11  1:30 ` Tejun Heo
  2009-02-11  4:08   ` Mark Lord
@ 2009-02-11 20:24   ` Chuck Ebbert
  1 sibling, 0 replies; 23+ messages in thread
From: Chuck Ebbert @ 2009-02-11 20:24 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide

On Wed, 11 Feb 2009 10:30:21 +0900
Tejun Heo <tj@kernel.org> wrote:

> 
> ext3 doesn't use barrier by default.  Timing out on FLUSH_CACHE is a
> pretty good sign that something is wrong with the disk.  Can you
> please post the output of "smartctl -a /dev/sda"?
> 

smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD160JJ/P
Serial Number:    S0DFJ1NLA12279
Firmware Version: ZM100-34
User Capacity:    160,000,000,000 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:    Wed Feb 11 20:24:01 2009 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (3699) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  61) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       9
  3 Spin_Up_Time            0x0007   100   100   025    Pre-fail  Always       -       5248
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       72
  5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   253   253   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   253   253   000    Old_age   Offline      -       0
  9 Power_On_Half_Minutes   0x0032   100   100   000    Old_age   Always       -       109h+27m
 10 Spin_Retry_Count        0x0032   253   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   253   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       62
190 Airflow_Temperature_Cel 0x0022   118   100   000    Old_age   Always       -       40
194 Temperature_Celsius     0x0022   118   100   000    Old_age   Always       -       40
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       18436133
196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0
202 TA_Increase_Count       0x0032   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     12919         -
# 2  Short offline       Completed without error       00%      5645         -
# 3  Extended offline    Completed without error       00%         1         -
# 4  Extended offline    Completed without error       00%         1         -
# 5  Short offline       Completed without error       00%         0         -

SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-11  4:08   ` Mark Lord
@ 2009-02-11 20:29     ` Chuck Ebbert
  2009-02-11 22:03       ` Mark Lord
  0 siblings, 1 reply; 23+ messages in thread
From: Chuck Ebbert @ 2009-02-11 20:29 UTC (permalink / raw)
  To: Mark Lord; +Cc: Tejun Heo, linux-ide

On Tue, 10 Feb 2009 23:08:40 -0500
Mark Lord <liml@rtr.ca> wrote:

> 
> If I let hddtemp or smartctl run periodically during heavy writes,
> one (or both?) of the drives would eventually have issues and require
> a reset.  Problem was never resolved (I simply got rid of the periodic
> hddtemp and smartctl invocations instead).
> 

smartd is disabled and I don't even have hddtemp installed, so I don't
think they could be causing this.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-11 20:29     ` Chuck Ebbert
@ 2009-02-11 22:03       ` Mark Lord
  2009-02-11 22:11         ` Jeff Garzik
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Lord @ 2009-02-11 22:03 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Tejun Heo, linux-ide

Chuck Ebbert wrote:
> On Tue, 10 Feb 2009 23:08:40 -0500
> Mark Lord <liml@rtr.ca> wrote:
> 
>> If I let hddtemp or smartctl run periodically during heavy writes,
>> one (or both?) of the drives would eventually have issues and require
>> a reset.  Problem was never resolved (I simply got rid of the periodic
>> hddtemp and smartctl invocations instead).
>>
> 
> smartd is disabled and I don't even have hddtemp installed, so I don't
> think they could be causing this.
..

I didn't figure they were the cause, just a possible trigger
for some other rare issue.

I wonder if it's just a case of too short a timeout on the cache flushes?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-11 22:03       ` Mark Lord
@ 2009-02-11 22:11         ` Jeff Garzik
  2009-02-11 22:29           ` Mark Lord
  2009-02-12  4:28           ` Robert Hancock
  0 siblings, 2 replies; 23+ messages in thread
From: Jeff Garzik @ 2009-02-11 22:11 UTC (permalink / raw)
  To: Mark Lord; +Cc: Chuck Ebbert, Tejun Heo, linux-ide

Mark Lord wrote:
> I wonder if it's just a case of too short a timeout on the cache flushes?

The answer in general to this question has always been "yes".....

Unless this has changed in the past year, the worst case for SATA cache 
flush can definitely exceed 30 seconds...  it is unbounded as defined in 
the spec, and unbounded in practice as well.

Of course, users' patience is not unbounded :)

	Jeff




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-11 22:11         ` Jeff Garzik
@ 2009-02-11 22:29           ` Mark Lord
  2009-02-11 22:54             ` Chuck Ebbert
  2009-02-12  4:28           ` Robert Hancock
  1 sibling, 1 reply; 23+ messages in thread
From: Mark Lord @ 2009-02-11 22:29 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Chuck Ebbert, Tejun Heo, linux-ide

Jeff Garzik wrote:
> Mark Lord wrote:
>> I wonder if it's just a case of too short a timeout on the cache flushes?
> 
> The answer in general to this question has always been "yes".....
> 
> Unless this has changed in the past year, the worst case for SATA cache 
> flush can definitely exceed 30 seconds...  it is unbounded as defined in 
> the spec, and unbounded in practice as well.
..

But I don't think we've yet seen a proven case of it taking too long
for the current libata timeouts.  Unless it's happening now.
T'would be good to find out..  Chuck?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-11 22:29           ` Mark Lord
@ 2009-02-11 22:54             ` Chuck Ebbert
  2009-02-12 16:10               ` Mark Lord
  0 siblings, 1 reply; 23+ messages in thread
From: Chuck Ebbert @ 2009-02-11 22:54 UTC (permalink / raw)
  To: Mark Lord; +Cc: Jeff Garzik, Tejun Heo, linux-ide

On Wed, 11 Feb 2009 17:29:43 -0500
Mark Lord <liml@rtr.ca> wrote:

> > Unless this has changed in the past year, the worst case for SATA cache 
> > flush can definitely exceed 30 seconds...  it is unbounded as defined in 
> > the spec, and unbounded in practice as well.
> ..
> 
> But I don't think we've yet seen a proven case of it taking too long
> for the current libata timeouts.  Unless it's happening now.
> T'would be good to find out..  Chuck?

How do I change the timeout?

I'm pretty sure I have a spare partition on this drive to test with.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-11 22:11         ` Jeff Garzik
  2009-02-11 22:29           ` Mark Lord
@ 2009-02-12  4:28           ` Robert Hancock
  2009-02-19 15:27             ` Mark Lord
  1 sibling, 1 reply; 23+ messages in thread
From: Robert Hancock @ 2009-02-12  4:28 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Mark Lord, Chuck Ebbert, Tejun Heo, linux-ide

Jeff Garzik wrote:
> Mark Lord wrote:
>> I wonder if it's just a case of too short a timeout on the cache flushes?
> 
> The answer in general to this question has always been "yes".....
> 
> Unless this has changed in the past year, the worst case for SATA cache 
> flush can definitely exceed 30 seconds...  it is unbounded as defined in 
> the spec, and unbounded in practice as well.
> 
> Of course, users' patience is not unbounded :)

Yeah, it's pretty ludicrous for a cache flush to potentially take that 
long. Theoretically if you had a 32MB write cache completely full with 
completely non-contiguous sectors you could end up with it taking 
something like 2 minutes or more to write out. However, the drive really 
should be ensuring that it doesn't build up so much in the write cache 
that a flush would takes this long - that is a lot of data that could be 
lost on an unexpected power-off.

However, in this case the drive is not reporting Busy status at the 
timeout, which suggests maybe an interrupt got lost or something. (Could 
be still the drive's fault.)

Chuck, when this happens can you tell if the disk sounds like it is 
grinding away until the timeout happens or is it just sitting there?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-11 22:54             ` Chuck Ebbert
@ 2009-02-12 16:10               ` Mark Lord
  2009-02-12 16:13                 ` Mark Lord
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Lord @ 2009-02-12 16:10 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Jeff Garzik, Tejun Heo, linux-ide

Chuck Ebbert wrote:
> On Wed, 11 Feb 2009 17:29:43 -0500
> Mark Lord <liml@rtr.ca> wrote:
> 
>>> Unless this has changed in the past year, the worst case for SATA cache 
>>> flush can definitely exceed 30 seconds...  it is unbounded as defined in 
>>> the spec, and unbounded in practice as well.
>> ..
>>
>> But I don't think we've yet seen a proven case of it taking too long
>> for the current libata timeouts.  Unless it's happening now.
>> T'would be good to find out..  Chuck?
> 
> How do I change the timeout?
..

Heh.. actually, I'm not entirely sure myself.
This stuff bounces between the block, scsi, and ata layers,
and I think it's set before it gets to libata.

Tejun?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-12 16:10               ` Mark Lord
@ 2009-02-12 16:13                 ` Mark Lord
  2009-02-16  3:12                   ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Lord @ 2009-02-12 16:13 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Jeff Garzik, Tejun Heo, linux-ide

Mark Lord wrote:
> Chuck Ebbert wrote:
>
>> How do I change the timeout?
> ..
> 
> Heh.. actually, I'm not entirely sure myself.
> This stuff bounces between the block, scsi, and ata layers,
> and I think it's set before it gets to libata.
..

Mmm.. I think it is the SD_TIMEOUT value, defined in drivers/scsi/sd.h
which currently uses (30 * HZ) as the timeout for most things,
including for cache flushes.

Try changing it to (180 * HZ) just for fun.

Cheers

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-12 16:13                 ` Mark Lord
@ 2009-02-16  3:12                   ` Tejun Heo
  2009-02-16 21:30                     ` Chuck Ebbert
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2009-02-16  3:12 UTC (permalink / raw)
  To: Mark Lord; +Cc: Chuck Ebbert, Jeff Garzik, linux-ide

Mark Lord wrote:
> Mark Lord wrote:
>> Chuck Ebbert wrote:
>>
>>> How do I change the timeout?
>> ..
>>
>> Heh.. actually, I'm not entirely sure myself.
>> This stuff bounces between the block, scsi, and ata layers,
>> and I think it's set before it gets to libata.
> ..
> 
> Mmm.. I think it is the SD_TIMEOUT value, defined in drivers/scsi/sd.h
> which currently uses (30 * HZ) as the timeout for most things,
> including for cache flushes.
> 
> Try changing it to (180 * HZ) just for fun.

Yeap, SD_TIMEOUT should be it.  But I've never personally seen disk
flushing taking as long as 30 seconds.  Does it really happen?  It's
not like the drive would be doing random seeking.  One full stroke
across the platter should be it.  Even with the rotational delay,
going over 30 seconds doesn't seem very likely.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-16  3:12                   ` Tejun Heo
@ 2009-02-16 21:30                     ` Chuck Ebbert
  2009-02-19  6:21                       ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Chuck Ebbert @ 2009-02-16 21:30 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Mark Lord, Jeff Garzik, linux-ide

On Mon, 16 Feb 2009 12:12:56 +0900
Tejun Heo <tj@kernel.org> wrote:


> > 
> > Mmm.. I think it is the SD_TIMEOUT value, defined in drivers/scsi/sd.h
> > which currently uses (30 * HZ) as the timeout for most things,
> > including for cache flushes.
> > 
> > Try changing it to (180 * HZ) just for fun.
> 
> Yeap, SD_TIMEOUT should be it.  But I've never personally seen disk
> flushing taking as long as 30 seconds.  Does it really happen?  It's
> not like the drive would be doing random seeking.  One full stroke
> across the platter should be it.  Even with the rotational delay,
> going over 30 seconds doesn't seem very likely.

I just noticed that the other drive, a Western Digital, was using 1.5Gbps
instead of 3.0. So I forced the Samsung to the slower speed and now I
can't make the timeout happen anymore.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-16 21:30                     ` Chuck Ebbert
@ 2009-02-19  6:21                       ` Tejun Heo
  2009-02-19 16:41                         ` Greg Freemyer
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2009-02-19  6:21 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Mark Lord, Jeff Garzik, linux-ide

Hello,

Chuck Ebbert wrote:
>> Yeap, SD_TIMEOUT should be it.  But I've never personally seen disk
>> flushing taking as long as 30 seconds.  Does it really happen?  It's
>> not like the drive would be doing random seeking.  One full stroke
>> across the platter should be it.  Even with the rotational delay,
>> going over 30 seconds doesn't seem very likely.
> 
> I just noticed that the other drive, a Western Digital, was using 1.5Gbps
> instead of 3.0. So I forced the Samsung to the slower speed and now I
> can't make the timeout happen anymore.

Timeouts are much more likely with the higher transfer speed but it's
kind of strange for flush cache to be affected by it as the command
doesn't have any data to transfer.  :-(

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-12  4:28           ` Robert Hancock
@ 2009-02-19 15:27             ` Mark Lord
  2009-02-20  0:32               ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Mark Lord @ 2009-02-19 15:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Robert Hancock, Jeff Garzik, Chuck Ebbert, linux-ide

 >ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
 >ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
 >         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
 >ata1.00: status: { DRDY }

>>> I wonder if it's just a case of too short a timeout on the cache flushes?
..
> However, in this case the drive is not reporting Busy status at the 
> timeout, which suggests maybe an interrupt got lost or something. (Could 
> be still the drive's fault.)
..

If I recall correctly, The reported shadow register contents are bogus
when a timeout occurs.  So we don't actually know what the drive state was.

Or do we, Tejun?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-19  6:21                       ` Tejun Heo
@ 2009-02-19 16:41                         ` Greg Freemyer
  2009-02-20  0:33                           ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Greg Freemyer @ 2009-02-19 16:41 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Chuck Ebbert, Mark Lord, Jeff Garzik, linux-ide

On Thu, Feb 19, 2009 at 1:21 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> Chuck Ebbert wrote:
>>> Yeap, SD_TIMEOUT should be it.  But I've never personally seen disk
>>> flushing taking as long as 30 seconds.  Does it really happen?  It's
>>> not like the drive would be doing random seeking.  One full stroke
>>> across the platter should be it.  Even with the rotational delay,
>>> going over 30 seconds doesn't seem very likely.
>>
>> I just noticed that the other drive, a Western Digital, was using 1.5Gbps
>> instead of 3.0. So I forced the Samsung to the slower speed and now I
>> can't make the timeout happen anymore.
>
> Timeouts are much more likely with the higher transfer speed but it's
> kind of strange for flush cache to be affected by it as the command
> doesn't have any data to transfer.  :-(

I have not been following this thread, but those jumpers normally
reduce the feature set from SATA-II to SATA-I as well.

So it could be an issue with a SATA-II feature.  ncq?


Greg
-- 
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-19 15:27             ` Mark Lord
@ 2009-02-20  0:32               ` Tejun Heo
  2009-02-20  2:52                 ` Robert Hancock
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2009-02-20  0:32 UTC (permalink / raw)
  To: Mark Lord; +Cc: Robert Hancock, Jeff Garzik, Chuck Ebbert, linux-ide

Mark Lord wrote:
>>ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>>         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>ata1.00: status: { DRDY }
> 
>>>> I wonder if it's just a case of too short a timeout on the cache
>>>> flushes?
> ..
>> However, in this case the drive is not reporting Busy status at the
>> timeout, which suggests maybe an interrupt got lost or something.
>> (Could be still the drive's fault.)
> ..
> 
> If I recall correctly, The reported shadow register contents are bogus
> when a timeout occurs.  So we don't actually know what the drive state was.
> 
> Or do we, Tejun?

Yeah, it's bogus.  Maybe we should just report zeros.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-19 16:41                         ` Greg Freemyer
@ 2009-02-20  0:33                           ` Tejun Heo
  0 siblings, 0 replies; 23+ messages in thread
From: Tejun Heo @ 2009-02-20  0:33 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: Chuck Ebbert, Mark Lord, Jeff Garzik, linux-ide

Greg Freemyer wrote:
> On Thu, Feb 19, 2009 at 1:21 AM, Tejun Heo <tj@kernel.org> wrote:
>> Hello,
>>
>> Chuck Ebbert wrote:
>>>> Yeap, SD_TIMEOUT should be it.  But I've never personally seen disk
>>>> flushing taking as long as 30 seconds.  Does it really happen?  It's
>>>> not like the drive would be doing random seeking.  One full stroke
>>>> across the platter should be it.  Even with the rotational delay,
>>>> going over 30 seconds doesn't seem very likely.
>>> I just noticed that the other drive, a Western Digital, was using 1.5Gbps
>>> instead of 3.0. So I forced the Samsung to the slower speed and now I
>>> can't make the timeout happen anymore.
>> Timeouts are much more likely with the higher transfer speed but it's
>> kind of strange for flush cache to be affected by it as the command
>> doesn't have any data to transfer.  :-(
> 
> I have not been following this thread, but those jumpers normally
> reduce the feature set from SATA-II to SATA-I as well.
> 
> So it could be an issue with a SATA-II feature.  ncq?

Hmmm... maybe.  Chuck, can you please use "libata.force=1.5Gbps"
instead of the jumper and see whether the problem reappears?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-20  0:32               ` Tejun Heo
@ 2009-02-20  2:52                 ` Robert Hancock
  2009-02-20  3:26                   ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Robert Hancock @ 2009-02-20  2:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Mark Lord, Jeff Garzik, Chuck Ebbert, linux-ide

Tejun Heo wrote:
> Mark Lord wrote:
>>> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
>>> ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
>>>         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
>>> ata1.00: status: { DRDY }
>>>>> I wonder if it's just a case of too short a timeout on the cache
>>>>> flushes?
>> ..
>>> However, in this case the drive is not reporting Busy status at the
>>> timeout, which suggests maybe an interrupt got lost or something.
>>> (Could be still the drive's fault.)
>> ..
>>
>> If I recall correctly, The reported shadow register contents are bogus
>> when a timeout occurs.  So we don't actually know what the drive state was.
>>
>> Or do we, Tejun?
> 
> Yeah, it's bogus.  Maybe we should just report zeros.

Didn't know that. Shouldn't we be able to do a qc_fill_rtf before error 
handling in this case? That would make it easier to tell if we lost an 
interrupt or if the drive is just taking too long..

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-20  2:52                 ` Robert Hancock
@ 2009-02-20  3:26                   ` Tejun Heo
  2009-03-01 19:31                     ` Robert Hancock
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2009-02-20  3:26 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Mark Lord, Jeff Garzik, Chuck Ebbert, linux-ide

Hello,

Robert Hancock wrote:
>>> If I recall correctly, The reported shadow register contents are bogus
>>> when a timeout occurs.  So we don't actually know what the drive
>>> state was.
>>>
>>> Or do we, Tejun?
>>
>> Yeah, it's bogus.  Maybe we should just report zeros.
> 
> Didn't know that. Shouldn't we be able to do a qc_fill_rtf before error
> handling in this case? That would make it easier to tell if we lost an
> interrupt or if the drive is just taking too long..

I think Alan already did it in the patches which added improved
timeout handling callback.  Hmmm... Can't find it.  I thought it was
in #upstream.  Anyways, I'm slightly worried about reading status
blindly after timeout mainly due to experiences I had early while
developing libata EH.  Some controllers were simply scary and very
eager to lock up the whole machine.  That said, it could be that I'm
just overly paranoid.  After all, with shared IRQ, we don't have
control over when altstatus is read at least.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-02-20  3:26                   ` Tejun Heo
@ 2009-03-01 19:31                     ` Robert Hancock
  2009-03-01 20:28                       ` Alan Cox
  0 siblings, 1 reply; 23+ messages in thread
From: Robert Hancock @ 2009-03-01 19:31 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Mark Lord, Jeff Garzik, Chuck Ebbert, linux-ide

Tejun Heo wrote:
> Hello,
> 
> Robert Hancock wrote:
>>>> If I recall correctly, The reported shadow register contents are bogus
>>>> when a timeout occurs.  So we don't actually know what the drive
>>>> state was.
>>>>
>>>> Or do we, Tejun?
>>> Yeah, it's bogus.  Maybe we should just report zeros.
>> Didn't know that. Shouldn't we be able to do a qc_fill_rtf before error
>> handling in this case? That would make it easier to tell if we lost an
>> interrupt or if the drive is just taking too long..
> 
> I think Alan already did it in the patches which added improved
> timeout handling callback.  Hmmm... Can't find it.  I thought it was
> in #upstream.  Anyways, I'm slightly worried about reading status
> blindly after timeout mainly due to experiences I had early while
> developing libata EH.  Some controllers were simply scary and very
> eager to lock up the whole machine.  That said, it could be that I'm
> just overly paranoid.  After all, with shared IRQ, we don't have
> control over when altstatus is read at least.

For some of these timeout issues I think it would make things a bit 
easier to diagnose, certainly.. With some controllers there might be a 
little bit of risk (nForce4 seems to be one of those twitchy ones, 
whether in ADMA mode or not, at least for command errors, possibly not 
for timeouts), however certainly for ones like AHCI which just store the 
D2H register FIS in memory, there's really no reason not to read it out..

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: libata timeouts when stressing a Samsung HDD
  2009-03-01 19:31                     ` Robert Hancock
@ 2009-03-01 20:28                       ` Alan Cox
  0 siblings, 0 replies; 23+ messages in thread
From: Alan Cox @ 2009-03-01 20:28 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Tejun Heo, Mark Lord, Jeff Garzik, Chuck Ebbert, linux-ide

> For some of these timeout issues I think it would make things a bit 
> easier to diagnose, certainly.. With some controllers there might be a 
> little bit of risk (nForce4 seems to be one of those twitchy ones, 
> whether in ADMA mode or not, at least for command errors, possibly not 
> for timeouts), however certainly for ones like AHCI which just store the 
> D2H register FIS in memory, there's really no reason not to read it out..

The timeout patches are not yet upstream because they needed a bit of
further work for some controllers that subclass SFF but aren't very SFF
so broke when the default timeout method was used. Thats all fixed now
and waiting for the next update.

Alan

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2009-03-01 20:31 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-02 21:40 libata timeouts when stressing a Samsung HDD Chuck Ebbert
2009-02-11  1:30 ` Tejun Heo
2009-02-11  4:08   ` Mark Lord
2009-02-11 20:29     ` Chuck Ebbert
2009-02-11 22:03       ` Mark Lord
2009-02-11 22:11         ` Jeff Garzik
2009-02-11 22:29           ` Mark Lord
2009-02-11 22:54             ` Chuck Ebbert
2009-02-12 16:10               ` Mark Lord
2009-02-12 16:13                 ` Mark Lord
2009-02-16  3:12                   ` Tejun Heo
2009-02-16 21:30                     ` Chuck Ebbert
2009-02-19  6:21                       ` Tejun Heo
2009-02-19 16:41                         ` Greg Freemyer
2009-02-20  0:33                           ` Tejun Heo
2009-02-12  4:28           ` Robert Hancock
2009-02-19 15:27             ` Mark Lord
2009-02-20  0:32               ` Tejun Heo
2009-02-20  2:52                 ` Robert Hancock
2009-02-20  3:26                   ` Tejun Heo
2009-03-01 19:31                     ` Robert Hancock
2009-03-01 20:28                       ` Alan Cox
2009-02-11 20:24   ` Chuck Ebbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.