* Wierd: Degrading while recovering raid5
@ 2015-02-10 4:20 Kyle Logue
2015-02-10 7:35 ` Adam Goryachev
0 siblings, 1 reply; 9+ messages in thread
From: Kyle Logue @ 2015-02-10 4:20 UTC (permalink / raw)
To: linux-raid
Hey all:
I have a 5 disk software raid5 that was working fine until I decided
to swap out an old disk with a new one.
mdadm /dev/md0 --add /dev/sda1
mdadm /dev/md0 --fail /dev/sde1
At this point it started automatically rebuilding the array.
About 60%? of the way in it stops and I see a lot of this repeated in my dmesg:
[Mon Feb 9 18:06:48 2015] ata5.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x6 frozen
[Mon Feb 9 18:06:48 2015] ata5.00: failed command: SMART
[Mon Feb 9 18:06:48 2015] ata5.00: cmd
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 7
[Mon Feb 9 18:06:48 2015] res
40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[Mon Feb 9 18:06:48 2015] ata5.00: status: { DRDY }
[Mon Feb 9 18:06:48 2015] ata5: hard resetting link
[Mon Feb 9 18:06:58 2015] ata5: softreset failed (1st FIS failed)
[Mon Feb 9 18:06:58 2015] ata5: hard resetting link
[Mon Feb 9 18:07:08 2015] ata5: softreset failed (1st FIS failed)
[Mon Feb 9 18:07:08 2015] ata5: hard resetting link
[Mon Feb 9 18:07:12 2015] ata5: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
[Mon Feb 9 18:07:12 2015] ata5.00: configured for UDMA/33
[Mon Feb 9 18:07:12 2015] ata5: EH complete
ata5 corresponds to my /dev/sdc drive.
So I was worried but it didn't look so terrible when i did examine:
sudo mdadm --examine /dev/sd[dabfec]1 | egrep 'dev|Update|Role|State|Events'
/dev/sda1:
State : clean
Update Time : Sun Feb 8 20:43:27 2015
Device Role : spare
Array State : .A.AA ('A' == active, '.' == missing)
Events : 27009
/dev/sdb1:
State : clean
Update Time : Sun Feb 8 20:43:27 2015
Device Role : Active device 4
Array State : .A.AA ('A' == active, '.' == missing)
Events : 27009
/dev/sdc1:
State : clean
Update Time : Sun Feb 8 20:21:13 2015
Device Role : Active device 0
Array State : AAAAA ('A' == active, '.' == missing)
Events : 26995
/dev/sdd1:
State : clean
Update Time : Sun Feb 8 20:43:27 2015
Device Role : Active device 1
Array State : .A.AA ('A' == active, '.' == missing)
Events : 27009
/dev/sde1:
State : clean
Update Time : Sun Feb 8 12:17:10 2015
Device Role : Active device 2
Array State : AAAAA ('A' == active, '.' == missing)
Events : 21977
/dev/sdf1:
State : clean
Update Time : Sun Feb 8 20:43:27 2015
Device Role : Active device 3
Array State : .A.AA ('A' == active, '.' == missing)
Events : 27009
So the event counts looked pretty close on the drives I was updating, so I did:
mdadm --stop /dev/md0
mdadm --assemble --force /dev/md0 /dev/sd[dabfec]1
But it stopped again during recovery at some point while at work with
the same ATA errors in the dmesg.
Searching the web for these errors show lots of people having this
issue with various linux distros and laying the blame on everything
from faulty SATA cables to BIOS to NVIDIA drivers - nothing
definitive. I powered off my box and reconnected all my SATA cables as
a sanity check.
I tried --assemble --force again and it got to 70%:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid5 sdc1[7] sda1[8] sdb1[6] sdf1[4] sdd1[5]
7814047744 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UU_UU]
[=============>.......] recovery = 68.9%
(1347855508/1953511936) finish=306.1min speed=32967K/sec
...but died again. I was monitoring dmesg like a hawk this time and
saw those ata5 errors every 3-15 minutes with different cmd and res
values. At the very end I got this:
[Mon Feb 9 23:11:01 2015] ata5.00: configured for UDMA/33
[Mon Feb 9 23:11:01 2015] sd 4:0:0:0: [sdc] Unhandled sense code
[Mon Feb 9 23:11:01 2015] sd 4:0:0:0: [sdc]
[Mon Feb 9 23:11:01 2015] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Feb 9 23:11:01 2015] sd 4:0:0:0: [sdc]
[Mon Feb 9 23:11:01 2015] Sense Key : Medium Error [current] [descriptor]
[Mon Feb 9 23:11:01 2015] Descriptor sense data with sense
descriptors (in hex):
[Mon Feb 9 23:11:01 2015] 72 03 11 04 00 00 00 0c 00 0a 80 00
00 00 00 00
[Mon Feb 9 23:11:01 2015] a4 1c 1d e8
[Mon Feb 9 23:11:01 2015] sd 4:0:0:0: [sdc]
[Mon Feb 9 23:11:01 2015] Add. Sense: Unrecovered read error - auto
reallocate failed
[Mon Feb 9 23:11:01 2015] sd 4:0:0:0: [sdc] CDB:
[Mon Feb 9 23:11:01 2015] Read(10): 28 00 a4 1c 1d e8 00 00 80 00
[Mon Feb 9 23:11:01 2015] end_request: I/O error, dev sdc, sector 2753306088
[Mon Feb 9 23:11:01 2015] md/raid:md0: Disk failure on sdc1, disabling device.
[Mon Feb 9 23:11:01 2015] md/raid:md0: Operation continuing on 3 devices.
[Mon Feb 9 23:11:01 2015] ata5: EH complete
[Mon Feb 9 23:11:01 2015] md: md0: recovery interrupted.
[Mon Feb 9 23:11:01 2015] RAID conf printout:
[Mon Feb 9 23:11:01 2015] --- level:5 rd:5 wd:3
[Mon Feb 9 23:11:01 2015] disk 0, o:0, dev:sdc1
[Mon Feb 9 23:11:01 2015] disk 1, o:1, dev:sdd1
[Mon Feb 9 23:11:01 2015] disk 2, o:1, dev:sda1
[Mon Feb 9 23:11:01 2015] disk 3, o:1, dev:sdf1
[Mon Feb 9 23:11:01 2015] disk 4, o:1, dev:sdb1
[Mon Feb 9 23:11:01 2015] RAID conf printout:
[Mon Feb 9 23:11:01 2015] --- level:5 rd:5 wd:3
[Mon Feb 9 23:11:01 2015] disk 1, o:1, dev:sdd1
[Mon Feb 9 23:11:01 2015] disk 2, o:1, dev:sda1
[Mon Feb 9 23:11:01 2015] disk 3, o:1, dev:sdf1
[Mon Feb 9 23:11:01 2015] disk 4, o:1, dev:sdb1
[Mon Feb 9 23:11:01 2015] RAID conf printout:
[Mon Feb 9 23:11:01 2015] --- level:5 rd:5 wd:3
[Mon Feb 9 23:11:01 2015] disk 1, o:1, dev:sdd1
[Mon Feb 9 23:11:01 2015] disk 2, o:1, dev:sda1
[Mon Feb 9 23:11:01 2015] disk 3, o:1, dev:sdf1
[Mon Feb 9 23:11:01 2015] disk 4, o:1, dev:sdb1
[Mon Feb 9 23:11:01 2015] RAID conf printout:
[Mon Feb 9 23:11:01 2015] --- level:5 rd:5 wd:3
[Mon Feb 9 23:11:01 2015] disk 1, o:1, dev:sdd1
[Mon Feb 9 23:11:01 2015] disk 3, o:1, dev:sdf1
[Mon Feb 9 23:11:01 2015] disk 4, o:1, dev:sdb1
and mdstat now has:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid5 sdc1[7](F) sda1[8](S) sdb1[6] sdf1[4] sdd1[5]
7814047744 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/3] [_U_UU]
And now I am out of ideas. Any thoughts on correcting those ata5
errors? or skipping those sectors maybe? While sde1 is the disk i
manually failed, it hasn't been touched yet. The event count is way
off now, but maybe I can use that somehow? Should i replace the sata
cable for sdc and retry?
Anybody in DC want a beer on me for helping figure this out? I have
more log files stored, but was trying to keep it short.
Thanks for looking,
Kyle L
PS. mdadm v3.2.5 on Ubuntu 14.04 running linux 3.13.0-45
PPS. Last full backup was six months ago. Hmm.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wierd: Degrading while recovering raid5
2015-02-10 4:20 Wierd: Degrading while recovering raid5 Kyle Logue
@ 2015-02-10 7:35 ` Adam Goryachev
2015-02-10 13:51 ` Phil Turmel
0 siblings, 1 reply; 9+ messages in thread
From: Adam Goryachev @ 2015-02-10 7:35 UTC (permalink / raw)
To: Kyle Logue, linux-raid
Hi Kyle,
There are other people who will jump in and help you with your problem,
but I'll add a couple of pointers while you are waiting. See below.
On 10/02/15 15:20, Kyle Logue wrote:
> Hey all:
>
> I have a 5 disk software raid5 that was working fine until I decided
> to swap out an old disk with a new one.
>
> mdadm /dev/md0 --add /dev/sda1
> mdadm /dev/md0 --fail /dev/sde1
>
> At this point it started automatically rebuilding the array.
> About 60%? of the way in it stops and I see a lot of this repeated in my dmesg:
>
> [Mon Feb 9 18:06:48 2015] ata5.00: exception Emask 0x0 SAct 0x0 SErr
> 0x0 action 0x6 frozen
> [Mon Feb 9 18:06:48 2015] ata5.00: failed command: SMART
> [Mon Feb 9 18:06:48 2015] ata5.00: cmd
> b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 7
> [Mon Feb 9 18:06:48 2015] res
> 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
> [Mon Feb 9 18:06:48 2015] ata5.00: status: { DRDY }
> [Mon Feb 9 18:06:48 2015] ata5: hard resetting link
> [Mon Feb 9 18:06:58 2015] ata5: softreset failed (1st FIS failed)
> [Mon Feb 9 18:06:58 2015] ata5: hard resetting link
> [Mon Feb 9 18:07:08 2015] ata5: softreset failed (1st FIS failed)
> [Mon Feb 9 18:07:08 2015] ata5: hard resetting link
> [Mon Feb 9 18:07:12 2015] ata5: SATA link up 1.5 Gbps (SStatus 113
> SControl 310)
> [Mon Feb 9 18:07:12 2015] ata5.00: configured for UDMA/33
> [Mon Feb 9 18:07:12 2015] ata5: EH complete
>
> ata5 corresponds to my /dev/sdc drive.
First, check if the drive is faulty.
dd if=/dev/sdc of=/dev/null bs=10M
If that completes without any errors from dd, then the drive can be read
OK. Now check the logs, was there any errors there? Especially if there
were errors in the logs, (or even if not) read about timing mismatches
between the kernel and the hard drive, and how to solve that. There was
another post earlier today with some links to specific posts that will
be helpful (check the online archive).
Finally, I think your first mistake was to fail the drive. You should
have replaced it which will stop you from losing protection from a
failed drive.
See the second answer to this question:
http://unix.stackexchange.com/questions/74924/how-to-safely-replace-a-not-yet-failed-disk-in-a-linux-raid5-array
Regards,
Adam
--
Adam Goryachev Website Managers www.websitemanagers.com.au
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wierd: Degrading while recovering raid5
2015-02-10 7:35 ` Adam Goryachev
@ 2015-02-10 13:51 ` Phil Turmel
2015-02-10 21:50 ` Kyle Logue
0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2015-02-10 13:51 UTC (permalink / raw)
To: Adam Goryachev, Kyle Logue, linux-raid
Hi Kyle,
Your symptoms look like classic timeout mismatch. Details interleaved.
On 02/10/2015 02:35 AM, Adam Goryachev wrote:
> There are other people who will jump in and help you with your problem,
> but I'll add a couple of pointers while you are waiting. See below.
> On 10/02/15 15:20, Kyle Logue wrote:
>> Hey all:
>>
>> I have a 5 disk software raid5 that was working fine until I decided
>> to swap out an old disk with a new one.
>>
>> mdadm /dev/md0 --add /dev/sda1
>> mdadm /dev/md0 --fail /dev/sde1
As Adam pointed out, you should have used --replace, but you probably
wouldn't have made it through the replace function anyways.
>> At this point it started automatically rebuilding the array.
>> About 60%? of the way in it stops and I see a lot of this repeated in
>> my dmesg:
>>
>> [Mon Feb 9 18:06:48 2015] ata5.00: exception Emask 0x0 SAct 0x0 SErr
>> 0x0 action 0x6 frozen
>> [Mon Feb 9 18:06:48 2015] ata5.00: failed command: SMART
>> [Mon Feb 9 18:06:48 2015] ata5.00: cmd
>> b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 7
>> [Mon Feb 9 18:06:48 2015] res
>> 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
^^^^^^^^^
Smoking gun.
>> [Mon Feb 9 18:06:48 2015] ata5.00: status: { DRDY }
>> [Mon Feb 9 18:06:48 2015] ata5: hard resetting link
>> [Mon Feb 9 18:06:58 2015] ata5: softreset failed (1st FIS failed)
>> [Mon Feb 9 18:06:58 2015] ata5: hard resetting link
>> [Mon Feb 9 18:07:08 2015] ata5: softreset failed (1st FIS failed)
>> [Mon Feb 9 18:07:08 2015] ata5: hard resetting link
>> [Mon Feb 9 18:07:12 2015] ata5: SATA link up 1.5 Gbps (SStatus 113
>> SControl 310)
>> [Mon Feb 9 18:07:12 2015] ata5.00: configured for UDMA/33
>> [Mon Feb 9 18:07:12 2015] ata5: EH complete
Notice that after a timeout error, the drive is unresponsive for several
more seconds -- about 24 in your case.
> .... read about timing mismatches
> between the kernel and the hard drive, and how to solve that. There was
> another post earlier today with some links to specific posts that will
> be helpful (check the online archive).
That would have been me. Start with this link for a description of what
you are experiencing:
http://marc.info/?l=linux-raid&m=135811522817345&w=1
First, you need to protect yourself from timeout mismatch due to the use
of desktop-grade drives. (Enterprise and raid-rated drives don't have
this problem.)
{ If you were stuck in the middle of a replace a you had just
worked-around your timeout problem, it would likely continue and
complete. You've lost that opportunity. }
Show us the output of "smartctl -x" for all of your drives if you'd like
advice on your particular drives. (Pasted inline is preferred.)
Second, you need to find and overwrite (with zeros) the bad sectors on
your drives. Or ddrescue to a complete set of replacement drives and
assemble those.
Third, you need to set up a cron job to scrub your array regularly to
clean out UREs before they accumulate beyond MD's ability to handle it
(20 read errors in an hour, 10 per hour sustained).
Phil
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wierd: Degrading while recovering raid5
2015-02-10 13:51 ` Phil Turmel
@ 2015-02-10 21:50 ` Kyle Logue
2015-02-11 2:14 ` Phil Turmel
0 siblings, 1 reply; 9+ messages in thread
From: Kyle Logue @ 2015-02-10 21:50 UTC (permalink / raw)
To: linux-raid
Phil:
Thanks for your detailed response. That link does seem to describe my
problem and I do understand that desktop grade drives are sub-optimal.
It was many years ago when I first set up this array on my home
theater pc. Until now I had no idea about the cron job - I'll make
sure to implement that. I am preparing to move to 6 tb disks sometime
soon and i'll definitely go enterprise this time.
Regarding the drive timeout: I understand that I need to increase it
from 30 seconds to something larger (2+ min) but am unaware how to do
this. Is it a kernel variable? I'll keep googling but this seems like
it's whats going to save me.
tl;dr: How do I change the drive timeout?
Here is the smartctl -x for all my drives:
Reminder: SDA is the new drive. SDC is the troublemaker. SDE is the
one I failed.
> sudo smartctl -x /dev/sda
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family: Seagate Barracuda 7200.14 (AF)
> Device Model: ST2000DM001-1CH164
> Serial Number: Z340F2SP
> LU WWN Device Id: 5 000c50 064d5887d
> Firmware Version: CC27
> User Capacity: 2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 7200 rpm
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
> SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is: Tue Feb 10 16:37:52 2015 EST
> ==> WARNING: A firmware update for this drive may be available,
> see the following Seagate web pages:
> http://knowledge.seagate.com/articles/en_US/FAQ/207931en
> http://knowledge.seagate.com/articles/en_US/FAQ/223651en
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is: Unavailable
> APM level is: 254 (maximum performance)
> Rd look-ahead is: Enabled
> Write cache is: Enabled
> ATA Security is: Disabled, NOT FROZEN [SEC1]
> Write SCT (Get) XXX Error Recovery Control Command failed: scsi error aborted command
> Wt Cache Reorder: N/A
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status: (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: ( 584) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 1) minutes.
> Extended self-test routine
> recommended polling time: ( 212) minutes.
> Conveyance self-test routine
> recommended polling time: ( 2) minutes.
> SCT capabilities: (0x3085) SCT Status supported.
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-- 105 099 006 - 9806192
> 3 Spin_Up_Time PO---- 097 097 000 - 0
> 4 Start_Stop_Count -O--CK 100 100 020 - 4
> 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
> 7 Seek_Error_Rate POSR-- 100 253 030 - 289070
> 9 Power_On_Hours -O--CK 100 100 000 - 35
> 10 Spin_Retry_Count PO--C- 100 100 097 - 0
> 12 Power_Cycle_Count -O--CK 100 100 020 - 5
> 183 Runtime_Bad_Block -O--CK 099 099 000 - 1
> 184 End-to-End_Error -O--CK 100 100 099 - 0
> 187 Reported_Uncorrect -O--CK 100 100 000 - 0
> 188 Command_Timeout -O--CK 100 100 000 - 0 0 0
> 189 High_Fly_Writes -O-RCK 100 100 000 - 0
> 190 Airflow_Temperature_Cel -O---K 073 062 045 - 27 (Min/Max 25/27)
> 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
> 192 Power-Off_Retract_Count -O--CK 100 100 000 - 4
> 193 Load_Cycle_Count -O--CK 100 100 000 - 8
> 194 Temperature_Celsius -O---K 027 040 000 - 27 (0 22 0 0 0)
> 197 Current_Pending_Sector -O--C- 100 100 000 - 0
> 198 Offline_Uncorrectable ----C- 100 100 000 - 0
> 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
> 240 Head_Flying_Hours ------ 100 253 000 - 35h+41m+13.042s
> 241 Total_LBAs_Written ------ 100 253 000 - 11031892416
> 242 Total_LBAs_Read ------ 100 253 000 - 2769646
> ||||||_ K auto-keep
> |||||__ C event count
> ||||___ R error rate
> |||____ S speed/performance
> ||_____ O updated online
> |______ P prefailure warning
> General Purpose Log Directory Version 1
> SMART Log Directory Version 1 [multi-sector log support]
> Address Access R/W Size Description
> 0x00 GPL,SL R/O 1 Log Directory
> 0x01 SL R/O 1 Summary SMART error log
> 0x02 SL R/O 5 Comprehensive SMART error log
> 0x03 GPL R/O 5 Ext. Comprehensive SMART error log
> 0x06 SL R/O 1 SMART self-test log
> 0x07 GPL R/O 1 Extended self-test log
> 0x09 SL R/W 1 Selective self-test log
> 0x10 GPL R/O 1 NCQ Command Error log
> 0x11 GPL R/O 1 SATA Phy Event Counters
> 0x21 GPL R/O 1 Write stream error log
> 0x22 GPL R/O 1 Read stream error log
> 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> 0xa1 GPL,SL VS 20 Device vendor specific log
> 0xa2 GPL VS 4496 Device vendor specific log
> 0xa8 GPL,SL VS 129 Device vendor specific log
> 0xa9 GPL,SL VS 1 Device vendor specific log
> 0xab GPL VS 1 Device vendor specific log
> 0xb0 GPL VS 5176 Device vendor specific log
> 0xbe-0xbf GPL VS 65535 Device vendor specific log
> 0xc0 GPL,SL VS 1 Device vendor specific log
> 0xc1 GPL,SL VS 10 Device vendor specific log
> 0xc4 GPL,SL VS 5 Device vendor specific log
> 0xe0 GPL,SL R/W 1 SCT Command/Status
> 0xe1 GPL,SL R/W 1 SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> No Errors Logged
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged. [To run self-tests, use: smartctl -t]
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> SCT Data Table command not supported
> SCT Error Recovery Control command not supported
> Device Statistics (GP Log 0x04) not supported
> SATA Phy Event Counters (GP Log 0x11)
> ID Size Value Description
> 0x000a 2 6 Device-to-host register FISes sent due to a COMRESET
> 0x0001 2 0 Command failed due to ICRC error
> 0x0003 2 0 R_ERR response for device-to-host data FIS
> 0x0004 2 0 R_ERR response for host-to-device data FIS
> 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> 0x0007 2 0 R_ERR response for host-to-device non-data FIS
>
> sudo smartctl -x /dev/sdb
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family: Seagate Barracuda 7200.14 (AF)
> Device Model: ST2000DM001-1CH164
> Serial Number: S1E1CW9Y
> LU WWN Device Id: 5 000c50 05c085bef
> Firmware Version: CC24
> User Capacity: 2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 7200 rpm
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ATA8-ACS T13/1699-D revision 4
> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is: Tue Feb 10 16:40:24 2015 EST
> ==> WARNING: A firmware update for this drive may be available,
> see the following Seagate web pages:
> http://knowledge.seagate.com/articles/en_US/FAQ/207931en
> http://knowledge.seagate.com/articles/en_US/FAQ/223651en
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is: Unavailable
> APM level is: 254 (maximum performance)
> Rd look-ahead is: Enabled
> Write cache is: Enabled
> ATA Security is: Disabled, NOT FROZEN [SEC1]
> Write SCT (Get) XXX Error Recovery Control Command failed: scsi error aborted command
> Wt Cache Reorder: N/A
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status: (0x82) Offline data collection activity
> was completed without error.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: ( 584) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 1) minutes.
> Extended self-test routine
> recommended polling time: ( 225) minutes.
> Conveyance self-test routine
> recommended polling time: ( 2) minutes.
> SCT capabilities: (0x3085) SCT Status supported.
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate POSR-- 117 099 006 - 153090384
> 3 Spin_Up_Time PO---- 096 096 000 - 0
> 4 Start_Stop_Count -O--CK 100 100 020 - 58
> 5 Reallocated_Sector_Ct PO--CK 100 100 010 - 0
> 7 Seek_Error_Rate POSR-- 063 058 030 - 8594213138
> 9 Power_On_Hours -O--CK 084 084 000 - 14743
> 10 Spin_Retry_Count PO--C- 100 100 097 - 0
> 12 Power_Cycle_Count -O--CK 100 100 020 - 58
> 183 Runtime_Bad_Block -O--CK 100 100 000 - 0
> 184 End-to-End_Error -O--CK 100 100 099 - 0
> 187 Reported_Uncorrect -O--CK 100 100 000 - 0
> 188 Command_Timeout -O--CK 100 099 000 - 1 1 1
> 189 High_Fly_Writes -O-RCK 100 100 000 - 0
> 190 Airflow_Temperature_Cel -O---K 072 057 045 - 28 (Min/Max 26/28)
> 191 G-Sense_Error_Rate -O--CK 100 100 000 - 0
> 192 Power-Off_Retract_Count -O--CK 100 100 000 - 34
> 193 Load_Cycle_Count -O--CK 100 100 000 - 110
> 194 Temperature_Celsius -O---K 028 043 000 - 28 (0 18 0 0 0)
> 197 Current_Pending_Sector -O--C- 100 100 000 - 0
> 198 Offline_Uncorrectable ----C- 100 100 000 - 0
> 199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
> 240 Head_Flying_Hours ------ 100 253 000 - 14740h+55m+31.297s
> 241 Total_LBAs_Written ------ 100 253 000 - 9249405614
> 242 Total_LBAs_Read ------ 100 253 000 - 100539385901
> ||||||_ K auto-keep
> |||||__ C event count
> ||||___ R error rate
> |||____ S speed/performance
> ||_____ O updated online
> |______ P prefailure warning
> General Purpose Log Directory Version 1
> SMART Log Directory Version 1 [multi-sector log support]
> Address Access R/W Size Description
> 0x00 GPL,SL R/O 1 Log Directory
> 0x01 SL R/O 1 Summary SMART error log
> 0x02 SL R/O 5 Comprehensive SMART error log
> 0x03 GPL R/O 5 Ext. Comprehensive SMART error log
> 0x06 SL R/O 1 SMART self-test log
> 0x07 GPL R/O 1 Extended self-test log
> 0x09 SL R/W 1 Selective self-test log
> 0x10 GPL R/O 1 NCQ Command Error log
> 0x11 GPL R/O 1 SATA Phy Event Counters
> 0x21 GPL R/O 1 Write stream error log
> 0x22 GPL R/O 1 Read stream error log
> 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> 0xa1 GPL,SL VS 20 Device vendor specific log
> 0xa2 GPL VS 4496 Device vendor specific log
> 0xa8 GPL,SL VS 129 Device vendor specific log
> 0xa9 GPL,SL VS 1 Device vendor specific log
> 0xab GPL VS 1 Device vendor specific log
> 0xb0 GPL VS 5176 Device vendor specific log
> 0xbd GPL VS 512 Device vendor specific log
> 0xbe-0xbf GPL VS 65535 Device vendor specific log
> 0xc0 GPL,SL VS 1 Device vendor specific log
> 0xc1 GPL,SL VS 10 Device vendor specific log
> 0xc4 GPL,SL VS 5 Device vendor specific log
> 0xe0 GPL,SL R/W 1 SCT Command/Status
> 0xe1 GPL,SL R/W 1 SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> No Errors Logged
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged. [To run self-tests, use: smartctl -t]
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> SCT Data Table command not supported
> SCT Error Recovery Control command not supported
> Device Statistics (GP Log 0x04) not supported
> SATA Phy Event Counters (GP Log 0x11)
> ID Size Value Description
> 0x000a 2 6 Device-to-host register FISes sent due to a COMRESET
> 0x0001 2 0 Command failed due to ICRC error
> 0x0003 2 0 R_ERR response for device-to-host data FIS
> 0x0004 2 0 R_ERR response for host-to-device data FIS
> 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> 0x0007 2 0 R_ERR response for host-to-device non-data FIS
> THIS IS THE BAD DISK:
> sudo smartctl -x /dev/sdc
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family: Seagate Barracuda 7200.14 (AF)
> Device Model: ST2000DM001-1CH164
> Serial Number: S240V6VR
> LU WWN Device Id: 5 000c50 05c05c2e7
> Firmware Version: CC24
> User Capacity: 2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes: 512 bytes logical, 4096 bytes physical
> Rotation Rate: 7200 rpm
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ATA8-ACS T13/1699-D revision 4
> SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is: Tue Feb 10 16:42:53 2015 EST
> ==> WARNING: A firmware update for this drive may be available,
> see the following Seagate web pages:
> http://knowledge.seagate.com/articles/en_US/FAQ/207931en
> http://knowledge.seagate.com/articles/en_US/FAQ/223651en
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is: Unavailable
> APM level is: 254 (maximum performance)
> Rd look-ahead is: Enabled
> Write cache is: Enabled
> ATA Security is: Disabled, NOT FROZEN [SEC1]
> Write SCT (Get) XXX Error Recovery Control Command failed: scsi error aborted command
> Wt Cache Reorder: N/A
> Read SMART Data failed: scsi error aborted command
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: UNKNOWN!
> SMART Status, Attributes and Thresholds cannot be read.
> General Purpose Log Directory Version 1
> SMART Log Directory Version 1 [multi-sector log support]
> Address Access R/W Size Description
> 0x00 GPL,SL R/O 1 Log Directory
> 0x01 SL R/O 1 Summary SMART error log
> 0x02 SL R/O 5 Comprehensive SMART error log
> 0x03 GPL R/O 5 Ext. Comprehensive SMART error log
> 0x06 SL R/O 1 SMART self-test log
> 0x07 GPL R/O 1 Extended self-test log
> 0x09 SL R/W 1 Selective self-test log
> 0x10 GPL R/O 1 NCQ Command Error log
> 0x11 GPL R/O 1 SATA Phy Event Counters
> 0x21 GPL R/O 1 Write stream error log
> 0x22 GPL R/O 1 Read stream error log
> 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> 0xa1 GPL,SL VS 20 Device vendor specific log
> 0xa2 GPL VS 4496 Device vendor specific log
> 0xa8 GPL,SL VS 129 Device vendor specific log
> 0xa9 GPL,SL VS 1 Device vendor specific log
> 0xab GPL VS 1 Device vendor specific log
> 0xb0 GPL VS 5176 Device vendor specific log
> 0xbd GPL VS 512 Device vendor specific log
> 0xbe-0xbf GPL VS 65535 Device vendor specific log
> 0xc0 GPL,SL VS 1 Device vendor specific log
> 0xc1 GPL,SL VS 10 Device vendor specific log
> 0xc4 GPL,SL VS 5 Device vendor specific log
> 0xe0 GPL,SL R/W 1 SCT Command/Status
> 0xe1 GPL,SL R/W 1 SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> Device Error Count: 9
> CR = Command Register
> FEATR = Features Register
> COUNT = Count (was: Sector Count) Register
> LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
> LH = LBA High (was: Cylinder High) Register ] LBA
> LM = LBA Mid (was: Cylinder Low) Register ] Register
> LL = LBA Low (was: Sector Number) Register ]
> DV = Device (was: Device/Head) Register
> DC = Device Control Register
> ER = Error register
> ST = Status register
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
> Error 9 [8] occurred at disk power-on lifetime: 14697 hours (612 days + 9 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 40 -- 51 00 00 00 00 a4 1c 1d e8 00 00 Error: UNC at LBA = 0xa41c1de8 = 2753306088
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 00 80 00 00 a4 1c 1d e8 e0 00 04:55:26.791 READ DMA EXT
> 25 00 00 04 00 00 00 a4 1c 21 00 e0 00 04:55:26.776 READ DMA EXT
> ef 00 10 00 02 00 00 00 00 00 00 a0 00 04:55:26.775 SET FEATURES [Enable SATA feature]
> 27 00 00 00 00 00 00 00 00 00 00 e0 00 04:55:26.775 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
> ec 00 00 00 00 00 00 00 00 00 00 a0 00 04:55:26.774 IDENTIFY DEVICE
> Error 8 [7] occurred at disk power-on lifetime: 14697 hours (612 days + 9 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 40 -- 51 00 00 00 00 a4 1c 1d e8 00 00 Error: UNC at LBA = 0xa41c1de8 = 2753306088
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 04 00 00 00 a4 1c 1d 00 e0 00 04:55:23.631 READ DMA EXT
> 25 00 00 04 00 00 00 a4 1c 19 00 e0 00 04:55:23.553 READ DMA EXT
> 25 00 00 04 00 00 00 a4 1c 15 00 e0 00 04:55:23.108 READ DMA EXT
> 25 00 00 04 00 00 00 a4 1c 11 00 e0 00 04:55:23.004 READ DMA EXT
> 25 00 00 04 00 00 00 a4 1c 0d 00 e0 00 04:55:22.893 READ DMA EXT
> Error 7 [6] occurred at disk power-on lifetime: 14686 hours (611 days + 22 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 40 -- 51 00 00 00 00 a4 1c 1d e8 00 00 Error: UNC at LBA = 0xa41c1de8 = 2753306088
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 03 c0 00 00 a4 1c 1d e8 e0 00 1d+00:26:44.862 READ DMA EXT
> 25 00 00 00 08 00 00 a4 1c 21 a8 e0 00 1d+00:26:44.852 READ DMA EXT
> ec 00 00 00 01 00 00 00 00 00 00 00 00 1d+00:26:44.851 IDENTIFY DEVICE
> ec 00 00 00 01 00 00 00 00 00 00 00 00 1d+00:26:44.851 IDENTIFY DEVICE
> e5 00 00 00 00 00 00 00 00 00 00 00 00 1d+00:26:44.851 CHECK POWER MODE
> Error 6 [5] occurred at disk power-on lifetime: 14686 hours (611 days + 22 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 40 -- 51 00 00 00 00 a4 1c 1d e8 00 00 Error: UNC at LBA = 0xa41c1de8 = 2753306088
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 04 00 00 00 a4 1c 1d a8 e0 00 1d+00:26:30.653 READ DMA EXT
> ef 00 90 00 03 00 00 00 00 00 00 a0 00 1d+00:26:30.638 SET FEATURES [Disable SATA feature]
> ef 00 10 00 02 00 00 00 00 00 00 a0 00 1d+00:26:30.638 SET FEATURES [Enable SATA feature]
> 27 00 00 00 00 00 00 00 00 00 00 e0 00 1d+00:26:30.638 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
> ec 00 00 00 00 00 00 00 00 00 00 a0 00 1d+00:26:30.638 IDENTIFY DEVICE
> Error 5 [4] occurred at disk power-on lifetime: 14676 hours (611 days + 12 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 40 -- 51 00 00 00 00 a4 1c 1d e8 00 00 Error: UNC at LBA = 0xa41c1de8 = 2753306088
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 00 a8 00 00 a4 1c 1d e8 e0 00 14:43:09.384 READ DMA EXT
> e5 00 00 00 00 00 00 00 00 00 00 00 00 14:43:09.383 CHECK POWER MODE
> 25 00 00 04 00 00 00 a4 1c 1e 90 e0 00 14:43:09.371 READ DMA EXT
> ef 00 10 00 02 00 00 00 00 00 00 a0 00 14:43:09.370 SET FEATURES [Enable SATA feature]
> 27 00 00 00 00 00 00 00 00 00 00 e0 00 14:43:09.370 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
> Error 4 [3] occurred at disk power-on lifetime: 14676 hours (611 days + 12 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 40 -- 51 00 00 00 00 a4 1c 1d e8 00 00 Error: UNC at LBA = 0xa41c1de8 = 2753306088
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 04 00 00 00 a4 1c 1a 90 e0 00 14:43:06.283 READ DMA EXT
> 25 00 00 04 00 00 00 a4 1c 16 90 e0 00 14:43:06.205 READ DMA EXT
> 25 00 00 04 00 00 00 a4 1c 12 90 e0 00 14:43:04.892 READ DMA EXT
> 25 00 00 04 00 00 00 a4 1c 0e 90 e0 00 14:43:04.855 READ DMA EXT
> 25 00 00 04 00 00 00 a4 1c 0a 90 e0 00 14:43:04.819 READ DMA EXT
> Error 3 [2] occurred at disk power-on lifetime: 14670 hours (611 days + 6 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 40 -- 51 00 00 00 00 a4 1c 1d e8 00 00 Error: UNC at LBA = 0xa41c1de8 = 2753306088
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 04 00 00 00 a4 1c 1a 00 e0 00 08:33:02.502 READ DMA EXT
> ef 00 10 00 02 00 00 00 00 00 00 a0 00 08:33:02.501 SET FEATURES [Enable SATA feature]
> 27 00 00 00 00 00 00 00 00 00 00 e0 00 08:33:02.501 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
> ec 00 00 00 00 00 00 00 00 00 00 a0 00 08:33:02.501 IDENTIFY DEVICE
> ef 00 03 00 42 00 00 00 00 00 00 a0 00 08:33:02.501 SET FEATURES [Set transfer mode]
> Error 2 [1] occurred at disk power-on lifetime: 14670 hours (611 days + 6 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 40 -- 51 00 00 00 00 a4 1c 13 d0 00 00 Error: UNC at LBA = 0xa41c13d0 = 2753303504
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 02 30 00 00 a4 1c 13 d0 e0 00 08:32:59.645 READ DMA EXT
> e5 00 00 00 00 00 00 00 00 00 00 00 00 08:32:59.643 CHECK POWER MODE
> 25 00 00 04 00 00 00 a4 1c 16 00 e0 00 08:32:59.581 READ DMA EXT
> ef 00 10 00 02 00 00 00 00 00 00 a0 00 08:32:59.580 SET FEATURES [Enable SATA feature]
> 27 00 00 00 00 00 00 00 00 00 00 e0 00 08:32:59.580 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged. [To run self-tests, use: smartctl -t]
> Selective Self-tests/Logging not supported
> SCT Data Table command not supported
> SCT Error Recovery Control command not supported
> Device Statistics (GP Log 0x04) not supported
> SATA Phy Event Counters (GP Log 0x11)
> ID Size Value Description
> 0x000a 2 6 Device-to-host register FISes sent due to a COMRESET
> 0x0001 2 0 Command failed due to ICRC error
> 0x0003 2 0 R_ERR response for device-to-host data FIS
> 0x0004 2 0 R_ERR response for host-to-device data FIS
> 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> 0x0007 2 0 R_ERR response for host-to-device non-data FIS
> sudo smartctl -x /dev/sdd
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family: Hitachi Deskstar 7K3000
> Device Model: Hitachi HDS723020BLA642
> Serial Number: MN3220F32GX10E
> LU WWN Device Id: 5 000cca 369e2f56f
> Firmware Version: MN6OA5C0
> User Capacity: 2,000,398,934,016 bytes [2.00 TB]
> Sector Size: 512 bytes logical/physical
> Rotation Rate: 7200 rpm
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ATA8-ACS T13/1699-D revision 4
> SATA Version is: SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is: Tue Feb 10 16:45:04 2015 EST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is: Unavailable
> APM feature is: Disabled
> Rd look-ahead is: Enabled
> Write cache is: Enabled
> ATA Security is: Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Enabled
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status: (0x84) Offline data collection activity
> was suspended by an interrupting command from host.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (18096) seconds.
> Offline data collection
> capabilities: (0x5b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> No Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 1) minutes.
> Extended self-test routine
> recommended polling time: ( 302) minutes.
> SCT capabilities: (0x003d) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
> SCT Data Table supported.
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate PO-R-- 100 100 016 - 0
> 2 Throughput_Performance P-S--- 136 136 054 - 82
> 3 Spin_Up_Time POS--- 152 152 024 - 434 (Average 320)
> 4 Start_Stop_Count -O--C- 100 100 000 - 97
> 5 Reallocated_Sector_Ct PO--CK 100 100 005 - 0
> 7 Seek_Error_Rate PO-R-- 100 100 067 - 0
> 8 Seek_Time_Performance P-S--- 135 135 020 - 26
> 9 Power_On_Hours -O--C- 097 097 000 - 27235
> 10 Spin_Retry_Count PO--C- 100 100 060 - 0
> 12 Power_Cycle_Count -O--CK 100 100 000 - 97
> 192 Power-Off_Retract_Count -O--CK 100 100 000 - 755
> 193 Load_Cycle_Count -O--C- 100 100 000 - 755
> 194 Temperature_Celsius -O---- 200 200 000 - 30 (Min/Max 19/45)
> 196 Reallocated_Event_Count -O--CK 100 100 000 - 0
> 197 Current_Pending_Sector -O---K 100 100 000 - 0
> 198 Offline_Uncorrectable ---R-- 100 100 000 - 0
> 199 UDMA_CRC_Error_Count -O-R-- 200 200 000 - 0
> ||||||_ K auto-keep
> |||||__ C event count
> ||||___ R error rate
> |||____ S speed/performance
> ||_____ O updated online
> |______ P prefailure warning
> General Purpose Log Directory Version 1
> SMART Log Directory Version 1 [multi-sector log support]
> Address Access R/W Size Description
> 0x00 GPL,SL R/O 1 Log Directory
> 0x01 SL R/O 1 Summary SMART error log
> 0x03 GPL R/O 1 Ext. Comprehensive SMART error log
> 0x04 GPL R/O 7 Device Statistics log
> 0x06 SL R/O 1 SMART self-test log
> 0x07 GPL R/O 1 Extended self-test log
> 0x08 GPL R/O 1 Power Conditions log
> 0x09 SL R/W 1 Selective self-test log
> 0x10 GPL R/O 1 NCQ Command Error log
> 0x11 GPL R/O 1 SATA Phy Event Counters
> 0x20 GPL R/O 1 Streaming performance log [OBS-8]
> 0x21 GPL R/O 1 Write stream error log
> 0x22 GPL R/O 1 Read stream error log
> 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> 0xe0 GPL,SL R/W 1 SCT Command/Status
> 0xe1 GPL,SL R/W 1 SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
> No Errors Logged
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged. [To run self-tests, use: smartctl -t]
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> SCT Status Version: 3
> SCT Version (vendor specific): 256 (0x0100)
> SCT Support Level: 1
> Device State: SMART Off-line Data Collection executing in background (4)
> Current Temperature: 30 Celsius
> Power Cycle Min/Max Temperature: 27/30 Celsius
> Lifetime Min/Max Temperature: 19/45 Celsius
> Under/Over Temperature Limit Count: 0/0
> SCT Temperature History Version: 2
> Temperature Sampling Period: 1 minute
> Temperature Logging Interval: 1 minute
> Min/Max recommended Temperature: 0/60 Celsius
> Min/Max Temperature Limit: -40/70 Celsius
> Temperature History Size (Index): 128 (52)
> Index Estimated Time Temperature Celsius
> 53 2015-02-10 14:38 37 ******************
> ... ..( 24 skipped). .. ******************
> 78 2015-02-10 15:03 37 ******************
> 79 2015-02-10 15:04 36 *****************
> 80 2015-02-10 15:05 36 *****************
> 81 2015-02-10 15:06 37 ******************
> ... ..( 5 skipped). .. ******************
> 87 2015-02-10 15:12 37 ******************
> 88 2015-02-10 15:13 36 *****************
> 89 2015-02-10 15:14 37 ******************
> ... ..( 5 skipped). .. ******************
> 95 2015-02-10 15:20 37 ******************
> 96 2015-02-10 15:21 36 *****************
> 97 2015-02-10 15:22 37 ******************
> 98 2015-02-10 15:23 37 ******************
> 99 2015-02-10 15:24 36 *****************
> 100 2015-02-10 15:25 37 ******************
> ... ..( 4 skipped). .. ******************
> 105 2015-02-10 15:30 37 ******************
> 106 2015-02-10 15:31 36 *****************
> 107 2015-02-10 15:32 36 *****************
> 108 2015-02-10 15:33 37 ******************
> ... ..( 6 skipped). .. ******************
> 115 2015-02-10 15:40 37 ******************
> 116 2015-02-10 15:41 36 *****************
> 117 2015-02-10 15:42 36 *****************
> 118 2015-02-10 15:43 36 *****************
> 119 2015-02-10 15:44 37 ******************
> ... ..( 2 skipped). .. ******************
> 122 2015-02-10 15:47 37 ******************
> 123 2015-02-10 15:48 36 *****************
> 124 2015-02-10 15:49 37 ******************
> 125 2015-02-10 15:50 37 ******************
> 126 2015-02-10 15:51 36 *****************
> 127 2015-02-10 15:52 36 *****************
> 0 2015-02-10 15:53 37 ******************
> 1 2015-02-10 15:54 36 *****************
> 2 2015-02-10 15:55 37 ******************
> 3 2015-02-10 15:56 36 *****************
> 4 2015-02-10 15:57 36 *****************
> 5 2015-02-10 15:58 37 ******************
> ... ..( 2 skipped). .. ******************
> 8 2015-02-10 16:01 37 ******************
> 9 2015-02-10 16:02 36 *****************
> 10 2015-02-10 16:03 37 ******************
> ... ..( 2 skipped). .. ******************
> 13 2015-02-10 16:06 37 ******************
> 14 2015-02-10 16:07 36 *****************
> 15 2015-02-10 16:08 37 ******************
> ... ..( 10 skipped). .. ******************
> 26 2015-02-10 16:19 37 ******************
> 27 2015-02-10 16:20 36 *****************
> ... ..( 5 skipped). .. *****************
> 33 2015-02-10 16:26 36 *****************
> 34 2015-02-10 16:27 37 ******************
> ... ..( 4 skipped). .. ******************
> 39 2015-02-10 16:32 37 ******************
> 40 2015-02-10 16:33 ? -
> 41 2015-02-10 16:34 27 ********
> 42 2015-02-10 16:35 28 *********
> 43 2015-02-10 16:36 28 *********
> 44 2015-02-10 16:37 28 *********
> 45 2015-02-10 16:38 29 **********
> ... ..( 2 skipped). .. **********
> 48 2015-02-10 16:41 29 **********
> 49 2015-02-10 16:42 30 ***********
> ... ..( 2 skipped). .. ***********
> 52 2015-02-10 16:45 30 ***********
> SCT Error Recovery Control:
> Read: Disabled
> Write: Disabled
> Device Statistics (GP Log 0x04)
> Page Offset Size Value Description
> 1 ===== = = == General Statistics (rev 1) ==
> 1 0x008 4 97 Lifetime Power-On Resets
> 1 0x010 4 27235 Power-on Hours
> 1 0x018 6 11734342067 Logical Sectors Written
> 1 0x020 6 27559380 Number of Write Commands
> 1 0x028 6 2738754035727 Logical Sectors Read
> 1 0x030 6 5733165681 Number of Read Commands
> 3 ===== = = == Rotating Media Statistics (rev 1) ==
> 3 0x008 4 27229 Spindle Motor Power-on Hours
> 3 0x010 4 27229 Head Flying Hours
> 3 0x018 4 755 Head Load Events
> 3 0x020 4 0 Number of Reallocated Logical Sectors
> 3 0x028 4 276 Read Recovery Attempts
> 3 0x030 4 7 Number of Mechanical Start Failures
> 4 ===== = = == General Errors Statistics (rev 1) ==
> 4 0x008 4 0 Number of Reported Uncorrectable Errors
> 4 0x010 4 2 Resets Between Cmd Acceptance and Completion
> 5 ===== = = == Temperature Statistics (rev 1) ==
> 5 0x008 1 30 Current Temperature
> 5 0x010 1 35~ Average Short Term Temperature
> 5 0x018 1 33~ Average Long Term Temperature
> 5 0x020 1 45 Highest Temperature
> 5 0x028 1 19 Lowest Temperature
> 5 0x030 1 42~ Highest Average Short Term Temperature
> 5 0x038 1 24~ Lowest Average Short Term Temperature
> 5 0x040 1 39~ Highest Average Long Term Temperature
> 5 0x048 1 25~ Lowest Average Long Term Temperature
> 5 0x050 4 0 Time in Over-Temperature
> 5 0x058 1 60 Specified Maximum Operating Temperature
> 5 0x060 4 0 Time in Under-Temperature
> 5 0x068 1 0 Specified Minimum Operating Temperature
> 6 ===== = = == Transport Statistics (rev 1) ==
> 6 0x008 4 1122 Number of Hardware Resets
> 6 0x010 4 1027 Number of ASR Events
> 6 0x018 4 0 Number of Interface CRC Errors
> |_ ~ normalized value
> SATA Phy Event Counters (GP Log 0x11)
> ID Size Value Description
> 0x0001 2 0 Command failed due to ICRC error
> 0x0002 2 0 R_ERR response for data FIS
> 0x0003 2 0 R_ERR response for device-to-host data FIS
> 0x0004 2 0 R_ERR response for host-to-device data FIS
> 0x0005 2 0 R_ERR response for non-data FIS
> 0x0006 2 0 R_ERR response for device-to-host non-data FIS
> 0x0007 2 0 R_ERR response for host-to-device non-data FIS
> 0x0009 2 6 Transition from drive PhyRdy to drive PhyNRdy
> 0x000a 2 5 Device-to-host register FISes sent due to a COMRESET
> 0x000b 2 0 CRC errors within host-to-device FIS
> 0x000d 2 0 Non-CRC errors within host-to-device FIS
> sudo smartctl -x /dev/sde
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family: Hitachi Deskstar 7K2000
> Device Model: Hitachi HDS722020ALA330
> Serial Number: JK1171YAGAD8LS
> LU WWN Device Id: 5 000cca 221c4b9cc
> Firmware Version: JKAOA20N
> User Capacity: 2,000,398,934,016 bytes [2.00 TB]
> Sector Size: 512 bytes logical/physical
> Rotation Rate: 7200 rpm
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ATA8-ACS T13/1699-D revision 4
> SATA Version is: SATA 2.6, 3.0 Gb/s
> Local Time is: Tue Feb 10 16:45:31 2015 EST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is: Disabled
> APM feature is: Disabled
> Rd look-ahead is: Enabled
> Write cache is: Enabled
> ATA Security is: Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Enabled
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status: (0x84) Offline data collection activity
> was suspended by an interrupting command from host.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (21007) seconds.
> Offline data collection
> capabilities: (0x5b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> No Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 1) minutes.
> Extended self-test routine
> recommended polling time: ( 350) minutes.
> SCT capabilities: (0x003d) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
> SCT Data Table supported.
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate PO-R-- 100 100 016 - 0
> 2 Throughput_Performance P-S--- 134 134 054 - 98
> 3 Spin_Up_Time POS--- 137 137 024 - 619 (Average 439)
> 4 Start_Stop_Count -O--C- 100 100 000 - 207
> 5 Reallocated_Sector_Ct PO--CK 100 100 005 - 0
> 7 Seek_Error_Rate PO-R-- 100 100 067 - 0
> 8 Seek_Time_Performance P-S--- 112 112 020 - 39
> 9 Power_On_Hours -O--C- 094 094 000 - 44002
> 10 Spin_Retry_Count PO--C- 100 100 060 - 0
> 12 Power_Cycle_Count -O--CK 100 100 000 - 207
> 192 Power-Off_Retract_Count -O--CK 099 099 000 - 1267
> 193 Load_Cycle_Count -O--C- 099 099 000 - 1267
> 194 Temperature_Celsius -O---- 181 181 000 - 33 (Min/Max 20/53)
> 196 Reallocated_Event_Count -O--CK 100 100 000 - 0
> 197 Current_Pending_Sector -O---K 100 100 000 - 0
> 198 Offline_Uncorrectable ---R-- 100 100 000 - 0
> 199 UDMA_CRC_Error_Count -O-R-- 200 200 000 - 9
> ||||||_ K auto-keep
> |||||__ C event count
> ||||___ R error rate
> |||____ S speed/performance
> ||_____ O updated online
> |______ P prefailure warning
> General Purpose Log Directory Version 1
> SMART Log Directory Version 1 [multi-sector log support]
> Address Access R/W Size Description
> 0x00 GPL,SL R/O 1 Log Directory
> 0x01 SL R/O 1 Summary SMART error log
> 0x03 GPL R/O 1 Ext. Comprehensive SMART error log
> 0x04 GPL R/O 7 Device Statistics log
> 0x06 SL R/O 1 SMART self-test log
> 0x07 GPL R/O 1 Extended self-test log
> 0x09 SL R/W 1 Selective self-test log
> 0x10 GPL R/O 1 NCQ Command Error log
> 0x11 GPL R/O 1 SATA Phy Event Counters
> 0x20 GPL R/O 1 Streaming performance log [OBS-8]
> 0x21 GPL R/O 1 Write stream error log
> 0x22 GPL R/O 1 Read stream error log
> 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> 0xe0 GPL,SL R/W 1 SCT Command/Status
> 0xe1 GPL,SL R/W 1 SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
> Device Error Count: 10 (device log contains only the most recent 4 errors)
> CR = Command Register
> FEATR = Features Register
> COUNT = Count (was: Sector Count) Register
> LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
> LH = LBA High (was: Cylinder High) Register ] LBA
> LM = LBA Mid (was: Cylinder Low) Register ] Register
> LL = LBA Low (was: Sector Number) Register ]
> DV = Device (was: Device/Head) Register
> DC = Device Control Register
> ER = Error register
> ST = Status register
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
> Error 10 [1] occurred at disk power-on lifetime: 1655 hours (68 days + 23 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 84 -- 51 01 28 00 00 50 83 5d e8 00 00 Error: ICRC, ABRT 296 sectors at LBA = 0x50835de8 = 1350786536
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 02 a8 00 00 50 83 5c 68 e0 08 23d+05:05:37.425 READ DMA EXT
> 25 00 00 03 68 00 00 50 83 59 00 e0 08 23d+05:05:37.413 READ DMA EXT
> 25 00 00 01 00 00 00 50 83 58 00 e0 08 23d+05:05:37.409 READ DMA EXT
> 25 00 00 00 f0 00 00 50 83 57 10 e0 08 23d+05:05:37.405 READ DMA EXT
> 25 00 00 02 a0 00 00 50 83 54 70 e0 08 23d+05:05:37.352 READ DMA EXT
> Error 9 [0] occurred at disk power-on lifetime: 1654 hours (68 days + 22 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 84 -- 51 00 90 00 00 4e eb 15 70 00 00 Error: ICRC, ABRT 144 sectors at LBA = 0x4eeb1570 = 1324029296
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 01 00 00 00 4e eb 15 00 ee 08 23d+04:47:42.788 READ DMA EXT
> 25 00 00 02 28 00 00 4e eb 12 d8 ee 08 23d+04:47:42.713 READ DMA EXT
> 25 00 00 03 d8 00 00 4e eb 0f 00 ee 08 23d+04:47:42.698 READ DMA EXT
> 25 00 00 01 00 00 00 4e eb 0e 00 ee 08 23d+04:47:42.694 READ DMA EXT
> 25 00 00 01 00 00 00 4e eb 0d 00 ee 08 23d+04:47:42.691 READ DMA EXT
> Error 8 [3] occurred at disk power-on lifetime: 1654 hours (68 days + 22 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 84 -- 51 00 28 00 00 36 08 f1 d8 00 00 Error: ICRC, ABRT 40 sectors at LBA = 0x3608f1d8 = 906555864
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 00 f8 00 00 36 08 f1 08 e6 08 23d+00:06:40.966 READ DMA EXT
> 25 00 00 02 78 00 00 36 08 ee 90 e6 08 23d+00:06:40.914 READ DMA EXT
> 25 00 00 03 90 00 00 36 08 eb 00 e6 08 23d+00:06:40.900 READ DMA EXT
> 25 00 00 01 00 00 00 36 08 ea 00 e6 08 23d+00:06:40.896 READ DMA EXT
> 25 00 00 00 f8 00 00 36 08 e9 08 e6 08 23d+00:06:40.893 READ DMA EXT
> Error 7 [2] occurred at disk power-on lifetime: 1654 hours (68 days + 22 hours)
> When the command that caused the error occurred, the device was active or idle.
> After command completion occurred, registers were:
> ER -- ST COUNT LBA_48 LH LM LL DV DC
> -- -- -- == -- == == == -- -- -- -- --
> 84 -- 51 01 28 00 00 33 d1 bb 40 00 00 Error: ICRC, ABRT 296 sectors at LBA = 0x33d1bb40 = 869382976
> Commands leading to the command that caused the error were:
> CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
> -- == -- == -- == == == -- -- -- -- -- --------------- --------------------
> 25 00 00 03 68 00 00 33 d1 b9 00 e3 08 22d+23:42:04.107 READ DMA EXT
> 25 00 00 01 00 00 00 33 d1 b8 00 e3 08 22d+23:42:04.103 READ DMA EXT
> 25 00 00 00 f0 00 00 33 d1 b7 10 e3 08 22d+23:42:04.099 READ DMA EXT
> 25 00 00 02 b0 00 00 33 d1 b4 60 e3 08 22d+23:42:04.022 READ DMA EXT
> 25 00 00 03 60 00 00 33 d1 b1 00 e3 08 22d+23:42:04.009 READ DMA EXT
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged. [To run self-tests, use: smartctl -t]
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> SCT Status Version: 3
> SCT Version (vendor specific): 256 (0x0100)
> SCT Support Level: 1
> Device State: SMART Off-line Data Collection executing in background (4)
> Current Temperature: 33 Celsius
> Power Cycle Min/Max Temperature: 27/33 Celsius
> Lifetime Min/Max Temperature: 20/53 Celsius
> Under/Over Temperature Limit Count: 0/0
> SCT Temperature History Version: 2
> Temperature Sampling Period: 1 minute
> Temperature Logging Interval: 1 minute
> Min/Max recommended Temperature: 0/60 Celsius
> Min/Max Temperature Limit: -40/70 Celsius
> Temperature History Size (Index): 128 (81)
> Index Estimated Time Temperature Celsius
> 82 2015-02-10 14:38 41 **********************
> ... ..(113 skipped). .. **********************
> 68 2015-02-10 16:32 41 **********************
> 69 2015-02-10 16:33 ? -
> 70 2015-02-10 16:34 28 *********
> 71 2015-02-10 16:35 28 *********
> 72 2015-02-10 16:36 29 **********
> 73 2015-02-10 16:37 29 **********
> 74 2015-02-10 16:38 30 ***********
> 75 2015-02-10 16:39 30 ***********
> 76 2015-02-10 16:40 31 ************
> 77 2015-02-10 16:41 31 ************
> 78 2015-02-10 16:42 32 *************
> 79 2015-02-10 16:43 32 *************
> 80 2015-02-10 16:44 33 **************
> 81 2015-02-10 16:45 33 **************
> SCT Error Recovery Control:
> Read: Disabled
> Write: Disabled
> Device Statistics (GP Log 0x04)
> Page Offset Size Value Description
> 1 ===== = = == General Statistics (rev 1) ==
> 1 0x008 4 207 Lifetime Power-On Resets
> 1 0x010 4 44002 Power-on Hours
> 1 0x018 6 19676641503 Logical Sectors Written
> 1 0x020 6 47285021 Number of Write Commands
> 1 0x028 6 4518358603939 Logical Sectors Read
> 1 0x030 6 5982270826 Number of Read Commands
> 3 ===== = = == Rotating Media Statistics (rev 1) ==
> 3 0x008 4 43993 Spindle Motor Power-on Hours
> 3 0x010 4 43993 Head Flying Hours
> 3 0x018 4 1267 Head Load Events
> 3 0x020 4 0 Number of Reallocated Logical Sectors
> 3 0x028 4 14 Read Recovery Attempts
> 3 0x030 4 1 Number of Mechanical Start Failures
> 4 ===== = = == General Errors Statistics (rev 1) ==
> 4 0x008 4 0 Number of Reported Uncorrectable Errors
> 4 0x010 4 180 Resets Between Cmd Acceptance and Completion
> 5 ===== = = == Temperature Statistics (rev 1) ==
> 5 0x008 1 33 Current Temperature
> 5 0x010 1 41~ Average Short Term Temperature
> 5 0x018 1 41~ Average Long Term Temperature
> 5 0x020 1 53 Highest Temperature
> 5 0x028 1 20 Lowest Temperature
> 5 0x030 1 49~ Highest Average Short Term Temperature
> 5 0x038 1 0~ Lowest Average Short Term Temperature
> 5 0x040 1 47~ Highest Average Long Term Temperature
> 5 0x048 1 0~ Lowest Average Long Term Temperature
> 5 0x050 4 0 Time in Over-Temperature
> 5 0x058 1 60 Specified Maximum Operating Temperature
> 5 0x060 4 0 Time in Under-Temperature
> 5 0x068 1 0 Specified Minimum Operating Temperature
> 6 ===== = = == Transport Statistics (rev 1) ==
> 6 0x008 4 1957 Number of Hardware Resets
> 6 0x010 4 1773 Number of ASR Events
> 6 0x018 4 9 Number of Interface CRC Errors
> |_ ~ normalized value
> SATA Phy Event Counters (GP Log 0x11)
> ID Size Value Description
> 0x0001 2 0 Command failed due to ICRC error
> 0x0002 2 0 R_ERR response for data FIS
> 0x0005 2 0 R_ERR response for non-data FIS
> 0x0009 2 6 Transition from drive PhyRdy to drive PhyNRdy
> 0x000a 2 4 Device-to-host register FISes sent due to a COMRESET
> 0x000b 2 0 CRC errors within host-to-device FIS
> 0x000d 2 0 Non-CRC errors within host-to-device FIS
> sudo smartctl -x /dev/sdf
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family: Hitachi Deskstar 7K2000
> Device Model: Hitachi HDS722020ALA330
> Serial Number: JK1171YAGDAD5S
> LU WWN Device Id: 5 000cca 221c59b77
> Firmware Version: JKAOA20N
> User Capacity: 2,000,397,852,160 bytes [2.00 TB]
> Sector Size: 512 bytes logical/physical
> Rotation Rate: 7200 rpm
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: ATA8-ACS T13/1699-D revision 4
> SATA Version is: SATA 2.6, 3.0 Gb/s
> Local Time is: Tue Feb 10 16:46:04 2015 EST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is: Disabled
> APM feature is: Disabled
> Rd look-ahead is: Enabled
> Write cache is: Enabled
> ATA Security is: Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Enabled
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status: (0x84) Offline data collection activity
> was suspended by an interrupting command from host.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (22917) seconds.
> Offline data collection
> capabilities: (0x5b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> No Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 1) minutes.
> Extended self-test routine
> recommended polling time: ( 382) minutes.
> SCT capabilities: (0x003d) SCT Status supported.
> SCT Error Recovery Control supported.
> SCT Feature Control supported.
> SCT Data Table supported.
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
> 1 Raw_Read_Error_Rate PO-R-- 100 100 016 - 0
> 2 Throughput_Performance P-S--- 133 133 054 - 101
> 3 Spin_Up_Time POS--- 134 134 024 - 627 (Average 452)
> 4 Start_Stop_Count -O--C- 100 100 000 - 203
> 5 Reallocated_Sector_Ct PO--CK 100 100 005 - 0
> 7 Seek_Error_Rate PO-R-- 100 100 067 - 0
> 8 Seek_Time_Performance P-S--- 112 112 020 - 39
> 9 Power_On_Hours -O--C- 094 094 000 - 44006
> 10 Spin_Retry_Count PO--C- 100 100 060 - 0
> 12 Power_Cycle_Count -O--CK 100 100 000 - 203
> 192 Power-Off_Retract_Count -O--CK 099 099 000 - 1248
> 193 Load_Cycle_Count -O--C- 099 099 000 - 1248
> 194 Temperature_Celsius -O---- 193 193 000 - 31 (Min/Max 20/50)
> 196 Reallocated_Event_Count -O--CK 100 100 000 - 0
> 197 Current_Pending_Sector -O---K 100 100 000 - 0
> 198 Offline_Uncorrectable ---R-- 100 100 000 - 0
> 199 UDMA_CRC_Error_Count -O-R-- 200 200 000 - 0
> ||||||_ K auto-keep
> |||||__ C event count
> ||||___ R error rate
> |||____ S speed/performance
> ||_____ O updated online
> |______ P prefailure warning
> General Purpose Log Directory Version 1
> SMART Log Directory Version 1 [multi-sector log support]
> Address Access R/W Size Description
> 0x00 GPL,SL R/O 1 Log Directory
> 0x01 SL R/O 1 Summary SMART error log
> 0x03 GPL R/O 1 Ext. Comprehensive SMART error log
> 0x04 GPL R/O 7 Device Statistics log
> 0x06 SL R/O 1 SMART self-test log
> 0x07 GPL R/O 1 Extended self-test log
> 0x09 SL R/W 1 Selective self-test log
> 0x10 GPL R/O 1 NCQ Command Error log
> 0x11 GPL R/O 1 SATA Phy Event Counters
> 0x20 GPL R/O 1 Streaming performance log [OBS-8]
> 0x21 GPL R/O 1 Write stream error log
> 0x22 GPL R/O 1 Read stream error log
> 0x80-0x9f GPL,SL R/W 16 Host vendor specific log
> 0xe0 GPL,SL R/W 1 SCT Command/Status
> 0xe1 GPL,SL R/W 1 SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 0 (1 sectors)
> No Errors Logged
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged. [To run self-tests, use: smartctl -t]
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> SCT Status Version: 3
> SCT Version (vendor specific): 256 (0x0100)
> SCT Support Level: 1
> Device State: SMART Off-line Data Collection executing in background (4)
> Current Temperature: 31 Celsius
> Power Cycle Min/Max Temperature: 27/31 Celsius
> Lifetime Min/Max Temperature: 20/50 Celsius
> Under/Over Temperature Limit Count: 0/0
> SCT Temperature History Version: 2
> Temperature Sampling Period: 1 minute
> Temperature Logging Interval: 1 minute
> Min/Max recommended Temperature: 0/60 Celsius
> Min/Max Temperature Limit: -40/70 Celsius
> Temperature History Size (Index): 128 (47)
> Index Estimated Time Temperature Celsius
> 48 2015-02-10 14:39 39 ********************
> ... ..( 98 skipped). .. ********************
> 19 2015-02-10 16:18 39 ********************
> 20 2015-02-10 16:19 40 *********************
> 21 2015-02-10 16:20 39 ********************
> ... ..( 3 skipped). .. ********************
> 25 2015-02-10 16:24 39 ********************
> 26 2015-02-10 16:25 38 *******************
> ... ..( 6 skipped). .. *******************
> 33 2015-02-10 16:32 38 *******************
> 34 2015-02-10 16:33 ? -
> 35 2015-02-10 16:34 27 ********
> 36 2015-02-10 16:35 28 *********
> 37 2015-02-10 16:36 28 *********
> 38 2015-02-10 16:37 29 **********
> 39 2015-02-10 16:38 29 **********
> 40 2015-02-10 16:39 30 ***********
> ... ..( 2 skipped). .. ***********
> 43 2015-02-10 16:42 30 ***********
> 44 2015-02-10 16:43 31 ************
> ... ..( 2 skipped). .. ************
> 47 2015-02-10 16:46 31 ************
> SCT Error Recovery Control:
> Read: Disabled
> Write: Disabled
> Device Statistics (GP Log 0x04)
> Page Offset Size Value Description
> 1 ===== = = == General Statistics (rev 1) ==
> 1 0x008 4 203 Lifetime Power-On Resets
> 1 0x010 4 44006 Power-on Hours
> 1 0x018 6 15872353160 Logical Sectors Written
> 1 0x020 6 39140100 Number of Write Commands
> 1 0x028 6 4462388816379 Logical Sectors Read
> 1 0x030 6 5927428317 Number of Read Commands
> 3 ===== = = == Rotating Media Statistics (rev 1) ==
> 3 0x008 4 43997 Spindle Motor Power-on Hours
> 3 0x010 4 43997 Head Flying Hours
> 3 0x018 4 1248 Head Load Events
> 3 0x020 4 0 Number of Reallocated Logical Sectors
> 3 0x028 4 32 Read Recovery Attempts
> 3 0x030 4 0 Number of Mechanical Start Failures
> 4 ===== = = == General Errors Statistics (rev 1) ==
> 4 0x008 4 0 Number of Reported Uncorrectable Errors
> 4 0x010 4 192 Resets Between Cmd Acceptance and Completion
> 5 ===== = = == Temperature Statistics (rev 1) ==
> 5 0x008 1 31 Current Temperature
> 5 0x010 1 37~ Average Short Term Temperature
> 5 0x018 1 35~ Average Long Term Temperature
> 5 0x020 1 50 Highest Temperature
> 5 0x028 1 20 Lowest Temperature
> 5 0x030 1 44~ Highest Average Short Term Temperature
> 5 0x038 1 0~ Lowest Average Short Term Temperature
> 5 0x040 1 42~ Highest Average Long Term Temperature
> 5 0x048 1 0~ Lowest Average Long Term Temperature
> 5 0x050 4 0 Time in Over-Temperature
> 5 0x058 1 60 Specified Maximum Operating Temperature
> 5 0x060 4 0 Time in Under-Temperature
> 5 0x068 1 0 Specified Minimum Operating Temperature
> 6 ===== = = == Transport Statistics (rev 1) ==
> 6 0x008 4 1947 Number of Hardware Resets
> 6 0x010 4 1765 Number of ASR Events
> 6 0x018 4 0 Number of Interface CRC Errors
> |_ ~ normalized value
> SATA Phy Event Counters (GP Log 0x11)
> ID Size Value Description
> 0x0001 2 0 Command failed due to ICRC error
> 0x0002 2 0 R_ERR response for data FIS
> 0x0005 2 0 R_ERR response for non-data FIS
> 0x0009 2 6 Transition from drive PhyRdy to drive PhyNRdy
> 0x000a 2 4 Device-to-host register FISes sent due to a COMRESET
> 0x000b 2 0 CRC errors within host-to-device FIS
> 0x000d 2 0 Non-CRC errors within host-to-device FIS
Adam:
I actually read that exact stackexchange article about using the
--replace command but I neither had kernel 3.2+ nor mdadm 3.3+ that
seemed to be a necessary requirement. I suppose I could have booted to
a more recent kernel livecd, but sadly i did not.
Thank you both for your help,
Kyle L
On Tue, Feb 10, 2015 at 8:51 AM, Phil Turmel <philip@turmel.org> wrote:
> Hi Kyle,
>
> Your symptoms look like classic timeout mismatch. Details interleaved.
>
> On 02/10/2015 02:35 AM, Adam Goryachev wrote:
>
>> There are other people who will jump in and help you with your problem,
>> but I'll add a couple of pointers while you are waiting. See below.
>
>> On 10/02/15 15:20, Kyle Logue wrote:
>>> Hey all:
>>>
>>> I have a 5 disk software raid5 that was working fine until I decided
>>> to swap out an old disk with a new one.
>>>
>>> mdadm /dev/md0 --add /dev/sda1
>>> mdadm /dev/md0 --fail /dev/sde1
>
> As Adam pointed out, you should have used --replace, but you probably
> wouldn't have made it through the replace function anyways.
>
>>> At this point it started automatically rebuilding the array.
>>> About 60%? of the way in it stops and I see a lot of this repeated in
>>> my dmesg:
>>>
>>> [Mon Feb 9 18:06:48 2015] ata5.00: exception Emask 0x0 SAct 0x0 SErr
>>> 0x0 action 0x6 frozen
>>> [Mon Feb 9 18:06:48 2015] ata5.00: failed command: SMART
>>> [Mon Feb 9 18:06:48 2015] ata5.00: cmd
>>> b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 7
>>> [Mon Feb 9 18:06:48 2015] res
>>> 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
> ^^^^^^^^^
> Smoking gun.
>
>>> [Mon Feb 9 18:06:48 2015] ata5.00: status: { DRDY }
>>> [Mon Feb 9 18:06:48 2015] ata5: hard resetting link
>>> [Mon Feb 9 18:06:58 2015] ata5: softreset failed (1st FIS failed)
>>> [Mon Feb 9 18:06:58 2015] ata5: hard resetting link
>>> [Mon Feb 9 18:07:08 2015] ata5: softreset failed (1st FIS failed)
>>> [Mon Feb 9 18:07:08 2015] ata5: hard resetting link
>>> [Mon Feb 9 18:07:12 2015] ata5: SATA link up 1.5 Gbps (SStatus 113
>>> SControl 310)
>>> [Mon Feb 9 18:07:12 2015] ata5.00: configured for UDMA/33
>>> [Mon Feb 9 18:07:12 2015] ata5: EH complete
>
> Notice that after a timeout error, the drive is unresponsive for several
> more seconds -- about 24 in your case.
>
>> .... read about timing mismatches
>> between the kernel and the hard drive, and how to solve that. There was
>> another post earlier today with some links to specific posts that will
>> be helpful (check the online archive).
>
> That would have been me. Start with this link for a description of what
> you are experiencing:
>
> http://marc.info/?l=linux-raid&m=135811522817345&w=1
>
> First, you need to protect yourself from timeout mismatch due to the use
> of desktop-grade drives. (Enterprise and raid-rated drives don't have
> this problem.)
>
> { If you were stuck in the middle of a replace a you had just
> worked-around your timeout problem, it would likely continue and
> complete. You've lost that opportunity. }
>
> Show us the output of "smartctl -x" for all of your drives if you'd like
> advice on your particular drives. (Pasted inline is preferred.)
>
> Second, you need to find and overwrite (with zeros) the bad sectors on
> your drives. Or ddrescue to a complete set of replacement drives and
> assemble those.
>
> Third, you need to set up a cron job to scrub your array regularly to
> clean out UREs before they accumulate beyond MD's ability to handle it
> (20 read errors in an hour, 10 per hour sustained).
>
> Phil
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wierd: Degrading while recovering raid5
2015-02-10 21:50 ` Kyle Logue
@ 2015-02-11 2:14 ` Phil Turmel
0 siblings, 0 replies; 9+ messages in thread
From: Phil Turmel @ 2015-02-11 2:14 UTC (permalink / raw)
To: Kyle Logue, linux-raid
Hi Kyle,
{ Convention on kernel.org lists is reply-to-all, trim replies, and
either bottom post or interleave }
On 02/10/2015 04:50 PM, Kyle Logue wrote:
> Phil:
>
> Thanks for your detailed response. That link does seem to describe my
> problem and I do understand that desktop grade drives are sub-optimal.
> It was many years ago when I first set up this array on my home
> theater pc. Until now I had no idea about the cron job - I'll make
> sure to implement that. I am preparing to move to 6 tb disks sometime
> soon and i'll definitely go enterprise this time.
>
> Regarding the drive timeout: I understand that I need to increase it
> from 30 seconds to something larger (2+ min) but am unaware how to do
> this. Is it a kernel variable? I'll keep googling but this seems like
> it's whats going to save me.
>
> tl;dr: How do I change the drive timeout?
Put something like this in /etc/rc.local or wherever your distro suggests:
for x in /sys/block/sd[a-f]/device/timeout ; do
echo 180 > $x
done
Where the [a-f] is adjusted to suit your needs, and only for non-raid
non-scterc drives.
Phil
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wierd: Degrading while recovering raid5
2015-02-11 22:12 ` Kyle Logue
@ 2015-02-12 0:15 ` Phil Turmel
0 siblings, 0 replies; 9+ messages in thread
From: Phil Turmel @ 2015-02-12 0:15 UTC (permalink / raw)
To: Kyle Logue; +Cc: linux-raid
On 02/11/2015 05:12 PM, Kyle Logue wrote:
> Good news phil. Under the hypothesis that the new disk that I added
> didn't fully replace my sde I omitted it from my assemble. The array
> went full UUUUU, then I echo'd check > /sys/block/md0/md/sync_action
>
> Much later it kicked out the faulty disk (previously sdc) and now i
> have a _UUUU.
>
> So hopefully this is the final question, but should I just evacuate as
> much data as possible immediately? Or try to add another spare and
> rebuild?
So long as you haven't mounted it yet, I suggest you do another forced
assembly to get back to UUUUU, then kick off another check. When many
UREs are allowed to accumulate, mdadm can hit its read error rate limit
and kick the drive. If it hasn't been mounted, you can keep doing it
until you get through the entire check.
But, you also had misaligned partitions. If sdcN is one of them, the
above won't work, and you should get your backups ASAP. And then make a
new array from scratch.
If you do succeed in completing a check scrub, you can use --replace to
put the array on properly aligned partitions.
Phil
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wierd: Degrading while recovering raid5
2015-02-11 14:28 ` Phil Turmel
@ 2015-02-11 22:12 ` Kyle Logue
2015-02-12 0:15 ` Phil Turmel
0 siblings, 1 reply; 9+ messages in thread
From: Kyle Logue @ 2015-02-11 22:12 UTC (permalink / raw)
To: Phil Turmel; +Cc: linux-raid
Good news phil. Under the hypothesis that the new disk that I added
didn't fully replace my sde I omitted it from my assemble. The array
went full UUUUU, then I echo'd check > /sys/block/md0/md/sync_action
Much later it kicked out the faulty disk (previously sdc) and now i
have a _UUUU.
So hopefully this is the final question, but should I just evacuate as
much data as possible immediately? Or try to add another spare and
rebuild?
Thanks for the help,
Kyle L
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wierd: Degrading while recovering raid5
2015-02-11 6:23 Kyle Logue
@ 2015-02-11 14:28 ` Phil Turmel
2015-02-11 22:12 ` Kyle Logue
0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2015-02-11 14:28 UTC (permalink / raw)
To: Kyle Logue, linux-raid
On 02/11/2015 01:23 AM, Kyle Logue wrote:
> Phil:
>
> For a while I really thought that was going to work. I swapped out the
> sata cable and set the timeout to 10 minutes. At about 70% rebuilt I
> got the following dmesg which seems to indicate the death of my sdc
> drive.
Ten minutes is way overkill. The three minutes I suggested is already
extreme, and most drives will only need two minutes.
> Here is my question: I still have this sde that I manually failed and
> hasn't been touched. Can i force re-add it to the array and just take
> the data corruption hit?
No, sde is being replaced by sda, so it's no help for sdc. If you put
it back into service, it would have to take the role of sda. (Forced
assembly, though, not a re-add.) If the array was in use during your
first replacement attempt, the differences could be substantial.
I'm not sure how MD will handle the rebuild status in this case.
Hopefully, it will take you back to a working, non-rebuilding array. If
you try this, you should test with a set of overlay devices as described
on the wiki.
> I'd rather have to revert part of my data than all of it. The drive
> counts are significantly different now, but I haven't mounted the
> drives since the beginning. I haven't tried it but I saw someone else
> online get a message like 'raid has failed so using --add cannot work
> and might destroy data'. Is there a force add? What are my chances?
The right answer here depends on whether the array was in use. If it
wasn't, I'd try to use sde in place of sda to get back to a
non-rebuilding array. If the test run succeeds, undo the overlays and
do it for real. Then zero the superblock on sda, add it back as a
spare, then --replace sdc.
If the trial doesn't work (or the changes to sda too great), the
alternative is to ddrescue sdc onto a spare disk (sde would be available
at that point, if it's useless for assembly). Then manually reassemble
and let the rebuild finish. If you run into more errors on the other
members, you may have to repeat the ddrescue process for each.
Whichever path you take, when done, consider switching to raid6 using
the extra drive. That's far more secure than a hot spare (if a little
slower).
I did notice one other issue in your posted dmesg: misaligned
partitions. This cripples MD's ability to fix UREs on the fly or during
a scrub. You *must* rebuild your array with properly aligned partitions
before you quit.
Phil
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Wierd: Degrading while recovering raid5
@ 2015-02-11 6:23 Kyle Logue
2015-02-11 14:28 ` Phil Turmel
0 siblings, 1 reply; 9+ messages in thread
From: Kyle Logue @ 2015-02-11 6:23 UTC (permalink / raw)
To: linux-raid
Phil:
For a while I really thought that was going to work. I swapped out the
sata cable and set the timeout to 10 minutes. At about 70% rebuilt I
got the following dmesg which seems to indicate the death of my sdc
drive.
Here is my question: I still have this sde that I manually failed and
hasn't been touched. Can i force re-add it to the array and just take
the data corruption hit?
I'd rather have to revert part of my data than all of it. The drive
counts are significantly different now, but I haven't mounted the
drives since the beginning. I haven't tried it but I saw someone else
online get a message like 'raid has failed so using --add cannot work
and might destroy data'. Is there a force add? What are my chances?
The dmesg in question. I started rebuilding at 20:24.
[Tue Feb 10 20:23:59 2015] md: md0 stopped.
[Tue Feb 10 20:23:59 2015] md: unbind<sdf1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sdf1)
[Tue Feb 10 20:23:59 2015] md: unbind<sde1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sde1)
[Tue Feb 10 20:23:59 2015] md: unbind<sdd1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sdd1)
[Tue Feb 10 20:23:59 2015] md: unbind<sdc1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sdc1)
[Tue Feb 10 20:23:59 2015] md: unbind<sdb1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sdb1)
[Tue Feb 10 20:23:59 2015] md: unbind<sda1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sda1)
[Tue Feb 10 20:24:59 2015] md: md0 stopped.
[Tue Feb 10 20:24:59 2015] md: bind<sdd1>
[Tue Feb 10 20:24:59 2015] md: bind<sde1>
[Tue Feb 10 20:24:59 2015] md: bind<sdf1>
[Tue Feb 10 20:24:59 2015] md: bind<sdb1>
[Tue Feb 10 20:24:59 2015] md: bind<sda1>
[Tue Feb 10 20:24:59 2015] md: bind<sdc1>
[Tue Feb 10 20:24:59 2015] md: kicking non-fresh sde1 from array!
[Tue Feb 10 20:24:59 2015] md: unbind<sde1>
[Tue Feb 10 20:24:59 2015] md: export_rdev(sde1)
[Tue Feb 10 20:24:59 2015] md/raid:md0: device sdc1 operational as raid disk 0
[Tue Feb 10 20:24:59 2015] md/raid:md0: device sdb1 operational as raid disk 4
[Tue Feb 10 20:24:59 2015] md/raid:md0: device sdf1 operational as raid disk 3
[Tue Feb 10 20:24:59 2015] md/raid:md0: device sdd1 operational as raid disk 1
[Tue Feb 10 20:24:59 2015] md/raid:md0: allocated 0kB
[Tue Feb 10 20:24:59 2015] md/raid:md0: raid level 5 active with 4 out
of 5 devices, algorithm 2
[Tue Feb 10 20:24:59 2015] RAID conf printout:
[Tue Feb 10 20:24:59 2015] --- level:5 rd:5 wd:4
[Tue Feb 10 20:24:59 2015] disk 0, o:1, dev:sdc1
[Tue Feb 10 20:24:59 2015] disk 1, o:1, dev:sdd1
[Tue Feb 10 20:24:59 2015] disk 3, o:1, dev:sdf1
[Tue Feb 10 20:24:59 2015] disk 4, o:1, dev:sdb1
[Tue Feb 10 20:24:59 2015] md0: Warning: Device sda1 is misaligned
[Tue Feb 10 20:24:59 2015] md0: Warning: Device sdb1 is misaligned
[Tue Feb 10 20:24:59 2015] md0: Warning: Device sdb1 is misaligned
[Tue Feb 10 20:24:59 2015] md0: detected capacity change from 0 to 8001584889856
[Tue Feb 10 20:24:59 2015] RAID conf printout:
[Tue Feb 10 20:24:59 2015] --- level:5 rd:5 wd:4
[Tue Feb 10 20:24:59 2015] disk 0, o:1, dev:sdc1
[Tue Feb 10 20:24:59 2015] disk 1, o:1, dev:sdd1
[Tue Feb 10 20:24:59 2015] disk 2, o:1, dev:sda1
[Tue Feb 10 20:24:59 2015] disk 3, o:1, dev:sdf1
[Tue Feb 10 20:24:59 2015] disk 4, o:1, dev:sdb1
[Tue Feb 10 20:24:59 2015] md: recovery of RAID array md0
[Tue Feb 10 20:24:59 2015] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[Tue Feb 10 20:24:59 2015] md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for recovery.
[Tue Feb 10 20:24:59 2015] md: using 128k window, over a total of 1953511936k.
[Tue Feb 10 20:24:59 2015] md0: unknown partition table
[Tue Feb 10 20:35:34 2015] perf samples too long (2505 > 2500),
lowering kernel.perf_event_max_sample_rate to 50000
[Wed Feb 11 01:02:15 2015] ata5.00: exception Emask 0x0 SAct 0x30 SErr
0x0 action 0x0
[Wed Feb 11 01:02:15 2015] ata5.00: irq_stat 0x40000008
[Wed Feb 11 01:02:15 2015] ata5.00: failed command: READ FPDMA QUEUED
[Wed Feb 11 01:02:15 2015] ata5.00: cmd
60/00:20:18:1d:1c/04:00:a4:00:00/40 tag 4 ncq 524288 in
[Wed Feb 11 01:02:15 2015] res
41/40:00:e8:1d:1c/00:04:a4:00:00/00 Emask 0x409 (media error) <F>
[Wed Feb 11 01:02:15 2015] ata5.00: status: { DRDY ERR }
[Wed Feb 11 01:02:15 2015] ata5.00: error: { UNC }
[Wed Feb 11 01:02:15 2015] ata5.00: configured for UDMA/133
[Wed Feb 11 01:02:15 2015] sd 4:0:0:0: [sdc] Unhandled sense code
[Wed Feb 11 01:02:15 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:15 2015] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Feb 11 01:02:15 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:15 2015] Sense Key : Medium Error [current] [descriptor]
[Wed Feb 11 01:02:15 2015] Descriptor sense data with sense
descriptors (in hex):
[Wed Feb 11 01:02:15 2015] 72 03 11 04 00 00 00 0c 00 0a 80 00
00 00 00 00
[Wed Feb 11 01:02:15 2015] a4 1c 1d e8
[Wed Feb 11 01:02:15 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:15 2015] Add. Sense: Unrecovered read error - auto
reallocate failed
[Wed Feb 11 01:02:15 2015] sd 4:0:0:0: [sdc] CDB:
[Wed Feb 11 01:02:15 2015] Read(10): 28 00 a4 1c 1d 18 00 04 00 00
[Wed Feb 11 01:02:15 2015] end_request: I/O error, dev sdc, sector 2753306088
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304040 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304048 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304056 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304064 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304072 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304080 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304088 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304096 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304104 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304112 on sdc1).
[Wed Feb 11 01:02:15 2015] ata5: EH complete
[Wed Feb 11 01:02:18 2015] ata5.00: exception Emask 0x0 SAct 0xff80
SErr 0x0 action 0x0
[Wed Feb 11 01:02:18 2015] ata5.00: irq_stat 0x40000008
[Wed Feb 11 01:02:18 2015] ata5.00: failed command: READ FPDMA QUEUED
[Wed Feb 11 01:02:18 2015] ata5.00: cmd
60/80:38:e8:1d:1c/00:00:a4:00:00/40 tag 7 ncq 65536 in
[Wed Feb 11 01:02:18 2015] res
41/40:80:e8:1d:1c/00:00:a4:00:00/00 Emask 0x409 (media error) <F>
[Wed Feb 11 01:02:18 2015] ata5.00: status: { DRDY ERR }
[Wed Feb 11 01:02:18 2015] ata5.00: error: { UNC }
[Wed Feb 11 01:02:18 2015] ata5.00: configured for UDMA/133
[Wed Feb 11 01:02:18 2015] sd 4:0:0:0: [sdc] Unhandled sense code
[Wed Feb 11 01:02:18 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:18 2015] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Feb 11 01:02:18 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:18 2015] Sense Key : Medium Error [current] [descriptor]
[Wed Feb 11 01:02:18 2015] Descriptor sense data with sense
descriptors (in hex):
[Wed Feb 11 01:02:18 2015] 72 03 11 04 00 00 00 0c 00 0a 80 00
00 00 00 00
[Wed Feb 11 01:02:18 2015] a4 1c 1d e8
[Wed Feb 11 01:02:18 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:18 2015] Add. Sense: Unrecovered read error - auto
reallocate failed
[Wed Feb 11 01:02:18 2015] sd 4:0:0:0: [sdc] CDB:
[Wed Feb 11 01:02:18 2015] Read(10): 28 00 a4 1c 1d e8 00 00 80 00
[Wed Feb 11 01:02:18 2015] end_request: I/O error, dev sdc, sector 2753306088
[Wed Feb 11 01:02:18 2015] md/raid:md0: Disk failure on sdc1, disabling device.
[Wed Feb 11 01:02:18 2015] md/raid:md0: Operation continuing on 3 devices.
[Wed Feb 11 01:02:18 2015] ata5: EH complete
[Wed Feb 11 01:02:18 2015] md: md0: recovery interrupted.
[Wed Feb 11 01:02:18 2015] RAID conf printout:
[Wed Feb 11 01:02:18 2015] --- level:5 rd:5 wd:3
[Wed Feb 11 01:02:18 2015] disk 0, o:0, dev:sdc1
[Wed Feb 11 01:02:18 2015] disk 1, o:1, dev:sdd1
[Wed Feb 11 01:02:18 2015] disk 2, o:1, dev:sda1
[Wed Feb 11 01:02:18 2015] disk 3, o:1, dev:sdf1
[Wed Feb 11 01:02:18 2015] disk 4, o:1, dev:sdb1
[Wed Feb 11 01:02:18 2015] RAID conf printout:
[Wed Feb 11 01:02:18 2015] --- level:5 rd:5 wd:3
[Wed Feb 11 01:02:18 2015] disk 1, o:1, dev:sdd1
[Wed Feb 11 01:02:18 2015] disk 2, o:1, dev:sda1
[Wed Feb 11 01:02:18 2015] disk 3, o:1, dev:sdf1
[Wed Feb 11 01:02:18 2015] disk 4, o:1, dev:sdb1
[Wed Feb 11 01:02:18 2015] RAID conf printout:
[Wed Feb 11 01:02:18 2015] --- level:5 rd:5 wd:3
[Wed Feb 11 01:02:18 2015] disk 1, o:1, dev:sdd1
[Wed Feb 11 01:02:18 2015] disk 2, o:1, dev:sda1
[Wed Feb 11 01:02:18 2015] disk 3, o:1, dev:sdf1
[Wed Feb 11 01:02:18 2015] disk 4, o:1, dev:sdb1
[Wed Feb 11 01:02:18 2015] RAID conf printout:
[Wed Feb 11 01:02:18 2015] --- level:5 rd:5 wd:3
[Wed Feb 11 01:02:18 2015] disk 1, o:1, dev:sdd1
[Wed Feb 11 01:02:18 2015] disk 3, o:1, dev:sdf1
[Wed Feb 11 01:02:18 2015] disk 4, o:1, dev:sdb1
Thanks again,
Kyle L
On Tue, Feb 10, 2015 at 9:14 PM, Phil Turmel <philip@turmel.org> wrote:
>
> Hi Kyle,
>
> { Convention on kernel.org lists is reply-to-all, trim replies, and
> either bottom post or interleave }
>
> On 02/10/2015 04:50 PM, Kyle Logue wrote:
> > Phil:
> >
> > Thanks for your detailed response. That link does seem to describe my
> > problem and I do understand that desktop grade drives are sub-optimal.
> > It was many years ago when I first set up this array on my home
> > theater pc. Until now I had no idea about the cron job - I'll make
> > sure to implement that. I am preparing to move to 6 tb disks sometime
> > soon and i'll definitely go enterprise this time.
> >
> > Regarding the drive timeout: I understand that I need to increase it
> > from 30 seconds to something larger (2+ min) but am unaware how to do
> > this. Is it a kernel variable? I'll keep googling but this seems like
> > it's whats going to save me.
> >
> > tl;dr: How do I change the drive timeout?
>
> Put something like this in /etc/rc.local or wherever your distro suggests:
>
> for x in /sys/block/sd[a-f]/device/timeout ; do
> echo 180 > $x
> done
>
> Where the [a-f] is adjusted to suit your needs, and only for non-raid
> non-scterc drives.
>
> Phil
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-02-12 0:15 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-10 4:20 Wierd: Degrading while recovering raid5 Kyle Logue
2015-02-10 7:35 ` Adam Goryachev
2015-02-10 13:51 ` Phil Turmel
2015-02-10 21:50 ` Kyle Logue
2015-02-11 2:14 ` Phil Turmel
2015-02-11 6:23 Kyle Logue
2015-02-11 14:28 ` Phil Turmel
2015-02-11 22:12 ` Kyle Logue
2015-02-12 0:15 ` Phil Turmel
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.