* recovering RAID5 from multiple disk failures
@ 2013-02-01 12:28 Michael Ritzert
2013-02-01 13:21 ` Phil Turmel
0 siblings, 1 reply; 9+ messages in thread
From: Michael Ritzert @ 2013-02-01 12:28 UTC (permalink / raw)
To: linux-raid
Hi all,
this looks bad:
I have a RAID5 that showed a disk error. The disk failed badly with read
errors. Apparantly, these happen to be at locations important to the file
system, as the RAID read speed was some kb/s with permanent timeouts
reading from the disk.
So I removed the disk from the RAID, to be able to take a backup. The
backup ran well for one directory, and then completely stopped. It turned
out another disk also suddenly showed read errors.
So the situation is: I have a four-disk RAID5 with two active disks, and
two that dropped out at different times.
I made 1:1 copies of all 4 disks with ddrescue, and the error report shows
that the errorneous regions do not overlap. So I hope there is a chance to
recover the data.
But for the filesystem mount, there were only read accesses to the array
after the first disk dropped out. So my strategy would be to convince md
to accept all disks as uptodate and treat the read errors on two disks,
and the differing filesystem metadata as RAID errors that can hopefully
be corrected.
The mdadm report for one of the disks looks like this:
/dev/sdb3:
Magic : a92b4efc
Version : 0.90.00
UUID : f5ad617a:14ccd4b1:3d7a38e4:71465fe8
Creation Time : Fri Nov 26 19:58:40 2010
Raid Level : raid5
Used Dev Size : 1951945600 (1861.52 GiB 1998.79 GB)
Array Size : 5855836800 (5584.56 GiB 5996.38 GB)
Raid Devices : 4
Total Devices : 3
Preferred Minor : 0
Update Time : Fri Jan 4 16:33:36 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 1
Spare Devices : 0
Checksum : 74966e68 - correct
Events : 237
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 3 8 51 3 active sync
0 0 0 0 0 removed
1 1 8 19 1 active sync /dev/sdb3
2 2 0 0 2 faulty removed
3 3 8 51 3 active sync
My first attempt would be to try
mdadm --create --metadata=0.9 --chunk=64 --assume-clean, etc.
Is there a chance for this to succeed? Or do you have better suggestions?
If all recovery that involves assembling the array fails: Is is possible
to manually assemble the data?
I'm thinking in the direction of: take the first 64k from disk1, then 64k
from disk2, etc.? This would probably take years to complete, but the data
is of really big importance to me (which is why I put it on a RAID in the
first place...).
Thanks,
Michael
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: recovering RAID5 from multiple disk failures
2013-02-01 12:28 recovering RAID5 from multiple disk failures Michael Ritzert
@ 2013-02-01 13:21 ` Phil Turmel
2013-02-02 13:04 ` Michael Ritzert
0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2013-02-01 13:21 UTC (permalink / raw)
To: Michael Ritzert; +Cc: linux-raid
Hi Michael,
On 02/01/2013 07:28 AM, Michael Ritzert wrote:
> Hi all,
>
> this looks bad:
> I have a RAID5 that showed a disk error. The disk failed badly with read
> errors. Apparantly, these happen to be at locations important to the file
> system, as the RAID read speed was some kb/s with permanent timeouts
> reading from the disk.
> So I removed the disk from the RAID, to be able to take a backup. The
> backup ran well for one directory, and then completely stopped. It turned
> out another disk also suddenly showed read errors.
>
> So the situation is: I have a four-disk RAID5 with two active disks, and
> two that dropped out at different times.
Please show the errors from dmesg.
And show "smartctl -x" for the drives that failed.
> I made 1:1 copies of all 4 disks with ddrescue, and the error report shows
> that the errorneous regions do not overlap. So I hope there is a chance to
> recover the data.
Very good.
> But for the filesystem mount, there were only read accesses to the array
> after the first disk dropped out. So my strategy would be to convince md
> to accept all disks as uptodate and treat the read errors on two disks,
> and the differing filesystem metadata as RAID errors that can hopefully
> be corrected.
>
> The mdadm report for one of the disks looks like this:
> /dev/sdb3:
> Magic : a92b4efc
> Version : 0.90.00
> UUID : f5ad617a:14ccd4b1:3d7a38e4:71465fe8
> Creation Time : Fri Nov 26 19:58:40 2010
> Raid Level : raid5
> Used Dev Size : 1951945600 (1861.52 GiB 1998.79 GB)
> Array Size : 5855836800 (5584.56 GiB 5996.38 GB)
> Raid Devices : 4
> Total Devices : 3
> Preferred Minor : 0
>
> Update Time : Fri Jan 4 16:33:36 2013
> State : clean
> Active Devices : 2
> Working Devices : 2
> Failed Devices : 1
> Spare Devices : 0
> Checksum : 74966e68 - correct
> Events : 237
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Number Major Minor RaidDevice State
> this 3 8 51 3 active sync
>
> 0 0 0 0 0 removed
> 1 1 8 19 1 active sync /dev/sdb3
> 2 2 0 0 2 faulty removed
> 3 3 8 51 3 active sync
Also show "mdadm -E" for all of the member devices. This data is an
absolute *must* before any major surgery on an array.
> My first attempt would be to try
> mdadm --create --metadata=0.9 --chunk=64 --assume-clean, etc.
>
> Is there a chance for this to succeed? Or do you have better suggestions?
"--create" is a *terrible* first step. "mdadm --assemble --force" is
the right tool for this job.
> If all recovery that involves assembling the array fails: Is is possible
> to manually assemble the data?
> I'm thinking in the direction of: take the first 64k from disk1, then 64k
> from disk2, etc.? This would probably take years to complete, but the data
> is of really big importance to me (which is why I put it on a RAID in the
> first place...).
Your scenario sounds like the common timeout mismatch catastrophe, which
is why I asked for "smartctl -x". If that is the case, MD won't be able
to do the reconstructions that it should when encounting read errors.
Also, you have a poor understanding of MD's use--it is *not* a backup
alternative. It is a tool for maximizing *uptime*. It will keep you
running through the normal random failures that complex
electro-mechanical systems experience.
MD won't save your data from accidental deletion or other operator
error. It won't save your data from a lightning strike. It won't save
your data from a home or office fire. You still need to make backups.
Phil
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: recovering RAID5 from multiple disk failures
2013-02-01 13:21 ` Phil Turmel
@ 2013-02-02 13:04 ` Michael Ritzert
2013-02-02 13:44 ` Phil Turmel
0 siblings, 1 reply; 9+ messages in thread
From: Michael Ritzert @ 2013-02-02 13:04 UTC (permalink / raw)
To: linux-raid
Hi Phil,
In article <510BC173.7070002@turmel.org> you wrote:
>> So the situation is: I have a four-disk RAID5 with two active disks, and
>> two that dropped out at different times.
>
> Please show the errors from dmesg.
I don't think I can provide that. The RAID ran in a QNAP system, and if
there is a log at all, it's on this disk...
During the copy process, it was all media errors, however.
> And show "smartctl -x" for the drives that failed.
See below.
[...]
> Also show "mdadm -E" for all of the member devices. This data is an
> absolute *must* before any major surgery on an array.
also below.
>> My first attempt would be to try
>> mdadm --create --metadata=0.9 --chunk=64 --assume-clean, etc.
>>
>> Is there a chance for this to succeed? Or do you have better suggestions?
>
> "--create" is a *terrible* first step. "mdadm --assemble --force" is
> the right tool for this job.
I forgot to mention: I tried that, and stopped it, after I saw the first
thing it did was to start a rebuild of the array. I couldn't figure out
which disk it was trying to rebuild, but whichever of the two dropped out
disks it was, I can't see how it could reconstruct the data once it reaches
the point of the errors on the disk it uses in the reconstruction.
(So "first" above should really say more verbose "first after the new copies
are finished".)
mdadm --assemble --assume-clean sounded like the most logical combination of
options, but was rejected.
Unfortunately, the data on the disk is not simply a filesystem where bad
blocks mean a few unreadable files, but a filesystem with a number of files
on it that represent a volume exported by iSCSI, on which there is an
encrypted partition with a filesystem. So I'm not too sure, if any of these
indirections badly multiplies the effect of a single bad sector, and I'm
trying to reach 100% good, if possible.
>> If all recovery that involves assembling the array fails: Is is possible
>> to manually assemble the data?
>> I'm thinking in the direction of: take the first 64k from disk1, then 64k
>> from disk2, etc.? This would probably take years to complete, but the data
>> is of really big importance to me (which is why I put it on a RAID in the
>> first place...).
>
> Your scenario sounds like the common timeout mismatch catastrophe, which
> is why I asked for "smartctl -x". If that is the case, MD won't be able
> to do the reconstructions that it should when encounting read errors.
You mean the "timeout of the disk is longer than RAID's patience" problem?
I have no idea, if the old disks suffered from it, I used Samsung HD204UI
which were certified by QNAP. The copies are now WD NAS edition disks,
which have a lower timeout.
> Also, you have a poor understanding of MD's use--it is *not* a backup
> alternative. It is a tool for maximizing *uptime*. It will keep you
> running through the normal random failures that complex
> electro-mechanical systems experience.
>
> MD won't save your data from accidental deletion or other operator
> error. It won't save your data from a lightning strike. It won't save
> your data from a home or office fire. You still need to make backups.
I'm all too aware of that, and we also tryied to keep a "manual RAID" by
copying to a number of USB disks stored at a different location to survive a
burnt-down house. However, this event uncovered a bad oversight on our
side in that process. We simply missed some data under certain circumstances
(the "I-thought-you-did-it" bug in human interaction). So out of the ~800GB
on the RAID, for some +/-20GB, this is the only remaining copy.
Recently, I also started copying all data to Amazon Glacier, for 100%-epsilon
reliable storage, but this upload simply took longer than the disks lasted
(=less than 30 days spinning! very disappointing).
Regards,
Michael
All the disk data: disk 1 and 3 failed:
(I installed the patch for the firmware bug from December 2012.)
Disk1:
======
/dev/sdc3:
Magic : a92b4efc
Version : 0.90.00
UUID : f5ad617a:14ccd4b1:3d7a38e4:71465fe8
Creation Time : Fri Nov 26 19:58:40 2010
Raid Level : raid5
Used Dev Size : 1951945600 (1861.52 GiB 1998.79 GB)
Array Size : 5855836800 (5584.56 GiB 5996.38 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Update Time : Fri Jan 4 15:11:07 2013
State : active
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Checksum : 7496591c - correct
Events : 25
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 0 8 3 0 active sync
0 0 8 3 0 active sync
1 1 8 19 1 active sync /dev/sdb3
2 2 8 35 2 active sync /dev/sdc3
3 3 8 51 3 active sync
Disk2:
======
/dev/sdb3:
Magic : a92b4efc
Version : 0.90.00
UUID : f5ad617a:14ccd4b1:3d7a38e4:71465fe8
Creation Time : Fri Nov 26 19:58:40 2010
Raid Level : raid5
Used Dev Size : 1951945600 (1861.52 GiB 1998.79 GB)
Array Size : 5855836800 (5584.56 GiB 5996.38 GB)
Raid Devices : 4
Total Devices : 3
Preferred Minor : 0
Update Time : Fri Jan 4 16:33:36 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 1
Spare Devices : 0
Checksum : 74966e44 - correct
Events : 237
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 1 8 19 1 active sync /dev/sdb3
0 0 0 0 0 removed
1 1 8 19 1 active sync /dev/sdb3
2 2 0 0 2 faulty removed
3 3 8 51 3 active sync
Disk3:
======
/dev/sdb3:
Magic : a92b4efc
Version : 0.90.00
UUID : f5ad617a:14ccd4b1:3d7a38e4:71465fe8
Creation Time : Fri Nov 26 19:58:40 2010
Raid Level : raid5
Used Dev Size : 1951945600 (1861.52 GiB 1998.79 GB)
Array Size : 5855836800 (5584.56 GiB 5996.38 GB)
Raid Devices : 4
Total Devices : 3
Preferred Minor : 0
Update Time : Fri Jan 4 16:32:27 2013
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Checksum : 74966e04 - correct
Events : 236
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 2 8 35 2 active sync /dev/sdc3
0 0 0 0 0 removed
1 1 8 19 1 active sync /dev/sdb3
2 2 8 35 2 active sync /dev/sdc3
3 3 8 51 3 active sync
Disk4:
======
/dev/sdc3:
Magic : a92b4efc
Version : 0.90.00
UUID : f5ad617a:14ccd4b1:3d7a38e4:71465fe8
Creation Time : Fri Nov 26 19:58:40 2010
Raid Level : raid5
Used Dev Size : 1951945600 (1861.52 GiB 1998.79 GB)
Array Size : 5855836800 (5584.56 GiB 5996.38 GB)
Raid Devices : 4
Total Devices : 3
Preferred Minor : 0
Update Time : Fri Jan 4 16:33:36 2013
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 1
Spare Devices : 0
Checksum : 74966e68 - correct
Events : 237
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 3 8 51 3 active sync
0 0 0 0 0 removed
1 1 8 19 1 active sync /dev/sdb3
2 2 0 0 2 faulty removed
3 3 8 51 3 active sync
Disk1:
======
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7J1BZA16176
LU WWN Device Id: 5 0024e9 004358105
Firmware Version: 1AQ10001
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Fri Feb 1 20:33:54 2013 CET
==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://www.samsung.com/global/business/hdd/faqView.do?b2b_bbs_msg_id=386
http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 119) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (20640) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 100 100 051 - 1121
2 Throughput_Performance -OS--K 252 252 000 - 0
3 Spin_Up_Time PO---K 066 065 025 - 10415
4 Start_Stop_Count -O--CK 099 099 000 - 1123
5 Reallocated_Sector_Ct PO--CK 252 252 010 - 0
7 Seek_Error_Rate -OSR-K 252 252 051 - 0
8 Seek_Time_Performance --S--K 252 252 015 - 0
9 Power_On_Hours -O--CK 100 100 000 - 717
10 Spin_Retry_Count -O--CK 252 252 051 - 0
11 Calibration_Retry_Count -O--CK 252 252 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 120
181 Program_Fail_Cnt_Total -O---K 100 100 000 - 903
191 G-Sense_Error_Rate -O---K 100 100 000 - 3
192 Power-Off_Retract_Count -O---K 252 252 000 - 0
194 Temperature_Celsius -O---- 064 064 000 - 17 (Min/Max 14/33)
195 Hardware_ECC_Recovered -O-RCK 100 100 000 - 0
196 Reallocated_Event_Count -O--CK 252 252 000 - 0
197 Current_Pending_Sector -O--CK 100 100 000 - 30
198 Offline_Uncorrectable ----CK 252 252 000 - 0
199 UDMA_CRC_Error_Count -OS-CK 200 200 000 - 0
200 Multi_Zone_Error_Rate -O-R-K 100 100 000 - 0
223 Load_Retry_Count -O--CK 252 252 000 - 0
225 Load_Cycle_Count -O--CK 100 100 000 - 1124
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
GP/S Log at address 0x00 has 1 sectors [Log Directory]
SMART Log at address 0x01 has 1 sectors [Summary SMART error log]
SMART Log at address 0x02 has 2 sectors [Comprehensive SMART error log]
GP Log at address 0x03 has 2 sectors [Ext. Comprehensive SMART error log]
SMART Log at address 0x06 has 1 sectors [SMART self-test log]
GP Log at address 0x07 has 2 sectors [Extended self-test log]
GP Log at address 0x08 has 2 sectors [Power Conditions]
SMART Log at address 0x09 has 1 sectors [Selective self-test log]
GP Log at address 0x10 has 1 sectors [NCQ Command Error]
GP Log at address 0x11 has 1 sectors [SATA Phy Event Counters]
GP/S Log at address 0x80 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x81 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x82 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x83 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x84 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x85 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x86 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x87 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x88 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x89 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8f has 16 sectors [Host vendor specific log]
GP/S Log at address 0x90 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x91 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x92 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x93 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x94 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x95 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x96 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x97 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x98 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x99 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9f has 16 sectors [Host vendor specific log]
GP/S Log at address 0xe0 has 1 sectors [SCT Command/Status]
GP/S Log at address 0xe1 has 1 sectors [SCT Data Transfer]
SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
Device Error Count: 1005 (device log contains only the most recent 8 errors)
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1005 [4] occurred at disk power-on lifetime: 716 hours (29 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 13 a4 a7 88 e0 00 Error: UNC 8 sectors at LBA = 0x13a4a788 = 329557896
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 13 a4 a7 88 e0 00 00:01:16.277 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.277 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:01:16.277 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:01:16.277 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.277 READ NATIVE MAX ADDRESS EXT
Error 1004 [3] occurred at disk power-on lifetime: 716 hours (29 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 13 a4 a7 88 e0 00 Error: UNC 8 sectors at LBA = 0x13a4a788 = 329557896
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 13 a4 a7 88 e0 00 00:01:16.271 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.271 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:01:16.271 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:01:16.271 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.271 READ NATIVE MAX ADDRESS EXT
Error 1003 [2] occurred at disk power-on lifetime: 716 hours (29 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 13 a4 a7 88 e0 00 Error: UNC 8 sectors at LBA = 0x13a4a788 = 329557896
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 13 a4 a7 88 e0 00 00:01:16.266 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.266 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:01:16.266 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:01:16.266 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.266 READ NATIVE MAX ADDRESS EXT
Error 1002 [1] occurred at disk power-on lifetime: 716 hours (29 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 13 a4 a7 88 e0 00 Error: UNC 8 sectors at LBA = 0x13a4a788 = 329557896
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 13 a4 a7 88 e0 00 00:01:16.261 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.261 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:01:16.261 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:01:16.261 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.261 READ NATIVE MAX ADDRESS EXT
Error 1001 [0] occurred at disk power-on lifetime: 716 hours (29 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 13 a4 a7 88 e0 00 Error: UNC 8 sectors at LBA = 0x13a4a788 = 329557896
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 13 a4 a7 88 e0 00 00:01:16.256 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.256 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:01:16.256 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:01:16.256 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.256 READ NATIVE MAX ADDRESS EXT
Error 1000 [7] occurred at disk power-on lifetime: 716 hours (29 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 13 a4 a7 88 e0 00 Error: UNC 8 sectors at LBA = 0x13a4a788 = 329557896
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 13 a4 a7 88 e0 00 00:01:16.251 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.251 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:01:16.251 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:01:16.251 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.251 READ NATIVE MAX ADDRESS EXT
Error 999 [6] occurred at disk power-on lifetime: 716 hours (29 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 13 a4 a7 88 e0 00 Error: UNC 8 sectors at LBA = 0x13a4a788 = 329557896
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 13 a4 a7 88 e0 00 00:01:16.246 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.246 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:01:16.246 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:01:16.246 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.246 READ NATIVE MAX ADDRESS EXT
Error 998 [5] occurred at disk power-on lifetime: 716 hours (29 days + 20 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 13 a4 a7 88 e0 00 Error: UNC 8 sectors at LBA = 0x13a4a788 = 329557896
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 13 a4 a7 88 e0 00 00:01:16.241 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.241 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:01:16.241 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:01:16.241 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:01:16.241 READ NATIVE MAX ADDRESS EXT
SMART Extended Self-test Log Version: 1 (2 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 70% 698 327170640
# 2 Extended offline Completed: read failure 90% 693 327170208
# 3 Short offline Completed: read failure 10% 692 327170648
Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed_read_failure [70% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 2
SCT Version (vendor specific): 256 (0x0100)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 17 Celsius
Power Cycle Min/Max Temperature: 16/17 Celsius
Lifetime Min/Max Temperature: 14/67 Celsius
Lifetime Average Temperature: 80 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 5 minutes
Temperature Logging Interval: 5 minutes
Min/Max recommended Temperature: -5/80 Celsius
Min/Max Temperature Limit: -10/85 Celsius
Temperature History Size (Index): 128 (104)
Index Estimated Time Temperature Celsius
105 2013-02-01 09:55 17 -
106 2013-02-01 10:00 29 **********
107 2013-02-01 10:05 28 *********
... ..( 15 skipped). .. *********
123 2013-02-01 11:25 28 *********
124 2013-02-01 11:30 29 **********
125 2013-02-01 11:35 28 *********
... ..( 9 skipped). .. *********
7 2013-02-01 12:25 28 *********
8 2013-02-01 12:30 29 **********
... ..( 9 skipped). .. **********
18 2013-02-01 13:20 29 **********
19 2013-02-01 13:25 28 *********
20 2013-02-01 13:30 29 **********
... ..( 6 skipped). .. **********
27 2013-02-01 14:05 29 **********
28 2013-02-01 14:10 28 *********
29 2013-02-01 14:15 29 **********
30 2013-02-01 14:20 28 *********
31 2013-02-01 14:25 28 *********
32 2013-02-01 14:30 28 *********
33 2013-02-01 14:35 29 **********
34 2013-02-01 14:40 28 *********
... ..( 9 skipped). .. *********
44 2013-02-01 15:30 28 *********
45 2013-02-01 15:35 29 **********
46 2013-02-01 15:40 28 *********
... ..( 7 skipped). .. *********
54 2013-02-01 16:20 28 *********
55 2013-02-01 16:25 29 **********
56 2013-02-01 16:30 28 *********
57 2013-02-01 16:35 29 **********
58 2013-02-01 16:40 28 *********
59 2013-02-01 16:45 28 *********
60 2013-02-01 16:50 27 ********
61 2013-02-01 16:55 28 *********
62 2013-02-01 17:00 29 **********
63 2013-02-01 17:05 28 *********
... ..( 13 skipped). .. *********
77 2013-02-01 18:15 28 *********
78 2013-02-01 18:20 29 **********
79 2013-02-01 18:25 28 *********
... ..( 14 skipped). .. *********
94 2013-02-01 19:40 28 *********
95 2013-02-01 19:45 27 ********
96 2013-02-01 19:50 28 *********
97 2013-02-01 19:55 28 *********
98 2013-02-01 20:00 27 ********
... ..( 5 skipped). .. ********
104 2013-02-01 20:30 27 ********
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
0x0003 4 0 R_ERR response for device-to-host data FIS
0x0004 4 0 R_ERR response for host-to-device data FIS
0x0005 4 0 R_ERR response for non-data FIS
0x0006 4 0 R_ERR response for device-to-host non-data FIS
0x0007 4 0 R_ERR response for host-to-device non-data FIS
0x0008 4 0 Device-to-host non-data FIS retries
0x0009 4 7 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 0 Device-to-host register FISes sent due to a COMRESET
0x000b 4 0 CRC errors within host-to-device FIS
0x000d 4 0 Non-CRC errors within host-to-device FIS
0x000f 4 0 R_ERR response for host-to-device data FIS, CRC
0x0010 4 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 4 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 4 0 R_ERR response for host-to-device non-data FIS, non-CRC
0x8e00 4 0 Vendor specific
0x8e01 4 0 Vendor specific
0x8e02 4 0 Vendor specific
0x8e03 4 0 Vendor specific
0x8e04 4 0 Vendor specific
0x8e05 4 0 Vendor specific
0x8e06 4 0 Vendor specific
0x8e07 4 0 Vendor specific
0x8e08 4 0 Vendor specific
0x8e09 4 0 Vendor specific
0x8e0a 4 0 Vendor specific
0x8e0b 4 0 Vendor specific
0x8e0c 4 0 Vendor specific
0x8e0d 4 0 Vendor specific
0x8e0e 4 0 Vendor specific
0x8e0f 4 0 Vendor specific
0x8e10 4 0 Vendor specific
0x8e11 4 0 Vendor specific
Disk2:
======
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7J1BZA16132
LU WWN Device Id: 5 0024e9 004357e94
Firmware Version: 1AQ10001
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Fri Feb 1 20:28:32 2013 CET
==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://www.samsung.com/global/business/hdd/faqView.do?b2b_bbs_msg_id=386
http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (20760) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 100 100 051 - 0
2 Throughput_Performance -OS--K 252 252 000 - 0
3 Spin_Up_Time PO---K 067 066 025 - 10206
4 Start_Stop_Count -O--CK 099 099 000 - 1124
5 Reallocated_Sector_Ct PO--CK 252 252 010 - 0
7 Seek_Error_Rate -OSR-K 252 252 051 - 0
8 Seek_Time_Performance --S--K 252 252 015 - 0
9 Power_On_Hours -O--CK 100 100 000 - 709
10 Spin_Retry_Count -O--CK 252 252 051 - 0
11 Calibration_Retry_Count -O--CK 252 252 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 128
181 Program_Fail_Cnt_Total -O---K 100 100 000 - 925
191 G-Sense_Error_Rate -O---K 252 252 000 - 0
192 Power-Off_Retract_Count -O---K 252 252 000 - 0
194 Temperature_Celsius -O---- 064 064 000 - 21 (Min/Max 14/35)
195 Hardware_ECC_Recovered -O-RCK 100 100 000 - 0
196 Reallocated_Event_Count -O--CK 252 252 000 - 0
197 Current_Pending_Sector -O--CK 252 252 000 - 0
198 Offline_Uncorrectable ----CK 252 252 000 - 0
199 UDMA_CRC_Error_Count -OS-CK 200 200 000 - 0
200 Multi_Zone_Error_Rate -O-R-K 100 100 000 - 0
223 Load_Retry_Count -O--CK 252 252 000 - 0
225 Load_Cycle_Count -O--CK 100 100 000 - 1126
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
GP/S Log at address 0x00 has 1 sectors [Log Directory]
SMART Log at address 0x01 has 1 sectors [Summary SMART error log]
SMART Log at address 0x02 has 2 sectors [Comprehensive SMART error log]
GP Log at address 0x03 has 2 sectors [Ext. Comprehensive SMART error log]
SMART Log at address 0x06 has 1 sectors [SMART self-test log]
GP Log at address 0x07 has 2 sectors [Extended self-test log]
GP Log at address 0x08 has 2 sectors [Power Conditions]
SMART Log at address 0x09 has 1 sectors [Selective self-test log]
GP Log at address 0x10 has 1 sectors [NCQ Command Error]
GP Log at address 0x11 has 1 sectors [SATA Phy Event Counters]
GP/S Log at address 0x80 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x81 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x82 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x83 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x84 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x85 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x86 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x87 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x88 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x89 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8f has 16 sectors [Host vendor specific log]
GP/S Log at address 0x90 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x91 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x92 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x93 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x94 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x95 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x96 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x97 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x98 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x99 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9f has 16 sectors [Host vendor specific log]
GP/S Log at address 0xe0 has 1 sectors [SCT Command/Status]
GP/S Log at address 0xe1 has 1 sectors [SCT Data Transfer]
SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (2 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 697 -
Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed [00% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 2
SCT Version (vendor specific): 256 (0x0100)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 21 Celsius
Power Cycle Min/Max Temperature: 20/21 Celsius
Lifetime Min/Max Temperature: 14/62 Celsius
Lifetime Average Temperature: 80 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 5 minutes
Temperature Logging Interval: 5 minutes
Min/Max recommended Temperature: -5/80 Celsius
Min/Max Temperature Limit: -10/85 Celsius
Temperature History Size (Index): 128 (2)
Index Estimated Time Temperature Celsius
3 2013-02-01 09:50 21 **
4 2013-02-01 09:55 26 *******
5 2013-02-01 10:00 24 *****
6 2013-02-01 10:05 24 *****
7 2013-02-01 10:10 25 ******
8 2013-02-01 10:15 25 ******
9 2013-02-01 10:20 18 -
10 2013-02-01 10:25 19 -
11 2013-02-01 10:30 21 **
12 2013-02-01 10:35 22 ***
13 2013-02-01 10:40 23 ****
14 2013-02-01 10:45 24 *****
15 2013-02-01 10:50 24 *****
16 2013-02-01 10:55 25 ******
17 2013-02-01 11:00 25 ******
18 2013-02-01 11:05 26 *******
19 2013-02-01 11:10 26 *******
20 2013-02-01 11:15 26 *******
21 2013-02-01 11:20 27 ********
22 2013-02-01 11:25 26 *******
23 2013-02-01 11:30 27 ********
... ..( 4 skipped). .. ********
28 2013-02-01 11:55 27 ********
29 2013-02-01 12:00 28 *********
30 2013-02-01 12:05 27 ********
31 2013-02-01 12:10 27 ********
32 2013-02-01 12:15 18 -
33 2013-02-01 12:20 19 -
34 2013-02-01 12:25 20 *
35 2013-02-01 12:30 22 ***
36 2013-02-01 12:35 23 ****
37 2013-02-01 12:40 18 -
38 2013-02-01 12:45 19 -
39 2013-02-01 12:50 21 **
40 2013-02-01 12:55 22 ***
41 2013-02-01 13:00 24 *****
42 2013-02-01 13:05 24 *****
43 2013-02-01 13:10 24 *****
44 2013-02-01 13:15 25 ******
45 2013-02-01 13:20 25 ******
46 2013-02-01 13:25 26 *******
47 2013-02-01 13:30 26 *******
48 2013-02-01 13:35 19 -
49 2013-02-01 13:40 22 ***
50 2013-02-01 13:45 24 *****
51 2013-02-01 13:50 26 *******
52 2013-02-01 13:55 27 ********
53 2013-02-01 14:00 29 **********
54 2013-02-01 14:05 30 ***********
55 2013-02-01 14:10 31 ************
56 2013-02-01 14:15 31 ************
57 2013-02-01 14:20 32 *************
58 2013-02-01 14:25 32 *************
59 2013-02-01 14:30 32 *************
60 2013-02-01 14:35 33 **************
... ..( 2 skipped). .. **************
63 2013-02-01 14:50 33 **************
64 2013-02-01 14:55 32 *************
65 2013-02-01 15:00 33 **************
66 2013-02-01 15:05 33 **************
67 2013-02-01 15:10 34 ***************
68 2013-02-01 15:15 33 **************
69 2013-02-01 15:20 34 ***************
70 2013-02-01 15:25 34 ***************
71 2013-02-01 15:30 33 **************
72 2013-02-01 15:35 33 **************
73 2013-02-01 15:40 34 ***************
... ..( 46 skipped). .. ***************
120 2013-02-01 19:35 34 ***************
121 2013-02-01 19:40 33 **************
122 2013-02-01 19:45 33 **************
123 2013-02-01 19:50 35 ****************
124 2013-02-01 19:55 34 ***************
125 2013-02-01 20:00 34 ***************
126 2013-02-01 20:05 33 **************
... ..( 3 skipped). .. **************
2 2013-02-01 20:25 33 **************
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
0x0003 4 0 R_ERR response for device-to-host data FIS
0x0004 4 0 R_ERR response for host-to-device data FIS
0x0005 4 0 R_ERR response for non-data FIS
0x0006 4 0 R_ERR response for device-to-host non-data FIS
0x0007 4 0 R_ERR response for host-to-device non-data FIS
0x0008 4 0 Device-to-host non-data FIS retries
0x0009 4 3 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 0 Device-to-host register FISes sent due to a COMRESET
0x000b 4 0 CRC errors within host-to-device FIS
0x000d 4 0 Non-CRC errors within host-to-device FIS
0x000f 4 0 R_ERR response for host-to-device data FIS, CRC
0x0010 4 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 4 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 4 0 R_ERR response for host-to-device non-data FIS, non-CRC
0x8e00 4 0 Vendor specific
0x8e01 4 0 Vendor specific
0x8e02 4 0 Vendor specific
0x8e03 4 0 Vendor specific
0x8e04 4 0 Vendor specific
0x8e05 4 0 Vendor specific
0x8e06 4 0 Vendor specific
0x8e07 4 0 Vendor specific
0x8e08 4 0 Vendor specific
0x8e09 4 0 Vendor specific
0x8e0a 4 0 Vendor specific
0x8e0b 4 0 Vendor specific
0x8e0c 4 0 Vendor specific
0x8e0d 4 0 Vendor specific
0x8e0e 4 0 Vendor specific
0x8e0f 4 0 Vendor specific
0x8e10 4 0 Vendor specific
0x8e11 4 0 Vendor specific
Disk 3:
=======
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7J1BZA16147
LU WWN Device Id: 5 0024e9 004357edb
Firmware Version: 1AQ10001
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Fri Feb 1 20:33:49 2013 CET
==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://www.samsung.com/global/business/hdd/faqView.do?b2b_bbs_msg_id=386
http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 116) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (21000) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 100 100 051 - 29439
2 Throughput_Performance -OS--K 252 252 000 - 0
3 Spin_Up_Time PO---K 067 066 025 - 10221
4 Start_Stop_Count -O--CK 099 099 000 - 1130
5 Reallocated_Sector_Ct PO--CK 252 252 010 - 0
7 Seek_Error_Rate -OSR-K 252 252 051 - 0
8 Seek_Time_Performance --S--K 252 252 015 - 0
9 Power_On_Hours -O--CK 100 100 000 - 766
10 Spin_Retry_Count -O--CK 252 252 051 - 0
11 Calibration_Retry_Count -O--CK 252 252 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 133
181 Program_Fail_Cnt_Total -O---K 100 100 000 - 919
191 G-Sense_Error_Rate -O---K 252 252 000 - 0
192 Power-Off_Retract_Count -O---K 252 252 000 - 0
194 Temperature_Celsius -O---- 064 064 000 - 18 (Min/Max 14/35)
195 Hardware_ECC_Recovered -O-RCK 100 100 000 - 0
196 Reallocated_Event_Count -O--CK 252 252 000 - 0
197 Current_Pending_Sector -O--CK 081 081 000 - 2350
198 Offline_Uncorrectable ----CK 252 100 000 - 0
199 UDMA_CRC_Error_Count -OS-CK 200 200 000 - 0
200 Multi_Zone_Error_Rate -O-R-K 100 100 000 - 0
223 Load_Retry_Count -O--CK 252 252 000 - 0
225 Load_Cycle_Count -O--CK 100 100 000 - 1132
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
GP/S Log at address 0x00 has 1 sectors [Log Directory]
SMART Log at address 0x01 has 1 sectors [Summary SMART error log]
SMART Log at address 0x02 has 2 sectors [Comprehensive SMART error log]
GP Log at address 0x03 has 2 sectors [Ext. Comprehensive SMART error log]
SMART Log at address 0x06 has 1 sectors [SMART self-test log]
GP Log at address 0x07 has 2 sectors [Extended self-test log]
GP Log at address 0x08 has 2 sectors [Power Conditions]
SMART Log at address 0x09 has 1 sectors [Selective self-test log]
GP Log at address 0x10 has 1 sectors [NCQ Command Error]
GP Log at address 0x11 has 1 sectors [SATA Phy Event Counters]
GP/S Log at address 0x80 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x81 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x82 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x83 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x84 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x85 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x86 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x87 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x88 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x89 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8f has 16 sectors [Host vendor specific log]
GP/S Log at address 0x90 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x91 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x92 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x93 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x94 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x95 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x96 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x97 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x98 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x99 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9f has 16 sectors [Host vendor specific log]
GP/S Log at address 0xe0 has 1 sectors [SCT Command/Status]
GP/S Log at address 0xe1 has 1 sectors [SCT Data Transfer]
SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
Device Error Count: 21341 (device log contains only the most recent 8 errors)
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 21341 [4] occurred at disk power-on lifetime: 766 hours (31 days + 22 hours)
When the command that caused the error occurred, the device was in a vendor specific state.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 2c 88 72 70 e0 00 Error: UNC 8 sectors at LBA = 0x2c887270 = 747139696
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 2c 88 72 70 e0 00 00:00:50.113 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.113 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:00:50.113 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:00:50.113 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.113 READ NATIVE MAX ADDRESS EXT
Error 21340 [3] occurred at disk power-on lifetime: 766 hours (31 days + 22 hours)
When the command that caused the error occurred, the device was in a vendor specific state.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 2c 88 72 70 e0 00 Error: UNC 8 sectors at LBA = 0x2c887270 = 747139696
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 2c 88 72 70 e0 00 00:00:50.108 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.108 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:00:50.108 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:00:50.108 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.108 READ NATIVE MAX ADDRESS EXT
Error 21339 [2] occurred at disk power-on lifetime: 766 hours (31 days + 22 hours)
When the command that caused the error occurred, the device was in a vendor specific state.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 2c 88 72 70 e0 00 Error: UNC 8 sectors at LBA = 0x2c887270 = 747139696
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 2c 88 72 70 e0 00 00:00:50.103 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.103 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:00:50.103 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:00:50.103 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.103 READ NATIVE MAX ADDRESS EXT
Error 21338 [1] occurred at disk power-on lifetime: 766 hours (31 days + 22 hours)
When the command that caused the error occurred, the device was in a vendor specific state.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 2c 88 72 70 e0 00 Error: UNC 8 sectors at LBA = 0x2c887270 = 747139696
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 2c 88 72 70 e0 00 00:00:50.098 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.098 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:00:50.098 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:00:50.098 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.098 READ NATIVE MAX ADDRESS EXT
Error 21337 [0] occurred at disk power-on lifetime: 766 hours (31 days + 22 hours)
When the command that caused the error occurred, the device was in a vendor specific state.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 2c 88 72 70 e0 00 Error: UNC 8 sectors at LBA = 0x2c887270 = 747139696
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 2c 88 72 70 e0 00 00:00:50.093 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.093 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:00:50.093 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:00:50.093 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.093 READ NATIVE MAX ADDRESS EXT
Error 21336 [7] occurred at disk power-on lifetime: 766 hours (31 days + 22 hours)
When the command that caused the error occurred, the device was in a vendor specific state.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 2c 88 72 70 e0 00 Error: UNC 8 sectors at LBA = 0x2c887270 = 747139696
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 2c 88 72 70 e0 00 00:00:50.087 READ DMA EXT
25 00 00 00 08 00 00 2c 88 72 78 e0 00 00:00:50.087 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.087 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:00:50.087 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:00:50.087 SET FEATURES [Set transfer mode]
Error 21335 [6] occurred at disk power-on lifetime: 766 hours (31 days + 22 hours)
When the command that caused the error occurred, the device was in a vendor specific state.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 2c 88 72 b8 e0 00 Error: UNC 8 sectors at LBA = 0x2c8872b8 = 747139768
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 2c 88 72 b8 e0 00 00:00:50.082 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.082 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:00:50.082 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:00:50.082 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.082 READ NATIVE MAX ADDRESS EXT
Error 21334 [5] occurred at disk power-on lifetime: 766 hours (31 days + 22 hours)
When the command that caused the error occurred, the device was in a vendor specific state.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
40 -- 51 00 08 00 00 2c 88 72 b8 e0 00 Error: UNC 8 sectors at LBA = 0x2c8872b8 = 747139768
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
25 00 00 00 08 00 00 2c 88 72 b8 e0 00 00:00:50.077 READ DMA EXT
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.077 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 00 00 00 00 00 a0 00 00:00:50.077 IDENTIFY DEVICE
ef 00 03 00 42 00 00 00 00 00 00 a0 00 00:00:50.077 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 00 00 00 00 00 e0 00 00:00:50.077 READ NATIVE MAX ADDRESS EXT
SMART Extended Self-test Log Version: 1 (2 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 40% 766 329136144
# 2 Short offline Completed: read failure 10% 745 717909280
# 3 Short offline Completed: read failure 70% 714 327191864
# 4 Extended offline Completed: read failure 90% 695 329136144
# 5 Short offline Completed: read failure 80% 695 724561192
Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed_read_failure [40% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 2
SCT Version (vendor specific): 256 (0x0100)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 17 Celsius
Power Cycle Min/Max Temperature: 15/17 Celsius
Lifetime Min/Max Temperature: 13/67 Celsius
Lifetime Average Temperature: 80 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 5 minutes
Temperature Logging Interval: 5 minutes
Min/Max recommended Temperature: -5/80 Celsius
Min/Max Temperature Limit: -10/85 Celsius
Temperature History Size (Index): 128 (5)
Index Estimated Time Temperature Celsius
6 2013-02-01 09:55 17 -
7 2013-02-01 10:00 27 ********
... ..( 24 skipped). .. ********
32 2013-02-01 12:05 27 ********
33 2013-02-01 12:10 26 *******
34 2013-02-01 12:15 27 ********
... ..( 29 skipped). .. ********
64 2013-02-01 14:45 27 ********
65 2013-02-01 14:50 28 *********
66 2013-02-01 14:55 27 ********
... ..( 22 skipped). .. ********
89 2013-02-01 16:50 27 ********
90 2013-02-01 16:55 26 *******
91 2013-02-01 17:00 27 ********
... ..( 5 skipped). .. ********
97 2013-02-01 17:30 27 ********
98 2013-02-01 17:35 26 *******
99 2013-02-01 17:40 26 *******
100 2013-02-01 17:45 27 ********
101 2013-02-01 17:50 26 *******
102 2013-02-01 17:55 27 ********
103 2013-02-01 18:00 26 *******
... ..( 6 skipped). .. *******
110 2013-02-01 18:35 26 *******
111 2013-02-01 18:40 27 ********
112 2013-02-01 18:45 27 ********
113 2013-02-01 18:50 26 *******
114 2013-02-01 18:55 26 *******
115 2013-02-01 19:00 27 ********
... ..( 2 skipped). .. ********
118 2013-02-01 19:15 27 ********
119 2013-02-01 19:20 26 *******
120 2013-02-01 19:25 26 *******
121 2013-02-01 19:30 27 ********
122 2013-02-01 19:35 26 *******
123 2013-02-01 19:40 27 ********
124 2013-02-01 19:45 27 ********
125 2013-02-01 19:50 26 *******
... ..( 3 skipped). .. *******
1 2013-02-01 20:10 26 *******
2 2013-02-01 20:15 27 ********
3 2013-02-01 20:20 27 ********
4 2013-02-01 20:25 26 *******
5 2013-02-01 20:30 26 *******
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
0x0003 4 0 R_ERR response for device-to-host data FIS
0x0004 4 0 R_ERR response for host-to-device data FIS
0x0005 4 0 R_ERR response for non-data FIS
0x0006 4 0 R_ERR response for device-to-host non-data FIS
0x0007 4 0 R_ERR response for host-to-device non-data FIS
0x0008 4 0 Device-to-host non-data FIS retries
0x0009 4 9 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 1 Device-to-host register FISes sent due to a COMRESET
0x000b 4 0 CRC errors within host-to-device FIS
0x000d 4 0 Non-CRC errors within host-to-device FIS
0x000f 4 0 R_ERR response for host-to-device data FIS, CRC
0x0010 4 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 4 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 4 0 R_ERR response for host-to-device non-data FIS, non-CRC
0x8e00 4 0 Vendor specific
0x8e01 4 0 Vendor specific
0x8e02 4 0 Vendor specific
0x8e03 4 0 Vendor specific
0x8e04 4 0 Vendor specific
0x8e05 4 0 Vendor specific
0x8e06 4 0 Vendor specific
0x8e07 4 0 Vendor specific
0x8e08 4 0 Vendor specific
0x8e09 4 0 Vendor specific
0x8e0a 4 0 Vendor specific
0x8e0b 4 0 Vendor specific
0x8e0c 4 0 Vendor specific
0x8e0d 4 0 Vendor specific
0x8e0e 4 0 Vendor specific
0x8e0f 4 0 Vendor specific
0x8e10 4 0 Vendor specific
0x8e11 4 0 Vendor specific
Disk 4:
=======
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: SAMSUNG SpinPoint F4 EG (AFT)
Device Model: SAMSUNG HD204UI
Serial Number: S2H7J1BZA16596
LU WWN Device Id: 5 0024e9 00435a3ba
Firmware Version: 1AQ10001
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 6
Local Time is: Fri Feb 1 20:30:17 2013 CET
==> WARNING: Using smartmontools or hdparm with this
drive may result in data loss due to a firmware bug.
****** THIS DRIVE MAY OR MAY NOT BE AFFECTED! ******
Buggy and fixed firmware report same version number!
See the following web pages for details:
http://www.samsung.com/global/business/hdd/faqView.do?b2b_bbs_msg_id=386
http://sourceforge.net/apps/trac/smartmontools/wiki/SamsungF4EGBadBlocks
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (20820) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 100 100 051 - 1
2 Throughput_Performance -OS--K 252 252 000 - 0
3 Spin_Up_Time PO---K 067 060 025 - 10215
4 Start_Stop_Count -O--CK 099 099 000 - 1134
5 Reallocated_Sector_Ct PO--CK 252 252 010 - 0
7 Seek_Error_Rate -OSR-K 252 252 051 - 0
8 Seek_Time_Performance --S--K 252 252 015 - 0
9 Power_On_Hours -O--CK 100 100 000 - 722
10 Spin_Retry_Count -O--CK 252 252 051 - 0
11 Calibration_Retry_Count -O--CK 252 252 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 144
181 Program_Fail_Cnt_Total -O---K 100 100 000 - 926
191 G-Sense_Error_Rate -O---K 100 100 000 - 3
192 Power-Off_Retract_Count -O---K 252 252 000 - 0
194 Temperature_Celsius -O---- 064 063 000 - 21 (Min/Max 14/37)
195 Hardware_ECC_Recovered -O-RCK 100 100 000 - 0
196 Reallocated_Event_Count -O--CK 252 252 000 - 0
197 Current_Pending_Sector -O--CK 252 252 000 - 0
198 Offline_Uncorrectable ----CK 252 252 000 - 0
199 UDMA_CRC_Error_Count -OS-CK 200 200 000 - 0
200 Multi_Zone_Error_Rate -O-R-K 100 100 000 - 0
223 Load_Retry_Count -O--CK 252 252 000 - 0
225 Load_Cycle_Count -O--CK 100 100 000 - 1147
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
GP/S Log at address 0x00 has 1 sectors [Log Directory]
SMART Log at address 0x01 has 1 sectors [Summary SMART error log]
SMART Log at address 0x02 has 2 sectors [Comprehensive SMART error log]
GP Log at address 0x03 has 2 sectors [Ext. Comprehensive SMART error log]
SMART Log at address 0x06 has 1 sectors [SMART self-test log]
GP Log at address 0x07 has 2 sectors [Extended self-test log]
GP Log at address 0x08 has 2 sectors [Power Conditions]
SMART Log at address 0x09 has 1 sectors [Selective self-test log]
GP Log at address 0x10 has 1 sectors [NCQ Command Error]
GP Log at address 0x11 has 1 sectors [SATA Phy Event Counters]
GP/S Log at address 0x80 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x81 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x82 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x83 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x84 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x85 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x86 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x87 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x88 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x89 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x8f has 16 sectors [Host vendor specific log]
GP/S Log at address 0x90 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x91 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x92 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x93 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x94 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x95 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x96 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x97 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x98 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x99 has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9a has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9b has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9c has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9d has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9e has 16 sectors [Host vendor specific log]
GP/S Log at address 0x9f has 16 sectors [Host vendor specific log]
GP/S Log at address 0xe0 has 1 sectors [SCT Command/Status]
GP/S Log at address 0xe1 has 1 sectors [SCT Data Transfer]
SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (2 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 717 -
# 2 Short offline Completed without error 00% 707 -
# 3 Short offline Completed without error 00% 693 -
Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Completed [00% left] (0-65535)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 2
SCT Version (vendor specific): 256 (0x0100)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 21 Celsius
Power Cycle Min/Max Temperature: 20/22 Celsius
Lifetime Min/Max Temperature: 13/65 Celsius
Lifetime Average Temperature: 80 Celsius
Under/Over Temperature Limit Count: 0/0
SCT Temperature History Version: 2
Temperature Sampling Period: 5 minutes
Temperature Logging Interval: 5 minutes
Min/Max recommended Temperature: -5/80 Celsius
Min/Max Temperature Limit: -10/85 Celsius
Temperature History Size (Index): 128 (108)
Index Estimated Time Temperature Celsius
109 2013-02-01 09:55 22 ***
110 2013-02-01 10:00 31 ************
... ..( 4 skipped). .. ************
115 2013-02-01 10:25 31 ************
116 2013-02-01 10:30 30 ***********
117 2013-02-01 10:35 31 ************
... ..( 52 skipped). .. ************
42 2013-02-01 15:00 31 ************
43 2013-02-01 15:05 26 *******
44 2013-02-01 15:10 28 *********
45 2013-02-01 15:15 30 ***********
46 2013-02-01 15:20 31 ************
47 2013-02-01 15:25 32 *************
48 2013-02-01 15:30 33 **************
49 2013-02-01 15:35 34 ***************
50 2013-02-01 15:40 35 ****************
51 2013-02-01 15:45 35 ****************
52 2013-02-01 15:50 37 ******************
53 2013-02-01 15:55 36 *****************
... ..( 9 skipped). .. *****************
63 2013-02-01 16:45 36 *****************
64 2013-02-01 16:50 35 ****************
65 2013-02-01 16:55 36 *****************
... ..( 4 skipped). .. *****************
70 2013-02-01 17:20 36 *****************
71 2013-02-01 17:25 37 ******************
72 2013-02-01 17:30 36 *****************
... ..( 11 skipped). .. *****************
84 2013-02-01 18:30 36 *****************
85 2013-02-01 18:35 37 ******************
86 2013-02-01 18:40 36 *****************
87 2013-02-01 18:45 36 *****************
88 2013-02-01 18:50 37 ******************
89 2013-02-01 18:55 36 *****************
90 2013-02-01 19:00 36 *****************
91 2013-02-01 19:05 36 *****************
92 2013-02-01 19:10 37 ******************
93 2013-02-01 19:15 36 *****************
94 2013-02-01 19:20 36 *****************
95 2013-02-01 19:25 35 ****************
... ..( 11 skipped). .. ****************
107 2013-02-01 20:25 35 ****************
108 2013-02-01 20:30 21 **
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 4 0 Command failed due to ICRC error
0x0002 4 0 R_ERR response for data FIS
0x0003 4 0 R_ERR response for device-to-host data FIS
0x0004 4 0 R_ERR response for host-to-device data FIS
0x0005 4 0 R_ERR response for non-data FIS
0x0006 4 0 R_ERR response for device-to-host non-data FIS
0x0007 4 0 R_ERR response for host-to-device non-data FIS
0x0008 4 0 Device-to-host non-data FIS retries
0x0009 4 3 Transition from drive PhyRdy to drive PhyNRdy
0x000a 4 0 Device-to-host register FISes sent due to a COMRESET
0x000b 4 0 CRC errors within host-to-device FIS
0x000d 4 0 Non-CRC errors within host-to-device FIS
0x000f 4 0 R_ERR response for host-to-device data FIS, CRC
0x0010 4 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 4 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 4 0 R_ERR response for host-to-device non-data FIS, non-CRC
0x8e00 4 0 Vendor specific
0x8e01 4 0 Vendor specific
0x8e02 4 0 Vendor specific
0x8e03 4 0 Vendor specific
0x8e04 4 0 Vendor specific
0x8e05 4 0 Vendor specific
0x8e06 4 0 Vendor specific
0x8e07 4 0 Vendor specific
0x8e08 4 0 Vendor specific
0x8e09 4 0 Vendor specific
0x8e0a 4 0 Vendor specific
0x8e0b 4 0 Vendor specific
0x8e0c 4 0 Vendor specific
0x8e0d 4 0 Vendor specific
0x8e0e 4 0 Vendor specific
0x8e0f 4 0 Vendor specific
0x8e10 4 0 Vendor specific
0x8e11 4 0 Vendor specific
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: recovering RAID5 from multiple disk failures
2013-02-02 13:04 ` Michael Ritzert
@ 2013-02-02 13:44 ` Phil Turmel
2013-02-02 20:20 ` Chris Murphy
0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2013-02-02 13:44 UTC (permalink / raw)
To: Michael Ritzert; +Cc: linux-raid
On 02/02/2013 08:04 AM, Michael Ritzert wrote:
> Hi Phil,
>
> In article <510BC173.7070002@turmel.org> you wrote:
>>> So the situation is: I have a four-disk RAID5 with two active disks, and
>>> two that dropped out at different times.
>>
>> Please show the errors from dmesg.
>
> I don't think I can provide that. The RAID ran in a QNAP system, and if
> there is a log at all, it's on this disk...
> During the copy process, it was all media errors, however.
>
>> And show "smartctl -x" for the drives that failed.
>
> See below.
>
> [...]
>> Also show "mdadm -E" for all of the member devices. This data is an
>> absolute *must* before any major surgery on an array.
>
> also below.
>
>>> My first attempt would be to try
>>> mdadm --create --metadata=0.9 --chunk=64 --assume-clean, etc.
>>>
>>> Is there a chance for this to succeed? Or do you have better suggestions?
>>
>> "--create" is a *terrible* first step. "mdadm --assemble --force" is
>> the right tool for this job.
>
> I forgot to mention: I tried that, and stopped it, after I saw the first
> thing it did was to start a rebuild of the array. I couldn't figure out
> which disk it was trying to rebuild, but whichever of the two dropped out
> disks it was, I can't see how it could reconstruct the data once it reaches
> the point of the errors on the disk it uses in the reconstruction.
> (So "first" above should really say more verbose "first after the new copies
> are finished".)
Ok.
> mdadm --assemble --assume-clean sounded like the most logical combination of
> options, but was rejected.
Now it is appropriate, but I'm concerned about mapping drives to device
names in your setup (plugging and unplugging to get these reports?).
Please map drive serial numbers to device names with all drives plugged
in. "lsdrv"[1] or an extract from /dev/disk/by-id/.
> Unfortunately, the data on the disk is not simply a filesystem where bad
> blocks mean a few unreadable files, but a filesystem with a number of files
> on it that represent a volume exported by iSCSI, on which there is an
> encrypted partition with a filesystem. So I'm not too sure, if any of these
> indirections badly multiplies the effect of a single bad sector, and I'm
> trying to reach 100% good, if possible.
Ugly. Yes, there's a bit of multiplication. Not sure how to quantify it.
>>> If all recovery that involves assembling the array fails: Is is possible
>>> to manually assemble the data?
>>> I'm thinking in the direction of: take the first 64k from disk1, then 64k
>>> from disk2, etc.? This would probably take years to complete, but the data
>>> is of really big importance to me (which is why I put it on a RAID in the
>>> first place...).
>>
>> Your scenario sounds like the common timeout mismatch catastrophe, which
>> is why I asked for "smartctl -x". If that is the case, MD won't be able
>> to do the reconstructions that it should when encounting read errors.
>
> You mean the "timeout of the disk is longer than RAID's patience" problem?
> I have no idea, if the old disks suffered from it, I used Samsung HD204UI
> which were certified by QNAP. The copies are now WD NAS edition disks,
> which have a lower timeout.
I've never heard it called a "patience" problem, but that's apt. Your
drives are raid-capable, but they aren't safe out of the box. From your
smartctl reports:
> SCT Error Recovery Control:
> Read: Disabled
> Write: Disabled
You *must* issue "smartctl -l scterc,70,70 /dev/sdX" for each of these
drives *every* time they are powered on. Based on the event counts in
your superblocks, I'd say disk1 was kicked out long ago due to a normal
URE (hundreds of hours ago) and the array has been degraded ever since.
Totally useless way to run a raid. When you started your urgent backup
effort, you found more UREs, in a time/quantity combination that kicked
out another (disk3).
> Recently, I also started copying all data to Amazon Glacier, for 100%-epsilon
> reliable storage, but this upload simply took longer than the disks lasted
> (=less than 30 days spinning! very disappointing).
All of your drives are in perfect condition (no relocations at all).
Meaning that all of your troubles are due to timeout mismatch, lack of
scrubbing (or timeout error on the first scrub), and lack of backups.
Aim your disappointment elsewhere.
"mdadm --create .... missing /dev/sd[XYZ]" is your next step (leaving
out disk1) after you fix your drive timeouts. Match parameters exactly,
of course. Then add disk1 and let it rebuild. If that doesn't succeed,
you will need to use dd_rescue on disks 2-4 to clean up their remaining
UREs, then repeat the "--create ... missing".
You won't achieve 100% good, as the URE locations on disk 2-4 cannot be
recovered from disk1 (too old, almost certainly).
I'll be offline for several hours. Good luck (or ask for more help from
others).
Phil
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: recovering RAID5 from multiple disk failures
2013-02-02 13:44 ` Phil Turmel
@ 2013-02-02 20:20 ` Chris Murphy
2013-02-02 21:56 ` Michael Ritzert
0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2013-02-02 20:20 UTC (permalink / raw)
To: Phil Turmel; +Cc: Michael Ritzert, linux-raid
On Feb 2, 2013, at 6:44 AM, Phil Turmel <philip@turmel.org> wrote:
>
> All of your drives are in perfect condition (no relocations at all).
One disk has Current_Pending_Sector raw value 30, and read error rate of 1121.
Another disk has Current_Pending_Sector raw value 2350, and read error rate of 29439.
I think for new drives that's unreasonable.
It's probably also unreasonable to trust a new drive without testing it. But some of the drives were tested by someone or something, and the test itself was aborted due to read failures, even though the disk was not flagged by SMART as "failed" or failure in progress. Example:
# 1 Short offline Completed: read failure 40% 766 329136144
# 2 Short offline Completed: read failure 10% 745 717909280
# 3 Short offline Completed: read failure 70% 714 327191864
# 4 Extended offline Completed: read failure 90% 695 329136144
# 5 Short offline Completed: read failure 80% 695 724561192
Almost 100 hours ago, at least, problems with this disk were identified. Maybe this is a NAS feature limitation problem, but if the NAS is going to purport to do SMART testing and then fail to inform the user that the tests themselves are failing due to bad sectors, that's negligence in my opinion. Sadly it's common.
These NAS products should have an option to test the drives: secure erase them, do long extended tests, make sure they finish, make sure they don't have sector pending errors, report to the user sector pending errors and what to do about it.
Otherwise, it's a crap product. The knowledge is available, the tools are there, the product just isn't using them.
> Based on the event counts in your superblocks, I'd say disk1 was kicked out long ago due to a normal URE (hundreds of hours ago) and the array has been degraded ever since.
I'm confused because the OP reports disk 1 and disk 4 as sdc3, disk 2 and disk 3 as sdb3; yet the superblock info has different checksums for each. So based on Update Time field, I'm curious what other information leads you to believe disk1 was kicked hundreds of hours ago:
disk 1:
Fri Jan 4 15:11:07 2013
disk 2:
Fri Jan 4 16:33:36 2013
disk 3:
Fri Jan 4 16:32:27 2013
disk 4:
Fri Jan 4 16:33:36 2013
Nevertheless, over an hour and a half is a long time if the file system were being updated at all. There'd definitely be data/parity mismatches for disk1.
If disk 1 is assumed to be useless, meaning force assemble the array in degraded mode; a URE or linux SCSI layer time out is to be avoided or the array as a whole fails. Every sector is needed. So what do you think about raising the linux scsi layer time out to maybe 2 minutes, and leaving the remaining drive's SCT ERC alone so that they don't time out sooner, but rather go into whatever deep recovery they have to in the hopes those bad sectors can be read?
echo 120 >/sys/block/sdX/device/timeout
I'm seeing from the SMART data, even though there are disks that have bad sectors, there are NO hardware ECC recovered events. So I don't know that we know those sectors are totally lost causes yet. If they are, seems like the array is toast unless disk 1 can somehow be included.
> Meaning that all of your troubles are due to timeout mismatch, lack of
> scrubbing (or timeout error on the first scrub), and lack of backups.
> Aim your disappointment elsewhere.
I tentatively agree. This is a case of maybe not the best drives out of the box, a contributing factor certainly is that they have bad sectors on arrival. Not good. Combine that with a NAS that doesn't properly set the SCT ERC on any of the drives. Combine that with whoever or whatever did the offline tests but did not report the aborts due to read failures to the user.
It's a collision of multiple "not good" events.
For a normal, non-degraded array, I read man 4 md to mean either a check or repair would "fix" bad sectors resulting in UREs: i.e. whether a data chunk or parity chunk, the URE'd sector will be overwritten with correct data.
But what about --assemble --force where --assume-clean isn't accepted? Does this involve "check" behavior, or is the ensuing resync assuming data chunks are valid and parity chunks invalid (subject to being overwritten with recomputed parity)? If so, then what happens with a data chunk URE? Can this resync repair that?
Chris Murphy
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: recovering RAID5 from multiple disk failures
2013-02-02 20:20 ` Chris Murphy
@ 2013-02-02 21:56 ` Michael Ritzert
2013-02-02 23:08 ` Chris Murphy
0 siblings, 1 reply; 9+ messages in thread
From: Michael Ritzert @ 2013-02-02 21:56 UTC (permalink / raw)
To: linux-raid
Hi Phil, Chris,
Chris Murphy <lists@colorremedies.com> wrote:
> On Feb 2, 2013, at 6:44 AM, Phil Turmel <philip@turmel.org> wrote:
>>
>> All of your drives are in perfect condition (no relocations at all).
>
> One disk has Current_Pending_Sector raw value 30, and read error rate of 1121.
> Another disk has Current_Pending_Sector raw value 2350, and read error rate of 29439.
>
> I think for new drives that's unreasonable.
>
> It's probably also unreasonable to trust a new drive without testing it. But some of the drives were tested by someone or something, and the test itself was aborted due to read failures, even though the disk was not flagged by SMART as "failed" or failure in progress. Example:
>
> # 1 Short offline Completed: read failure 40% 766 329136144
> # 2 Short offline Completed: read failure 10% 745 717909280
> # 3 Short offline Completed: read failure 70% 714 327191864
> # 4 Extended offline Completed: read failure 90% 695 329136144
> # 5 Short offline Completed: read failure 80% 695 724561192
That was probably me manually starting tests.
When I first noticed signs of trouble, i.e. slow access, I immediately
checked the disk status, and the status page said "OK". I couldn't believe
that, so I started unscheduled and extended tests.
Would you consider running a full smart selftest on a new disk sufficient?
Or do you propose even stricter tests?
> Almost 100 hours ago, at least, problems with this disk were identified. Maybe this is a NAS feature limitation problem, but if the NAS is going to purport to do SMART testing and then fail to inform the user that the tests themselves are failing due to bad sectors, that's negligence in my opinion. Sadly it's common.
When judging the 100 hours, you have to keep in mind that these disk have
been running since the failure. Taking the copy took a few hours (times
two by now), and few more hours have been added since it finished at
nighttime and the disk stayed on until I got up. Still, that shouldn't add
up to 100 hours.
>> Based on the event counts in your superblocks, I'd say disk1 was kicked out long ago due to a normal URE (hundreds of hours ago) and the array has been degraded ever since.
>
> I'm confused because the OP reports disk 1 and disk 4 as sdc3, disk 2 and disk 3 as sdb3; yet the superblock info has different checksums for each. So based on Update Time field, I'm curious what other information leads you to believe disk1 was kicked hundreds of hours ago:
The disks are running on a desktop PC at the moment. I can plug in two
disks at any time, as I have set things up at the moment. So I had to
connect two times two disk to get all four reports. That's why the
devices are identical.
> disk 1:
> Fri Jan 4 15:11:07 2013
> disk 2:
> Fri Jan 4 16:33:36 2013
> disk 3:
> Fri Jan 4 16:32:27 2013
> disk 4:
> Fri Jan 4 16:33:36 2013
>
> Nevertheless, over an hour and a half is a long time if the file system were being updated at all. There'd definitely be data/parity mismatches for disk1.
After disk1 failed, the only write access should have been metadata update
when the filesystem was mounted. I only read data from the filesystem
thereafter. So only atime changes are to be expected, there, and only
for a small number of files that I could capture before disk3 failed. I
know which files are affected, and could leave them alone.
> If disk 1 is assumed to be useless, meaning force assemble the array in degraded mode; a URE or linux SCSI layer time out is to be avoided or the array as a whole fails. Every sector is needed. So what do you think about raising the linux scsi layer time out to maybe 2 minutes, and leaving the remaining drive's SCT ERC alone so that they don't time out sooner, but rather go into whatever deep recovery they have to in the hopes those bad sectors can be read?
>
> echo 120 >/sys/block/sdX/device/timeout
I just tried that, but I couldn't see any effect. The error rate coming
in is much higher than 1 every two minutes.
When I assemble the array, I will have all new disks (with good smart
selftests...), so I wouldn't expect timeouts. Instead, junk data will be
returned from the sectors in question¹. How will md react to that?
Regards,
Michael
¹ One could think about filling these gaps with data from the three
remaining disks. Disk1 is still uptodate in 99%+ of all chunks. So
data from 3 disks is available. I could implement the RAID5 algorithm
in userspace to compute what should be in the bad sector. I do know
where the bad sectors are from the ddrescue report. We are talking
about less that 50kB bad data on disk1. Unfortunately, disk3 is
worse, but there is no sector that is bad on both disks.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: recovering RAID5 from multiple disk failures
2013-02-02 21:56 ` Michael Ritzert
@ 2013-02-02 23:08 ` Chris Murphy
2013-02-03 0:23 ` Phil Turmel
0 siblings, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2013-02-02 23:08 UTC (permalink / raw)
To: Michael Ritzert; +Cc: Phil Turmel, linux-raid@vger.kernel.org list
On Feb 2, 2013, at 2:56 PM, Michael Ritzert <ksciplot@gmx.net> wrote:
>
> Chris Murphy <lists@colorremedies.com> wrote:
>>
>> Nevertheless, over an hour and a half is a long time if the file system were being updated at all. There'd definitely be data/parity mismatches for disk1.
>
> After disk1 failed, the only write access should have been metadata update
> when the filesystem was mounted.
Was it mounted ro?
> I only read data from the filesystem
> thereafter. So only atime changes are to be expected, there, and only
> for a small number of files that I could capture before disk3 failed. I
> know which files are affected, and could leave them alone.
Even for a small number of files there could be dozens or hundreds of chunks altered. I think conservatively you have to consider disk 1 out and mount in degraded mode.
>
>> If disk 1 is assumed to be useless, meaning force assemble the array in degraded mode; a URE or linux SCSI layer time out is to be avoided or the array as a whole fails. Every sector is needed. So what do you think about raising the linux scsi layer time out to maybe 2 minutes, and leaving the remaining drive's SCT ERC alone so that they don't time out sooner, but rather go into whatever deep recovery they have to in the hopes those bad sectors can be read?
>>
>> echo 120 >/sys/block/sdX/device/timeout
>
> I just tried that, but I couldn't see any effect. The error rate coming
> in is much higher than 1 every two minutes.
This timeout is not about error rate. And what the value should be depends on context. Normal operation you want the disk error recovery to be short, so that the disk produces a bonafide URE, not a SCSI layer timeout error. That way md will correct the bad sector. That's what probably wasn't happening in your case, which allowed bad sectors to accumulate until it was a real problem.
But now, for the cloning process, you want the disk error timeout to be long (or disabled) so that the disk has as long as possible to do ECC to recover each of these problematic sectors. But this also means getting the SCSI layer timeout set to at least 1 second longer than the longest recovery time for the drives, so that the SCSI layer time out doesn't stop sector recovery during cloning. Now maybe the disk still won't be able to recover all data from these bad sectors, but it's your best shot IMO.
> When I assemble the array, I will have all new disks (with good smart
> selftests...), so I wouldn't expect timeouts. Instead, junk data will be
> returned from the sectors in question¹. How will md react to that?
Well yeah, with the new drives, they won't report UREs. So there's an ambiguity with any mismatch between data and parity chunks as to which is correct. Without a URE, md doesn't know that the data chunk is right or wrong with RAID 5.
Phil may disagree, and I have to defer to his experience in this, but I think the most conservative and best shot you have at getting the 20GB you want off the array is this:
a.) Make sure the SCT ERC for all drives is disabled. That means it will take the longest time to recover from bad sectors, and thus has as much of a chance as there can be for the disk firmware to do ECC recovery on them.
b.) Make sure the linux SCSI layer has a timeout set that's at least 1 second longer than the disk error time out. I think that no vendor users a time longer than 2 minutes.
c.) Base your disk 2, 3, 4 clones on the above settings. If you cloned the data from old to new disk using default /sys/block/sdX/device/timeout of 30 seconds, then the source disks did not have every chance to recover their sectors and you almost certainly have more holes in the clones than you want. If you are still getting errors in dmesg during the clone, they should only be bonafide read errors from the disk, not timeouts. Report the errors if you're not clear on this point.
d.) Try to assemble force the clones of disks 2, 3, 4, which means it's brought up degraded. Disk 1, conservatively probably can't be trusted. Open question on whether you do or do not want to assume clean in this case, maybe it doesn't matter because it's degraded anyway.
But now you try to extract those 20GB you really need.
Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: recovering RAID5 from multiple disk failures
2013-02-02 23:08 ` Chris Murphy
@ 2013-02-03 0:23 ` Phil Turmel
2013-02-03 0:39 ` Chris Murphy
0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2013-02-03 0:23 UTC (permalink / raw)
To: Chris Murphy; +Cc: Michael Ritzert, linux-raid@vger.kernel.org list
On 02/02/2013 06:08 PM, Chris Murphy wrote:
>
> On Feb 2, 2013, at 2:56 PM, Michael Ritzert <ksciplot@gmx.net>
> wrote:
>>
>> Chris Murphy <lists@colorremedies.com> wrote:
>>>
>>> Nevertheless, over an hour and a half is a long time if the file
>>> system were being updated at all. There'd definitely be
>>> data/parity mismatches for disk1.
>>
>> After disk1 failed, the only write access should have been
>> metadata update when the filesystem was mounted.
This is significant.
> Was it mounted ro?
>
>> I only read data from the filesystem thereafter. So only atime
>> changes are to be expected, there, and only for a small number of
>> files that I could capture before disk3 failed. I know which files
>> are affected, and could leave them alone.
>
> Even for a small number of files there could be dozens or hundreds
> of chunks altered. I think conservatively you have to consider disk
> 1 out and mount in degraded mode.
>
>>
>>> If disk 1 is assumed to be useless, meaning force assemble the
>>> array in degraded mode; a URE or linux SCSI layer time out is to
>>> be avoided or the array as a whole fails. Every sector is
>>> needed. So what do you think about raising the linux scsi layer
>>> time out to maybe 2 minutes, and leaving the remaining drive's
>>> SCT ERC alone so that they don't time out sooner, but rather go
>>> into whatever deep recovery they have to in the hopes those bad
>>> sectors can be read?
>>>
>>> echo 120 >/sys/block/sdX/device/timeout
>>
>> I just tried that, but I couldn't see any effect. The error rate
>> coming in is much higher than 1 every two minutes.
>
> This timeout is not about error rate. And what the value should be
> depends on context. Normal operation you want the disk error
> recovery to be short, so that the disk produces a bonafide URE, not a
> SCSI layer timeout error. That way md will correct the bad sector.
> That's what probably wasn't happening in your case, which allowed
> bad sectors to accumulate until it was a real problem.
If you try to recover from the degraded array, this is the correct approach.
> But now, for the cloning process, you want the disk error timeout to
> be long (or disabled) so that the disk has as long as possible to do
> ECC to recover each of these problematic sectors. But this also
> means getting the SCSI layer timeout set to at least 1 second longer
> than the longest recovery time for the drives, so that the SCSI layer
> time out doesn't stop sector recovery during cloning. Now maybe the
> disk still won't be able to recover all data from these bad sectors,
> but it's your best shot IMO.
For the array assembled degraded (disk1 left out).
>> When I assemble the array, I will have all new disks (with good
>> smart selftests...), so I wouldn't expect timeouts. Instead, junk
>> data will be returned from the sectors in question¹. How will md
>> react to that?
>
> Well yeah, with the new drives, they won't report UREs. So there's
> an ambiguity with any mismatch between data and parity chunks as to
> which is correct. Without a URE, md doesn't know that the data chunk
> is right or wrong with RAID 5.
Bingo. Working from the copies guarantees you won't have correct data
where the UREs are. (The copies are very good to have, of course.)
> Phil may disagree, and I have to defer to his experience in this,
> but I think the most conservative and best shot you have at getting
> the 20GB you want off the array is this:
I do disagree.
The above, combined with:
> I do know where the bad sectors are from the ddrescue report. We are
> talking about less that 50kB bad data on disk1. Unfortunately, disk3
> is worse, but there is no sector that is bad on both disks.
Leads me to recommend "mdadm --create --assume-clean" using the original
drives, taking care to specify the devices in the proper order (per
their "Raid Device" number in the --examine reports). I still haven't
seen any data that definitively links specific serial numbers to
specific raid device numbers. Please do that.
After re-creating the array, and setting all the drive timeouts to 7.0
seconds, issue a "check" scrub:
echo "check" >/sys/block/md0/md/sync_action
This should clean up the few pending sectors on disk #1 by
reconstruction from the others, and may very well do the same for disk #3.
If disk #3 gets kicked out at this point, assemble in degraded mode with
disk #2, #4, and a fresh copy of disk #1 (picking up the new superblock
and any fixes during the partial scrub). Then "--add" a spare (wiped)
disk and let the array rebuild.
And grab your data.
Phil.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: recovering RAID5 from multiple disk failures
2013-02-03 0:23 ` Phil Turmel
@ 2013-02-03 0:39 ` Chris Murphy
0 siblings, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2013-02-03 0:39 UTC (permalink / raw)
To: Phil Turmel; +Cc: Michael Ritzert, linux-raid@vger.kernel.org list
On Feb 2, 2013, at 5:23 PM, Phil Turmel <philip@turmel.org> wrote:
>
> I do disagree.
>
> The above, combined with:
>
>> I do know where the bad sectors are from the ddrescue report. We are
>> talking about less that 50kB bad data on disk1. Unfortunately, disk3
>> is worse, but there is no sector that is bad on both disks.
>
> Leads me to recommend "mdadm --create --assume-clean" using the original
> drives, taking care to specify the devices in the proper order (per
> their "Raid Device" number in the --examine reports). I still haven't
> seen any data that definitively links specific serial numbers to
> specific raid device numbers. Please do that.
>
> After re-creating the array, and setting all the drive timeouts to 7.0
> seconds, issue a "check" scrub:
>
> echo "check" >/sys/block/md0/md/sync_action
>
> This should clean up the few pending sectors on disk #1 by
> reconstruction from the others, and may very well do the same for disk #3.
>
> If disk #3 gets kicked out at this point, assemble in degraded mode with
> disk #2, #4, and a fresh copy of disk #1 (picking up the new superblock
> and any fixes during the partial scrub). Then "--add" a spare (wiped)
> disk and let the array rebuild.
>
> And grab your data.
OK I understand. This seems reasonable to me as well. It is very important to get *each* drive's SCT ERC's set before starting the check!
So basically disk1 being out of sync in this instance is considered minimal, and worth taking a chance on in order to avoid losing the 50kb of data affected by bad sectors; because they may be all the difference in easily getting the array up, mounted, and the data off the disk.
Chris Murphy
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-02-03 0:39 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-01 12:28 recovering RAID5 from multiple disk failures Michael Ritzert
2013-02-01 13:21 ` Phil Turmel
2013-02-02 13:04 ` Michael Ritzert
2013-02-02 13:44 ` Phil Turmel
2013-02-02 20:20 ` Chris Murphy
2013-02-02 21:56 ` Michael Ritzert
2013-02-02 23:08 ` Chris Murphy
2013-02-03 0:23 ` Phil Turmel
2013-02-03 0:39 ` Chris Murphy
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.