All of lore.kernel.org
 help / color / mirror / Atom feed
* Wierd: Degrading while recovering raid5
@ 2015-02-10  4:20 Kyle Logue
  2015-02-10  7:35 ` Adam Goryachev
  0 siblings, 1 reply; 9+ messages in thread
From: Kyle Logue @ 2015-02-10  4:20 UTC (permalink / raw)
  To: linux-raid

Hey all:

I have a 5 disk software raid5 that was working fine until I decided
to swap out an old disk with a new one.

mdadm /dev/md0 --add /dev/sda1
mdadm /dev/md0 --fail /dev/sde1

At this point it started automatically rebuilding the array.
About 60%? of the way in it stops and I see a lot of this repeated in my dmesg:

[Mon Feb  9 18:06:48 2015] ata5.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x6 frozen
[Mon Feb  9 18:06:48 2015] ata5.00: failed command: SMART
[Mon Feb  9 18:06:48 2015] ata5.00: cmd
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 7
[Mon Feb  9 18:06:48 2015]          res
40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[Mon Feb  9 18:06:48 2015] ata5.00: status: { DRDY }
[Mon Feb  9 18:06:48 2015] ata5: hard resetting link
[Mon Feb  9 18:06:58 2015] ata5: softreset failed (1st FIS failed)
[Mon Feb  9 18:06:58 2015] ata5: hard resetting link
[Mon Feb  9 18:07:08 2015] ata5: softreset failed (1st FIS failed)
[Mon Feb  9 18:07:08 2015] ata5: hard resetting link
[Mon Feb  9 18:07:12 2015] ata5: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
[Mon Feb  9 18:07:12 2015] ata5.00: configured for UDMA/33
[Mon Feb  9 18:07:12 2015] ata5: EH complete

ata5 corresponds to my /dev/sdc drive.
So I was worried but it didn't look so terrible when i did examine:

sudo mdadm --examine /dev/sd[dabfec]1 | egrep 'dev|Update|Role|State|Events'
/dev/sda1:
          State : clean
    Update Time : Sun Feb  8 20:43:27 2015
   Device Role : spare
   Array State : .A.AA ('A' == active, '.' == missing)
         Events : 27009
/dev/sdb1:
          State : clean
    Update Time : Sun Feb  8 20:43:27 2015
   Device Role : Active device 4
   Array State : .A.AA ('A' == active, '.' == missing)
         Events : 27009
/dev/sdc1:
          State : clean
    Update Time : Sun Feb  8 20:21:13 2015
   Device Role : Active device 0
   Array State : AAAAA ('A' == active, '.' == missing)
         Events : 26995
/dev/sdd1:
          State : clean
    Update Time : Sun Feb  8 20:43:27 2015
   Device Role : Active device 1
   Array State : .A.AA ('A' == active, '.' == missing)
         Events : 27009
/dev/sde1:
          State : clean
    Update Time : Sun Feb  8 12:17:10 2015
   Device Role : Active device 2
   Array State : AAAAA ('A' == active, '.' == missing)
         Events : 21977
/dev/sdf1:
          State : clean
    Update Time : Sun Feb  8 20:43:27 2015
   Device Role : Active device 3
   Array State : .A.AA ('A' == active, '.' == missing)
         Events : 27009

So the event counts looked pretty close on the drives I was updating, so I did:

mdadm --stop /dev/md0
mdadm --assemble --force /dev/md0 /dev/sd[dabfec]1

But it stopped again during recovery at some point while at work with
the same ATA errors in the dmesg.
Searching the web for these errors show lots of people having this
issue with various linux distros and laying the blame on everything
from faulty SATA cables to BIOS to NVIDIA drivers - nothing
definitive. I powered off my box and reconnected all my SATA cables as
a sanity check.

I tried --assemble --force again and it got to 70%:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid5 sdc1[7] sda1[8] sdb1[6] sdf1[4] sdd1[5]
      7814047744 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [UU_UU]
      [=============>.......]  recovery = 68.9%
(1347855508/1953511936) finish=306.1min speed=32967K/sec

...but died again. I was monitoring dmesg like a hawk this time and
saw those ata5 errors every 3-15 minutes with different cmd and res
values. At the very end I got this:

[Mon Feb  9 23:11:01 2015] ata5.00: configured for UDMA/33
[Mon Feb  9 23:11:01 2015] sd 4:0:0:0: [sdc] Unhandled sense code
[Mon Feb  9 23:11:01 2015] sd 4:0:0:0: [sdc]
[Mon Feb  9 23:11:01 2015] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon Feb  9 23:11:01 2015] sd 4:0:0:0: [sdc]
[Mon Feb  9 23:11:01 2015] Sense Key : Medium Error [current] [descriptor]
[Mon Feb  9 23:11:01 2015] Descriptor sense data with sense
descriptors (in hex):
[Mon Feb  9 23:11:01 2015]         72 03 11 04 00 00 00 0c 00 0a 80 00
00 00 00 00
[Mon Feb  9 23:11:01 2015]         a4 1c 1d e8
[Mon Feb  9 23:11:01 2015] sd 4:0:0:0: [sdc]
[Mon Feb  9 23:11:01 2015] Add. Sense: Unrecovered read error - auto
reallocate failed
[Mon Feb  9 23:11:01 2015] sd 4:0:0:0: [sdc] CDB:
[Mon Feb  9 23:11:01 2015] Read(10): 28 00 a4 1c 1d e8 00 00 80 00
[Mon Feb  9 23:11:01 2015] end_request: I/O error, dev sdc, sector 2753306088
[Mon Feb  9 23:11:01 2015] md/raid:md0: Disk failure on sdc1, disabling device.
[Mon Feb  9 23:11:01 2015] md/raid:md0: Operation continuing on 3 devices.
[Mon Feb  9 23:11:01 2015] ata5: EH complete
[Mon Feb  9 23:11:01 2015] md: md0: recovery interrupted.
[Mon Feb  9 23:11:01 2015] RAID conf printout:
[Mon Feb  9 23:11:01 2015]  --- level:5 rd:5 wd:3
[Mon Feb  9 23:11:01 2015]  disk 0, o:0, dev:sdc1
[Mon Feb  9 23:11:01 2015]  disk 1, o:1, dev:sdd1
[Mon Feb  9 23:11:01 2015]  disk 2, o:1, dev:sda1
[Mon Feb  9 23:11:01 2015]  disk 3, o:1, dev:sdf1
[Mon Feb  9 23:11:01 2015]  disk 4, o:1, dev:sdb1
[Mon Feb  9 23:11:01 2015] RAID conf printout:
[Mon Feb  9 23:11:01 2015]  --- level:5 rd:5 wd:3
[Mon Feb  9 23:11:01 2015]  disk 1, o:1, dev:sdd1
[Mon Feb  9 23:11:01 2015]  disk 2, o:1, dev:sda1
[Mon Feb  9 23:11:01 2015]  disk 3, o:1, dev:sdf1
[Mon Feb  9 23:11:01 2015]  disk 4, o:1, dev:sdb1
[Mon Feb  9 23:11:01 2015] RAID conf printout:
[Mon Feb  9 23:11:01 2015]  --- level:5 rd:5 wd:3
[Mon Feb  9 23:11:01 2015]  disk 1, o:1, dev:sdd1
[Mon Feb  9 23:11:01 2015]  disk 2, o:1, dev:sda1
[Mon Feb  9 23:11:01 2015]  disk 3, o:1, dev:sdf1
[Mon Feb  9 23:11:01 2015]  disk 4, o:1, dev:sdb1
[Mon Feb  9 23:11:01 2015] RAID conf printout:
[Mon Feb  9 23:11:01 2015]  --- level:5 rd:5 wd:3
[Mon Feb  9 23:11:01 2015]  disk 1, o:1, dev:sdd1
[Mon Feb  9 23:11:01 2015]  disk 3, o:1, dev:sdf1
[Mon Feb  9 23:11:01 2015]  disk 4, o:1, dev:sdb1

and mdstat now has:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
[raid4] [raid10]
md0 : active raid5 sdc1[7](F) sda1[8](S) sdb1[6] sdf1[4] sdd1[5]
      7814047744 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/3] [_U_UU]

And now I am out of ideas. Any thoughts on correcting those ata5
errors? or skipping those sectors maybe? While sde1 is the disk i
manually failed, it hasn't been touched yet. The event count is way
off now, but maybe I can use that somehow? Should i replace the sata
cable for sdc and retry?

Anybody in DC want a beer on me for helping figure this out? I have
more log files stored, but was trying to keep it short.

Thanks for looking,

Kyle L

PS. mdadm v3.2.5 on Ubuntu 14.04 running linux 3.13.0-45
PPS. Last full backup was six months ago. Hmm.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wierd: Degrading while recovering raid5
  2015-02-10  4:20 Wierd: Degrading while recovering raid5 Kyle Logue
@ 2015-02-10  7:35 ` Adam Goryachev
  2015-02-10 13:51   ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: Adam Goryachev @ 2015-02-10  7:35 UTC (permalink / raw)
  To: Kyle Logue, linux-raid

Hi Kyle,

There are other people who will jump in and help you with your problem, 
but I'll add a couple of pointers while you are waiting. See below.

On 10/02/15 15:20, Kyle Logue wrote:
> Hey all:
>
> I have a 5 disk software raid5 that was working fine until I decided
> to swap out an old disk with a new one.
>
> mdadm /dev/md0 --add /dev/sda1
> mdadm /dev/md0 --fail /dev/sde1
>
> At this point it started automatically rebuilding the array.
> About 60%? of the way in it stops and I see a lot of this repeated in my dmesg:
>
> [Mon Feb  9 18:06:48 2015] ata5.00: exception Emask 0x0 SAct 0x0 SErr
> 0x0 action 0x6 frozen
> [Mon Feb  9 18:06:48 2015] ata5.00: failed command: SMART
> [Mon Feb  9 18:06:48 2015] ata5.00: cmd
> b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 7
> [Mon Feb  9 18:06:48 2015]          res
> 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
> [Mon Feb  9 18:06:48 2015] ata5.00: status: { DRDY }
> [Mon Feb  9 18:06:48 2015] ata5: hard resetting link
> [Mon Feb  9 18:06:58 2015] ata5: softreset failed (1st FIS failed)
> [Mon Feb  9 18:06:58 2015] ata5: hard resetting link
> [Mon Feb  9 18:07:08 2015] ata5: softreset failed (1st FIS failed)
> [Mon Feb  9 18:07:08 2015] ata5: hard resetting link
> [Mon Feb  9 18:07:12 2015] ata5: SATA link up 1.5 Gbps (SStatus 113
> SControl 310)
> [Mon Feb  9 18:07:12 2015] ata5.00: configured for UDMA/33
> [Mon Feb  9 18:07:12 2015] ata5: EH complete
>
> ata5 corresponds to my /dev/sdc drive.
First, check if the drive is faulty.
dd if=/dev/sdc of=/dev/null bs=10M

If that completes without any errors from dd, then the drive can be read 
OK. Now check the logs, was there any errors there? Especially if there 
were errors in the logs, (or even if not) read about timing mismatches 
between the kernel and the hard drive, and how to solve that. There was 
another post earlier today with some links to specific posts that will 
be helpful (check the online archive).

Finally, I think your first mistake was to fail the drive. You should 
have replaced it which will stop you from losing protection from a 
failed drive.
See the second answer to this question:
http://unix.stackexchange.com/questions/74924/how-to-safely-replace-a-not-yet-failed-disk-in-a-linux-raid5-array

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wierd: Degrading while recovering raid5
  2015-02-10  7:35 ` Adam Goryachev
@ 2015-02-10 13:51   ` Phil Turmel
  2015-02-10 21:50     ` Kyle Logue
  0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2015-02-10 13:51 UTC (permalink / raw)
  To: Adam Goryachev, Kyle Logue, linux-raid

Hi Kyle,

Your symptoms look like classic timeout mismatch.  Details interleaved.

On 02/10/2015 02:35 AM, Adam Goryachev wrote:

> There are other people who will jump in and help you with your problem,
> but I'll add a couple of pointers while you are waiting. See below.

> On 10/02/15 15:20, Kyle Logue wrote:
>> Hey all:
>>
>> I have a 5 disk software raid5 that was working fine until I decided
>> to swap out an old disk with a new one.
>>
>> mdadm /dev/md0 --add /dev/sda1
>> mdadm /dev/md0 --fail /dev/sde1

As Adam pointed out, you should have used --replace, but you probably
wouldn't have made it through the replace function anyways.

>> At this point it started automatically rebuilding the array.
>> About 60%? of the way in it stops and I see a lot of this repeated in
>> my dmesg:
>>
>> [Mon Feb  9 18:06:48 2015] ata5.00: exception Emask 0x0 SAct 0x0 SErr
>> 0x0 action 0x6 frozen
>> [Mon Feb  9 18:06:48 2015] ata5.00: failed command: SMART
>> [Mon Feb  9 18:06:48 2015] ata5.00: cmd
>> b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 7
>> [Mon Feb  9 18:06:48 2015]          res
>> 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
                                                 ^^^^^^^^^
Smoking gun.

>> [Mon Feb  9 18:06:48 2015] ata5.00: status: { DRDY }
>> [Mon Feb  9 18:06:48 2015] ata5: hard resetting link
>> [Mon Feb  9 18:06:58 2015] ata5: softreset failed (1st FIS failed)
>> [Mon Feb  9 18:06:58 2015] ata5: hard resetting link
>> [Mon Feb  9 18:07:08 2015] ata5: softreset failed (1st FIS failed)
>> [Mon Feb  9 18:07:08 2015] ata5: hard resetting link
>> [Mon Feb  9 18:07:12 2015] ata5: SATA link up 1.5 Gbps (SStatus 113
>> SControl 310)
>> [Mon Feb  9 18:07:12 2015] ata5.00: configured for UDMA/33
>> [Mon Feb  9 18:07:12 2015] ata5: EH complete

Notice that after a timeout error, the drive is unresponsive for several
more seconds -- about 24 in your case.

> ....  read about timing mismatches
> between the kernel and the hard drive, and how to solve that. There was
> another post earlier today with some links to specific posts that will
> be helpful (check the online archive).

That would have been me.  Start with this link for a description of what
you are experiencing:

http://marc.info/?l=linux-raid&m=135811522817345&w=1

First, you need to protect yourself from timeout mismatch due to the use
of desktop-grade drives.  (Enterprise and raid-rated drives don't have
this problem.)

{ If you were stuck in the middle of a replace a you had just
worked-around your timeout problem, it would likely continue and
complete.  You've lost that opportunity. }

Show us the output of "smartctl -x" for all of your drives if you'd like
advice on your particular drives.  (Pasted inline is preferred.)

Second, you need to find and overwrite (with zeros) the bad sectors on
your drives.  Or ddrescue to a complete set of replacement drives and
assemble those.

Third, you need to set up a cron job to scrub your array regularly to
clean out UREs before they accumulate beyond MD's ability to handle it
(20 read errors in an hour, 10 per hour sustained).

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wierd: Degrading while recovering raid5
  2015-02-10 13:51   ` Phil Turmel
@ 2015-02-10 21:50     ` Kyle Logue
  2015-02-11  2:14       ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: Kyle Logue @ 2015-02-10 21:50 UTC (permalink / raw)
  To: linux-raid

Phil:

Thanks for your detailed response. That link does seem to describe my
problem and I do understand that desktop grade drives are sub-optimal.
It was many years ago when I first set up this array on my home
theater pc.  Until now I had no idea about the cron job - I'll make
sure to implement that. I am preparing to move to 6 tb disks sometime
soon and i'll definitely go enterprise this time.

Regarding the drive timeout: I understand that I need to increase it
from 30 seconds to something larger (2+ min) but am unaware how to do
this. Is it a kernel variable? I'll keep googling but this seems like
it's whats going to save me.

tl;dr: How do I change the drive timeout?

Here is the smartctl -x for all my drives:

Reminder: SDA is the new drive. SDC is the troublemaker. SDE is the
one I failed.

> sudo smartctl -x /dev/sda
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family:     Seagate Barracuda 7200.14 (AF)
> Device Model:     ST2000DM001-1CH164
> Serial Number:    Z340F2SP
> LU WWN Device Id: 5 000c50 064d5887d
> Firmware Version: CC27
> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    7200 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
> SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is:    Tue Feb 10 16:37:52 2015 EST
> ==> WARNING: A firmware update for this drive may be available,
> see the following Seagate web pages:
> http://knowledge.seagate.com/articles/en_US/FAQ/207931en
> http://knowledge.seagate.com/articles/en_US/FAQ/223651en
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is:   Unavailable
> APM level is:     254 (maximum performance)
> Rd look-ahead is: Enabled
> Write cache is:   Enabled
> ATA Security is:  Disabled, NOT FROZEN [SEC1]
> Write SCT (Get) XXX Error Recovery Control Command failed: scsi error aborted command
> Wt Cache Reorder: N/A
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status:  (0x82) Offline data collection activity
>                                         was completed without error.
>                                         Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine completed
>                                         without error or no self-test has ever
>                                         been run.
> Total time to complete Offline
> data collection:                (  584) seconds.
> Offline data collection
> capabilities:                    (0x7b) SMART execute Offline immediate.
>                                         Auto Offline data collection on/off support.
>                                         Suspend Offline collection upon new
>                                         command.
>                                         Offline surface scan supported.
>                                         Self-test supported.
>                                         Conveyance Self-test supported.
>                                         Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                         power-saving mode.
>                                         Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                         General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   1) minutes.
> Extended self-test routine
> recommended polling time:        ( 212) minutes.
> Conveyance self-test routine
> recommended polling time:        (   2) minutes.
> SCT capabilities:              (0x3085) SCT Status supported.
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     POSR--   105   099   006    -    9806192
>   3 Spin_Up_Time            PO----   097   097   000    -    0
>   4 Start_Stop_Count        -O--CK   100   100   020    -    4
>   5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
>   7 Seek_Error_Rate         POSR--   100   253   030    -    289070
>   9 Power_On_Hours          -O--CK   100   100   000    -    35
>  10 Spin_Retry_Count        PO--C-   100   100   097    -    0
>  12 Power_Cycle_Count       -O--CK   100   100   020    -    5
> 183 Runtime_Bad_Block       -O--CK   099   099   000    -    1
> 184 End-to-End_Error        -O--CK   100   100   099    -    0
> 187 Reported_Uncorrect      -O--CK   100   100   000    -    0
> 188 Command_Timeout         -O--CK   100   100   000    -    0 0 0
> 189 High_Fly_Writes         -O-RCK   100   100   000    -    0
> 190 Airflow_Temperature_Cel -O---K   073   062   045    -    27 (Min/Max 25/27)
> 191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
> 192 Power-Off_Retract_Count -O--CK   100   100   000    -    4
> 193 Load_Cycle_Count        -O--CK   100   100   000    -    8
> 194 Temperature_Celsius     -O---K   027   040   000    -    27 (0 22 0 0 0)
> 197 Current_Pending_Sector  -O--C-   100   100   000    -    0
> 198 Offline_Uncorrectable   ----C-   100   100   000    -    0
> 199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
> 240 Head_Flying_Hours       ------   100   253   000    -    35h+41m+13.042s
> 241 Total_LBAs_Written      ------   100   253   000    -    11031892416
> 242 Total_LBAs_Read         ------   100   253   000    -    2769646
>                             ||||||_ K auto-keep
>                             |||||__ C event count
>                             ||||___ R error rate
>                             |||____ S speed/performance
>                             ||_____ O updated online
>                             |______ P prefailure warning
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> Address    Access  R/W   Size  Description
> 0x00       GPL,SL  R/O      1  Log Directory
> 0x01           SL  R/O      1  Summary SMART error log
> 0x02           SL  R/O      5  Comprehensive SMART error log
> 0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
> 0x06           SL  R/O      1  SMART self-test log
> 0x07       GPL     R/O      1  Extended self-test log
> 0x09           SL  R/W      1  Selective self-test log
> 0x10       GPL     R/O      1  NCQ Command Error log
> 0x11       GPL     R/O      1  SATA Phy Event Counters
> 0x21       GPL     R/O      1  Write stream error log
> 0x22       GPL     R/O      1  Read stream error log
> 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> 0xa1       GPL,SL  VS      20  Device vendor specific log
> 0xa2       GPL     VS    4496  Device vendor specific log
> 0xa8       GPL,SL  VS     129  Device vendor specific log
> 0xa9       GPL,SL  VS       1  Device vendor specific log
> 0xab       GPL     VS       1  Device vendor specific log
> 0xb0       GPL     VS    5176  Device vendor specific log
> 0xbe-0xbf  GPL     VS   65535  Device vendor specific log
> 0xc0       GPL,SL  VS       1  Device vendor specific log
> 0xc1       GPL,SL  VS      10  Device vendor specific log
> 0xc4       GPL,SL  VS       5  Device vendor specific log
> 0xe0       GPL,SL  R/W      1  SCT Command/Status
> 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> No Errors Logged
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> SCT Data Table command not supported
> SCT Error Recovery Control command not supported
> Device Statistics (GP Log 0x04) not supported
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x000a  2            6  Device-to-host register FISes sent due to a COMRESET
> 0x0001  2            0  Command failed due to ICRC error
> 0x0003  2            0  R_ERR response for device-to-host data FIS
> 0x0004  2            0  R_ERR response for host-to-device data FIS
> 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> 0x0007  2            0  R_ERR response for host-to-device non-data FIS
>
> sudo smartctl -x /dev/sdb
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family:     Seagate Barracuda 7200.14 (AF)
> Device Model:     ST2000DM001-1CH164
> Serial Number:    S1E1CW9Y
> LU WWN Device Id: 5 000c50 05c085bef
> Firmware Version: CC24
> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    7200 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ATA8-ACS T13/1699-D revision 4
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is:    Tue Feb 10 16:40:24 2015 EST
> ==> WARNING: A firmware update for this drive may be available,
> see the following Seagate web pages:
> http://knowledge.seagate.com/articles/en_US/FAQ/207931en
> http://knowledge.seagate.com/articles/en_US/FAQ/223651en
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is:   Unavailable
> APM level is:     254 (maximum performance)
> Rd look-ahead is: Enabled
> Write cache is:   Enabled
> ATA Security is:  Disabled, NOT FROZEN [SEC1]
> Write SCT (Get) XXX Error Recovery Control Command failed: scsi error aborted command
> Wt Cache Reorder: N/A
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status:  (0x82) Offline data collection activity
>                                         was completed without error.
>                                         Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine completed
>                                         without error or no self-test has ever
>                                         been run.
> Total time to complete Offline
> data collection:                (  584) seconds.
> Offline data collection
> capabilities:                    (0x7b) SMART execute Offline immediate.
>                                         Auto Offline data collection on/off support.
>                                         Suspend Offline collection upon new
>                                         command.
>                                         Offline surface scan supported.
>                                         Self-test supported.
>                                         Conveyance Self-test supported.
>                                         Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                         power-saving mode.
>                                         Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                         General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   1) minutes.
> Extended self-test routine
> recommended polling time:        ( 225) minutes.
> Conveyance self-test routine
> recommended polling time:        (   2) minutes.
> SCT capabilities:              (0x3085) SCT Status supported.
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     POSR--   117   099   006    -    153090384
>   3 Spin_Up_Time            PO----   096   096   000    -    0
>   4 Start_Stop_Count        -O--CK   100   100   020    -    58
>   5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
>   7 Seek_Error_Rate         POSR--   063   058   030    -    8594213138
>   9 Power_On_Hours          -O--CK   084   084   000    -    14743
>  10 Spin_Retry_Count        PO--C-   100   100   097    -    0
>  12 Power_Cycle_Count       -O--CK   100   100   020    -    58
> 183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
> 184 End-to-End_Error        -O--CK   100   100   099    -    0
> 187 Reported_Uncorrect      -O--CK   100   100   000    -    0
> 188 Command_Timeout         -O--CK   100   099   000    -    1 1 1
> 189 High_Fly_Writes         -O-RCK   100   100   000    -    0
> 190 Airflow_Temperature_Cel -O---K   072   057   045    -    28 (Min/Max 26/28)
> 191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
> 192 Power-Off_Retract_Count -O--CK   100   100   000    -    34
> 193 Load_Cycle_Count        -O--CK   100   100   000    -    110
> 194 Temperature_Celsius     -O---K   028   043   000    -    28 (0 18 0 0 0)
> 197 Current_Pending_Sector  -O--C-   100   100   000    -    0
> 198 Offline_Uncorrectable   ----C-   100   100   000    -    0
> 199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
> 240 Head_Flying_Hours       ------   100   253   000    -    14740h+55m+31.297s
> 241 Total_LBAs_Written      ------   100   253   000    -    9249405614
> 242 Total_LBAs_Read         ------   100   253   000    -    100539385901
>                             ||||||_ K auto-keep
>                             |||||__ C event count
>                             ||||___ R error rate
>                             |||____ S speed/performance
>                             ||_____ O updated online
>                             |______ P prefailure warning
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> Address    Access  R/W   Size  Description
> 0x00       GPL,SL  R/O      1  Log Directory
> 0x01           SL  R/O      1  Summary SMART error log
> 0x02           SL  R/O      5  Comprehensive SMART error log
> 0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
> 0x06           SL  R/O      1  SMART self-test log
> 0x07       GPL     R/O      1  Extended self-test log
> 0x09           SL  R/W      1  Selective self-test log
> 0x10       GPL     R/O      1  NCQ Command Error log
> 0x11       GPL     R/O      1  SATA Phy Event Counters
> 0x21       GPL     R/O      1  Write stream error log
> 0x22       GPL     R/O      1  Read stream error log
> 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> 0xa1       GPL,SL  VS      20  Device vendor specific log
> 0xa2       GPL     VS    4496  Device vendor specific log
> 0xa8       GPL,SL  VS     129  Device vendor specific log
> 0xa9       GPL,SL  VS       1  Device vendor specific log
> 0xab       GPL     VS       1  Device vendor specific log
> 0xb0       GPL     VS    5176  Device vendor specific log
> 0xbd       GPL     VS     512  Device vendor specific log
> 0xbe-0xbf  GPL     VS   65535  Device vendor specific log
> 0xc0       GPL,SL  VS       1  Device vendor specific log
> 0xc1       GPL,SL  VS      10  Device vendor specific log
> 0xc4       GPL,SL  VS       5  Device vendor specific log
> 0xe0       GPL,SL  R/W      1  SCT Command/Status
> 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> No Errors Logged
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> SCT Data Table command not supported
> SCT Error Recovery Control command not supported
> Device Statistics (GP Log 0x04) not supported
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x000a  2            6  Device-to-host register FISes sent due to a COMRESET
> 0x0001  2            0  Command failed due to ICRC error
> 0x0003  2            0  R_ERR response for device-to-host data FIS
> 0x0004  2            0  R_ERR response for host-to-device data FIS
> 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> THIS IS THE BAD DISK:
> sudo smartctl -x /dev/sdc
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family:     Seagate Barracuda 7200.14 (AF)
> Device Model:     ST2000DM001-1CH164
> Serial Number:    S240V6VR
> LU WWN Device Id: 5 000c50 05c05c2e7
> Firmware Version: CC24
> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> Sector Sizes:     512 bytes logical, 4096 bytes physical
> Rotation Rate:    7200 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ATA8-ACS T13/1699-D revision 4
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is:    Tue Feb 10 16:42:53 2015 EST
> ==> WARNING: A firmware update for this drive may be available,
> see the following Seagate web pages:
> http://knowledge.seagate.com/articles/en_US/FAQ/207931en
> http://knowledge.seagate.com/articles/en_US/FAQ/223651en
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is:   Unavailable
> APM level is:     254 (maximum performance)
> Rd look-ahead is: Enabled
> Write cache is:   Enabled
> ATA Security is:  Disabled, NOT FROZEN [SEC1]
> Write SCT (Get) XXX Error Recovery Control Command failed: scsi error aborted command
> Wt Cache Reorder: N/A
> Read SMART Data failed: scsi error aborted command
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: UNKNOWN!
> SMART Status, Attributes and Thresholds cannot be read.
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> Address    Access  R/W   Size  Description
> 0x00       GPL,SL  R/O      1  Log Directory
> 0x01           SL  R/O      1  Summary SMART error log
> 0x02           SL  R/O      5  Comprehensive SMART error log
> 0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
> 0x06           SL  R/O      1  SMART self-test log
> 0x07       GPL     R/O      1  Extended self-test log
> 0x09           SL  R/W      1  Selective self-test log
> 0x10       GPL     R/O      1  NCQ Command Error log
> 0x11       GPL     R/O      1  SATA Phy Event Counters
> 0x21       GPL     R/O      1  Write stream error log
> 0x22       GPL     R/O      1  Read stream error log
> 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> 0xa1       GPL,SL  VS      20  Device vendor specific log
> 0xa2       GPL     VS    4496  Device vendor specific log
> 0xa8       GPL,SL  VS     129  Device vendor specific log
> 0xa9       GPL,SL  VS       1  Device vendor specific log
> 0xab       GPL     VS       1  Device vendor specific log
> 0xb0       GPL     VS    5176  Device vendor specific log
> 0xbd       GPL     VS     512  Device vendor specific log
> 0xbe-0xbf  GPL     VS   65535  Device vendor specific log
> 0xc0       GPL,SL  VS       1  Device vendor specific log
> 0xc1       GPL,SL  VS      10  Device vendor specific log
> 0xc4       GPL,SL  VS       5  Device vendor specific log
> 0xe0       GPL,SL  R/W      1  SCT Command/Status
> 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
> Device Error Count: 9
>         CR     = Command Register
>         FEATR  = Features Register
>         COUNT  = Count (was: Sector Count) Register
>         LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
>         LH     = LBA High (was: Cylinder High) Register    ]   LBA
>         LM     = LBA Mid (was: Cylinder Low) Register      ] Register
>         LL     = LBA Low (was: Sector Number) Register     ]
>         DV     = Device (was: Device/Head) Register
>         DC     = Device Control Register
>         ER     = Error register
>         ST     = Status register
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
> Error 9 [8] occurred at disk power-on lifetime: 14697 hours (612 days + 9 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 a4 1c 1d e8 00 00  Error: UNC at LBA = 0xa41c1de8 = 2753306088
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 00 80 00 00 a4 1c 1d e8 e0 00     04:55:26.791  READ DMA EXT
>   25 00 00 04 00 00 00 a4 1c 21 00 e0 00     04:55:26.776  READ DMA EXT
>   ef 00 10 00 02 00 00 00 00 00 00 a0 00     04:55:26.775  SET FEATURES [Enable SATA feature]
>   27 00 00 00 00 00 00 00 00 00 00 e0 00     04:55:26.775  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
>   ec 00 00 00 00 00 00 00 00 00 00 a0 00     04:55:26.774  IDENTIFY DEVICE
> Error 8 [7] occurred at disk power-on lifetime: 14697 hours (612 days + 9 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 a4 1c 1d e8 00 00  Error: UNC at LBA = 0xa41c1de8 = 2753306088
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 04 00 00 00 a4 1c 1d 00 e0 00     04:55:23.631  READ DMA EXT
>   25 00 00 04 00 00 00 a4 1c 19 00 e0 00     04:55:23.553  READ DMA EXT
>   25 00 00 04 00 00 00 a4 1c 15 00 e0 00     04:55:23.108  READ DMA EXT
>   25 00 00 04 00 00 00 a4 1c 11 00 e0 00     04:55:23.004  READ DMA EXT
>   25 00 00 04 00 00 00 a4 1c 0d 00 e0 00     04:55:22.893  READ DMA EXT
> Error 7 [6] occurred at disk power-on lifetime: 14686 hours (611 days + 22 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 a4 1c 1d e8 00 00  Error: UNC at LBA = 0xa41c1de8 = 2753306088
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 03 c0 00 00 a4 1c 1d e8 e0 00  1d+00:26:44.862  READ DMA EXT
>   25 00 00 00 08 00 00 a4 1c 21 a8 e0 00  1d+00:26:44.852  READ DMA EXT
>   ec 00 00 00 01 00 00 00 00 00 00 00 00  1d+00:26:44.851  IDENTIFY DEVICE
>   ec 00 00 00 01 00 00 00 00 00 00 00 00  1d+00:26:44.851  IDENTIFY DEVICE
>   e5 00 00 00 00 00 00 00 00 00 00 00 00  1d+00:26:44.851  CHECK POWER MODE
> Error 6 [5] occurred at disk power-on lifetime: 14686 hours (611 days + 22 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 a4 1c 1d e8 00 00  Error: UNC at LBA = 0xa41c1de8 = 2753306088
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 04 00 00 00 a4 1c 1d a8 e0 00  1d+00:26:30.653  READ DMA EXT
>   ef 00 90 00 03 00 00 00 00 00 00 a0 00  1d+00:26:30.638  SET FEATURES [Disable SATA feature]
>   ef 00 10 00 02 00 00 00 00 00 00 a0 00  1d+00:26:30.638  SET FEATURES [Enable SATA feature]
>   27 00 00 00 00 00 00 00 00 00 00 e0 00  1d+00:26:30.638  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
>   ec 00 00 00 00 00 00 00 00 00 00 a0 00  1d+00:26:30.638  IDENTIFY DEVICE
> Error 5 [4] occurred at disk power-on lifetime: 14676 hours (611 days + 12 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 a4 1c 1d e8 00 00  Error: UNC at LBA = 0xa41c1de8 = 2753306088
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 00 a8 00 00 a4 1c 1d e8 e0 00     14:43:09.384  READ DMA EXT
>   e5 00 00 00 00 00 00 00 00 00 00 00 00     14:43:09.383  CHECK POWER MODE
>   25 00 00 04 00 00 00 a4 1c 1e 90 e0 00     14:43:09.371  READ DMA EXT
>   ef 00 10 00 02 00 00 00 00 00 00 a0 00     14:43:09.370  SET FEATURES [Enable SATA feature]
>   27 00 00 00 00 00 00 00 00 00 00 e0 00     14:43:09.370  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
> Error 4 [3] occurred at disk power-on lifetime: 14676 hours (611 days + 12 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 a4 1c 1d e8 00 00  Error: UNC at LBA = 0xa41c1de8 = 2753306088
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 04 00 00 00 a4 1c 1a 90 e0 00     14:43:06.283  READ DMA EXT
>   25 00 00 04 00 00 00 a4 1c 16 90 e0 00     14:43:06.205  READ DMA EXT
>   25 00 00 04 00 00 00 a4 1c 12 90 e0 00     14:43:04.892  READ DMA EXT
>   25 00 00 04 00 00 00 a4 1c 0e 90 e0 00     14:43:04.855  READ DMA EXT
>   25 00 00 04 00 00 00 a4 1c 0a 90 e0 00     14:43:04.819  READ DMA EXT
> Error 3 [2] occurred at disk power-on lifetime: 14670 hours (611 days + 6 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 a4 1c 1d e8 00 00  Error: UNC at LBA = 0xa41c1de8 = 2753306088
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 04 00 00 00 a4 1c 1a 00 e0 00     08:33:02.502  READ DMA EXT
>   ef 00 10 00 02 00 00 00 00 00 00 a0 00     08:33:02.501  SET FEATURES [Enable SATA feature]
>   27 00 00 00 00 00 00 00 00 00 00 e0 00     08:33:02.501  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
>   ec 00 00 00 00 00 00 00 00 00 00 a0 00     08:33:02.501  IDENTIFY DEVICE
>   ef 00 03 00 42 00 00 00 00 00 00 a0 00     08:33:02.501  SET FEATURES [Set transfer mode]
> Error 2 [1] occurred at disk power-on lifetime: 14670 hours (611 days + 6 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   40 -- 51 00 00 00 00 a4 1c 13 d0 00 00  Error: UNC at LBA = 0xa41c13d0 = 2753303504
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 02 30 00 00 a4 1c 13 d0 e0 00     08:32:59.645  READ DMA EXT
>   e5 00 00 00 00 00 00 00 00 00 00 00 00     08:32:59.643  CHECK POWER MODE
>   25 00 00 04 00 00 00 a4 1c 16 00 e0 00     08:32:59.581  READ DMA EXT
>   ef 00 10 00 02 00 00 00 00 00 00 a0 00     08:32:59.580  SET FEATURES [Enable SATA feature]
>   27 00 00 00 00 00 00 00 00 00 00 e0 00     08:32:59.580  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> Selective Self-tests/Logging not supported
> SCT Data Table command not supported
> SCT Error Recovery Control command not supported
> Device Statistics (GP Log 0x04) not supported
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x000a  2            6  Device-to-host register FISes sent due to a COMRESET
> 0x0001  2            0  Command failed due to ICRC error
> 0x0003  2            0  R_ERR response for device-to-host data FIS
> 0x0004  2            0  R_ERR response for host-to-device data FIS
> 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> sudo smartctl -x /dev/sdd
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family:     Hitachi Deskstar 7K3000
> Device Model:     Hitachi HDS723020BLA642
> Serial Number:    MN3220F32GX10E
> LU WWN Device Id: 5 000cca 369e2f56f
> Firmware Version: MN6OA5C0
> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> Sector Size:      512 bytes logical/physical
> Rotation Rate:    7200 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ATA8-ACS T13/1699-D revision 4
> SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)
> Local Time is:    Tue Feb 10 16:45:04 2015 EST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is:   Unavailable
> APM feature is:   Disabled
> Rd look-ahead is: Enabled
> Write cache is:   Enabled
> ATA Security is:  Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Enabled
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status:  (0x84) Offline data collection activity
>                                         was suspended by an interrupting command from host.
>                                         Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine completed
>                                         without error or no self-test has ever
>                                         been run.
> Total time to complete Offline
> data collection:                (18096) seconds.
> Offline data collection
> capabilities:                    (0x5b) SMART execute Offline immediate.
>                                         Auto Offline data collection on/off support.
>                                         Suspend Offline collection upon new
>                                         command.
>                                         Offline surface scan supported.
>                                         Self-test supported.
>                                         No Conveyance Self-test supported.
>                                         Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                         power-saving mode.
>                                         Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                         General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   1) minutes.
> Extended self-test routine
> recommended polling time:        ( 302) minutes.
> SCT capabilities:              (0x003d) SCT Status supported.
>                                         SCT Error Recovery Control supported.
>                                         SCT Feature Control supported.
>                                         SCT Data Table supported.
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
>   2 Throughput_Performance  P-S---   136   136   054    -    82
>   3 Spin_Up_Time            POS---   152   152   024    -    434 (Average 320)
>   4 Start_Stop_Count        -O--C-   100   100   000    -    97
>   5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
>   7 Seek_Error_Rate         PO-R--   100   100   067    -    0
>   8 Seek_Time_Performance   P-S---   135   135   020    -    26
>   9 Power_On_Hours          -O--C-   097   097   000    -    27235
>  10 Spin_Retry_Count        PO--C-   100   100   060    -    0
>  12 Power_Cycle_Count       -O--CK   100   100   000    -    97
> 192 Power-Off_Retract_Count -O--CK   100   100   000    -    755
> 193 Load_Cycle_Count        -O--C-   100   100   000    -    755
> 194 Temperature_Celsius     -O----   200   200   000    -    30 (Min/Max 19/45)
> 196 Reallocated_Event_Count -O--CK   100   100   000    -    0
> 197 Current_Pending_Sector  -O---K   100   100   000    -    0
> 198 Offline_Uncorrectable   ---R--   100   100   000    -    0
> 199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
>                             ||||||_ K auto-keep
>                             |||||__ C event count
>                             ||||___ R error rate
>                             |||____ S speed/performance
>                             ||_____ O updated online
>                             |______ P prefailure warning
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> Address    Access  R/W   Size  Description
> 0x00       GPL,SL  R/O      1  Log Directory
> 0x01           SL  R/O      1  Summary SMART error log
> 0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
> 0x04       GPL     R/O      7  Device Statistics log
> 0x06           SL  R/O      1  SMART self-test log
> 0x07       GPL     R/O      1  Extended self-test log
> 0x08       GPL     R/O      1  Power Conditions log
> 0x09           SL  R/W      1  Selective self-test log
> 0x10       GPL     R/O      1  NCQ Command Error log
> 0x11       GPL     R/O      1  SATA Phy Event Counters
> 0x20       GPL     R/O      1  Streaming performance log [OBS-8]
> 0x21       GPL     R/O      1  Write stream error log
> 0x22       GPL     R/O      1  Read stream error log
> 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> 0xe0       GPL,SL  R/W      1  SCT Command/Status
> 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
> No Errors Logged
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> SCT Status Version:                  3
> SCT Version (vendor specific):       256 (0x0100)
> SCT Support Level:                   1
> Device State:                        SMART Off-line Data Collection executing in background (4)
> Current Temperature:                    30 Celsius
> Power Cycle Min/Max Temperature:     27/30 Celsius
> Lifetime    Min/Max Temperature:     19/45 Celsius
> Under/Over Temperature Limit Count:   0/0
> SCT Temperature History Version:     2
> Temperature Sampling Period:         1 minute
> Temperature Logging Interval:        1 minute
> Min/Max recommended Temperature:      0/60 Celsius
> Min/Max Temperature Limit:           -40/70 Celsius
> Temperature History Size (Index):    128 (52)
> Index    Estimated Time   Temperature Celsius
>   53    2015-02-10 14:38    37  ******************
>  ...    ..( 24 skipped).    ..  ******************
>   78    2015-02-10 15:03    37  ******************
>   79    2015-02-10 15:04    36  *****************
>   80    2015-02-10 15:05    36  *****************
>   81    2015-02-10 15:06    37  ******************
>  ...    ..(  5 skipped).    ..  ******************
>   87    2015-02-10 15:12    37  ******************
>   88    2015-02-10 15:13    36  *****************
>   89    2015-02-10 15:14    37  ******************
>  ...    ..(  5 skipped).    ..  ******************
>   95    2015-02-10 15:20    37  ******************
>   96    2015-02-10 15:21    36  *****************
>   97    2015-02-10 15:22    37  ******************
>   98    2015-02-10 15:23    37  ******************
>   99    2015-02-10 15:24    36  *****************
>  100    2015-02-10 15:25    37  ******************
>  ...    ..(  4 skipped).    ..  ******************
>  105    2015-02-10 15:30    37  ******************
>  106    2015-02-10 15:31    36  *****************
>  107    2015-02-10 15:32    36  *****************
>  108    2015-02-10 15:33    37  ******************
>  ...    ..(  6 skipped).    ..  ******************
>  115    2015-02-10 15:40    37  ******************
>  116    2015-02-10 15:41    36  *****************
>  117    2015-02-10 15:42    36  *****************
>  118    2015-02-10 15:43    36  *****************
>  119    2015-02-10 15:44    37  ******************
>  ...    ..(  2 skipped).    ..  ******************
>  122    2015-02-10 15:47    37  ******************
>  123    2015-02-10 15:48    36  *****************
>  124    2015-02-10 15:49    37  ******************
>  125    2015-02-10 15:50    37  ******************
>  126    2015-02-10 15:51    36  *****************
>  127    2015-02-10 15:52    36  *****************
>    0    2015-02-10 15:53    37  ******************
>    1    2015-02-10 15:54    36  *****************
>    2    2015-02-10 15:55    37  ******************
>    3    2015-02-10 15:56    36  *****************
>    4    2015-02-10 15:57    36  *****************
>    5    2015-02-10 15:58    37  ******************
>  ...    ..(  2 skipped).    ..  ******************
>    8    2015-02-10 16:01    37  ******************
>    9    2015-02-10 16:02    36  *****************
>   10    2015-02-10 16:03    37  ******************
>  ...    ..(  2 skipped).    ..  ******************
>   13    2015-02-10 16:06    37  ******************
>   14    2015-02-10 16:07    36  *****************
>   15    2015-02-10 16:08    37  ******************
>  ...    ..( 10 skipped).    ..  ******************
>   26    2015-02-10 16:19    37  ******************
>   27    2015-02-10 16:20    36  *****************
>  ...    ..(  5 skipped).    ..  *****************
>   33    2015-02-10 16:26    36  *****************
>   34    2015-02-10 16:27    37  ******************
>  ...    ..(  4 skipped).    ..  ******************
>   39    2015-02-10 16:32    37  ******************
>   40    2015-02-10 16:33     ?  -
>   41    2015-02-10 16:34    27  ********
>   42    2015-02-10 16:35    28  *********
>   43    2015-02-10 16:36    28  *********
>   44    2015-02-10 16:37    28  *********
>   45    2015-02-10 16:38    29  **********
>  ...    ..(  2 skipped).    ..  **********
>   48    2015-02-10 16:41    29  **********
>   49    2015-02-10 16:42    30  ***********
>  ...    ..(  2 skipped).    ..  ***********
>   52    2015-02-10 16:45    30  ***********
> SCT Error Recovery Control:
>            Read: Disabled
>           Write: Disabled
> Device Statistics (GP Log 0x04)
> Page Offset Size         Value  Description
>   1  =====  =                =  == General Statistics (rev 1) ==
>   1  0x008  4               97  Lifetime Power-On Resets
>   1  0x010  4            27235  Power-on Hours
>   1  0x018  6      11734342067  Logical Sectors Written
>   1  0x020  6         27559380  Number of Write Commands
>   1  0x028  6    2738754035727  Logical Sectors Read
>   1  0x030  6       5733165681  Number of Read Commands
>   3  =====  =                =  == Rotating Media Statistics (rev 1) ==
>   3  0x008  4            27229  Spindle Motor Power-on Hours
>   3  0x010  4            27229  Head Flying Hours
>   3  0x018  4              755  Head Load Events
>   3  0x020  4                0  Number of Reallocated Logical Sectors
>   3  0x028  4              276  Read Recovery Attempts
>   3  0x030  4                7  Number of Mechanical Start Failures
>   4  =====  =                =  == General Errors Statistics (rev 1) ==
>   4  0x008  4                0  Number of Reported Uncorrectable Errors
>   4  0x010  4                2  Resets Between Cmd Acceptance and Completion
>   5  =====  =                =  == Temperature Statistics (rev 1) ==
>   5  0x008  1               30  Current Temperature
>   5  0x010  1               35~ Average Short Term Temperature
>   5  0x018  1               33~ Average Long Term Temperature
>   5  0x020  1               45  Highest Temperature
>   5  0x028  1               19  Lowest Temperature
>   5  0x030  1               42~ Highest Average Short Term Temperature
>   5  0x038  1               24~ Lowest Average Short Term Temperature
>   5  0x040  1               39~ Highest Average Long Term Temperature
>   5  0x048  1               25~ Lowest Average Long Term Temperature
>   5  0x050  4                0  Time in Over-Temperature
>   5  0x058  1               60  Specified Maximum Operating Temperature
>   5  0x060  4                0  Time in Under-Temperature
>   5  0x068  1                0  Specified Minimum Operating Temperature
>   6  =====  =                =  == Transport Statistics (rev 1) ==
>   6  0x008  4             1122  Number of Hardware Resets
>   6  0x010  4             1027  Number of ASR Events
>   6  0x018  4                0  Number of Interface CRC Errors
>                               |_ ~ normalized value
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x0001  2            0  Command failed due to ICRC error
> 0x0002  2            0  R_ERR response for data FIS
> 0x0003  2            0  R_ERR response for device-to-host data FIS
> 0x0004  2            0  R_ERR response for host-to-device data FIS
> 0x0005  2            0  R_ERR response for non-data FIS
> 0x0006  2            0  R_ERR response for device-to-host non-data FIS
> 0x0007  2            0  R_ERR response for host-to-device non-data FIS
> 0x0009  2            6  Transition from drive PhyRdy to drive PhyNRdy
> 0x000a  2            5  Device-to-host register FISes sent due to a COMRESET
> 0x000b  2            0  CRC errors within host-to-device FIS
> 0x000d  2            0  Non-CRC errors within host-to-device FIS
> sudo smartctl -x /dev/sde
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family:     Hitachi Deskstar 7K2000
> Device Model:     Hitachi HDS722020ALA330
> Serial Number:    JK1171YAGAD8LS
> LU WWN Device Id: 5 000cca 221c4b9cc
> Firmware Version: JKAOA20N
> User Capacity:    2,000,398,934,016 bytes [2.00 TB]
> Sector Size:      512 bytes logical/physical
> Rotation Rate:    7200 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ATA8-ACS T13/1699-D revision 4
> SATA Version is:  SATA 2.6, 3.0 Gb/s
> Local Time is:    Tue Feb 10 16:45:31 2015 EST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is:   Disabled
> APM feature is:   Disabled
> Rd look-ahead is: Enabled
> Write cache is:   Enabled
> ATA Security is:  Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Enabled
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status:  (0x84) Offline data collection activity
>                                         was suspended by an interrupting command from host.
>                                         Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine completed
>                                         without error or no self-test has ever
>                                         been run.
> Total time to complete Offline
> data collection:                (21007) seconds.
> Offline data collection
> capabilities:                    (0x5b) SMART execute Offline immediate.
>                                         Auto Offline data collection on/off support.
>                                         Suspend Offline collection upon new
>                                         command.
>                                         Offline surface scan supported.
>                                         Self-test supported.
>                                         No Conveyance Self-test supported.
>                                         Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                         power-saving mode.
>                                         Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                         General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   1) minutes.
> Extended self-test routine
> recommended polling time:        ( 350) minutes.
> SCT capabilities:              (0x003d) SCT Status supported.
>                                         SCT Error Recovery Control supported.
>                                         SCT Feature Control supported.
>                                         SCT Data Table supported.
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
>   2 Throughput_Performance  P-S---   134   134   054    -    98
>   3 Spin_Up_Time            POS---   137   137   024    -    619 (Average 439)
>   4 Start_Stop_Count        -O--C-   100   100   000    -    207
>   5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
>   7 Seek_Error_Rate         PO-R--   100   100   067    -    0
>   8 Seek_Time_Performance   P-S---   112   112   020    -    39
>   9 Power_On_Hours          -O--C-   094   094   000    -    44002
>  10 Spin_Retry_Count        PO--C-   100   100   060    -    0
>  12 Power_Cycle_Count       -O--CK   100   100   000    -    207
> 192 Power-Off_Retract_Count -O--CK   099   099   000    -    1267
> 193 Load_Cycle_Count        -O--C-   099   099   000    -    1267
> 194 Temperature_Celsius     -O----   181   181   000    -    33 (Min/Max 20/53)
> 196 Reallocated_Event_Count -O--CK   100   100   000    -    0
> 197 Current_Pending_Sector  -O---K   100   100   000    -    0
> 198 Offline_Uncorrectable   ---R--   100   100   000    -    0
> 199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    9
>                             ||||||_ K auto-keep
>                             |||||__ C event count
>                             ||||___ R error rate
>                             |||____ S speed/performance
>                             ||_____ O updated online
>                             |______ P prefailure warning
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> Address    Access  R/W   Size  Description
> 0x00       GPL,SL  R/O      1  Log Directory
> 0x01           SL  R/O      1  Summary SMART error log
> 0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
> 0x04       GPL     R/O      7  Device Statistics log
> 0x06           SL  R/O      1  SMART self-test log
> 0x07       GPL     R/O      1  Extended self-test log
> 0x09           SL  R/W      1  Selective self-test log
> 0x10       GPL     R/O      1  NCQ Command Error log
> 0x11       GPL     R/O      1  SATA Phy Event Counters
> 0x20       GPL     R/O      1  Streaming performance log [OBS-8]
> 0x21       GPL     R/O      1  Write stream error log
> 0x22       GPL     R/O      1  Read stream error log
> 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> 0xe0       GPL,SL  R/W      1  SCT Command/Status
> 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
> Device Error Count: 10 (device log contains only the most recent 4 errors)
>         CR     = Command Register
>         FEATR  = Features Register
>         COUNT  = Count (was: Sector Count) Register
>         LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
>         LH     = LBA High (was: Cylinder High) Register    ]   LBA
>         LM     = LBA Mid (was: Cylinder Low) Register      ] Register
>         LL     = LBA Low (was: Sector Number) Register     ]
>         DV     = Device (was: Device/Head) Register
>         DC     = Device Control Register
>         ER     = Error register
>         ST     = Status register
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
> Error 10 [1] occurred at disk power-on lifetime: 1655 hours (68 days + 23 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   84 -- 51 01 28 00 00 50 83 5d e8 00 00  Error: ICRC, ABRT 296 sectors at LBA = 0x50835de8 = 1350786536
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 02 a8 00 00 50 83 5c 68 e0 08 23d+05:05:37.425  READ DMA EXT
>   25 00 00 03 68 00 00 50 83 59 00 e0 08 23d+05:05:37.413  READ DMA EXT
>   25 00 00 01 00 00 00 50 83 58 00 e0 08 23d+05:05:37.409  READ DMA EXT
>   25 00 00 00 f0 00 00 50 83 57 10 e0 08 23d+05:05:37.405  READ DMA EXT
>   25 00 00 02 a0 00 00 50 83 54 70 e0 08 23d+05:05:37.352  READ DMA EXT
> Error 9 [0] occurred at disk power-on lifetime: 1654 hours (68 days + 22 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   84 -- 51 00 90 00 00 4e eb 15 70 00 00  Error: ICRC, ABRT 144 sectors at LBA = 0x4eeb1570 = 1324029296
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 01 00 00 00 4e eb 15 00 ee 08 23d+04:47:42.788  READ DMA EXT
>   25 00 00 02 28 00 00 4e eb 12 d8 ee 08 23d+04:47:42.713  READ DMA EXT
>   25 00 00 03 d8 00 00 4e eb 0f 00 ee 08 23d+04:47:42.698  READ DMA EXT
>   25 00 00 01 00 00 00 4e eb 0e 00 ee 08 23d+04:47:42.694  READ DMA EXT
>   25 00 00 01 00 00 00 4e eb 0d 00 ee 08 23d+04:47:42.691  READ DMA EXT
> Error 8 [3] occurred at disk power-on lifetime: 1654 hours (68 days + 22 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   84 -- 51 00 28 00 00 36 08 f1 d8 00 00  Error: ICRC, ABRT 40 sectors at LBA = 0x3608f1d8 = 906555864
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 00 f8 00 00 36 08 f1 08 e6 08 23d+00:06:40.966  READ DMA EXT
>   25 00 00 02 78 00 00 36 08 ee 90 e6 08 23d+00:06:40.914  READ DMA EXT
>   25 00 00 03 90 00 00 36 08 eb 00 e6 08 23d+00:06:40.900  READ DMA EXT
>   25 00 00 01 00 00 00 36 08 ea 00 e6 08 23d+00:06:40.896  READ DMA EXT
>   25 00 00 00 f8 00 00 36 08 e9 08 e6 08 23d+00:06:40.893  READ DMA EXT
> Error 7 [2] occurred at disk power-on lifetime: 1654 hours (68 days + 22 hours)
>   When the command that caused the error occurred, the device was active or idle.
>   After command completion occurred, registers were:
>   ER -- ST COUNT  LBA_48  LH LM LL DV DC
>   -- -- -- == -- == == == -- -- -- -- --
>   84 -- 51 01 28 00 00 33 d1 bb 40 00 00  Error: ICRC, ABRT 296 sectors at LBA = 0x33d1bb40 = 869382976
>   Commands leading to the command that caused the error were:
>   CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
>   -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
>   25 00 00 03 68 00 00 33 d1 b9 00 e3 08 22d+23:42:04.107  READ DMA EXT
>   25 00 00 01 00 00 00 33 d1 b8 00 e3 08 22d+23:42:04.103  READ DMA EXT
>   25 00 00 00 f0 00 00 33 d1 b7 10 e3 08 22d+23:42:04.099  READ DMA EXT
>   25 00 00 02 b0 00 00 33 d1 b4 60 e3 08 22d+23:42:04.022  READ DMA EXT
>   25 00 00 03 60 00 00 33 d1 b1 00 e3 08 22d+23:42:04.009  READ DMA EXT
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> SCT Status Version:                  3
> SCT Version (vendor specific):       256 (0x0100)
> SCT Support Level:                   1
> Device State:                        SMART Off-line Data Collection executing in background (4)
> Current Temperature:                    33 Celsius
> Power Cycle Min/Max Temperature:     27/33 Celsius
> Lifetime    Min/Max Temperature:     20/53 Celsius
> Under/Over Temperature Limit Count:   0/0
> SCT Temperature History Version:     2
> Temperature Sampling Period:         1 minute
> Temperature Logging Interval:        1 minute
> Min/Max recommended Temperature:      0/60 Celsius
> Min/Max Temperature Limit:           -40/70 Celsius
> Temperature History Size (Index):    128 (81)
> Index    Estimated Time   Temperature Celsius
>   82    2015-02-10 14:38    41  **********************
>  ...    ..(113 skipped).    ..  **********************
>   68    2015-02-10 16:32    41  **********************
>   69    2015-02-10 16:33     ?  -
>   70    2015-02-10 16:34    28  *********
>   71    2015-02-10 16:35    28  *********
>   72    2015-02-10 16:36    29  **********
>   73    2015-02-10 16:37    29  **********
>   74    2015-02-10 16:38    30  ***********
>   75    2015-02-10 16:39    30  ***********
>   76    2015-02-10 16:40    31  ************
>   77    2015-02-10 16:41    31  ************
>   78    2015-02-10 16:42    32  *************
>   79    2015-02-10 16:43    32  *************
>   80    2015-02-10 16:44    33  **************
>   81    2015-02-10 16:45    33  **************
> SCT Error Recovery Control:
>            Read: Disabled
>           Write: Disabled
> Device Statistics (GP Log 0x04)
> Page Offset Size         Value  Description
>   1  =====  =                =  == General Statistics (rev 1) ==
>   1  0x008  4              207  Lifetime Power-On Resets
>   1  0x010  4            44002  Power-on Hours
>   1  0x018  6      19676641503  Logical Sectors Written
>   1  0x020  6         47285021  Number of Write Commands
>   1  0x028  6    4518358603939  Logical Sectors Read
>   1  0x030  6       5982270826  Number of Read Commands
>   3  =====  =                =  == Rotating Media Statistics (rev 1) ==
>   3  0x008  4            43993  Spindle Motor Power-on Hours
>   3  0x010  4            43993  Head Flying Hours
>   3  0x018  4             1267  Head Load Events
>   3  0x020  4                0  Number of Reallocated Logical Sectors
>   3  0x028  4               14  Read Recovery Attempts
>   3  0x030  4                1  Number of Mechanical Start Failures
>   4  =====  =                =  == General Errors Statistics (rev 1) ==
>   4  0x008  4                0  Number of Reported Uncorrectable Errors
>   4  0x010  4              180  Resets Between Cmd Acceptance and Completion
>   5  =====  =                =  == Temperature Statistics (rev 1) ==
>   5  0x008  1               33  Current Temperature
>   5  0x010  1               41~ Average Short Term Temperature
>   5  0x018  1               41~ Average Long Term Temperature
>   5  0x020  1               53  Highest Temperature
>   5  0x028  1               20  Lowest Temperature
>   5  0x030  1               49~ Highest Average Short Term Temperature
>   5  0x038  1                0~ Lowest Average Short Term Temperature
>   5  0x040  1               47~ Highest Average Long Term Temperature
>   5  0x048  1                0~ Lowest Average Long Term Temperature
>   5  0x050  4                0  Time in Over-Temperature
>   5  0x058  1               60  Specified Maximum Operating Temperature
>   5  0x060  4                0  Time in Under-Temperature
>   5  0x068  1                0  Specified Minimum Operating Temperature
>   6  =====  =                =  == Transport Statistics (rev 1) ==
>   6  0x008  4             1957  Number of Hardware Resets
>   6  0x010  4             1773  Number of ASR Events
>   6  0x018  4                9  Number of Interface CRC Errors
>                               |_ ~ normalized value
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x0001  2            0  Command failed due to ICRC error
> 0x0002  2            0  R_ERR response for data FIS
> 0x0005  2            0  R_ERR response for non-data FIS
> 0x0009  2            6  Transition from drive PhyRdy to drive PhyNRdy
> 0x000a  2            4  Device-to-host register FISes sent due to a COMRESET
> 0x000b  2            0  CRC errors within host-to-device FIS
> 0x000d  2            0  Non-CRC errors within host-to-device FIS
>  sudo smartctl -x /dev/sdf
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> === START OF INFORMATION SECTION ===
> Model Family:     Hitachi Deskstar 7K2000
> Device Model:     Hitachi HDS722020ALA330
> Serial Number:    JK1171YAGDAD5S
> LU WWN Device Id: 5 000cca 221c59b77
> Firmware Version: JKAOA20N
> User Capacity:    2,000,397,852,160 bytes [2.00 TB]
> Sector Size:      512 bytes logical/physical
> Rotation Rate:    7200 rpm
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   ATA8-ACS T13/1699-D revision 4
> SATA Version is:  SATA 2.6, 3.0 Gb/s
> Local Time is:    Tue Feb 10 16:46:04 2015 EST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is:   Disabled
> APM feature is:   Disabled
> Rd look-ahead is: Enabled
> Write cache is:   Enabled
> ATA Security is:  Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Enabled
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> General SMART Values:
> Offline data collection status:  (0x84) Offline data collection activity
>                                         was suspended by an interrupting command from host.
>                                         Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine completed
>                                         without error or no self-test has ever
>                                         been run.
> Total time to complete Offline
> data collection:                (22917) seconds.
> Offline data collection
> capabilities:                    (0x5b) SMART execute Offline immediate.
>                                         Auto Offline data collection on/off support.
>                                         Suspend Offline collection upon new
>                                         command.
>                                         Offline surface scan supported.
>                                         Self-test supported.
>                                         No Conveyance Self-test supported.
>                                         Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                         power-saving mode.
>                                         Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                         General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   1) minutes.
> Extended self-test routine
> recommended polling time:        ( 382) minutes.
> SCT capabilities:              (0x003d) SCT Status supported.
>                                         SCT Error Recovery Control supported.
>                                         SCT Feature Control supported.
>                                         SCT Data Table supported.
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>   1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
>   2 Throughput_Performance  P-S---   133   133   054    -    101
>   3 Spin_Up_Time            POS---   134   134   024    -    627 (Average 452)
>   4 Start_Stop_Count        -O--C-   100   100   000    -    203
>   5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
>   7 Seek_Error_Rate         PO-R--   100   100   067    -    0
>   8 Seek_Time_Performance   P-S---   112   112   020    -    39
>   9 Power_On_Hours          -O--C-   094   094   000    -    44006
>  10 Spin_Retry_Count        PO--C-   100   100   060    -    0
>  12 Power_Cycle_Count       -O--CK   100   100   000    -    203
> 192 Power-Off_Retract_Count -O--CK   099   099   000    -    1248
> 193 Load_Cycle_Count        -O--C-   099   099   000    -    1248
> 194 Temperature_Celsius     -O----   193   193   000    -    31 (Min/Max 20/50)
> 196 Reallocated_Event_Count -O--CK   100   100   000    -    0
> 197 Current_Pending_Sector  -O---K   100   100   000    -    0
> 198 Offline_Uncorrectable   ---R--   100   100   000    -    0
> 199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
>                             ||||||_ K auto-keep
>                             |||||__ C event count
>                             ||||___ R error rate
>                             |||____ S speed/performance
>                             ||_____ O updated online
>                             |______ P prefailure warning
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> Address    Access  R/W   Size  Description
> 0x00       GPL,SL  R/O      1  Log Directory
> 0x01           SL  R/O      1  Summary SMART error log
> 0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
> 0x04       GPL     R/O      7  Device Statistics log
> 0x06           SL  R/O      1  SMART self-test log
> 0x07       GPL     R/O      1  Extended self-test log
> 0x09           SL  R/W      1  Selective self-test log
> 0x10       GPL     R/O      1  NCQ Command Error log
> 0x11       GPL     R/O      1  SATA Phy Event Counters
> 0x20       GPL     R/O      1  Streaming performance log [OBS-8]
> 0x21       GPL     R/O      1  Write stream error log
> 0x22       GPL     R/O      1  Read stream error log
> 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> 0xe0       GPL,SL  R/W      1  SCT Command/Status
> 0xe1       GPL,SL  R/W      1  SCT Data Transfer
> SMART Extended Comprehensive Error Log Version: 0 (1 sectors)
> No Errors Logged
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
> SCT Status Version:                  3
> SCT Version (vendor specific):       256 (0x0100)
> SCT Support Level:                   1
> Device State:                        SMART Off-line Data Collection executing in background (4)
> Current Temperature:                    31 Celsius
> Power Cycle Min/Max Temperature:     27/31 Celsius
> Lifetime    Min/Max Temperature:     20/50 Celsius
> Under/Over Temperature Limit Count:   0/0
> SCT Temperature History Version:     2
> Temperature Sampling Period:         1 minute
> Temperature Logging Interval:        1 minute
> Min/Max recommended Temperature:      0/60 Celsius
> Min/Max Temperature Limit:           -40/70 Celsius
> Temperature History Size (Index):    128 (47)
> Index    Estimated Time   Temperature Celsius
>   48    2015-02-10 14:39    39  ********************
>  ...    ..( 98 skipped).    ..  ********************
>   19    2015-02-10 16:18    39  ********************
>   20    2015-02-10 16:19    40  *********************
>   21    2015-02-10 16:20    39  ********************
>  ...    ..(  3 skipped).    ..  ********************
>   25    2015-02-10 16:24    39  ********************
>   26    2015-02-10 16:25    38  *******************
>  ...    ..(  6 skipped).    ..  *******************
>   33    2015-02-10 16:32    38  *******************
>   34    2015-02-10 16:33     ?  -
>   35    2015-02-10 16:34    27  ********
>   36    2015-02-10 16:35    28  *********
>   37    2015-02-10 16:36    28  *********
>   38    2015-02-10 16:37    29  **********
>   39    2015-02-10 16:38    29  **********
>   40    2015-02-10 16:39    30  ***********
>  ...    ..(  2 skipped).    ..  ***********
>   43    2015-02-10 16:42    30  ***********
>   44    2015-02-10 16:43    31  ************
>  ...    ..(  2 skipped).    ..  ************
>   47    2015-02-10 16:46    31  ************
> SCT Error Recovery Control:
>            Read: Disabled
>           Write: Disabled
> Device Statistics (GP Log 0x04)
> Page Offset Size         Value  Description
>   1  =====  =                =  == General Statistics (rev 1) ==
>   1  0x008  4              203  Lifetime Power-On Resets
>   1  0x010  4            44006  Power-on Hours
>   1  0x018  6      15872353160  Logical Sectors Written
>   1  0x020  6         39140100  Number of Write Commands
>   1  0x028  6    4462388816379  Logical Sectors Read
>   1  0x030  6       5927428317  Number of Read Commands
>   3  =====  =                =  == Rotating Media Statistics (rev 1) ==
>   3  0x008  4            43997  Spindle Motor Power-on Hours
>   3  0x010  4            43997  Head Flying Hours
>   3  0x018  4             1248  Head Load Events
>   3  0x020  4                0  Number of Reallocated Logical Sectors
>   3  0x028  4               32  Read Recovery Attempts
>   3  0x030  4                0  Number of Mechanical Start Failures
>   4  =====  =                =  == General Errors Statistics (rev 1) ==
>   4  0x008  4                0  Number of Reported Uncorrectable Errors
>   4  0x010  4              192  Resets Between Cmd Acceptance and Completion
>   5  =====  =                =  == Temperature Statistics (rev 1) ==
>   5  0x008  1               31  Current Temperature
>   5  0x010  1               37~ Average Short Term Temperature
>   5  0x018  1               35~ Average Long Term Temperature
>   5  0x020  1               50  Highest Temperature
>   5  0x028  1               20  Lowest Temperature
>   5  0x030  1               44~ Highest Average Short Term Temperature
>   5  0x038  1                0~ Lowest Average Short Term Temperature
>   5  0x040  1               42~ Highest Average Long Term Temperature
>   5  0x048  1                0~ Lowest Average Long Term Temperature
>   5  0x050  4                0  Time in Over-Temperature
>   5  0x058  1               60  Specified Maximum Operating Temperature
>   5  0x060  4                0  Time in Under-Temperature
>   5  0x068  1                0  Specified Minimum Operating Temperature
>   6  =====  =                =  == Transport Statistics (rev 1) ==
>   6  0x008  4             1947  Number of Hardware Resets
>   6  0x010  4             1765  Number of ASR Events
>   6  0x018  4                0  Number of Interface CRC Errors
>                               |_ ~ normalized value
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x0001  2            0  Command failed due to ICRC error
> 0x0002  2            0  R_ERR response for data FIS
> 0x0005  2            0  R_ERR response for non-data FIS
> 0x0009  2            6  Transition from drive PhyRdy to drive PhyNRdy
> 0x000a  2            4  Device-to-host register FISes sent due to a COMRESET
> 0x000b  2            0  CRC errors within host-to-device FIS
> 0x000d  2            0  Non-CRC errors within host-to-device FIS



Adam:

I actually read that exact stackexchange article about using the
--replace command but I neither had kernel 3.2+ nor mdadm 3.3+ that
seemed to be a necessary requirement. I suppose I could have booted to
a more recent kernel livecd, but sadly i did not.

Thank you both for your help,

Kyle L

On Tue, Feb 10, 2015 at 8:51 AM, Phil Turmel <philip@turmel.org> wrote:
> Hi Kyle,
>
> Your symptoms look like classic timeout mismatch.  Details interleaved.
>
> On 02/10/2015 02:35 AM, Adam Goryachev wrote:
>
>> There are other people who will jump in and help you with your problem,
>> but I'll add a couple of pointers while you are waiting. See below.
>
>> On 10/02/15 15:20, Kyle Logue wrote:
>>> Hey all:
>>>
>>> I have a 5 disk software raid5 that was working fine until I decided
>>> to swap out an old disk with a new one.
>>>
>>> mdadm /dev/md0 --add /dev/sda1
>>> mdadm /dev/md0 --fail /dev/sde1
>
> As Adam pointed out, you should have used --replace, but you probably
> wouldn't have made it through the replace function anyways.
>
>>> At this point it started automatically rebuilding the array.
>>> About 60%? of the way in it stops and I see a lot of this repeated in
>>> my dmesg:
>>>
>>> [Mon Feb  9 18:06:48 2015] ata5.00: exception Emask 0x0 SAct 0x0 SErr
>>> 0x0 action 0x6 frozen
>>> [Mon Feb  9 18:06:48 2015] ata5.00: failed command: SMART
>>> [Mon Feb  9 18:06:48 2015] ata5.00: cmd
>>> b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 7
>>> [Mon Feb  9 18:06:48 2015]          res
>>> 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
>                                                  ^^^^^^^^^
> Smoking gun.
>
>>> [Mon Feb  9 18:06:48 2015] ata5.00: status: { DRDY }
>>> [Mon Feb  9 18:06:48 2015] ata5: hard resetting link
>>> [Mon Feb  9 18:06:58 2015] ata5: softreset failed (1st FIS failed)
>>> [Mon Feb  9 18:06:58 2015] ata5: hard resetting link
>>> [Mon Feb  9 18:07:08 2015] ata5: softreset failed (1st FIS failed)
>>> [Mon Feb  9 18:07:08 2015] ata5: hard resetting link
>>> [Mon Feb  9 18:07:12 2015] ata5: SATA link up 1.5 Gbps (SStatus 113
>>> SControl 310)
>>> [Mon Feb  9 18:07:12 2015] ata5.00: configured for UDMA/33
>>> [Mon Feb  9 18:07:12 2015] ata5: EH complete
>
> Notice that after a timeout error, the drive is unresponsive for several
> more seconds -- about 24 in your case.
>
>> ....  read about timing mismatches
>> between the kernel and the hard drive, and how to solve that. There was
>> another post earlier today with some links to specific posts that will
>> be helpful (check the online archive).
>
> That would have been me.  Start with this link for a description of what
> you are experiencing:
>
> http://marc.info/?l=linux-raid&m=135811522817345&w=1
>
> First, you need to protect yourself from timeout mismatch due to the use
> of desktop-grade drives.  (Enterprise and raid-rated drives don't have
> this problem.)
>
> { If you were stuck in the middle of a replace a you had just
> worked-around your timeout problem, it would likely continue and
> complete.  You've lost that opportunity. }
>
> Show us the output of "smartctl -x" for all of your drives if you'd like
> advice on your particular drives.  (Pasted inline is preferred.)
>
> Second, you need to find and overwrite (with zeros) the bad sectors on
> your drives.  Or ddrescue to a complete set of replacement drives and
> assemble those.
>
> Third, you need to set up a cron job to scrub your array regularly to
> clean out UREs before they accumulate beyond MD's ability to handle it
> (20 read errors in an hour, 10 per hour sustained).
>
> Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wierd: Degrading while recovering raid5
  2015-02-10 21:50     ` Kyle Logue
@ 2015-02-11  2:14       ` Phil Turmel
  0 siblings, 0 replies; 9+ messages in thread
From: Phil Turmel @ 2015-02-11  2:14 UTC (permalink / raw)
  To: Kyle Logue, linux-raid

Hi Kyle,

{ Convention on kernel.org lists is reply-to-all, trim replies, and
either bottom post or interleave }

On 02/10/2015 04:50 PM, Kyle Logue wrote:
> Phil:
> 
> Thanks for your detailed response. That link does seem to describe my
> problem and I do understand that desktop grade drives are sub-optimal.
> It was many years ago when I first set up this array on my home
> theater pc.  Until now I had no idea about the cron job - I'll make
> sure to implement that. I am preparing to move to 6 tb disks sometime
> soon and i'll definitely go enterprise this time.
> 
> Regarding the drive timeout: I understand that I need to increase it
> from 30 seconds to something larger (2+ min) but am unaware how to do
> this. Is it a kernel variable? I'll keep googling but this seems like
> it's whats going to save me.
> 
> tl;dr: How do I change the drive timeout?

Put something like this in /etc/rc.local or wherever your distro suggests:

for x in /sys/block/sd[a-f]/device/timeout ; do
  echo 180 > $x
done

Where the [a-f] is adjusted to suit your needs, and only for non-raid
non-scterc drives.

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wierd: Degrading while recovering raid5
  2015-02-11 22:12   ` Kyle Logue
@ 2015-02-12  0:15     ` Phil Turmel
  0 siblings, 0 replies; 9+ messages in thread
From: Phil Turmel @ 2015-02-12  0:15 UTC (permalink / raw)
  To: Kyle Logue; +Cc: linux-raid

On 02/11/2015 05:12 PM, Kyle Logue wrote:
> Good news phil. Under the hypothesis that the new disk that I added
> didn't fully replace my sde I omitted it from my assemble. The array
> went full UUUUU, then I echo'd check > /sys/block/md0/md/sync_action
> 
> Much later it kicked out the faulty disk (previously sdc) and now i
> have a _UUUU.
> 
> So hopefully this is the final question, but should I just evacuate as
> much data as possible immediately? Or try to add another spare and
> rebuild?

So long as you haven't mounted it yet, I suggest you do another forced
assembly to get back to UUUUU, then kick off another check.  When many
UREs are allowed to accumulate, mdadm can hit its read error rate limit
and kick the drive.  If it hasn't been mounted, you can keep doing it
until you get through the entire check.

But, you also had misaligned partitions.  If sdcN is one of them, the
above won't work, and you should get your backups ASAP.  And then make a
new array from scratch.

If you do succeed in completing a check scrub, you can use --replace to
put the array on properly aligned partitions.

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wierd: Degrading while recovering raid5
  2015-02-11 14:28 ` Phil Turmel
@ 2015-02-11 22:12   ` Kyle Logue
  2015-02-12  0:15     ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: Kyle Logue @ 2015-02-11 22:12 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

Good news phil. Under the hypothesis that the new disk that I added
didn't fully replace my sde I omitted it from my assemble. The array
went full UUUUU, then I echo'd check > /sys/block/md0/md/sync_action

Much later it kicked out the faulty disk (previously sdc) and now i
have a _UUUU.

So hopefully this is the final question, but should I just evacuate as
much data as possible immediately? Or try to add another spare and
rebuild?

Thanks for the help,
Kyle L

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wierd: Degrading while recovering raid5
  2015-02-11  6:23 Kyle Logue
@ 2015-02-11 14:28 ` Phil Turmel
  2015-02-11 22:12   ` Kyle Logue
  0 siblings, 1 reply; 9+ messages in thread
From: Phil Turmel @ 2015-02-11 14:28 UTC (permalink / raw)
  To: Kyle Logue, linux-raid

On 02/11/2015 01:23 AM, Kyle Logue wrote:
> Phil:
> 
> For a while I really thought that was going to work. I swapped out the
> sata cable and set the timeout to 10 minutes. At about 70% rebuilt I
> got the following dmesg which seems to indicate the death of my sdc
> drive.

Ten minutes is way overkill.  The three minutes I suggested is already
extreme, and most drives will only need two minutes.

> Here is my question: I still have this sde that I manually failed and
> hasn't been touched. Can i force re-add it to the array and just take
> the data corruption hit?

No, sde is being replaced by sda, so it's no help for sdc.  If you put
it back into service, it would have to take the role of sda.  (Forced
assembly, though, not a re-add.)  If the array was in use during your
first replacement attempt, the differences could be substantial.

I'm not sure how MD will handle the rebuild status in this case.
Hopefully, it will take you back to a working, non-rebuilding array.  If
you try this, you should test with a set of overlay devices as described
on the wiki.

> I'd rather have to revert part of my data than all of it. The drive
> counts are significantly different now, but I haven't mounted the
> drives since the beginning. I haven't tried it but I saw someone else
> online get a message like 'raid has failed so using --add cannot work
> and might destroy data'. Is there a force add? What are my chances?

The right answer here depends on whether the array was in use.  If it
wasn't, I'd try to use sde in place of sda to get back to a
non-rebuilding array.  If the test run succeeds, undo the overlays and
do it for real.  Then zero the superblock on sda, add it back as a
spare, then --replace sdc.

If the trial doesn't work (or the changes to sda too great), the
alternative is to ddrescue sdc onto a spare disk (sde would be available
at that point, if it's useless for assembly).  Then manually reassemble
and let the rebuild finish.  If you run into more errors on the other
members, you may have to repeat the ddrescue process for each.

Whichever path you take, when done, consider switching to raid6 using
the extra drive.  That's far more secure than a hot spare (if a little
slower).

I did notice one other issue in your posted dmesg:  misaligned
partitions.  This cripples MD's ability to fix UREs on the fly or during
a scrub.  You *must* rebuild your array with properly aligned partitions
before you quit.

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Wierd: Degrading while recovering raid5
@ 2015-02-11  6:23 Kyle Logue
  2015-02-11 14:28 ` Phil Turmel
  0 siblings, 1 reply; 9+ messages in thread
From: Kyle Logue @ 2015-02-11  6:23 UTC (permalink / raw)
  To: linux-raid

Phil:

For a while I really thought that was going to work. I swapped out the
sata cable and set the timeout to 10 minutes. At about 70% rebuilt I
got the following dmesg which seems to indicate the death of my sdc
drive.

Here is my question: I still have this sde that I manually failed and
hasn't been touched. Can i force re-add it to the array and just take
the data corruption hit?

I'd rather have to revert part of my data than all of it. The drive
counts are significantly different now, but I haven't mounted the
drives since the beginning. I haven't tried it but I saw someone else
online get a message like 'raid has failed so using --add cannot work
and might destroy data'. Is there a force add? What are my chances?

The dmesg in question. I started rebuilding at 20:24.

[Tue Feb 10 20:23:59 2015] md: md0 stopped.
[Tue Feb 10 20:23:59 2015] md: unbind<sdf1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sdf1)
[Tue Feb 10 20:23:59 2015] md: unbind<sde1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sde1)
[Tue Feb 10 20:23:59 2015] md: unbind<sdd1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sdd1)
[Tue Feb 10 20:23:59 2015] md: unbind<sdc1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sdc1)
[Tue Feb 10 20:23:59 2015] md: unbind<sdb1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sdb1)
[Tue Feb 10 20:23:59 2015] md: unbind<sda1>
[Tue Feb 10 20:23:59 2015] md: export_rdev(sda1)
[Tue Feb 10 20:24:59 2015] md: md0 stopped.
[Tue Feb 10 20:24:59 2015] md: bind<sdd1>
[Tue Feb 10 20:24:59 2015] md: bind<sde1>
[Tue Feb 10 20:24:59 2015] md: bind<sdf1>
[Tue Feb 10 20:24:59 2015] md: bind<sdb1>
[Tue Feb 10 20:24:59 2015] md: bind<sda1>
[Tue Feb 10 20:24:59 2015] md: bind<sdc1>
[Tue Feb 10 20:24:59 2015] md: kicking non-fresh sde1 from array!
[Tue Feb 10 20:24:59 2015] md: unbind<sde1>
[Tue Feb 10 20:24:59 2015] md: export_rdev(sde1)
[Tue Feb 10 20:24:59 2015] md/raid:md0: device sdc1 operational as raid disk 0
[Tue Feb 10 20:24:59 2015] md/raid:md0: device sdb1 operational as raid disk 4
[Tue Feb 10 20:24:59 2015] md/raid:md0: device sdf1 operational as raid disk 3
[Tue Feb 10 20:24:59 2015] md/raid:md0: device sdd1 operational as raid disk 1
[Tue Feb 10 20:24:59 2015] md/raid:md0: allocated 0kB
[Tue Feb 10 20:24:59 2015] md/raid:md0: raid level 5 active with 4 out
of 5 devices, algorithm 2
[Tue Feb 10 20:24:59 2015] RAID conf printout:
[Tue Feb 10 20:24:59 2015]  --- level:5 rd:5 wd:4
[Tue Feb 10 20:24:59 2015]  disk 0, o:1, dev:sdc1
[Tue Feb 10 20:24:59 2015]  disk 1, o:1, dev:sdd1
[Tue Feb 10 20:24:59 2015]  disk 3, o:1, dev:sdf1
[Tue Feb 10 20:24:59 2015]  disk 4, o:1, dev:sdb1
[Tue Feb 10 20:24:59 2015] md0: Warning: Device sda1 is misaligned
[Tue Feb 10 20:24:59 2015] md0: Warning: Device sdb1 is misaligned
[Tue Feb 10 20:24:59 2015] md0: Warning: Device sdb1 is misaligned
[Tue Feb 10 20:24:59 2015] md0: detected capacity change from 0 to 8001584889856
[Tue Feb 10 20:24:59 2015] RAID conf printout:
[Tue Feb 10 20:24:59 2015]  --- level:5 rd:5 wd:4
[Tue Feb 10 20:24:59 2015]  disk 0, o:1, dev:sdc1
[Tue Feb 10 20:24:59 2015]  disk 1, o:1, dev:sdd1
[Tue Feb 10 20:24:59 2015]  disk 2, o:1, dev:sda1
[Tue Feb 10 20:24:59 2015]  disk 3, o:1, dev:sdf1
[Tue Feb 10 20:24:59 2015]  disk 4, o:1, dev:sdb1
[Tue Feb 10 20:24:59 2015] md: recovery of RAID array md0
[Tue Feb 10 20:24:59 2015] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[Tue Feb 10 20:24:59 2015] md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for recovery.
[Tue Feb 10 20:24:59 2015] md: using 128k window, over a total of 1953511936k.
[Tue Feb 10 20:24:59 2015]  md0: unknown partition table
[Tue Feb 10 20:35:34 2015] perf samples too long (2505 > 2500),
lowering kernel.perf_event_max_sample_rate to 50000
[Wed Feb 11 01:02:15 2015] ata5.00: exception Emask 0x0 SAct 0x30 SErr
0x0 action 0x0
[Wed Feb 11 01:02:15 2015] ata5.00: irq_stat 0x40000008
[Wed Feb 11 01:02:15 2015] ata5.00: failed command: READ FPDMA QUEUED
[Wed Feb 11 01:02:15 2015] ata5.00: cmd
60/00:20:18:1d:1c/04:00:a4:00:00/40 tag 4 ncq 524288 in
[Wed Feb 11 01:02:15 2015]          res
41/40:00:e8:1d:1c/00:04:a4:00:00/00 Emask 0x409 (media error) <F>
[Wed Feb 11 01:02:15 2015] ata5.00: status: { DRDY ERR }
[Wed Feb 11 01:02:15 2015] ata5.00: error: { UNC }
[Wed Feb 11 01:02:15 2015] ata5.00: configured for UDMA/133
[Wed Feb 11 01:02:15 2015] sd 4:0:0:0: [sdc] Unhandled sense code
[Wed Feb 11 01:02:15 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:15 2015] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Feb 11 01:02:15 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:15 2015] Sense Key : Medium Error [current] [descriptor]
[Wed Feb 11 01:02:15 2015] Descriptor sense data with sense
descriptors (in hex):
[Wed Feb 11 01:02:15 2015]         72 03 11 04 00 00 00 0c 00 0a 80 00
00 00 00 00
[Wed Feb 11 01:02:15 2015]         a4 1c 1d e8
[Wed Feb 11 01:02:15 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:15 2015] Add. Sense: Unrecovered read error - auto
reallocate failed
[Wed Feb 11 01:02:15 2015] sd 4:0:0:0: [sdc] CDB:
[Wed Feb 11 01:02:15 2015] Read(10): 28 00 a4 1c 1d 18 00 04 00 00
[Wed Feb 11 01:02:15 2015] end_request: I/O error, dev sdc, sector 2753306088
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304040 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304048 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304056 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304064 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304072 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304080 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304088 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304096 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304104 on sdc1).
[Wed Feb 11 01:02:15 2015] md/raid:md0: read error not correctable
(sector 2753304112 on sdc1).
[Wed Feb 11 01:02:15 2015] ata5: EH complete
[Wed Feb 11 01:02:18 2015] ata5.00: exception Emask 0x0 SAct 0xff80
SErr 0x0 action 0x0
[Wed Feb 11 01:02:18 2015] ata5.00: irq_stat 0x40000008
[Wed Feb 11 01:02:18 2015] ata5.00: failed command: READ FPDMA QUEUED
[Wed Feb 11 01:02:18 2015] ata5.00: cmd
60/80:38:e8:1d:1c/00:00:a4:00:00/40 tag 7 ncq 65536 in
[Wed Feb 11 01:02:18 2015]          res
41/40:80:e8:1d:1c/00:00:a4:00:00/00 Emask 0x409 (media error) <F>
[Wed Feb 11 01:02:18 2015] ata5.00: status: { DRDY ERR }
[Wed Feb 11 01:02:18 2015] ata5.00: error: { UNC }
[Wed Feb 11 01:02:18 2015] ata5.00: configured for UDMA/133
[Wed Feb 11 01:02:18 2015] sd 4:0:0:0: [sdc] Unhandled sense code
[Wed Feb 11 01:02:18 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:18 2015] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Wed Feb 11 01:02:18 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:18 2015] Sense Key : Medium Error [current] [descriptor]
[Wed Feb 11 01:02:18 2015] Descriptor sense data with sense
descriptors (in hex):
[Wed Feb 11 01:02:18 2015]         72 03 11 04 00 00 00 0c 00 0a 80 00
00 00 00 00
[Wed Feb 11 01:02:18 2015]         a4 1c 1d e8
[Wed Feb 11 01:02:18 2015] sd 4:0:0:0: [sdc]
[Wed Feb 11 01:02:18 2015] Add. Sense: Unrecovered read error - auto
reallocate failed
[Wed Feb 11 01:02:18 2015] sd 4:0:0:0: [sdc] CDB:
[Wed Feb 11 01:02:18 2015] Read(10): 28 00 a4 1c 1d e8 00 00 80 00
[Wed Feb 11 01:02:18 2015] end_request: I/O error, dev sdc, sector 2753306088
[Wed Feb 11 01:02:18 2015] md/raid:md0: Disk failure on sdc1, disabling device.
[Wed Feb 11 01:02:18 2015] md/raid:md0: Operation continuing on 3 devices.
[Wed Feb 11 01:02:18 2015] ata5: EH complete
[Wed Feb 11 01:02:18 2015] md: md0: recovery interrupted.
[Wed Feb 11 01:02:18 2015] RAID conf printout:
[Wed Feb 11 01:02:18 2015]  --- level:5 rd:5 wd:3
[Wed Feb 11 01:02:18 2015]  disk 0, o:0, dev:sdc1
[Wed Feb 11 01:02:18 2015]  disk 1, o:1, dev:sdd1
[Wed Feb 11 01:02:18 2015]  disk 2, o:1, dev:sda1
[Wed Feb 11 01:02:18 2015]  disk 3, o:1, dev:sdf1
[Wed Feb 11 01:02:18 2015]  disk 4, o:1, dev:sdb1
[Wed Feb 11 01:02:18 2015] RAID conf printout:
[Wed Feb 11 01:02:18 2015]  --- level:5 rd:5 wd:3
[Wed Feb 11 01:02:18 2015]  disk 1, o:1, dev:sdd1
[Wed Feb 11 01:02:18 2015]  disk 2, o:1, dev:sda1
[Wed Feb 11 01:02:18 2015]  disk 3, o:1, dev:sdf1
[Wed Feb 11 01:02:18 2015]  disk 4, o:1, dev:sdb1
[Wed Feb 11 01:02:18 2015] RAID conf printout:
[Wed Feb 11 01:02:18 2015]  --- level:5 rd:5 wd:3
[Wed Feb 11 01:02:18 2015]  disk 1, o:1, dev:sdd1
[Wed Feb 11 01:02:18 2015]  disk 2, o:1, dev:sda1
[Wed Feb 11 01:02:18 2015]  disk 3, o:1, dev:sdf1
[Wed Feb 11 01:02:18 2015]  disk 4, o:1, dev:sdb1
[Wed Feb 11 01:02:18 2015] RAID conf printout:
[Wed Feb 11 01:02:18 2015]  --- level:5 rd:5 wd:3
[Wed Feb 11 01:02:18 2015]  disk 1, o:1, dev:sdd1
[Wed Feb 11 01:02:18 2015]  disk 3, o:1, dev:sdf1
[Wed Feb 11 01:02:18 2015]  disk 4, o:1, dev:sdb1

Thanks again,

Kyle L

On Tue, Feb 10, 2015 at 9:14 PM, Phil Turmel <philip@turmel.org> wrote:
>
> Hi Kyle,
>
> { Convention on kernel.org lists is reply-to-all, trim replies, and
> either bottom post or interleave }
>
> On 02/10/2015 04:50 PM, Kyle Logue wrote:
> > Phil:
> >
> > Thanks for your detailed response. That link does seem to describe my
> > problem and I do understand that desktop grade drives are sub-optimal.
> > It was many years ago when I first set up this array on my home
> > theater pc.  Until now I had no idea about the cron job - I'll make
> > sure to implement that. I am preparing to move to 6 tb disks sometime
> > soon and i'll definitely go enterprise this time.
> >
> > Regarding the drive timeout: I understand that I need to increase it
> > from 30 seconds to something larger (2+ min) but am unaware how to do
> > this. Is it a kernel variable? I'll keep googling but this seems like
> > it's whats going to save me.
> >
> > tl;dr: How do I change the drive timeout?
>
> Put something like this in /etc/rc.local or wherever your distro suggests:
>
> for x in /sys/block/sd[a-f]/device/timeout ; do
>   echo 180 > $x
> done
>
> Where the [a-f] is adjusted to suit your needs, and only for non-raid
> non-scterc drives.
>
> Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-02-12  0:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-10  4:20 Wierd: Degrading while recovering raid5 Kyle Logue
2015-02-10  7:35 ` Adam Goryachev
2015-02-10 13:51   ` Phil Turmel
2015-02-10 21:50     ` Kyle Logue
2015-02-11  2:14       ` Phil Turmel
2015-02-11  6:23 Kyle Logue
2015-02-11 14:28 ` Phil Turmel
2015-02-11 22:12   ` Kyle Logue
2015-02-12  0:15     ` Phil Turmel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.