All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID1, changed disk, 2nd has errors ...
@ 2011-08-26 11:46 Stefan G. Weichinger
  2011-08-26 12:01 ` Mathias Burén
  2011-08-26 12:56 ` Robin Hill
  0 siblings, 2 replies; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-26 11:46 UTC (permalink / raw)
  To: linux-raid


Please help:

Today I removed a defective hdd out of a RAID1-array and swapped in a
new hdd instead.

3 arrays, to be true, md[012]

0 and 1 synced fine, in the process of syncing md2 the old sda threw
errors (in sda4):

md/raid1:md2: sda: unrecoverable I/O read error for block 643686144
md: md2: recovery done.

[...]

md/raid1:md2: sda: unrecoverable I/O read error for block 643686272

----

Did the system stop syncing or is "recovery done" the indication that
md2 was fully recovered BEFORE the system threw sda4 out of the array??

I hope for the second!

See:

# mdadm -D /dev/md2
/dev/md2:
        Version : 0.90
  Creation Time : Thu Feb 11 19:40:11 2010
     Raid Level : raid1
     Array Size : 962454080 (917.87 GiB 985.55 GB)
  Used Dev Size : 962454080 (917.87 GiB 985.55 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 2
    Persistence : Superblock is persistent

    Update Time : Fri Aug 26 13:40:55 2011
          State : clean, degraded
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

           UUID : 0ee7bbc7:fc6b0172:d195d856:5f94e963
         Events : 0.1833443

    Number   Major   Minor   RaidDevice State
       0       8        4        0      active sync   /dev/sda4
       1       0        0        1      removed

       2       8       20        -      spare   /dev/sdb4

# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb3[1] sda3[0]
      13679232 blocks [2/2] [UU]

md2 : active raid1 sdb4[2](S) sda4[0]
      962454080 blocks [2/1] [U_]

md0 : active raid1 sdb1[1] sda1[0]
      128384 blocks [2/2] [UU]

unused devices: <none>

----

The system seems to work OK, md2 which is a PV in a LVM-volumegroup is
there, etc

I just wonder if should somehow re-add sda4 or not touch a thing until I
have a new hdd at hand??

Can/should I somehow test the integrity of md2?

Pls help me to relax in this case ...

btw:

Linux version 2.6.36-gentoo-r5
mdadm-3.1.4

Thanks in advance, Stefan!



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-26 11:46 RAID1, changed disk, 2nd has errors Stefan G. Weichinger
@ 2011-08-26 12:01 ` Mathias Burén
  2011-08-26 12:19   ` Stefan G. Weichinger
  2011-08-26 12:56 ` Robin Hill
  1 sibling, 1 reply; 18+ messages in thread
From: Mathias Burén @ 2011-08-26 12:01 UTC (permalink / raw)
  To: lists; +Cc: linux-raid

On 26 August 2011 12:46, Stefan G. Weichinger <lists@xunil.at> wrote:
>
> Please help:
>
> Today I removed a defective hdd out of a RAID1-array and swapped in a
> new hdd instead.
>
> 3 arrays, to be true, md[012]
>
> 0 and 1 synced fine, in the process of syncing md2 the old sda threw
> errors (in sda4):
>
> md/raid1:md2: sda: unrecoverable I/O read error for block 643686144
> md: md2: recovery done.
>
> [...]
>
> md/raid1:md2: sda: unrecoverable I/O read error for block 643686272
>
> ----
>
> Did the system stop syncing or is "recovery done" the indication that
> md2 was fully recovered BEFORE the system threw sda4 out of the array??
>
> I hope for the second!
>
> See:
>
> # mdadm -D /dev/md2
> /dev/md2:
>        Version : 0.90
>  Creation Time : Thu Feb 11 19:40:11 2010
>     Raid Level : raid1
>     Array Size : 962454080 (917.87 GiB 985.55 GB)
>  Used Dev Size : 962454080 (917.87 GiB 985.55 GB)
>   Raid Devices : 2
>  Total Devices : 2
> Preferred Minor : 2
>    Persistence : Superblock is persistent
>
>    Update Time : Fri Aug 26 13:40:55 2011
>          State : clean, degraded
>  Active Devices : 1
> Working Devices : 2
>  Failed Devices : 0
>  Spare Devices : 1
>
>           UUID : 0ee7bbc7:fc6b0172:d195d856:5f94e963
>         Events : 0.1833443
>
>    Number   Major   Minor   RaidDevice State
>       0       8        4        0      active sync   /dev/sda4
>       1       0        0        1      removed
>
>       2       8       20        -      spare   /dev/sdb4
>
> # cat /proc/mdstat
> Personalities : [raid1]
> md1 : active raid1 sdb3[1] sda3[0]
>      13679232 blocks [2/2] [UU]
>
> md2 : active raid1 sdb4[2](S) sda4[0]
>      962454080 blocks [2/1] [U_]
>
> md0 : active raid1 sdb1[1] sda1[0]
>      128384 blocks [2/2] [UU]
>
> unused devices: <none>
>
> ----
>
> The system seems to work OK, md2 which is a PV in a LVM-volumegroup is
> there, etc
>
> I just wonder if should somehow re-add sda4 or not touch a thing until I
> have a new hdd at hand??
>
> Can/should I somehow test the integrity of md2?
>
> Pls help me to relax in this case ...
>
> btw:
>
> Linux version 2.6.36-gentoo-r5
> mdadm-3.1.4
>
> Thanks in advance, Stefan!
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Hm,

Could you perhaps post the output of "smartctl -a /dev/sda" (and sdb
for completeness sake) here? You can find smartctl in the
smartmontools package.

/Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-26 12:01 ` Mathias Burén
@ 2011-08-26 12:19   ` Stefan G. Weichinger
  2011-08-26 12:44     ` Stefan G. Weichinger
  2011-08-26 20:00     ` Mathias Burén
  0 siblings, 2 replies; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-26 12:19 UTC (permalink / raw)
  To: linux-raid; +Cc: Mathias Burén

Am 26.08.2011 14:01, schrieb Mathias Burén:

> Could you perhaps post the output of "smartctl -a /dev/sda" (and sdb
> for completeness sake) here? You can find smartctl in the
> smartmontools package.

sure. sdb is the new hdd from today (as mentioned)

->

 # smartctl -a /dev/sda
smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12 family
Device Model:     ST31000528AS
Serial Number:    9VP3BSEV
Firmware Version: CC38
User Capacity:    1.000.204.886.016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Aug 26 14:18:06 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine
completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		 ( 600) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 178) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   101   099   006    Pre-fail  Always
      -       77880938
  3 Spin_Up_Time            0x0003   097   095   000    Pre-fail  Always
      -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always
      -       50
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always
      -       0
  7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail  Always
      -       110698342
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always
      -       13359
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always
      -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always
      -       25
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always
      -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always
      -       0
187 Reported_Uncorrect      0x0032   082   082   000    Old_age   Always
      -       18
188 Command_Timeout         0x0032   100   099   000    Old_age   Always
      -       2
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always
      -       0
190 Airflow_Temperature_Cel 0x0022   065   060   045    Old_age   Always
      -       35 (Min/Max 32/36)
194 Temperature_Celsius     0x0022   035   040   000    Old_age   Always
      -       35 (0 15 0 0)
195 Hardware_ECC_Recovered  0x001a   046   024   000    Old_age   Always
      -       77880938
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
      -       2
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       2
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
      -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age
Offline      -       16896401355883
241 Total_LBAs_Written      0x0000   100   253   000    Old_age
Offline      -       2526036334
242 Total_LBAs_Read         0x0000   100   253   000    Old_age
Offline      -       2586691393

SMART Error Log Version: 1
ATA Error Count: 18 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 18 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      01:28:56.212  READ DMA EXT
  27 00 00 00 00 00 e0 00      01:28:56.211  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      01:28:56.191  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      01:28:56.175  SET FEATURES [Set transfer
mode]
  27 00 00 00 00 00 e0 00      01:28:56.151  READ NATIVE MAX ADDRESS EXT

Error 17 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      01:28:53.001  READ DMA EXT
  27 00 00 00 00 00 e0 00      01:28:53.000  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      01:28:52.980  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      01:28:52.961  SET FEATURES [Set transfer
mode]
  27 00 00 00 00 00 e0 00      01:28:52.940  READ NATIVE MAX ADDRESS EXT

Error 16 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      01:28:49.790  READ DMA EXT
  27 00 00 00 00 00 e0 00      01:28:49.789  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      01:28:49.749  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      01:28:49.739  SET FEATURES [Set transfer
mode]
  27 00 00 00 00 00 e0 00      01:28:49.719  READ NATIVE MAX ADDRESS EXT

Error 15 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      01:28:46.580  READ DMA EXT
  27 00 00 00 00 00 e0 00      01:28:46.579  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      01:28:46.559  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      01:28:46.542  SET FEATURES [Set transfer
mode]
  27 00 00 00 00 00 e0 00      01:28:46.519  READ NATIVE MAX ADDRESS EXT

Error 14 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      01:28:43.379  READ DMA EXT
  27 00 00 00 00 00 e0 00      01:28:43.378  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      01:28:43.358  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      01:28:43.345  SET FEATURES [Set transfer
mode]
  27 00 00 00 00 00 e0 00      01:28:43.318  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     13357
     -
# 2  Short offline       Completed without error       00%     13333
     -
# 3  Short offline       Completed without error       00%     13310
     -
# 4  Short offline       Completed without error       00%     13286
     -
# 5  Short offline       Completed without error       00%     13261
     -
# 6  Short offline       Completed without error       00%     13237
     -
# 7  Short offline       Completed without error       00%     13213
     -
# 8  Extended offline    Completed without error       00%     13207
     -
# 9  Short offline       Completed without error       00%     13189
     -
#10  Short offline       Completed without error       00%     13164
     -
#11  Short offline       Completed without error       00%     13162
     -
#12  Short offline       Completed without error       00%     13138
     -
#13  Short offline       Completed without error       00%     13114
     -
#14  Short offline       Completed without error       00%     13090
     -
#15  Short offline       Completed without error       00%     13066
     -
#16  Extended offline    Completed without error       00%     13060
     -
#17  Short offline       Completed without error       00%     13042
     -
#18  Short offline       Completed without error       00%     13018
     -
#19  Short offline       Completed without error       00%     12994
     -
#20  Short offline       Completed without error       00%     12970
     -
#21  Short offline       Completed without error       00%     12946
     -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



# smartctl -a /dev/sdb
smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST1000NM0011
Serial Number:    Z1N04CMC
Firmware Version: SN02
User Capacity:    1.000.204.886.016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Aug 26 14:18:35 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine
completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		 ( 114) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 155) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x10bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   066   066   044    Pre-fail  Always
      -       5184768
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always
      -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always
      -       8
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always
      -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always
      -       88000
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always
      -       3
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always
      -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always
      -       8
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always
      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always
      -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always
      -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always
      -       0
190 Airflow_Temperature_Cel 0x0022   064   049   045    Old_age   Always
      -       36 (Min/Max 30/37)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always
      -       1
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always
      -       7
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always
      -       8
194 Temperature_Celsius     0x0022   036   051   000    Old_age   Always
      -       36 (0 25 0 0)
195 Hardware_ECC_Recovered  0x001a   102   100   000    Old_age   Always
      -       5184768
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
      -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         1
     -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-26 12:19   ` Stefan G. Weichinger
@ 2011-08-26 12:44     ` Stefan G. Weichinger
  2011-08-26 20:00     ` Mathias Burén
  1 sibling, 0 replies; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-26 12:44 UTC (permalink / raw)
  To: linux-raid; +Cc: Mathias Burén


Additional stuff letting me hope for a valid array:

# grep md /var/log/messages

Aug 26 11:22:02 horde mdadm[3739]: RebuildFinished event detected on md
device /dev/md/1
Aug 26 11:22:02 horde mdadm[3739]: SpareActive event detected on md
device /dev/md/1, component device /dev/sdb3
Aug 26 11:22:02 horde mdadm[3739]: RebuildStarted event detected on md
device /dev/md/2
Aug 26 12:12:02 horde mdadm[3739]: Rebuild20 event detected on md device
/dev/md/2
Aug 26 12:42:53 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:42:59 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:43:05 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:43:09 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:43:12 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:43:15 horde kernel: ata1.00: cmd
25/00:00:b5:73:12/00:04:28:00:00/e0 tag 0 dma 524288 in
Aug 26 12:43:19 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:22 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:25 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:28 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:31 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:34 horde kernel: ata1.00: cmd
25/00:08:fd:73:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:34 horde kernel: md/raid1:md2: sda: unrecoverable I/O read
error for block 643686144
Aug 26 12:43:34 horde kernel: md: md2: recovery done.
Aug 26 12:43:37 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:41 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:44 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:47 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:50 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:53 horde kernel: ata1.00: cmd
25/00:08:35:74:12/00:00:28:00:00/e0 tag 0 dma 4096 in
Aug 26 12:43:54 horde kernel: md/raid1:md2: sda: unrecoverable I/O read
error for block 643686272
Aug 26 12:43:54 horde mdadm[3739]: RebuildFinished event detected on md
device /dev/md/2

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-26 11:46 RAID1, changed disk, 2nd has errors Stefan G. Weichinger
  2011-08-26 12:01 ` Mathias Burén
@ 2011-08-26 12:56 ` Robin Hill
  2011-08-26 13:51   ` Stefan G. Weichinger
  1 sibling, 1 reply; 18+ messages in thread
From: Robin Hill @ 2011-08-26 12:56 UTC (permalink / raw)
  To: Stefan G. Weichinger; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2233 bytes --]

On Fri Aug 26, 2011 at 01:46:41PM +0200, Stefan G. Weichinger wrote:

> 
> Please help:
> 
> Today I removed a defective hdd out of a RAID1-array and swapped in a
> new hdd instead.
> 
> 3 arrays, to be true, md[012]
> 
> 0 and 1 synced fine, in the process of syncing md2 the old sda threw
> errors (in sda4):
> 
> md/raid1:md2: sda: unrecoverable I/O read error for block 643686144
> md: md2: recovery done.
> 
> [...]
> 
> md/raid1:md2: sda: unrecoverable I/O read error for block 643686272
> 
> ----
> 
> Did the system stop syncing or is "recovery done" the indication that
> md2 was fully recovered BEFORE the system threw sda4 out of the array??
> 
> I hope for the second!
> 
I think it just indicates that it stopped attempting recovery at this
point.

> # cat /proc/mdstat
> Personalities : [raid1]
> md1 : active raid1 sdb3[1] sda3[0]
>       13679232 blocks [2/2] [UU]
> 
> md2 : active raid1 sdb4[2](S) sda4[0]
>       962454080 blocks [2/1] [U_]
> 
> md0 : active raid1 sdb1[1] sda1[0]
>       128384 blocks [2/2] [UU]
> 
This would indicate that sdb has been reset as a spare, suggesting that
the resync failed so it has left sda alone in the array (as failing it
would destroy the array).

I'd suggest stopping the array and using ddrescue to clone sda4
to sdb4. That'll copy everything possible, flagging up any read issues.
You'll then need to run a "fsck -f" on sdb4 to clear up any filesystem
damage. You may still be left with damaged/missing files, depending on
where any read errors occurred. How critical this is will depend on what
the filesystem is used for (and whether you have any backup).

If that all works okay, then get sda replaced and give it a thorough
badblocks and SMART test.

I'd also advise setting up regular array checks (echo check >
/sys/block/mdX/md/sync_action) to make sure the disks are checked and
any unreadable blocks repaired/mapped out _before_ they're needed for
recovery.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-26 12:56 ` Robin Hill
@ 2011-08-26 13:51   ` Stefan G. Weichinger
  2011-08-26 14:08     ` Robin Hill
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-26 13:51 UTC (permalink / raw)
  To: linux-raid; +Cc: robin.hill47

Am 26.08.2011 14:56, schrieb Robin Hill:

> This would indicate that sdb has been reset as a spare, suggesting 
> that the resync failed so it has left sda alone in the array (as 
> failing it would destroy the array).

oh my ...

So the array is somehow split-brain now? Some sectors good here, some
there??

Why is sda4 now flagged as (S)? Is it a spare or not?
I don't fully understand the current state of the array ...

> I'd suggest stopping the array and using ddrescue to clone sda4 to 
> sdb4. That'll copy everything possible, flagging up any read
> issues. You'll then need to run a "fsck -f" on sdb4 to clear up
> any filesystem damage. You may still be left with damaged/missing
> files, depending on where any read errors occurred. How critical
> this is will depend on what the filesystem is used for (and whether
> you have any backup).

I am rather scared to do so ... as I am ~50kms away from the box now,
and as it seems to be working fine so far (though there are currently
no users working with it).

As mentioned /dev/md2 doesn't contain a filesystem itself, but is the
single PV in a LVM-volumegroup.

This group contains 6 logical volumes ...

As far as I understand it might be possible to spot the defective
sectors and the related LV?

I have backups, yes ...

> If that all works okay, then get sda replaced and give it a
> thorough badblocks and SMART test.
> 
> I'd also advise setting up regular array checks (echo check > 
> /sys/block/mdX/md/sync_action) to make sure the disks are checked 
> and any unreadable blocks repaired/mapped out _before_ they're
> needed for recovery.

re-adding sda4 and starting such a check would be possible?
Or would a re-add damage things?

Should I shutdown the box for safety?

I am really feeling unsafe now, and getting another hdd for swapping
will take me at least until monday.

(I would like to dd-rescue to another new disk to keep sdb, just in case)

Thanks, Stefan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-26 13:51   ` Stefan G. Weichinger
@ 2011-08-26 14:08     ` Robin Hill
  2011-08-26 15:41       ` Stefan G. Weichinger
  0 siblings, 1 reply; 18+ messages in thread
From: Robin Hill @ 2011-08-26 14:08 UTC (permalink / raw)
  To: Stefan G. Weichinger; +Cc: linux-raid, robin.hill47

[-- Attachment #1: Type: text/plain, Size: 3641 bytes --]

On Fri Aug 26, 2011 at 03:51:17PM +0200, Stefan G. Weichinger wrote:

> Am 26.08.2011 14:56, schrieb Robin Hill:
> 
> > This would indicate that sdb has been reset as a spare, suggesting 
> > that the resync failed so it has left sda alone in the array (as 
> > failing it would destroy the array).
> 
> oh my ...
> 
> So the array is somehow split-brain now? Some sectors good here, some
> there??
> 
> Why is sda4 now flagged as (S)? Is it a spare or not?
> I don't fully understand the current state of the array ...
> 
sda4 is still in the array, with some unreadable sectors. sdb4 is a
spare because the resync failed due to unreadable sectors on sda4. You
cannot add a disk to an array unless the data can all be read (or
recovered if there's still enough redundancy).

> > I'd suggest stopping the array and using ddrescue to clone sda4 to 
> > sdb4. That'll copy everything possible, flagging up any read
> > issues. You'll then need to run a "fsck -f" on sdb4 to clear up
> > any filesystem damage. You may still be left with damaged/missing
> > files, depending on where any read errors occurred. How critical
> > this is will depend on what the filesystem is used for (and whether
> > you have any backup).
> 
> I am rather scared to do so ... as I am ~50kms away from the box now,
> and as it seems to be working fine so far (though there are currently
> no users working with it).
> 
It'll work fine unless something attempts to read from any of the
unreadable sectors on sda4. If these are not used by the filesystem
currently, then you may never run into an issue (as they'll get remapped
if a write error occurs when they do get used).

> As mentioned /dev/md2 doesn't contain a filesystem itself, but is the
> single PV in a LVM-volumegroup.
> 
> This group contains 6 logical volumes ...
> 
> As far as I understand it might be possible to spot the defective
> sectors and the related LV?
> 
A read of the relevant block device (dd if=/dev/xxx of=/dev/null) will
result in read errors for whichever block device contains the bad
sectors. You could also probably map the sectors reported by the kernel
to the position on the disk to tell what LV it.

> I have backups, yes ...
> 
In which case the absolute safest option is just to recreate whatever
arrays, PVs, LVs, etc. on sdb4 and restore the data, ignoring whatever's
on sda4 currently.

> > If that all works okay, then get sda replaced and give it a
> > thorough badblocks and SMART test.
> > 
> > I'd also advise setting up regular array checks (echo check > 
> > /sys/block/mdX/md/sync_action) to make sure the disks are checked 
> > and any unreadable blocks repaired/mapped out _before_ they're
> > needed for recovery.
> 
> re-adding sda4 and starting such a check would be possible?
> Or would a re-add damage things?
> 
You can't add sda4 because it's already in the array.

> Should I shutdown the box for safety?
> 
For absolute safety, yes, though I don't think the risk is too high at
the moment, and I don't think things'll get any worse in the short term.

> I am really feeling unsafe now, and getting another hdd for swapping
> will take me at least until monday.
> 
> (I would like to dd-rescue to another new disk to keep sdb, just in case)
> 
I doubt you'd be able to recover anything useful from sdb4 at the
moment, but that's up to you.


-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-26 14:08     ` Robin Hill
@ 2011-08-26 15:41       ` Stefan G. Weichinger
  2011-08-29  7:02         ` Stefan G. Weichinger
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-26 15:41 UTC (permalink / raw)
  To: linux-raid, robin.hill47

Am 2011-08-26 16:08, schrieb Robin Hill:
> sda4 is still in the array, with some unreadable sectors. sdb4 is a
>  spare because the resync failed due to unreadable sectors on
> sda4. You cannot add a disk to an array unless the data can all be
> read (or recovered if there's still enough redundancy).

Ah, now I got it.
I misinterpreted this:

md2 : active raid1 sdb4[2](S) sda4[0]
      962454080 blocks [2/1] [U_]

I thought [U_] maps to the first line "sdb4 sda4" and somehow read
"sdb4 is UP and sda4 is down"

I could have seen it at

    Number   Major   Minor   RaidDevice State
       0       8        4        0      active sync   /dev/sda4
       1       0        0        1      removed

       2       8       20        -      spare   /dev/sdb4

but you know, panic ;-)

So basically I am where I was before swapping sdb: everything running
on sda, which has some corrupt sectors. Which may never have been
touched so far.

>> As far as I understand it might be possible to spot the defective
>>  sectors and the related LV?
>> 
> A read of the relevant block device (dd if=/dev/xxx of=/dev/null) 
> will result in read errors for whichever block device contains the 
> bad sectors. You could also probably map the sectors reported by
> the kernel to the position on the disk to tell what LV it.

There is only 350GB out of ~920GB mapped to active LVs. It might be
the case that the corrupt stuff isn't even mapped yet.

I once knew how to figure that out, I will have a closer look.

>> I have backups, yes ...
>> 
> In which case the absolute safest option is just to recreate 
> whatever arrays, PVs, LVs, etc. on sdb4 and restore the data, 
> ignoring whatever's on sda4 currently.

I understand now, yes.

>> re-adding sda4 and starting such a check would be possible? Or 
>> would a re-add damage things?
>> 
> You can't add sda4 because it's already in the array.

Sure, now that I figured out the mentioned misunderstanding.

>> Should I shutdown the box for safety?
>> 
> For absolute safety, yes, though I don't think the risk is too
> high at the moment, and I don't think things'll get any worse in
> the short term.

That sounds good for my weekend! Thanks ...

>> I am really feeling unsafe now, and getting another hdd for 
>> swapping will take me at least until monday.
>> 
>> (I would like to dd-rescue to another new disk to keep sdb, just
>> in case)
>> 
> I doubt you'd be able to recover anything useful from sdb4 at the 
> moment, but that's up to you.

Yep, also clear now.
I wait with that ddrescue-stuff anyway.

Thanks for your help!
Stefan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-26 12:19   ` Stefan G. Weichinger
  2011-08-26 12:44     ` Stefan G. Weichinger
@ 2011-08-26 20:00     ` Mathias Burén
  2011-08-26 22:12       ` Stefan G. Weichinger
  1 sibling, 1 reply; 18+ messages in thread
From: Mathias Burén @ 2011-08-26 20:00 UTC (permalink / raw)
  To: lists; +Cc: linux-raid

On 26 August 2011 13:19, Stefan G. Weichinger <lists@xunil.at> wrote:
> Am 26.08.2011 14:01, schrieb Mathias Burén:
>
>> Could you perhaps post the output of "smartctl -a /dev/sda" (and sdb
>> for completeness sake) here? You can find smartctl in the
>> smartmontools package.
>
> sure. sdb is the new hdd from today (as mentioned)
>
> ->
>
(snip)
>
>

FWIW, sda is failing, looking at uncorrectable sectors and all else.
If possible I'd mount the HDD (array) read-only, copy the contents
somewhere else, then recreate the array from scratch using your new
HDD and a new HDD to replace sda.

/Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-26 20:00     ` Mathias Burén
@ 2011-08-26 22:12       ` Stefan G. Weichinger
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-26 22:12 UTC (permalink / raw)
  To: Mathias Burén; +Cc: linux-raid

Am 26.08.2011 22:00, schrieb Mathias Burén:

> FWIW, sda is failing, looking at uncorrectable sectors and all else.
> If possible I'd mount the HDD (array) read-only, copy the contents
> somewhere else, then recreate the array from scratch using your new
> HDD and a new HDD to replace sda.

Thanks, Mathias .... will continue work on this on monday, as soon as I
have another hdd at hand (regarding the distance etc)

S

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-26 15:41       ` Stefan G. Weichinger
@ 2011-08-29  7:02         ` Stefan G. Weichinger
  2011-08-29  7:45           ` Stefan G. Weichinger
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-29  7:02 UTC (permalink / raw)
  To: lists; +Cc: linux-raid, robin.hill47

Am 26.08.2011 17:41, schrieb Stefan G. Weichinger:

> Yep, also clear now.
> I wait with that ddrescue-stuff anyway.

Could I somehow make the hdd re-map those 2 sectors?
S


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-29  7:02         ` Stefan G. Weichinger
@ 2011-08-29  7:45           ` Stefan G. Weichinger
  2011-08-29  7:51             ` Mathias Burén
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-29  7:45 UTC (permalink / raw)
  To: lists; +Cc: linux-raid, robin.hill47

Am 29.08.2011 09:02, schrieb Stefan G. Weichinger:
> Am 26.08.2011 17:41, schrieb Stefan G. Weichinger:
> 
>> Yep, also clear now.
>> I wait with that ddrescue-stuff anyway.
> 
> Could I somehow make the hdd re-map those 2 sectors?

I now followed

http://smartmontools.sourceforge.net/badblockhowto.html#lvm

and afai see the two bad blocks are inside a LVM-LV which is not
important at all!

It is a 20 GB LV prepared for something the customer never really used
so I will mv away the test-data and remove the LV.

Does this somehow help me to be able to maybe remap the bad blocks?

Thanks, Stefan




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-29  7:45           ` Stefan G. Weichinger
@ 2011-08-29  7:51             ` Mathias Burén
  2011-08-29  8:00               ` Stefan G. Weichinger
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Burén @ 2011-08-29  7:51 UTC (permalink / raw)
  To: lists; +Cc: linux-raid, robin.hill47

On 29 August 2011 08:45, Stefan G. Weichinger <lists@xunil.at> wrote:
> Am 29.08.2011 09:02, schrieb Stefan G. Weichinger:
>> Am 26.08.2011 17:41, schrieb Stefan G. Weichinger:
>>
>>> Yep, also clear now.
>>> I wait with that ddrescue-stuff anyway.
>>
>> Could I somehow make the hdd re-map those 2 sectors?
>
> I now followed
>
> http://smartmontools.sourceforge.net/badblockhowto.html#lvm
>
> and afai see the two bad blocks are inside a LVM-LV which is not
> important at all!
>
> It is a 20 GB LV prepared for something the customer never really used
> so I will mv away the test-data and remove the LV.
>
> Does this somehow help me to be able to maybe remap the bad blocks?
>
> Thanks, Stefan
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Maybe running badblocks on the sector range (or over the whole HDD,
but in non-read-write mode it takes quite a while longer) will do the
trick.

/M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-29  7:51             ` Mathias Burén
@ 2011-08-29  8:00               ` Stefan G. Weichinger
  2011-08-29  8:25                 ` Stefan G. Weichinger
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-29  8:00 UTC (permalink / raw)
  To: Mathias Burén; +Cc: linux-raid, robin.hill47

Am 29.08.2011 09:51, schrieb Mathias Burén:

> Maybe running badblocks on the sector range (or over the whole HDD,
> but in non-read-write mode it takes quite a while longer) will do the
> trick.

I currently run "badblocks -n -s /dev/VG01/my_lv" ... we'll see

S
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: RAID1, changed disk, 2nd has errors ...
  2011-08-29  8:00               ` Stefan G. Weichinger
@ 2011-08-29  8:25                 ` Stefan G. Weichinger
  2011-08-29 14:34                   ` (solved) " Stefan G. Weichinger
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-29  8:25 UTC (permalink / raw)
  To: lists; +Cc: Mathias Burén, linux-raid, robin.hill47

Am 29.08.2011 10:00, schrieb Stefan G. Weichinger:
> Am 29.08.2011 09:51, schrieb Mathias Burén:
> 
>> Maybe running badblocks on the sector range (or over the whole HDD,
>> but in non-read-write mode it takes quite a while longer) will do the
>> trick.
> 
> I currently run "badblocks -n -s /dev/VG01/my_lv" ... we'll see

Switched over to

dd if=/dev/zero of=/dev/VG01/my_lv bs=4096

This executed without error (wrote ~20GB) and now when I check with:

smartctl -a /dev/sda

I get

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
      -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0


Sounds good to me! Right?

So now I could re-add /dev/sdb4 to retry syncing that array, correct?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: (solved) RAID1, changed disk, 2nd has errors ...
  2011-08-29  8:25                 ` Stefan G. Weichinger
@ 2011-08-29 14:34                   ` Stefan G. Weichinger
  2011-08-29 23:40                     ` Mathias Burén
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-29 14:34 UTC (permalink / raw)
  To: lists; +Cc: Mathias Burén, linux-raid, robin.hill47

Am 29.08.2011 10:25, schrieb Stefan G. Weichinger:

> I get
> 
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
>       -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       0
> 
> 
> Sounds good to me! Right?
> 
> So now I could re-add /dev/sdb4 to retry syncing that array, correct?

Did that.

I failed/removed/re-added /dev/sdb4 and waited for some hours of resyncing.

Now /dev/md2 is in sync again, still with no bad sectors in SMART
(attached, @Mathias ;-))

thanks to Robin and Mathias for your feedback, it helped me to get the
picture and chose the next steps!

For now I let the arrays as they are and wait for the second new hdd.
As soon as I have it here I will swap /dev/sdb as well.

(a new server with maybe RAID6 is soon to come there ...)

Thanks, Stefan

----

# smartctl -a /dev/sda
smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12 family
Device Model:     ST31000528AS
Serial Number:    9VP3BSEV
Firmware Version: CC38
User Capacity:    1.000.204.886.016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon Aug 29 16:31:35 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine
completed
					without error or no self-test has ever
					been run.
Total time to complete Offline
data collection: 		 ( 600) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 178) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always
      -       134791791
  3 Spin_Up_Time            0x0003   097   095   000    Pre-fail  Always
      -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always
      -       50
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always
      -       0
  7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail  Always
      -       111650379
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always
      -       13433
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always
      -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always
      -       25
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always
      -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always
      -       0
187 Reported_Uncorrect      0x0032   082   082   000    Old_age   Always
      -       18
188 Command_Timeout         0x0032   100   099   000    Old_age   Always
      -       2
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always
      -       0
190 Airflow_Temperature_Cel 0x0022   067   060   045    Old_age   Always
      -       33 (Min/Max 27/36)
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always
      -       33 (0 15 0 0)
195 Hardware_ECC_Recovered  0x001a   048   024   000    Old_age   Always
      -       134791791
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
      -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
      -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age
Offline      -       255980050855093
241 Total_LBAs_Written      0x0000   100   253   000    Old_age
Offline      -       2678846567
242 Total_LBAs_Read         0x0000   100   253   000    Old_age
Offline      -       4015371061

SMART Error Log Version: 1
ATA Error Count: 18 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 18 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      01:28:56.212  READ DMA EXT
  27 00 00 00 00 00 e0 00      01:28:56.211  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      01:28:56.191  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      01:28:56.175  SET FEATURES [Set transfer
mode]
  27 00 00 00 00 00 e0 00      01:28:56.151  READ NATIVE MAX ADDRESS EXT

Error 17 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      01:28:53.001  READ DMA EXT
  27 00 00 00 00 00 e0 00      01:28:53.000  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      01:28:52.980  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      01:28:52.961  SET FEATURES [Set transfer
mode]
  27 00 00 00 00 00 e0 00      01:28:52.940  READ NATIVE MAX ADDRESS EXT

Error 16 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      01:28:49.790  READ DMA EXT
  27 00 00 00 00 00 e0 00      01:28:49.789  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      01:28:49.749  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      01:28:49.739  SET FEATURES [Set transfer
mode]
  27 00 00 00 00 00 e0 00      01:28:49.719  READ NATIVE MAX ADDRESS EXT

Error 15 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      01:28:46.580  READ DMA EXT
  27 00 00 00 00 00 e0 00      01:28:46.579  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      01:28:46.559  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      01:28:46.542  SET FEATURES [Set transfer
mode]
  27 00 00 00 00 00 e0 00      01:28:46.519  READ NATIVE MAX ADDRESS EXT

Error 14 occurred at disk power-on lifetime: 13357 hours (556 days + 13
hours)
  When the command that caused the error occurred, the device was active
or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 ff ff ff ef 00      01:28:43.379  READ DMA EXT
  27 00 00 00 00 00 e0 00      01:28:43.378  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      01:28:43.358  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      01:28:43.345  SET FEATURES [Set transfer
mode]
  27 00 00 00 00 00 e0 00      01:28:43.318  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     13429
     -
# 2  Short offline       Completed without error       00%     13405
     -
# 3  Short offline       Completed without error       00%     13381
     -
# 4  Extended offline    Completed without error       00%     13375
     -
# 5  Short offline       Completed without error       00%     13357
     -
# 6  Short offline       Completed without error       00%     13333
     -
# 7  Short offline       Completed without error       00%     13310
     -
# 8  Short offline       Completed without error       00%     13286
     -
# 9  Short offline       Completed without error       00%     13261
     -
#10  Short offline       Completed without error       00%     13237
     -
#11  Short offline       Completed without error       00%     13213
     -
#12  Extended offline    Completed without error       00%     13207
     -
#13  Short offline       Completed without error       00%     13189
     -
#14  Short offline       Completed without error       00%     13164
     -
#15  Short offline       Completed without error       00%     13162
     -
#16  Short offline       Completed without error       00%     13138
     -
#17  Short offline       Completed without error       00%     13114
     -
#18  Short offline       Completed without error       00%     13090
     -
#19  Short offline       Completed without error       00%     13066
     -
#20  Extended offline    Completed without error       00%     13060
     -
#21  Short offline       Completed without error       00%     13042
     -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: (solved) RAID1, changed disk, 2nd has errors ...
  2011-08-29 14:34                   ` (solved) " Stefan G. Weichinger
@ 2011-08-29 23:40                     ` Mathias Burén
  2011-08-30 12:14                       ` Stefan G. Weichinger
  0 siblings, 1 reply; 18+ messages in thread
From: Mathias Burén @ 2011-08-29 23:40 UTC (permalink / raw)
  To: lists; +Cc: linux-raid, robin.hill47

On 29 August 2011 15:34, Stefan G. Weichinger <lists@xunil.at> wrote:
> Am 29.08.2011 10:25, schrieb Stefan G. Weichinger:
>
>> I get
>>
>> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
>>       -       0
>> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
>> Offline      -       0
>>
>>
>> Sounds good to me! Right?
>>
>> So now I could re-add /dev/sdb4 to retry syncing that array, correct?
>
> Did that.
>
> I failed/removed/re-added /dev/sdb4 and waited for some hours of resyncing.
>
> Now /dev/md2 is in sync again, still with no bad sectors in SMART
> (attached, @Mathias ;-))
>
> thanks to Robin and Mathias for your feedback, it helped me to get the
> picture and chose the next steps!
>
> For now I let the arrays as they are and wait for the second new hdd.
> As soon as I have it here I will swap /dev/sdb as well.
>
> (a new server with maybe RAID6 is soon to come there ...)
>
> Thanks, Stefan
>
> ----
>
> # smartctl -a /dev/sda
> smartctl 5.40 2010-10-16 r3189 [i686-pc-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family:     Seagate Barracuda 7200.12 family
> Device Model:     ST31000528AS
> Serial Number:    9VP3BSEV
> Firmware Version: CC38
> User Capacity:    1.000.204.886.016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 4
> Local Time is:    Mon Aug 29 16:31:35 2011 CEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x82) Offline data collection activity
>                                        was completed without error.
>                                        Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0) The previous self-test routine
> completed
>                                        without error or no self-test has ever
>                                        been run.
> Total time to complete Offline
> data collection:                 ( 600) seconds.
> Offline data collection
> capabilities:                    (0x7b) SMART execute Offline immediate.
>                                        Auto Offline data collection on/off support.
>                                        Suspend Offline collection upon new
>                                        command.
>                                        Offline surface scan supported.
>                                        Self-test supported.
>                                        Conveyance Self-test supported.
>                                        Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                        power-saving mode.
>                                        Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                        General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   1) minutes.
> Extended self-test routine
> recommended polling time:        ( 178) minutes.
> Conveyance self-test routine
> recommended polling time:        (   2) minutes.
> SCT capabilities:              (0x103f) SCT Status supported.
>                                        SCT Error Recovery Control supported.
>                                        SCT Feature Control supported.
>                                        SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always
>      -       134791791
>  3 Spin_Up_Time            0x0003   097   095   000    Pre-fail  Always
>      -       0
>  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always
>      -       50
>  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always
>      -       0
>  7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail  Always
>      -       111650379
>  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always
>      -       13433
>  10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always
>      -       0
>  12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always
>      -       25
> 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always
>      -       0
> 184 End-to-End_Error        0x0032   100   100   099    Old_age   Always
>      -       0
> 187 Reported_Uncorrect      0x0032   082   082   000    Old_age   Always
>      -       18
> 188 Command_Timeout         0x0032   100   099   000    Old_age   Always
>      -       2
> 189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always
>      -       0
> 190 Airflow_Temperature_Cel 0x0022   067   060   045    Old_age   Always
>      -       33 (Min/Max 27/36)
> 194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always
>      -       33 (0 15 0 0)
> 195 Hardware_ECC_Recovered  0x001a   048   024   000    Old_age   Always
>      -       134791791
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always
>      -       0
> 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always
>      -       0
> 240 Head_Flying_Hours       0x0000   100   253   000    Old_age
> Offline      -       255980050855093
> 241 Total_LBAs_Written      0x0000   100   253   000    Old_age
> Offline      -       2678846567
> 242 Total_LBAs_Read         0x0000   100   253   000    Old_age
> Offline      -       4015371061
>
> SMART Error Log Version: 1
> ATA Error Count: 18 (device log contains only the most recent five errors)
>        CR = Command Register [HEX]
>        FR = Features Register [HEX]
>        SC = Sector Count Register [HEX]
>        SN = Sector Number Register [HEX]
>        CL = Cylinder Low Register [HEX]
>        CH = Cylinder High Register [HEX]
>        DH = Device/Head Register [HEX]
>        DC = Device Command Register [HEX]
>        ER = Error register [HEX]
>        ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
>
> Error 18 occurred at disk power-on lifetime: 13357 hours (556 days + 13
> hours)
>  When the command that caused the error occurred, the device was active
> or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  25 00 08 ff ff ff ef 00      01:28:56.212  READ DMA EXT
>  27 00 00 00 00 00 e0 00      01:28:56.211  READ NATIVE MAX ADDRESS EXT
>  ec 00 00 00 00 00 a0 00      01:28:56.191  IDENTIFY DEVICE
>  ef 03 46 00 00 00 a0 00      01:28:56.175  SET FEATURES [Set transfer
> mode]
>  27 00 00 00 00 00 e0 00      01:28:56.151  READ NATIVE MAX ADDRESS EXT
>
> Error 17 occurred at disk power-on lifetime: 13357 hours (556 days + 13
> hours)
>  When the command that caused the error occurred, the device was active
> or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  25 00 08 ff ff ff ef 00      01:28:53.001  READ DMA EXT
>  27 00 00 00 00 00 e0 00      01:28:53.000  READ NATIVE MAX ADDRESS EXT
>  ec 00 00 00 00 00 a0 00      01:28:52.980  IDENTIFY DEVICE
>  ef 03 46 00 00 00 a0 00      01:28:52.961  SET FEATURES [Set transfer
> mode]
>  27 00 00 00 00 00 e0 00      01:28:52.940  READ NATIVE MAX ADDRESS EXT
>
> Error 16 occurred at disk power-on lifetime: 13357 hours (556 days + 13
> hours)
>  When the command that caused the error occurred, the device was active
> or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  25 00 08 ff ff ff ef 00      01:28:49.790  READ DMA EXT
>  27 00 00 00 00 00 e0 00      01:28:49.789  READ NATIVE MAX ADDRESS EXT
>  ec 00 00 00 00 00 a0 00      01:28:49.749  IDENTIFY DEVICE
>  ef 03 46 00 00 00 a0 00      01:28:49.739  SET FEATURES [Set transfer
> mode]
>  27 00 00 00 00 00 e0 00      01:28:49.719  READ NATIVE MAX ADDRESS EXT
>
> Error 15 occurred at disk power-on lifetime: 13357 hours (556 days + 13
> hours)
>  When the command that caused the error occurred, the device was active
> or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  25 00 08 ff ff ff ef 00      01:28:46.580  READ DMA EXT
>  27 00 00 00 00 00 e0 00      01:28:46.579  READ NATIVE MAX ADDRESS EXT
>  ec 00 00 00 00 00 a0 00      01:28:46.559  IDENTIFY DEVICE
>  ef 03 46 00 00 00 a0 00      01:28:46.542  SET FEATURES [Set transfer
> mode]
>  27 00 00 00 00 00 e0 00      01:28:46.519  READ NATIVE MAX ADDRESS EXT
>
> Error 14 occurred at disk power-on lifetime: 13357 hours (556 days + 13
> hours)
>  When the command that caused the error occurred, the device was active
> or idle.
>
>  After command completion occurred, registers were:
>  ER ST SC SN CL CH DH
>  -- -- -- -- -- -- --
>  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455
>
>  Commands leading to the command that caused the error were:
>  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>  -- -- -- -- -- -- -- --  ----------------  --------------------
>  25 00 08 ff ff ff ef 00      01:28:43.379  READ DMA EXT
>  27 00 00 00 00 00 e0 00      01:28:43.378  READ NATIVE MAX ADDRESS EXT
>  ec 00 00 00 00 00 a0 00      01:28:43.358  IDENTIFY DEVICE
>  ef 03 46 00 00 00 a0 00      01:28:43.345  SET FEATURES [Set transfer
> mode]
>  27 00 00 00 00 00 e0 00      01:28:43.318  READ NATIVE MAX ADDRESS EXT
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining
> LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed without error       00%     13429
>     -
> # 2  Short offline       Completed without error       00%     13405
>     -
> # 3  Short offline       Completed without error       00%     13381
>     -
> # 4  Extended offline    Completed without error       00%     13375
>     -
> # 5  Short offline       Completed without error       00%     13357
>     -
> # 6  Short offline       Completed without error       00%     13333
>     -
> # 7  Short offline       Completed without error       00%     13310
>     -
> # 8  Short offline       Completed without error       00%     13286
>     -
> # 9  Short offline       Completed without error       00%     13261
>     -
> #10  Short offline       Completed without error       00%     13237
>     -
> #11  Short offline       Completed without error       00%     13213
>     -
> #12  Extended offline    Completed without error       00%     13207
>     -
> #13  Short offline       Completed without error       00%     13189
>     -
> #14  Short offline       Completed without error       00%     13164
>     -
> #15  Short offline       Completed without error       00%     13162
>     -
> #16  Short offline       Completed without error       00%     13138
>     -
> #17  Short offline       Completed without error       00%     13114
>     -
> #18  Short offline       Completed without error       00%     13090
>     -
> #19  Short offline       Completed without error       00%     13066
>     -
> #20  Extended offline    Completed without error       00%     13060
>     -
> #21  Short offline       Completed without error       00%     13042
>     -
>
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>    1        0        0  Not_testing
>    2        0        0  Not_testing
>    3        0        0  Not_testing
>    4        0        0  Not_testing
>    5        0        0  Not_testing
> Selective self-test flags (0x0):
>  After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.
>
>
>

Glad you got it working, but your drive looks like a failing drive to
me, because of these:

187 Reported_Uncorrect      0x0032   082   082   000    Old_age   Always
     -       18

So I'd replace it ASAP. Cheers,

/M
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: (solved) RAID1, changed disk, 2nd has errors ...
  2011-08-29 23:40                     ` Mathias Burén
@ 2011-08-30 12:14                       ` Stefan G. Weichinger
  0 siblings, 0 replies; 18+ messages in thread
From: Stefan G. Weichinger @ 2011-08-30 12:14 UTC (permalink / raw)
  To: Mathias Burén; +Cc: linux-raid, robin.hill47

Am 30.08.2011 01:40, schrieb Mathias Burén:

> Glad you got it working, but your drive looks like a failing drive
> to me, because of these:
> 
> 187 Reported_Uncorrect      0x0032   082   082   000    Old_age
> Always -       18
> 
> So I'd replace it ASAP.

As mentioned, I ordered the new disk already.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-08-30 12:14 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-26 11:46 RAID1, changed disk, 2nd has errors Stefan G. Weichinger
2011-08-26 12:01 ` Mathias Burén
2011-08-26 12:19   ` Stefan G. Weichinger
2011-08-26 12:44     ` Stefan G. Weichinger
2011-08-26 20:00     ` Mathias Burén
2011-08-26 22:12       ` Stefan G. Weichinger
2011-08-26 12:56 ` Robin Hill
2011-08-26 13:51   ` Stefan G. Weichinger
2011-08-26 14:08     ` Robin Hill
2011-08-26 15:41       ` Stefan G. Weichinger
2011-08-29  7:02         ` Stefan G. Weichinger
2011-08-29  7:45           ` Stefan G. Weichinger
2011-08-29  7:51             ` Mathias Burén
2011-08-29  8:00               ` Stefan G. Weichinger
2011-08-29  8:25                 ` Stefan G. Weichinger
2011-08-29 14:34                   ` (solved) " Stefan G. Weichinger
2011-08-29 23:40                     ` Mathias Burén
2011-08-30 12:14                       ` Stefan G. Weichinger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.