All of lore.kernel.org
 help / color / mirror / Atom feed
* Help with two momentarily failed drives out of a 4x3TB Raid 5
@ 2013-03-10 23:48 Javier Marcet
  2013-03-11  0:12 ` Mathias Burén
  0 siblings, 1 reply; 13+ messages in thread
From: Javier Marcet @ 2013-03-10 23:48 UTC (permalink / raw)
  To: linux-raid

Hi,

I have been using what is my a 4x3TB Raid 5 rray for the last 8 months
without an issue but last week I got some recoverable reading errors.
Initially I forced an array check and it finished without problems but
the problem saw up again a day later. I remembered I saw a cable which
I thought I should replace the last time I had to open the server
case, but it was built into the case so I tried not to worry.

At first I tried to reassemble the array after checking all the
connections inside the case and left it overnight. It should have
finished today by noon. Instead I was greeted by a bunch of traces
like this:

20614.984915] WARNING: at drivers/md/raid5.c:352 get_active_stripe+0x6bc/0x7c0()
[20614.984916] Hardware name: To Be Filled By O.E.M.
[20614.984916] Modules linked in: mt2063 drxk cx25840 cx23885
btcx_risc videobuf_dvb tveeprom cx2341x videobuf_dma_sg r8169
videobuf_core
[20614.984920] Pid: 10125, comm: kworker/u:0 Tainted: G        W
3.7.10-himawari #1
[20614.984920] Call Trace:
[20614.984922]  [<ffffffff810b8eaa>] warn_slowpath_common+0x7a/0xb0
[20614.984923]  [<ffffffff810b8ef5>] warn_slowpath_null+0x15/0x20
[20614.984925]  [<ffffffff8163278c>] get_active_stripe+0x6bc/0x7c0
[20614.984926]  [<ffffffff810e99de>] ? __wake_up+0x4e/0x70
[20614.984928]  [<ffffffff81659ec4>] ? md_wakeup_thread+0x34/0x60
[20614.984929]  [<ffffffff810ddac6>] ? prepare_to_wait+0x56/0x90
[20614.984931]  [<ffffffff816368aa>] make_request+0x1aa/0x6f0
[20614.984932]  [<ffffffff810dd850>] ? finish_wait+0x80/0x80
[20614.984934]  [<ffffffff8165b935>] md_make_request+0x105/0x260
[20614.984935]  [<ffffffff813b0e92>] generic_make_request+0xc2/0x110
[20614.984937]  [<ffffffff81644aea>] bch_generic_make_request_hack+0x9a/0xa0
[20614.984938]  [<ffffffff81644eb3>] bch_generic_make_request+0x43/0x190
[20614.984939]  [<ffffffff816479f8>] write_dirty+0x78/0x120
[20614.984941]  [<ffffffff810d597a>] process_one_work+0x13a/0x4f0
[20614.984942]  [<ffffffff81647980>] ? read_dirty_submit+0xe0/0xe0
[20614.984944]  [<ffffffff810d73c5>] worker_thread+0x165/0x480
[20614.984946]  [<ffffffff810d7260>] ? busy_worker_rebind_fn+0x110/0x110
[20614.984947]  [<ffffffff810dd0cb>] kthread+0xbb/0xc0
[20614.984949]  [<ffffffff810dd010>] ? flush_kthread_worker+0x70/0x70
[20614.984950]  [<ffffffff8188872c>] ret_from_fork+0x7c/0xb0
[20614.984951]  [<ffffffff810dd010>] ? flush_kthread_worker+0x70/0x70
[20614.984952] ---[ end trace d2db072c18819bc0 ]---
[20614.984954] sector=8b909ff8 i=2           (null)           (null)
        (null)           (null) 1
[20614.984955] ------------[ cut here ]------------

Thinking that it could still be a loose cable, I decided to order a
case more suited to host the raid (than the server case where the
drives share space with cards and cables). Meanwhile I left the drives
in such a way I could use reliable cables for the two with faulty
cables and tried to assemble the array again.

Initially it didn't want to, and I was using mdadm --force. It started
to rebuild after a few seconds, though. To my dismay it ended the same
way. Only this time I went back through the logs and saw when was the
first back trace: http://bpaste.net/raw/82819/

Here is my raid.status: http://bpaste.net/raw/82820/

I have read all the info in
https://raid.wiki.kernel.org/index.php/RAID_Recovery#Restore_array_by_recreating_.28after_multiple_device_failure.29
and before I lose any chance of copying the data (most of it at least)
trying to forcing a complete rebuild.

I have 4.5 TB used and right now I have the filesystem mounted and I
can use it yet the kernel is spiting that same trace over and over
again. I really don't know what would be the best thing to do right
now and would appreciate any help.


--
Javier Marcet <jmarcet@gmail.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-10 23:48 Help with two momentarily failed drives out of a 4x3TB Raid 5 Javier Marcet
@ 2013-03-11  0:12 ` Mathias Burén
  2013-03-11  0:33   ` Javier Marcet
  0 siblings, 1 reply; 13+ messages in thread
From: Mathias Burén @ 2013-03-11  0:12 UTC (permalink / raw)
  To: Javier Marcet; +Cc: Linux-RAID

On 10 March 2013 23:48, Javier Marcet <jmarcet@gmail.com> wrote:
> Hi,
>
> I have been using what is my a 4x3TB Raid 5 rray for the last 8 months
> without an issue but last week I got some recoverable reading errors.
> Initially I forced an array check and it finished without problems but
> the problem saw up again a day later. I remembered I saw a cable which
> I thought I should replace the last time I had to open the server
> case, but it was built into the case so I tried not to worry.
>
> At first I tried to reassemble the array after checking all the
> connections inside the case and left it overnight. It should have
> finished today by noon. Instead I was greeted by a bunch of traces
> like this:
>
> 20614.984915] WARNING: at drivers/md/raid5.c:352 get_active_stripe+0x6bc/0x7c0()
> [20614.984916] Hardware name: To Be Filled By O.E.M.
> [20614.984916] Modules linked in: mt2063 drxk cx25840 cx23885
> btcx_risc videobuf_dvb tveeprom cx2341x videobuf_dma_sg r8169
> videobuf_core
> [20614.984920] Pid: 10125, comm: kworker/u:0 Tainted: G        W
> 3.7.10-himawari #1
> [20614.984920] Call Trace:
> [20614.984922]  [<ffffffff810b8eaa>] warn_slowpath_common+0x7a/0xb0
> [20614.984923]  [<ffffffff810b8ef5>] warn_slowpath_null+0x15/0x20
> [20614.984925]  [<ffffffff8163278c>] get_active_stripe+0x6bc/0x7c0
> [20614.984926]  [<ffffffff810e99de>] ? __wake_up+0x4e/0x70
> [20614.984928]  [<ffffffff81659ec4>] ? md_wakeup_thread+0x34/0x60
> [20614.984929]  [<ffffffff810ddac6>] ? prepare_to_wait+0x56/0x90
> [20614.984931]  [<ffffffff816368aa>] make_request+0x1aa/0x6f0
> [20614.984932]  [<ffffffff810dd850>] ? finish_wait+0x80/0x80
> [20614.984934]  [<ffffffff8165b935>] md_make_request+0x105/0x260
> [20614.984935]  [<ffffffff813b0e92>] generic_make_request+0xc2/0x110
> [20614.984937]  [<ffffffff81644aea>] bch_generic_make_request_hack+0x9a/0xa0
> [20614.984938]  [<ffffffff81644eb3>] bch_generic_make_request+0x43/0x190
> [20614.984939]  [<ffffffff816479f8>] write_dirty+0x78/0x120
> [20614.984941]  [<ffffffff810d597a>] process_one_work+0x13a/0x4f0
> [20614.984942]  [<ffffffff81647980>] ? read_dirty_submit+0xe0/0xe0
> [20614.984944]  [<ffffffff810d73c5>] worker_thread+0x165/0x480
> [20614.984946]  [<ffffffff810d7260>] ? busy_worker_rebind_fn+0x110/0x110
> [20614.984947]  [<ffffffff810dd0cb>] kthread+0xbb/0xc0
> [20614.984949]  [<ffffffff810dd010>] ? flush_kthread_worker+0x70/0x70
> [20614.984950]  [<ffffffff8188872c>] ret_from_fork+0x7c/0xb0
> [20614.984951]  [<ffffffff810dd010>] ? flush_kthread_worker+0x70/0x70
> [20614.984952] ---[ end trace d2db072c18819bc0 ]---
> [20614.984954] sector=8b909ff8 i=2           (null)           (null)
>         (null)           (null) 1
> [20614.984955] ------------[ cut here ]------------
>
> Thinking that it could still be a loose cable, I decided to order a
> case more suited to host the raid (than the server case where the
> drives share space with cards and cables). Meanwhile I left the drives
> in such a way I could use reliable cables for the two with faulty
> cables and tried to assemble the array again.
>
> Initially it didn't want to, and I was using mdadm --force. It started
> to rebuild after a few seconds, though. To my dismay it ended the same
> way. Only this time I went back through the logs and saw when was the
> first back trace: http://bpaste.net/raw/82819/
>
> Here is my raid.status: http://bpaste.net/raw/82820/
>
> I have read all the info in
> https://raid.wiki.kernel.org/index.php/RAID_Recovery#Restore_array_by_recreating_.28after_multiple_device_failure.29
> and before I lose any chance of copying the data (most of it at least)
> trying to forcing a complete rebuild.
>
> I have 4.5 TB used and right now I have the filesystem mounted and I
> can use it yet the kernel is spiting that same trace over and over
> again. I really don't know what would be the best thing to do right
> now and would appreciate any help.
>
>
> --
> Javier Marcet <jmarcet@gmail.com>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

So how are the drivers doing? smartctl -a for all HDDs please.

Cheers,
Mathias

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-11  0:12 ` Mathias Burén
@ 2013-03-11  0:33   ` Javier Marcet
  2013-03-11  0:41     ` Javier Marcet
  2013-03-11 20:18     ` Chris Murphy
  0 siblings, 2 replies; 13+ messages in thread
From: Javier Marcet @ 2013-03-11  0:33 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Linux-RAID

On Mon, Mar 11, 2013 at 1:12 AM, Mathias Burén <mathias.buren@gmail.com> wrote:

>> Initially it didn't want to, and I was using mdadm --force. It started
>> to rebuild after a few seconds, though. To my dismay it ended the same
>> way. Only this time I went back through the logs and saw when was the
>> first back trace: http://bpaste.net/raw/82819/
>>
>> Here is my raid.status: http://bpaste.net/raw/82820/
>>
>> I have read all the info in
>> https://raid.wiki.kernel.org/index.php/RAID_Recovery#Restore_array_by_recreating_.28after_multiple_device_failure.29
>> and before I lose any chance of copying the data (most of it at least)
>> trying to forcing a complete rebuild.
>>
>> I have 4.5 TB used and right now I have the filesystem mounted and I
>> can use it yet the kernel is spiting that same trace over and over
>> again. I really don't know what would be the best thing to do right
>> now and would appreciate any help.
>
> So how are the drivers doing? smartctl -a for all HDDs please.

http://bpaste.net/raw/82828/


-- 
Javier Marcet <jmarcet@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-11  0:33   ` Javier Marcet
@ 2013-03-11  0:41     ` Javier Marcet
  2013-03-11  8:40       ` Mathias Burén
  2013-03-11 20:18     ` Chris Murphy
  1 sibling, 1 reply; 13+ messages in thread
From: Javier Marcet @ 2013-03-11  0:41 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Linux-RAID

On Mon, Mar 11, 2013 at 1:33 AM, Javier Marcet <jmarcet@gmail.com> wrote:

>>> I have 4.5 TB used and right now I have the filesystem mounted and I
>>> can use it yet the kernel is spiting that same trace over and over
>>> again. I really don't know what would be the best thing to do right
>>> now and would appreciate any help.
>>
>> So how are the drivers doing? smartctl -a for all HDDs please.
>
> http://bpaste.net/raw/82828/

As far as I can see there are only two read errors reported on
/dev/sdb, nothing on the rest.

Also, I finally put the raid completely offline since the data was not
really readable in that state.



-- 
Javier Marcet <jmarcet@gmail.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-11  0:41     ` Javier Marcet
@ 2013-03-11  8:40       ` Mathias Burén
  2013-03-11  8:56         ` Javier Marcet
  0 siblings, 1 reply; 13+ messages in thread
From: Mathias Burén @ 2013-03-11  8:40 UTC (permalink / raw)
  To: Javier Marcet; +Cc: Linux-RAID

On 11 March 2013 00:41, Javier Marcet <jmarcet@gmail.com> wrote:
> On Mon, Mar 11, 2013 at 1:33 AM, Javier Marcet <jmarcet@gmail.com> wrote:
>
>>>> I have 4.5 TB used and right now I have the filesystem mounted and I
>>>> can use it yet the kernel is spiting that same trace over and over
>>>> again. I really don't know what would be the best thing to do right
>>>> now and would appreciate any help.
>>>
>>> So how are the drivers doing? smartctl -a for all HDDs please.
>>
>> http://bpaste.net/raw/82828/
>
> As far as I can see there are only two read errors reported on
> /dev/sdb, nothing on the rest.
>
> Also, I finally put the raid completely offline since the data was not
> really readable in that state.
>
>
>
> --
> Javier Marcet <jmarcet@gmail.com>

Serial Number:    WD-WCAWZ2190021
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       3

Not good. I'd run smartctl -t long, wait for completion, check
results, then run badblocks.

Serial Number:    WD-WCAWZ2200949
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       1

As above. Also, I see no self-tests logged?

Cheers,
Mathias

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-11  8:40       ` Mathias Burén
@ 2013-03-11  8:56         ` Javier Marcet
  2013-03-11  9:41           ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 13+ messages in thread
From: Javier Marcet @ 2013-03-11  8:56 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Linux-RAID

On Mon, Mar 11, 2013 at 9:40 AM, Mathias Burén <mathias.buren@gmail.com> wrote:

>>>>> I have 4.5 TB used and right now I have the filesystem mounted and I
>>>>> can use it yet the kernel is spiting that same trace over and over
>>>>> again. I really don't know what would be the best thing to do right
>>>>> now and would appreciate any help.
>>>>
>>>> So how are the drivers doing? smartctl -a for all HDDs please.
>>>
>>> http://bpaste.net/raw/82828/
>>
>> As far as I can see there are only two read errors reported on
>> /dev/sdb, nothing on the rest.
>>
>> Also, I finally put the raid completely offline since the data was not
>> really readable in that state.
>>
>>
>>
>> --
>> Javier Marcet <jmarcet@gmail.com>
>
> Serial Number:    WD-WCAWZ2190021
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       3
>
> Not good. I'd run smartctl -t long, wait for completion, check
> results, then run badblocks.

By 18:00 today I should have the smartctl results.

>
> Serial Number:    WD-WCAWZ2200949
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       1
>
> As above. Also, I see no self-tests logged?

I hadn't done any test manually but I had smartd running and configured.


-- 
Javier Marcet <jmarcet@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-11  8:56         ` Javier Marcet
@ 2013-03-11  9:41           ` Roy Sigurd Karlsbakk
  2013-03-11 19:03             ` Javier Marcet
  0 siblings, 1 reply; 13+ messages in thread
From: Roy Sigurd Karlsbakk @ 2013-03-11  9:41 UTC (permalink / raw)
  To: Javier Marcet; +Cc: Linux-RAID, Mathias Burén

> > As above. Also, I see no self-tests logged?
> 
> I hadn't done any test manually but I had smartd running and
> configured.

smartd won't run any tests unless you configure it to do so.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-11  9:41           ` Roy Sigurd Karlsbakk
@ 2013-03-11 19:03             ` Javier Marcet
  2013-03-11 19:26               ` Mathias Burén
  0 siblings, 1 reply; 13+ messages in thread
From: Javier Marcet @ 2013-03-11 19:03 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux-RAID, Mathias Burén

On Mon, Mar 11, 2013 at 10:41 AM, Roy Sigurd Karlsbakk
<roy@karlsbakk.net> wrote:

>> > As above. Also, I see no self-tests logged?
>>
>> I hadn't done any test manually but I had smartd running and
>> configured.
>
> smartd won't run any tests unless you configure it to do so.

I completely forgot to check the surfaces when I bought the disks. I
did quite a few tests before I decided to use ext4 but did not check
their health.

I have two disks still missing 30% of the long test but the two other
ones stopped due to errors:

http://bpaste.net/raw/83003/
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       18

http://bpaste.net/raw/83004/
197 Current_Pending_Sector  0x0032   200   200   000    Old_age
Always       -       1


I am gonna pass badblocks on them but this does not look good.

Could a faulty sata data cable cause those bad blocks? The disks are 8
months old so I hope I can get them replaced.



-- 
Javier Marcet <jmarcet@gmail.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-11 19:03             ` Javier Marcet
@ 2013-03-11 19:26               ` Mathias Burén
  2013-03-11 19:41                 ` Javier Marcet
  2013-03-12 18:55                 ` Javier Marcet
  0 siblings, 2 replies; 13+ messages in thread
From: Mathias Burén @ 2013-03-11 19:26 UTC (permalink / raw)
  To: Javier Marcet; +Cc: Roy Sigurd Karlsbakk, Linux-RAID

On 11 March 2013 19:03, Javier Marcet <jmarcet@gmail.com> wrote:
> On Mon, Mar 11, 2013 at 10:41 AM, Roy Sigurd Karlsbakk
> <roy@karlsbakk.net> wrote:
>
>>> > As above. Also, I see no self-tests logged?
>>>
>>> I hadn't done any test manually but I had smartd running and
>>> configured.
>>
>> smartd won't run any tests unless you configure it to do so.
>
> I completely forgot to check the surfaces when I bought the disks. I
> did quite a few tests before I decided to use ext4 but did not check
> their health.
>
> I have two disks still missing 30% of the long test but the two other
> ones stopped due to errors:
>
> http://bpaste.net/raw/83003/
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       18
>
> http://bpaste.net/raw/83004/
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
> Always       -       1
>
>
> I am gonna pass badblocks on them but this does not look good.
>
> Could a faulty sata data cable cause those bad blocks? The disks are 8
> months old so I hope I can get them replaced.
>
>
>
> --
> Javier Marcet <jmarcet@gmail.com>

I see read failures on the SMART self tests, that warrants an RMA. So,
get that done. A faulty SATA cable would not do that, you'd rather see
CRC errors or other kernel log messages.

Mathias

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-11 19:26               ` Mathias Burén
@ 2013-03-11 19:41                 ` Javier Marcet
  2013-03-12 18:55                 ` Javier Marcet
  1 sibling, 0 replies; 13+ messages in thread
From: Javier Marcet @ 2013-03-11 19:41 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Roy Sigurd Karlsbakk, Linux-RAID

On Mon, Mar 11, 2013 at 8:26 PM, Mathias Burén <mathias.buren@gmail.com> wrote:

>> I have two disks still missing 30% of the long test but the two other
>> ones stopped due to errors:
>>
>> http://bpaste.net/raw/83003/
>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>> Always       -       18
>>
>> http://bpaste.net/raw/83004/
>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>> Always       -       1
>>
>>
>> I am gonna pass badblocks on them but this does not look good.
>>
>> Could a faulty sata data cable cause those bad blocks? The disks are 8
>> months old so I hope I can get them replaced.

> I see read failures on the SMART self tests, that warrants an RMA. So,
> get that done. A faulty SATA cable would not do that, you'd rather see
> CRC errors or other kernel log messages.

All right, tomorrow I will request an RMA.

One of the drives only shows one sector bad. I hope I can still get
some/most of my data.

I know that forcing a reassembly lets me mount the filesystem although
at certain point causes a kernel backtrace and I believe most data is
unreadable.

How should I proceed to be able to get most of the data? Will I have
to create a completely new array or can I somehow fix it adding new
disks?

Oh and thanks a lot for the help, I'm still a newbie with raids.


-- 
Javier Marcet <jmarcet@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-11  0:33   ` Javier Marcet
  2013-03-11  0:41     ` Javier Marcet
@ 2013-03-11 20:18     ` Chris Murphy
  1 sibling, 0 replies; 13+ messages in thread
From: Chris Murphy @ 2013-03-11 20:18 UTC (permalink / raw)
  To: linux-raid list


On Mar 10, 2013, at 6:33 PM, Javier Marcet <jmarcet@gmail.com> wrote:

> On Mon, Mar 11, 2013 at 1:12 AM, Mathias Burén <mathias.buren@gmail.com> wrote:
>> 
>> So how are the drivers doing? smartctl -a for all HDDs please.
> 
> http://bpaste.net/raw/82828/


Two of four drives report bad sectors as Current_Pending_Sector. We need to see full dmesg for the time when the array collapsed to be sure, but I bet dollars to donuts that disk 1 drops out for some reason (?) and shortly thereafter the other drive experiences ERR UNC for its bad sector causing the array to collapse.

The first disk ejected is probably not in sync with the array and needs to be rebuilt. The other drive might be slightly out of sync, but it's worth forcing assemble to find out. And then these bad sectors need to be repaired which is difficult if the first disk ejected happened before too many writes to the array while it was degraded but before the array collapsed.

The drives clearly are configured incorrectly with their controller and/or the linux SCSI layer timeout for the block devices or you wouldn't have bad sectors pending. Configured correctly, bad sectors are remapped in the course of a normally functioning array, as well as scheduled scrubs.

Ideally the drive SCT ERC is lowered to something like 70 deciseconds. Or if that's not supported by the drives, then the controller and the block device timeout needs to be raised to whatever the drive timeout is using:

echo xxx >/sys/block/sdX/device/timeout

xxs is in seconds. So for a 2 minute drive timeout, you'd need that to be at least 120, maybe a few seconds more to make absolutely certain linux doesn't timeout the block device before the drive itself reports a read error.


> By 18:00 today I should have the smartctl results.

Next to pointless in that it will stop the testing as soon as it finds the first bad sector. But if you have that LBA you can use dd to zero just that sector. While that corrupts the data in that sector, the data is effectively gone anyway, and it will prevent another read error and allow the rebuild to proceed.

> Could a faulty sata data cable cause those bad blocks? 

No. Some bad sectors are normal. Many, or increasing occurrence is cause for them to be replaced under warranty.

> How should I proceed to be able to get most of the data? Will I have
> to create a completely new array or can I somehow fix it adding new
> disks?

You're better off recreating the array and restoring from backup. Fixing it will be tedious.


Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-11 19:26               ` Mathias Burén
  2013-03-11 19:41                 ` Javier Marcet
@ 2013-03-12 18:55                 ` Javier Marcet
  2013-03-13  4:13                   ` Mikael Abrahamsson
  1 sibling, 1 reply; 13+ messages in thread
From: Javier Marcet @ 2013-03-12 18:55 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Roy Sigurd Karlsbakk, Linux-RAID

On Mon, Mar 11, 2013 at 8:26 PM, Mathias Burén <mathias.buren@gmail.com> wrote:

>>>> > As above. Also, I see no self-tests logged?
>>>>
>>>> I hadn't done any test manually but I had smartd running and
>>>> configured.
>>>
>>> smartd won't run any tests unless you configure it to do so.
>>
>> I completely forgot to check the surfaces when I bought the disks. I
>> did quite a few tests before I decided to use ext4 but did not check
>> their health.
>>
>> I have two disks still missing 30% of the long test but the two other
>> ones stopped due to errors:
>>
>> http://bpaste.net/raw/83003/
>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>> Always       -       18
>>
>> http://bpaste.net/raw/83004/
>> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age
>> Always       -       1
>>
>>
>> I am gonna pass badblocks on them but this does not look good.
>>
>> Could a faulty sata data cable cause those bad blocks? The disks are 8
>> months old so I hope I can get them replaced.

I have filed the RMAs and ordered a new disk so that I can at least
copy some of the data. Yet I have been thinking I could clone the
drive which is still part of the array and has one bad sector over the
new one and try to assemble the raid with the clone.

Could that work?


-- 
Javier Marcet <jmarcet@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Help with two momentarily failed drives out of a 4x3TB Raid 5
  2013-03-12 18:55                 ` Javier Marcet
@ 2013-03-13  4:13                   ` Mikael Abrahamsson
  0 siblings, 0 replies; 13+ messages in thread
From: Mikael Abrahamsson @ 2013-03-13  4:13 UTC (permalink / raw)
  To: Javier Marcet; +Cc: Linux-RAID

On Tue, 12 Mar 2013, Javier Marcet wrote:

> I have filed the RMAs and ordered a new disk so that I can at least copy 
> some of the data. Yet I have been thinking I could clone the drive which 
> is still part of the array and has one bad sector over the new one and 
> try to assemble the raid with the clone.
>
> Could that work?

Yes, to use dd_rescue or gddrescue for this purpose is very common.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-03-13  4:13 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-10 23:48 Help with two momentarily failed drives out of a 4x3TB Raid 5 Javier Marcet
2013-03-11  0:12 ` Mathias Burén
2013-03-11  0:33   ` Javier Marcet
2013-03-11  0:41     ` Javier Marcet
2013-03-11  8:40       ` Mathias Burén
2013-03-11  8:56         ` Javier Marcet
2013-03-11  9:41           ` Roy Sigurd Karlsbakk
2013-03-11 19:03             ` Javier Marcet
2013-03-11 19:26               ` Mathias Burén
2013-03-11 19:41                 ` Javier Marcet
2013-03-12 18:55                 ` Javier Marcet
2013-03-13  4:13                   ` Mikael Abrahamsson
2013-03-11 20:18     ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.