* failed sector detected but disk still active ?
@ 2022-05-13 7:37 Gandalf Corvotempesta
2022-05-13 16:02 ` Piergiorgio Sartor
2022-05-14 23:46 ` Jani Partanen
0 siblings, 2 replies; 16+ messages in thread
From: Gandalf Corvotempesta @ 2022-05-13 7:37 UTC (permalink / raw)
To: Linux RAID Mailing List
How this is possible ?
seems that sdc has some failed sectors but disk is still active in the array
[Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc] Unhandled sense code
[Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc]
[Mon May 2 03:36:24 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc]
[Mon May 2 03:36:24 2022] Sense Key : Medium Error [current]
[Mon May 2 03:36:24 2022] Info fld=0x10565570
[Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc]
[Mon May 2 03:36:24 2022] Add. Sense: Unrecovered read error
[Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc] CDB:
[Mon May 2 03:36:24 2022] Read(10): 28 00 10 56 51 80 00 04 00 00
[Mon May 2 03:36:24 2022] end_request: critical medium error, dev
sdc, sector 274093424
[Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc] Unhandled sense code
[Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc]
[Mon May 2 03:36:25 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc]
[Mon May 2 03:36:25 2022] Sense Key : Medium Error [current]
[Mon May 2 03:36:25 2022] Info fld=0x10565584
[Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc]
[Mon May 2 03:36:25 2022] Add. Sense: Unrecovered read error
[Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc] CDB:
[Mon May 2 03:36:25 2022] Read(10): 28 00 10 56 55 80 00 04 00 00
[Mon May 2 03:36:25 2022] end_request: critical medium error, dev
sdc, sector 274093444
[Mon May 2 04:06:32 2022] md: md0: data-check done.
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[2] sda1[0] sdb1[1]
292836352 blocks super 1.2 [3/3] [UUU]
bitmap: 3/3 pages [12KB], 65536KB chunk
unused devices: <none>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-13 7:37 failed sector detected but disk still active ? Gandalf Corvotempesta
@ 2022-05-13 16:02 ` Piergiorgio Sartor
2022-05-14 13:46 ` Wols Lists
2022-05-14 23:46 ` Jani Partanen
1 sibling, 1 reply; 16+ messages in thread
From: Piergiorgio Sartor @ 2022-05-13 16:02 UTC (permalink / raw)
To: Gandalf Corvotempesta; +Cc: Linux RAID Mailing List
On Fri, May 13, 2022 at 09:37:13AM +0200, Gandalf Corvotempesta wrote:
> How this is possible ?
> seems that sdc has some failed sectors but disk is still active in the array
>
> [Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc] Unhandled sense code
> [Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:24 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:24 2022] Sense Key : Medium Error [current]
> [Mon May 2 03:36:24 2022] Info fld=0x10565570
> [Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:24 2022] Add. Sense: Unrecovered read error
> [Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc] CDB:
> [Mon May 2 03:36:24 2022] Read(10): 28 00 10 56 51 80 00 04 00 00
> [Mon May 2 03:36:24 2022] end_request: critical medium error, dev
> sdc, sector 274093424
> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc] Unhandled sense code
> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:25 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:25 2022] Sense Key : Medium Error [current]
> [Mon May 2 03:36:25 2022] Info fld=0x10565584
> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:25 2022] Add. Sense: Unrecovered read error
> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc] CDB:
> [Mon May 2 03:36:25 2022] Read(10): 28 00 10 56 55 80 00 04 00 00
> [Mon May 2 03:36:25 2022] end_request: critical medium error, dev
> sdc, sector 274093444
> [Mon May 2 04:06:32 2022] md: md0: data-check done.
The error is reported from the device.
As far as I know, and please someone correct
me if I'm wrong, when a device has an error,
"md" tries to re-write the data, using the
redundancy, and, if no error occurs, it just
continues, no reason to kick the device our
of the array.
bye,
pg
>
>
>
> # cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdc1[2] sda1[0] sdb1[1]
> 292836352 blocks super 1.2 [3/3] [UUU]
> bitmap: 3/3 pages [12KB], 65536KB chunk
>
> unused devices: <none>
--
piergiorgio
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-13 16:02 ` Piergiorgio Sartor
@ 2022-05-14 13:46 ` Wols Lists
2022-05-15 18:39 ` Pascal Hambourg
2022-05-18 10:51 ` Gandalf Corvotempesta
0 siblings, 2 replies; 16+ messages in thread
From: Wols Lists @ 2022-05-14 13:46 UTC (permalink / raw)
To: Piergiorgio Sartor, Gandalf Corvotempesta; +Cc: Linux RAID Mailing List
On 13/05/2022 17:02, Piergiorgio Sartor wrote:
>> [Mon May 2 03:36:25 2022] Add. Sense: Unrecovered read error
>> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc] CDB:
>> [Mon May 2 03:36:25 2022] Read(10): 28 00 10 56 55 80 00 04 00 00
>> [Mon May 2 03:36:25 2022] end_request: critical medium error, dev
>> sdc, sector 274093444
>> [Mon May 2 04:06:32 2022] md: md0: data-check done.
> The error is reported from the device.
>
> As far as I know, and please someone correct
> me if I'm wrong, when a device has an error,
> "md" tries to re-write the data, using the
> redundancy, and, if no error occurs, it just
> continues, no reason to kick the device our
> of the array.
Correct. If the underlying disk returns an error, raid recovery kicks
in. The missing block is calculated, returned to the caller and written
back to the disk.
There's a whole bunch of reasons how/why this can occur. If it's a
transient failure and the re-write succeeds perfectly, everything is
normally hunky-dory.
There could be a problem with the drive, the drive re-locates the dodgy
sector, and everything APPEARS hunky-dory.
Or the rewrite fails, raid assumes the drive is faulty and kicks it out.
That's why you should never use desktop drives unless you know EXACTLY
what you are doing!
The error message is "critical medium error" - we have a real problem
with the disk I suspect.
FIRST run SMART on the disk and see what that reports. If that's not
happy, REPLACE THE DRIVE PRONTO.
If SMART is happy, run a raid scrub.
And whatever, if you haven't replaced the drive, start monitoring SMART.
If disk errors start climbing, that's a cause for concern and replacing
the drive.
Cheers,
Wol
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-13 7:37 failed sector detected but disk still active ? Gandalf Corvotempesta
2022-05-13 16:02 ` Piergiorgio Sartor
@ 2022-05-14 23:46 ` Jani Partanen
1 sibling, 0 replies; 16+ messages in thread
From: Jani Partanen @ 2022-05-14 23:46 UTC (permalink / raw)
To: Gandalf Corvotempesta, Linux RAID Mailing List
Interesting. I had just yesterday same error, but in my case it happened
when I was rebuilding raid-5 and system was very unhappy, eventually
stopped rebuild because there was like 10 errors.
I was actually afraid that I lost that pool, but when I did reboot,
rebuild started again and at some point it did show again same errors
but now only 2 times and rebuild finished.
Now I am running check and after check I will run btrfs scrub because I
have btrfs on that pool with double meta.
What kernel version you are running? I'm on 5.17.6-300.fc36.x86_64
I do have now 23 bending sectors on that disk what is for me quite big
indicator that I really need to replace that disk, it will be toasted
soon by previous experience. So check your smart status from that disk.
//JiiPee
Gandalf Corvotempesta kirjoitti 13/05/2022 klo 10.37:
> How this is possible ?
> seems that sdc has some failed sectors but disk is still active in the array
>
> [Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc] Unhandled sense code
> [Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:24 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:24 2022] Sense Key : Medium Error [current]
> [Mon May 2 03:36:24 2022] Info fld=0x10565570
> [Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:24 2022] Add. Sense: Unrecovered read error
> [Mon May 2 03:36:24 2022] sd 0:0:2:0: [sdc] CDB:
> [Mon May 2 03:36:24 2022] Read(10): 28 00 10 56 51 80 00 04 00 00
> [Mon May 2 03:36:24 2022] end_request: critical medium error, dev
> sdc, sector 274093424
> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc] Unhandled sense code
> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:25 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:25 2022] Sense Key : Medium Error [current]
> [Mon May 2 03:36:25 2022] Info fld=0x10565584
> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May 2 03:36:25 2022] Add. Sense: Unrecovered read error
> [Mon May 2 03:36:25 2022] sd 0:0:2:0: [sdc] CDB:
> [Mon May 2 03:36:25 2022] Read(10): 28 00 10 56 55 80 00 04 00 00
> [Mon May 2 03:36:25 2022] end_request: critical medium error, dev
> sdc, sector 274093444
> [Mon May 2 04:06:32 2022] md: md0: data-check done.
>
>
>
> # cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdc1[2] sda1[0] sdb1[1]
> 292836352 blocks super 1.2 [3/3] [UUU]
> bitmap: 3/3 pages [12KB], 65536KB chunk
>
> unused devices: <none>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-14 13:46 ` Wols Lists
@ 2022-05-15 18:39 ` Pascal Hambourg
2022-05-15 19:29 ` Wol
2022-05-15 19:47 ` Jani Partanen
2022-05-18 10:51 ` Gandalf Corvotempesta
1 sibling, 2 replies; 16+ messages in thread
From: Pascal Hambourg @ 2022-05-15 18:39 UTC (permalink / raw)
To: Wols Lists; +Cc: Linux RAID Mailing List
Le 14/05/2022 à 15:46, Wols Lists a écrit :
>
> Or the rewrite fails, raid assumes the drive is faulty and kicks it out.
> That's why you should never use desktop drives unless you know EXACTLY
> what you are doing!
What's wrong with desktop drives ?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-15 18:39 ` Pascal Hambourg
@ 2022-05-15 19:29 ` Wol
2022-05-16 6:47 ` Pascal Hambourg
2022-05-15 19:47 ` Jani Partanen
1 sibling, 1 reply; 16+ messages in thread
From: Wol @ 2022-05-15 19:29 UTC (permalink / raw)
To: Pascal Hambourg; +Cc: Linux RAID Mailing List
On 15/05/2022 19:39, Pascal Hambourg wrote:
> Le 14/05/2022 à 15:46, Wols Lists a écrit :
>>
>> Or the rewrite fails, raid assumes the drive is faulty and kicks it
>> out. That's why you should never use desktop drives unless you know
>> EXACTLY what you are doing!
>
> What's wrong with desktop drives ?
Once things start going wrong, they go pear-shaped very fast.
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
tl;dr
Raid/Enterprise drives have something called SCT/ERC. If there's a
problem, the drive will abort the read/write, and return an error.
Consumer drives don't have this. If there's a problem, they can
typically take two minutes to respond. No matter whether the problem is
transient or real, that's a real bummer for whatever wants the data. The
kernel typically gives up waiting after 30secs, tries to talk to the
drive again, and on getting no response whatsoever assumes the disk has
failed. As far as raid is concerned, a faulty, non-responsive disk is
BAD NEWS.
It gets worse. SMR drives can - in the NORMAL course of events, take
about ten minutes to respond!
So basically, Enterprise drives typically take about 7 seconds to sort
out a problem. Consumer drives - the old CMR type - typically take about
2 minutes. New SMR drives can take 10s of minutes. And transient
problems aren't that uncommon. Worse, once things start going wrong, it
can explode very fast.
Cheers,
Wol
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-15 18:39 ` Pascal Hambourg
2022-05-15 19:29 ` Wol
@ 2022-05-15 19:47 ` Jani Partanen
1 sibling, 0 replies; 16+ messages in thread
From: Jani Partanen @ 2022-05-15 19:47 UTC (permalink / raw)
To: Pascal Hambourg, Wols Lists; +Cc: Linux RAID Mailing List
Lot of stuff. Example SMR vs CMR. Another example is WD green drives
what are very bad for raid. They like to go sleep constantly and basicly
destroy themself because of that.
Desktop drives are many times also missing some features what if not
required, but are very handy when doing raid.
// JiiPee
Pascal Hambourg kirjoitti 15/05/2022 klo 21.39:
> Le 14/05/2022 à 15:46, Wols Lists a écrit :
>>
>> Or the rewrite fails, raid assumes the drive is faulty and kicks it
>> out. That's why you should never use desktop drives unless you know
>> EXACTLY what you are doing!
>
> What's wrong with desktop drives ?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-15 19:29 ` Wol
@ 2022-05-16 6:47 ` Pascal Hambourg
2022-05-16 7:09 ` Wols Lists
0 siblings, 1 reply; 16+ messages in thread
From: Pascal Hambourg @ 2022-05-16 6:47 UTC (permalink / raw)
To: Wol; +Cc: Linux RAID Mailing List
Le 15/05/2022 à 21:29, Wol a écrit :
> On 15/05/2022 19:39, Pascal Hambourg wrote:
>> Le 14/05/2022 à 15:46, Wols Lists a écrit :
>>>
>>> Or the rewrite fails, raid assumes the drive is faulty and kicks it
>>> out. That's why you should never use desktop drives unless you know
>>> EXACTLY what you are doing!
>>
>> What's wrong with desktop drives ?
>
> Once things start going wrong, they go pear-shaped very fast.
>
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
I did not mean "in general", but in relation with "Or the rewrite fails,
raid assumes the drive is faulty and kicks it out".
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-16 6:47 ` Pascal Hambourg
@ 2022-05-16 7:09 ` Wols Lists
2022-05-17 14:57 ` Pascal Hambourg
0 siblings, 1 reply; 16+ messages in thread
From: Wols Lists @ 2022-05-16 7:09 UTC (permalink / raw)
To: Pascal Hambourg; +Cc: Linux RAID Mailing List
On 16/05/2022 07:47, Pascal Hambourg wrote:
> Le 15/05/2022 à 21:29, Wol a écrit :
>> On 15/05/2022 19:39, Pascal Hambourg wrote:
>>> Le 14/05/2022 à 15:46, Wols Lists a écrit :
>>>>
>>>> Or the rewrite fails, raid assumes the drive is faulty and kicks it
>>>> out. That's why you should never use desktop drives unless you know
>>>> EXACTLY what you are doing!
>>>
>>> What's wrong with desktop drives ?
>>
>> Once things start going wrong, they go pear-shaped very fast.
>>
>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>
> I did not mean "in general", but in relation with "Or the rewrite fails,
> raid assumes the drive is faulty and kicks it out".
Well, the timeout mismatch is directly responsible for a non-faulty
drive being kicked out the array ...
Cheers,
Wol
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-16 7:09 ` Wols Lists
@ 2022-05-17 14:57 ` Pascal Hambourg
2022-05-17 19:00 ` Wol
0 siblings, 1 reply; 16+ messages in thread
From: Pascal Hambourg @ 2022-05-17 14:57 UTC (permalink / raw)
To: Wols Lists; +Cc: Linux RAID Mailing List
Le 16/05/2022 à 09:09, Wols Lists a écrit :
> On 16/05/2022 07:47, Pascal Hambourg wrote:
>> Le 15/05/2022 à 21:29, Wol a écrit :
>>> On 15/05/2022 19:39, Pascal Hambourg wrote:
>>>> Le 14/05/2022 à 15:46, Wols Lists a écrit :
>>>>>
>>>>> Or the rewrite fails, raid assumes the drive is faulty and kicks it
>>>>> out. That's why you should never use desktop drives unless you know
>>>>> EXACTLY what you are doing!
>>>>
>>>> What's wrong with desktop drives ?
>>>
>>> Once things start going wrong, they go pear-shaped very fast.
>>>
>>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>>
>> I did not mean "in general", but in relation with "Or the rewrite
>> fails, raid assumes the drive is faulty and kicks it out".
>
> Well, the timeout mismatch is directly responsible for a non-faulty
> drive being kicked out the array ...
Thanks, I overlooked that line :
"the RAID code recomputes the block and tries to write it back to the
disk. The disk is still trying to read the data and fails to respond".
On the other hand, I have seen faulty drives report success on write to
an unreadable block then failing immediate read at the same location.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-17 14:57 ` Pascal Hambourg
@ 2022-05-17 19:00 ` Wol
2022-05-17 20:02 ` Reindl Harald
2022-05-17 21:27 ` Pascal Hambourg
0 siblings, 2 replies; 16+ messages in thread
From: Wol @ 2022-05-17 19:00 UTC (permalink / raw)
To: Pascal Hambourg; +Cc: Linux RAID Mailing List
On 17/05/2022 15:57, Pascal Hambourg wrote:
> On the other hand, I have seen faulty drives report success on write to
> an unreadable block then failing immediate read at the same location.
That's exactly what I'd expect from a write (as opposed to a
write-and-verify, they're not the same thing ... :-(
Cheers,
Wol
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-17 19:00 ` Wol
@ 2022-05-17 20:02 ` Reindl Harald
2022-05-17 21:27 ` Pascal Hambourg
1 sibling, 0 replies; 16+ messages in thread
From: Reindl Harald @ 2022-05-17 20:02 UTC (permalink / raw)
To: Wol, Pascal Hambourg; +Cc: Linux RAID Mailing List
Am 17.05.22 um 21:00 schrieb Wol:
> On 17/05/2022 15:57, Pascal Hambourg wrote:
>> On the other hand, I have seen faulty drives report success on write
>> to an unreadable block then failing immediate read at the same location.
>
> That's exactly what I'd expect from a write (as opposed to a
> write-and-verify, they're not the same thing ... :-(
but shouldn#t it be a write-and-verify - i mean it#s error handling at
that point and you *really* want to be sure in such events
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-17 19:00 ` Wol
2022-05-17 20:02 ` Reindl Harald
@ 2022-05-17 21:27 ` Pascal Hambourg
2022-05-18 3:43 ` Wols Lists
1 sibling, 1 reply; 16+ messages in thread
From: Pascal Hambourg @ 2022-05-17 21:27 UTC (permalink / raw)
To: Wol; +Cc: Linux RAID Mailing List
Le 17/05/2022 à 21:00, Wol a écrit :
> On 17/05/2022 15:57, Pascal Hambourg wrote:
>> On the other hand, I have seen faulty drives report success on write
>> to an unreadable block then failing immediate read at the same location.
>
> That's exactly what I'd expect from a write (as opposed to a
> write-and-verify, they're not the same thing ... :-(
Even when the drive knows that a read of this block previously failed ?
On the contrary I would expect from a decent drive to verify after the
write and to reallocate the block if the read still fails.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-17 21:27 ` Pascal Hambourg
@ 2022-05-18 3:43 ` Wols Lists
2022-05-18 10:45 ` Pascal Hambourg
0 siblings, 1 reply; 16+ messages in thread
From: Wols Lists @ 2022-05-18 3:43 UTC (permalink / raw)
To: Pascal Hambourg; +Cc: Linux RAID Mailing List
On 17/05/2022 22:27, Pascal Hambourg wrote:
> Le 17/05/2022 à 21:00, Wol a écrit :
>> On 17/05/2022 15:57, Pascal Hambourg wrote:
>>> On the other hand, I have seen faulty drives report success on write
>>> to an unreadable block then failing immediate read at the same location.
>>
>> That's exactly what I'd expect from a write (as opposed to a
>> write-and-verify, they're not the same thing ... :-(
>
> Even when the drive knows that a read of this block previously failed ?
> On the contrary I would expect from a decent drive to verify after the
> write and to reallocate the block if the read still fails.
The drive knows, or the OS knows? Cheap drives won't remember,
Enterprise drives maybe. Unless the OS explicitly asks for a "write and
verify", it's unlikely to happen, and cheap drives probably won't even
recognise such a request.
Cheers,
Wol
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-18 3:43 ` Wols Lists
@ 2022-05-18 10:45 ` Pascal Hambourg
0 siblings, 0 replies; 16+ messages in thread
From: Pascal Hambourg @ 2022-05-18 10:45 UTC (permalink / raw)
To: Wols Lists; +Cc: Linux RAID Mailing List
Le 18/05/2022 à 05:43, Wols Lists a écrit :
> On 17/05/2022 22:27, Pascal Hambourg wrote:
>> Le 17/05/2022 à 21:00, Wol a écrit :
>>> On 17/05/2022 15:57, Pascal Hambourg wrote:
>>>> On the other hand, I have seen faulty drives report success on write
>>>> to an unreadable block then failing immediate read at the same
>>>> location.
>>>
>>> That's exactly what I'd expect from a write (as opposed to a
>>> write-and-verify, they're not the same thing ... :-(
>>
>> Even when the drive knows that a read of this block previously failed ?
>> On the contrary I would expect from a decent drive to verify after the
>> write and to reallocate the block if the read still fails.
>
> The drive knows, or the OS knows? Cheap drives won't remember,
The drive should know. It was the one which reported read failure to the
host, ran failed offline self-tests and increased the bad block count in
SMART attributes (Current_Pending_Sector, Offline_Uncorrectable).
Else how would it be able to decrease the bad block count in SMART
attributes after a successful write into a bad block if it doesn't
remember that the block was bad ?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: failed sector detected but disk still active ?
2022-05-14 13:46 ` Wols Lists
2022-05-15 18:39 ` Pascal Hambourg
@ 2022-05-18 10:51 ` Gandalf Corvotempesta
1 sibling, 0 replies; 16+ messages in thread
From: Gandalf Corvotempesta @ 2022-05-18 10:51 UTC (permalink / raw)
To: Wols Lists; +Cc: Piergiorgio Sartor, Linux RAID Mailing List
Il giorno sab 14 mag 2022 alle ore 15:46 Wols Lists
<antlists@youngman.org.uk> ha scritto:
> Correct. If the underlying disk returns an error, raid recovery kicks
> in. The missing block is calculated, returned to the caller and written
> back to the disk.
but in this case i would expect md to log something somewhere,
not a total silence.
> The error message is "critical medium error" - we have a real problem
> with the disk I suspect.
>
> FIRST run SMART on the disk and see what that reports. If that's not
> happy, REPLACE THE DRIVE PRONTO.
>
> If SMART is happy, run a raid scrub.
When this happens, i'll replace drives ASAP, it doesn't matter if it's
a transient failure or similar.
A working disk, for me, is a disk that NEVER returns any kind of
issue. Usually I replace disks even
when there is a single recovered sector.
> And whatever, if you haven't replaced the drive, start monitoring SMART.
> If disk errors start climbing, that's a cause for concern and replacing
> the drive.
All disks are under smart monitoring with both short and long tests
(weekly) and also weekly (or monthly? I don't remember) md
consistency check
Anyway, as our new servers has some free slots (we keep free slots
with intentions) out replacements doesn't
mean to remove the old drive (loosing part of redundancy) and then
adding a new one, but we always use a replace:
mdadm /dev/md0 --add /dev/NEW --replace /dev/OLD --with /dev/NEW
it's MUCH safer, but what happens in case of /dev/OLD failure during
the replacement ? the rebuild will be done reading from other drivers
transparently ?
And normally, reads are done FROM old in this case or from the full array ?
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-05-18 10:51 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-13 7:37 failed sector detected but disk still active ? Gandalf Corvotempesta
2022-05-13 16:02 ` Piergiorgio Sartor
2022-05-14 13:46 ` Wols Lists
2022-05-15 18:39 ` Pascal Hambourg
2022-05-15 19:29 ` Wol
2022-05-16 6:47 ` Pascal Hambourg
2022-05-16 7:09 ` Wols Lists
2022-05-17 14:57 ` Pascal Hambourg
2022-05-17 19:00 ` Wol
2022-05-17 20:02 ` Reindl Harald
2022-05-17 21:27 ` Pascal Hambourg
2022-05-18 3:43 ` Wols Lists
2022-05-18 10:45 ` Pascal Hambourg
2022-05-15 19:47 ` Jani Partanen
2022-05-18 10:51 ` Gandalf Corvotempesta
2022-05-14 23:46 ` Jani Partanen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.