All of lore.kernel.org
 help / color / mirror / Atom feed
* failed sector detected but disk still active ?
@ 2022-05-13  7:37 Gandalf Corvotempesta
  2022-05-13 16:02 ` Piergiorgio Sartor
  2022-05-14 23:46 ` Jani Partanen
  0 siblings, 2 replies; 16+ messages in thread
From: Gandalf Corvotempesta @ 2022-05-13  7:37 UTC (permalink / raw)
  To: Linux RAID Mailing List

How this is possible ?
seems that sdc has some failed sectors but disk is still active in the array

[Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc] Unhandled sense code
[Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc]
[Mon May  2 03:36:24 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc]
[Mon May  2 03:36:24 2022] Sense Key : Medium Error [current]
[Mon May  2 03:36:24 2022] Info fld=0x10565570
[Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc]
[Mon May  2 03:36:24 2022] Add. Sense: Unrecovered read error
[Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc] CDB:
[Mon May  2 03:36:24 2022] Read(10): 28 00 10 56 51 80 00 04 00 00
[Mon May  2 03:36:24 2022] end_request: critical medium error, dev
sdc, sector 274093424
[Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc] Unhandled sense code
[Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc]
[Mon May  2 03:36:25 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc]
[Mon May  2 03:36:25 2022] Sense Key : Medium Error [current]
[Mon May  2 03:36:25 2022] Info fld=0x10565584
[Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc]
[Mon May  2 03:36:25 2022] Add. Sense: Unrecovered read error
[Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc] CDB:
[Mon May  2 03:36:25 2022] Read(10): 28 00 10 56 55 80 00 04 00 00
[Mon May  2 03:36:25 2022] end_request: critical medium error, dev
sdc, sector 274093444
[Mon May  2 04:06:32 2022] md: md0: data-check done.



# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[2] sda1[0] sdb1[1]
      292836352 blocks super 1.2 [3/3] [UUU]
      bitmap: 3/3 pages [12KB], 65536KB chunk

unused devices: <none>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-13  7:37 failed sector detected but disk still active ? Gandalf Corvotempesta
@ 2022-05-13 16:02 ` Piergiorgio Sartor
  2022-05-14 13:46   ` Wols Lists
  2022-05-14 23:46 ` Jani Partanen
  1 sibling, 1 reply; 16+ messages in thread
From: Piergiorgio Sartor @ 2022-05-13 16:02 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: Linux RAID Mailing List

On Fri, May 13, 2022 at 09:37:13AM +0200, Gandalf Corvotempesta wrote:
> How this is possible ?
> seems that sdc has some failed sectors but disk is still active in the array
> 
> [Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc] Unhandled sense code
> [Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:24 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:24 2022] Sense Key : Medium Error [current]
> [Mon May  2 03:36:24 2022] Info fld=0x10565570
> [Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:24 2022] Add. Sense: Unrecovered read error
> [Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc] CDB:
> [Mon May  2 03:36:24 2022] Read(10): 28 00 10 56 51 80 00 04 00 00
> [Mon May  2 03:36:24 2022] end_request: critical medium error, dev
> sdc, sector 274093424
> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc] Unhandled sense code
> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:25 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:25 2022] Sense Key : Medium Error [current]
> [Mon May  2 03:36:25 2022] Info fld=0x10565584
> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:25 2022] Add. Sense: Unrecovered read error
> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc] CDB:
> [Mon May  2 03:36:25 2022] Read(10): 28 00 10 56 55 80 00 04 00 00
> [Mon May  2 03:36:25 2022] end_request: critical medium error, dev
> sdc, sector 274093444
> [Mon May  2 04:06:32 2022] md: md0: data-check done.

The error is reported from the device.

As far as I know, and please someone correct
me if I'm wrong, when a device has an error,
"md" tries to re-write the data, using the
redundancy, and, if no error occurs, it just
continues, no reason to kick the device our
of the array.

bye,

pg

> 
> 
> 
> # cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdc1[2] sda1[0] sdb1[1]
>       292836352 blocks super 1.2 [3/3] [UUU]
>       bitmap: 3/3 pages [12KB], 65536KB chunk
> 
> unused devices: <none>

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-13 16:02 ` Piergiorgio Sartor
@ 2022-05-14 13:46   ` Wols Lists
  2022-05-15 18:39     ` Pascal Hambourg
  2022-05-18 10:51     ` Gandalf Corvotempesta
  0 siblings, 2 replies; 16+ messages in thread
From: Wols Lists @ 2022-05-14 13:46 UTC (permalink / raw)
  To: Piergiorgio Sartor, Gandalf Corvotempesta; +Cc: Linux RAID Mailing List

On 13/05/2022 17:02, Piergiorgio Sartor wrote:
>> [Mon May  2 03:36:25 2022] Add. Sense: Unrecovered read error
>> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc] CDB:
>> [Mon May  2 03:36:25 2022] Read(10): 28 00 10 56 55 80 00 04 00 00
>> [Mon May  2 03:36:25 2022] end_request: critical medium error, dev
>> sdc, sector 274093444
>> [Mon May  2 04:06:32 2022] md: md0: data-check done.
> The error is reported from the device.
> 
> As far as I know, and please someone correct
> me if I'm wrong, when a device has an error,
> "md" tries to re-write the data, using the
> redundancy, and, if no error occurs, it just
> continues, no reason to kick the device our
> of the array.

Correct. If the underlying disk returns an error, raid recovery kicks 
in. The missing block is calculated, returned to the caller and written 
back to the disk.

There's a whole bunch of reasons how/why this can occur. If it's a 
transient failure and the re-write succeeds perfectly, everything is 
normally hunky-dory.

There could be a problem with the drive, the drive re-locates the dodgy 
sector, and everything APPEARS hunky-dory.

Or the rewrite fails, raid assumes the drive is faulty and kicks it out. 
That's why you should never use desktop drives unless you know EXACTLY 
what you are doing!

The error message is "critical medium error" - we have a real problem 
with the disk I suspect.

FIRST run SMART on the disk and see what that reports. If that's not 
happy, REPLACE THE DRIVE PRONTO.

If SMART is happy, run a raid scrub.

And whatever, if you haven't replaced the drive, start monitoring SMART. 
If disk errors start climbing, that's a cause for concern and replacing 
the drive.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-13  7:37 failed sector detected but disk still active ? Gandalf Corvotempesta
  2022-05-13 16:02 ` Piergiorgio Sartor
@ 2022-05-14 23:46 ` Jani Partanen
  1 sibling, 0 replies; 16+ messages in thread
From: Jani Partanen @ 2022-05-14 23:46 UTC (permalink / raw)
  To: Gandalf Corvotempesta, Linux RAID Mailing List

Interesting. I had just yesterday same error, but in my case it happened 
when I was rebuilding raid-5 and system was very unhappy, eventually 
stopped rebuild because there was like 10 errors.
I was actually afraid that I lost that pool, but when I did reboot, 
rebuild started again and at some point it did show again same errors 
but now only 2 times and rebuild finished.
Now I am running check and after check I will run btrfs scrub because I 
have btrfs on that pool with double meta.

What kernel version you are running? I'm on 5.17.6-300.fc36.x86_64

I do have now 23 bending sectors on that disk what is for me quite big 
indicator that I really need to replace that disk, it will be toasted 
soon by previous experience. So check your smart status from that disk.


//JiiPee


Gandalf Corvotempesta kirjoitti 13/05/2022 klo 10.37:
> How this is possible ?
> seems that sdc has some failed sectors but disk is still active in the array
>
> [Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc] Unhandled sense code
> [Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:24 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:24 2022] Sense Key : Medium Error [current]
> [Mon May  2 03:36:24 2022] Info fld=0x10565570
> [Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:24 2022] Add. Sense: Unrecovered read error
> [Mon May  2 03:36:24 2022] sd 0:0:2:0: [sdc] CDB:
> [Mon May  2 03:36:24 2022] Read(10): 28 00 10 56 51 80 00 04 00 00
> [Mon May  2 03:36:24 2022] end_request: critical medium error, dev
> sdc, sector 274093424
> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc] Unhandled sense code
> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:25 2022] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:25 2022] Sense Key : Medium Error [current]
> [Mon May  2 03:36:25 2022] Info fld=0x10565584
> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc]
> [Mon May  2 03:36:25 2022] Add. Sense: Unrecovered read error
> [Mon May  2 03:36:25 2022] sd 0:0:2:0: [sdc] CDB:
> [Mon May  2 03:36:25 2022] Read(10): 28 00 10 56 55 80 00 04 00 00
> [Mon May  2 03:36:25 2022] end_request: critical medium error, dev
> sdc, sector 274093444
> [Mon May  2 04:06:32 2022] md: md0: data-check done.
>
>
>
> # cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdc1[2] sda1[0] sdb1[1]
>        292836352 blocks super 1.2 [3/3] [UUU]
>        bitmap: 3/3 pages [12KB], 65536KB chunk
>
> unused devices: <none>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-14 13:46   ` Wols Lists
@ 2022-05-15 18:39     ` Pascal Hambourg
  2022-05-15 19:29       ` Wol
  2022-05-15 19:47       ` Jani Partanen
  2022-05-18 10:51     ` Gandalf Corvotempesta
  1 sibling, 2 replies; 16+ messages in thread
From: Pascal Hambourg @ 2022-05-15 18:39 UTC (permalink / raw)
  To: Wols Lists; +Cc: Linux RAID Mailing List

Le 14/05/2022 à 15:46, Wols Lists a écrit :
> 
> Or the rewrite fails, raid assumes the drive is faulty and kicks it out. 
> That's why you should never use desktop drives unless you know EXACTLY 
> what you are doing!

What's wrong with desktop drives ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-15 18:39     ` Pascal Hambourg
@ 2022-05-15 19:29       ` Wol
  2022-05-16  6:47         ` Pascal Hambourg
  2022-05-15 19:47       ` Jani Partanen
  1 sibling, 1 reply; 16+ messages in thread
From: Wol @ 2022-05-15 19:29 UTC (permalink / raw)
  To: Pascal Hambourg; +Cc: Linux RAID Mailing List

On 15/05/2022 19:39, Pascal Hambourg wrote:
> Le 14/05/2022 à 15:46, Wols Lists a écrit :
>>
>> Or the rewrite fails, raid assumes the drive is faulty and kicks it 
>> out. That's why you should never use desktop drives unless you know 
>> EXACTLY what you are doing!
> 
> What's wrong with desktop drives ?

Once things start going wrong, they go pear-shaped very fast.

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

tl;dr

Raid/Enterprise drives have something called SCT/ERC. If there's a 
problem, the drive will abort the read/write, and return an error.

Consumer drives don't have this. If there's a problem, they can 
typically take two minutes to respond. No matter whether the problem is 
transient or real, that's a real bummer for whatever wants the data. The 
kernel typically gives up waiting after 30secs, tries to talk to the 
drive again, and on getting no response whatsoever assumes the disk has 
failed. As far as raid is concerned, a faulty, non-responsive disk is 
BAD NEWS.

It gets worse. SMR drives can - in the NORMAL course of events, take 
about ten minutes to respond!

So basically, Enterprise drives typically take about 7 seconds to sort 
out a problem. Consumer drives - the old CMR type - typically take about 
2 minutes. New SMR drives can take 10s of minutes. And transient 
problems aren't that uncommon. Worse, once things start going wrong, it 
can explode very fast.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-15 18:39     ` Pascal Hambourg
  2022-05-15 19:29       ` Wol
@ 2022-05-15 19:47       ` Jani Partanen
  1 sibling, 0 replies; 16+ messages in thread
From: Jani Partanen @ 2022-05-15 19:47 UTC (permalink / raw)
  To: Pascal Hambourg, Wols Lists; +Cc: Linux RAID Mailing List

Lot of stuff. Example SMR vs CMR. Another example is WD green drives 
what are very bad for raid. They like to go sleep constantly and basicly 
destroy themself because of that.

Desktop drives are many times also missing some features what if not 
required, but are very handy when doing raid.


// JiiPee

Pascal Hambourg kirjoitti 15/05/2022 klo 21.39:
> Le 14/05/2022 à 15:46, Wols Lists a écrit :
>>
>> Or the rewrite fails, raid assumes the drive is faulty and kicks it 
>> out. That's why you should never use desktop drives unless you know 
>> EXACTLY what you are doing!
>
> What's wrong with desktop drives ?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-15 19:29       ` Wol
@ 2022-05-16  6:47         ` Pascal Hambourg
  2022-05-16  7:09           ` Wols Lists
  0 siblings, 1 reply; 16+ messages in thread
From: Pascal Hambourg @ 2022-05-16  6:47 UTC (permalink / raw)
  To: Wol; +Cc: Linux RAID Mailing List

Le 15/05/2022 à 21:29, Wol a écrit :
> On 15/05/2022 19:39, Pascal Hambourg wrote:
>> Le 14/05/2022 à 15:46, Wols Lists a écrit :
>>>
>>> Or the rewrite fails, raid assumes the drive is faulty and kicks it 
>>> out. That's why you should never use desktop drives unless you know 
>>> EXACTLY what you are doing!
>>
>> What's wrong with desktop drives ?
> 
> Once things start going wrong, they go pear-shaped very fast.
> 
> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

I did not mean "in general", but in relation with "Or the rewrite fails, 
raid assumes the drive is faulty and kicks it out".

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-16  6:47         ` Pascal Hambourg
@ 2022-05-16  7:09           ` Wols Lists
  2022-05-17 14:57             ` Pascal Hambourg
  0 siblings, 1 reply; 16+ messages in thread
From: Wols Lists @ 2022-05-16  7:09 UTC (permalink / raw)
  To: Pascal Hambourg; +Cc: Linux RAID Mailing List

On 16/05/2022 07:47, Pascal Hambourg wrote:
> Le 15/05/2022 à 21:29, Wol a écrit :
>> On 15/05/2022 19:39, Pascal Hambourg wrote:
>>> Le 14/05/2022 à 15:46, Wols Lists a écrit :
>>>>
>>>> Or the rewrite fails, raid assumes the drive is faulty and kicks it 
>>>> out. That's why you should never use desktop drives unless you know 
>>>> EXACTLY what you are doing!
>>>
>>> What's wrong with desktop drives ?
>>
>> Once things start going wrong, they go pear-shaped very fast.
>>
>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
> 
> I did not mean "in general", but in relation with "Or the rewrite fails, 
> raid assumes the drive is faulty and kicks it out".

Well, the timeout mismatch is directly responsible for a non-faulty 
drive being kicked out the array ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-16  7:09           ` Wols Lists
@ 2022-05-17 14:57             ` Pascal Hambourg
  2022-05-17 19:00               ` Wol
  0 siblings, 1 reply; 16+ messages in thread
From: Pascal Hambourg @ 2022-05-17 14:57 UTC (permalink / raw)
  To: Wols Lists; +Cc: Linux RAID Mailing List

Le 16/05/2022 à 09:09, Wols Lists a écrit :
> On 16/05/2022 07:47, Pascal Hambourg wrote:
>> Le 15/05/2022 à 21:29, Wol a écrit :
>>> On 15/05/2022 19:39, Pascal Hambourg wrote:
>>>> Le 14/05/2022 à 15:46, Wols Lists a écrit :
>>>>>
>>>>> Or the rewrite fails, raid assumes the drive is faulty and kicks it 
>>>>> out. That's why you should never use desktop drives unless you know 
>>>>> EXACTLY what you are doing!
>>>>
>>>> What's wrong with desktop drives ?
>>>
>>> Once things start going wrong, they go pear-shaped very fast.
>>>
>>> https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>>
>> I did not mean "in general", but in relation with "Or the rewrite 
>> fails, raid assumes the drive is faulty and kicks it out".
> 
> Well, the timeout mismatch is directly responsible for a non-faulty 
> drive being kicked out the array ...

Thanks, I overlooked that line :
"the RAID code recomputes the block and tries to write it back to the 
disk. The disk is still trying to read the data and fails to respond".

On the other hand, I have seen faulty drives report success on write to 
an unreadable block then failing immediate read at the same location.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-17 14:57             ` Pascal Hambourg
@ 2022-05-17 19:00               ` Wol
  2022-05-17 20:02                 ` Reindl Harald
  2022-05-17 21:27                 ` Pascal Hambourg
  0 siblings, 2 replies; 16+ messages in thread
From: Wol @ 2022-05-17 19:00 UTC (permalink / raw)
  To: Pascal Hambourg; +Cc: Linux RAID Mailing List

On 17/05/2022 15:57, Pascal Hambourg wrote:
> On the other hand, I have seen faulty drives report success on write to 
> an unreadable block then failing immediate read at the same location.

That's exactly what I'd expect from a write (as opposed to a 
write-and-verify, they're not the same thing ... :-(

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-17 19:00               ` Wol
@ 2022-05-17 20:02                 ` Reindl Harald
  2022-05-17 21:27                 ` Pascal Hambourg
  1 sibling, 0 replies; 16+ messages in thread
From: Reindl Harald @ 2022-05-17 20:02 UTC (permalink / raw)
  To: Wol, Pascal Hambourg; +Cc: Linux RAID Mailing List



Am 17.05.22 um 21:00 schrieb Wol:
> On 17/05/2022 15:57, Pascal Hambourg wrote:
>> On the other hand, I have seen faulty drives report success on write 
>> to an unreadable block then failing immediate read at the same location.
> 
> That's exactly what I'd expect from a write (as opposed to a 
> write-and-verify, they're not the same thing ... :-(

but shouldn#t it be a write-and-verify - i mean it#s error handling at 
that point and you *really* want to be sure in such events

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-17 19:00               ` Wol
  2022-05-17 20:02                 ` Reindl Harald
@ 2022-05-17 21:27                 ` Pascal Hambourg
  2022-05-18  3:43                   ` Wols Lists
  1 sibling, 1 reply; 16+ messages in thread
From: Pascal Hambourg @ 2022-05-17 21:27 UTC (permalink / raw)
  To: Wol; +Cc: Linux RAID Mailing List

Le 17/05/2022 à 21:00, Wol a écrit :
> On 17/05/2022 15:57, Pascal Hambourg wrote:
>> On the other hand, I have seen faulty drives report success on write 
>> to an unreadable block then failing immediate read at the same location.
> 
> That's exactly what I'd expect from a write (as opposed to a 
> write-and-verify, they're not the same thing ... :-(

Even when the drive knows that a read of this block previously failed ?
On the contrary I would expect from a decent drive to verify after the 
write and to reallocate the block if the read still fails.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-17 21:27                 ` Pascal Hambourg
@ 2022-05-18  3:43                   ` Wols Lists
  2022-05-18 10:45                     ` Pascal Hambourg
  0 siblings, 1 reply; 16+ messages in thread
From: Wols Lists @ 2022-05-18  3:43 UTC (permalink / raw)
  To: Pascal Hambourg; +Cc: Linux RAID Mailing List

On 17/05/2022 22:27, Pascal Hambourg wrote:
> Le 17/05/2022 à 21:00, Wol a écrit :
>> On 17/05/2022 15:57, Pascal Hambourg wrote:
>>> On the other hand, I have seen faulty drives report success on write 
>>> to an unreadable block then failing immediate read at the same location.
>>
>> That's exactly what I'd expect from a write (as opposed to a 
>> write-and-verify, they're not the same thing ... :-(
> 
> Even when the drive knows that a read of this block previously failed ?
> On the contrary I would expect from a decent drive to verify after the 
> write and to reallocate the block if the read still fails.

The drive knows, or the OS knows? Cheap drives won't remember, 
Enterprise drives maybe. Unless the OS explicitly asks for a "write and 
verify", it's unlikely to happen, and cheap drives probably won't even 
recognise such a request.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-18  3:43                   ` Wols Lists
@ 2022-05-18 10:45                     ` Pascal Hambourg
  0 siblings, 0 replies; 16+ messages in thread
From: Pascal Hambourg @ 2022-05-18 10:45 UTC (permalink / raw)
  To: Wols Lists; +Cc: Linux RAID Mailing List

Le 18/05/2022 à 05:43, Wols Lists a écrit :
> On 17/05/2022 22:27, Pascal Hambourg wrote:
>> Le 17/05/2022 à 21:00, Wol a écrit :
>>> On 17/05/2022 15:57, Pascal Hambourg wrote:
>>>> On the other hand, I have seen faulty drives report success on write 
>>>> to an unreadable block then failing immediate read at the same 
>>>> location.
>>>
>>> That's exactly what I'd expect from a write (as opposed to a 
>>> write-and-verify, they're not the same thing ... :-(
>>
>> Even when the drive knows that a read of this block previously failed ?
>> On the contrary I would expect from a decent drive to verify after the 
>> write and to reallocate the block if the read still fails.
> 
> The drive knows, or the OS knows? Cheap drives won't remember, 

The drive should know. It was the one which reported read failure to the 
host, ran failed offline self-tests and increased the bad block count in 
SMART attributes (Current_Pending_Sector, Offline_Uncorrectable).

Else how would it be able to decrease the bad block count in SMART 
attributes after a successful write into a bad block if it doesn't 
remember that the block was bad ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: failed sector detected but disk still active ?
  2022-05-14 13:46   ` Wols Lists
  2022-05-15 18:39     ` Pascal Hambourg
@ 2022-05-18 10:51     ` Gandalf Corvotempesta
  1 sibling, 0 replies; 16+ messages in thread
From: Gandalf Corvotempesta @ 2022-05-18 10:51 UTC (permalink / raw)
  To: Wols Lists; +Cc: Piergiorgio Sartor, Linux RAID Mailing List

Il giorno sab 14 mag 2022 alle ore 15:46 Wols Lists
<antlists@youngman.org.uk> ha scritto:
> Correct. If the underlying disk returns an error, raid recovery kicks
> in. The missing block is calculated, returned to the caller and written
> back to the disk.

but in this case i would expect md to log something somewhere,
not a total silence.

> The error message is "critical medium error" - we have a real problem
> with the disk I suspect.
>
> FIRST run SMART on the disk and see what that reports. If that's not
> happy, REPLACE THE DRIVE PRONTO.
>
> If SMART is happy, run a raid scrub.

When this happens, i'll replace drives ASAP, it doesn't matter if it's
a transient failure or similar.
A working disk, for me, is a disk that NEVER returns any kind of
issue. Usually I replace disks even
when there is a single recovered sector.

> And whatever, if you haven't replaced the drive, start monitoring SMART.
> If disk errors start climbing, that's a cause for concern and replacing
> the drive.

All disks are under smart monitoring with both short and long tests
(weekly) and also weekly (or monthly?  I don't remember) md
consistency check

Anyway, as our new servers has some free slots (we keep free slots
with intentions) out replacements doesn't
mean to remove the old drive (loosing part of redundancy) and then
adding a new one, but we always use a replace:
mdadm /dev/md0 --add /dev/NEW --replace /dev/OLD --with /dev/NEW
it's MUCH safer, but what happens in case of /dev/OLD failure during
the replacement ? the rebuild will be done reading from other drivers
transparently ?
And normally, reads are done FROM old in this case or from the full array ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-05-18 10:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-13  7:37 failed sector detected but disk still active ? Gandalf Corvotempesta
2022-05-13 16:02 ` Piergiorgio Sartor
2022-05-14 13:46   ` Wols Lists
2022-05-15 18:39     ` Pascal Hambourg
2022-05-15 19:29       ` Wol
2022-05-16  6:47         ` Pascal Hambourg
2022-05-16  7:09           ` Wols Lists
2022-05-17 14:57             ` Pascal Hambourg
2022-05-17 19:00               ` Wol
2022-05-17 20:02                 ` Reindl Harald
2022-05-17 21:27                 ` Pascal Hambourg
2022-05-18  3:43                   ` Wols Lists
2022-05-18 10:45                     ` Pascal Hambourg
2022-05-15 19:47       ` Jani Partanen
2022-05-18 10:51     ` Gandalf Corvotempesta
2022-05-14 23:46 ` Jani Partanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.