* SMART detects pending sectors; take offline?
@ 2017-10-07 7:48 Alexander Shenkin
2017-10-07 8:21 ` Carsten Aulbert
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2017-10-07 7:48 UTC (permalink / raw)
To: linux-raid
Hi all,
My SMART monitoring has picked up some pending sectors on one of my
RAID0 + RAID5 drives (it's one of the infamous 3TB seagate drives... my
other 3 failed earlier... this is the last of them, that finally has
gone as well...). I've just ordered a replacement (Toshiba P300) that
will arrive tomorrow... but the question is, what to do in the meantime?
Should I take the drive offline? I suspect so, but would like to
double check before taking action. Thanks in advance for any advice.
Here are the errors:
The following warning/error was logged by the smartd daemon:
Device: /dev/sda [SAT], Self-Test Log error count increased from 0 to 1
Device info:
ST3000DM001-9YN166, S/N:Z1F13FBA, WWN:5-000c50-04e444ab1, FW:CC4B, 3.00 TB
------------------
The following warning/error was logged by the smartd daemon:
Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Device info:
ST3000DM001-9YN166, S/N:Z1F13FBA, WWN:5-000c50-04e444ab1, FW:CC4B, 3.00 TB
-----------------
The following warning/error was logged by the smartd daemon:
Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Device info:
ST3000DM001-9YN166, S/N:Z1F13FBA, WWN:5-000c50-04e444ab1, FW:CC4B, 3.00 TB
Thanks,
Allie
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-07 7:48 SMART detects pending sectors; take offline? Alexander Shenkin
@ 2017-10-07 8:21 ` Carsten Aulbert
2017-10-07 10:05 ` Alexander Shenkin
2017-10-09 20:16 ` Phil Turmel
0 siblings, 2 replies; 49+ messages in thread
From: Carsten Aulbert @ 2017-10-07 8:21 UTC (permalink / raw)
To: Alexander Shenkin, linux-raid
Hi
On 10/07/17 09:48, Alexander Shenkin wrote:
> My SMART monitoring has picked up some pending sectors on one of my
> RAID0 + RAID5 drives (it's one of the infamous 3TB seagate drives... my
> other 3 failed earlier... this is the last of them, that finally has
> gone as well...). I've just ordered a replacement (Toshiba P300) that
> will arrive tomorrow... but the question is, what to do in the meantime?
> Should I take the drive offline? I suspect so, but would like to
> double check before taking action. Thanks in advance for any advice.
Given this is "only" a single sector error I would keep it running as
long as you can physically install the new drive and only then take it
offline.
At least theoretically, it may be possible to force the rewrite of this
sector and use the spare sectors of the disk, but I'm not 100% sure if a
simple md check would already trigger it - usually you need to write
"new" data to defective sectors to force the drive's firmware to use the
spare sectors.
But given the replacement disk should arrive soon, I would not act
before that and run with a degraded RAID5 until then.
I'm a bit more worried about the RAID0 here, do you run RAID0 on top of
RAID5 or what is the exact set-up?
cheers
Carsten
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-07 8:21 ` Carsten Aulbert
@ 2017-10-07 10:05 ` Alexander Shenkin
2017-10-07 17:29 ` Wols Lists
2017-10-09 20:16 ` Phil Turmel
1 sibling, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2017-10-07 10:05 UTC (permalink / raw)
To: Carsten Aulbert, linux-raid
Thanks Carsten,
I was mistaken, it's a RAID1, not RAID0. I have /boot mounted on a
RAID0, and / mounted on RAID5. They both split across 4 drives.
Appreciate the advice - i'll just keep it running until the drive
arrives tomorrow...
Thanks,
Allie
On 10/7/2017 9:21 AM, Carsten Aulbert wrote:
> Hi
>
> On 10/07/17 09:48, Alexander Shenkin wrote:
>> My SMART monitoring has picked up some pending sectors on one of my
>> RAID0 + RAID5 drives (it's one of the infamous 3TB seagate drives... my
>> other 3 failed earlier... this is the last of them, that finally has
>> gone as well...). I've just ordered a replacement (Toshiba P300) that
>> will arrive tomorrow... but the question is, what to do in the meantime?
>> Should I take the drive offline? I suspect so, but would like to
>> double check before taking action. Thanks in advance for any advice.
>
> Given this is "only" a single sector error I would keep it running as
> long as you can physically install the new drive and only then take it
> offline.
>
> At least theoretically, it may be possible to force the rewrite of this
> sector and use the spare sectors of the disk, but I'm not 100% sure if a
> simple md check would already trigger it - usually you need to write
> "new" data to defective sectors to force the drive's firmware to use the
> spare sectors.
>
> But given the replacement disk should arrive soon, I would not act
> before that and run with a degraded RAID5 until then.
>
> I'm a bit more worried about the RAID0 here, do you run RAID0 on top of
> RAID5 or what is the exact set-up?
>
> cheers
>
> Carsten
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-07 10:05 ` Alexander Shenkin
@ 2017-10-07 17:29 ` Wols Lists
2017-10-08 9:19 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Wols Lists @ 2017-10-07 17:29 UTC (permalink / raw)
To: Alexander Shenkin, Carsten Aulbert, linux-raid
On 07/10/17 11:05, Alexander Shenkin wrote:
> Thanks Carsten,
>
> I was mistaken, it's a RAID1, not RAID0. I have /boot mounted on a
> RAID0, and / mounted on RAID5. They both split across 4 drives.
How big is each partition that makes up /boot? If that's raid0, surely
that's not wise? A single disk failure will render the machine
unbootable. Surely that should be raid1, so you can boot off any disk.
>
> Appreciate the advice - i'll just keep it running until the drive
> arrives tomorrow...
I'd keep it running ...
>
> Thanks,
> Allie
>
> On 10/7/2017 9:21 AM, Carsten Aulbert wrote:
>> Hi
>>
>>
>> Given this is "only" a single sector error I would keep it running as
>> long as you can physically install the new drive and only then take it
>> offline.
>>
>> At least theoretically, it may be possible to force the rewrite of this
>> sector and use the spare sectors of the disk, but I'm not 100% sure if a
>> simple md check would already trigger it - usually you need to write
>> "new" data to defective sectors to force the drive's firmware to use the
>> spare sectors.
>>
How serious is a "pending sector"? I think doing a scrub will fix it.
If it's not serious I'd look at using the extra drive to convert it to
raid6. I doubt the infamous 3TB drives were a "bad batch", but given the
press they got I would have expected Seagate to fix the problem. If
these drives are newer than the ones that got the bad press, they might
be fine.
There's always the argument "do you ditch a disk on the first error, or
do you wait until it's definitely dying". But iirc a "pending sector" is
just one of those things that happens every now and then. If this goes
away with a scrub, and you don't get a batch of new ones, then the drive
is probably fine (until the next *random* problem shows up).
Cheers,
Wol
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-07 17:29 ` Wols Lists
@ 2017-10-08 9:19 ` Alexander Shenkin
2017-10-08 9:49 ` Wols Lists
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2017-10-08 9:19 UTC (permalink / raw)
To: Wols Lists, Carsten Aulbert, linux-raid
Thanks Wol; that's *2* mistakes I've made; ugh. /boot is on RAID1. I
have no RAID0's in the machine.
These 3TB seagates are old ones, and were sitting in their original
boxes for a few years unused. 3 others have already died, and one
during a rebuild, causing tons of grief. So, random or not, this one is
going in the trash. Seagate should refund me, but they never will.
Thanks,
Allie
On 10/7/2017 6:29 PM, Wols Lists wrote:
> On 07/10/17 11:05, Alexander Shenkin wrote:
>> Thanks Carsten,
>>
>> I was mistaken, it's a RAID1, not RAID0. I have /boot mounted on a
>> RAID0, and / mounted on RAID5. They both split across 4 drives.
>
> How big is each partition that makes up /boot? If that's raid0, surely
> that's not wise? A single disk failure will render the machine
> unbootable. Surely that should be raid1, so you can boot off any disk.
>>
>> Appreciate the advice - i'll just keep it running until the drive
>> arrives tomorrow...
>
> I'd keep it running ...
>>
>> Thanks,
>> Allie
>>
>> On 10/7/2017 9:21 AM, Carsten Aulbert wrote:
>>> Hi
>>>
>
>>>
>>> Given this is "only" a single sector error I would keep it running as
>>> long as you can physically install the new drive and only then take it
>>> offline.
>>>
>>> At least theoretically, it may be possible to force the rewrite of this
>>> sector and use the spare sectors of the disk, but I'm not 100% sure if a
>>> simple md check would already trigger it - usually you need to write
>>> "new" data to defective sectors to force the drive's firmware to use the
>>> spare sectors.
>>>
> How serious is a "pending sector"? I think doing a scrub will fix it.
>
> If it's not serious I'd look at using the extra drive to convert it to
> raid6. I doubt the infamous 3TB drives were a "bad batch", but given the
> press they got I would have expected Seagate to fix the problem. If
> these drives are newer than the ones that got the bad press, they might
> be fine.
>
> There's always the argument "do you ditch a disk on the first error, or
> do you wait until it's definitely dying". But iirc a "pending sector" is
> just one of those things that happens every now and then. If this goes
> away with a scrub, and you don't get a batch of new ones, then the drive
> is probably fine (until the next *random* problem shows up).
>
> Cheers,
> Wol
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-08 9:19 ` Alexander Shenkin
@ 2017-10-08 9:49 ` Wols Lists
0 siblings, 0 replies; 49+ messages in thread
From: Wols Lists @ 2017-10-08 9:49 UTC (permalink / raw)
To: Alexander Shenkin, linux-raid
On 08/10/17 10:19, Alexander Shenkin wrote:
> Thanks Wol; that's *2* mistakes I've made; ugh. /boot is on RAID1. I
> have no RAID0's in the machine.
:-)
>
> These 3TB seagates are old ones, and were sitting in their original
> boxes for a few years unused. 3 others have already died, and one
> during a rebuild, causing tons of grief. So, random or not, this one is
> going in the trash. Seagate should refund me, but they never will.
>
My machine has two 3TB Barracudas - raid-1. So far (touch wood) they've
been reliable enough.
As soon as I can afford it (£700, money I haven't got :-) I'm going to
build a new machine - lvm/qemu on raid-1, then linux, windows, whatever
on top of that on raid-6.
The problem, of course, is I can't find much info on actually setting up
a machine with a minimal virtual-machine install then a bunch of vm's on
top. I guess it's all out there, but it's technical docu, not guides and
howtos. So I guess I'll be documenting it all :-) And maybe trying to
get a linux.org wiki to put it up on :-)
Cheers,
Wol
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-07 8:21 ` Carsten Aulbert
2017-10-07 10:05 ` Alexander Shenkin
@ 2017-10-09 20:16 ` Phil Turmel
2017-10-10 9:00 ` Alexander Shenkin
1 sibling, 1 reply; 49+ messages in thread
From: Phil Turmel @ 2017-10-09 20:16 UTC (permalink / raw)
To: Carsten Aulbert, Alexander Shenkin, linux-raid
On 10/07/2017 04:21 AM, Carsten Aulbert wrote:
> Hi
>
> On 10/07/17 09:48, Alexander Shenkin wrote:
>> My SMART monitoring has picked up some pending sectors on one of my
>> RAID0 + RAID5 drives (it's one of the infamous 3TB seagate drives... my
>> other 3 failed earlier... this is the last of them, that finally has
>> gone as well...). I've just ordered a replacement (Toshiba P300) that
>> will arrive tomorrow... but the question is, what to do in the meantime?
>> Should I take the drive offline? I suspect so, but would like to
>> double check before taking action. Thanks in advance for any advice.
>
> Given this is "only" a single sector error I would keep it running as
> long as you can physically install the new drive and only then take it
> offline.
>
> At least theoretically, it may be possible to force the rewrite of this
> sector and use the spare sectors of the disk, but I'm not 100% sure if a
> simple md check would already trigger it - usually you need to write
> "new" data to defective sectors to force the drive's firmware to use the
> spare sectors.
>
> But given the replacement disk should arrive soon, I would not act
> before that and run with a degraded RAID5 until then.
>
> I'm a bit more worried about the RAID0 here, do you run RAID0 on top of
> RAID5 or what is the exact set-up?
So, no regular "check" scrubs. Check scrubs fix pending sectors by
writing back to such sectors when the error is hit. As long as there is
redundancy to obtain the data from, and the drive in question actually
returns a read error.
Since this is a desktop drive that is known to not have SCTERC support,
you *must* reset your driver timeouts to 180 seconds for a check scrub
to succeed. You will also have to do so with your P300 drive, as
Toshiba's website says that drive is not NAS optimized.
Please read up on "timeout mismatch" before your array blows up.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-09 20:16 ` Phil Turmel
@ 2017-10-10 9:00 ` Alexander Shenkin
2017-10-10 9:11 ` Reindl Harald
2017-10-10 9:21 ` Wols Lists
0 siblings, 2 replies; 49+ messages in thread
From: Alexander Shenkin @ 2017-10-10 9:00 UTC (permalink / raw)
To: Phil Turmel, Carsten Aulbert, linux-raid
On 10/9/2017 9:16 PM, Phil Turmel wrote:
> On 10/07/2017 04:21 AM, Carsten Aulbert wrote:
>> Hi
>>
>> On 10/07/17 09:48, Alexander Shenkin wrote:
>>> My SMART monitoring has picked up some pending sectors on one of my
>>> RAID0 + RAID5 drives (it's one of the infamous 3TB seagate drives... my
>>> other 3 failed earlier... this is the last of them, that finally has
>>> gone as well...). I've just ordered a replacement (Toshiba P300) that
>>> will arrive tomorrow... but the question is, what to do in the meantime?
>>> Should I take the drive offline? I suspect so, but would like to
>>> double check before taking action. Thanks in advance for any advice.
>>
>> Given this is "only" a single sector error I would keep it running as
>> long as you can physically install the new drive and only then take it
>> offline.
>>
>> At least theoretically, it may be possible to force the rewrite of this
>> sector and use the spare sectors of the disk, but I'm not 100% sure if a
>> simple md check would already trigger it - usually you need to write
>> "new" data to defective sectors to force the drive's firmware to use the
>> spare sectors.
>>
>> But given the replacement disk should arrive soon, I would not act
>> before that and run with a degraded RAID5 until then.
>>
>> I'm a bit more worried about the RAID0 here, do you run RAID0 on top of
>> RAID5 or what is the exact set-up?
>
> So, no regular "check" scrubs. Check scrubs fix pending sectors by
> writing back to such sectors when the error is hit. As long as there is
> redundancy to obtain the data from, and the drive in question actually
> returns a read error.
Thanks... I know nothing about "check scrubs". Could you point me to a
good resource? I've found
https://raid.wiki.kernel.org/index.php/Scrubbing and
https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives, but it's
hard to tell exactly how the system should be configured in order to run
these regularly. A weekly cron perhaps? And, should it be just check,
or repair? etc... Any help you could offer would be welcome.
Is this something I should run now? I figure it's a bad idea to push an
array that is starting to degrade... haven't had a chance to replace the
drive yet, but will get to it this week. Probably best to start the
scrubbing routines once I have 4 good drives in there I figure...
> Since this is a desktop drive that is known to not have SCTERC support,
> you *must* reset your driver timeouts to 180 seconds for a check scrub
> to succeed. You will also have to do so with your P300 drive, as
> Toshiba's website says that drive is not NAS optimized.
>
> Please read up on "timeout mismatch" before your array blows up.
I have timeouts set on all drives when the system boots, and the same
script turns on the P300s' SCTERC.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-10 9:00 ` Alexander Shenkin
@ 2017-10-10 9:11 ` Reindl Harald
2017-10-10 9:56 ` Alexander Shenkin
2017-10-10 9:21 ` Wols Lists
1 sibling, 1 reply; 49+ messages in thread
From: Reindl Harald @ 2017-10-10 9:11 UTC (permalink / raw)
To: Alexander Shenkin, Phil Turmel, Carsten Aulbert, linux-raid
Am 10.10.2017 um 11:00 schrieb Alexander Shenkin:
> Thanks... I know nothing about "check scrubs". Could you point me to a
> good resource? I've found
> https://raid.wiki.kernel.org/index.php/Scrubbing and
> https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives, but it's
> hard to tell exactly how the system should be configured in order to run
> these regularly. A weekly cron perhaps? And, should it be just check,
> or repair? etc... Any help you could offer would be welcome.
if your distribution don't install a cronjob for that you should blame
them because RAID without regular scrub is asking for troubles
[root@srv-rhsoft:~]$ rpm -q --file /etc/cron.d/raid-check
mdadm-4.0-1.fc26.x86_64
[root@srv-rhsoft:~]$ cat /etc/cron.d/raid-check
30 4 * * Mon root /usr/sbin/raid-check
> Is this something I should run now? I figure it's a bad idea to push an
> array that is starting to degrade... haven't had a chance to replace the
> drive yet, but will get to it this week. Probably best to start the
> scrubbing routines once I have 4 good drives in there I figure...
NO - never put any load you can avoid on degraded arrays
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-10 9:00 ` Alexander Shenkin
2017-10-10 9:11 ` Reindl Harald
@ 2017-10-10 9:21 ` Wols Lists
1 sibling, 0 replies; 49+ messages in thread
From: Wols Lists @ 2017-10-10 9:21 UTC (permalink / raw)
To: Alexander Shenkin, linux-raid
On 10/10/17 10:00, Alexander Shenkin wrote:
>> Please read up on "timeout mismatch" before your array blows up.
>
> I have timeouts set on all drives when the system boots, and the same
> script turns on the P300s' SCTERC.
You may notice on the raid wiki that I've started putting up smartctl
output from various drives. It would be nice to have the P300 there.
When you get it, any chance you could email me the output of a "smartctl
-x"? I notice on my Barracudas that I get different output depending on
whether I've turned smarts on (it's disabled by default at power-on), so
obviously I'd like it once smart is enabled :-)
It's meant to give people a place to look to (try to) work out which
drive is suitable for them. I've noticed with the drives I've got and
what people have posted, that output is "weird" for some drives...
Cheers,
Wol
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-10 9:11 ` Reindl Harald
@ 2017-10-10 9:56 ` Alexander Shenkin
2017-10-10 12:55 ` Phil Turmel
2017-10-10 22:23 ` josh
0 siblings, 2 replies; 49+ messages in thread
From: Alexander Shenkin @ 2017-10-10 9:56 UTC (permalink / raw)
To: Reindl Harald, Phil Turmel, Carsten Aulbert, linux-raid
On 10/10/2017 10:11 AM, Reindl Harald wrote:
>
>
> Am 10.10.2017 um 11:00 schrieb Alexander Shenkin:
>> Thanks... I know nothing about "check scrubs". Could you point me to
>> a good resource? I've found
>> https://raid.wiki.kernel.org/index.php/Scrubbing and
>> https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives, but it's
>> hard to tell exactly how the system should be configured in order to
>> run these regularly. A weekly cron perhaps? And, should it be just
>> check, or repair? etc... Any help you could offer would be welcome.
>
> if your distribution don't install a cronjob for that you should blame
> them because RAID without regular scrub is asking for troubles
>
> [root@srv-rhsoft:~]$ rpm -q --file /etc/cron.d/raid-check
> mdadm-4.0-1.fc26.x86_64
>
> [root@srv-rhsoft:~]$ cat /etc/cron.d/raid-check
> 30 4 * * Mon root /usr/sbin/raid-check
Thanks Reindl. Here's what I have installed (no evidence of raid-check
available on my system):
$ cat /etc/cron.d/mdadm
57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d)
-le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi
>> Is this something I should run now? I figure it's a bad idea to push
>> an array that is starting to degrade... haven't had a chance to
>> replace the drive yet, but will get to it this week. Probably best to
>> start the scrubbing routines once I have 4 good drives in there I
>> figure...
>
> NO - never put any load you can avoid on degraded arrays
thanks, i won't.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-10 9:56 ` Alexander Shenkin
@ 2017-10-10 12:55 ` Phil Turmel
2017-10-11 10:31 ` Alexander Shenkin
2017-10-10 22:23 ` josh
1 sibling, 1 reply; 49+ messages in thread
From: Phil Turmel @ 2017-10-10 12:55 UTC (permalink / raw)
To: Alexander Shenkin, Reindl Harald, Carsten Aulbert, linux-raid
Hi Alex,
On 10/10/2017 05:56 AM, Alexander Shenkin wrote:
> Thanks Reindl. Here's what I have installed (no evidence of raid-check
> available on my system):
>
> $ cat /etc/cron.d/mdadm
> 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d)
> -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi
>
This is *really* good news. If this has been running once a month as
shown, and you've already noted that you are dealing properly with
timeout mismatch, then your arrays are in fact reasonably scrubbed.
Which means the pending sector found by a smartctl background scan is
likely in a non-array data area. And if not, the next scrub will fix
it. You can run checkarray yourself if you don't want to wait.
>>> Is this something I should run now? I figure it's a bad idea to push
>>> an array that is starting to degrade... haven't had a chance to
>>> replace the drive yet, but will get to it this week. Probably best
>>> to start the scrubbing routines once I have 4 good drives in there I
>>> figure...
>>
>> NO - never put any load you can avoid on degraded arrays
>
> thanks, i won't.
If I read your OP correctly, your array is *not* degraded -- it just has
a pending URE on one drive. You are still redundant, and your system is
scrubbing once a month. FWIW, I don't replace drives just for
pending sectors -- they are expected occasionally per drive specs. So
long as check scrubs regularly complete, I replace drives when actual
relocations hit double digits.
You are fine. Your array is fine. Long timeouts can cause application
timeouts and user freak-outs, so your Seagates are less than ideal, but
your system is *fine*.
Consider using the new drive to convert to raid6. If you have other
reasons to stay with raid5, then add it as as spare, then use mdadm's
--replace operation to swap out the drive with the pending sector.
Phil
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-10 9:56 ` Alexander Shenkin
2017-10-10 12:55 ` Phil Turmel
@ 2017-10-10 22:23 ` josh
2017-10-11 6:23 ` Alexander Shenkin
1 sibling, 1 reply; 49+ messages in thread
From: josh @ 2017-10-10 22:23 UTC (permalink / raw)
To: Alexander Shenkin; +Cc: Reindl Harald, Phil Turmel, Carsten Aulbert, linux-raid
Hello Alexander,
On 10 October 2017 at 20:56, Alexander Shenkin <al@shenkin.org> wrote:
> On 10/10/2017 10:11 AM, Reindl Harald wrote:
>>
>>
>>
>> Am 10.10.2017 um 11:00 schrieb Alexander Shenkin:
>>>
>>> Thanks... I know nothing about "check scrubs". Could you point me to a
>>> good resource? I've found https://raid.wiki.kernel.org/index.php/Scrubbing
>>> and https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives, but it's
>>> hard to tell exactly how the system should be configured in order to run
>>> these regularly. A weekly cron perhaps? And, should it be just check, or
>>> repair? etc... Any help you could offer would be welcome.
>>
>>
>> if your distribution don't install a cronjob for that you should blame
>> them because RAID without regular scrub is asking for troubles
>>
>> [root@srv-rhsoft:~]$ rpm -q --file /etc/cron.d/raid-check
>> mdadm-4.0-1.fc26.x86_64
>>
>> [root@srv-rhsoft:~]$ cat /etc/cron.d/raid-check
>> 30 4 * * Mon root /usr/sbin/raid-check
>
>
> Thanks Reindl. Here's what I have installed (no evidence of raid-check
> available on my system):
>
> $ cat /etc/cron.d/mdadm
> 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le
> 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi
>
This is indicative of a Debian/Ubuntu distribution. The cron entry is
not enough to enable md array checks, you have to edit
/etc/default/mdadm and set AUTOCHECK=true
>>> Is this something I should run now? I figure it's a bad idea to push an
>>> array that is starting to degrade... haven't had a chance to replace the
>>> drive yet, but will get to it this week. Probably best to start the
>>> scrubbing routines once I have 4 good drives in there I figure...
>>
>>
>> NO - never put any load you can avoid on degraded arrays
>
>
> thanks, i won't.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-10 22:23 ` josh
@ 2017-10-11 6:23 ` Alexander Shenkin
0 siblings, 0 replies; 49+ messages in thread
From: Alexander Shenkin @ 2017-10-11 6:23 UTC (permalink / raw)
To: josh; +Cc: Reindl Harald, Phil Turmel, Carsten Aulbert, linux-raid
On 10/10/2017 11:23 PM, josh wrote:
> This is indicative of a Debian/Ubuntu distribution. The cron entry is
> not enough to enable md array checks, you have to edit
> /etc/default/mdadm and set AUTOCHECK=true
Thanks Josh. AUTOCHECK is indeed set to true in /etc/default/mdadm.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-10 12:55 ` Phil Turmel
@ 2017-10-11 10:31 ` Alexander Shenkin
2017-10-11 17:10 ` Phil Turmel
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2017-10-11 10:31 UTC (permalink / raw)
To: Phil Turmel, Reindl Harald, Carsten Aulbert, linux-raid
On 10/10/2017 1:55 PM, Phil Turmel wrote:
> Hi Alex,
>
> On 10/10/2017 05:56 AM, Alexander Shenkin wrote:
>
>> Thanks Reindl. Here's what I have installed (no evidence of raid-check
>> available on my system):
>>
>> $ cat /etc/cron.d/mdadm
>> 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d)
>> -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi
>>
>
> This is *really* good news. If this has been running once a month as
> shown, and you've already noted that you are dealing properly with
> timeout mismatch, then your arrays are in fact reasonably scrubbed.
>
> Which means the pending sector found by a smartctl background scan is
> likely in a non-array data area. And if not, the next scrub will fix
> it. You can run checkarray yourself if you don't want to wait.
Thanks Phil. I ran checkarray --all --idle, and it completed fine, with
no Rebuild messages as far as I could see (looked in dmesg &
/var/log/syslog, see below).
[4444093.042246] md: data-check of RAID array md0
[4444093.042252] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[4444093.042254] md: using maximum available idle IO bandwidth (but not
more than 200000 KB/sec) for data-check.
[4444093.042262] md: using 128k window, over a total of 1950656k.
[4444093.192032] md: delaying data-check of md2 until md0 has finished
(they share one or more physical units)
[4444106.854418] md: md0: data-check done.
[4444106.863292] md: data-check of RAID array md2
[4444106.863295] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[4444106.863298] md: using maximum available idle IO bandwidth (but not
more than 200000 KB/sec) for data-check.
[4444106.863304] md: using 128k window, over a total of 2920188928k.
[4475376.852520] md: md2: data-check done.
SMART still shows those 8 unreadable sectors. dmesg has a bunch of
related errors, copied below.
>>>> Is this something I should run now? I figure it's a bad idea to push
>>>> an array that is starting to degrade... haven't had a chance to
>>>> replace the drive yet, but will get to it this week. Probably best
>>>> to start the scrubbing routines once I have 4 good drives in there I
>>>> figure...
>>>
>>> NO - never put any load you can avoid on degraded arrays
>>
>> thanks, i won't.
>
> If I read your OP correctly, your array is *not* degraded -- it just has
> a pending URE on one drive. You are still redundant, and your system is
> scrubbing once a month. FWIW, I don't replace drives just for
> pending sectors -- they are expected occasionally per drive specs. So
> long as check scrubs regularly complete, I replace drives when actual
> relocations hit double digits.
So, is there a way to tell if the array successfully "relocated" those 8
sectors? Or, no need to verify it?
>
> You are fine. Your array is fine. Long timeouts can cause application
> timeouts and user freak-outs, so your Seagates are less than ideal, but
> your system is *fine*.
>
> Consider using the new drive to convert to raid6. If you have other
> reasons to stay with raid5, then add it as as spare, then use mdadm's
> --replace operation to swap out the drive with the pending sector.
Thanks - I'll look into raid6 conversion if this drive doesn't start
upping it's unreadable sector counts more in the near future...
thanks,
allie
-------------------------------------
[4038193.380403] INFO: task md2_raid5:247 blocked for more than 120 seconds.
[4038193.380473] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu
[4038193.380526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4038193.380591] md2_raid5 D ffff8800af01fbe0 0 247 2
0x00000000
[4038193.380599] ffff8800af01fbe0 ffff8800980f5400 ffff8802229b8e00
ffff8800af020000
[4038193.380604] ffff880222017298 ffffea0002bc0300 ffff880222017018
ffff880222017000
[4038193.380609] ffff8800af01fbf8 ffffffff81808f75 ffff880222017000
ffff8800af01fc40
[4038193.380613] Call Trace:
[4038193.380625] [<ffffffff81808f75>] schedule+0x35/0x80
[4038193.380631] [<ffffffff81681a17>] md_super_wait+0x47/0x80
[4038193.380638] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0
[4038193.380643] [<ffffffff816892d5>] write_page+0x1f5/0x310
[4038193.380648] [<ffffffff81688fe8>] bitmap_update_sb+0x138/0x150
[4038193.380652] [<ffffffff81681d79>] md_update_sb.part.51+0x329/0x800
[4038193.380657] [<ffffffff81682275>] md_update_sb+0x25/0x30
[4038193.380661] [<ffffffff8168292d>] md_check_recovery+0x1dd/0x4a0
[4038193.380670] [<ffffffffc00e6765>] raid5d+0x45/0x740 [raid456]
[4038193.380675] [<ffffffff810e7b08>] ? del_timer_sync+0x48/0x50
[4038193.380680] [<ffffffff8180b86b>] ? schedule_timeout+0x16b/0x2d0
[4038193.380685] [<ffffffff810e75f0>] ?
trace_event_raw_event_tick_stop+0xd0/0xd0
[4038193.380691] [<ffffffff8167a957>] md_thread+0x117/0x130
[4038193.380696] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0
[4038193.380702] [<ffffffff8167a840>] ? find_pers+0x70/0x70
[4038193.380707] [<ffffffff8109cd56>] kthread+0xd6/0xf0
[4038193.380711] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60
[4038193.380715] [<ffffffff8180cb8f>] ret_from_fork+0x3f/0x70
[4038193.380719] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60
[4038193.380723] INFO: task jbd2/md2-8:261 blocked for more than 120
seconds.
[4038193.380779] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu
[4038193.380834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4038193.380898] jbd2/md2-8 D ffff8800af017a30 0 261 2
0x00000000
[4038193.380903] ffff8800af017a30 ffff880225a44600 ffff880225b4d400
ffff8800af018000
[4038193.380908] ffff880222017298 ffff880222017290 ffff88014970c400
0000000000000000
[4038193.380912] ffff8800af017a48 ffffffff81808f75 ffff880222017000
ffff8800af017a98
[4038193.380917] Call Trace:
[4038193.380922] [<ffffffff81808f75>] schedule+0x35/0x80
[4038193.380926] [<ffffffff8167f94d>] md_write_start+0x9d/0x180
[4038193.380931] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0
[4038193.380939] [<ffffffffc00e307a>] make_request+0x7a/0xcc0 [raid456]
[4038193.380944] [<ffffffff8128a6f0>] ? ext4_map_blocks+0x2c0/0x4e0
[4038193.380950] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0
[4038193.380954] [<ffffffff8167b13c>] md_make_request+0xec/0x240
[4038193.380959] [<ffffffff813aee08>] generic_make_request+0xf8/0x2a0
[4038193.380962] [<ffffffff813af027>] submit_bio+0x77/0x150
[4038193.380967] [<ffffffff813a6821>] ? bio_alloc_bioset+0x181/0x2c0
[4038193.380971] [<ffffffff81237ecf>] submit_bh_wbc+0x12f/0x160
[4038193.380975] [<ffffffff81237f32>] submit_bh+0x12/0x20
[4038193.380980] [<ffffffff812db465>]
jbd2_journal_commit_transaction+0x5e5/0x1970
[4038193.380984] [<ffffffff810b315f>] ? update_curr+0xdf/0x170
[4038193.380989] [<ffffffff810e7a9f>] ? try_to_del_timer_sync+0x4f/0x70
[4038193.380994] [<ffffffff812e01eb>] kjournald2+0xbb/0x230
[4038193.380999] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0
[4038193.381004] [<ffffffff812e0130>] ? commit_timeout+0x10/0x10
[4038193.381007] [<ffffffff8109cd56>] kthread+0xd6/0xf0
[4038193.381011] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60
[4038193.381015] [<ffffffff8180cb8f>] ret_from_fork+0x3f/0x70
[4038193.381019] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60
[4038193.381055] INFO: task kworker/u16:1:4795 blocked for more than 120
seconds.
[4038193.381114] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu
[4038193.381167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4038193.381231] kworker/u16:1 D ffff880003b777b0 0 4795 2
0x00000000
[4038193.381240] Workqueue: writeback wb_workfn (flush-9:2)
[4038193.381243] ffff880003b777b0 ffffffff81e13500 ffff88022023d400
ffff880003b78000
[4038193.381247] ffff880222017298 0000000000000001 ffff8800afaef500
ffff88020e88e570
[4038193.381252] ffff880003b777c8 ffffffff81808f75 ffff880222017000
ffff880003b77818
[4038193.381256] Call Trace:
[4038193.381261] [<ffffffff81808f75>] schedule+0x35/0x80
[4038193.381265] [<ffffffff8167f94d>] md_write_start+0x9d/0x180
[4038193.381271] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0
[4038193.381278] [<ffffffffc00e307a>] make_request+0x7a/0xcc0 [raid456]
[4038193.381282] [<ffffffff813a6844>] ? bio_alloc_bioset+0x1a4/0x2c0
[4038193.381288] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0
[4038193.381292] [<ffffffff8167b13c>] md_make_request+0xec/0x240
[4038193.381296] [<ffffffff810bf184>] ? __wake_up+0x44/0x50
[4038193.381300] [<ffffffff813aee08>] generic_make_request+0xf8/0x2a0
[4038193.381304] [<ffffffff813af027>] submit_bio+0x77/0x150
[4038193.381309] [<ffffffff8129198e>] ext4_io_submit+0x3e/0x60
[4038193.381313] [<ffffffff8128da43>] ext4_writepages+0x553/0xcd0
[4038193.381318] [<ffffffff8180b86b>] ? schedule_timeout+0x16b/0x2d0
[4038193.381323] [<ffffffff81095343>] ? __queue_delayed_work+0x83/0x180
[4038193.381329] [<ffffffff8119186e>] do_writepages+0x1e/0x30
[4038193.381333] [<ffffffff8122e885>] __writeback_single_inode+0x45/0x340
[4038193.381338] [<ffffffff818099a8>] ?
wait_for_completion_io_timeout+0xa8/0x120
[4038193.381343] [<ffffffff8122f0bb>] writeback_sb_inodes+0x26b/0x5c0
[4038193.381347] [<ffffffff8122f496>] __writeback_inodes_wb+0x86/0xc0
[4038193.381351] [<ffffffff8122f722>] wb_writeback+0x252/0x2e0
[4038193.381355] [<ffffffff8122ff12>] wb_workfn+0x2c2/0x3d0
[4038193.381359] [<ffffffff810b3f55>] ? put_prev_entity+0x35/0x670
[4038193.381365] [<ffffffff81096d20>] process_one_work+0x150/0x3f0
[4038193.381370] [<ffffffff8109749a>] worker_thread+0x11a/0x470
[4038193.381375] [<ffffffff81097380>] ? rescuer_thread+0x310/0x310
[4038193.381378] [<ffffffff8109cd56>] kthread+0xd6/0xf0
[4038193.381382] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60
[4038193.381386] [<ffffffff8180cb8f>] ret_from_fork+0x3f/0x70
[4038193.381390] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60
[4038193.381400] INFO: task updatedb.mlocat:16442 blocked for more than
120 seconds.
[4038193.381461] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu
[4038193.381512] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4038193.381576] updatedb.mlocat D ffff880206e2b968 0 16442 16441
0x00000000
[4038193.381581] ffff880206e2b968 ffff880225a41c00 ffff880049599c00
ffff880206e2c000
[4038193.381586] 0000000000000000 7fffffffffffffff ffff88022efb2ce8
ffffffff818096b0
[4038193.381590] ffff880206e2b980 ffffffff81808f75 ffff88022ec96e00
ffff880206e2ba28
[4038193.381594] Call Trace:
[4038193.381599] [<ffffffff818096b0>] ? bit_wait+0x50/0x50
[4038193.381604] [<ffffffff81808f75>] schedule+0x35/0x80
[4038193.381608] [<ffffffff8180b937>] schedule_timeout+0x237/0x2d0
[4038193.381613] [<ffffffff8159dd7f>] ? scsi_request_fn+0x3f/0x630
[4038193.381618] [<ffffffff813a8121>] ? elv_rb_add+0x61/0x70
[4038193.381624] [<ffffffff810f028c>] ? ktime_get+0x3c/0xb0
[4038193.381628] [<ffffffff818096b0>] ? bit_wait+0x50/0x50
[4038193.381633] [<ffffffff81808556>] io_schedule_timeout+0xa6/0x110
[4038193.381637] [<ffffffff818096cb>] bit_wait_io+0x1b/0x60
[4038193.381642] [<ffffffff81809310>] __wait_on_bit+0x60/0x90
[4038193.381646] [<ffffffff818096b0>] ? bit_wait+0x50/0x50
[4038193.381650] [<ffffffff818093b2>] out_of_line_wait_on_bit+0x72/0x80
[4038193.381655] [<ffffffff810bf6a0>] ? autoremove_wake_function+0x40/0x40
[4038193.381661] [<ffffffff81236922>] __wait_on_buffer+0x32/0x40
[4038193.381665] [<ffffffff812885c3>] __ext4_get_inode_loc+0x1c3/0x3f0
[4038193.381669] [<ffffffff8128b7bf>] ext4_iget+0x8f/0xb80
[4038193.381673] [<ffffffff8128c2e0>] ext4_iget_normal+0x30/0x40
[4038193.381677] [<ffffffff812964b1>] ext4_lookup+0xf1/0x230
[4038193.381682] [<ffffffff8120ac7d>] lookup_real+0x1d/0x50
[4038193.381686] [<ffffffff8120b093>] __lookup_hash+0x33/0x40
[4038193.381690] [<ffffffff8120d967>] walk_component+0x177/0x230
[4038193.381695] [<ffffffff8120e9f0>] path_lookupat+0x60/0x110
[4038193.381699] [<ffffffff8121087c>] filename_lookup+0x9c/0x150
[4038193.381704] [<ffffffff811ded1f>] ? kmem_cache_alloc+0x19f/0x200
[4038193.381708] [<ffffffff812104bf>] ? getname_flags+0x4f/0x1f0
[4038193.381713] [<ffffffff812109e6>] user_path_at_empty+0x36/0x40
[4038193.381719] [<ffffffff81205f93>] vfs_fstatat+0x53/0xa0
[4038193.381724] [<ffffffff81206462>] SYSC_newlstat+0x22/0x40
[4038193.381729] [<ffffffff81216555>] ? SyS_poll+0x65/0xf0
[4038193.381733] [<ffffffff8120667e>] SyS_newlstat+0xe/0x10
[4038193.381737] [<ffffffff8180c7f6>] entry_SYSCALL_64_fastpath+0x16/0x75
[4038193.381741] INFO: task nmbd:16462 blocked for more than 120 seconds.
[4038193.381794] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu
[4038193.381846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[4038193.381910] nmbd D ffff880098197b38 0 16462 6615
0x00000000
[4038193.381915] ffff880098197b38 ffff880225a43800 ffff88004959d400
ffff880098198000
[4038193.381920] 0000000000000000 7fffffffffffffff ffff88022efba0f8
ffffffff818096b0
[4038193.381924] ffff880098197b50 ffffffff81808f75 ffff88022ed16e00
ffff880098197bf8
[4038193.381928] Call Trace:
[4038193.381933] [<ffffffff818096b0>] ? bit_wait+0x50/0x50
[4038193.381937] [<ffffffff81808f75>] schedule+0x35/0x80
[4038193.381942] [<ffffffff8180b937>] schedule_timeout+0x237/0x2d0
[4038193.381946] [<ffffffff811de8dc>] ? ___slab_alloc+0x1cc/0x470
[4038193.381951] [<ffffffff810f028c>] ? ktime_get+0x3c/0xb0
[4038193.381956] [<ffffffff818096b0>] ? bit_wait+0x50/0x50
[4038193.381960] [<ffffffff81808556>] io_schedule_timeout+0xa6/0x110
[4038193.381965] [<ffffffff818096cb>] bit_wait_io+0x1b/0x60
[4038193.381969] [<ffffffff81809310>] __wait_on_bit+0x60/0x90
[4038193.381974] [<ffffffff818096b0>] ? bit_wait+0x50/0x50
[4038193.381978] [<ffffffff818093b2>] out_of_line_wait_on_bit+0x72/0x80
[4038193.381983] [<ffffffff810bf6a0>] ? autoremove_wake_function+0x40/0x40
[4038193.381988] [<ffffffff812d9943>] do_get_write_access+0x273/0x490
[4038193.381992] [<ffffffff812d9b91>]
jbd2_journal_get_write_access+0x31/0x60
[4038193.381997] [<ffffffff812bdb0b>]
__ext4_journal_get_write_access+0x3b/0x80
[4038193.382001] [<ffffffff81298ba4>] ext4_orphan_add+0xa4/0x260
[4038193.382006] [<ffffffff81299db8>] ext4_unlink+0x338/0x350
[4038193.382010] [<ffffffff8120c93a>] vfs_unlink+0xda/0x190
[4038193.382015] [<ffffffff8137881b>] ? wrap_apparmor_path_unlink+0x1b/0x20
[4038193.382020] [<ffffffff81211127>] do_unlinkat+0x257/0x2a0
[4038193.382025] [<ffffffff81211ae6>] SyS_unlink+0x16/0x20
[4038193.382029] [<ffffffff8180c7f6>] entry_SYSCALL_64_fastpath+0x16/0x75
[4038242.602780] ata3.00: exception Emask 0x40 SAct 0x7fffffff SErr
0x800 action 0x6 frozen
[4038242.602856] ata3: SError: { HostInt }
[4038242.602892] ata3.00: failed command: READ FPDMA QUEUED
[4038242.602943] ata3.00: cmd 60/08:00:d0:a3:3f/00:00:28:01:00/40 tag 0
ncq 4096 in
[4038242.602943] res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.603063] ata3.00: status: { DRDY }
[4038242.603097] ata3.00: failed command: READ FPDMA QUEUED
[4038242.603146] ata3.00: cmd 60/08:08:d8:a3:3f/00:00:28:01:00/40 tag 1
ncq 4096 in
[4038242.603146] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.603266] ata3.00: status: { DRDY }
[4038242.603299] ata3.00: failed command: READ FPDMA QUEUED
[4038242.603348] ata3.00: cmd 60/08:10:e0:a3:3f/00:00:28:01:00/40 tag 2
ncq 4096 in
[4038242.603348] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.603466] ata3.00: status: { DRDY }
[4038242.603500] ata3.00: failed command: READ FPDMA QUEUED
[4038242.603549] ata3.00: cmd 60/08:18:e8:a3:3f/00:00:28:01:00/40 tag 3
ncq 4096 in
[4038242.603549] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.603667] ata3.00: status: { DRDY }
[4038242.603700] ata3.00: failed command: READ FPDMA QUEUED
[4038242.603749] ata3.00: cmd 60/08:20:f8:a3:3f/00:00:28:01:00/40 tag 4
ncq 4096 in
[4038242.603749] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.603867] ata3.00: status: { DRDY }
[4038242.603901] ata3.00: failed command: READ FPDMA QUEUED
[4038242.603950] ata3.00: cmd 60/08:28:f0:a3:3f/00:00:28:01:00/40 tag 5
ncq 4096 in
[4038242.603950] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask
0x44 (timeout)
[4038242.604070] ata3.00: status: { DRDY }
[4038242.604104] ata3.00: failed command: READ FPDMA QUEUED
[4038242.604153] ata3.00: cmd 60/08:30:08:a3:3f/00:00:28:01:00/40 tag 6
ncq 4096 in
[4038242.604153] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.604271] ata3.00: status: { DRDY }
[4038242.606329] ata3.00: failed command: READ FPDMA QUEUED
[4038242.608395] ata3.00: cmd 60/08:38:10:a3:3f/00:00:28:01:00/40 tag 7
ncq 4096 in
[4038242.608395] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.612513] ata3.00: status: { DRDY }
[4038242.614535] ata3.00: failed command: READ FPDMA QUEUED
[4038242.616547] ata3.00: cmd 60/08:40:18:a3:3f/00:00:28:01:00/40 tag 8
ncq 4096 in
[4038242.616547] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask
0x44 (timeout)
[4038242.620545] ata3.00: status: { DRDY }
[4038242.622544] ata3.00: failed command: READ FPDMA QUEUED
[4038242.624503] ata3.00: cmd 60/08:48:20:a3:3f/00:00:28:01:00/40 tag 9
ncq 4096 in
[4038242.624503] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.628406] ata3.00: status: { DRDY }
[4038242.630330] ata3.00: failed command: READ FPDMA QUEUED
[4038242.632245] ata3.00: cmd 60/08:50:28:a3:3f/00:00:28:01:00/40 tag 10
ncq 4096 in
[4038242.632245] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.636058] ata3.00: status: { DRDY }
[4038242.637942] ata3.00: failed command: READ FPDMA QUEUED
[4038242.639818] ata3.00: cmd 60/08:58:30:a3:3f/00:00:28:01:00/40 tag 11
ncq 4096 in
[4038242.639818] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.643555] ata3.00: status: { DRDY }
[4038242.645413] ata3.00: failed command: READ FPDMA QUEUED
[4038242.647270] ata3.00: cmd 60/08:60:38:a3:3f/00:00:28:01:00/40 tag 12
ncq 4096 in
[4038242.647270] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.650999] ata3.00: status: { DRDY }
[4038242.652859] ata3.00: failed command: READ FPDMA QUEUED
[4038242.654718] ata3.00: cmd 60/08:68:40:a3:3f/00:00:28:01:00/40 tag 13
ncq 4096 in
[4038242.654718] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.658458] ata3.00: status: { DRDY }
[4038242.660326] ata3.00: failed command: READ FPDMA QUEUED
[4038242.662189] ata3.00: cmd 60/08:70:48:a3:3f/00:00:28:01:00/40 tag 14
ncq 4096 in
[4038242.662189] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.665951] ata3.00: status: { DRDY }
[4038242.667827] ata3.00: failed command: READ FPDMA QUEUED
[4038242.669696] ata3.00: cmd 60/08:78:50:a3:3f/00:00:28:01:00/40 tag 15
ncq 4096 in
[4038242.669696] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask
0x44 (timeout)
[4038242.673441] ata3.00: status: { DRDY }
[4038242.675319] ata3.00: failed command: READ FPDMA QUEUED
[4038242.677189] ata3.00: cmd 60/08:80:58:a3:3f/00:00:28:01:00/40 tag 16
ncq 4096 in
[4038242.677189] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.680935] ata3.00: status: { DRDY }
[4038242.682816] ata3.00: failed command: READ FPDMA QUEUED
[4038242.684689] ata3.00: cmd 60/08:88:60:a3:3f/00:00:28:01:00/40 tag 17
ncq 4096 in
[4038242.684689] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.688450] ata3.00: status: { DRDY }
[4038242.690328] ata3.00: failed command: READ FPDMA QUEUED
[4038242.692208] ata3.00: cmd 60/08:90:68:a3:3f/00:00:28:01:00/40 tag 18
ncq 4096 in
[4038242.692208] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.695989] ata3.00: status: { DRDY }
[4038242.697872] ata3.00: failed command: READ FPDMA QUEUED
[4038242.699753] ata3.00: cmd 60/08:98:70:a3:3f/00:00:28:01:00/40 tag 19
ncq 4096 in
[4038242.699753] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.703534] ata3.00: status: { DRDY }
[4038242.705417] ata3.00: failed command: READ FPDMA QUEUED
[4038242.707310] ata3.00: cmd 60/08:a0:78:a3:3f/00:00:28:01:00/40 tag 20
ncq 4096 in
[4038242.707310] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask
0x44 (timeout)
[4038242.711100] ata3.00: status: { DRDY }
[4038242.712986] ata3.00: failed command: READ FPDMA QUEUED
[4038242.714874] ata3.00: cmd 60/08:a8:80:a3:3f/00:00:28:01:00/40 tag 21
ncq 4096 in
[4038242.714874] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.718671] ata3.00: status: { DRDY }
[4038242.720557] ata3.00: failed command: READ FPDMA QUEUED
[4038242.722432] ata3.00: cmd 60/08:b0:88:a3:3f/00:00:28:01:00/40 tag 22
ncq 4096 in
[4038242.722432] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.726213] ata3.00: status: { DRDY }
[4038242.728109] ata3.00: failed command: READ FPDMA QUEUED
[4038242.729996] ata3.00: cmd 60/08:b8:90:a3:3f/00:00:28:01:00/40 tag 23
ncq 4096 in
[4038242.729996] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.733776] ata3.00: status: { DRDY }
[4038242.735680] ata3.00: failed command: READ FPDMA QUEUED
[4038242.737564] ata3.00: cmd 60/08:c0:98:a3:3f/00:00:28:01:00/40 tag 24
ncq 4096 in
[4038242.737564] res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.741366] ata3.00: status: { DRDY }
[4038242.743276] ata3.00: failed command: READ FPDMA QUEUED
[4038242.745171] ata3.00: cmd 60/08:c8:a0:a3:3f/00:00:28:01:00/40 tag 25
ncq 4096 in
[4038242.745171] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.748959] ata3.00: status: { DRDY }
[4038242.750855] ata3.00: failed command: READ FPDMA QUEUED
[4038242.752741] ata3.00: cmd 60/08:d0:a8:a3:3f/00:00:28:01:00/40 tag 26
ncq 4096 in
[4038242.752741] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.756528] ata3.00: status: { DRDY }
[4038242.758417] ata3.00: failed command: READ FPDMA QUEUED
[4038242.760304] ata3.00: cmd 60/08:d8:b0:a3:3f/00:00:28:01:00/40 tag 27
ncq 4096 in
[4038242.760304] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.764094] ata3.00: status: { DRDY }
[4038242.765981] ata3.00: failed command: READ FPDMA QUEUED
[4038242.767867] ata3.00: cmd 60/08:e0:b8:a3:3f/00:00:28:01:00/40 tag 28
ncq 4096 in
[4038242.767867] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.771670] ata3.00: status: { DRDY }
[4038242.773566] ata3.00: failed command: READ FPDMA QUEUED
[4038242.775454] ata3.00: cmd 60/08:e8:c0:a3:3f/00:00:28:01:00/40 tag 29
ncq 4096 in
[4038242.775454] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.779246] ata3.00: status: { DRDY }
[4038242.781133] ata3.00: failed command: READ FPDMA QUEUED
[4038242.783020] ata3.00: cmd 60/08:f0:c8:a3:3f/00:00:28:01:00/40 tag 30
ncq 4096 in
[4038242.783020] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask
0x44 (timeout)
[4038242.786812] ata3.00: status: { DRDY }
[4038242.788703] ata3: hard resetting link
[4038243.278780] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[4038243.329279] ata3.00: configured for UDMA/133
[4038243.329314] sd 2:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[4038243.329320] sd 2:0:0:0: [sda] tag#0 Sense Key : Illegal Request
[current] [descriptor]
[4038243.329326] sd 2:0:0:0: [sda] tag#0 Add. Sense: Unaligned write command
[4038243.329332] sd 2:0:0:0: [sda] tag#0 CDB: Read(16) 88 00 00 00 00 01
28 3f a3 d0 00 00 00 08 00 00
[4038243.329336] blk_update_request: I/O error, dev sda, sector 4970226640
[4038243.331303] sd 2:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[4038243.331309] sd 2:0:0:0: [sda] tag#1 Sense Key : Illegal Request
[current] [descriptor]
[4038243.331314] sd 2:0:0:0: [sda] tag#1 Add. Sense: Unaligned write command
[4038243.331319] sd 2:0:0:0: [sda] tag#1 CDB: Read(16) 88 00 00 00 00 01
28 3f a3 d8 00 00 00 08 00 00
[4038243.331323] blk_update_request: I/O error, dev sda, sector 4970226648
[4038243.333204] sd 2:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[4038243.333209] sd 2:0:0:0: [sda] tag#2 Sense Key : Illegal Request
[current] [descriptor]
[4038243.333214] sd 2:0:0:0: [sda] tag#2 Add. Sense: Unaligned write command
[4038243.333218] sd 2:0:0:0: [sda] tag#2 CDB: Read(16) 88 00 00 00 00 01
28 3f a3 e0 00 00 00 08 00 00
[4038243.333221] blk_update_request: I/O error, dev sda, sector 4970226656
[4038243.335072] sd 2:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[4038243.335077] sd 2:0:0:0: [sda] tag#3 Sense Key : Illegal Request
[current] [descriptor]
[4038243.335081] sd 2:0:0:0: [sda] tag#3 Add. Sense: Unaligned write command
[4038243.335086] sd 2:0:0:0: [sda] tag#3 CDB: Read(16) 88 00 00 00 00 01
28 3f a3 e8 00 00 00 08 00 00
[4038243.335089] blk_update_request: I/O error, dev sda, sector 4970226664
[4038243.336908] sd 2:0:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[4038243.336914] sd 2:0:0:0: [sda] tag#5 Sense Key : Illegal Request
[current] [descriptor]
[4038243.336918] sd 2:0:0:0: [sda] tag#5 Add. Sense: Unaligned write command
[4038243.336923] sd 2:0:0:0: [sda] tag#5 CDB: Read(16) 88 00 00 00 00 01
28 3f a3 f0 00 00 00 08 00 00
[4038243.336926] blk_update_request: I/O error, dev sda, sector 4970226672
[4038243.338720] sd 2:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[4038243.338725] sd 2:0:0:0: [sda] tag#6 Sense Key : Illegal Request
[current] [descriptor]
[4038243.338730] sd 2:0:0:0: [sda] tag#6 Add. Sense: Unaligned write command
[4038243.338734] sd 2:0:0:0: [sda] tag#6 CDB: Read(16) 88 00 00 00 00 01
28 3f a3 08 00 00 00 08 00 00
[4038243.338737] blk_update_request: I/O error, dev sda, sector 4970226440
[4038243.340502] sd 2:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[4038243.340506] sd 2:0:0:0: [sda] tag#7 Sense Key : Illegal Request
[current] [descriptor]
[4038243.340511] sd 2:0:0:0: [sda] tag#7 Add. Sense: Unaligned write command
[4038243.340516] sd 2:0:0:0: [sda] tag#7 CDB: Read(16) 88 00 00 00 00 01
28 3f a3 10 00 00 00 08 00 00
[4038243.340519] blk_update_request: I/O error, dev sda, sector 4970226448
[4038243.342229] sd 2:0:0:0: [sda] tag#8 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[4038243.342234] sd 2:0:0:0: [sda] tag#8 Sense Key : Illegal Request
[current] [descriptor]
[4038243.342239] sd 2:0:0:0: [sda] tag#8 Add. Sense: Unaligned write command
[4038243.342244] sd 2:0:0:0: [sda] tag#8 CDB: Read(16) 88 00 00 00 00 01
28 3f a3 18 00 00 00 08 00 00
[4038243.342247] blk_update_request: I/O error, dev sda, sector 4970226456
[4038243.343950] sd 2:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[4038243.343955] sd 2:0:0:0: [sda] tag#9 Sense Key : Illegal Request
[current] [descriptor]
[4038243.343960] sd 2:0:0:0: [sda] tag#9 Add. Sense: Unaligned write command
[4038243.343965] sd 2:0:0:0: [sda] tag#9 CDB: Read(16) 88 00 00 00 00 01
28 3f a3 20 00 00 00 08 00 00
[4038243.343968] blk_update_request: I/O error, dev sda, sector 4970226464
[4038243.345623] sd 2:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[4038243.345628] sd 2:0:0:0: [sda] tag#10 Sense Key : Illegal Request
[current] [descriptor]
[4038243.345633] sd 2:0:0:0: [sda] tag#10 Add. Sense: Unaligned write
command
[4038243.345637] sd 2:0:0:0: [sda] tag#10 CDB: Read(16) 88 00 00 00 00
01 28 3f a3 28 00 00 00 08 00 00
[4038243.345640] blk_update_request: I/O error, dev sda, sector 4970226472
[4038243.347350] ata3: EH complete
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-11 10:31 ` Alexander Shenkin
@ 2017-10-11 17:10 ` Phil Turmel
2017-10-12 9:50 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Phil Turmel @ 2017-10-11 17:10 UTC (permalink / raw)
To: Alexander Shenkin, Reindl Harald, Carsten Aulbert, linux-raid
On 10/11/2017 06:31 AM, Alexander Shenkin wrote:
> On 10/10/2017 1:55 PM, Phil Turmel wrote:
>> Which means the pending sector found by a smartctl background scan is
>> likely in a non-array data area. And if not, the next scrub will fix
>> it. You can run checkarray yourself if you don't want to wait.
>
> Thanks Phil. I ran checkarray --all --idle, and it completed fine, with
> no Rebuild messages as far as I could see (looked in dmesg &
> /var/log/syslog, see below).
>
> [4444093.042246] md: data-check of RAID array md0
> [4444093.042252] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> [4444093.042254] md: using maximum available idle IO bandwidth (but not
> more than 200000 KB/sec) for data-check.
> [4444093.042262] md: using 128k window, over a total of 1950656k.
> [4444093.192032] md: delaying data-check of md2 until md0 has finished
> (they share one or more physical units)
> [4444106.854418] md: md0: data-check done.
> [4444106.863292] md: data-check of RAID array md2
> [4444106.863295] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
> [4444106.863298] md: using maximum available idle IO bandwidth (but not
> more than 200000 KB/sec) for data-check.
> [4444106.863304] md: using 128k window, over a total of 2920188928k.
> [4475376.852520] md: md2: data-check done.
>
> SMART still shows those 8 unreadable sectors. dmesg has a bunch of
> related errors, copied below.
Uh-oh. Your kernel has a hangcheck timer that is shorter (120 seconds)
than the URE timeout of your crappy Seagate drive (w/ driver times out
at 180 seconds). So the writeback that would fix the URE isn't happening.
You'll need to set your hangcheck timer to 180 seconds, too. I'm not
sure how to do that. (I've never seen this particular combination, but
it would be another black mark on desktop drives in raid arrays.)
Phil
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-11 17:10 ` Phil Turmel
@ 2017-10-12 9:50 ` Alexander Shenkin
2017-10-12 11:01 ` Wols Lists
2017-10-12 15:19 ` Kai Stian Olstad
0 siblings, 2 replies; 49+ messages in thread
From: Alexander Shenkin @ 2017-10-12 9:50 UTC (permalink / raw)
To: Phil Turmel, Reindl Harald, Carsten Aulbert, linux-raid
On 10/11/2017 6:10 PM, Phil Turmel wrote:
> On 10/11/2017 06:31 AM, Alexander Shenkin wrote:
>> On 10/10/2017 1:55 PM, Phil Turmel wrote:
>
>>> Which means the pending sector found by a smartctl background scan is
>>> likely in a non-array data area. And if not, the next scrub will fix
>>> it. You can run checkarray yourself if you don't want to wait.
>>
>> Thanks Phil. I ran checkarray --all --idle, and it completed fine, with
>> no Rebuild messages as far as I could see (looked in dmesg &
>> /var/log/syslog, see below).
>>
>> [4444093.042246] md: data-check of RAID array md0
>> [4444093.042252] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
>> [4444093.042254] md: using maximum available idle IO bandwidth (but not
>> more than 200000 KB/sec) for data-check.
>> [4444093.042262] md: using 128k window, over a total of 1950656k.
>> [4444093.192032] md: delaying data-check of md2 until md0 has finished
>> (they share one or more physical units)
>> [4444106.854418] md: md0: data-check done.
>> [4444106.863292] md: data-check of RAID array md2
>> [4444106.863295] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
>> [4444106.863298] md: using maximum available idle IO bandwidth (but not
>> more than 200000 KB/sec) for data-check.
>> [4444106.863304] md: using 128k window, over a total of 2920188928k.
>> [4475376.852520] md: md2: data-check done.
>>
>> SMART still shows those 8 unreadable sectors. dmesg has a bunch of
>> related errors, copied below.
>
> Uh-oh. Your kernel has a hangcheck timer that is shorter (120 seconds)
> than the URE timeout of your crappy Seagate drive (w/ driver times out
> at 180 seconds). So the writeback that would fix the URE isn't happening.
>
> You'll need to set your hangcheck timer to 180 seconds, too. I'm not
> sure how to do that. (I've never seen this particular combination, but
> it would be another black mark on desktop drives in raid arrays.)
Thanks Phil... Googling around, I haven't found a way to change it
either, but then again, I'm not really sure what to search for.
What about changing my default disk timeout to something less than 120
secs? Say, 100 secs instead of 180?
Seems like this issue should probably make it into the timeout wiki
page, no? Perhaps some instructions on how to query your system's
hangcheck timeout, and thus making sure that you set your drive timeouts
to less than that?
Thanks,
Allie
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-12 9:50 ` Alexander Shenkin
@ 2017-10-12 11:01 ` Wols Lists
2017-10-12 13:04 ` Phil Turmel
2017-10-12 15:19 ` Kai Stian Olstad
1 sibling, 1 reply; 49+ messages in thread
From: Wols Lists @ 2017-10-12 11:01 UTC (permalink / raw)
To: Alexander Shenkin, Phil Turmel, Reindl Harald, Carsten Aulbert,
linux-raid
On 12/10/17 10:50, Alexander Shenkin wrote:
> Thanks Phil... Googling around, I haven't found a way to change it
> either, but then again, I'm not really sure what to search for.
>
> What about changing my default disk timeout to something less than 120
> secs? Say, 100 secs instead of 180?
>
> Seems like this issue should probably make it into the timeout wiki
> page, no? Perhaps some instructions on how to query your system's
> hangcheck timeout, and thus making sure that you set your drive timeouts
> to less than that?
Very much so. What is a "hangcheck timeout"?
My wife has kindly bought the basics I need for a new PC for my birthday
(yeah! :-) and I've ordered two Seagate Ironwolfs to go with it, so I
will be setting this up from scratch. Raid, KVM, LVM, the works. So
hangcheck timeouts, documenting on the wiki, all the other bits, the
important thing is I'll have a brand new system I can play with that's
not got anything important on it and if the system (software side only,
of course :-) gets trashed, so what. I can try stuff out without
worrying about putting my live system at risk.
But back to topic. I know we have the disk timeout (on desktop drives,
any random number up to 180secs :-). We have the linux i/o wait timeout
- by default 30 secs. And now we have the hangcheck timeout, whatever
that is ...
Cheers,
Wol
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-12 11:01 ` Wols Lists
@ 2017-10-12 13:04 ` Phil Turmel
2017-10-12 13:16 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Phil Turmel @ 2017-10-12 13:04 UTC (permalink / raw)
To: Wols Lists, Alexander Shenkin, Reindl Harald, Carsten Aulbert,
linux-raid
On 10/12/2017 07:01 AM, Wols Lists wrote:
> On 12/10/17 10:50, Alexander Shenkin wrote:
>> Thanks Phil... Googling around, I haven't found a way to change it
>> either, but then again, I'm not really sure what to search for.
>>
>> What about changing my default disk timeout to something less than 120
>> secs? Say, 100 secs instead of 180?
Nope. The number has to be longer than the actual longest timeout of
your drive, which we now know is >120. When I first investigated this
phenomenon years ago, I picked 120 for my timeouts. Other reports
reached the list with the need for longer, and the recommendation for
180 was chosen.
If the driver times out, it resets the SATA connection while the drive
is still in la-la land. MD gets the error and tries to write the fixed
sector. The SATA connection is still resetting at that point, and MD
gets a *write* error, which boots that drive out of the array.
>> Seems like this issue should probably make it into the timeout wiki
>> page, no? Perhaps some instructions on how to query your system's
>> hangcheck timeout, and thus making sure that you set your drive timeouts
>> to less than that?
>
> Very much so. What is a "hangcheck timeout"?
Maybe compiled into the kernel. I vaguely recall seeing some of this
when I used to read (most of) lkml. Haven't had time for lkml in years.
I'll dig around later if no-one beats me to it.
Phil
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-12 13:04 ` Phil Turmel
@ 2017-10-12 13:16 ` Alexander Shenkin
2017-10-12 13:21 ` Mark Knecht
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2017-10-12 13:16 UTC (permalink / raw)
To: Phil Turmel, Wols Lists, Reindl Harald, Carsten Aulbert, linux-raid
On 10/12/2017 2:04 PM, Phil Turmel wrote:
> On 10/12/2017 07:01 AM, Wols Lists wrote:
>> On 12/10/17 10:50, Alexander Shenkin wrote:
>>> Thanks Phil... Googling around, I haven't found a way to change it
>>> either, but then again, I'm not really sure what to search for.
>>>
>>> What about changing my default disk timeout to something less than 120
>>> secs? Say, 100 secs instead of 180?
>
> Nope. The number has to be longer than the actual longest timeout of
> your drive, which we now know is >120. When I first investigated this
> phenomenon years ago, I picked 120 for my timeouts. Other reports
> reached the list with the need for longer, and the recommendation for
> 180 was chosen.
>
> If the driver times out, it resets the SATA connection while the drive
> is still in la-la land. MD gets the error and tries to write the fixed
> sector. The SATA connection is still resetting at that point, and MD
> gets a *write* error, which boots that drive out of the array.
Thanks Phil. Lots of questions in my head, but all rather newbie-ish
and don't want to bother folks, so I'll just wait till you experts hash
it out and then will follow recommendations...
thanks,
allie
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-12 13:16 ` Alexander Shenkin
@ 2017-10-12 13:21 ` Mark Knecht
2017-10-12 15:16 ` Edward Kuns
0 siblings, 1 reply; 49+ messages in thread
From: Mark Knecht @ 2017-10-12 13:21 UTC (permalink / raw)
To: Alexander Shenkin
Cc: Phil Turmel, Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID
On Thu, Oct 12, 2017 at 6:16 AM, Alexander Shenkin <al@shenkin.org> wrote:
> On 10/12/2017 2:04 PM, Phil Turmel wrote:
>>
>> On 10/12/2017 07:01 AM, Wols Lists wrote:
>>>
>>> On 12/10/17 10:50, Alexander Shenkin wrote:
>>>>
>>>> Thanks Phil... Googling around, I haven't found a way to change it
>>>> either, but then again, I'm not really sure what to search for.
>>>>
>>>> What about changing my default disk timeout to something less than 120
>>>> secs? Say, 100 secs instead of 180?
>>
>>
>> Nope. The number has to be longer than the actual longest timeout of
>> your drive, which we now know is >120. When I first investigated this
>> phenomenon years ago, I picked 120 for my timeouts. Other reports
>> reached the list with the need for longer, and the recommendation for
>> 180 was chosen.
>>
>> If the driver times out, it resets the SATA connection while the drive
>> is still in la-la land. MD gets the error and tries to write the fixed
>> sector. The SATA connection is still resetting at that point, and MD
>> gets a *write* error, which boots that drive out of the array.
>
>
> Thanks Phil. Lots of questions in my head, but all rather newbie-ish and
> don't want to bother folks, so I'll just wait till you experts hash it out
> and then will follow recommendations...
>
> thanks,
> allie
Not an expert here but on my Gentoo systems all running kernel 4.12.12 I
have the hangcheck timer disabled. Using it does not appear to be a
hard requirement.
- Mark
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-12 13:21 ` Mark Knecht
@ 2017-10-12 15:16 ` Edward Kuns
2017-10-12 15:52 ` Edward Kuns
0 siblings, 1 reply; 49+ messages in thread
From: Edward Kuns @ 2017-10-12 15:16 UTC (permalink / raw)
To: Mark Knecht
Cc: Alexander Shenkin, Phil Turmel, Wols Lists, Reindl Harald,
Carsten Aulbert, Linux-RAID
All y'all referring to a whole separate kernel module,
hangcheck-timer.ko? If so, it appears that you can set the timeouts
(there is more than one) via kernel parameters. I found this, which
has a long comment at the top explaining what it does:
https://github.com/spotify/linux/blob/master/drivers/char/hangcheck-timer.c
Here's the comment (reformatted):
The hangcheck-timer driver uses the TSC to catch delays that jiffies
does not notice. A timer is set. When the timer fires, it checks
whether it was delayed and if that delay exceeds a given margin of
error. The hangcheck_tick module parameter takes the timer duration in
seconds. The hangcheck_margin parameter defines the margin of error,
in seconds. The defaults are 60 seconds for the timer and 180 seconds
for the margin of error. IOW, a timer is set for 60 seconds. When the
timer fires, the callback checks the actual duration that the timer
waited. If the duration exceeds the alloted time and margin (here 60 +
180, or 240 seconds), the machine is restarted. A healthy machine will
have the duration match the expected timeout very closely.
There are four parameters to this kernel module:
MODULE_PARM_DESC(hangcheck_tick, "Timer delay.");
MODULE_PARM_DESC(hangcheck_margin, "If the hangcheck timer has been
delayed more than hangcheck_margin seconds, the driver will fire.");
MODULE_PARM_DESC(hangcheck_reboot, "If nonzero, the machine will
reboot when the timer margin is exceeded.");
MODULE_PARM_DESC(hangcheck_dump_tasks, "If nonzero, the machine will
dump the system task state when the timer margin is exceeded.");
The first two are times measured in seconds. hangcheck_tick defaults
to 180 seconds and hangcheck_margin defaults to 60 seconds, at least
in the Spotify kernel version I found on github.
Eddie
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-12 9:50 ` Alexander Shenkin
2017-10-12 11:01 ` Wols Lists
@ 2017-10-12 15:19 ` Kai Stian Olstad
1 sibling, 0 replies; 49+ messages in thread
From: Kai Stian Olstad @ 2017-10-12 15:19 UTC (permalink / raw)
To: Alexander Shenkin, Phil Turmel, Reindl Harald, Carsten Aulbert,
linux-raid
On 12. okt. 2017 11:50, Alexander Shenkin wrote:
> On 10/11/2017 6:10 PM, Phil Turmel wrote:
>> You'll need to set your hangcheck timer to 180 seconds, too. I'm not
>> sure how to do that. (I've never seen this particular combination, but
>> it would be another black mark on desktop drives in raid arrays.)
>
> Thanks Phil... Googling around, I haven't found a way to change it
> either, but then again, I'm not really sure what to search for.
Your dmesg did say
[4038193.380526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
I guess 0 disables this feature or you could just use "echo 180" or
"sysctl kernel.hung_task_timeout_secs = 180".
--
Kai Stian Olstad
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-12 15:16 ` Edward Kuns
@ 2017-10-12 15:52 ` Edward Kuns
2017-10-15 14:41 ` Alexander Shenkin
2017-12-18 15:51 ` Alexander Shenkin
0 siblings, 2 replies; 49+ messages in thread
From: Edward Kuns @ 2017-10-12 15:52 UTC (permalink / raw)
To: Mark Knecht
Cc: Alexander Shenkin, Phil Turmel, Wols Lists, Reindl Harald,
Carsten Aulbert, Linux-RAID
On Thu, Oct 12, 2017 at 10:16 AM, Edward Kuns <eddie.kuns@gmail.com> wrote:
> All y'all referring to a whole separate kernel module, hangcheck-timer.ko?
Looking back at the original messages:
[4038193.380403] INFO: task md2_raid5:247 blocked for more than 120 seconds.
[4038193.380473] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu
[4038193.380526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
it looks like you're dealing with this part of the kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/hung_task.c
The timer is configurable with sysctl and defaults to 120 seconds.
You can check with this command:
$ sudo sysctl kernel.hung_task_timeout_secs
kernel.hung_task_timeout_secs = 120
You can adjust it temporarily (e.g. to make it longer):
$ sudo sysctl -w kernel.hung_task_timeout_secs=150
Or you can adjust it permanently by modifying your sysctl configuration.
It looks like by default it will only warn ten times. After that it
will stop complaining. That is also configurable via sysctl.
Eddie
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-12 15:52 ` Edward Kuns
@ 2017-10-15 14:41 ` Alexander Shenkin
2017-12-18 15:51 ` Alexander Shenkin
1 sibling, 0 replies; 49+ messages in thread
From: Alexander Shenkin @ 2017-10-15 14:41 UTC (permalink / raw)
To: Edward Kuns, Mark Knecht
Cc: Phil Turmel, Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID
Hi all,
Thanks for all the feedback on this issue. I'm wondering if there's any
consensus about what should be done here... Should I push the kernel
timeout to something more than the 180 seconds set in the timeout
script? I'm not clear on all the timers in play here (drive timeout,
kernel timeout, others?), so not sure how the config should be set so
they don't end up conflicting... Hopefully the minds gathered here can
chart the best path forward. Thanks again for all the attention...
hopefully this can help others in the future...
thanks,
allie
On 10/12/2017 4:52 PM, Edward Kuns wrote:
> On Thu, Oct 12, 2017 at 10:16 AM, Edward Kuns <eddie.kuns@gmail.com> wrote:
>> All y'all referring to a whole separate kernel module, hangcheck-timer.ko?
>
> Looking back at the original messages:
>
> [4038193.380403] INFO: task md2_raid5:247 blocked for more than 120 seconds.
> [4038193.380473] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu
> [4038193.380526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
>
> it looks like you're dealing with this part of the kernel:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/hung_task.c
>
> The timer is configurable with sysctl and defaults to 120 seconds.
> You can check with this command:
>
> $ sudo sysctl kernel.hung_task_timeout_secs
> kernel.hung_task_timeout_secs = 120
>
> You can adjust it temporarily (e.g. to make it longer):
>
> $ sudo sysctl -w kernel.hung_task_timeout_secs=150
>
> Or you can adjust it permanently by modifying your sysctl configuration.
>
> It looks like by default it will only warn ten times. After that it
> will stop complaining. That is also configurable via sysctl.
>
> Eddie
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-10-12 15:52 ` Edward Kuns
2017-10-15 14:41 ` Alexander Shenkin
@ 2017-12-18 15:51 ` Alexander Shenkin
2017-12-18 16:09 ` Phil Turmel
1 sibling, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2017-12-18 15:51 UTC (permalink / raw)
To: Edward Kuns, Mark Knecht
Cc: Phil Turmel, Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID
Hi all,
I'm getting back to this now that I'll have time, apologies for the
delay. So, is the following correct in the case of a read error?
1) System tries to read an unreadable sector
2) Drive timeout reports unreadable based on drive timeout setting.
2a) In this case, mdadm sees the sector is unreadable and rewrites it
elsewhere on that drive.
3) If linux hangcheck timer runs out before the drive timeout, then
linux aborts the read, logs an error, and mdadm isn't given a chance to
rewrite elsewhere based on checksums.
I'm not sure how the linux io timeout fits in here, and how it's
different from the hangcheck timer.
Given all this, it seems to me that I should now set the hangcheck timer
to something greater than drive timeout (180 seconds). Does that sound
right? Otherwise, linux will kill the rewrite again, no?
Thanks,
Allie
On 10/12/2017 4:52 PM, Edward Kuns wrote:
> On Thu, Oct 12, 2017 at 10:16 AM, Edward Kuns <eddie.kuns@gmail.com> wrote:
>> All y'all referring to a whole separate kernel module, hangcheck-timer.ko?
>
> Looking back at the original messages:
>
> [4038193.380403] INFO: task md2_raid5:247 blocked for more than 120 seconds.
> [4038193.380473] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu
> [4038193.380526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
>
> it looks like you're dealing with this part of the kernel:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/hung_task.c
>
> The timer is configurable with sysctl and defaults to 120 seconds.
> You can check with this command:
>
> $ sudo sysctl kernel.hung_task_timeout_secs
> kernel.hung_task_timeout_secs = 120
>
> You can adjust it temporarily (e.g. to make it longer):
>
> $ sudo sysctl -w kernel.hung_task_timeout_secs=150
>
> Or you can adjust it permanently by modifying your sysctl configuration.
>
> It looks like by default it will only warn ten times. After that it
> will stop complaining. That is also configurable via sysctl.
>
> Eddie
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-12-18 15:51 ` Alexander Shenkin
@ 2017-12-18 16:09 ` Phil Turmel
2017-12-19 10:35 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Phil Turmel @ 2017-12-18 16:09 UTC (permalink / raw)
To: Alexander Shenkin, Edward Kuns, Mark Knecht
Cc: Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID
Hi Alexander,
On 12/18/2017 10:51 AM, Alexander Shenkin wrote:
> Hi all,
>
> I'm getting back to this now that I'll have time, apologies for the
> delay. So, is the following correct in the case of a read error?
Not quite.
> 1) System tries to read an unreadable sector
> 2) Drive timeout reports unreadable based on drive timeout setting.
> 2a) In this case, mdadm sees the sector is unreadable and rewrites it
> elsewhere on that drive.
No. MD reconstructs the sector from redundancy (mirror or reverse
parity calc or reverse P+Q syndrome) and writes it back to the *same*
sector. Since the drive firmware reported an error here, it knows to
verify the write as well. If the verification fails, the drive firmware
will relocate the sector in the background, invisible to the upper
layers. As far as MD is concerned, that sector address is fixed either
way. Relocations are handled entirely within the drive. MD does not
perform or track relocations.
> 3) If linux hangcheck timer runs out before the drive timeout, then
> linux aborts the read, logs an error, and mdadm isn't given a chance
> to rewrite elsewhere based on checksums.
No. The hangcheck timer issue described in your forwarded email is
unrelated. And MD doesn't use checksums.
Each drive has a device driver timeout, as you note below, found at
/sys/block/*/device/timeout, that linux's ATA/SCSI stack uses to cut off
non-responsive controller cards and/or drives. If that timer runs out
on a read before the drive reports the read error, the low level
*driver* reports a read error to the MD layer. MD treats it the same as
any other read error, locating or recomputing the sector from redundancy
as above. The difference in this case is that the physical drive isn't
talking to the controller (link reset in progress, typically) and the
corrective rewrite of the sector (to fix or relocate within the drive)
is refused, and that write error causes MD to kick out the drive. And
the pending sector is also left unfixed.
> Given all this, it seems to me that I should now set the hangcheck
> timer to something greater than drive timeout (180 seconds). Does
> that sound right? Otherwise, linux will kill the rewrite again, no?
In and of itself, waiting on I/O is not a hang. So it should not be
applicable.
Phil
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-12-18 16:09 ` Phil Turmel
@ 2017-12-19 10:35 ` Alexander Shenkin
2017-12-19 12:02 ` Phil Turmel
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2017-12-19 10:35 UTC (permalink / raw)
To: Phil Turmel, Edward Kuns, Mark Knecht
Cc: Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID
On 12/18/2017 4:09 PM, Phil Turmel wrote:
> Hi Alexander,
>
> On 12/18/2017 10:51 AM, Alexander Shenkin wrote:
>> Hi all,
>>
>> I'm getting back to this now that I'll have time, apologies for the
>> delay. So, is the following correct in the case of a read error?
>
> Not quite.
>
>> 1) System tries to read an unreadable sector
>
>> 2) Drive timeout reports unreadable based on drive timeout setting.
>
>> 2a) In this case, mdadm sees the sector is unreadable and rewrites it
>> elsewhere on that drive.
>
> No. MD reconstructs the sector from redundancy (mirror or reverse
> parity calc or reverse P+Q syndrome) and writes it back to the *same*
> sector. Since the drive firmware reported an error here, it knows to
> verify the write as well. If the verification fails, the drive firmware
> will relocate the sector in the background, invisible to the upper
> layers. As far as MD is concerned, that sector address is fixed either
> way. Relocations are handled entirely within the drive. MD does not
> perform or track relocations.
>
>> 3) If linux hangcheck timer runs out before the drive timeout, then
>> linux aborts the read, logs an error, and mdadm isn't given a chance
>> to rewrite elsewhere based on checksums.
>
> No. The hangcheck timer issue described in your forwarded email is
> unrelated. And MD doesn't use checksums.
>
> Each drive has a device driver timeout, as you note below, found at
> /sys/block/*/device/timeout, that linux's ATA/SCSI stack uses to cut off
> non-responsive controller cards and/or drives. If that timer runs out
> on a read before the drive reports the read error, the low level
> *driver* reports a read error to the MD layer. MD treats it the same as
> any other read error, locating or recomputing the sector from redundancy
> as above. The difference in this case is that the physical drive isn't
> talking to the controller (link reset in progress, typically) and the
> corrective rewrite of the sector (to fix or relocate within the drive)
> is refused, and that write error causes MD to kick out the drive. And
> the pending sector is also left unfixed. >
>> Given all this, it seems to me that I should now set the hangcheck
>> timer to something greater than drive timeout (180 seconds). Does
>> that sound right? Otherwise, linux will kill the rewrite again, no?
>
> In and of itself, waiting on I/O is not a hang. So it should not be
> applicable.
Ok, so, it's now my understanding that I would normally be ok, having
set the driver timeout to 180 secs (thus giving time for the seagate
drive to report the read error back up to the MD layer before 180 secs
is up). In my case, however, the kernel hangcheck timer is interrupting
the process (md?) that is waiting on the sector read at 120 secs.
Therefore, the writeback doesn't happen.
Thus, I should set the hangcheck to something > 120 (say, 180 secs -
should it be >180 to let the driver timeout first?). Does this sound
correct? Apologies if I'm repeating info from before - just trying to
be sure about what I'm doing before I go ahead and do it.
If that's correct, I'll add the following line in /etc/sysctl.conf:
kernel.hung_task_timeout_secs = 180
I'll make sure the setting has taken, and then I'll run:
sudo /usr/share/mdadm/checkarray --idle --all
Thanks,
Allie
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-12-19 10:35 ` Alexander Shenkin
@ 2017-12-19 12:02 ` Phil Turmel
2017-12-21 11:28 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Phil Turmel @ 2017-12-19 12:02 UTC (permalink / raw)
To: Alexander Shenkin, Edward Kuns, Mark Knecht
Cc: Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID
On 12/19/2017 05:35 AM, Alexander Shenkin wrote:
> Ok, so, it's now my understanding that I would normally be ok, having
> set the driver timeout to 180 secs (thus giving time for the seagate
> drive to report the read error back up to the MD layer before 180 secs
> is up). In my case, however, the kernel hangcheck timer is interrupting
> the process (md?) that is waiting on the sector read at 120 secs.
> Therefore, the writeback doesn't happen.
Yes. I think this behavior is a bug, and you need to work around it.
> Thus, I should set the hangcheck to something > 120 (say, 180 secs -
> should it be >180 to let the driver timeout first?). Does this sound
> correct? Apologies if I'm repeating info from before - just trying to
> be sure about what I'm doing before I go ahead and do it.
>
> If that's correct, I'll add the following line in /etc/sysctl.conf:
>
> kernel.hung_task_timeout_secs = 180
Yes. For your kernel.
> I'll make sure the setting has taken, and then I'll run:
>
> sudo /usr/share/mdadm/checkarray --idle --all
Makes sense. Please report your results for posterity when the scrub is
done.
Phil
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-12-19 12:02 ` Phil Turmel
@ 2017-12-21 11:28 ` Alexander Shenkin
2017-12-21 11:38 ` Reindl Harald
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2017-12-21 11:28 UTC (permalink / raw)
To: Phil Turmel, Edward Kuns, Mark Knecht
Cc: Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID
[-- Attachment #1: Type: text/plain, Size: 2785 bytes --]
Hi all,
Reporting back after changing the hangcheck timer to 180 secs and
re-running checkarray. I got a number of rebuild events (see syslog
excerpts below and attached), and I see no signs of the hangcheck issue
in dmesg like I did last time.
I'm still getting the SMART OfflineUncorrectableSector and
CurrentPendingSector errors, however. Should those go away if the
rewrites were correctly carried out by the drive? Any thoughts on next
steps to verify everything is ok?
Thanks,
Allie
user@machine:/var/log$ cat syslog | grep Rebuild
Dec 19 12:48:18 machine mdadm[23296]: RebuildStarted event detected on
md device /dev/md/0
Dec 19 12:48:41 machine mdadm[23296]: Rebuild99 event detected on md
device /dev/md/0
Dec 19 12:48:41 machine mdadm[23296]: RebuildStarted event detected on
md device /dev/md/2
Dec 19 12:48:41 machine mdadm[23296]: RebuildFinished event detected on
md device /dev/md/0
Dec 19 14:12:02 machine mdadm[23296]: Rebuild22 event detected on md
device /dev/md/2
Dec 19 15:18:42 machine mdadm[23296]: Rebuild41 event detected on md
device /dev/md/2
Dec 19 16:42:02 machine mdadm[23296]: Rebuild62 event detected on md
device /dev/md/2
Dec 19 18:05:23 machine mdadm[23296]: Rebuild80 event detected on md
device /dev/md/2
Dec 19 20:02:09 machine mdadm[23296]: RebuildFinished event detected on
md device /dev/md/2
On 12/19/2017 12:02 PM, Phil Turmel wrote:
> On 12/19/2017 05:35 AM, Alexander Shenkin wrote:
>
>> Ok, so, it's now my understanding that I would normally be ok, having
>> set the driver timeout to 180 secs (thus giving time for the seagate
>> drive to report the read error back up to the MD layer before 180 secs
>> is up). In my case, however, the kernel hangcheck timer is interrupting
>> the process (md?) that is waiting on the sector read at 120 secs.
>> Therefore, the writeback doesn't happen.
>
> Yes. I think this behavior is a bug, and you need to work around it.
>
>> Thus, I should set the hangcheck to something > 120 (say, 180 secs -
>> should it be >180 to let the driver timeout first?). Does this sound
>> correct? Apologies if I'm repeating info from before - just trying to
>> be sure about what I'm doing before I go ahead and do it.
>>
>> If that's correct, I'll add the following line in /etc/sysctl.conf:
>>
>> kernel.hung_task_timeout_secs = 180
>
> Yes. For your kernel.
>
>> I'll make sure the setting has taken, and then I'll run:
>>
>> sudo /usr/share/mdadm/checkarray --idle --all
>
> Makes sense. Please report your results for posterity when the scrub is
> done.
>
> Phil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
[-- Attachment #2: syslog --]
[-- Type: text/plain, Size: 17325 bytes --]
Dec 19 12:48:18 machinename mdadm[23296]: RebuildStarted event detected on md device /dev/md/0
Dec 19 12:48:19 machinename kernel: [1057980.859389] md: data-check of RAID array md0
Dec 19 12:48:19 machinename kernel: [1057980.859396] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Dec 19 12:48:19 machinename kernel: [1057980.859399] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Dec 19 12:48:19 machinename kernel: [1057980.859410] md: using 128k window, over a total of 1950656k.
Dec 19 12:48:19 machinename kernel: [1057981.785802] md: delaying data-check of md2 until md0 has finished (they share one or more physical units)
Dec 19 12:48:41 machinename kernel: [1058004.462362] md: md0: data-check done.
Dec 19 12:48:41 machinename kernel: [1058004.472905] md: data-check of RAID array md2
Dec 19 12:48:41 machinename kernel: [1058004.472910] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Dec 19 12:48:41 machinename kernel: [1058004.472911] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Dec 19 12:48:41 machinename kernel: [1058004.472917] md: using 128k window, over a total of 2920188928k.
Dec 19 12:48:41 machinename mdadm[23296]: Rebuild99 event detected on md device /dev/md/0
Dec 19 12:48:41 machinename mdadm[23296]: RebuildStarted event detected on md device /dev/md/2
Dec 19 12:48:41 machinename mdadm[23296]: RebuildFinished event detected on md device /dev/md/0
Dec 19 12:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 12:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 12:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 104
Dec 19 12:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 54 to 53
Dec 19 12:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 46 to 47
Dec 19 12:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 12:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 13:09:01 machinename CRON[677]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 13:17:01 machinename CRON[1297]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 53 to 50
Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 47 to 50
Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 13:39:01 machinename CRON[2821]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 13:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 13:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 13:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 104 to 119
Dec 19 13:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 50 to 49
Dec 19 13:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 50 to 51
Dec 19 13:55:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 13:55:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 14:09:16 machinename CRON[5303]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 14:12:02 machinename mdadm[23296]: Rebuild22 event detected on md device /dev/md/2
Dec 19 14:17:03 machinename CRON[5784]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 19 14:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 14:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 14:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 119 to 120
Dec 19 14:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 49 to 48
Dec 19 14:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 51 to 52
Dec 19 14:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 14:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 14:39:01 machinename CRON[7066]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 14:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 14:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 14:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 105
Dec 19 14:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 14:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 15:09:01 machinename CRON[10105]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 15:17:01 machinename CRON[10868]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 19 15:18:42 machinename mdadm[23296]: Rebuild41 event detected on md device /dev/md/2
Dec 19 15:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 15:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 15:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 105 to 119
Dec 19 15:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 15:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 15:39:02 machinename CRON[12435]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 15:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 15:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 15:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 48 to 47
Dec 19 15:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 52 to 53
Dec 19 15:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 15:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 16:09:01 machinename CRON[15876]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 16:14:28 machinename dhclient: DHCPREQUEST of 192.168.x.x on eth0 to 192.168.1.1 port 67 (xid=0x294df4f0)
Dec 19 16:14:28 machinename dhclient: DHCPACK of 192.168.x.x from 192.168.1.1
Dec 19 16:14:35 machinename dhclient: bound to 192.168.x.x -- renewal in 37501 seconds.
Dec 19 16:17:02 machinename CRON[16614]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 19 16:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 16:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 16:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 119 to 118
Dec 19 16:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 16:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 16:39:01 machinename CRON[18221]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 16:42:02 machinename mdadm[23296]: Rebuild62 event detected on md device /dev/md/2
Dec 19 16:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 16:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 16:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 118 to 116
Dec 19 16:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 16:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 17:09:02 machinename CRON[21316]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 17:17:03 machinename CRON[21836]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 19 17:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 17:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 17:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 108
Dec 19 17:25:47 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 17:25:47 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 17:39:05 machinename CRON[23185]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 17:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 17:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 17:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 108 to 117
Dec 19 17:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 47 to 48
Dec 19 17:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 53 to 52
Dec 19 17:55:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 17:55:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 18:05:23 machinename mdadm[23296]: Rebuild80 event detected on md device /dev/md/2
Dec 19 18:09:01 machinename CRON[25890]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 18:17:01 machinename CRON[26421]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 120
Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 48 to 47
Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 52 to 53
Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 18:39:12 machinename CRON[27738]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 18:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 18:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 18:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 114
Dec 19 18:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 18:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 19:09:16 machinename CRON[31805]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 19:17:02 machinename CRON[322]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 19 19:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 19:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 19:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 114 to 117
Dec 19 19:25:47 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 19:25:47 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 19:39:01 machinename CRON[1738]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 47 to 48
Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 53 to 52
Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 20:02:08 machinename kernel: [1084008.008663] md: md2: data-check done.
Dec 19 20:02:09 machinename mdadm[23296]: RebuildFinished event detected on md device /dev/md/2
Dec 19 20:09:01 machinename CRON[4780]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime))
Dec 19 20:17:01 machinename CRON[5577]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors
Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors
Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 119
Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 48 to 50
Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 52 to 50
Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], previous self-test completed with error (read test element)
Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], Self-Test Log error count increased from 19 to 20
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-12-21 11:28 ` Alexander Shenkin
@ 2017-12-21 11:38 ` Reindl Harald
2017-12-23 3:14 ` Brad Campbell
0 siblings, 1 reply; 49+ messages in thread
From: Reindl Harald @ 2017-12-21 11:38 UTC (permalink / raw)
To: Alexander Shenkin, Phil Turmel, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
Am 21.12.2017 um 12:28 schrieb Alexander Shenkin:
> Hi all,
>
> Reporting back after changing the hangcheck timer to 180 secs and
> re-running checkarray. I got a number of rebuild events (see syslog
> excerpts below and attached), and I see no signs of the hangcheck issue
> in dmesg like I did last time.
>
> I'm still getting the SMART OfflineUncorrectableSector and
> CurrentPendingSector errors, however. Should those go away if the
> rewrites were correctly carried out by the drive? Any thoughts on next
> steps to verify everything is ok?
OfflineUncorrectableSector unlikely can go away
CurrentPendingSector
https://kb.acronis.com/content/9133
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-12-21 11:38 ` Reindl Harald
@ 2017-12-23 3:14 ` Brad Campbell
2018-01-03 12:44 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Brad Campbell @ 2017-12-23 3:14 UTC (permalink / raw)
To: Reindl Harald, Alexander Shenkin, Phil Turmel, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 21/12/17 19:38, Reindl Harald wrote:
>
>
> Am 21.12.2017 um 12:28 schrieb Alexander Shenkin:
>> Hi all,
>>
>> Reporting back after changing the hangcheck timer to 180 secs and
>> re-running checkarray. I got a number of rebuild events (see syslog
>> excerpts below and attached), and I see no signs of the hangcheck
>> issue in dmesg like I did last time.
>>
>> I'm still getting the SMART OfflineUncorrectableSector and
>> CurrentPendingSector errors, however. Should those go away if the
>> rewrites were correctly carried out by the drive? Any thoughts on
>> next steps to verify everything is ok?
>
> OfflineUncorrectableSector unlikely can go away
>
> CurrentPendingSector
> https://kb.acronis.com/content/9133
If they've been re-written (so are no longer pending) then a SMART long
or possibly offline test will make them go away. I use SMART long myself.
Regards,
Brad
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2017-12-23 3:14 ` Brad Campbell
@ 2018-01-03 12:44 ` Alexander Shenkin
2018-01-03 13:26 ` Brad Campbell
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2018-01-03 12:44 UTC (permalink / raw)
To: Brad Campbell, Reindl Harald, Phil Turmel, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 12/23/2017 3:14 AM, Brad Campbell wrote:
> On 21/12/17 19:38, Reindl Harald wrote:
>>
>>
>> Am 21.12.2017 um 12:28 schrieb Alexander Shenkin:
>>> Hi all,
>>>
>>> Reporting back after changing the hangcheck timer to 180 secs and
>>> re-running checkarray. I got a number of rebuild events (see syslog
>>> excerpts below and attached), and I see no signs of the hangcheck
>>> issue in dmesg like I did last time.
>>>
>>> I'm still getting the SMART OfflineUncorrectableSector and
>>> CurrentPendingSector errors, however. Should those go away if the
>>> rewrites were correctly carried out by the drive? Any thoughts on
>>> next steps to verify everything is ok?
>>
>> OfflineUncorrectableSector unlikely can go away
>>
>> CurrentPendingSector
>> https://kb.acronis.com/content/9133
>
> If they've been re-written (so are no longer pending) then a SMART long
> or possibly offline test will make them go away. I use SMART long myself.
>
Thanks Brad. I'm running a long test now, but I believe I have the
system set up to run long tests regularly, and the issue hasn't been
fixed. Furthermore, strangely, the reallocated sector count still sits
at 0 (see below). If these errors had been properly handled by the
drive, shouldn't Reallocated_Sector_Ct sit at least at 8?
Thanks,
Allie
user@machine:~$ sudo smartctl -a /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-97-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST3000DM001-9YN166
Serial Number: Z1F13FBA
LU WWN Device Id: 5 000c50 04e444ab1
Firmware Version: CC4B
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Jan 3 12:30:04 2018 GMT
==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/223651en
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection:
Enabled.
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: ( 592) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection
on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 335) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail
Always - 190170288
3 Spin_Up_Time 0x0003 093 092 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 49
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 0
7 Seek_Error_Rate 0x000f 082 060 030 Pre-fail
Always - 175521338
9 Power_On_Hours 0x0032 083 083 000 Old_age
Always - 15266
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 73
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always
- 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always
- 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always
- 0
188 Command_Timeout 0x0032 100 099 000 Old_age Always
- 3 3 3
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always
- 0
190 Airflow_Temperature_Cel 0x0022 049 040 045 Old_age Always
In_the_past 51 (108 124 54 26 0)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always
- 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always
- 54
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always
- 114
194 Temperature_Celsius 0x0022 051 060 000 Old_age Always
- 51 (0 14 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always
- 8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 8
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always
- 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 9657h+43m+05.288s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 178878793257480
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 134902761417217
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Self-test routine in progress 90% 15266
-
# 2 Conveyance offline Completed: read failure 80% 15160
-
# 3 Extended offline Completed: read failure 10% 15152
-
# 4 Conveyance offline Completed: read failure 70% 14992
-
# 5 Extended offline Completed: read failure 10% 14986
-
# 6 Short offline Completed: read failure 60% 14913
-
# 7 Conveyance offline Completed: read failure 70% 14824
-
# 8 Extended offline Completed: read failure 10% 14818
-
# 9 Conveyance offline Completed: read failure 80% 14656
-
#10 Extended offline Completed: read failure 10% 14649
-
#11 Conveyance offline Completed: read failure 80% 14489
-
#12 Extended offline Completed: read failure 10% 14482
-
#13 Conveyance offline Completed: read failure 80% 14321
-
#14 Extended offline Completed: read failure 10% 14314
-
#15 Conveyance offline Completed: read failure 80% 14153
-
#16 Extended offline Completed: read failure 10% 14145
-
#17 Conveyance offline Completed: read failure 70% 13985
-
#18 Extended offline Completed: read failure 10% 13977
-
#19 Conveyance offline Completed: read failure 70% 13817
-
#20 Extended offline Completed: read failure 10% 13809
-
#21 Conveyance offline Completed without error 00% 13648
-
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-03 12:44 ` Alexander Shenkin
@ 2018-01-03 13:26 ` Brad Campbell
2018-01-03 13:50 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Brad Campbell @ 2018-01-03 13:26 UTC (permalink / raw)
To: Alexander Shenkin, Reindl Harald, Phil Turmel, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 03/01/18 20:44, Alexander Shenkin wrote:
> On 12/23/2017 3:14 AM, Brad Campbell wrote:
>> On 21/12/17 19:38, Reindl Harald wrote:
>>>
>>>
>>> Am 21.12.2017 um 12:28 schrieb Alexander Shenkin:
>>>> Hi all,
>>>>
>>>> Reporting back after changing the hangcheck timer to 180 secs and
>>>> re-running checkarray. I got a number of rebuild events (see
>>>> syslog excerpts below and attached), and I see no signs of the
>>>> hangcheck issue in dmesg like I did last time.
>>>>
>>>> I'm still getting the SMART OfflineUncorrectableSector and
>>>> CurrentPendingSector errors, however. Should those go away if the
>>>> rewrites were correctly carried out by the drive? Any thoughts on
>>>> next steps to verify everything is ok?
>>>
>>> OfflineUncorrectableSector unlikely can go away
>>>
>>> CurrentPendingSector
>>> https://kb.acronis.com/content/9133
>>
>> If they've been re-written (so are no longer pending) then a SMART
>> long or possibly offline test will make them go away. I use SMART
>> long myself.
>>
>
> Thanks Brad. I'm running a long test now, but I believe I have the
> system set up to run long tests regularly, and the issue hasn't been
> fixed. Furthermore, strangely, the reallocated sector count still
> sits at 0 (see below). If these errors had been properly handled by
> the drive, shouldn't Reallocated_Sector_Ct sit at least at 8?
Nope. Your pending is still at 8, so you've got bad sectors in an area
of the drive that hasn't been dealt with. What is "interesting" is that
your SMART test results don't list the LBA of the first failure.
Disappointing behaviour on the part of the disk. They are within the 1st
10% of the drive however, so it wouldn't surprise me if they were in an
unused portion of the RAID superblock area.
Regards,
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-03 13:26 ` Brad Campbell
@ 2018-01-03 13:50 ` Alexander Shenkin
2018-01-03 15:53 ` Phil Turmel
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2018-01-03 13:50 UTC (permalink / raw)
To: Brad Campbell, Reindl Harald, Phil Turmel, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 1/3/2018 1:26 PM, Brad Campbell wrote:
>
>
> On 03/01/18 20:44, Alexander Shenkin wrote:
>> On 12/23/2017 3:14 AM, Brad Campbell wrote:
>>> On 21/12/17 19:38, Reindl Harald wrote:
>>>>
>>>>
>>>> Am 21.12.2017 um 12:28 schrieb Alexander Shenkin:
>>>>> Hi all,
>>>>>
>>>>> Reporting back after changing the hangcheck timer to 180 secs and
>>>>> re-running checkarray. I got a number of rebuild events (see
>>>>> syslog excerpts below and attached), and I see no signs of the
>>>>> hangcheck issue in dmesg like I did last time.
>>>>>
>>>>> I'm still getting the SMART OfflineUncorrectableSector and
>>>>> CurrentPendingSector errors, however. Should those go away if the
>>>>> rewrites were correctly carried out by the drive? Any thoughts on
>>>>> next steps to verify everything is ok?
>>>>
>>>> OfflineUncorrectableSector unlikely can go away
>>>>
>>>> CurrentPendingSector
>>>> https://kb.acronis.com/content/9133
>>>
>>> If they've been re-written (so are no longer pending) then a SMART
>>> long or possibly offline test will make them go away. I use SMART
>>> long myself.
>>>
>>
>> Thanks Brad. I'm running a long test now, but I believe I have the
>> system set up to run long tests regularly, and the issue hasn't been
>> fixed. Furthermore, strangely, the reallocated sector count still
>> sits at 0 (see below). If these errors had been properly handled by
>> the drive, shouldn't Reallocated_Sector_Ct sit at least at 8?
>
> Nope. Your pending is still at 8, so you've got bad sectors in an area
> of the drive that hasn't been dealt with. What is "interesting" is that
> your SMART test results don't list the LBA of the first failure.
> Disappointing behaviour on the part of the disk. They are within the 1st
> 10% of the drive however, so it wouldn't surprise me if they were in an
> unused portion of the RAID superblock area.
Thanks Brad. So, to theoretically get these sectors remapped so I don't
keep getting errors, I would have to somehow try to write to those
sectors. That's tough given that the LBA's aren't reported as you
mention. Perhaps my best course of action then is to:
1) re-run sudo /usr/share/mdadm/checkarray --idle --all
2) add my previously-purchased drive to convert the RAID5 to RAID6
(using
http://www.ewams.net/?date=2013/05/02&view=Converting_RAID5_to_RAID6_in_mdadm
as a guide)
3) after that, fail and remove /dev/sda from the RAID6
4) write 0's on /dev/sda (dd if=/dev/zero of=/dev/sda bs=1M)
5) re-add /dev/sda to the RAID6
This should get those bad sectors remapped... thoughts?
thanks,
allie
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-03 13:50 ` Alexander Shenkin
@ 2018-01-03 15:53 ` Phil Turmel
2018-01-03 15:59 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Phil Turmel @ 2018-01-03 15:53 UTC (permalink / raw)
To: Alexander Shenkin, Brad Campbell, Reindl Harald, Edward Kuns,
Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 01/03/2018 08:50 AM, Alexander Shenkin wrote:
> On 1/3/2018 1:26 PM, Brad Campbell wrote:
>> Nope. Your pending is still at 8, so you've got bad sectors in an area
>> of the drive that hasn't been dealt with. What is "interesting" is
>> that your SMART test results don't list the LBA of the first failure.
>> Disappointing behaviour on the part of the disk. They are within the
>> 1st 10% of the drive however, so it wouldn't surprise me if they were
>> in an unused portion of the RAID superblock area.
>
> Thanks Brad. So, to theoretically get these sectors remapped so I don't
> keep getting errors, I would have to somehow try to write to those
> sectors. That's tough given that the LBA's aren't reported as you
> mention. Perhaps my best course of action then is to:
No, just use dd to read that device -- it'll bail out with read error
when it hits the trouble spot, which will report the affected sector.
Then you can rewrite it with the appropriate seek= value. (Assuming it
really is in an unused part of the member device.)
Phil
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-03 15:53 ` Phil Turmel
@ 2018-01-03 15:59 ` Alexander Shenkin
2018-01-03 16:02 ` Phil Turmel
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2018-01-03 15:59 UTC (permalink / raw)
To: Phil Turmel, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 1/3/2018 3:53 PM, Phil Turmel wrote:
> On 01/03/2018 08:50 AM, Alexander Shenkin wrote:
>> On 1/3/2018 1:26 PM, Brad Campbell wrote:
>
>>> Nope. Your pending is still at 8, so you've got bad sectors in an area
>>> of the drive that hasn't been dealt with. What is "interesting" is
>>> that your SMART test results don't list the LBA of the first failure.
>>> Disappointing behaviour on the part of the disk. They are within the
>>> 1st 10% of the drive however, so it wouldn't surprise me if they were
>>> in an unused portion of the RAID superblock area.
>>
>> Thanks Brad. So, to theoretically get these sectors remapped so I don't
>> keep getting errors, I would have to somehow try to write to those
>> sectors. That's tough given that the LBA's aren't reported as you
>> mention. Perhaps my best course of action then is to:
>
> No, just use dd to read that device -- it'll bail out with read error
> when it hits the trouble spot, which will report the affected sector.
> Then you can rewrite it with the appropriate seek= value. (Assuming it
> really is in an unused part of the member device.)
Thanks Phil. So, just: dd if=/dev/sda of=/dev/null bs=512
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-03 15:59 ` Alexander Shenkin
@ 2018-01-03 16:02 ` Phil Turmel
2018-01-04 10:37 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Phil Turmel @ 2018-01-03 16:02 UTC (permalink / raw)
To: Alexander Shenkin, Brad Campbell, Reindl Harald, Edward Kuns,
Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 01/03/2018 10:59 AM, Alexander Shenkin wrote:
> On 1/3/2018 3:53 PM, Phil Turmel wrote:
>> On 01/03/2018 08:50 AM, Alexander Shenkin wrote:
>>> On 1/3/2018 1:26 PM, Brad Campbell wrote:
>>
>>>> Nope. Your pending is still at 8, so you've got bad sectors in an area
>>>> of the drive that hasn't been dealt with. What is "interesting" is
>>>> that your SMART test results don't list the LBA of the first failure.
>>>> Disappointing behaviour on the part of the disk. They are within the
>>>> 1st 10% of the drive however, so it wouldn't surprise me if they were
>>>> in an unused portion of the RAID superblock area.
>>>
>>> Thanks Brad. So, to theoretically get these sectors remapped so I don't
>>> keep getting errors, I would have to somehow try to write to those
>>> sectors. That's tough given that the LBA's aren't reported as you
>>> mention. Perhaps my best course of action then is to:
>>
>> No, just use dd to read that device -- it'll bail out with read error
>> when it hits the trouble spot, which will report the affected sector.
>> Then you can rewrite it with the appropriate seek= value. (Assuming it
>> really is in an unused part of the member device.)
>
> Thanks Phil. So, just: dd if=/dev/sda of=/dev/null bs=512
Yup. (-:
Phil
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-03 16:02 ` Phil Turmel
@ 2018-01-04 10:37 ` Alexander Shenkin
2018-01-04 12:28 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2018-01-04 10:37 UTC (permalink / raw)
To: Phil Turmel, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 1/3/2018 4:02 PM, Phil Turmel wrote:
> On 01/03/2018 10:59 AM, Alexander Shenkin wrote:
>> On 1/3/2018 3:53 PM, Phil Turmel wrote:
>>> On 01/03/2018 08:50 AM, Alexander Shenkin wrote:
>>>> On 1/3/2018 1:26 PM, Brad Campbell wrote:
>>>
>>>>> Nope. Your pending is still at 8, so you've got bad sectors in an area
>>>>> of the drive that hasn't been dealt with. What is "interesting" is
>>>>> that your SMART test results don't list the LBA of the first failure.
>>>>> Disappointing behaviour on the part of the disk. They are within the
>>>>> 1st 10% of the drive however, so it wouldn't surprise me if they were
>>>>> in an unused portion of the RAID superblock area.
>>>>
>>>> Thanks Brad. So, to theoretically get these sectors remapped so I don't
>>>> keep getting errors, I would have to somehow try to write to those
>>>> sectors. That's tough given that the LBA's aren't reported as you
>>>> mention. Perhaps my best course of action then is to:
>>>
>>> No, just use dd to read that device -- it'll bail out with read error
>>> when it hits the trouble spot, which will report the affected sector.
>>> Then you can rewrite it with the appropriate seek= value. (Assuming it
>>> really is in an unused part of the member device.)
>>
So, I got a read error as expected, running (physical sector size of sda
is 4096):
dd if=/dev/sda of=/dev/null bs=4096
Is there some way to tell whether this sector is considered to be in
use? Not sure what the effect of rewriting it might be if it is...
If it's safe, I'd run:
dd if=/dev/zero of=/dev/sda seek=5857843312 count=1 bs=4096
Perhaps the way to go is to write to it, and then run checkarray again?
Thanks,
Allie
syslog here:
user@machinename:~$ cat /var/log/syslog | grep sda
Jan 4 08:23:30 machinename kernel: [1330854.323854] sd 0:0:0:0: [sda]
tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 4 08:23:30 machinename kernel: [1330854.323861] sd 0:0:0:0: [sda]
tag#16 Sense Key : Medium Error [current] [descriptor]
Jan 4 08:23:30 machinename kernel: [1330854.323867] sd 0:0:0:0: [sda]
tag#16 Add. Sense: Unrecovered read error - auto reallocate failed
Jan 4 08:23:30 machinename kernel: [1330854.323873] sd 0:0:0:0: [sda]
tag#16 CDB: Read(16) 88 00 00 00 00 01 5d 27 98 08 00 00 01 00 00 00
Jan 4 08:23:30 machinename kernel: [1330854.323877] blk_update_request:
I/O error, dev sda, sector 5857843312
Jan 4 08:23:33 machinename kernel: [1330858.108216] sd 0:0:0:0: [sda]
tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 4 08:23:33 machinename kernel: [1330858.108222] sd 0:0:0:0: [sda]
tag#3 Sense Key : Medium Error [current] [descriptor]
Jan 4 08:23:33 machinename kernel: [1330858.108228] sd 0:0:0:0: [sda]
tag#3 Add. Sense: Unrecovered read error - auto reallocate failed
Jan 4 08:23:33 machinename kernel: [1330858.108235] sd 0:0:0:0: [sda]
tag#3 CDB: Read(16) 88 00 00 00 00 01 5d 27 98 70 00 00 00 08 00 00
Jan 4 08:23:33 machinename kernel: [1330858.108239] blk_update_request:
I/O error, dev sda, sector 5857843312
Jan 4 08:23:33 machinename kernel: [1330858.108297] Buffer I/O error on
dev sda, logical block 732230414, async page read
Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], 8
Currently unreadable (pending) sectors
Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], 8
Offline uncorrectable sectors
Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], SMART
Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 114
Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], SMART
Usage Attribute: 187 Reported_Uncorrect changed from 100 to 98
Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], SMART
Usage Attribute: 190 Airflow_Temperature_Cel changed from 47 to 49
Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], SMART
Usage Attribute: 194 Temperature_Celsius changed from 53 to 51
Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], ATA
error count increased from 0 to 2
Jan 4 08:42:08 machinename smartd[2203]: Device: /dev/sda [SAT], 8
Currently unreadable (pending) sectors
Jan 4 08:42:08 machinename smartd[2203]: Device: /dev/sda [SAT], 8
Offline uncorrectable sectors
Jan 4 08:42:08 machinename smartd[2203]: Device: /dev/sda [SAT], ATA
error count increased from 0 to 2
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-04 10:37 ` Alexander Shenkin
@ 2018-01-04 12:28 ` Alexander Shenkin
2018-01-04 13:16 ` Brad Campbell
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2018-01-04 12:28 UTC (permalink / raw)
To: Phil Turmel, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
>>> On 1/3/2018 3:53 PM, Phil Turmel wrote:
>>>> On 01/03/2018 08:50 AM, Alexander Shenkin wrote:
>>>>> On 1/3/2018 1:26 PM, Brad Campbell wrote:
>>>>
>>>>>> Nope. Your pending is still at 8, so you've got bad sectors in an
>>>>>> area
>>>>>> of the drive that hasn't been dealt with. What is "interesting" is
>>>>>> that your SMART test results don't list the LBA of the first failure.
>>>>>> Disappointing behaviour on the part of the disk. They are within the
>>>>>> 1st 10% of the drive however, so it wouldn't surprise me if they were
>>>>>> in an unused portion of the RAID superblock area.
>>>>>
>>>>> Thanks Brad. So, to theoretically get these sectors remapped so I
>>>>> don't
>>>>> keep getting errors, I would have to somehow try to write to those
>>>>> sectors. That's tough given that the LBA's aren't reported as you
>>>>> mention. Perhaps my best course of action then is to:
>>>>
>>>> No, just use dd to read that device -- it'll bail out with read error
>>>> when it hits the trouble spot, which will report the affected sector.
>>>> Then you can rewrite it with the appropriate seek= value. (Assuming it
>>>> really is in an unused part of the member device.)
Ok, an update. Writing with bs=512 had issues, as it was failing on a
read, and reallocating was failing. I think this is because the
physical sector size is 4096b, and it needs to read the other 7 512b
logical sectors if it wants to write just 1 512b logical sector. So, I ran:
sudo dd if=/dev/zero of=/dev/sda seek=732230414 count=1 bs=4096
This seems to have worked. In syslog, I just now saw:
Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], No
more Currently unreadable (pending) sectors, warning condition reset
after 90 emails
Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], No
more Offline uncorrectable sectors, warning condition reset after 90 emails
Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], SMART
Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 113 to 117
I'm now running a checkarray and will report back final results, and
whether the SMART warnings return. Thanks all for the help, hope this
marks the end of this issues...
Allie
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-04 12:28 ` Alexander Shenkin
@ 2018-01-04 13:16 ` Brad Campbell
2018-01-04 13:39 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Brad Campbell @ 2018-01-04 13:16 UTC (permalink / raw)
To: Alexander Shenkin, Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 04/01/18 20:28, Alexander Shenkin wrote:
>
>
> Ok, an update. Writing with bs=512 had issues, as it was failing on a
> read, and reallocating was failing. I think this is because the
> physical sector size is 4096b, and it needs to read the other 7 512b
> logical sectors if it wants to write just 1 512b logical sector. So,
> I ran:
>
> sudo dd if=/dev/zero of=/dev/sda seek=732230414 count=1 bs=4096
>
> This seems to have worked. In syslog, I just now saw:
>
> Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], No
> more Currently unreadable (pending) sectors, warning condition reset
> after 90 emails
> Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], No
> more Offline uncorrectable sectors, warning condition reset after 90
> emails
> Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT],
> SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 113 to 117
>
> I'm now running a checkarray and will report back final results, and
> whether the SMART warnings return. Thanks all for the help, hope this
> marks the end of this issues...
>
> Allie
>
Damn. I saw your initial mail, but I was out and nowhere near a device I
could use to reply sensibly.
You should *really* check by looking at the mdadm --examine output and
calculating the position of the sectors in question to be absolutely
sure the area you just wrote over was not in the active array area. If
it was then you should stop the checkarray *now* and come back for advice.
Sorry I don't have time right now to elaborate.
Regards,
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-04 13:16 ` Brad Campbell
@ 2018-01-04 13:39 ` Alexander Shenkin
2018-01-05 5:20 ` Brad Campbell
0 siblings, 1 reply; 49+ messages in thread
From: Alexander Shenkin @ 2018-01-04 13:39 UTC (permalink / raw)
To: Brad Campbell, Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 1/4/2018 1:16 PM, Brad Campbell wrote:>
> Damn. I saw your initial mail, but I was out and nowhere near a device I
> could use to reply sensibly.
> You should *really* check by looking at the mdadm --examine output and
> calculating the position of the sectors in question to be absolutely
> sure the area you just wrote over was not in the active array area. If
> it was then you should stop the checkarray *now* and come back for advice.
Thanks Brad, no worries, really appreciate your attention. I stopped
checkarray. It had one rebuild event (Rebuild99) in /dev/md0 (small
RAID1, where /boot is mounted) before I stopped it. Here's the examine
output (not really sure what to do with it, will wait for advice):
user@machinename:~$ sudo mdadm --examine /dev/sd*
/dev/sda:
MBR Magic : aa55
Partition[0] : 4294967295 sectors at 1 (type ee)
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 437e4abb:c7ac46f1:ef8b2976:94921060
Name : arrayname:0
Creation Time : Mon Dec 7 08:31:31 2015
Raid Level : raid1
Raid Devices : 4
Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB)
Array Size : 1950656 (1905.26 MiB 1997.47 MB)
Used Dev Size : 3901312 (1905.26 MiB 1997.47 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 9cb1890b:ad675b3b:7517467f:0780ec8e
Update Time : Thu Jan 4 12:00:45 2018
Checksum : 682930c3 - correct
Events : 215
Device Role : Active device 0
Array State : AAAA ('A' == active, '.' == missing)
mdadm: No md superblock detected on /dev/sda2.
/dev/sda3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : c7303f62:d848d424:269581c8:83a045ec
Name : ubuntu:2
Creation Time : Sun Feb 5 23:39:58 2017
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5840377856 (2784.91 GiB 2990.27 GB)
Array Size : 8760566784 (8354.73 GiB 8970.82 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 64c8db38:3d0a6895:be2259d8:4c3c3542
Internal Bitmap : 8 sectors from superblock
Update Time : Thu Jan 4 13:34:53 2018
Checksum : d9775efa - correct
Events : 81186
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 0
Array State : AAAA ('A' == active, '.' == missing)
mdadm: No md superblock detected on /dev/sda4.
/dev/sdb:
MBR Magic : aa55
Partition[0] : 4294967295 sectors at 1 (type ee)
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 437e4abb:c7ac46f1:ef8b2976:94921060
Name : arrayname:0
Creation Time : Mon Dec 7 08:31:31 2015
Raid Level : raid1
Raid Devices : 4
Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB)
Array Size : 1950656 (1905.26 MiB 1997.47 MB)
Used Dev Size : 3901312 (1905.26 MiB 1997.47 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 38999554:d1b0db8d:d8066e72:86865c31
Update Time : Thu Jan 4 12:00:45 2018
Checksum : 995557b1 - correct
Events : 215
Device Role : Active device 2
Array State : AAAA ('A' == active, '.' == missing)
mdadm: No md superblock detected on /dev/sdb2.
/dev/sdb3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : c7303f62:d848d424:269581c8:83a045ec
Name : ubuntu:2
Creation Time : Sun Feb 5 23:39:58 2017
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5840377856 (2784.91 GiB 2990.27 GB)
Array Size : 8760566784 (8354.73 GiB 8970.82 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : cf70dad5:0c9ff5f6:ede689f2:ccee2eb0
Internal Bitmap : 8 sectors from superblock
Update Time : Thu Jan 4 13:34:53 2018
Checksum : 602e12e9 - correct
Events : 81186
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 1
Array State : AAAA ('A' == active, '.' == missing)
mdadm: No md superblock detected on /dev/sdb4.
/dev/sdc:
MBR Magic : aa55
Partition[0] : 4294967295 sectors at 1 (type ee)
/dev/sdc1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 437e4abb:c7ac46f1:ef8b2976:94921060
Name : arrayname:0
Creation Time : Mon Dec 7 08:31:31 2015
Raid Level : raid1
Raid Devices : 4
Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB)
Array Size : 1950656 (1905.26 MiB 1997.47 MB)
Used Dev Size : 3901312 (1905.26 MiB 1997.47 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : f162eae5:19f8926b:f5bb6a2a:8adbbefd
Update Time : Thu Jan 4 12:00:45 2018
Checksum : 8cb8728a - correct
Events : 215
Device Role : Active device 3
Array State : AAAA ('A' == active, '.' == missing)
mdadm: No md superblock detected on /dev/sdc2.
/dev/sdc3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : c7303f62:d848d424:269581c8:83a045ec
Name : ubuntu:2
Creation Time : Sun Feb 5 23:39:58 2017
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5840377856 (2784.91 GiB 2990.27 GB)
Array Size : 8760566784 (8354.73 GiB 8970.82 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : f8839952:eaba2e9c:c2c401d4:3e0592a5
Internal Bitmap : 8 sectors from superblock
Update Time : Thu Jan 4 13:34:53 2018
Checksum : 59013634 - correct
Events : 81186
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 2
Array State : AAAA ('A' == active, '.' == missing)
mdadm: No md superblock detected on /dev/sdc4.
/dev/sdd:
MBR Magic : aa55
Partition[0] : 4294967295 sectors at 1 (type ee)
/dev/sdd1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 437e4abb:c7ac46f1:ef8b2976:94921060
Name : arrayname:0
Creation Time : Mon Dec 7 08:31:31 2015
Raid Level : raid1
Raid Devices : 4
Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB)
Array Size : 1950656 (1905.26 MiB 1997.47 MB)
Used Dev Size : 3901312 (1905.26 MiB 1997.47 MB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 4823f0b8:8e0d3ac3:e312c219:29f76622
Update Time : Thu Jan 4 12:00:45 2018
Checksum : cb64bae5 - correct
Events : 215
Device Role : Active device 1
Array State : AAAA ('A' == active, '.' == missing)
mdadm: No md superblock detected on /dev/sdd2.
/dev/sdd3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : c7303f62:d848d424:269581c8:83a045ec
Name : ubuntu:2
Creation Time : Sun Feb 5 23:39:58 2017
Raid Level : raid5
Raid Devices : 4
Avail Dev Size : 5840377856 (2784.91 GiB 2990.27 GB)
Array Size : 8760566784 (8354.73 GiB 8970.82 GB)
Data Offset : 262144 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 875a0dbd:965a9986:1b78eb3d:e15fee50
Internal Bitmap : 8 sectors from superblock
Update Time : Thu Jan 4 13:34:53 2018
Checksum : c325ba6d - correct
Events : 81186
Layout : left-symmetric
Chunk Size : 512K
Device Role : Active device 3
Array State : AAAA ('A' == active, '.' == missing)
mdadm: No md superblock detected on /dev/sdd4.
Here's mdadm --detail if needed:
user@machinename:~$ sudo mdadm --detail /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Sun Feb 5 23:39:58 2017
Raid Level : raid5
Array Size : 8760566784 (8354.73 GiB 8970.82 GB)
Used Dev Size : 2920188928 (2784.91 GiB 2990.27 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Thu Jan 4 13:31:21 2018
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : ubuntu:2
UUID : c7303f62:d848d424:269581c8:83a045ec
Events : 81186
Number Major Minor RaidDevice State
0 8 3 0 active sync /dev/sda3
4 8 19 1 active sync /dev/sdb3
2 8 35 2 active sync /dev/sdc3
5 8 51 3 active sync /dev/sdd3
user@machinename:~$ sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Mon Dec 7 08:31:31 2015
Raid Level : raid1
Array Size : 1950656 (1905.26 MiB 1997.47 MB)
Used Dev Size : 1950656 (1905.26 MiB 1997.47 MB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Thu Jan 4 12:00:45 2018
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Name : arrayname:0
UUID : 437e4abb:c7ac46f1:ef8b2976:94921060
Events : 215
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
5 8 49 1 active sync /dev/sdd1
4 8 17 2 active sync /dev/sdb1
2 8 33 3 active sync /dev/sdc1
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-04 13:39 ` Alexander Shenkin
@ 2018-01-05 5:20 ` Brad Campbell
2018-01-05 5:25 ` Brad Campbell
0 siblings, 1 reply; 49+ messages in thread
From: Brad Campbell @ 2018-01-05 5:20 UTC (permalink / raw)
To: Alexander Shenkin, Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 04/01/18 21:39, Alexander Shenkin wrote:
> Thanks Brad, no worries, really appreciate your attention. I stopped
> checkarray. It had one rebuild event (Rebuild99) in /dev/md0 (small
> RAID1, where /boot is mounted) before I stopped it. Here's the
> examine output (not really sure what to do with it, will wait for
> advice):
Ok, so you have 4 disks with 2 partitions on each.
You re-wrote Sectors 5857843312+7 on the disk.
Without knowing the layout of your partitions it's a bit difficult, but
lets make an assumption and see where it gets us.
You have a partition table. Lets assume 1st partition starts at sector
2048 as fdisk will often leave that for alignment.
1st partition data offset is 2048 sectors (1M for superblock) and is
3901312 sectors long, so it ends at 3905408 (3901312+2048+2048)
2nd partition data offset is 262144 sectors and is 5840377856 sectors
long, totaling 5840640000 sectors.
Add those two and we get 5844545408 sectors. So if my maths is any good
you wrote a block 13297904 sectors from the end of the data area.
Now the whole point of that was to say if the block you wrote happens to
fall in a parity area, then you are fine. Checkarray will just
re-calculate the parity from the data blocks and re-write it. Your
mismatch count will be 1 at the end of the operation.
If however the block falls in a data area, running checkarray is going
to use that re-written block to re-calculate the parity and it's corrupt
for good.
Now I need someone to re-check my maths, and an fdisk -l /dev/sda from
you to see if I've made any glaring error. My assessment is that block
*did* lay in the data area of the disk.
If I'm right, then the only way I can see to rectify it is to pop sda
out, zero the superblock and re-add it which will rebuild the disk
entirely but that leaves you extremely vulnerable for the entire
process. Of course if there is nothing on the filesystem at that
location, or you are ok with losing a 4k chunk of a file then this is
all moot.
At this point I'd be most glad to be proven incorrect.
Regards,
Brad
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-05 5:20 ` Brad Campbell
@ 2018-01-05 5:25 ` Brad Campbell
2018-01-05 10:10 ` Alexander Shenkin
0 siblings, 1 reply; 49+ messages in thread
From: Brad Campbell @ 2018-01-05 5:25 UTC (permalink / raw)
To: Alexander Shenkin, Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 05/01/18 13:20, Brad Campbell wrote:
> You re-wrote Sectors 5857843312+7 on the disk.
> Add those two and we get 5844545408 sectors. So if my maths is any good
> you wrote a block 13297904 sectors from the end of the data area.
I can't believe I did that. No, you wrote a block ~6M *after* the data
area and you should be fine.
I'm going to go and write a letter of apology to my primary school maths
teacher now.
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-05 5:25 ` Brad Campbell
@ 2018-01-05 10:10 ` Alexander Shenkin
2018-01-05 10:32 ` Brad Campbell
2018-01-05 13:50 ` Phil Turmel
0 siblings, 2 replies; 49+ messages in thread
From: Alexander Shenkin @ 2018-01-05 10:10 UTC (permalink / raw)
To: Brad Campbell, Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 1/5/2018 5:25 AM, Brad Campbell wrote:
> On 05/01/18 13:20, Brad Campbell wrote:
>
>> You re-wrote Sectors 5857843312+7 on the disk.
>
>> Add those two and we get 5844545408 sectors. So if my maths is any good
>> you wrote a block 13297904 sectors from the end of the data area.
>
> I can't believe I did that. No, you wrote a block ~6M *after* the data
> area and you should be fine.
>
> I'm going to go and write a letter of apology to my primary school maths
> teacher now.
>
>
Thanks much, Brad. fdisk & parted output are below. I have swap space
mounted on /dev/sda4, 15,984,640 sectors long, after the partitions used
for raid. I'm not sure where exactly the parity data sits... Looks to
me like this happened in swap space, no? Currently, swapon reports
552,272 kb (= 1,104,544 sectors) in use (i think). If that's
contiguous, then the write should have happened after the used space
(13,297,904 > 1,104,544). But I'm not sure swap is contiguous. In this
case, regardless, I suspect I should just reboot, and then run
checkarray to be safe?
One followup: is parity info stored in a separate area than data info on
the disk? If the write *had* fallen within the raid partition area,
would you indeed be able to tell if it overwrote data vs parity vs both?
Google wouldn't tell me...
Thanks again,
Allie
user@machinename:~$ sudo fdisk -l /dev/sda*
WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util
fdisk doesn't support GPT. Use GNU Parted.
Disk /dev/sda: 3000.6 GB, 3000592982016 bytes
255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Device Boot Start End Blocks Id System
/dev/sda1 1 4294967295 2147483647+ ee GPT
Partition 1 does not start on physical sector boundary.
Disk /dev/sda1: 1998 MB, 1998585856 bytes
255 heads, 63 sectors/track, 242 cylinders, total 3903488 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sda1 doesn't contain a valid partition table
Disk /dev/sda2: 1 MB, 1048576 bytes
255 heads, 63 sectors/track, 0 cylinders, total 2048 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sda2 doesn't contain a valid partition table
Disk /dev/sda3: 2990.4 GB, 2990407680000 bytes
255 heads, 63 sectors/track, 363563 cylinders, total 5840640000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sda3 doesn't contain a valid partition table
Disk /dev/sda4: 8184 MB, 8184135680 bytes
255 heads, 63 sectors/track, 994 cylinders, total 15984640 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000
Disk /dev/sda4 doesn't contain a valid partition table
user@machinename:~$ sudo parted /dev/sda 'unit s print'
Model: ATA ST3000DM001-9YN1 (scsi)
Disk /dev/sda: 5860533168s
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name
Flags
1 2048s 3905535s 3903488s boot
raid
2 3905536s 3907583s 2048s grubbios
bios_grub
3 3907584s 5844547583s 5840640000s ext4 main
raid
4 5844547584s 5860532223s 15984640s linux-swap(v1) swap
user@machinename:~$ swapon --summary
Filename Type Size Used
Priority
/dev/sda4 partition 7992316 552272 -1
/dev/sdb4 partition 7992316 0 -2
/dev/sdc4 partition 7992316 0 -3
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-05 10:10 ` Alexander Shenkin
@ 2018-01-05 10:32 ` Brad Campbell
2018-01-05 13:50 ` Phil Turmel
1 sibling, 0 replies; 49+ messages in thread
From: Brad Campbell @ 2018-01-05 10:32 UTC (permalink / raw)
To: Alexander Shenkin
Cc: Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht, Wols Lists,
Carsten Aulbert, Linux-RAID
Yep, it's in swap. Check the sector you wrote against the sector ranges listed in the parted print. All good to go.
> On 5 Jan 2018, at 6:10 PM, Alexander Shenkin <al@shenkin.org> wrote:
>
>> On 1/5/2018 5:25 AM, Brad Campbell wrote:
>>> On 05/01/18 13:20, Brad Campbell wrote:
>>> You re-wrote Sectors 5857843312+7 on the disk.
>>> Add those two and we get 5844545408 sectors. So if my maths is any good
>>> you wrote a block 13297904 sectors from the end of the data area.
>> I can't believe I did that. No, you wrote a block ~6M *after* the data area and you should be fine.
>> I'm going to go and write a letter of apology to my primary school maths teacher now.
>
> Thanks much, Brad. fdisk & parted output are below. I have swap space mounted on /dev/sda4, 15,984,640 sectors long, after the partitions used for raid. I'm not sure where exactly the parity data sits... Looks to me like this happened in swap space, no? Currently, swapon reports 552,272 kb (= 1,104,544 sectors) in use (i think). If that's contiguous, then the write should have happened after the used space (13,297,904 > 1,104,544). But I'm not sure swap is contiguous. In this case, regardless, I suspect I should just reboot, and then run checkarray to be safe?
>
> One followup: is parity info stored in a separate area than data info on the disk? If the write *had* fallen within the raid partition area, would you indeed be able to tell if it overwrote data vs parity vs both? Google wouldn't tell me...
>
> Thanks again,
> Allie
>
>
> user@machinename:~$ sudo fdisk -l /dev/sda*
>
> WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.
>
>
> Disk /dev/sda: 3000.6 GB, 3000592982016 bytes
> 255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> Disk identifier: 0x00000000
>
> Device Boot Start End Blocks Id System
> /dev/sda1 1 4294967295 2147483647+ ee GPT
> Partition 1 does not start on physical sector boundary.
>
> Disk /dev/sda1: 1998 MB, 1998585856 bytes
> 255 heads, 63 sectors/track, 242 cylinders, total 3903488 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> Disk identifier: 0x00000000
>
> Disk /dev/sda1 doesn't contain a valid partition table
>
> Disk /dev/sda2: 1 MB, 1048576 bytes
> 255 heads, 63 sectors/track, 0 cylinders, total 2048 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> Disk identifier: 0x00000000
>
> Disk /dev/sda2 doesn't contain a valid partition table
>
> Disk /dev/sda3: 2990.4 GB, 2990407680000 bytes
> 255 heads, 63 sectors/track, 363563 cylinders, total 5840640000 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> Disk identifier: 0x00000000
>
> Disk /dev/sda3 doesn't contain a valid partition table
>
> Disk /dev/sda4: 8184 MB, 8184135680 bytes
> 255 heads, 63 sectors/track, 994 cylinders, total 15984640 sectors
> Units = sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 4096 bytes
> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> Disk identifier: 0x00000000
>
> Disk /dev/sda4 doesn't contain a valid partition table
>
>
>
>
>
> user@machinename:~$ sudo parted /dev/sda 'unit s print'
> Model: ATA ST3000DM001-9YN1 (scsi)
> Disk /dev/sda: 5860533168s
> Sector size (logical/physical): 512B/4096B
> Partition Table: gpt
>
> Number Start End Size File system Name Flags
> 1 2048s 3905535s 3903488s boot raid
> 2 3905536s 3907583s 2048s grubbios bios_grub
> 3 3907584s 5844547583s 5840640000s ext4 main raid
> 4 5844547584s 5860532223s 15984640s linux-swap(v1) swap
>
>
> user@machinename:~$ swapon --summary
> Filename Type Size Used Priority
> /dev/sda4 partition 7992316 552272 -1
> /dev/sdb4 partition 7992316 0 -2
> /dev/sdc4 partition 7992316 0 -3
>
>
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-05 10:10 ` Alexander Shenkin
2018-01-05 10:32 ` Brad Campbell
@ 2018-01-05 13:50 ` Phil Turmel
2018-01-05 14:01 ` Alexander Shenkin
2018-01-05 15:59 ` Wols Lists
1 sibling, 2 replies; 49+ messages in thread
From: Phil Turmel @ 2018-01-05 13:50 UTC (permalink / raw)
To: Alexander Shenkin, Brad Campbell, Reindl Harald, Edward Kuns,
Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
Hi Alex,
On 01/05/2018 05:10 AM, Alexander Shenkin wrote:
> On 1/5/2018 5:25 AM, Brad Campbell wrote:
>> I'm going to go and write a letter of apology to my primary school
>> maths teacher now.
> Thanks much, Brad. fdisk & parted output are below. I have swap space
> mounted on /dev/sda4, 15,984,640 sectors long, after the partitions used
> for raid. I'm not sure where exactly the parity data sits... Looks to
> me like this happened in swap space, no? Currently, swapon reports
> 552,272 kb (= 1,104,544 sectors) in use (i think). If that's
> contiguous, then the write should have happened after the used space
> (13,297,904 > 1,104,544). But I'm not sure swap is contiguous. In this
> case, regardless, I suspect I should just reboot, and then run
> checkarray to be safe?
The output of fdisk is invalid on your system, see the warning it
printed. Use gdisk or parted instead. Don't use '*'.
> One followup: is parity info stored in a separate area than data info on
> the disk? If the write *had* fallen within the raid partition area,
> would you indeed be able to tell if it overwrote data vs parity vs both?
> Google wouldn't tell me...
No. Parity is interleaved with data on all devices, chunk by chunk, on
all default raid5/6 layouts. In raid4, the last device is all of the
parity. There are optional layouts for raid5 that do the same, and
variants for raid6 that place various combinations of parity and
syndrome at either end. See the --layout option in the mdadm man page.
The non-data area of member devices contains at least the superblock,
and optionally a write-intent bitmap and/or a bad-block list. Most of
the non-data space is reserved for optimizing future --grow operations.
> user@machinename:~$ sudo fdisk -l /dev/sda*
>
> WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util
> fdisk doesn't support GPT. Use GNU Parted.
All of the partition data following this warning is bogus -- it is the
"protective" MBR record.
Phil
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-05 13:50 ` Phil Turmel
@ 2018-01-05 14:01 ` Alexander Shenkin
2018-01-05 15:59 ` Wols Lists
1 sibling, 0 replies; 49+ messages in thread
From: Alexander Shenkin @ 2018-01-05 14:01 UTC (permalink / raw)
To: Phil Turmel, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht
Cc: Wols Lists, Carsten Aulbert, Linux-RAID
On 1/5/2018 1:50 PM, Phil Turmel wrote:
> No. Parity is interleaved with data on all devices, chunk by chunk, on
> all default raid5/6 layouts.
Thanks Phil. So, I suppose then the final diagnosis of my issue was
that the bad sector occurred in a non-raid portion of the drive, which
is why it was never read from or written to, and hence never corrected.
BTW, my reallocated sector count remains at 0 on that drive, but
whatever - maybe that's because there was no data to reallocate (though
i would think it would physically reallocate some sectors regardless).
Also, I keep getting Rebuild99 events on /dev/md0. But, I think that,
when I have time, I'll add a drive for raid6, fail out sda, write it
with 0's, update it's firmware (apparently seagate released one for
these drives), and re-add it.
Thanks,
allie
^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline?
2018-01-05 13:50 ` Phil Turmel
2018-01-05 14:01 ` Alexander Shenkin
@ 2018-01-05 15:59 ` Wols Lists
1 sibling, 0 replies; 49+ messages in thread
From: Wols Lists @ 2018-01-05 15:59 UTC (permalink / raw)
To: Phil Turmel, Alexander Shenkin, Brad Campbell, Reindl Harald,
Edward Kuns, Mark Knecht
Cc: Carsten Aulbert, Linux-RAID
On 05/01/18 13:50, Phil Turmel wrote:
> The output of fdisk is invalid on your system, see the warning it
> printed. Use gdisk or parted instead. Don't use '*'.
It used to be so simple - fdisk for MBR disks, gdisk et al for GPT.
ashdown src # fdisk /dev/sda
Welcome to fdisk (util-linux 2.26.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Command (m for help): p
Disk /dev/sda: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 47915407-BA7E-4869-8D3E-3CB44F5FDA12
Device Start End Sectors Size Type
/dev/sda1 2048 1050623 1048576 512M BIOS boot
/dev/sda2 1312768 68421631 67108864 32G Linux swap
/dev/sda3 68683776 270010367 201326592 96G Linux filesystem
/dev/sda4 270272512 471599103 201326592 96G Linux filesystem
/dev/sda5 471861248 5860533134 5388671887 2.5T Linux filesystem
Command (m for help): q
ashdown src #
Modern fdisk now supports GPT. So yes, in this case the warning is
correct, but we need to watch out that people will be using fdisk, and
it's okay. Dunno when this happened, but I've only very recently noticed it.
Cheers,
Wol
^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2018-01-05 15:59 UTC | newest]
Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-07 7:48 SMART detects pending sectors; take offline? Alexander Shenkin
2017-10-07 8:21 ` Carsten Aulbert
2017-10-07 10:05 ` Alexander Shenkin
2017-10-07 17:29 ` Wols Lists
2017-10-08 9:19 ` Alexander Shenkin
2017-10-08 9:49 ` Wols Lists
2017-10-09 20:16 ` Phil Turmel
2017-10-10 9:00 ` Alexander Shenkin
2017-10-10 9:11 ` Reindl Harald
2017-10-10 9:56 ` Alexander Shenkin
2017-10-10 12:55 ` Phil Turmel
2017-10-11 10:31 ` Alexander Shenkin
2017-10-11 17:10 ` Phil Turmel
2017-10-12 9:50 ` Alexander Shenkin
2017-10-12 11:01 ` Wols Lists
2017-10-12 13:04 ` Phil Turmel
2017-10-12 13:16 ` Alexander Shenkin
2017-10-12 13:21 ` Mark Knecht
2017-10-12 15:16 ` Edward Kuns
2017-10-12 15:52 ` Edward Kuns
2017-10-15 14:41 ` Alexander Shenkin
2017-12-18 15:51 ` Alexander Shenkin
2017-12-18 16:09 ` Phil Turmel
2017-12-19 10:35 ` Alexander Shenkin
2017-12-19 12:02 ` Phil Turmel
2017-12-21 11:28 ` Alexander Shenkin
2017-12-21 11:38 ` Reindl Harald
2017-12-23 3:14 ` Brad Campbell
2018-01-03 12:44 ` Alexander Shenkin
2018-01-03 13:26 ` Brad Campbell
2018-01-03 13:50 ` Alexander Shenkin
2018-01-03 15:53 ` Phil Turmel
2018-01-03 15:59 ` Alexander Shenkin
2018-01-03 16:02 ` Phil Turmel
2018-01-04 10:37 ` Alexander Shenkin
2018-01-04 12:28 ` Alexander Shenkin
2018-01-04 13:16 ` Brad Campbell
2018-01-04 13:39 ` Alexander Shenkin
2018-01-05 5:20 ` Brad Campbell
2018-01-05 5:25 ` Brad Campbell
2018-01-05 10:10 ` Alexander Shenkin
2018-01-05 10:32 ` Brad Campbell
2018-01-05 13:50 ` Phil Turmel
2018-01-05 14:01 ` Alexander Shenkin
2018-01-05 15:59 ` Wols Lists
2017-10-12 15:19 ` Kai Stian Olstad
2017-10-10 22:23 ` josh
2017-10-11 6:23 ` Alexander Shenkin
2017-10-10 9:21 ` Wols Lists
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.