* SMART detects pending sectors; take offline? @ 2017-10-07 7:48 Alexander Shenkin 2017-10-07 8:21 ` Carsten Aulbert 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2017-10-07 7:48 UTC (permalink / raw) To: linux-raid Hi all, My SMART monitoring has picked up some pending sectors on one of my RAID0 + RAID5 drives (it's one of the infamous 3TB seagate drives... my other 3 failed earlier... this is the last of them, that finally has gone as well...). I've just ordered a replacement (Toshiba P300) that will arrive tomorrow... but the question is, what to do in the meantime? Should I take the drive offline? I suspect so, but would like to double check before taking action. Thanks in advance for any advice. Here are the errors: The following warning/error was logged by the smartd daemon: Device: /dev/sda [SAT], Self-Test Log error count increased from 0 to 1 Device info: ST3000DM001-9YN166, S/N:Z1F13FBA, WWN:5-000c50-04e444ab1, FW:CC4B, 3.00 TB ------------------ The following warning/error was logged by the smartd daemon: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Device info: ST3000DM001-9YN166, S/N:Z1F13FBA, WWN:5-000c50-04e444ab1, FW:CC4B, 3.00 TB ----------------- The following warning/error was logged by the smartd daemon: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Device info: ST3000DM001-9YN166, S/N:Z1F13FBA, WWN:5-000c50-04e444ab1, FW:CC4B, 3.00 TB Thanks, Allie ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-07 7:48 SMART detects pending sectors; take offline? Alexander Shenkin @ 2017-10-07 8:21 ` Carsten Aulbert 2017-10-07 10:05 ` Alexander Shenkin 2017-10-09 20:16 ` Phil Turmel 0 siblings, 2 replies; 49+ messages in thread From: Carsten Aulbert @ 2017-10-07 8:21 UTC (permalink / raw) To: Alexander Shenkin, linux-raid Hi On 10/07/17 09:48, Alexander Shenkin wrote: > My SMART monitoring has picked up some pending sectors on one of my > RAID0 + RAID5 drives (it's one of the infamous 3TB seagate drives... my > other 3 failed earlier... this is the last of them, that finally has > gone as well...). I've just ordered a replacement (Toshiba P300) that > will arrive tomorrow... but the question is, what to do in the meantime? > Should I take the drive offline? I suspect so, but would like to > double check before taking action. Thanks in advance for any advice. Given this is "only" a single sector error I would keep it running as long as you can physically install the new drive and only then take it offline. At least theoretically, it may be possible to force the rewrite of this sector and use the spare sectors of the disk, but I'm not 100% sure if a simple md check would already trigger it - usually you need to write "new" data to defective sectors to force the drive's firmware to use the spare sectors. But given the replacement disk should arrive soon, I would not act before that and run with a degraded RAID5 until then. I'm a bit more worried about the RAID0 here, do you run RAID0 on top of RAID5 or what is the exact set-up? cheers Carsten ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-07 8:21 ` Carsten Aulbert @ 2017-10-07 10:05 ` Alexander Shenkin 2017-10-07 17:29 ` Wols Lists 2017-10-09 20:16 ` Phil Turmel 1 sibling, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2017-10-07 10:05 UTC (permalink / raw) To: Carsten Aulbert, linux-raid Thanks Carsten, I was mistaken, it's a RAID1, not RAID0. I have /boot mounted on a RAID0, and / mounted on RAID5. They both split across 4 drives. Appreciate the advice - i'll just keep it running until the drive arrives tomorrow... Thanks, Allie On 10/7/2017 9:21 AM, Carsten Aulbert wrote: > Hi > > On 10/07/17 09:48, Alexander Shenkin wrote: >> My SMART monitoring has picked up some pending sectors on one of my >> RAID0 + RAID5 drives (it's one of the infamous 3TB seagate drives... my >> other 3 failed earlier... this is the last of them, that finally has >> gone as well...). I've just ordered a replacement (Toshiba P300) that >> will arrive tomorrow... but the question is, what to do in the meantime? >> Should I take the drive offline? I suspect so, but would like to >> double check before taking action. Thanks in advance for any advice. > > Given this is "only" a single sector error I would keep it running as > long as you can physically install the new drive and only then take it > offline. > > At least theoretically, it may be possible to force the rewrite of this > sector and use the spare sectors of the disk, but I'm not 100% sure if a > simple md check would already trigger it - usually you need to write > "new" data to defective sectors to force the drive's firmware to use the > spare sectors. > > But given the replacement disk should arrive soon, I would not act > before that and run with a degraded RAID5 until then. > > I'm a bit more worried about the RAID0 here, do you run RAID0 on top of > RAID5 or what is the exact set-up? > > cheers > > Carsten > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-07 10:05 ` Alexander Shenkin @ 2017-10-07 17:29 ` Wols Lists 2017-10-08 9:19 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Wols Lists @ 2017-10-07 17:29 UTC (permalink / raw) To: Alexander Shenkin, Carsten Aulbert, linux-raid On 07/10/17 11:05, Alexander Shenkin wrote: > Thanks Carsten, > > I was mistaken, it's a RAID1, not RAID0. I have /boot mounted on a > RAID0, and / mounted on RAID5. They both split across 4 drives. How big is each partition that makes up /boot? If that's raid0, surely that's not wise? A single disk failure will render the machine unbootable. Surely that should be raid1, so you can boot off any disk. > > Appreciate the advice - i'll just keep it running until the drive > arrives tomorrow... I'd keep it running ... > > Thanks, > Allie > > On 10/7/2017 9:21 AM, Carsten Aulbert wrote: >> Hi >> >> >> Given this is "only" a single sector error I would keep it running as >> long as you can physically install the new drive and only then take it >> offline. >> >> At least theoretically, it may be possible to force the rewrite of this >> sector and use the spare sectors of the disk, but I'm not 100% sure if a >> simple md check would already trigger it - usually you need to write >> "new" data to defective sectors to force the drive's firmware to use the >> spare sectors. >> How serious is a "pending sector"? I think doing a scrub will fix it. If it's not serious I'd look at using the extra drive to convert it to raid6. I doubt the infamous 3TB drives were a "bad batch", but given the press they got I would have expected Seagate to fix the problem. If these drives are newer than the ones that got the bad press, they might be fine. There's always the argument "do you ditch a disk on the first error, or do you wait until it's definitely dying". But iirc a "pending sector" is just one of those things that happens every now and then. If this goes away with a scrub, and you don't get a batch of new ones, then the drive is probably fine (until the next *random* problem shows up). Cheers, Wol ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-07 17:29 ` Wols Lists @ 2017-10-08 9:19 ` Alexander Shenkin 2017-10-08 9:49 ` Wols Lists 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2017-10-08 9:19 UTC (permalink / raw) To: Wols Lists, Carsten Aulbert, linux-raid Thanks Wol; that's *2* mistakes I've made; ugh. /boot is on RAID1. I have no RAID0's in the machine. These 3TB seagates are old ones, and were sitting in their original boxes for a few years unused. 3 others have already died, and one during a rebuild, causing tons of grief. So, random or not, this one is going in the trash. Seagate should refund me, but they never will. Thanks, Allie On 10/7/2017 6:29 PM, Wols Lists wrote: > On 07/10/17 11:05, Alexander Shenkin wrote: >> Thanks Carsten, >> >> I was mistaken, it's a RAID1, not RAID0. I have /boot mounted on a >> RAID0, and / mounted on RAID5. They both split across 4 drives. > > How big is each partition that makes up /boot? If that's raid0, surely > that's not wise? A single disk failure will render the machine > unbootable. Surely that should be raid1, so you can boot off any disk. >> >> Appreciate the advice - i'll just keep it running until the drive >> arrives tomorrow... > > I'd keep it running ... >> >> Thanks, >> Allie >> >> On 10/7/2017 9:21 AM, Carsten Aulbert wrote: >>> Hi >>> > >>> >>> Given this is "only" a single sector error I would keep it running as >>> long as you can physically install the new drive and only then take it >>> offline. >>> >>> At least theoretically, it may be possible to force the rewrite of this >>> sector and use the spare sectors of the disk, but I'm not 100% sure if a >>> simple md check would already trigger it - usually you need to write >>> "new" data to defective sectors to force the drive's firmware to use the >>> spare sectors. >>> > How serious is a "pending sector"? I think doing a scrub will fix it. > > If it's not serious I'd look at using the extra drive to convert it to > raid6. I doubt the infamous 3TB drives were a "bad batch", but given the > press they got I would have expected Seagate to fix the problem. If > these drives are newer than the ones that got the bad press, they might > be fine. > > There's always the argument "do you ditch a disk on the first error, or > do you wait until it's definitely dying". But iirc a "pending sector" is > just one of those things that happens every now and then. If this goes > away with a scrub, and you don't get a batch of new ones, then the drive > is probably fine (until the next *random* problem shows up). > > Cheers, > Wol > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-08 9:19 ` Alexander Shenkin @ 2017-10-08 9:49 ` Wols Lists 0 siblings, 0 replies; 49+ messages in thread From: Wols Lists @ 2017-10-08 9:49 UTC (permalink / raw) To: Alexander Shenkin, linux-raid On 08/10/17 10:19, Alexander Shenkin wrote: > Thanks Wol; that's *2* mistakes I've made; ugh. /boot is on RAID1. I > have no RAID0's in the machine. :-) > > These 3TB seagates are old ones, and were sitting in their original > boxes for a few years unused. 3 others have already died, and one > during a rebuild, causing tons of grief. So, random or not, this one is > going in the trash. Seagate should refund me, but they never will. > My machine has two 3TB Barracudas - raid-1. So far (touch wood) they've been reliable enough. As soon as I can afford it (£700, money I haven't got :-) I'm going to build a new machine - lvm/qemu on raid-1, then linux, windows, whatever on top of that on raid-6. The problem, of course, is I can't find much info on actually setting up a machine with a minimal virtual-machine install then a bunch of vm's on top. I guess it's all out there, but it's technical docu, not guides and howtos. So I guess I'll be documenting it all :-) And maybe trying to get a linux.org wiki to put it up on :-) Cheers, Wol ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-07 8:21 ` Carsten Aulbert 2017-10-07 10:05 ` Alexander Shenkin @ 2017-10-09 20:16 ` Phil Turmel 2017-10-10 9:00 ` Alexander Shenkin 1 sibling, 1 reply; 49+ messages in thread From: Phil Turmel @ 2017-10-09 20:16 UTC (permalink / raw) To: Carsten Aulbert, Alexander Shenkin, linux-raid On 10/07/2017 04:21 AM, Carsten Aulbert wrote: > Hi > > On 10/07/17 09:48, Alexander Shenkin wrote: >> My SMART monitoring has picked up some pending sectors on one of my >> RAID0 + RAID5 drives (it's one of the infamous 3TB seagate drives... my >> other 3 failed earlier... this is the last of them, that finally has >> gone as well...). I've just ordered a replacement (Toshiba P300) that >> will arrive tomorrow... but the question is, what to do in the meantime? >> Should I take the drive offline? I suspect so, but would like to >> double check before taking action. Thanks in advance for any advice. > > Given this is "only" a single sector error I would keep it running as > long as you can physically install the new drive and only then take it > offline. > > At least theoretically, it may be possible to force the rewrite of this > sector and use the spare sectors of the disk, but I'm not 100% sure if a > simple md check would already trigger it - usually you need to write > "new" data to defective sectors to force the drive's firmware to use the > spare sectors. > > But given the replacement disk should arrive soon, I would not act > before that and run with a degraded RAID5 until then. > > I'm a bit more worried about the RAID0 here, do you run RAID0 on top of > RAID5 or what is the exact set-up? So, no regular "check" scrubs. Check scrubs fix pending sectors by writing back to such sectors when the error is hit. As long as there is redundancy to obtain the data from, and the drive in question actually returns a read error. Since this is a desktop drive that is known to not have SCTERC support, you *must* reset your driver timeouts to 180 seconds for a check scrub to succeed. You will also have to do so with your P300 drive, as Toshiba's website says that drive is not NAS optimized. Please read up on "timeout mismatch" before your array blows up. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-09 20:16 ` Phil Turmel @ 2017-10-10 9:00 ` Alexander Shenkin 2017-10-10 9:11 ` Reindl Harald 2017-10-10 9:21 ` Wols Lists 0 siblings, 2 replies; 49+ messages in thread From: Alexander Shenkin @ 2017-10-10 9:00 UTC (permalink / raw) To: Phil Turmel, Carsten Aulbert, linux-raid On 10/9/2017 9:16 PM, Phil Turmel wrote: > On 10/07/2017 04:21 AM, Carsten Aulbert wrote: >> Hi >> >> On 10/07/17 09:48, Alexander Shenkin wrote: >>> My SMART monitoring has picked up some pending sectors on one of my >>> RAID0 + RAID5 drives (it's one of the infamous 3TB seagate drives... my >>> other 3 failed earlier... this is the last of them, that finally has >>> gone as well...). I've just ordered a replacement (Toshiba P300) that >>> will arrive tomorrow... but the question is, what to do in the meantime? >>> Should I take the drive offline? I suspect so, but would like to >>> double check before taking action. Thanks in advance for any advice. >> >> Given this is "only" a single sector error I would keep it running as >> long as you can physically install the new drive and only then take it >> offline. >> >> At least theoretically, it may be possible to force the rewrite of this >> sector and use the spare sectors of the disk, but I'm not 100% sure if a >> simple md check would already trigger it - usually you need to write >> "new" data to defective sectors to force the drive's firmware to use the >> spare sectors. >> >> But given the replacement disk should arrive soon, I would not act >> before that and run with a degraded RAID5 until then. >> >> I'm a bit more worried about the RAID0 here, do you run RAID0 on top of >> RAID5 or what is the exact set-up? > > So, no regular "check" scrubs. Check scrubs fix pending sectors by > writing back to such sectors when the error is hit. As long as there is > redundancy to obtain the data from, and the drive in question actually > returns a read error. Thanks... I know nothing about "check scrubs". Could you point me to a good resource? I've found https://raid.wiki.kernel.org/index.php/Scrubbing and https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives, but it's hard to tell exactly how the system should be configured in order to run these regularly. A weekly cron perhaps? And, should it be just check, or repair? etc... Any help you could offer would be welcome. Is this something I should run now? I figure it's a bad idea to push an array that is starting to degrade... haven't had a chance to replace the drive yet, but will get to it this week. Probably best to start the scrubbing routines once I have 4 good drives in there I figure... > Since this is a desktop drive that is known to not have SCTERC support, > you *must* reset your driver timeouts to 180 seconds for a check scrub > to succeed. You will also have to do so with your P300 drive, as > Toshiba's website says that drive is not NAS optimized. > > Please read up on "timeout mismatch" before your array blows up. I have timeouts set on all drives when the system boots, and the same script turns on the P300s' SCTERC. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-10 9:00 ` Alexander Shenkin @ 2017-10-10 9:11 ` Reindl Harald 2017-10-10 9:56 ` Alexander Shenkin 2017-10-10 9:21 ` Wols Lists 1 sibling, 1 reply; 49+ messages in thread From: Reindl Harald @ 2017-10-10 9:11 UTC (permalink / raw) To: Alexander Shenkin, Phil Turmel, Carsten Aulbert, linux-raid Am 10.10.2017 um 11:00 schrieb Alexander Shenkin: > Thanks... I know nothing about "check scrubs". Could you point me to a > good resource? I've found > https://raid.wiki.kernel.org/index.php/Scrubbing and > https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives, but it's > hard to tell exactly how the system should be configured in order to run > these regularly. A weekly cron perhaps? And, should it be just check, > or repair? etc... Any help you could offer would be welcome. if your distribution don't install a cronjob for that you should blame them because RAID without regular scrub is asking for troubles [root@srv-rhsoft:~]$ rpm -q --file /etc/cron.d/raid-check mdadm-4.0-1.fc26.x86_64 [root@srv-rhsoft:~]$ cat /etc/cron.d/raid-check 30 4 * * Mon root /usr/sbin/raid-check > Is this something I should run now? I figure it's a bad idea to push an > array that is starting to degrade... haven't had a chance to replace the > drive yet, but will get to it this week. Probably best to start the > scrubbing routines once I have 4 good drives in there I figure... NO - never put any load you can avoid on degraded arrays ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-10 9:11 ` Reindl Harald @ 2017-10-10 9:56 ` Alexander Shenkin 2017-10-10 12:55 ` Phil Turmel 2017-10-10 22:23 ` josh 0 siblings, 2 replies; 49+ messages in thread From: Alexander Shenkin @ 2017-10-10 9:56 UTC (permalink / raw) To: Reindl Harald, Phil Turmel, Carsten Aulbert, linux-raid On 10/10/2017 10:11 AM, Reindl Harald wrote: > > > Am 10.10.2017 um 11:00 schrieb Alexander Shenkin: >> Thanks... I know nothing about "check scrubs". Could you point me to >> a good resource? I've found >> https://raid.wiki.kernel.org/index.php/Scrubbing and >> https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives, but it's >> hard to tell exactly how the system should be configured in order to >> run these regularly. A weekly cron perhaps? And, should it be just >> check, or repair? etc... Any help you could offer would be welcome. > > if your distribution don't install a cronjob for that you should blame > them because RAID without regular scrub is asking for troubles > > [root@srv-rhsoft:~]$ rpm -q --file /etc/cron.d/raid-check > mdadm-4.0-1.fc26.x86_64 > > [root@srv-rhsoft:~]$ cat /etc/cron.d/raid-check > 30 4 * * Mon root /usr/sbin/raid-check Thanks Reindl. Here's what I have installed (no evidence of raid-check available on my system): $ cat /etc/cron.d/mdadm 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi >> Is this something I should run now? I figure it's a bad idea to push >> an array that is starting to degrade... haven't had a chance to >> replace the drive yet, but will get to it this week. Probably best to >> start the scrubbing routines once I have 4 good drives in there I >> figure... > > NO - never put any load you can avoid on degraded arrays thanks, i won't. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-10 9:56 ` Alexander Shenkin @ 2017-10-10 12:55 ` Phil Turmel 2017-10-11 10:31 ` Alexander Shenkin 2017-10-10 22:23 ` josh 1 sibling, 1 reply; 49+ messages in thread From: Phil Turmel @ 2017-10-10 12:55 UTC (permalink / raw) To: Alexander Shenkin, Reindl Harald, Carsten Aulbert, linux-raid Hi Alex, On 10/10/2017 05:56 AM, Alexander Shenkin wrote: > Thanks Reindl. Here's what I have installed (no evidence of raid-check > available on my system): > > $ cat /etc/cron.d/mdadm > 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) > -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi > This is *really* good news. If this has been running once a month as shown, and you've already noted that you are dealing properly with timeout mismatch, then your arrays are in fact reasonably scrubbed. Which means the pending sector found by a smartctl background scan is likely in a non-array data area. And if not, the next scrub will fix it. You can run checkarray yourself if you don't want to wait. >>> Is this something I should run now? I figure it's a bad idea to push >>> an array that is starting to degrade... haven't had a chance to >>> replace the drive yet, but will get to it this week. Probably best >>> to start the scrubbing routines once I have 4 good drives in there I >>> figure... >> >> NO - never put any load you can avoid on degraded arrays > > thanks, i won't. If I read your OP correctly, your array is *not* degraded -- it just has a pending URE on one drive. You are still redundant, and your system is scrubbing once a month. FWIW, I don't replace drives just for pending sectors -- they are expected occasionally per drive specs. So long as check scrubs regularly complete, I replace drives when actual relocations hit double digits. You are fine. Your array is fine. Long timeouts can cause application timeouts and user freak-outs, so your Seagates are less than ideal, but your system is *fine*. Consider using the new drive to convert to raid6. If you have other reasons to stay with raid5, then add it as as spare, then use mdadm's --replace operation to swap out the drive with the pending sector. Phil ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-10 12:55 ` Phil Turmel @ 2017-10-11 10:31 ` Alexander Shenkin 2017-10-11 17:10 ` Phil Turmel 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2017-10-11 10:31 UTC (permalink / raw) To: Phil Turmel, Reindl Harald, Carsten Aulbert, linux-raid On 10/10/2017 1:55 PM, Phil Turmel wrote: > Hi Alex, > > On 10/10/2017 05:56 AM, Alexander Shenkin wrote: > >> Thanks Reindl. Here's what I have installed (no evidence of raid-check >> available on my system): >> >> $ cat /etc/cron.d/mdadm >> 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) >> -le 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi >> > > This is *really* good news. If this has been running once a month as > shown, and you've already noted that you are dealing properly with > timeout mismatch, then your arrays are in fact reasonably scrubbed. > > Which means the pending sector found by a smartctl background scan is > likely in a non-array data area. And if not, the next scrub will fix > it. You can run checkarray yourself if you don't want to wait. Thanks Phil. I ran checkarray --all --idle, and it completed fine, with no Rebuild messages as far as I could see (looked in dmesg & /var/log/syslog, see below). [4444093.042246] md: data-check of RAID array md0 [4444093.042252] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [4444093.042254] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. [4444093.042262] md: using 128k window, over a total of 1950656k. [4444093.192032] md: delaying data-check of md2 until md0 has finished (they share one or more physical units) [4444106.854418] md: md0: data-check done. [4444106.863292] md: data-check of RAID array md2 [4444106.863295] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. [4444106.863298] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. [4444106.863304] md: using 128k window, over a total of 2920188928k. [4475376.852520] md: md2: data-check done. SMART still shows those 8 unreadable sectors. dmesg has a bunch of related errors, copied below. >>>> Is this something I should run now? I figure it's a bad idea to push >>>> an array that is starting to degrade... haven't had a chance to >>>> replace the drive yet, but will get to it this week. Probably best >>>> to start the scrubbing routines once I have 4 good drives in there I >>>> figure... >>> >>> NO - never put any load you can avoid on degraded arrays >> >> thanks, i won't. > > If I read your OP correctly, your array is *not* degraded -- it just has > a pending URE on one drive. You are still redundant, and your system is > scrubbing once a month. FWIW, I don't replace drives just for > pending sectors -- they are expected occasionally per drive specs. So > long as check scrubs regularly complete, I replace drives when actual > relocations hit double digits. So, is there a way to tell if the array successfully "relocated" those 8 sectors? Or, no need to verify it? > > You are fine. Your array is fine. Long timeouts can cause application > timeouts and user freak-outs, so your Seagates are less than ideal, but > your system is *fine*. > > Consider using the new drive to convert to raid6. If you have other > reasons to stay with raid5, then add it as as spare, then use mdadm's > --replace operation to swap out the drive with the pending sector. Thanks - I'll look into raid6 conversion if this drive doesn't start upping it's unreadable sector counts more in the near future... thanks, allie ------------------------------------- [4038193.380403] INFO: task md2_raid5:247 blocked for more than 120 seconds. [4038193.380473] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu [4038193.380526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [4038193.380591] md2_raid5 D ffff8800af01fbe0 0 247 2 0x00000000 [4038193.380599] ffff8800af01fbe0 ffff8800980f5400 ffff8802229b8e00 ffff8800af020000 [4038193.380604] ffff880222017298 ffffea0002bc0300 ffff880222017018 ffff880222017000 [4038193.380609] ffff8800af01fbf8 ffffffff81808f75 ffff880222017000 ffff8800af01fc40 [4038193.380613] Call Trace: [4038193.380625] [<ffffffff81808f75>] schedule+0x35/0x80 [4038193.380631] [<ffffffff81681a17>] md_super_wait+0x47/0x80 [4038193.380638] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0 [4038193.380643] [<ffffffff816892d5>] write_page+0x1f5/0x310 [4038193.380648] [<ffffffff81688fe8>] bitmap_update_sb+0x138/0x150 [4038193.380652] [<ffffffff81681d79>] md_update_sb.part.51+0x329/0x800 [4038193.380657] [<ffffffff81682275>] md_update_sb+0x25/0x30 [4038193.380661] [<ffffffff8168292d>] md_check_recovery+0x1dd/0x4a0 [4038193.380670] [<ffffffffc00e6765>] raid5d+0x45/0x740 [raid456] [4038193.380675] [<ffffffff810e7b08>] ? del_timer_sync+0x48/0x50 [4038193.380680] [<ffffffff8180b86b>] ? schedule_timeout+0x16b/0x2d0 [4038193.380685] [<ffffffff810e75f0>] ? trace_event_raw_event_tick_stop+0xd0/0xd0 [4038193.380691] [<ffffffff8167a957>] md_thread+0x117/0x130 [4038193.380696] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0 [4038193.380702] [<ffffffff8167a840>] ? find_pers+0x70/0x70 [4038193.380707] [<ffffffff8109cd56>] kthread+0xd6/0xf0 [4038193.380711] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60 [4038193.380715] [<ffffffff8180cb8f>] ret_from_fork+0x3f/0x70 [4038193.380719] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60 [4038193.380723] INFO: task jbd2/md2-8:261 blocked for more than 120 seconds. [4038193.380779] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu [4038193.380834] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [4038193.380898] jbd2/md2-8 D ffff8800af017a30 0 261 2 0x00000000 [4038193.380903] ffff8800af017a30 ffff880225a44600 ffff880225b4d400 ffff8800af018000 [4038193.380908] ffff880222017298 ffff880222017290 ffff88014970c400 0000000000000000 [4038193.380912] ffff8800af017a48 ffffffff81808f75 ffff880222017000 ffff8800af017a98 [4038193.380917] Call Trace: [4038193.380922] [<ffffffff81808f75>] schedule+0x35/0x80 [4038193.380926] [<ffffffff8167f94d>] md_write_start+0x9d/0x180 [4038193.380931] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0 [4038193.380939] [<ffffffffc00e307a>] make_request+0x7a/0xcc0 [raid456] [4038193.380944] [<ffffffff8128a6f0>] ? ext4_map_blocks+0x2c0/0x4e0 [4038193.380950] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0 [4038193.380954] [<ffffffff8167b13c>] md_make_request+0xec/0x240 [4038193.380959] [<ffffffff813aee08>] generic_make_request+0xf8/0x2a0 [4038193.380962] [<ffffffff813af027>] submit_bio+0x77/0x150 [4038193.380967] [<ffffffff813a6821>] ? bio_alloc_bioset+0x181/0x2c0 [4038193.380971] [<ffffffff81237ecf>] submit_bh_wbc+0x12f/0x160 [4038193.380975] [<ffffffff81237f32>] submit_bh+0x12/0x20 [4038193.380980] [<ffffffff812db465>] jbd2_journal_commit_transaction+0x5e5/0x1970 [4038193.380984] [<ffffffff810b315f>] ? update_curr+0xdf/0x170 [4038193.380989] [<ffffffff810e7a9f>] ? try_to_del_timer_sync+0x4f/0x70 [4038193.380994] [<ffffffff812e01eb>] kjournald2+0xbb/0x230 [4038193.380999] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0 [4038193.381004] [<ffffffff812e0130>] ? commit_timeout+0x10/0x10 [4038193.381007] [<ffffffff8109cd56>] kthread+0xd6/0xf0 [4038193.381011] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60 [4038193.381015] [<ffffffff8180cb8f>] ret_from_fork+0x3f/0x70 [4038193.381019] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60 [4038193.381055] INFO: task kworker/u16:1:4795 blocked for more than 120 seconds. [4038193.381114] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu [4038193.381167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [4038193.381231] kworker/u16:1 D ffff880003b777b0 0 4795 2 0x00000000 [4038193.381240] Workqueue: writeback wb_workfn (flush-9:2) [4038193.381243] ffff880003b777b0 ffffffff81e13500 ffff88022023d400 ffff880003b78000 [4038193.381247] ffff880222017298 0000000000000001 ffff8800afaef500 ffff88020e88e570 [4038193.381252] ffff880003b777c8 ffffffff81808f75 ffff880222017000 ffff880003b77818 [4038193.381256] Call Trace: [4038193.381261] [<ffffffff81808f75>] schedule+0x35/0x80 [4038193.381265] [<ffffffff8167f94d>] md_write_start+0x9d/0x180 [4038193.381271] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0 [4038193.381278] [<ffffffffc00e307a>] make_request+0x7a/0xcc0 [raid456] [4038193.381282] [<ffffffff813a6844>] ? bio_alloc_bioset+0x1a4/0x2c0 [4038193.381288] [<ffffffff810bf660>] ? prepare_to_wait_event+0xf0/0xf0 [4038193.381292] [<ffffffff8167b13c>] md_make_request+0xec/0x240 [4038193.381296] [<ffffffff810bf184>] ? __wake_up+0x44/0x50 [4038193.381300] [<ffffffff813aee08>] generic_make_request+0xf8/0x2a0 [4038193.381304] [<ffffffff813af027>] submit_bio+0x77/0x150 [4038193.381309] [<ffffffff8129198e>] ext4_io_submit+0x3e/0x60 [4038193.381313] [<ffffffff8128da43>] ext4_writepages+0x553/0xcd0 [4038193.381318] [<ffffffff8180b86b>] ? schedule_timeout+0x16b/0x2d0 [4038193.381323] [<ffffffff81095343>] ? __queue_delayed_work+0x83/0x180 [4038193.381329] [<ffffffff8119186e>] do_writepages+0x1e/0x30 [4038193.381333] [<ffffffff8122e885>] __writeback_single_inode+0x45/0x340 [4038193.381338] [<ffffffff818099a8>] ? wait_for_completion_io_timeout+0xa8/0x120 [4038193.381343] [<ffffffff8122f0bb>] writeback_sb_inodes+0x26b/0x5c0 [4038193.381347] [<ffffffff8122f496>] __writeback_inodes_wb+0x86/0xc0 [4038193.381351] [<ffffffff8122f722>] wb_writeback+0x252/0x2e0 [4038193.381355] [<ffffffff8122ff12>] wb_workfn+0x2c2/0x3d0 [4038193.381359] [<ffffffff810b3f55>] ? put_prev_entity+0x35/0x670 [4038193.381365] [<ffffffff81096d20>] process_one_work+0x150/0x3f0 [4038193.381370] [<ffffffff8109749a>] worker_thread+0x11a/0x470 [4038193.381375] [<ffffffff81097380>] ? rescuer_thread+0x310/0x310 [4038193.381378] [<ffffffff8109cd56>] kthread+0xd6/0xf0 [4038193.381382] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60 [4038193.381386] [<ffffffff8180cb8f>] ret_from_fork+0x3f/0x70 [4038193.381390] [<ffffffff8109cc80>] ? kthread_park+0x60/0x60 [4038193.381400] INFO: task updatedb.mlocat:16442 blocked for more than 120 seconds. [4038193.381461] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu [4038193.381512] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [4038193.381576] updatedb.mlocat D ffff880206e2b968 0 16442 16441 0x00000000 [4038193.381581] ffff880206e2b968 ffff880225a41c00 ffff880049599c00 ffff880206e2c000 [4038193.381586] 0000000000000000 7fffffffffffffff ffff88022efb2ce8 ffffffff818096b0 [4038193.381590] ffff880206e2b980 ffffffff81808f75 ffff88022ec96e00 ffff880206e2ba28 [4038193.381594] Call Trace: [4038193.381599] [<ffffffff818096b0>] ? bit_wait+0x50/0x50 [4038193.381604] [<ffffffff81808f75>] schedule+0x35/0x80 [4038193.381608] [<ffffffff8180b937>] schedule_timeout+0x237/0x2d0 [4038193.381613] [<ffffffff8159dd7f>] ? scsi_request_fn+0x3f/0x630 [4038193.381618] [<ffffffff813a8121>] ? elv_rb_add+0x61/0x70 [4038193.381624] [<ffffffff810f028c>] ? ktime_get+0x3c/0xb0 [4038193.381628] [<ffffffff818096b0>] ? bit_wait+0x50/0x50 [4038193.381633] [<ffffffff81808556>] io_schedule_timeout+0xa6/0x110 [4038193.381637] [<ffffffff818096cb>] bit_wait_io+0x1b/0x60 [4038193.381642] [<ffffffff81809310>] __wait_on_bit+0x60/0x90 [4038193.381646] [<ffffffff818096b0>] ? bit_wait+0x50/0x50 [4038193.381650] [<ffffffff818093b2>] out_of_line_wait_on_bit+0x72/0x80 [4038193.381655] [<ffffffff810bf6a0>] ? autoremove_wake_function+0x40/0x40 [4038193.381661] [<ffffffff81236922>] __wait_on_buffer+0x32/0x40 [4038193.381665] [<ffffffff812885c3>] __ext4_get_inode_loc+0x1c3/0x3f0 [4038193.381669] [<ffffffff8128b7bf>] ext4_iget+0x8f/0xb80 [4038193.381673] [<ffffffff8128c2e0>] ext4_iget_normal+0x30/0x40 [4038193.381677] [<ffffffff812964b1>] ext4_lookup+0xf1/0x230 [4038193.381682] [<ffffffff8120ac7d>] lookup_real+0x1d/0x50 [4038193.381686] [<ffffffff8120b093>] __lookup_hash+0x33/0x40 [4038193.381690] [<ffffffff8120d967>] walk_component+0x177/0x230 [4038193.381695] [<ffffffff8120e9f0>] path_lookupat+0x60/0x110 [4038193.381699] [<ffffffff8121087c>] filename_lookup+0x9c/0x150 [4038193.381704] [<ffffffff811ded1f>] ? kmem_cache_alloc+0x19f/0x200 [4038193.381708] [<ffffffff812104bf>] ? getname_flags+0x4f/0x1f0 [4038193.381713] [<ffffffff812109e6>] user_path_at_empty+0x36/0x40 [4038193.381719] [<ffffffff81205f93>] vfs_fstatat+0x53/0xa0 [4038193.381724] [<ffffffff81206462>] SYSC_newlstat+0x22/0x40 [4038193.381729] [<ffffffff81216555>] ? SyS_poll+0x65/0xf0 [4038193.381733] [<ffffffff8120667e>] SyS_newlstat+0xe/0x10 [4038193.381737] [<ffffffff8180c7f6>] entry_SYSCALL_64_fastpath+0x16/0x75 [4038193.381741] INFO: task nmbd:16462 blocked for more than 120 seconds. [4038193.381794] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu [4038193.381846] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [4038193.381910] nmbd D ffff880098197b38 0 16462 6615 0x00000000 [4038193.381915] ffff880098197b38 ffff880225a43800 ffff88004959d400 ffff880098198000 [4038193.381920] 0000000000000000 7fffffffffffffff ffff88022efba0f8 ffffffff818096b0 [4038193.381924] ffff880098197b50 ffffffff81808f75 ffff88022ed16e00 ffff880098197bf8 [4038193.381928] Call Trace: [4038193.381933] [<ffffffff818096b0>] ? bit_wait+0x50/0x50 [4038193.381937] [<ffffffff81808f75>] schedule+0x35/0x80 [4038193.381942] [<ffffffff8180b937>] schedule_timeout+0x237/0x2d0 [4038193.381946] [<ffffffff811de8dc>] ? ___slab_alloc+0x1cc/0x470 [4038193.381951] [<ffffffff810f028c>] ? ktime_get+0x3c/0xb0 [4038193.381956] [<ffffffff818096b0>] ? bit_wait+0x50/0x50 [4038193.381960] [<ffffffff81808556>] io_schedule_timeout+0xa6/0x110 [4038193.381965] [<ffffffff818096cb>] bit_wait_io+0x1b/0x60 [4038193.381969] [<ffffffff81809310>] __wait_on_bit+0x60/0x90 [4038193.381974] [<ffffffff818096b0>] ? bit_wait+0x50/0x50 [4038193.381978] [<ffffffff818093b2>] out_of_line_wait_on_bit+0x72/0x80 [4038193.381983] [<ffffffff810bf6a0>] ? autoremove_wake_function+0x40/0x40 [4038193.381988] [<ffffffff812d9943>] do_get_write_access+0x273/0x490 [4038193.381992] [<ffffffff812d9b91>] jbd2_journal_get_write_access+0x31/0x60 [4038193.381997] [<ffffffff812bdb0b>] __ext4_journal_get_write_access+0x3b/0x80 [4038193.382001] [<ffffffff81298ba4>] ext4_orphan_add+0xa4/0x260 [4038193.382006] [<ffffffff81299db8>] ext4_unlink+0x338/0x350 [4038193.382010] [<ffffffff8120c93a>] vfs_unlink+0xda/0x190 [4038193.382015] [<ffffffff8137881b>] ? wrap_apparmor_path_unlink+0x1b/0x20 [4038193.382020] [<ffffffff81211127>] do_unlinkat+0x257/0x2a0 [4038193.382025] [<ffffffff81211ae6>] SyS_unlink+0x16/0x20 [4038193.382029] [<ffffffff8180c7f6>] entry_SYSCALL_64_fastpath+0x16/0x75 [4038242.602780] ata3.00: exception Emask 0x40 SAct 0x7fffffff SErr 0x800 action 0x6 frozen [4038242.602856] ata3: SError: { HostInt } [4038242.602892] ata3.00: failed command: READ FPDMA QUEUED [4038242.602943] ata3.00: cmd 60/08:00:d0:a3:3f/00:00:28:01:00/40 tag 0 ncq 4096 in [4038242.602943] res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.603063] ata3.00: status: { DRDY } [4038242.603097] ata3.00: failed command: READ FPDMA QUEUED [4038242.603146] ata3.00: cmd 60/08:08:d8:a3:3f/00:00:28:01:00/40 tag 1 ncq 4096 in [4038242.603146] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.603266] ata3.00: status: { DRDY } [4038242.603299] ata3.00: failed command: READ FPDMA QUEUED [4038242.603348] ata3.00: cmd 60/08:10:e0:a3:3f/00:00:28:01:00/40 tag 2 ncq 4096 in [4038242.603348] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.603466] ata3.00: status: { DRDY } [4038242.603500] ata3.00: failed command: READ FPDMA QUEUED [4038242.603549] ata3.00: cmd 60/08:18:e8:a3:3f/00:00:28:01:00/40 tag 3 ncq 4096 in [4038242.603549] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.603667] ata3.00: status: { DRDY } [4038242.603700] ata3.00: failed command: READ FPDMA QUEUED [4038242.603749] ata3.00: cmd 60/08:20:f8:a3:3f/00:00:28:01:00/40 tag 4 ncq 4096 in [4038242.603749] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.603867] ata3.00: status: { DRDY } [4038242.603901] ata3.00: failed command: READ FPDMA QUEUED [4038242.603950] ata3.00: cmd 60/08:28:f0:a3:3f/00:00:28:01:00/40 tag 5 ncq 4096 in [4038242.603950] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x44 (timeout) [4038242.604070] ata3.00: status: { DRDY } [4038242.604104] ata3.00: failed command: READ FPDMA QUEUED [4038242.604153] ata3.00: cmd 60/08:30:08:a3:3f/00:00:28:01:00/40 tag 6 ncq 4096 in [4038242.604153] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.604271] ata3.00: status: { DRDY } [4038242.606329] ata3.00: failed command: READ FPDMA QUEUED [4038242.608395] ata3.00: cmd 60/08:38:10:a3:3f/00:00:28:01:00/40 tag 7 ncq 4096 in [4038242.608395] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.612513] ata3.00: status: { DRDY } [4038242.614535] ata3.00: failed command: READ FPDMA QUEUED [4038242.616547] ata3.00: cmd 60/08:40:18:a3:3f/00:00:28:01:00/40 tag 8 ncq 4096 in [4038242.616547] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x44 (timeout) [4038242.620545] ata3.00: status: { DRDY } [4038242.622544] ata3.00: failed command: READ FPDMA QUEUED [4038242.624503] ata3.00: cmd 60/08:48:20:a3:3f/00:00:28:01:00/40 tag 9 ncq 4096 in [4038242.624503] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.628406] ata3.00: status: { DRDY } [4038242.630330] ata3.00: failed command: READ FPDMA QUEUED [4038242.632245] ata3.00: cmd 60/08:50:28:a3:3f/00:00:28:01:00/40 tag 10 ncq 4096 in [4038242.632245] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.636058] ata3.00: status: { DRDY } [4038242.637942] ata3.00: failed command: READ FPDMA QUEUED [4038242.639818] ata3.00: cmd 60/08:58:30:a3:3f/00:00:28:01:00/40 tag 11 ncq 4096 in [4038242.639818] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.643555] ata3.00: status: { DRDY } [4038242.645413] ata3.00: failed command: READ FPDMA QUEUED [4038242.647270] ata3.00: cmd 60/08:60:38:a3:3f/00:00:28:01:00/40 tag 12 ncq 4096 in [4038242.647270] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.650999] ata3.00: status: { DRDY } [4038242.652859] ata3.00: failed command: READ FPDMA QUEUED [4038242.654718] ata3.00: cmd 60/08:68:40:a3:3f/00:00:28:01:00/40 tag 13 ncq 4096 in [4038242.654718] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.658458] ata3.00: status: { DRDY } [4038242.660326] ata3.00: failed command: READ FPDMA QUEUED [4038242.662189] ata3.00: cmd 60/08:70:48:a3:3f/00:00:28:01:00/40 tag 14 ncq 4096 in [4038242.662189] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.665951] ata3.00: status: { DRDY } [4038242.667827] ata3.00: failed command: READ FPDMA QUEUED [4038242.669696] ata3.00: cmd 60/08:78:50:a3:3f/00:00:28:01:00/40 tag 15 ncq 4096 in [4038242.669696] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x44 (timeout) [4038242.673441] ata3.00: status: { DRDY } [4038242.675319] ata3.00: failed command: READ FPDMA QUEUED [4038242.677189] ata3.00: cmd 60/08:80:58:a3:3f/00:00:28:01:00/40 tag 16 ncq 4096 in [4038242.677189] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.680935] ata3.00: status: { DRDY } [4038242.682816] ata3.00: failed command: READ FPDMA QUEUED [4038242.684689] ata3.00: cmd 60/08:88:60:a3:3f/00:00:28:01:00/40 tag 17 ncq 4096 in [4038242.684689] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.688450] ata3.00: status: { DRDY } [4038242.690328] ata3.00: failed command: READ FPDMA QUEUED [4038242.692208] ata3.00: cmd 60/08:90:68:a3:3f/00:00:28:01:00/40 tag 18 ncq 4096 in [4038242.692208] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.695989] ata3.00: status: { DRDY } [4038242.697872] ata3.00: failed command: READ FPDMA QUEUED [4038242.699753] ata3.00: cmd 60/08:98:70:a3:3f/00:00:28:01:00/40 tag 19 ncq 4096 in [4038242.699753] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.703534] ata3.00: status: { DRDY } [4038242.705417] ata3.00: failed command: READ FPDMA QUEUED [4038242.707310] ata3.00: cmd 60/08:a0:78:a3:3f/00:00:28:01:00/40 tag 20 ncq 4096 in [4038242.707310] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x44 (timeout) [4038242.711100] ata3.00: status: { DRDY } [4038242.712986] ata3.00: failed command: READ FPDMA QUEUED [4038242.714874] ata3.00: cmd 60/08:a8:80:a3:3f/00:00:28:01:00/40 tag 21 ncq 4096 in [4038242.714874] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.718671] ata3.00: status: { DRDY } [4038242.720557] ata3.00: failed command: READ FPDMA QUEUED [4038242.722432] ata3.00: cmd 60/08:b0:88:a3:3f/00:00:28:01:00/40 tag 22 ncq 4096 in [4038242.722432] res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.726213] ata3.00: status: { DRDY } [4038242.728109] ata3.00: failed command: READ FPDMA QUEUED [4038242.729996] ata3.00: cmd 60/08:b8:90:a3:3f/00:00:28:01:00/40 tag 23 ncq 4096 in [4038242.729996] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.733776] ata3.00: status: { DRDY } [4038242.735680] ata3.00: failed command: READ FPDMA QUEUED [4038242.737564] ata3.00: cmd 60/08:c0:98:a3:3f/00:00:28:01:00/40 tag 24 ncq 4096 in [4038242.737564] res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.741366] ata3.00: status: { DRDY } [4038242.743276] ata3.00: failed command: READ FPDMA QUEUED [4038242.745171] ata3.00: cmd 60/08:c8:a0:a3:3f/00:00:28:01:00/40 tag 25 ncq 4096 in [4038242.745171] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.748959] ata3.00: status: { DRDY } [4038242.750855] ata3.00: failed command: READ FPDMA QUEUED [4038242.752741] ata3.00: cmd 60/08:d0:a8:a3:3f/00:00:28:01:00/40 tag 26 ncq 4096 in [4038242.752741] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.756528] ata3.00: status: { DRDY } [4038242.758417] ata3.00: failed command: READ FPDMA QUEUED [4038242.760304] ata3.00: cmd 60/08:d8:b0:a3:3f/00:00:28:01:00/40 tag 27 ncq 4096 in [4038242.760304] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.764094] ata3.00: status: { DRDY } [4038242.765981] ata3.00: failed command: READ FPDMA QUEUED [4038242.767867] ata3.00: cmd 60/08:e0:b8:a3:3f/00:00:28:01:00/40 tag 28 ncq 4096 in [4038242.767867] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.771670] ata3.00: status: { DRDY } [4038242.773566] ata3.00: failed command: READ FPDMA QUEUED [4038242.775454] ata3.00: cmd 60/08:e8:c0:a3:3f/00:00:28:01:00/40 tag 29 ncq 4096 in [4038242.775454] res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.779246] ata3.00: status: { DRDY } [4038242.781133] ata3.00: failed command: READ FPDMA QUEUED [4038242.783020] ata3.00: cmd 60/08:f0:c8:a3:3f/00:00:28:01:00/40 tag 30 ncq 4096 in [4038242.783020] res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x44 (timeout) [4038242.786812] ata3.00: status: { DRDY } [4038242.788703] ata3: hard resetting link [4038243.278780] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [4038243.329279] ata3.00: configured for UDMA/133 [4038243.329314] sd 2:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [4038243.329320] sd 2:0:0:0: [sda] tag#0 Sense Key : Illegal Request [current] [descriptor] [4038243.329326] sd 2:0:0:0: [sda] tag#0 Add. Sense: Unaligned write command [4038243.329332] sd 2:0:0:0: [sda] tag#0 CDB: Read(16) 88 00 00 00 00 01 28 3f a3 d0 00 00 00 08 00 00 [4038243.329336] blk_update_request: I/O error, dev sda, sector 4970226640 [4038243.331303] sd 2:0:0:0: [sda] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [4038243.331309] sd 2:0:0:0: [sda] tag#1 Sense Key : Illegal Request [current] [descriptor] [4038243.331314] sd 2:0:0:0: [sda] tag#1 Add. Sense: Unaligned write command [4038243.331319] sd 2:0:0:0: [sda] tag#1 CDB: Read(16) 88 00 00 00 00 01 28 3f a3 d8 00 00 00 08 00 00 [4038243.331323] blk_update_request: I/O error, dev sda, sector 4970226648 [4038243.333204] sd 2:0:0:0: [sda] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [4038243.333209] sd 2:0:0:0: [sda] tag#2 Sense Key : Illegal Request [current] [descriptor] [4038243.333214] sd 2:0:0:0: [sda] tag#2 Add. Sense: Unaligned write command [4038243.333218] sd 2:0:0:0: [sda] tag#2 CDB: Read(16) 88 00 00 00 00 01 28 3f a3 e0 00 00 00 08 00 00 [4038243.333221] blk_update_request: I/O error, dev sda, sector 4970226656 [4038243.335072] sd 2:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [4038243.335077] sd 2:0:0:0: [sda] tag#3 Sense Key : Illegal Request [current] [descriptor] [4038243.335081] sd 2:0:0:0: [sda] tag#3 Add. Sense: Unaligned write command [4038243.335086] sd 2:0:0:0: [sda] tag#3 CDB: Read(16) 88 00 00 00 00 01 28 3f a3 e8 00 00 00 08 00 00 [4038243.335089] blk_update_request: I/O error, dev sda, sector 4970226664 [4038243.336908] sd 2:0:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [4038243.336914] sd 2:0:0:0: [sda] tag#5 Sense Key : Illegal Request [current] [descriptor] [4038243.336918] sd 2:0:0:0: [sda] tag#5 Add. Sense: Unaligned write command [4038243.336923] sd 2:0:0:0: [sda] tag#5 CDB: Read(16) 88 00 00 00 00 01 28 3f a3 f0 00 00 00 08 00 00 [4038243.336926] blk_update_request: I/O error, dev sda, sector 4970226672 [4038243.338720] sd 2:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [4038243.338725] sd 2:0:0:0: [sda] tag#6 Sense Key : Illegal Request [current] [descriptor] [4038243.338730] sd 2:0:0:0: [sda] tag#6 Add. Sense: Unaligned write command [4038243.338734] sd 2:0:0:0: [sda] tag#6 CDB: Read(16) 88 00 00 00 00 01 28 3f a3 08 00 00 00 08 00 00 [4038243.338737] blk_update_request: I/O error, dev sda, sector 4970226440 [4038243.340502] sd 2:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [4038243.340506] sd 2:0:0:0: [sda] tag#7 Sense Key : Illegal Request [current] [descriptor] [4038243.340511] sd 2:0:0:0: [sda] tag#7 Add. Sense: Unaligned write command [4038243.340516] sd 2:0:0:0: [sda] tag#7 CDB: Read(16) 88 00 00 00 00 01 28 3f a3 10 00 00 00 08 00 00 [4038243.340519] blk_update_request: I/O error, dev sda, sector 4970226448 [4038243.342229] sd 2:0:0:0: [sda] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [4038243.342234] sd 2:0:0:0: [sda] tag#8 Sense Key : Illegal Request [current] [descriptor] [4038243.342239] sd 2:0:0:0: [sda] tag#8 Add. Sense: Unaligned write command [4038243.342244] sd 2:0:0:0: [sda] tag#8 CDB: Read(16) 88 00 00 00 00 01 28 3f a3 18 00 00 00 08 00 00 [4038243.342247] blk_update_request: I/O error, dev sda, sector 4970226456 [4038243.343950] sd 2:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [4038243.343955] sd 2:0:0:0: [sda] tag#9 Sense Key : Illegal Request [current] [descriptor] [4038243.343960] sd 2:0:0:0: [sda] tag#9 Add. Sense: Unaligned write command [4038243.343965] sd 2:0:0:0: [sda] tag#9 CDB: Read(16) 88 00 00 00 00 01 28 3f a3 20 00 00 00 08 00 00 [4038243.343968] blk_update_request: I/O error, dev sda, sector 4970226464 [4038243.345623] sd 2:0:0:0: [sda] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [4038243.345628] sd 2:0:0:0: [sda] tag#10 Sense Key : Illegal Request [current] [descriptor] [4038243.345633] sd 2:0:0:0: [sda] tag#10 Add. Sense: Unaligned write command [4038243.345637] sd 2:0:0:0: [sda] tag#10 CDB: Read(16) 88 00 00 00 00 01 28 3f a3 28 00 00 00 08 00 00 [4038243.345640] blk_update_request: I/O error, dev sda, sector 4970226472 [4038243.347350] ata3: EH complete ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-11 10:31 ` Alexander Shenkin @ 2017-10-11 17:10 ` Phil Turmel 2017-10-12 9:50 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Phil Turmel @ 2017-10-11 17:10 UTC (permalink / raw) To: Alexander Shenkin, Reindl Harald, Carsten Aulbert, linux-raid On 10/11/2017 06:31 AM, Alexander Shenkin wrote: > On 10/10/2017 1:55 PM, Phil Turmel wrote: >> Which means the pending sector found by a smartctl background scan is >> likely in a non-array data area. And if not, the next scrub will fix >> it. You can run checkarray yourself if you don't want to wait. > > Thanks Phil. I ran checkarray --all --idle, and it completed fine, with > no Rebuild messages as far as I could see (looked in dmesg & > /var/log/syslog, see below). > > [4444093.042246] md: data-check of RAID array md0 > [4444093.042252] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > [4444093.042254] md: using maximum available idle IO bandwidth (but not > more than 200000 KB/sec) for data-check. > [4444093.042262] md: using 128k window, over a total of 1950656k. > [4444093.192032] md: delaying data-check of md2 until md0 has finished > (they share one or more physical units) > [4444106.854418] md: md0: data-check done. > [4444106.863292] md: data-check of RAID array md2 > [4444106.863295] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > [4444106.863298] md: using maximum available idle IO bandwidth (but not > more than 200000 KB/sec) for data-check. > [4444106.863304] md: using 128k window, over a total of 2920188928k. > [4475376.852520] md: md2: data-check done. > > SMART still shows those 8 unreadable sectors. dmesg has a bunch of > related errors, copied below. Uh-oh. Your kernel has a hangcheck timer that is shorter (120 seconds) than the URE timeout of your crappy Seagate drive (w/ driver times out at 180 seconds). So the writeback that would fix the URE isn't happening. You'll need to set your hangcheck timer to 180 seconds, too. I'm not sure how to do that. (I've never seen this particular combination, but it would be another black mark on desktop drives in raid arrays.) Phil ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-11 17:10 ` Phil Turmel @ 2017-10-12 9:50 ` Alexander Shenkin 2017-10-12 11:01 ` Wols Lists 2017-10-12 15:19 ` Kai Stian Olstad 0 siblings, 2 replies; 49+ messages in thread From: Alexander Shenkin @ 2017-10-12 9:50 UTC (permalink / raw) To: Phil Turmel, Reindl Harald, Carsten Aulbert, linux-raid On 10/11/2017 6:10 PM, Phil Turmel wrote: > On 10/11/2017 06:31 AM, Alexander Shenkin wrote: >> On 10/10/2017 1:55 PM, Phil Turmel wrote: > >>> Which means the pending sector found by a smartctl background scan is >>> likely in a non-array data area. And if not, the next scrub will fix >>> it. You can run checkarray yourself if you don't want to wait. >> >> Thanks Phil. I ran checkarray --all --idle, and it completed fine, with >> no Rebuild messages as far as I could see (looked in dmesg & >> /var/log/syslog, see below). >> >> [4444093.042246] md: data-check of RAID array md0 >> [4444093.042252] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. >> [4444093.042254] md: using maximum available idle IO bandwidth (but not >> more than 200000 KB/sec) for data-check. >> [4444093.042262] md: using 128k window, over a total of 1950656k. >> [4444093.192032] md: delaying data-check of md2 until md0 has finished >> (they share one or more physical units) >> [4444106.854418] md: md0: data-check done. >> [4444106.863292] md: data-check of RAID array md2 >> [4444106.863295] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. >> [4444106.863298] md: using maximum available idle IO bandwidth (but not >> more than 200000 KB/sec) for data-check. >> [4444106.863304] md: using 128k window, over a total of 2920188928k. >> [4475376.852520] md: md2: data-check done. >> >> SMART still shows those 8 unreadable sectors. dmesg has a bunch of >> related errors, copied below. > > Uh-oh. Your kernel has a hangcheck timer that is shorter (120 seconds) > than the URE timeout of your crappy Seagate drive (w/ driver times out > at 180 seconds). So the writeback that would fix the URE isn't happening. > > You'll need to set your hangcheck timer to 180 seconds, too. I'm not > sure how to do that. (I've never seen this particular combination, but > it would be another black mark on desktop drives in raid arrays.) Thanks Phil... Googling around, I haven't found a way to change it either, but then again, I'm not really sure what to search for. What about changing my default disk timeout to something less than 120 secs? Say, 100 secs instead of 180? Seems like this issue should probably make it into the timeout wiki page, no? Perhaps some instructions on how to query your system's hangcheck timeout, and thus making sure that you set your drive timeouts to less than that? Thanks, Allie ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-12 9:50 ` Alexander Shenkin @ 2017-10-12 11:01 ` Wols Lists 2017-10-12 13:04 ` Phil Turmel 2017-10-12 15:19 ` Kai Stian Olstad 1 sibling, 1 reply; 49+ messages in thread From: Wols Lists @ 2017-10-12 11:01 UTC (permalink / raw) To: Alexander Shenkin, Phil Turmel, Reindl Harald, Carsten Aulbert, linux-raid On 12/10/17 10:50, Alexander Shenkin wrote: > Thanks Phil... Googling around, I haven't found a way to change it > either, but then again, I'm not really sure what to search for. > > What about changing my default disk timeout to something less than 120 > secs? Say, 100 secs instead of 180? > > Seems like this issue should probably make it into the timeout wiki > page, no? Perhaps some instructions on how to query your system's > hangcheck timeout, and thus making sure that you set your drive timeouts > to less than that? Very much so. What is a "hangcheck timeout"? My wife has kindly bought the basics I need for a new PC for my birthday (yeah! :-) and I've ordered two Seagate Ironwolfs to go with it, so I will be setting this up from scratch. Raid, KVM, LVM, the works. So hangcheck timeouts, documenting on the wiki, all the other bits, the important thing is I'll have a brand new system I can play with that's not got anything important on it and if the system (software side only, of course :-) gets trashed, so what. I can try stuff out without worrying about putting my live system at risk. But back to topic. I know we have the disk timeout (on desktop drives, any random number up to 180secs :-). We have the linux i/o wait timeout - by default 30 secs. And now we have the hangcheck timeout, whatever that is ... Cheers, Wol ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-12 11:01 ` Wols Lists @ 2017-10-12 13:04 ` Phil Turmel 2017-10-12 13:16 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Phil Turmel @ 2017-10-12 13:04 UTC (permalink / raw) To: Wols Lists, Alexander Shenkin, Reindl Harald, Carsten Aulbert, linux-raid On 10/12/2017 07:01 AM, Wols Lists wrote: > On 12/10/17 10:50, Alexander Shenkin wrote: >> Thanks Phil... Googling around, I haven't found a way to change it >> either, but then again, I'm not really sure what to search for. >> >> What about changing my default disk timeout to something less than 120 >> secs? Say, 100 secs instead of 180? Nope. The number has to be longer than the actual longest timeout of your drive, which we now know is >120. When I first investigated this phenomenon years ago, I picked 120 for my timeouts. Other reports reached the list with the need for longer, and the recommendation for 180 was chosen. If the driver times out, it resets the SATA connection while the drive is still in la-la land. MD gets the error and tries to write the fixed sector. The SATA connection is still resetting at that point, and MD gets a *write* error, which boots that drive out of the array. >> Seems like this issue should probably make it into the timeout wiki >> page, no? Perhaps some instructions on how to query your system's >> hangcheck timeout, and thus making sure that you set your drive timeouts >> to less than that? > > Very much so. What is a "hangcheck timeout"? Maybe compiled into the kernel. I vaguely recall seeing some of this when I used to read (most of) lkml. Haven't had time for lkml in years. I'll dig around later if no-one beats me to it. Phil ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-12 13:04 ` Phil Turmel @ 2017-10-12 13:16 ` Alexander Shenkin 2017-10-12 13:21 ` Mark Knecht 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2017-10-12 13:16 UTC (permalink / raw) To: Phil Turmel, Wols Lists, Reindl Harald, Carsten Aulbert, linux-raid On 10/12/2017 2:04 PM, Phil Turmel wrote: > On 10/12/2017 07:01 AM, Wols Lists wrote: >> On 12/10/17 10:50, Alexander Shenkin wrote: >>> Thanks Phil... Googling around, I haven't found a way to change it >>> either, but then again, I'm not really sure what to search for. >>> >>> What about changing my default disk timeout to something less than 120 >>> secs? Say, 100 secs instead of 180? > > Nope. The number has to be longer than the actual longest timeout of > your drive, which we now know is >120. When I first investigated this > phenomenon years ago, I picked 120 for my timeouts. Other reports > reached the list with the need for longer, and the recommendation for > 180 was chosen. > > If the driver times out, it resets the SATA connection while the drive > is still in la-la land. MD gets the error and tries to write the fixed > sector. The SATA connection is still resetting at that point, and MD > gets a *write* error, which boots that drive out of the array. Thanks Phil. Lots of questions in my head, but all rather newbie-ish and don't want to bother folks, so I'll just wait till you experts hash it out and then will follow recommendations... thanks, allie ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-12 13:16 ` Alexander Shenkin @ 2017-10-12 13:21 ` Mark Knecht 2017-10-12 15:16 ` Edward Kuns 0 siblings, 1 reply; 49+ messages in thread From: Mark Knecht @ 2017-10-12 13:21 UTC (permalink / raw) To: Alexander Shenkin Cc: Phil Turmel, Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID On Thu, Oct 12, 2017 at 6:16 AM, Alexander Shenkin <al@shenkin.org> wrote: > On 10/12/2017 2:04 PM, Phil Turmel wrote: >> >> On 10/12/2017 07:01 AM, Wols Lists wrote: >>> >>> On 12/10/17 10:50, Alexander Shenkin wrote: >>>> >>>> Thanks Phil... Googling around, I haven't found a way to change it >>>> either, but then again, I'm not really sure what to search for. >>>> >>>> What about changing my default disk timeout to something less than 120 >>>> secs? Say, 100 secs instead of 180? >> >> >> Nope. The number has to be longer than the actual longest timeout of >> your drive, which we now know is >120. When I first investigated this >> phenomenon years ago, I picked 120 for my timeouts. Other reports >> reached the list with the need for longer, and the recommendation for >> 180 was chosen. >> >> If the driver times out, it resets the SATA connection while the drive >> is still in la-la land. MD gets the error and tries to write the fixed >> sector. The SATA connection is still resetting at that point, and MD >> gets a *write* error, which boots that drive out of the array. > > > Thanks Phil. Lots of questions in my head, but all rather newbie-ish and > don't want to bother folks, so I'll just wait till you experts hash it out > and then will follow recommendations... > > thanks, > allie Not an expert here but on my Gentoo systems all running kernel 4.12.12 I have the hangcheck timer disabled. Using it does not appear to be a hard requirement. - Mark ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-12 13:21 ` Mark Knecht @ 2017-10-12 15:16 ` Edward Kuns 2017-10-12 15:52 ` Edward Kuns 0 siblings, 1 reply; 49+ messages in thread From: Edward Kuns @ 2017-10-12 15:16 UTC (permalink / raw) To: Mark Knecht Cc: Alexander Shenkin, Phil Turmel, Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID All y'all referring to a whole separate kernel module, hangcheck-timer.ko? If so, it appears that you can set the timeouts (there is more than one) via kernel parameters. I found this, which has a long comment at the top explaining what it does: https://github.com/spotify/linux/blob/master/drivers/char/hangcheck-timer.c Here's the comment (reformatted): The hangcheck-timer driver uses the TSC to catch delays that jiffies does not notice. A timer is set. When the timer fires, it checks whether it was delayed and if that delay exceeds a given margin of error. The hangcheck_tick module parameter takes the timer duration in seconds. The hangcheck_margin parameter defines the margin of error, in seconds. The defaults are 60 seconds for the timer and 180 seconds for the margin of error. IOW, a timer is set for 60 seconds. When the timer fires, the callback checks the actual duration that the timer waited. If the duration exceeds the alloted time and margin (here 60 + 180, or 240 seconds), the machine is restarted. A healthy machine will have the duration match the expected timeout very closely. There are four parameters to this kernel module: MODULE_PARM_DESC(hangcheck_tick, "Timer delay."); MODULE_PARM_DESC(hangcheck_margin, "If the hangcheck timer has been delayed more than hangcheck_margin seconds, the driver will fire."); MODULE_PARM_DESC(hangcheck_reboot, "If nonzero, the machine will reboot when the timer margin is exceeded."); MODULE_PARM_DESC(hangcheck_dump_tasks, "If nonzero, the machine will dump the system task state when the timer margin is exceeded."); The first two are times measured in seconds. hangcheck_tick defaults to 180 seconds and hangcheck_margin defaults to 60 seconds, at least in the Spotify kernel version I found on github. Eddie ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-12 15:16 ` Edward Kuns @ 2017-10-12 15:52 ` Edward Kuns 2017-10-15 14:41 ` Alexander Shenkin 2017-12-18 15:51 ` Alexander Shenkin 0 siblings, 2 replies; 49+ messages in thread From: Edward Kuns @ 2017-10-12 15:52 UTC (permalink / raw) To: Mark Knecht Cc: Alexander Shenkin, Phil Turmel, Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID On Thu, Oct 12, 2017 at 10:16 AM, Edward Kuns <eddie.kuns@gmail.com> wrote: > All y'all referring to a whole separate kernel module, hangcheck-timer.ko? Looking back at the original messages: [4038193.380403] INFO: task md2_raid5:247 blocked for more than 120 seconds. [4038193.380473] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu [4038193.380526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. it looks like you're dealing with this part of the kernel: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/hung_task.c The timer is configurable with sysctl and defaults to 120 seconds. You can check with this command: $ sudo sysctl kernel.hung_task_timeout_secs kernel.hung_task_timeout_secs = 120 You can adjust it temporarily (e.g. to make it longer): $ sudo sysctl -w kernel.hung_task_timeout_secs=150 Or you can adjust it permanently by modifying your sysctl configuration. It looks like by default it will only warn ten times. After that it will stop complaining. That is also configurable via sysctl. Eddie ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-12 15:52 ` Edward Kuns @ 2017-10-15 14:41 ` Alexander Shenkin 2017-12-18 15:51 ` Alexander Shenkin 1 sibling, 0 replies; 49+ messages in thread From: Alexander Shenkin @ 2017-10-15 14:41 UTC (permalink / raw) To: Edward Kuns, Mark Knecht Cc: Phil Turmel, Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID Hi all, Thanks for all the feedback on this issue. I'm wondering if there's any consensus about what should be done here... Should I push the kernel timeout to something more than the 180 seconds set in the timeout script? I'm not clear on all the timers in play here (drive timeout, kernel timeout, others?), so not sure how the config should be set so they don't end up conflicting... Hopefully the minds gathered here can chart the best path forward. Thanks again for all the attention... hopefully this can help others in the future... thanks, allie On 10/12/2017 4:52 PM, Edward Kuns wrote: > On Thu, Oct 12, 2017 at 10:16 AM, Edward Kuns <eddie.kuns@gmail.com> wrote: >> All y'all referring to a whole separate kernel module, hangcheck-timer.ko? > > Looking back at the original messages: > > [4038193.380403] INFO: task md2_raid5:247 blocked for more than 120 seconds. > [4038193.380473] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu > [4038193.380526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > > it looks like you're dealing with this part of the kernel: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/hung_task.c > > The timer is configurable with sysctl and defaults to 120 seconds. > You can check with this command: > > $ sudo sysctl kernel.hung_task_timeout_secs > kernel.hung_task_timeout_secs = 120 > > You can adjust it temporarily (e.g. to make it longer): > > $ sudo sysctl -w kernel.hung_task_timeout_secs=150 > > Or you can adjust it permanently by modifying your sysctl configuration. > > It looks like by default it will only warn ten times. After that it > will stop complaining. That is also configurable via sysctl. > > Eddie > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-12 15:52 ` Edward Kuns 2017-10-15 14:41 ` Alexander Shenkin @ 2017-12-18 15:51 ` Alexander Shenkin 2017-12-18 16:09 ` Phil Turmel 1 sibling, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2017-12-18 15:51 UTC (permalink / raw) To: Edward Kuns, Mark Knecht Cc: Phil Turmel, Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID Hi all, I'm getting back to this now that I'll have time, apologies for the delay. So, is the following correct in the case of a read error? 1) System tries to read an unreadable sector 2) Drive timeout reports unreadable based on drive timeout setting. 2a) In this case, mdadm sees the sector is unreadable and rewrites it elsewhere on that drive. 3) If linux hangcheck timer runs out before the drive timeout, then linux aborts the read, logs an error, and mdadm isn't given a chance to rewrite elsewhere based on checksums. I'm not sure how the linux io timeout fits in here, and how it's different from the hangcheck timer. Given all this, it seems to me that I should now set the hangcheck timer to something greater than drive timeout (180 seconds). Does that sound right? Otherwise, linux will kill the rewrite again, no? Thanks, Allie On 10/12/2017 4:52 PM, Edward Kuns wrote: > On Thu, Oct 12, 2017 at 10:16 AM, Edward Kuns <eddie.kuns@gmail.com> wrote: >> All y'all referring to a whole separate kernel module, hangcheck-timer.ko? > > Looking back at the original messages: > > [4038193.380403] INFO: task md2_raid5:247 blocked for more than 120 seconds. > [4038193.380473] Not tainted 4.4.0-81-generic #104~14.04.1-Ubuntu > [4038193.380526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > > it looks like you're dealing with this part of the kernel: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/hung_task.c > > The timer is configurable with sysctl and defaults to 120 seconds. > You can check with this command: > > $ sudo sysctl kernel.hung_task_timeout_secs > kernel.hung_task_timeout_secs = 120 > > You can adjust it temporarily (e.g. to make it longer): > > $ sudo sysctl -w kernel.hung_task_timeout_secs=150 > > Or you can adjust it permanently by modifying your sysctl configuration. > > It looks like by default it will only warn ten times. After that it > will stop complaining. That is also configurable via sysctl. > > Eddie > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-12-18 15:51 ` Alexander Shenkin @ 2017-12-18 16:09 ` Phil Turmel 2017-12-19 10:35 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Phil Turmel @ 2017-12-18 16:09 UTC (permalink / raw) To: Alexander Shenkin, Edward Kuns, Mark Knecht Cc: Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID Hi Alexander, On 12/18/2017 10:51 AM, Alexander Shenkin wrote: > Hi all, > > I'm getting back to this now that I'll have time, apologies for the > delay. So, is the following correct in the case of a read error? Not quite. > 1) System tries to read an unreadable sector > 2) Drive timeout reports unreadable based on drive timeout setting. > 2a) In this case, mdadm sees the sector is unreadable and rewrites it > elsewhere on that drive. No. MD reconstructs the sector from redundancy (mirror or reverse parity calc or reverse P+Q syndrome) and writes it back to the *same* sector. Since the drive firmware reported an error here, it knows to verify the write as well. If the verification fails, the drive firmware will relocate the sector in the background, invisible to the upper layers. As far as MD is concerned, that sector address is fixed either way. Relocations are handled entirely within the drive. MD does not perform or track relocations. > 3) If linux hangcheck timer runs out before the drive timeout, then > linux aborts the read, logs an error, and mdadm isn't given a chance > to rewrite elsewhere based on checksums. No. The hangcheck timer issue described in your forwarded email is unrelated. And MD doesn't use checksums. Each drive has a device driver timeout, as you note below, found at /sys/block/*/device/timeout, that linux's ATA/SCSI stack uses to cut off non-responsive controller cards and/or drives. If that timer runs out on a read before the drive reports the read error, the low level *driver* reports a read error to the MD layer. MD treats it the same as any other read error, locating or recomputing the sector from redundancy as above. The difference in this case is that the physical drive isn't talking to the controller (link reset in progress, typically) and the corrective rewrite of the sector (to fix or relocate within the drive) is refused, and that write error causes MD to kick out the drive. And the pending sector is also left unfixed. > Given all this, it seems to me that I should now set the hangcheck > timer to something greater than drive timeout (180 seconds). Does > that sound right? Otherwise, linux will kill the rewrite again, no? In and of itself, waiting on I/O is not a hang. So it should not be applicable. Phil ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-12-18 16:09 ` Phil Turmel @ 2017-12-19 10:35 ` Alexander Shenkin 2017-12-19 12:02 ` Phil Turmel 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2017-12-19 10:35 UTC (permalink / raw) To: Phil Turmel, Edward Kuns, Mark Knecht Cc: Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID On 12/18/2017 4:09 PM, Phil Turmel wrote: > Hi Alexander, > > On 12/18/2017 10:51 AM, Alexander Shenkin wrote: >> Hi all, >> >> I'm getting back to this now that I'll have time, apologies for the >> delay. So, is the following correct in the case of a read error? > > Not quite. > >> 1) System tries to read an unreadable sector > >> 2) Drive timeout reports unreadable based on drive timeout setting. > >> 2a) In this case, mdadm sees the sector is unreadable and rewrites it >> elsewhere on that drive. > > No. MD reconstructs the sector from redundancy (mirror or reverse > parity calc or reverse P+Q syndrome) and writes it back to the *same* > sector. Since the drive firmware reported an error here, it knows to > verify the write as well. If the verification fails, the drive firmware > will relocate the sector in the background, invisible to the upper > layers. As far as MD is concerned, that sector address is fixed either > way. Relocations are handled entirely within the drive. MD does not > perform or track relocations. > >> 3) If linux hangcheck timer runs out before the drive timeout, then >> linux aborts the read, logs an error, and mdadm isn't given a chance >> to rewrite elsewhere based on checksums. > > No. The hangcheck timer issue described in your forwarded email is > unrelated. And MD doesn't use checksums. > > Each drive has a device driver timeout, as you note below, found at > /sys/block/*/device/timeout, that linux's ATA/SCSI stack uses to cut off > non-responsive controller cards and/or drives. If that timer runs out > on a read before the drive reports the read error, the low level > *driver* reports a read error to the MD layer. MD treats it the same as > any other read error, locating or recomputing the sector from redundancy > as above. The difference in this case is that the physical drive isn't > talking to the controller (link reset in progress, typically) and the > corrective rewrite of the sector (to fix or relocate within the drive) > is refused, and that write error causes MD to kick out the drive. And > the pending sector is also left unfixed. > >> Given all this, it seems to me that I should now set the hangcheck >> timer to something greater than drive timeout (180 seconds). Does >> that sound right? Otherwise, linux will kill the rewrite again, no? > > In and of itself, waiting on I/O is not a hang. So it should not be > applicable. Ok, so, it's now my understanding that I would normally be ok, having set the driver timeout to 180 secs (thus giving time for the seagate drive to report the read error back up to the MD layer before 180 secs is up). In my case, however, the kernel hangcheck timer is interrupting the process (md?) that is waiting on the sector read at 120 secs. Therefore, the writeback doesn't happen. Thus, I should set the hangcheck to something > 120 (say, 180 secs - should it be >180 to let the driver timeout first?). Does this sound correct? Apologies if I'm repeating info from before - just trying to be sure about what I'm doing before I go ahead and do it. If that's correct, I'll add the following line in /etc/sysctl.conf: kernel.hung_task_timeout_secs = 180 I'll make sure the setting has taken, and then I'll run: sudo /usr/share/mdadm/checkarray --idle --all Thanks, Allie ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-12-19 10:35 ` Alexander Shenkin @ 2017-12-19 12:02 ` Phil Turmel 2017-12-21 11:28 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Phil Turmel @ 2017-12-19 12:02 UTC (permalink / raw) To: Alexander Shenkin, Edward Kuns, Mark Knecht Cc: Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID On 12/19/2017 05:35 AM, Alexander Shenkin wrote: > Ok, so, it's now my understanding that I would normally be ok, having > set the driver timeout to 180 secs (thus giving time for the seagate > drive to report the read error back up to the MD layer before 180 secs > is up). In my case, however, the kernel hangcheck timer is interrupting > the process (md?) that is waiting on the sector read at 120 secs. > Therefore, the writeback doesn't happen. Yes. I think this behavior is a bug, and you need to work around it. > Thus, I should set the hangcheck to something > 120 (say, 180 secs - > should it be >180 to let the driver timeout first?). Does this sound > correct? Apologies if I'm repeating info from before - just trying to > be sure about what I'm doing before I go ahead and do it. > > If that's correct, I'll add the following line in /etc/sysctl.conf: > > kernel.hung_task_timeout_secs = 180 Yes. For your kernel. > I'll make sure the setting has taken, and then I'll run: > > sudo /usr/share/mdadm/checkarray --idle --all Makes sense. Please report your results for posterity when the scrub is done. Phil ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-12-19 12:02 ` Phil Turmel @ 2017-12-21 11:28 ` Alexander Shenkin 2017-12-21 11:38 ` Reindl Harald 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2017-12-21 11:28 UTC (permalink / raw) To: Phil Turmel, Edward Kuns, Mark Knecht Cc: Wols Lists, Reindl Harald, Carsten Aulbert, Linux-RAID [-- Attachment #1: Type: text/plain, Size: 2785 bytes --] Hi all, Reporting back after changing the hangcheck timer to 180 secs and re-running checkarray. I got a number of rebuild events (see syslog excerpts below and attached), and I see no signs of the hangcheck issue in dmesg like I did last time. I'm still getting the SMART OfflineUncorrectableSector and CurrentPendingSector errors, however. Should those go away if the rewrites were correctly carried out by the drive? Any thoughts on next steps to verify everything is ok? Thanks, Allie user@machine:/var/log$ cat syslog | grep Rebuild Dec 19 12:48:18 machine mdadm[23296]: RebuildStarted event detected on md device /dev/md/0 Dec 19 12:48:41 machine mdadm[23296]: Rebuild99 event detected on md device /dev/md/0 Dec 19 12:48:41 machine mdadm[23296]: RebuildStarted event detected on md device /dev/md/2 Dec 19 12:48:41 machine mdadm[23296]: RebuildFinished event detected on md device /dev/md/0 Dec 19 14:12:02 machine mdadm[23296]: Rebuild22 event detected on md device /dev/md/2 Dec 19 15:18:42 machine mdadm[23296]: Rebuild41 event detected on md device /dev/md/2 Dec 19 16:42:02 machine mdadm[23296]: Rebuild62 event detected on md device /dev/md/2 Dec 19 18:05:23 machine mdadm[23296]: Rebuild80 event detected on md device /dev/md/2 Dec 19 20:02:09 machine mdadm[23296]: RebuildFinished event detected on md device /dev/md/2 On 12/19/2017 12:02 PM, Phil Turmel wrote: > On 12/19/2017 05:35 AM, Alexander Shenkin wrote: > >> Ok, so, it's now my understanding that I would normally be ok, having >> set the driver timeout to 180 secs (thus giving time for the seagate >> drive to report the read error back up to the MD layer before 180 secs >> is up). In my case, however, the kernel hangcheck timer is interrupting >> the process (md?) that is waiting on the sector read at 120 secs. >> Therefore, the writeback doesn't happen. > > Yes. I think this behavior is a bug, and you need to work around it. > >> Thus, I should set the hangcheck to something > 120 (say, 180 secs - >> should it be >180 to let the driver timeout first?). Does this sound >> correct? Apologies if I'm repeating info from before - just trying to >> be sure about what I'm doing before I go ahead and do it. >> >> If that's correct, I'll add the following line in /etc/sysctl.conf: >> >> kernel.hung_task_timeout_secs = 180 > > Yes. For your kernel. > >> I'll make sure the setting has taken, and then I'll run: >> >> sudo /usr/share/mdadm/checkarray --idle --all > > Makes sense. Please report your results for posterity when the scrub is > done. > > Phil > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > [-- Attachment #2: syslog --] [-- Type: text/plain, Size: 17325 bytes --] Dec 19 12:48:18 machinename mdadm[23296]: RebuildStarted event detected on md device /dev/md/0 Dec 19 12:48:19 machinename kernel: [1057980.859389] md: data-check of RAID array md0 Dec 19 12:48:19 machinename kernel: [1057980.859396] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Dec 19 12:48:19 machinename kernel: [1057980.859399] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. Dec 19 12:48:19 machinename kernel: [1057980.859410] md: using 128k window, over a total of 1950656k. Dec 19 12:48:19 machinename kernel: [1057981.785802] md: delaying data-check of md2 until md0 has finished (they share one or more physical units) Dec 19 12:48:41 machinename kernel: [1058004.462362] md: md0: data-check done. Dec 19 12:48:41 machinename kernel: [1058004.472905] md: data-check of RAID array md2 Dec 19 12:48:41 machinename kernel: [1058004.472910] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Dec 19 12:48:41 machinename kernel: [1058004.472911] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. Dec 19 12:48:41 machinename kernel: [1058004.472917] md: using 128k window, over a total of 2920188928k. Dec 19 12:48:41 machinename mdadm[23296]: Rebuild99 event detected on md device /dev/md/0 Dec 19 12:48:41 machinename mdadm[23296]: RebuildStarted event detected on md device /dev/md/2 Dec 19 12:48:41 machinename mdadm[23296]: RebuildFinished event detected on md device /dev/md/0 Dec 19 12:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 12:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 12:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 104 Dec 19 12:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 54 to 53 Dec 19 12:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 46 to 47 Dec 19 12:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 12:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 13:09:01 machinename CRON[677]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 13:17:01 machinename CRON[1297]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 53 to 50 Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 47 to 50 Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 13:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 13:39:01 machinename CRON[2821]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 13:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 13:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 13:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 104 to 119 Dec 19 13:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 50 to 49 Dec 19 13:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 50 to 51 Dec 19 13:55:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 13:55:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 14:09:16 machinename CRON[5303]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 14:12:02 machinename mdadm[23296]: Rebuild22 event detected on md device /dev/md/2 Dec 19 14:17:03 machinename CRON[5784]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 19 14:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 14:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 14:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 119 to 120 Dec 19 14:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 49 to 48 Dec 19 14:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 51 to 52 Dec 19 14:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 14:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 14:39:01 machinename CRON[7066]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 14:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 14:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 14:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 105 Dec 19 14:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 14:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 15:09:01 machinename CRON[10105]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 15:17:01 machinename CRON[10868]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 19 15:18:42 machinename mdadm[23296]: Rebuild41 event detected on md device /dev/md/2 Dec 19 15:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 15:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 15:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 105 to 119 Dec 19 15:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 15:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 15:39:02 machinename CRON[12435]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 15:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 15:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 15:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 48 to 47 Dec 19 15:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 52 to 53 Dec 19 15:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 15:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 16:09:01 machinename CRON[15876]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 16:14:28 machinename dhclient: DHCPREQUEST of 192.168.x.x on eth0 to 192.168.1.1 port 67 (xid=0x294df4f0) Dec 19 16:14:28 machinename dhclient: DHCPACK of 192.168.x.x from 192.168.1.1 Dec 19 16:14:35 machinename dhclient: bound to 192.168.x.x -- renewal in 37501 seconds. Dec 19 16:17:02 machinename CRON[16614]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 19 16:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 16:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 16:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 119 to 118 Dec 19 16:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 16:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 16:39:01 machinename CRON[18221]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 16:42:02 machinename mdadm[23296]: Rebuild62 event detected on md device /dev/md/2 Dec 19 16:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 16:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 16:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 118 to 116 Dec 19 16:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 16:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 17:09:02 machinename CRON[21316]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 17:17:03 machinename CRON[21836]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 19 17:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 17:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 17:25:46 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 108 Dec 19 17:25:47 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 17:25:47 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 17:39:05 machinename CRON[23185]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 17:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 17:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 17:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 108 to 117 Dec 19 17:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 47 to 48 Dec 19 17:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 53 to 52 Dec 19 17:55:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 17:55:46 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 18:05:23 machinename mdadm[23296]: Rebuild80 event detected on md device /dev/md/2 Dec 19 18:09:01 machinename CRON[25890]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 18:17:01 machinename CRON[26421]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 120 Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 48 to 47 Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 52 to 53 Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 18:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 18:39:12 machinename CRON[27738]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 18:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 18:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 18:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 120 to 114 Dec 19 18:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 18:55:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 19:09:16 machinename CRON[31805]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 19:17:02 machinename CRON[322]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 19 19:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 19:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 19:25:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 114 to 117 Dec 19 19:25:47 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 19:25:47 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 19:39:01 machinename CRON[1738]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 47 to 48 Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 53 to 52 Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 19:55:45 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 20:02:08 machinename kernel: [1084008.008663] md: md2: data-check done. Dec 19 20:02:09 machinename mdadm[23296]: RebuildFinished event detected on md device /dev/md/2 Dec 19 20:09:01 machinename CRON[4780]: (root) CMD ( [ -x /usr/lib/php5/maxlifetime ] && [ -x /usr/lib/php5/sessionclean ] && [ -d /var/lib/php5 ] && /usr/lib/php5/sessionclean /var/lib/php5 $(/usr/lib/php5/maxlifetime)) Dec 19 20:17:01 machinename CRON[5577]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 119 Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 48 to 50 Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 52 to 50 Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], previous self-test completed with error (read test element) Dec 19 20:25:44 machinename smartd[2151]: Device: /dev/sda [SAT], Self-Test Log error count increased from 19 to 20 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-12-21 11:28 ` Alexander Shenkin @ 2017-12-21 11:38 ` Reindl Harald 2017-12-23 3:14 ` Brad Campbell 0 siblings, 1 reply; 49+ messages in thread From: Reindl Harald @ 2017-12-21 11:38 UTC (permalink / raw) To: Alexander Shenkin, Phil Turmel, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID Am 21.12.2017 um 12:28 schrieb Alexander Shenkin: > Hi all, > > Reporting back after changing the hangcheck timer to 180 secs and > re-running checkarray. I got a number of rebuild events (see syslog > excerpts below and attached), and I see no signs of the hangcheck issue > in dmesg like I did last time. > > I'm still getting the SMART OfflineUncorrectableSector and > CurrentPendingSector errors, however. Should those go away if the > rewrites were correctly carried out by the drive? Any thoughts on next > steps to verify everything is ok? OfflineUncorrectableSector unlikely can go away CurrentPendingSector https://kb.acronis.com/content/9133 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-12-21 11:38 ` Reindl Harald @ 2017-12-23 3:14 ` Brad Campbell 2018-01-03 12:44 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Brad Campbell @ 2017-12-23 3:14 UTC (permalink / raw) To: Reindl Harald, Alexander Shenkin, Phil Turmel, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 21/12/17 19:38, Reindl Harald wrote: > > > Am 21.12.2017 um 12:28 schrieb Alexander Shenkin: >> Hi all, >> >> Reporting back after changing the hangcheck timer to 180 secs and >> re-running checkarray. I got a number of rebuild events (see syslog >> excerpts below and attached), and I see no signs of the hangcheck >> issue in dmesg like I did last time. >> >> I'm still getting the SMART OfflineUncorrectableSector and >> CurrentPendingSector errors, however. Should those go away if the >> rewrites were correctly carried out by the drive? Any thoughts on >> next steps to verify everything is ok? > > OfflineUncorrectableSector unlikely can go away > > CurrentPendingSector > https://kb.acronis.com/content/9133 If they've been re-written (so are no longer pending) then a SMART long or possibly offline test will make them go away. I use SMART long myself. Regards, Brad ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-12-23 3:14 ` Brad Campbell @ 2018-01-03 12:44 ` Alexander Shenkin 2018-01-03 13:26 ` Brad Campbell 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2018-01-03 12:44 UTC (permalink / raw) To: Brad Campbell, Reindl Harald, Phil Turmel, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 12/23/2017 3:14 AM, Brad Campbell wrote: > On 21/12/17 19:38, Reindl Harald wrote: >> >> >> Am 21.12.2017 um 12:28 schrieb Alexander Shenkin: >>> Hi all, >>> >>> Reporting back after changing the hangcheck timer to 180 secs and >>> re-running checkarray. I got a number of rebuild events (see syslog >>> excerpts below and attached), and I see no signs of the hangcheck >>> issue in dmesg like I did last time. >>> >>> I'm still getting the SMART OfflineUncorrectableSector and >>> CurrentPendingSector errors, however. Should those go away if the >>> rewrites were correctly carried out by the drive? Any thoughts on >>> next steps to verify everything is ok? >> >> OfflineUncorrectableSector unlikely can go away >> >> CurrentPendingSector >> https://kb.acronis.com/content/9133 > > If they've been re-written (so are no longer pending) then a SMART long > or possibly offline test will make them go away. I use SMART long myself. > Thanks Brad. I'm running a long test now, but I believe I have the system set up to run long tests regularly, and the issue hasn't been fixed. Furthermore, strangely, the reallocated sector count still sits at 0 (see below). If these errors had been properly handled by the drive, shouldn't Reallocated_Sector_Ct sit at least at 8? Thanks, Allie user@machine:~$ sudo smartctl -a /dev/sda smartctl 6.2 2013-07-26 r3841 [x86_64-linux-4.4.0-97-generic] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST3000DM001-9YN166 Serial Number: Z1F13FBA LU WWN Device Id: 5 000c50 04e444ab1 Firmware Version: CC4B User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Wed Jan 3 12:30:04 2018 GMT ==> WARNING: A firmware update for this drive may be available, see the following Seagate web pages: http://knowledge.seagate.com/articles/en_US/FAQ/207931en http://knowledge.seagate.com/articles/en_US/FAQ/223651en SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 249) Self-test routine in progress... 90% of test remaining. Total time to complete Offline data collection: ( 592) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 335) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x3085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 190170288 3 Spin_Up_Time 0x0003 093 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 49 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 082 060 030 Pre-fail Always - 175521338 9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 15266 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 73 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 099 000 Old_age Always - 3 3 3 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 049 040 045 Old_age Always In_the_past 51 (108 124 54 26 0) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 54 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 114 194 Temperature_Celsius 0x0022 051 060 000 Old_age Always - 51 (0 14 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 9657h+43m+05.288s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 178878793257480 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 134902761417217 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Self-test routine in progress 90% 15266 - # 2 Conveyance offline Completed: read failure 80% 15160 - # 3 Extended offline Completed: read failure 10% 15152 - # 4 Conveyance offline Completed: read failure 70% 14992 - # 5 Extended offline Completed: read failure 10% 14986 - # 6 Short offline Completed: read failure 60% 14913 - # 7 Conveyance offline Completed: read failure 70% 14824 - # 8 Extended offline Completed: read failure 10% 14818 - # 9 Conveyance offline Completed: read failure 80% 14656 - #10 Extended offline Completed: read failure 10% 14649 - #11 Conveyance offline Completed: read failure 80% 14489 - #12 Extended offline Completed: read failure 10% 14482 - #13 Conveyance offline Completed: read failure 80% 14321 - #14 Extended offline Completed: read failure 10% 14314 - #15 Conveyance offline Completed: read failure 80% 14153 - #16 Extended offline Completed: read failure 10% 14145 - #17 Conveyance offline Completed: read failure 70% 13985 - #18 Extended offline Completed: read failure 10% 13977 - #19 Conveyance offline Completed: read failure 70% 13817 - #20 Extended offline Completed: read failure 10% 13809 - #21 Conveyance offline Completed without error 00% 13648 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-03 12:44 ` Alexander Shenkin @ 2018-01-03 13:26 ` Brad Campbell 2018-01-03 13:50 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Brad Campbell @ 2018-01-03 13:26 UTC (permalink / raw) To: Alexander Shenkin, Reindl Harald, Phil Turmel, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 03/01/18 20:44, Alexander Shenkin wrote: > On 12/23/2017 3:14 AM, Brad Campbell wrote: >> On 21/12/17 19:38, Reindl Harald wrote: >>> >>> >>> Am 21.12.2017 um 12:28 schrieb Alexander Shenkin: >>>> Hi all, >>>> >>>> Reporting back after changing the hangcheck timer to 180 secs and >>>> re-running checkarray. I got a number of rebuild events (see >>>> syslog excerpts below and attached), and I see no signs of the >>>> hangcheck issue in dmesg like I did last time. >>>> >>>> I'm still getting the SMART OfflineUncorrectableSector and >>>> CurrentPendingSector errors, however. Should those go away if the >>>> rewrites were correctly carried out by the drive? Any thoughts on >>>> next steps to verify everything is ok? >>> >>> OfflineUncorrectableSector unlikely can go away >>> >>> CurrentPendingSector >>> https://kb.acronis.com/content/9133 >> >> If they've been re-written (so are no longer pending) then a SMART >> long or possibly offline test will make them go away. I use SMART >> long myself. >> > > Thanks Brad. I'm running a long test now, but I believe I have the > system set up to run long tests regularly, and the issue hasn't been > fixed. Furthermore, strangely, the reallocated sector count still > sits at 0 (see below). If these errors had been properly handled by > the drive, shouldn't Reallocated_Sector_Ct sit at least at 8? Nope. Your pending is still at 8, so you've got bad sectors in an area of the drive that hasn't been dealt with. What is "interesting" is that your SMART test results don't list the LBA of the first failure. Disappointing behaviour on the part of the disk. They are within the 1st 10% of the drive however, so it wouldn't surprise me if they were in an unused portion of the RAID superblock area. Regards, -- Dolphins are so intelligent that within a few weeks they can train Americans to stand at the edge of the pool and throw them fish. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-03 13:26 ` Brad Campbell @ 2018-01-03 13:50 ` Alexander Shenkin 2018-01-03 15:53 ` Phil Turmel 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2018-01-03 13:50 UTC (permalink / raw) To: Brad Campbell, Reindl Harald, Phil Turmel, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 1/3/2018 1:26 PM, Brad Campbell wrote: > > > On 03/01/18 20:44, Alexander Shenkin wrote: >> On 12/23/2017 3:14 AM, Brad Campbell wrote: >>> On 21/12/17 19:38, Reindl Harald wrote: >>>> >>>> >>>> Am 21.12.2017 um 12:28 schrieb Alexander Shenkin: >>>>> Hi all, >>>>> >>>>> Reporting back after changing the hangcheck timer to 180 secs and >>>>> re-running checkarray. I got a number of rebuild events (see >>>>> syslog excerpts below and attached), and I see no signs of the >>>>> hangcheck issue in dmesg like I did last time. >>>>> >>>>> I'm still getting the SMART OfflineUncorrectableSector and >>>>> CurrentPendingSector errors, however. Should those go away if the >>>>> rewrites were correctly carried out by the drive? Any thoughts on >>>>> next steps to verify everything is ok? >>>> >>>> OfflineUncorrectableSector unlikely can go away >>>> >>>> CurrentPendingSector >>>> https://kb.acronis.com/content/9133 >>> >>> If they've been re-written (so are no longer pending) then a SMART >>> long or possibly offline test will make them go away. I use SMART >>> long myself. >>> >> >> Thanks Brad. I'm running a long test now, but I believe I have the >> system set up to run long tests regularly, and the issue hasn't been >> fixed. Furthermore, strangely, the reallocated sector count still >> sits at 0 (see below). If these errors had been properly handled by >> the drive, shouldn't Reallocated_Sector_Ct sit at least at 8? > > Nope. Your pending is still at 8, so you've got bad sectors in an area > of the drive that hasn't been dealt with. What is "interesting" is that > your SMART test results don't list the LBA of the first failure. > Disappointing behaviour on the part of the disk. They are within the 1st > 10% of the drive however, so it wouldn't surprise me if they were in an > unused portion of the RAID superblock area. Thanks Brad. So, to theoretically get these sectors remapped so I don't keep getting errors, I would have to somehow try to write to those sectors. That's tough given that the LBA's aren't reported as you mention. Perhaps my best course of action then is to: 1) re-run sudo /usr/share/mdadm/checkarray --idle --all 2) add my previously-purchased drive to convert the RAID5 to RAID6 (using http://www.ewams.net/?date=2013/05/02&view=Converting_RAID5_to_RAID6_in_mdadm as a guide) 3) after that, fail and remove /dev/sda from the RAID6 4) write 0's on /dev/sda (dd if=/dev/zero of=/dev/sda bs=1M) 5) re-add /dev/sda to the RAID6 This should get those bad sectors remapped... thoughts? thanks, allie ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-03 13:50 ` Alexander Shenkin @ 2018-01-03 15:53 ` Phil Turmel 2018-01-03 15:59 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Phil Turmel @ 2018-01-03 15:53 UTC (permalink / raw) To: Alexander Shenkin, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 01/03/2018 08:50 AM, Alexander Shenkin wrote: > On 1/3/2018 1:26 PM, Brad Campbell wrote: >> Nope. Your pending is still at 8, so you've got bad sectors in an area >> of the drive that hasn't been dealt with. What is "interesting" is >> that your SMART test results don't list the LBA of the first failure. >> Disappointing behaviour on the part of the disk. They are within the >> 1st 10% of the drive however, so it wouldn't surprise me if they were >> in an unused portion of the RAID superblock area. > > Thanks Brad. So, to theoretically get these sectors remapped so I don't > keep getting errors, I would have to somehow try to write to those > sectors. That's tough given that the LBA's aren't reported as you > mention. Perhaps my best course of action then is to: No, just use dd to read that device -- it'll bail out with read error when it hits the trouble spot, which will report the affected sector. Then you can rewrite it with the appropriate seek= value. (Assuming it really is in an unused part of the member device.) Phil ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-03 15:53 ` Phil Turmel @ 2018-01-03 15:59 ` Alexander Shenkin 2018-01-03 16:02 ` Phil Turmel 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2018-01-03 15:59 UTC (permalink / raw) To: Phil Turmel, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 1/3/2018 3:53 PM, Phil Turmel wrote: > On 01/03/2018 08:50 AM, Alexander Shenkin wrote: >> On 1/3/2018 1:26 PM, Brad Campbell wrote: > >>> Nope. Your pending is still at 8, so you've got bad sectors in an area >>> of the drive that hasn't been dealt with. What is "interesting" is >>> that your SMART test results don't list the LBA of the first failure. >>> Disappointing behaviour on the part of the disk. They are within the >>> 1st 10% of the drive however, so it wouldn't surprise me if they were >>> in an unused portion of the RAID superblock area. >> >> Thanks Brad. So, to theoretically get these sectors remapped so I don't >> keep getting errors, I would have to somehow try to write to those >> sectors. That's tough given that the LBA's aren't reported as you >> mention. Perhaps my best course of action then is to: > > No, just use dd to read that device -- it'll bail out with read error > when it hits the trouble spot, which will report the affected sector. > Then you can rewrite it with the appropriate seek= value. (Assuming it > really is in an unused part of the member device.) Thanks Phil. So, just: dd if=/dev/sda of=/dev/null bs=512 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-03 15:59 ` Alexander Shenkin @ 2018-01-03 16:02 ` Phil Turmel 2018-01-04 10:37 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Phil Turmel @ 2018-01-03 16:02 UTC (permalink / raw) To: Alexander Shenkin, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 01/03/2018 10:59 AM, Alexander Shenkin wrote: > On 1/3/2018 3:53 PM, Phil Turmel wrote: >> On 01/03/2018 08:50 AM, Alexander Shenkin wrote: >>> On 1/3/2018 1:26 PM, Brad Campbell wrote: >> >>>> Nope. Your pending is still at 8, so you've got bad sectors in an area >>>> of the drive that hasn't been dealt with. What is "interesting" is >>>> that your SMART test results don't list the LBA of the first failure. >>>> Disappointing behaviour on the part of the disk. They are within the >>>> 1st 10% of the drive however, so it wouldn't surprise me if they were >>>> in an unused portion of the RAID superblock area. >>> >>> Thanks Brad. So, to theoretically get these sectors remapped so I don't >>> keep getting errors, I would have to somehow try to write to those >>> sectors. That's tough given that the LBA's aren't reported as you >>> mention. Perhaps my best course of action then is to: >> >> No, just use dd to read that device -- it'll bail out with read error >> when it hits the trouble spot, which will report the affected sector. >> Then you can rewrite it with the appropriate seek= value. (Assuming it >> really is in an unused part of the member device.) > > Thanks Phil. So, just: dd if=/dev/sda of=/dev/null bs=512 Yup. (-: Phil ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-03 16:02 ` Phil Turmel @ 2018-01-04 10:37 ` Alexander Shenkin 2018-01-04 12:28 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2018-01-04 10:37 UTC (permalink / raw) To: Phil Turmel, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 1/3/2018 4:02 PM, Phil Turmel wrote: > On 01/03/2018 10:59 AM, Alexander Shenkin wrote: >> On 1/3/2018 3:53 PM, Phil Turmel wrote: >>> On 01/03/2018 08:50 AM, Alexander Shenkin wrote: >>>> On 1/3/2018 1:26 PM, Brad Campbell wrote: >>> >>>>> Nope. Your pending is still at 8, so you've got bad sectors in an area >>>>> of the drive that hasn't been dealt with. What is "interesting" is >>>>> that your SMART test results don't list the LBA of the first failure. >>>>> Disappointing behaviour on the part of the disk. They are within the >>>>> 1st 10% of the drive however, so it wouldn't surprise me if they were >>>>> in an unused portion of the RAID superblock area. >>>> >>>> Thanks Brad. So, to theoretically get these sectors remapped so I don't >>>> keep getting errors, I would have to somehow try to write to those >>>> sectors. That's tough given that the LBA's aren't reported as you >>>> mention. Perhaps my best course of action then is to: >>> >>> No, just use dd to read that device -- it'll bail out with read error >>> when it hits the trouble spot, which will report the affected sector. >>> Then you can rewrite it with the appropriate seek= value. (Assuming it >>> really is in an unused part of the member device.) >> So, I got a read error as expected, running (physical sector size of sda is 4096): dd if=/dev/sda of=/dev/null bs=4096 Is there some way to tell whether this sector is considered to be in use? Not sure what the effect of rewriting it might be if it is... If it's safe, I'd run: dd if=/dev/zero of=/dev/sda seek=5857843312 count=1 bs=4096 Perhaps the way to go is to write to it, and then run checkarray again? Thanks, Allie syslog here: user@machinename:~$ cat /var/log/syslog | grep sda Jan 4 08:23:30 machinename kernel: [1330854.323854] sd 0:0:0:0: [sda] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jan 4 08:23:30 machinename kernel: [1330854.323861] sd 0:0:0:0: [sda] tag#16 Sense Key : Medium Error [current] [descriptor] Jan 4 08:23:30 machinename kernel: [1330854.323867] sd 0:0:0:0: [sda] tag#16 Add. Sense: Unrecovered read error - auto reallocate failed Jan 4 08:23:30 machinename kernel: [1330854.323873] sd 0:0:0:0: [sda] tag#16 CDB: Read(16) 88 00 00 00 00 01 5d 27 98 08 00 00 01 00 00 00 Jan 4 08:23:30 machinename kernel: [1330854.323877] blk_update_request: I/O error, dev sda, sector 5857843312 Jan 4 08:23:33 machinename kernel: [1330858.108216] sd 0:0:0:0: [sda] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jan 4 08:23:33 machinename kernel: [1330858.108222] sd 0:0:0:0: [sda] tag#3 Sense Key : Medium Error [current] [descriptor] Jan 4 08:23:33 machinename kernel: [1330858.108228] sd 0:0:0:0: [sda] tag#3 Add. Sense: Unrecovered read error - auto reallocate failed Jan 4 08:23:33 machinename kernel: [1330858.108235] sd 0:0:0:0: [sda] tag#3 CDB: Read(16) 88 00 00 00 00 01 5d 27 98 70 00 00 00 08 00 00 Jan 4 08:23:33 machinename kernel: [1330858.108239] blk_update_request: I/O error, dev sda, sector 5857843312 Jan 4 08:23:33 machinename kernel: [1330858.108297] Buffer I/O error on dev sda, logical block 732230414, async page read Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 114 Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], SMART Usage Attribute: 187 Reported_Uncorrect changed from 100 to 98 Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 47 to 49 Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 53 to 51 Jan 4 08:42:07 machinename smartd[2203]: Device: /dev/sda [SAT], ATA error count increased from 0 to 2 Jan 4 08:42:08 machinename smartd[2203]: Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors Jan 4 08:42:08 machinename smartd[2203]: Device: /dev/sda [SAT], 8 Offline uncorrectable sectors Jan 4 08:42:08 machinename smartd[2203]: Device: /dev/sda [SAT], ATA error count increased from 0 to 2 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-04 10:37 ` Alexander Shenkin @ 2018-01-04 12:28 ` Alexander Shenkin 2018-01-04 13:16 ` Brad Campbell 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2018-01-04 12:28 UTC (permalink / raw) To: Phil Turmel, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID >>> On 1/3/2018 3:53 PM, Phil Turmel wrote: >>>> On 01/03/2018 08:50 AM, Alexander Shenkin wrote: >>>>> On 1/3/2018 1:26 PM, Brad Campbell wrote: >>>> >>>>>> Nope. Your pending is still at 8, so you've got bad sectors in an >>>>>> area >>>>>> of the drive that hasn't been dealt with. What is "interesting" is >>>>>> that your SMART test results don't list the LBA of the first failure. >>>>>> Disappointing behaviour on the part of the disk. They are within the >>>>>> 1st 10% of the drive however, so it wouldn't surprise me if they were >>>>>> in an unused portion of the RAID superblock area. >>>>> >>>>> Thanks Brad. So, to theoretically get these sectors remapped so I >>>>> don't >>>>> keep getting errors, I would have to somehow try to write to those >>>>> sectors. That's tough given that the LBA's aren't reported as you >>>>> mention. Perhaps my best course of action then is to: >>>> >>>> No, just use dd to read that device -- it'll bail out with read error >>>> when it hits the trouble spot, which will report the affected sector. >>>> Then you can rewrite it with the appropriate seek= value. (Assuming it >>>> really is in an unused part of the member device.) Ok, an update. Writing with bs=512 had issues, as it was failing on a read, and reallocating was failing. I think this is because the physical sector size is 4096b, and it needs to read the other 7 512b logical sectors if it wants to write just 1 512b logical sector. So, I ran: sudo dd if=/dev/zero of=/dev/sda seek=732230414 count=1 bs=4096 This seems to have worked. In syslog, I just now saw: Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 90 emails Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], No more Offline uncorrectable sectors, warning condition reset after 90 emails Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 113 to 117 I'm now running a checkarray and will report back final results, and whether the SMART warnings return. Thanks all for the help, hope this marks the end of this issues... Allie ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-04 12:28 ` Alexander Shenkin @ 2018-01-04 13:16 ` Brad Campbell 2018-01-04 13:39 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Brad Campbell @ 2018-01-04 13:16 UTC (permalink / raw) To: Alexander Shenkin, Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 04/01/18 20:28, Alexander Shenkin wrote: > > > Ok, an update. Writing with bs=512 had issues, as it was failing on a > read, and reallocating was failing. I think this is because the > physical sector size is 4096b, and it needs to read the other 7 512b > logical sectors if it wants to write just 1 512b logical sector. So, > I ran: > > sudo dd if=/dev/zero of=/dev/sda seek=732230414 count=1 bs=4096 > > This seems to have worked. In syslog, I just now saw: > > Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], No > more Currently unreadable (pending) sectors, warning condition reset > after 90 emails > Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], No > more Offline uncorrectable sectors, warning condition reset after 90 > emails > Jan 4 12:12:07 machinename smartd[2203]: Device: /dev/sda [SAT], > SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 113 to 117 > > I'm now running a checkarray and will report back final results, and > whether the SMART warnings return. Thanks all for the help, hope this > marks the end of this issues... > > Allie > Damn. I saw your initial mail, but I was out and nowhere near a device I could use to reply sensibly. You should *really* check by looking at the mdadm --examine output and calculating the position of the sectors in question to be absolutely sure the area you just wrote over was not in the active array area. If it was then you should stop the checkarray *now* and come back for advice. Sorry I don't have time right now to elaborate. Regards, -- Dolphins are so intelligent that within a few weeks they can train Americans to stand at the edge of the pool and throw them fish. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-04 13:16 ` Brad Campbell @ 2018-01-04 13:39 ` Alexander Shenkin 2018-01-05 5:20 ` Brad Campbell 0 siblings, 1 reply; 49+ messages in thread From: Alexander Shenkin @ 2018-01-04 13:39 UTC (permalink / raw) To: Brad Campbell, Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 1/4/2018 1:16 PM, Brad Campbell wrote:> > Damn. I saw your initial mail, but I was out and nowhere near a device I > could use to reply sensibly. > You should *really* check by looking at the mdadm --examine output and > calculating the position of the sectors in question to be absolutely > sure the area you just wrote over was not in the active array area. If > it was then you should stop the checkarray *now* and come back for advice. Thanks Brad, no worries, really appreciate your attention. I stopped checkarray. It had one rebuild event (Rebuild99) in /dev/md0 (small RAID1, where /boot is mounted) before I stopped it. Here's the examine output (not really sure what to do with it, will wait for advice): user@machinename:~$ sudo mdadm --examine /dev/sd* /dev/sda: MBR Magic : aa55 Partition[0] : 4294967295 sectors at 1 (type ee) /dev/sda1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 437e4abb:c7ac46f1:ef8b2976:94921060 Name : arrayname:0 Creation Time : Mon Dec 7 08:31:31 2015 Raid Level : raid1 Raid Devices : 4 Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB) Array Size : 1950656 (1905.26 MiB 1997.47 MB) Used Dev Size : 3901312 (1905.26 MiB 1997.47 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 9cb1890b:ad675b3b:7517467f:0780ec8e Update Time : Thu Jan 4 12:00:45 2018 Checksum : 682930c3 - correct Events : 215 Device Role : Active device 0 Array State : AAAA ('A' == active, '.' == missing) mdadm: No md superblock detected on /dev/sda2. /dev/sda3: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : c7303f62:d848d424:269581c8:83a045ec Name : ubuntu:2 Creation Time : Sun Feb 5 23:39:58 2017 Raid Level : raid5 Raid Devices : 4 Avail Dev Size : 5840377856 (2784.91 GiB 2990.27 GB) Array Size : 8760566784 (8354.73 GiB 8970.82 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : clean Device UUID : 64c8db38:3d0a6895:be2259d8:4c3c3542 Internal Bitmap : 8 sectors from superblock Update Time : Thu Jan 4 13:34:53 2018 Checksum : d9775efa - correct Events : 81186 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 0 Array State : AAAA ('A' == active, '.' == missing) mdadm: No md superblock detected on /dev/sda4. /dev/sdb: MBR Magic : aa55 Partition[0] : 4294967295 sectors at 1 (type ee) /dev/sdb1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 437e4abb:c7ac46f1:ef8b2976:94921060 Name : arrayname:0 Creation Time : Mon Dec 7 08:31:31 2015 Raid Level : raid1 Raid Devices : 4 Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB) Array Size : 1950656 (1905.26 MiB 1997.47 MB) Used Dev Size : 3901312 (1905.26 MiB 1997.47 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 38999554:d1b0db8d:d8066e72:86865c31 Update Time : Thu Jan 4 12:00:45 2018 Checksum : 995557b1 - correct Events : 215 Device Role : Active device 2 Array State : AAAA ('A' == active, '.' == missing) mdadm: No md superblock detected on /dev/sdb2. /dev/sdb3: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : c7303f62:d848d424:269581c8:83a045ec Name : ubuntu:2 Creation Time : Sun Feb 5 23:39:58 2017 Raid Level : raid5 Raid Devices : 4 Avail Dev Size : 5840377856 (2784.91 GiB 2990.27 GB) Array Size : 8760566784 (8354.73 GiB 8970.82 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : clean Device UUID : cf70dad5:0c9ff5f6:ede689f2:ccee2eb0 Internal Bitmap : 8 sectors from superblock Update Time : Thu Jan 4 13:34:53 2018 Checksum : 602e12e9 - correct Events : 81186 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 1 Array State : AAAA ('A' == active, '.' == missing) mdadm: No md superblock detected on /dev/sdb4. /dev/sdc: MBR Magic : aa55 Partition[0] : 4294967295 sectors at 1 (type ee) /dev/sdc1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 437e4abb:c7ac46f1:ef8b2976:94921060 Name : arrayname:0 Creation Time : Mon Dec 7 08:31:31 2015 Raid Level : raid1 Raid Devices : 4 Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB) Array Size : 1950656 (1905.26 MiB 1997.47 MB) Used Dev Size : 3901312 (1905.26 MiB 1997.47 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : f162eae5:19f8926b:f5bb6a2a:8adbbefd Update Time : Thu Jan 4 12:00:45 2018 Checksum : 8cb8728a - correct Events : 215 Device Role : Active device 3 Array State : AAAA ('A' == active, '.' == missing) mdadm: No md superblock detected on /dev/sdc2. /dev/sdc3: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : c7303f62:d848d424:269581c8:83a045ec Name : ubuntu:2 Creation Time : Sun Feb 5 23:39:58 2017 Raid Level : raid5 Raid Devices : 4 Avail Dev Size : 5840377856 (2784.91 GiB 2990.27 GB) Array Size : 8760566784 (8354.73 GiB 8970.82 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : clean Device UUID : f8839952:eaba2e9c:c2c401d4:3e0592a5 Internal Bitmap : 8 sectors from superblock Update Time : Thu Jan 4 13:34:53 2018 Checksum : 59013634 - correct Events : 81186 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 2 Array State : AAAA ('A' == active, '.' == missing) mdadm: No md superblock detected on /dev/sdc4. /dev/sdd: MBR Magic : aa55 Partition[0] : 4294967295 sectors at 1 (type ee) /dev/sdd1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 437e4abb:c7ac46f1:ef8b2976:94921060 Name : arrayname:0 Creation Time : Mon Dec 7 08:31:31 2015 Raid Level : raid1 Raid Devices : 4 Avail Dev Size : 3901440 (1905.32 MiB 1997.54 MB) Array Size : 1950656 (1905.26 MiB 1997.47 MB) Used Dev Size : 3901312 (1905.26 MiB 1997.47 MB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 4823f0b8:8e0d3ac3:e312c219:29f76622 Update Time : Thu Jan 4 12:00:45 2018 Checksum : cb64bae5 - correct Events : 215 Device Role : Active device 1 Array State : AAAA ('A' == active, '.' == missing) mdadm: No md superblock detected on /dev/sdd2. /dev/sdd3: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : c7303f62:d848d424:269581c8:83a045ec Name : ubuntu:2 Creation Time : Sun Feb 5 23:39:58 2017 Raid Level : raid5 Raid Devices : 4 Avail Dev Size : 5840377856 (2784.91 GiB 2990.27 GB) Array Size : 8760566784 (8354.73 GiB 8970.82 GB) Data Offset : 262144 sectors Super Offset : 8 sectors State : clean Device UUID : 875a0dbd:965a9986:1b78eb3d:e15fee50 Internal Bitmap : 8 sectors from superblock Update Time : Thu Jan 4 13:34:53 2018 Checksum : c325ba6d - correct Events : 81186 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 3 Array State : AAAA ('A' == active, '.' == missing) mdadm: No md superblock detected on /dev/sdd4. Here's mdadm --detail if needed: user@machinename:~$ sudo mdadm --detail /dev/md2 /dev/md2: Version : 1.2 Creation Time : Sun Feb 5 23:39:58 2017 Raid Level : raid5 Array Size : 8760566784 (8354.73 GiB 8970.82 GB) Used Dev Size : 2920188928 (2784.91 GiB 2990.27 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Thu Jan 4 13:31:21 2018 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : ubuntu:2 UUID : c7303f62:d848d424:269581c8:83a045ec Events : 81186 Number Major Minor RaidDevice State 0 8 3 0 active sync /dev/sda3 4 8 19 1 active sync /dev/sdb3 2 8 35 2 active sync /dev/sdc3 5 8 51 3 active sync /dev/sdd3 user@machinename:~$ sudo mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Mon Dec 7 08:31:31 2015 Raid Level : raid1 Array Size : 1950656 (1905.26 MiB 1997.47 MB) Used Dev Size : 1950656 (1905.26 MiB 1997.47 MB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Thu Jan 4 12:00:45 2018 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Name : arrayname:0 UUID : 437e4abb:c7ac46f1:ef8b2976:94921060 Events : 215 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 5 8 49 1 active sync /dev/sdd1 4 8 17 2 active sync /dev/sdb1 2 8 33 3 active sync /dev/sdc1 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-04 13:39 ` Alexander Shenkin @ 2018-01-05 5:20 ` Brad Campbell 2018-01-05 5:25 ` Brad Campbell 0 siblings, 1 reply; 49+ messages in thread From: Brad Campbell @ 2018-01-05 5:20 UTC (permalink / raw) To: Alexander Shenkin, Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 04/01/18 21:39, Alexander Shenkin wrote: > Thanks Brad, no worries, really appreciate your attention. I stopped > checkarray. It had one rebuild event (Rebuild99) in /dev/md0 (small > RAID1, where /boot is mounted) before I stopped it. Here's the > examine output (not really sure what to do with it, will wait for > advice): Ok, so you have 4 disks with 2 partitions on each. You re-wrote Sectors 5857843312+7 on the disk. Without knowing the layout of your partitions it's a bit difficult, but lets make an assumption and see where it gets us. You have a partition table. Lets assume 1st partition starts at sector 2048 as fdisk will often leave that for alignment. 1st partition data offset is 2048 sectors (1M for superblock) and is 3901312 sectors long, so it ends at 3905408 (3901312+2048+2048) 2nd partition data offset is 262144 sectors and is 5840377856 sectors long, totaling 5840640000 sectors. Add those two and we get 5844545408 sectors. So if my maths is any good you wrote a block 13297904 sectors from the end of the data area. Now the whole point of that was to say if the block you wrote happens to fall in a parity area, then you are fine. Checkarray will just re-calculate the parity from the data blocks and re-write it. Your mismatch count will be 1 at the end of the operation. If however the block falls in a data area, running checkarray is going to use that re-written block to re-calculate the parity and it's corrupt for good. Now I need someone to re-check my maths, and an fdisk -l /dev/sda from you to see if I've made any glaring error. My assessment is that block *did* lay in the data area of the disk. If I'm right, then the only way I can see to rectify it is to pop sda out, zero the superblock and re-add it which will rebuild the disk entirely but that leaves you extremely vulnerable for the entire process. Of course if there is nothing on the filesystem at that location, or you are ok with losing a 4k chunk of a file then this is all moot. At this point I'd be most glad to be proven incorrect. Regards, Brad -- Dolphins are so intelligent that within a few weeks they can train Americans to stand at the edge of the pool and throw them fish. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-05 5:20 ` Brad Campbell @ 2018-01-05 5:25 ` Brad Campbell 2018-01-05 10:10 ` Alexander Shenkin 0 siblings, 1 reply; 49+ messages in thread From: Brad Campbell @ 2018-01-05 5:25 UTC (permalink / raw) To: Alexander Shenkin, Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 05/01/18 13:20, Brad Campbell wrote: > You re-wrote Sectors 5857843312+7 on the disk. > Add those two and we get 5844545408 sectors. So if my maths is any good > you wrote a block 13297904 sectors from the end of the data area. I can't believe I did that. No, you wrote a block ~6M *after* the data area and you should be fine. I'm going to go and write a letter of apology to my primary school maths teacher now. -- Dolphins are so intelligent that within a few weeks they can train Americans to stand at the edge of the pool and throw them fish. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-05 5:25 ` Brad Campbell @ 2018-01-05 10:10 ` Alexander Shenkin 2018-01-05 10:32 ` Brad Campbell 2018-01-05 13:50 ` Phil Turmel 0 siblings, 2 replies; 49+ messages in thread From: Alexander Shenkin @ 2018-01-05 10:10 UTC (permalink / raw) To: Brad Campbell, Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 1/5/2018 5:25 AM, Brad Campbell wrote: > On 05/01/18 13:20, Brad Campbell wrote: > >> You re-wrote Sectors 5857843312+7 on the disk. > >> Add those two and we get 5844545408 sectors. So if my maths is any good >> you wrote a block 13297904 sectors from the end of the data area. > > I can't believe I did that. No, you wrote a block ~6M *after* the data > area and you should be fine. > > I'm going to go and write a letter of apology to my primary school maths > teacher now. > > Thanks much, Brad. fdisk & parted output are below. I have swap space mounted on /dev/sda4, 15,984,640 sectors long, after the partitions used for raid. I'm not sure where exactly the parity data sits... Looks to me like this happened in swap space, no? Currently, swapon reports 552,272 kb (= 1,104,544 sectors) in use (i think). If that's contiguous, then the write should have happened after the used space (13,297,904 > 1,104,544). But I'm not sure swap is contiguous. In this case, regardless, I suspect I should just reboot, and then run checkarray to be safe? One followup: is parity info stored in a separate area than data info on the disk? If the write *had* fallen within the raid partition area, would you indeed be able to tell if it overwrote data vs parity vs both? Google wouldn't tell me... Thanks again, Allie user@machinename:~$ sudo fdisk -l /dev/sda* WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted. Disk /dev/sda: 3000.6 GB, 3000592982016 bytes 255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x00000000 Device Boot Start End Blocks Id System /dev/sda1 1 4294967295 2147483647+ ee GPT Partition 1 does not start on physical sector boundary. Disk /dev/sda1: 1998 MB, 1998585856 bytes 255 heads, 63 sectors/track, 242 cylinders, total 3903488 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x00000000 Disk /dev/sda1 doesn't contain a valid partition table Disk /dev/sda2: 1 MB, 1048576 bytes 255 heads, 63 sectors/track, 0 cylinders, total 2048 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x00000000 Disk /dev/sda2 doesn't contain a valid partition table Disk /dev/sda3: 2990.4 GB, 2990407680000 bytes 255 heads, 63 sectors/track, 363563 cylinders, total 5840640000 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x00000000 Disk /dev/sda3 doesn't contain a valid partition table Disk /dev/sda4: 8184 MB, 8184135680 bytes 255 heads, 63 sectors/track, 994 cylinders, total 15984640 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disk identifier: 0x00000000 Disk /dev/sda4 doesn't contain a valid partition table user@machinename:~$ sudo parted /dev/sda 'unit s print' Model: ATA ST3000DM001-9YN1 (scsi) Disk /dev/sda: 5860533168s Sector size (logical/physical): 512B/4096B Partition Table: gpt Number Start End Size File system Name Flags 1 2048s 3905535s 3903488s boot raid 2 3905536s 3907583s 2048s grubbios bios_grub 3 3907584s 5844547583s 5840640000s ext4 main raid 4 5844547584s 5860532223s 15984640s linux-swap(v1) swap user@machinename:~$ swapon --summary Filename Type Size Used Priority /dev/sda4 partition 7992316 552272 -1 /dev/sdb4 partition 7992316 0 -2 /dev/sdc4 partition 7992316 0 -3 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-05 10:10 ` Alexander Shenkin @ 2018-01-05 10:32 ` Brad Campbell 2018-01-05 13:50 ` Phil Turmel 1 sibling, 0 replies; 49+ messages in thread From: Brad Campbell @ 2018-01-05 10:32 UTC (permalink / raw) To: Alexander Shenkin Cc: Phil Turmel, Reindl Harald, Edward Kuns, Mark Knecht, Wols Lists, Carsten Aulbert, Linux-RAID Yep, it's in swap. Check the sector you wrote against the sector ranges listed in the parted print. All good to go. > On 5 Jan 2018, at 6:10 PM, Alexander Shenkin <al@shenkin.org> wrote: > >> On 1/5/2018 5:25 AM, Brad Campbell wrote: >>> On 05/01/18 13:20, Brad Campbell wrote: >>> You re-wrote Sectors 5857843312+7 on the disk. >>> Add those two and we get 5844545408 sectors. So if my maths is any good >>> you wrote a block 13297904 sectors from the end of the data area. >> I can't believe I did that. No, you wrote a block ~6M *after* the data area and you should be fine. >> I'm going to go and write a letter of apology to my primary school maths teacher now. > > Thanks much, Brad. fdisk & parted output are below. I have swap space mounted on /dev/sda4, 15,984,640 sectors long, after the partitions used for raid. I'm not sure where exactly the parity data sits... Looks to me like this happened in swap space, no? Currently, swapon reports 552,272 kb (= 1,104,544 sectors) in use (i think). If that's contiguous, then the write should have happened after the used space (13,297,904 > 1,104,544). But I'm not sure swap is contiguous. In this case, regardless, I suspect I should just reboot, and then run checkarray to be safe? > > One followup: is parity info stored in a separate area than data info on the disk? If the write *had* fallen within the raid partition area, would you indeed be able to tell if it overwrote data vs parity vs both? Google wouldn't tell me... > > Thanks again, > Allie > > > user@machinename:~$ sudo fdisk -l /dev/sda* > > WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted. > > > Disk /dev/sda: 3000.6 GB, 3000592982016 bytes > 255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 4096 bytes > I/O size (minimum/optimal): 4096 bytes / 4096 bytes > Disk identifier: 0x00000000 > > Device Boot Start End Blocks Id System > /dev/sda1 1 4294967295 2147483647+ ee GPT > Partition 1 does not start on physical sector boundary. > > Disk /dev/sda1: 1998 MB, 1998585856 bytes > 255 heads, 63 sectors/track, 242 cylinders, total 3903488 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 4096 bytes > I/O size (minimum/optimal): 4096 bytes / 4096 bytes > Disk identifier: 0x00000000 > > Disk /dev/sda1 doesn't contain a valid partition table > > Disk /dev/sda2: 1 MB, 1048576 bytes > 255 heads, 63 sectors/track, 0 cylinders, total 2048 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 4096 bytes > I/O size (minimum/optimal): 4096 bytes / 4096 bytes > Disk identifier: 0x00000000 > > Disk /dev/sda2 doesn't contain a valid partition table > > Disk /dev/sda3: 2990.4 GB, 2990407680000 bytes > 255 heads, 63 sectors/track, 363563 cylinders, total 5840640000 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 4096 bytes > I/O size (minimum/optimal): 4096 bytes / 4096 bytes > Disk identifier: 0x00000000 > > Disk /dev/sda3 doesn't contain a valid partition table > > Disk /dev/sda4: 8184 MB, 8184135680 bytes > 255 heads, 63 sectors/track, 994 cylinders, total 15984640 sectors > Units = sectors of 1 * 512 = 512 bytes > Sector size (logical/physical): 512 bytes / 4096 bytes > I/O size (minimum/optimal): 4096 bytes / 4096 bytes > Disk identifier: 0x00000000 > > Disk /dev/sda4 doesn't contain a valid partition table > > > > > > user@machinename:~$ sudo parted /dev/sda 'unit s print' > Model: ATA ST3000DM001-9YN1 (scsi) > Disk /dev/sda: 5860533168s > Sector size (logical/physical): 512B/4096B > Partition Table: gpt > > Number Start End Size File system Name Flags > 1 2048s 3905535s 3903488s boot raid > 2 3905536s 3907583s 2048s grubbios bios_grub > 3 3907584s 5844547583s 5840640000s ext4 main raid > 4 5844547584s 5860532223s 15984640s linux-swap(v1) swap > > > user@machinename:~$ swapon --summary > Filename Type Size Used Priority > /dev/sda4 partition 7992316 552272 -1 > /dev/sdb4 partition 7992316 0 -2 > /dev/sdc4 partition 7992316 0 -3 > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-05 10:10 ` Alexander Shenkin 2018-01-05 10:32 ` Brad Campbell @ 2018-01-05 13:50 ` Phil Turmel 2018-01-05 14:01 ` Alexander Shenkin 2018-01-05 15:59 ` Wols Lists 1 sibling, 2 replies; 49+ messages in thread From: Phil Turmel @ 2018-01-05 13:50 UTC (permalink / raw) To: Alexander Shenkin, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID Hi Alex, On 01/05/2018 05:10 AM, Alexander Shenkin wrote: > On 1/5/2018 5:25 AM, Brad Campbell wrote: >> I'm going to go and write a letter of apology to my primary school >> maths teacher now. > Thanks much, Brad. fdisk & parted output are below. I have swap space > mounted on /dev/sda4, 15,984,640 sectors long, after the partitions used > for raid. I'm not sure where exactly the parity data sits... Looks to > me like this happened in swap space, no? Currently, swapon reports > 552,272 kb (= 1,104,544 sectors) in use (i think). If that's > contiguous, then the write should have happened after the used space > (13,297,904 > 1,104,544). But I'm not sure swap is contiguous. In this > case, regardless, I suspect I should just reboot, and then run > checkarray to be safe? The output of fdisk is invalid on your system, see the warning it printed. Use gdisk or parted instead. Don't use '*'. > One followup: is parity info stored in a separate area than data info on > the disk? If the write *had* fallen within the raid partition area, > would you indeed be able to tell if it overwrote data vs parity vs both? > Google wouldn't tell me... No. Parity is interleaved with data on all devices, chunk by chunk, on all default raid5/6 layouts. In raid4, the last device is all of the parity. There are optional layouts for raid5 that do the same, and variants for raid6 that place various combinations of parity and syndrome at either end. See the --layout option in the mdadm man page. The non-data area of member devices contains at least the superblock, and optionally a write-intent bitmap and/or a bad-block list. Most of the non-data space is reserved for optimizing future --grow operations. > user@machinename:~$ sudo fdisk -l /dev/sda* > > WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util > fdisk doesn't support GPT. Use GNU Parted. All of the partition data following this warning is bogus -- it is the "protective" MBR record. Phil ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-05 13:50 ` Phil Turmel @ 2018-01-05 14:01 ` Alexander Shenkin 2018-01-05 15:59 ` Wols Lists 1 sibling, 0 replies; 49+ messages in thread From: Alexander Shenkin @ 2018-01-05 14:01 UTC (permalink / raw) To: Phil Turmel, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht Cc: Wols Lists, Carsten Aulbert, Linux-RAID On 1/5/2018 1:50 PM, Phil Turmel wrote: > No. Parity is interleaved with data on all devices, chunk by chunk, on > all default raid5/6 layouts. Thanks Phil. So, I suppose then the final diagnosis of my issue was that the bad sector occurred in a non-raid portion of the drive, which is why it was never read from or written to, and hence never corrected. BTW, my reallocated sector count remains at 0 on that drive, but whatever - maybe that's because there was no data to reallocate (though i would think it would physically reallocate some sectors regardless). Also, I keep getting Rebuild99 events on /dev/md0. But, I think that, when I have time, I'll add a drive for raid6, fail out sda, write it with 0's, update it's firmware (apparently seagate released one for these drives), and re-add it. Thanks, allie ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2018-01-05 13:50 ` Phil Turmel 2018-01-05 14:01 ` Alexander Shenkin @ 2018-01-05 15:59 ` Wols Lists 1 sibling, 0 replies; 49+ messages in thread From: Wols Lists @ 2018-01-05 15:59 UTC (permalink / raw) To: Phil Turmel, Alexander Shenkin, Brad Campbell, Reindl Harald, Edward Kuns, Mark Knecht Cc: Carsten Aulbert, Linux-RAID On 05/01/18 13:50, Phil Turmel wrote: > The output of fdisk is invalid on your system, see the warning it > printed. Use gdisk or parted instead. Don't use '*'. It used to be so simple - fdisk for MBR disks, gdisk et al for GPT. ashdown src # fdisk /dev/sda Welcome to fdisk (util-linux 2.26.2). Changes will remain in memory only, until you decide to write them. Be careful before using the write command. Command (m for help): p Disk /dev/sda: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: 47915407-BA7E-4869-8D3E-3CB44F5FDA12 Device Start End Sectors Size Type /dev/sda1 2048 1050623 1048576 512M BIOS boot /dev/sda2 1312768 68421631 67108864 32G Linux swap /dev/sda3 68683776 270010367 201326592 96G Linux filesystem /dev/sda4 270272512 471599103 201326592 96G Linux filesystem /dev/sda5 471861248 5860533134 5388671887 2.5T Linux filesystem Command (m for help): q ashdown src # Modern fdisk now supports GPT. So yes, in this case the warning is correct, but we need to watch out that people will be using fdisk, and it's okay. Dunno when this happened, but I've only very recently noticed it. Cheers, Wol ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-12 9:50 ` Alexander Shenkin 2017-10-12 11:01 ` Wols Lists @ 2017-10-12 15:19 ` Kai Stian Olstad 1 sibling, 0 replies; 49+ messages in thread From: Kai Stian Olstad @ 2017-10-12 15:19 UTC (permalink / raw) To: Alexander Shenkin, Phil Turmel, Reindl Harald, Carsten Aulbert, linux-raid On 12. okt. 2017 11:50, Alexander Shenkin wrote: > On 10/11/2017 6:10 PM, Phil Turmel wrote: >> You'll need to set your hangcheck timer to 180 seconds, too. I'm not >> sure how to do that. (I've never seen this particular combination, but >> it would be another black mark on desktop drives in raid arrays.) > > Thanks Phil... Googling around, I haven't found a way to change it > either, but then again, I'm not really sure what to search for. Your dmesg did say [4038193.380526] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. I guess 0 disables this feature or you could just use "echo 180" or "sysctl kernel.hung_task_timeout_secs = 180". -- Kai Stian Olstad ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-10 9:56 ` Alexander Shenkin 2017-10-10 12:55 ` Phil Turmel @ 2017-10-10 22:23 ` josh 2017-10-11 6:23 ` Alexander Shenkin 1 sibling, 1 reply; 49+ messages in thread From: josh @ 2017-10-10 22:23 UTC (permalink / raw) To: Alexander Shenkin; +Cc: Reindl Harald, Phil Turmel, Carsten Aulbert, linux-raid Hello Alexander, On 10 October 2017 at 20:56, Alexander Shenkin <al@shenkin.org> wrote: > On 10/10/2017 10:11 AM, Reindl Harald wrote: >> >> >> >> Am 10.10.2017 um 11:00 schrieb Alexander Shenkin: >>> >>> Thanks... I know nothing about "check scrubs". Could you point me to a >>> good resource? I've found https://raid.wiki.kernel.org/index.php/Scrubbing >>> and https://raid.wiki.kernel.org/index.php/Scrubbing_the_drives, but it's >>> hard to tell exactly how the system should be configured in order to run >>> these regularly. A weekly cron perhaps? And, should it be just check, or >>> repair? etc... Any help you could offer would be welcome. >> >> >> if your distribution don't install a cronjob for that you should blame >> them because RAID without regular scrub is asking for troubles >> >> [root@srv-rhsoft:~]$ rpm -q --file /etc/cron.d/raid-check >> mdadm-4.0-1.fc26.x86_64 >> >> [root@srv-rhsoft:~]$ cat /etc/cron.d/raid-check >> 30 4 * * Mon root /usr/sbin/raid-check > > > Thanks Reindl. Here's what I have installed (no evidence of raid-check > available on my system): > > $ cat /etc/cron.d/mdadm > 57 0 * * 0 root if [ -x /usr/share/mdadm/checkarray ] && [ $(date +\%d) -le > 7 ]; then /usr/share/mdadm/checkarray --cron --all --idle --quiet; fi > This is indicative of a Debian/Ubuntu distribution. The cron entry is not enough to enable md array checks, you have to edit /etc/default/mdadm and set AUTOCHECK=true >>> Is this something I should run now? I figure it's a bad idea to push an >>> array that is starting to degrade... haven't had a chance to replace the >>> drive yet, but will get to it this week. Probably best to start the >>> scrubbing routines once I have 4 good drives in there I figure... >> >> >> NO - never put any load you can avoid on degraded arrays > > > thanks, i won't. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-10 22:23 ` josh @ 2017-10-11 6:23 ` Alexander Shenkin 0 siblings, 0 replies; 49+ messages in thread From: Alexander Shenkin @ 2017-10-11 6:23 UTC (permalink / raw) To: josh; +Cc: Reindl Harald, Phil Turmel, Carsten Aulbert, linux-raid On 10/10/2017 11:23 PM, josh wrote: > This is indicative of a Debian/Ubuntu distribution. The cron entry is > not enough to enable md array checks, you have to edit > /etc/default/mdadm and set AUTOCHECK=true Thanks Josh. AUTOCHECK is indeed set to true in /etc/default/mdadm. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: SMART detects pending sectors; take offline? 2017-10-10 9:00 ` Alexander Shenkin 2017-10-10 9:11 ` Reindl Harald @ 2017-10-10 9:21 ` Wols Lists 1 sibling, 0 replies; 49+ messages in thread From: Wols Lists @ 2017-10-10 9:21 UTC (permalink / raw) To: Alexander Shenkin, linux-raid On 10/10/17 10:00, Alexander Shenkin wrote: >> Please read up on "timeout mismatch" before your array blows up. > > I have timeouts set on all drives when the system boots, and the same > script turns on the P300s' SCTERC. You may notice on the raid wiki that I've started putting up smartctl output from various drives. It would be nice to have the P300 there. When you get it, any chance you could email me the output of a "smartctl -x"? I notice on my Barracudas that I get different output depending on whether I've turned smarts on (it's disabled by default at power-on), so obviously I'd like it once smart is enabled :-) It's meant to give people a place to look to (try to) work out which drive is suitable for them. I've noticed with the drives I've got and what people have posted, that output is "weird" for some drives... Cheers, Wol ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2018-01-05 15:59 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-10-07 7:48 SMART detects pending sectors; take offline? Alexander Shenkin 2017-10-07 8:21 ` Carsten Aulbert 2017-10-07 10:05 ` Alexander Shenkin 2017-10-07 17:29 ` Wols Lists 2017-10-08 9:19 ` Alexander Shenkin 2017-10-08 9:49 ` Wols Lists 2017-10-09 20:16 ` Phil Turmel 2017-10-10 9:00 ` Alexander Shenkin 2017-10-10 9:11 ` Reindl Harald 2017-10-10 9:56 ` Alexander Shenkin 2017-10-10 12:55 ` Phil Turmel 2017-10-11 10:31 ` Alexander Shenkin 2017-10-11 17:10 ` Phil Turmel 2017-10-12 9:50 ` Alexander Shenkin 2017-10-12 11:01 ` Wols Lists 2017-10-12 13:04 ` Phil Turmel 2017-10-12 13:16 ` Alexander Shenkin 2017-10-12 13:21 ` Mark Knecht 2017-10-12 15:16 ` Edward Kuns 2017-10-12 15:52 ` Edward Kuns 2017-10-15 14:41 ` Alexander Shenkin 2017-12-18 15:51 ` Alexander Shenkin 2017-12-18 16:09 ` Phil Turmel 2017-12-19 10:35 ` Alexander Shenkin 2017-12-19 12:02 ` Phil Turmel 2017-12-21 11:28 ` Alexander Shenkin 2017-12-21 11:38 ` Reindl Harald 2017-12-23 3:14 ` Brad Campbell 2018-01-03 12:44 ` Alexander Shenkin 2018-01-03 13:26 ` Brad Campbell 2018-01-03 13:50 ` Alexander Shenkin 2018-01-03 15:53 ` Phil Turmel 2018-01-03 15:59 ` Alexander Shenkin 2018-01-03 16:02 ` Phil Turmel 2018-01-04 10:37 ` Alexander Shenkin 2018-01-04 12:28 ` Alexander Shenkin 2018-01-04 13:16 ` Brad Campbell 2018-01-04 13:39 ` Alexander Shenkin 2018-01-05 5:20 ` Brad Campbell 2018-01-05 5:25 ` Brad Campbell 2018-01-05 10:10 ` Alexander Shenkin 2018-01-05 10:32 ` Brad Campbell 2018-01-05 13:50 ` Phil Turmel 2018-01-05 14:01 ` Alexander Shenkin 2018-01-05 15:59 ` Wols Lists 2017-10-12 15:19 ` Kai Stian Olstad 2017-10-10 22:23 ` josh 2017-10-11 6:23 ` Alexander Shenkin 2017-10-10 9:21 ` Wols Lists
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.