Hi all,

Reporting back after changing the hangcheck timer to 180 secs and 
re-running checkarray.  I got a number of rebuild events (see syslog 
excerpts below and attached), and I see no signs of the hangcheck issue 
in dmesg like I did last time.

I'm still getting the SMART OfflineUncorrectableSector and 
CurrentPendingSector errors, however.  Should those go away if the 
rewrites were correctly carried out by the drive?  Any thoughts on next 
steps to verify everything is ok?

Thanks,
Allie


user@machine:/var/log$ cat syslog | grep Rebuild
Dec 19 12:48:18 machine mdadm[23296]: RebuildStarted event detected on 
md device /dev/md/0
Dec 19 12:48:41 machine mdadm[23296]: Rebuild99 event detected on md 
device /dev/md/0
Dec 19 12:48:41 machine mdadm[23296]: RebuildStarted event detected on 
md device /dev/md/2
Dec 19 12:48:41 machine mdadm[23296]: RebuildFinished event detected on 
md device /dev/md/0
Dec 19 14:12:02 machine mdadm[23296]: Rebuild22 event detected on md 
device /dev/md/2
Dec 19 15:18:42 machine mdadm[23296]: Rebuild41 event detected on md 
device /dev/md/2
Dec 19 16:42:02 machine mdadm[23296]: Rebuild62 event detected on md 
device /dev/md/2
Dec 19 18:05:23 machine mdadm[23296]: Rebuild80 event detected on md 
device /dev/md/2
Dec 19 20:02:09 machine mdadm[23296]: RebuildFinished event detected on 
md device /dev/md/2


On 12/19/2017 12:02 PM, Phil Turmel wrote:
> On 12/19/2017 05:35 AM, Alexander Shenkin wrote:
> 
>> Ok, so, it's now my understanding that I would normally be ok, having
>> set the driver timeout to 180 secs (thus giving time for the seagate
>> drive to report the read error back up to the MD layer before 180 secs
>> is up).  In my case, however, the kernel hangcheck timer is interrupting
>> the process (md?) that is waiting on the sector read at 120 secs.
>> Therefore, the writeback doesn't happen.
> 
> Yes.  I think this behavior is a bug, and you need to work around it.
> 
>> Thus, I should set the hangcheck to something > 120 (say, 180 secs -
>> should it be >180 to let the driver timeout first?).  Does this sound
>> correct?  Apologies if I'm repeating info from before - just trying to
>> be sure about what I'm doing before I go ahead and do it.
>>
>> If that's correct, I'll add the following line in /etc/sysctl.conf:
>>
>> kernel.hung_task_timeout_secs = 180
> 
> Yes.  For your kernel.
> 
>> I'll make sure the setting has taken, and then I'll run:
>>
>> sudo /usr/share/mdadm/checkarray --idle --all
> 
> Makes sense.  Please report your results for posterity when the scrub is
> done.
> 
> Phil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>