On Wed, 18 Feb 2015 15:04:53 +0000 (UTC) Chris <email.bug@arcor.de> wrote:

> 
> Hello,
> 
> by adapting what I could find, I compiled the following short snippet now.
> 
> Could list members please look at this novice code and suggest a way to 
> determine the containing disk device $HDD_DEV from the parition/disk,
> before I dare to test this.
> 
> 
> 
> In udev-md-raid-assembly.rules, below LABEL="md_inc" (section only handling
> all md suppported devices) add:
> 
> # fix timouts for redundant raids, if possible
> IMPORT{program}="BINDIR/mdadm --examine --export $tempnode"
> TEST="/usr/sbin/smartctl", ENV{MD_LEVEL}=="raid[1-9]*",
> RUN+="BINDIR/mdadm-erc-timout-fix.sh $tempnode"

It might make sense to have 2 rules, one for partitions and one for disks
(based on ENV{DEVTYPE}).  Then use $parent to get the device from the
partition, and  $devnode to get the device of the disk.

> 
> And in a new mdadm-erc-timout-fix.sh file implement:
> 
>   #! /bin/sh
> 
>   HDD_DEV= $1 somehow stipping off the tailing numbers?
> 
>   if smartctl -l scterc ${HDD_DEV} | grep -q Disabled ; then
>     /usr/sbin/smartctl -l scterc,70,70 ${HDD_DEV}
>   else
>     if ! smartctl -l scterc ${HDD_DEV} | grep -q seconds ; then
>       echo 180 >/sys/block/${HDD_DEV}/device/timeout
>     fi
>   fi

You should be consistent and use /usr/sbin/smartctl everywhere, or explicitly
set $PATH and just use smartctl  everywhere.

> 
> Correct execution during boot would seem to require that distro
> package managers hook smartctl and the script into the initramfs
> generation.
> 
> Regards,
> Chris

One problem with this approach is that it assumes circumstances don't change.
If you have a working RAID1, then limiting the timeout on both devices makes
sense.  If you have a degraded RAID1 with only one device left then you
really want the drive to try as hard as it can to get the data.

There is a "FAILFAST" mechanism in the kernel which allows the filesystem to
md etc to indicate that it wants accesses to "fail fast", which presumably
means to use a smaller timeout.
I would rather md used this flag where appropriate, and for the device to
respond to it by using suitable timeouts.

The problem is that FAILFAST isn't documented usefully and it is very hard to
figure out what exactly (if anything) it does.

But until that is resolved, a fix like this is probably a good idea.

NeilBrown