All of lore.kernel.org
 help / color / mirror / Atom feed
* re-add POLICY
@ 2015-02-14 21:59 Chris
  2015-02-15 19:03 ` re-add POLICY: conflict detection? Chris
  2015-02-16  3:28 ` re-add POLICY NeilBrown
  0 siblings, 2 replies; 24+ messages in thread
From: Chris @ 2015-02-14 21:59 UTC (permalink / raw)
  To: linux-raid


Hi all,

I'd like mdadm to automatically attempt to re-sync raid members after they
where temporarily removed from the system. 

I would have thought "POLICY domain=default action=re-add" should allow this,
and found a prior post that also seemed to want/test that behaviour.
But as I understand the answer given there
http://permalink.gmane.org/gmane.linux.raid/47516
mdadm is expected to exit with an error (not re-add) upon plugging the
device back in?

with:
mdadm: can only add /dev/loop2 to /dev/md0 as a spare, and force-spare is
not set.
mdadm: failed to add /dev/loop2 to existing array /dev/md0: Invalid argument.

For one, I don't understand what the error messages is trying to tell me, about
an invalid argument that was never supplied to --incremental?

But more importantly, how can priorly diconnected devices (marked failed
with non-future event count) get re-synced automatically when they are 
plugged in again?
(avoiding manual mdadm /dev/mdX --add /dev/sdYZ hassle)

Cheers,
Chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: re-add POLICY: conflict detection?
  2015-02-14 21:59 re-add POLICY Chris
@ 2015-02-15 19:03 ` Chris
  2015-02-16  3:28 ` re-add POLICY NeilBrown
  1 sibling, 0 replies; 24+ messages in thread
From: Chris @ 2015-02-15 19:03 UTC (permalink / raw)
  To: linux-raid


thinking about the "invalid argument" message...

with "action=re-add":
# mdadm --incremental /dev/loop2

 mdadm: can only add /dev/loop2 to /dev/md0 as a spare, and force-spare is
 not set.
 mdadm: failed to add /dev/loop2 to existing array /dev/md0: Invalid argument.


My guess is that mdadm may not be adding back the failed disk, because it is
unsure wether it may have run separately, and may have newer data on it?

I thought it may be possible to clearly distinguish between clean re-adds
and conflicts, by doing something like this:

* If a member fails (or is missing when starting degraded) write this info
into some failed_at_event_count field belonging to the failed member in the
superblock of every remaining raid member device in the array.

Now, if an array part that got unplugged reappears and still has the event
count that matches the failed_at_event_count that was recorded in the
superblocks of the still running disks, and the reappearing part's
superblock has no failed_at_event_count values for any member of the running
array, the reappearing part is ok to be automatically re-synced.

But if the reappearing disk claims a member of the already running array has
failed, or it reappeared with a different event count than its
faile_at_event_count field in the superblocks of the running array says, a
conflict has arisen and a sync may only be done with manual --force.

Cheers,
Chris







^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: re-add POLICY
  2015-02-14 21:59 re-add POLICY Chris
  2015-02-15 19:03 ` re-add POLICY: conflict detection? Chris
@ 2015-02-16  3:28 ` NeilBrown
  2015-02-16 12:23   ` Chris
  1 sibling, 1 reply; 24+ messages in thread
From: NeilBrown @ 2015-02-16  3:28 UTC (permalink / raw)
  To: Chris; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1626 bytes --]

On Sat, 14 Feb 2015 21:59:34 +0000 (UTC) Chris <email.bug@arcor.de> wrote:

> 
> Hi all,
> 
> I'd like mdadm to automatically attempt to re-sync raid members after they
> where temporarily removed from the system. 
> 
> I would have thought "POLICY domain=default action=re-add" should allow this,
> and found a prior post that also seemed to want/test that behaviour.
> But as I understand the answer given there
> http://permalink.gmane.org/gmane.linux.raid/47516
> mdadm is expected to exit with an error (not re-add) upon plugging the
> device back in?
> 
> with:
> mdadm: can only add /dev/loop2 to /dev/md0 as a spare, and force-spare is
> not set.
> mdadm: failed to add /dev/loop2 to existing array /dev/md0: Invalid argument.
> 
> For one, I don't understand what the error messages is trying to tell me, about
> an invalid argument that was never supplied to --incremental?
> 
> But more importantly, how can priorly diconnected devices (marked failed
> with non-future event count) get re-synced automatically when they are 
> plugged in again?
> (avoiding manual mdadm /dev/mdX --add /dev/sdYZ hassle)
> 

Does your array have a write-intent bitmap configured?
If it does, then "POLICY action=re-add" really should work.

If it doesn't, then maybe you need "POLICY action=spare".

This isn't the default, because depending on exactly how/why the device
failed, it may not be safe to treat it as a spare.

If the above does not help, please report:
 - kernel version
 - mdadm version
 - "mdadm --examine" output of at least one good drive and one failed drive.

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: re-add POLICY
  2015-02-16  3:28 ` re-add POLICY NeilBrown
@ 2015-02-16 12:23   ` Chris
  2015-02-16 13:17     ` Phil Turmel
  2015-02-17 15:09     ` re-add POLICY Chris
  0 siblings, 2 replies; 24+ messages in thread
From: Chris @ 2015-02-16 12:23 UTC (permalink / raw)
  To: linux-raid

NeilBrown <neilb <at> suse.de> writes:
 
> Does your array have a write-intent bitmap configured?
> If it does, then "POLICY action=re-add" really should work.

Thank you for your insight. You are correct, the array has no write-intent
bitmap.

 
> If it doesn't, then maybe you need "POLICY action=spare".

OK, I will test this when the notebook is back in the house.



Actually, the man page had kind of kept me from trying this, because it
mentions the condition "if the device is bare", and I didn't want arbitrary
bare disk, partition, or free space to be automatically added, but just to
trigger an automatic try with raid members that got pulled and are save to
re-sync. (e.g. after the occasional bad block error that gets remapped by
the hardrives firmware)

[man page: spare works] "as  above  and  additionally:  if  the device is
bare it can become a spare if there is any array that it is a candidate for
based on domains and metadata."

Also, I wouldn't want a temporarily removed raid member to be added as spare
to some other array. Only have them added (re-synced even if no bitmap
re-add is possible) to the array they belong according to their superblock.

 
> This isn't the default, because depending on exactly how/why the device
> failed, it may not be safe to treat it as a spare.

OK, I can imagine detecting the corner cases may require some inteligent
error logging.

What I am looking for is a safe re-sync configuration option between
bitmap-based re-add, and treating a device as arbitrary spare drive.


Practically, this could be something like an additional action=re-sync
option in between re-add/spare, or having the "re-add" action also do
(non-bitmap) full re-syncs, if the device is in a clean state.


May recording the fail event count in the remaining superblocks help, as
described in
http://permalink.gmane.org/gmane.linux.raid/48077
help to detect the clean state?

Kind Regards,
Chris








^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: re-add POLICY
  2015-02-16 12:23   ` Chris
@ 2015-02-16 13:17     ` Phil Turmel
  2015-02-16 16:15       ` desktop disk's error recovery timouts (was: re-add POLICY) Chris
  2015-02-17 15:09     ` re-add POLICY Chris
  1 sibling, 1 reply; 24+ messages in thread
From: Phil Turmel @ 2015-02-16 13:17 UTC (permalink / raw)
  To: Chris, linux-raid

Hi Chris,

On 02/16/2015 07:23 AM, Chris wrote:
> .... with raid members that got pulled and are save to
> re-sync. (e.g. after the occasional bad block error that gets remapped by
> the hardrives firmware)

This should not be part of your concern here, as MD will handle
occassional UREs by reconstructing them and rewriting them on the fly,
-- without failing the device.  If devices are failing after read
errors, you have a different problem.  (Hint:  look at recent threads
for "timeout mismatch".)

Phil


^ permalink raw reply	[flat|nested] 24+ messages in thread

* desktop disk's error recovery timouts (was: re-add POLICY)
  2015-02-16 13:17     ` Phil Turmel
@ 2015-02-16 16:15       ` Chris
  2015-02-16 17:19         ` desktop disk's error recovery timouts Phil Turmel
  0 siblings, 1 reply; 24+ messages in thread
From: Chris @ 2015-02-16 16:15 UTC (permalink / raw)
  To: linux-raid

Phil Turmel <philip <at> turmel.org> writes:

> On 02/16/2015 07:23 AM, Chris wrote:
> > .... with raid members that got pulled and are save to
> > re-sync. (e.g. after the occasional bad block error that gets remapped by
> > the hardrives firmware)
> 
> This should not be part of your concern here, as MD will handle
> occassional UREs by reconstructing them and rewriting them on the fly,


Phil, thank you for dropping in with this hint. It very likly applies to
the disks in the docking station. I searched the mailing list, most hits
said to search for the keywords, though. ;-) 

To understand the issue, I think
https://en.wikipedia.org/wiki/Error_recovery_control
was good.

It would be good if this configuration information could be available there
or at https://raid.wiki.kernel.org

Cheers,
Chris

----

I compiled some snippets from your messages, that could serve as a basis to
correction/completion by someone knowledgeable:



The default linux controller timeout is 30 seconds.  Drives
that spend longer than the timeout in recovery will be reset.  If they
don't respond to the reset (because they're busy in recovery) when the
raid tries to write the correct data back to them, they will be kicked
out of the array.

You *must* set ERC shorter than the
timeout, or set the driver timeout longer than the drive's worst-case
recovery time.  The defaults for desktop drives are *not* suitable for
linux software raid.

I strongly encourage you to run "smartctl -l scterc /dev/sdX" for each
of your drives.  For any drive that warns that it doesn't support SCT
ERC, set the controller device timeout to 180 like so:

echo 180 >/sys/block/sdX/device/timeout

If the report says read or write ERC is disabled, run "smartctl -l
scterc,70,70 /dev/sdX" to set it to 7.0 seconds.

You then set up a boot-time script to do these adjustments at every restart,
and make sure you performing regular scrub runs to ...?


You might not want that kind of long device timeout, but then you shouldn't
use desktop drives in md RAID.

Anyone using desktop drives which don't support SCT ERC in md RAID is 
liable to see long timeouts on the simplest bad sector, and they 
probably prefer to keep the drive in the array AND have the sector 
rewritten after reconstruction than have the drive failed out of the array.





^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: desktop disk's error recovery timouts
  2015-02-16 16:15       ` desktop disk's error recovery timouts (was: re-add POLICY) Chris
@ 2015-02-16 17:19         ` Phil Turmel
  2015-02-16 17:48           ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) Chris
  0 siblings, 1 reply; 24+ messages in thread
From: Phil Turmel @ 2015-02-16 17:19 UTC (permalink / raw)
  To: Chris, linux-raid

On 02/16/2015 11:15 AM, Chris wrote:

> Phil, thank you for dropping in with this hint. It very likly applies to
> the disks in the docking station. I searched the mailing list, most hits
> said to search for the keywords, though. ;-) 

I don't always have time to explain. :-(

> To understand the issue, I think
> https://en.wikipedia.org/wiki/Error_recovery_control
> was good.

Good starting points in the archives:

http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=139050322510249&w=2

There's useful info in each entire thread, though.

Phil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts)
  2015-02-16 17:19         ` desktop disk's error recovery timouts Phil Turmel
@ 2015-02-16 17:48           ` Chris
  2015-02-16 19:44             ` What are mdadm maintainers to do? Phil Turmel
  2015-02-16 23:49             ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) NeilBrown
  0 siblings, 2 replies; 24+ messages in thread
From: Chris @ 2015-02-16 17:48 UTC (permalink / raw)
  To: linux-raid


Thank you for the additional information, it calls for action.


OK, calling for a solution to stop desktop drives from causing data loss and
affecting the mdadm reputation:


I gather that mdadm could ship with one additional udev rule that calls a
script to check/set scterc, or falls back to increasing the system timout.

Phil, you mentioned having posted such a script, could you prepare it for
addition to the mdadm package?


Would maintainers be ok with adding such a udev rule and script to the package?


Kind Regards,
Chris





^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do?
  2015-02-16 17:48           ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) Chris
@ 2015-02-16 19:44             ` Phil Turmel
  2015-02-16 23:49             ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) NeilBrown
  1 sibling, 0 replies; 24+ messages in thread
From: Phil Turmel @ 2015-02-16 19:44 UTC (permalink / raw)
  To: Chris, linux-raid

On 02/16/2015 12:48 PM, Chris wrote:
> 
> Thank you for the additional information, it calls for action.
> 
> 
> OK, calling for a solution to stop desktop drives from causing data loss and
> affecting the mdadm reputation:
> 
> 
> I gather that mdadm could ship with one additional udev rule that calls a
> script to check/set scterc, or falls back to increasing the system timout.
> 
> Phil, you mentioned having posted such a script, could you prepare it for
> addition to the mdadm package?

No, I've posted snippets for users to customize in their own rc.local or
distro equivalent.  I vaguely recall posting a generic script for some
common cases, but I've personally converted to raid-rated drives
everywhere in the past couple years.

Somebody else will have to tackle this.

> Would maintainers be ok with adding such a udev rule and script to the package?

Not my call, but keep in mind that this will add a dependency on
smartmontools or whatever means is used to access/write to scterc.

Phil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts)
  2015-02-16 17:48           ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) Chris
  2015-02-16 19:44             ` What are mdadm maintainers to do? Phil Turmel
@ 2015-02-16 23:49             ` NeilBrown
  2015-02-17  7:52               ` What are mdadm maintainers to do? (error recovery redundancy/data loss) Chris
  2015-02-18 15:04               ` help with the little script (erc timout fix) Chris
  1 sibling, 2 replies; 24+ messages in thread
From: NeilBrown @ 2015-02-16 23:49 UTC (permalink / raw)
  To: Chris; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1067 bytes --]

On Mon, 16 Feb 2015 17:48:50 +0000 (UTC) Chris <email.bug@arcor.de> wrote:

> 
> Thank you for the additional information, it calls for action.
> 
> 
> OK, calling for a solution to stop desktop drives from causing data loss and
> affecting the mdadm reputation:
> 
> 
> I gather that mdadm could ship with one additional udev rule that calls a
> script to check/set scterc, or falls back to increasing the system timout.
> 
> Phil, you mentioned having posted such a script, could you prepare it for
> addition to the mdadm package?
> 
> 
> Would maintainers be ok with adding such a udev rule and script to the package?


"maintainers" ? Plural?  That would be nice.
Unfortunately there is just the one singular me....

There are certainly other contributors who 
 - answer questions on the list 
 - provide bug reports
 - provide bits of code

and I am very thankful to them.  But I haven't found a likely co-maintainer
yet :-(


I'm certainly happy to consider and concrete proposal.  The more concrete,
the better.

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss)
  2015-02-16 23:49             ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) NeilBrown
@ 2015-02-17  7:52               ` Chris
  2015-02-17  8:48                 ` Mikael Abrahamsson
  2015-02-17 19:33                 ` Chris Murphy
  2015-02-18 15:04               ` help with the little script (erc timout fix) Chris
  1 sibling, 2 replies; 24+ messages in thread
From: Chris @ 2015-02-17  7:52 UTC (permalink / raw)
  To: linux-raid

NeilBrown <neilb <at> suse.de> writes:

> "maintainers" ? Plural?  That would be nice.
> Unfortunately there is just the one singular me....


Yes, as Weedy said, I also refered to distro package maintainers.

If we can come up here with an udev rule and a script to call, then upstream
(you) could include this, and distro maintainers could make smartctl a
suggested or recommended package of the mdadm package.


I certainly have not understood the whole topic yet,
what I just got is, that the script should do something like
the following, and I found some implementation below.

Evererybody please answer with improved versions if you can.


if smartctl tool is available
  if scterc is disabled
    /usr/sbin/smartctl -l scterc,70,70 ${DEVNAME}
  else
    if screrc is not available
      echo 180 >/sys/block/${DEVNAME}/device/timeout



Found an older implementation that "seems to work fine":

http://article.gmane.org/gmane.linux.raid/44566
>
> contents of udev rule:
> ACTION=="add", SUBSYSTEM=="block", KERNEL=="[sh]d[a-z]",
RUN+="/usr/local/bin/settimeout"
>
>
> contents of /usr/local/bin/settimeout:
> #!/bin/bash
> 
> [ "${ACTION}" == "add" ] && {
>         /usr/sbin/smartctl -l scterc,70,70 ${DEVNAME} || echo 180 >
/sys/${DEVPATH}/device/timeout
> }
> 
> I guess, what is missing, is to connect the HDDs
> with a specific "mdadm" event, instead of running
> for each HDD.
> I'm not sure if this is already possible, since
> some "udev" rules for "md" are already existing.


Let's get this disaster prevention into mdadm, even if just as important
reference experience for solving a more general kernel timeout mismatch
problem "symptom of a more generic issue".
http://article.gmane.org/gmane.linux.raid/44557



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss)
  2015-02-17  7:52               ` What are mdadm maintainers to do? (error recovery redundancy/data loss) Chris
@ 2015-02-17  8:48                 ` Mikael Abrahamsson
  2015-02-17 10:37                   ` Chris
  2015-02-17 19:33                 ` Chris Murphy
  1 sibling, 1 reply; 24+ messages in thread
From: Mikael Abrahamsson @ 2015-02-17  8:48 UTC (permalink / raw)
  To: Chris; +Cc: linux-raid

On Tue, 17 Feb 2015, Chris wrote:

> Evererybody please answer with improved versions if you can.
>
> if smartctl tool is available
>  if scterc is disabled
>    /usr/sbin/smartctl -l scterc,70,70 ${DEVNAME}
>  else
>    if screrc is not available
>      echo 180 >/sys/block/${DEVNAME}/device/timeout
>
> Found an older implementation that "seems to work fine":

Hi,

Generally I like this idea, and I agree that this would be a good idea, 
but if I was running raid0 or linear, I might not want scterc to be 
enabled.

Also, what would the harm be to always bump the timeout to 180 seconds? 
Yes, drives would take longer to be kicked out in case of errors, but if 
we're confident in scterc working, wouldn't we want to turn down the 
timeout to 10-15 seconds then?

Personally I turn on scterc if available and turn up the timeout to 180 
seconds, always, regardless what drives I'm running. I'd rather wait 
longer for a drive to be considered dead, than to have drives being kicked 
due to some hiccup in the system (controller or drive reset) that might 
rectify itself.

So I would suggest turning on scterc and turning up the timeout to 180 
seconds as soon as mdadm is installed. This is the best tradeoff I can 
come up with between stability and fast drive-dead-detection time.

Here on the list I see people all the time coming in with multiple drives 
kicked due to controller resets and other intermittent flukes, I never see 
people coming in complaining that it took 30 seconds to detect a drive 
error. I doubt there'd be much complaint for 180 seconds. If someone needs 
faster detect times then my opinion is that they are in the category who 
can be expected to tune this value to their application. 180 seconds works 
best for the "larger crowd" using mdadm.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss)
  2015-02-17  8:48                 ` Mikael Abrahamsson
@ 2015-02-17 10:37                   ` Chris
  0 siblings, 0 replies; 24+ messages in thread
From: Chris @ 2015-02-17 10:37 UTC (permalink / raw)
  To: linux-raid

Mikael Abrahamsson <swmike <at> swm.pp.se> writes:

> if I was running raid0 or linear, I might not want scterc to be 
> enabled.

Good Point.

> Also, what would the harm be to always bump the timeout to 180 seconds?

I don't know why the driver authors chose that linux default,
but the todo with both your points:


if the appearind device is an md member device (mdadm examine?)

  if smartctl tool is available
    if scterc is disabled in cotaining ${HDD_DEV} AND added device is not
raid0/linear
      /usr/sbin/smartctl -l scterc,70,70 ${HDD_DEV}

  echo 180 >/sys/block/${HDD_DEV}/device/timeout



 




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: re-add POLICY
  2015-02-16 12:23   ` Chris
  2015-02-16 13:17     ` Phil Turmel
@ 2015-02-17 15:09     ` Chris
  2015-02-22 13:23       ` Chris
  1 sibling, 1 reply; 24+ messages in thread
From: Chris @ 2015-02-17 15:09 UTC (permalink / raw)
  To: linux-raid


> NeilBrown <neilb <at> suse.de> writes:
> 
> > If it doesn't, then maybe you need "POLICY action=spare".
> 
> OK, I will test this when the notebook is back in the house.


I could test it on another system.

Without adding a bitmap, it required configring
  POLICY domain=default action=spare
and calling
  mdadm --udev-rules

but then, after removing and inserting sdc again, only two out of six md
partitions got synced.

To see if there is something wrong, I then added the sdc1 md0 member
manually, and it synced without failure.

So I can't tell why the other partitions did not sync atomatically.
Some of the unsynced partition types are 83 (md0 member), but others
are FD (md7 member) like the automatically synced ones.

linux 3.2.0
mdadm v3.2.5

md7 : active raid1 sdc6[3] sda8[2]
      14327680 blocks super 1.2 [3/2] [UU_]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md3 : active raid1 sdc8[4] sda10[3]
      307011392 blocks super 1.2 [3/2] [UU_]
      bitmap: 3/3 pages [12KB], 65536KB chunk

md6 : active raid1 sda7[2]
      8695680 blocks super 1.2 [3/1] [_U_]
      
md1 : active raid1 sda6[3](W) sdb2[1]
      19513216 blocks super 1.2 [4/2] [_UU_]
      
md2 : active raid1 sda9[3](W) sdb3[0]
      97590144 blocks super 1.2 [4/2] [U_U_]
      
md0 : active raid1 sdc1[4] sda5[2](W) sdb1[1]
      340672 blocks super 1.2 [4/3] [UUU_]




A partition that did not sync automatically:
/dev/sdc7:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 7a5847cd:be0e8510:8e170bf5:5d40143f
           Name : name:2  (local to host name)
  Creation Time : Sun Dec  2 21:40:58 2012
     Raid Level : raid1
   Raid Devices : 4

 Avail Dev Size : 195187135 (93.07 GiB 99.94 GB)
     Array Size : 97590144 (93.07 GiB 99.93 GB)
  Used Dev Size : 195180288 (93.07 GiB 99.93 GB)
    Data Offset : 131072 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : b1a97d12:965e3d08:059acefb:6ac5b7e3

    Update Time : Wed Dec  3 11:23:26 2014
       Checksum : ac0ce511 - correct
         Events : 382479


   Device Role : Active device 1
   Array State : AAA. ('A' == active, '.' == missing)



And a corresponding partition that is part of the running array:
/dev/sda9:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 7a5847cd:be0e8510:8e170bf5:5d40143f
           Name : name:2  (local to host name)
  Creation Time : Sun Dec  2 21:40:58 2012
     Raid Level : raid1
   Raid Devices : 4

 Avail Dev Size : 195182592 (93.07 GiB 99.93 GB)
     Array Size : 97590144 (93.07 GiB 99.93 GB)
  Used Dev Size : 195180288 (93.07 GiB 99.93 GB)
    Data Offset : 131072 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 4c191282:80769896:378abe34:aeb01b8d

          Flags : write-mostly
    Update Time : Mon Feb 16 17:41:54 2015
       Checksum : 8cb4794c - correct
         Events : 384989


   Device Role : Active device 2
   Array State : A.A. ('A' == active, '.' == missing)



BTW looking at this data now, it seems to me the superblocks almost support
the clean re-sync / conflict detection I was trying to explain.

a) The removed device 1 does not claim that a member
   in the running array (0 and 2) has failed (AAA.)

b) The Events count of device 1 is lower than in the running array.

c) The running array/superblock does not seem to keep
   a reference of the Event count when device 1 failed, for additional
   security that it has not ben started separately.

But b) and c) may not even be necessary, as starting device 1 separately
would make device 1 claim that 0 and 2 have failed, right?


Regards,
Chris






^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss)
  2015-02-17  7:52               ` What are mdadm maintainers to do? (error recovery redundancy/data loss) Chris
  2015-02-17  8:48                 ` Mikael Abrahamsson
@ 2015-02-17 19:33                 ` Chris Murphy
  2015-02-17 22:47                   ` Adam Goryachev
  2015-02-17 23:33                   ` Chris
  1 sibling, 2 replies; 24+ messages in thread
From: Chris Murphy @ 2015-02-17 19:33 UTC (permalink / raw)
  To: linux-raid

It's not just mdadm. It likewise affects Btrfs, ZFS, and LVM.

Also, there's a lack of granularity with linux command timer and SCT
ERC applying only to the entire block device, not partitions. So
there's a problem for mixed use cases. For example, two drives, each
with two partitions. sda1 and sdb1 are raid0, and sda2 and sdb2 are
raid1. What's the proper configuration for SCT ERC and the SCSI
command timer?

*shrug* I don't think the automatic udev configuration idea is fail
safe. It sounds too easy for it to automatically cause a
misconfiguration. And it also doesn't at all solve the problem that
there's next to no error reporting to user space. smartd does, but
it's narrow in scope and entirely defers to the hard drive's
self-assessment. There's all sorts of problems that aren't in the
domain of SMART that get reported in dmesg, but there's no method for
gnome-shell or KDE or any DE or even send an email to a sysadmin, as
an early warning. Instead, all too often it's "WTF XFS just corrupted
itself!" meanwhile the real problem has been happening for a week,
dmesg/journal is full of errors indicating the nature of those
problems, but nothing bothered to inform a human being until the file
system face planted.


Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss)
  2015-02-17 19:33                 ` Chris Murphy
@ 2015-02-17 22:47                   ` Adam Goryachev
  2015-02-18  1:02                     ` Chris Murphy
  2015-02-17 23:33                   ` Chris
  1 sibling, 1 reply; 24+ messages in thread
From: Adam Goryachev @ 2015-02-17 22:47 UTC (permalink / raw)
  To: Chris Murphy, linux-raid

On 18/02/15 06:33, Chris Murphy wrote:
> It's not just mdadm. It likewise affects Btrfs, ZFS, and LVM.
>
> Also, there's a lack of granularity with linux command timer and SCT
> ERC applying only to the entire block device, not partitions. So
> there's a problem for mixed use cases. For example, two drives, each
> with two partitions. sda1 and sdb1 are raid0, and sda2 and sdb2 are
> raid1. What's the proper configuration for SCT ERC and the SCSI
> command timer?

Umm, actually I don't know enough to disagree, but I'll ask some 
questions which probably shows both the assumptions I've made, and might 
help others understand the issue better.

If we enable SCT ERC on every drive that supports it, and we are using 
the drive (only) in a RAID0/linear array then what is the downside? As I 
understand it, the drive will no longer try for > 120sec to recover the 
data stored in the "bad" sector, and instead return an unreadable error 
message in a short amount of time (well below 30 seconds) which means 
the driver will be able to return a read error to the application (or FS 
or MD) and the system as a whole will carry on. If we didn't enable SCT 
ERC, then the entire drive would vanish, (because the timeout wasn't 
changed for the driver) and the current read and every future read/write 
will all fail, and the system will probably crash (well, depending on 
the application, FS layout, etc).

So, IMHO, it seems that by default, every SCT ERC capable drive should 
have this enabled by default. As a part of error recovery (ie, crap that 
really important data stored on those few unreadable sectors) the user 
could manually disable SCT ERC and re-attempt to request the data from 
the drive (eg, during dd_rescue or similar).


Secondly, changing the timeout for those drives that don't support SCT 
ERC, again, it is fairly similar to above, we get the error from the 
drive before the timeout, except we will avoid the only possible 
downside above (failing to read a very unlikely but possible to read 
sector). Again, we will avoid dropping the entire drive, even if all 
operations on this drive will stop for a longer period of time, it is 
probably better than stopping permanently.

So, IMHO, every non SCT ERC capable drive should have the timeout 
extended to 120s/180s or whatever the appropriate time is that (most) 
drives will respond within. Leaving only the most extremely brain dead 
drives which we simply ridicule on the list and anywhere and everywhere 
possible to ensure nobody will ever buy them (or the manufacturer will 
fix the problems).


Of course, quite possible I've totally over simplified this, and don't 
understand the other repercussions?

> *shrug* I don't think the automatic udev configuration idea is fail
> safe. It sounds too easy for it to automatically cause a
> misconfiguration. And it also doesn't at all solve the problem that
> there's next to no error reporting to user space. smartd does, but
> it's narrow in scope and entirely defers to the hard drive's
> self-assessment. There's all sorts of problems that aren't in the
> domain of SMART that get reported in dmesg, but there's no method for
> gnome-shell or KDE or any DE or even send an email to a sysadmin, as
> an early warning. Instead, all too often it's "WTF XFS just corrupted
> itself!" meanwhile the real problem has been happening for a week,
> dmesg/journal is full of errors indicating the nature of those
> problems, but nothing bothered to inform a human being until the file
> system face planted.

Just because the solution doesn't solve the entire problem, it does 
solve a part of the problem, so IMHO, better to solve this part of the 
problem, and then discuss/try to find a solution to the rest of the 
problem. Unless you have a suggestion which can solve both parts of the 
problem? I suppose that a "good" sysadmin should install some sort of 
log monitoring software which will alert them to issues, whether that is 
via some desktop application/popup or email or something else. The 
problem is that most of these issues come from "home" users who will 
never setup anything like "log file monitoring" or raid scrubs, or 
anything else, so if we do decide upon a generic solution that will work 
for almost everybody, then we will still need to rely on the distro 
maintainers to implement the solution.

PS, I suppose this is one of the "hide the gory details that nobody 
understands" balancing with "provide the information to the user so they 
can do something about it". One more generic consideration would be to 
have the kernel identify which messages are purely informational/debug 
and which are errors. Normal syslog has support for many different 
levels, but AFAIK, all kernel messages end up in the same basket.

eg (plugging in and removing a USB drive generated the following log 
entries as seen from "dmesg":
[614977.802828] usb 3-3: new high-speed USB device number 5 using xhci_hcd
[614977.822724] usb 3-3: New USB device found, idVendor=0951, idProduct=1665
[614977.822729] usb 3-3: New USB device strings: Mfr=1, Product=2, 
SerialNumber=3
[614977.822732] usb 3-3: Product: DataTraveler 2.0
[614977.822735] usb 3-3: Manufacturer: Kingston
[614977.822737] usb 3-3: SerialNumber: 60A44C413CCBFE40AB4FFB3E
[614977.822899] usb 3-3: ep 0x81 - rounding interval to 128 microframes, 
ep desc says 255 microframes
[614977.822905] usb 3-3: ep 0x2 - rounding interval to 128 microframes, 
ep desc says 255 microframes
[614977.836547] usb-storage 3-3:1.0: USB Mass Storage device detected
[614977.836734] scsi6 : usb-storage 3-3:1.0
[614977.836819] usbcore: registered new interface driver usb-storage
[614978.854080] scsi 6:0:0:0: Direct-Access     Kingston DataTraveler 
2.0 1.00 PQ: 0 ANSI: 4
[614978.854493] sd 6:0:0:0: Attached scsi generic sg2 type 0
[614978.854658] sd 6:0:0:0: [sdb] 15131636 512-byte logical blocks: 
(7.74 GB/7.21 GiB)
[614978.854884] sd 6:0:0:0: [sdb] Write Protect is off
[614978.854888] sd 6:0:0:0: [sdb] Mode Sense: 45 00 00 00
[614978.855085] sd 6:0:0:0: [sdb] Write cache: disabled, read cache: 
enabled, doesn't support DPO or FUA
[614978.860015]  sdb: sdb1
[614978.860864] sd 6:0:0:0: [sdb] Attached SCSI removable disk
[614979.061474] FAT-fs (sdb1): Volume was not properly unmounted. Some 
data may be corrupt. Please run fsck.
[615347.862058] usb 3-3: reset high-speed USB device number 5 using xhci_hcd
[615347.862111] usb 3-3: Device not responding to set address.
[615348.065856] usb 3-3: Device not responding to set address.
[615348.269944] usb 3-3: device not accepting address 5, error -71
[615348.326429] usb 3-3: USB disconnect, device number 5
[615348.334730] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called 
with disabled ep ffff88011b1b2600
[615348.334744] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called 
with disabled ep ffff88011b1b2640


Of the above, I would suggest most of that is "info" while the following 
lines might be warnings:
[614979.061474] FAT-fs (sdb1): Volume was not properly unmounted. Some 
data may be corrupt. Please run fsck.
These might be error or critical:
[615347.862058] usb 3-3: reset high-speed USB device number 5 using xhci_hcd
[615347.862111] usb 3-3: Device not responding to set address.
[615348.065856] usb 3-3: Device not responding to set address.
[615348.269944] usb 3-3: device not accepting address 5, error -71

Of course, this will rely on every driver maintainer to make a decision 
on just how important each line that they log may be.

Just my thoughts, hopefully it will be useful.

Regards,
Adam

-- 
Adam Goryachev Website Managers www.websitemanagers.com.au

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss)
  2015-02-17 19:33                 ` Chris Murphy
  2015-02-17 22:47                   ` Adam Goryachev
@ 2015-02-17 23:33                   ` Chris
  1 sibling, 0 replies; 24+ messages in thread
From: Chris @ 2015-02-17 23:33 UTC (permalink / raw)
  To: linux-raid

Chris Murphy writes:

> 
> It's not just mdadm. It likewise affects Btrfs, ZFS, and LVM.

Do they have own timouts, or rely on the kernel?
Maybe the kernel could read the SCTERT value from the drives (in lieu of
some better retry timout information, and set the controller timout a little
greater than that, or very large if SCTERT is disabled/not available.


> sda1 and sdb1 are raid0, and sda2 and sdb2 are
> raid1. What's the proper configuration for SCT ERC and the SCSI
> command timer?

guessing...

For SCTERT disabled drives:
A compromise may be to stay with the linux default controller timout, it's
30s, and set the drives SCTERT below 30s (maybe 27s), to avoid losing
redundancy and risking data loss *AND* allow more of the available time for ERC.

For longer error correcting attempts (and just as long i/o controller
blocking!) the contoller timout could be set to 180s, and SCTERT to 175s?

BUT: If I chose to use a raid0 alongside a redundant raid I already
explicitly decided to take all data loss the hardware throws at me. So I
don't think it makes much of a difference if ERC times out after <30 secs or
180s, its just more or less errors belonging to me.


For SCTERC enabled drives:
30s and 7s seems ok?



 
> *shrug* I don't think the automatic udev configuration idea is fail
> safe. It sounds too easy for it to automatically cause a
> misconfiguration.


A matching timeout configuration prevents that unavoidable unrecoverable
read error take down the redundancy for sure, and cause high risk of data
loss during rebuild.

It does fix a misconfiguration, however could possibly set SCTERT just below
the (30s) controler timout, to reduce the impact of SCTERT (e.g make use of
the small chance of error correction succceding a couple of seconds later).
Given the longer SCTERT timout does not lead to subseqent read error timouts
piling up.


> And it also doesn't at all solve the problem that
> there's next to no error reporting to user space.

That is correct, but rather not related to the importance to fix the timout
mismatch and reduce the risk, is it? The settings do solve unecessary loss
of redundancy on read errors that are sure to occur, unnecessary resyncing,
and high risk of data loss during all that.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss)
  2015-02-17 22:47                   ` Adam Goryachev
@ 2015-02-18  1:02                     ` Chris Murphy
  2015-02-18 11:04                       ` Chris
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Murphy @ 2015-02-18  1:02 UTC (permalink / raw)
  To: linux-raid

On Tue, Feb 17, 2015 at 3:47 PM, Adam Goryachev
<mailinglists@websitemanagers.com.au> wrote:

> If we enable SCT ERC on every drive that supports it, and we are using the
> drive (only) in a RAID0/linear array then what is the downside?

Unnecessary data loss.


> As I
> understand it, the drive will no longer try for > 120sec to recover the data
> stored in the "bad" sector, and instead return an unreadable error message
> in a short amount of time (well below 30 seconds) which means the driver
> will be able to return a read error to the application (or FS or MD) and the
> system as a whole will carry on.

Not necessarily, it depends what's in that sector. If it's user data,
this means a sector (or possibly more) of data loss. If it's file
system metadata it means progressive file system corruption.

Configuring the drive to give up too soon is completely inappropriate
for single, raid0 or linear configurations.

Arguably the drive should have already recovered this data. If a
longer recovery can recover, then why isn't the drive writing the data
back to that sector so that next time it isn't so ambiguous that it
requires long recovery? I can't answer that question. In some case
that appears to happen in other cases it's not. But the followup is
that there really ought to be some way for user space to get access to
these kinds of errors rather than them accumulating until disaster
strikes.

The contra argument to that is, it's still cheaper to buy the proper
use case specified drive.


>If we didn't enable SCT ERC, then the
> entire drive would vanish, (because the timeout wasn't changed for the
> driver) and the current read and every future read/write will all fail, and
> the system will probably crash (well, depending on the application, FS
> layout, etc).

Umm no. If SCT ERC remains a high value or disable, while also
increasing the kernel command timer, the drive has a longer chance to
recover. That's the appropriate configuration for single, linear, and
raid0.


>
> So, IMHO, it seems that by default, every SCT ERC capable drive should have
> this enabled by default. As a part of error recovery (ie, crap that really
> important data stored on those few unreadable sectors) the user could
> manually disable SCT ERC and re-attempt to request the data from the drive
> (eg, during dd_rescue or similar).

If you do this for single, linear, or raid0 it will increase the
incident of data loss that would otherwise not occur if deep/long
recovery times were available.

Before changing these settings, there should be some better
understanding of what the manufacturer defined recovery times in the
real world actually are, and whether or not these long recoveries are
helpful. Presumably they'd say they are helpful, but I think we need
facts to contradict their position before second guessing the default
settings. And we have such facts to do exactly that when it comes to
raid1, 5, 6 with such drives which is why the recommendation is to
change SCT ERC if supported.



> Secondly, changing the timeout for those drives that don't support SCT ERC,
> again, it is fairly similar to above, we get the error from the drive before
> the timeout, except we will avoid the only possible downside above (failing
> to read a very unlikely but possible to read sector). Again, we will avoid
> dropping the entire drive, even if all operations on this drive will stop
> for a longer period of time, it is probably better than stopping
> permanently.

Not by default. You can't assume any drive hang is due to bad sectors
that merely need a longer recovery time. It could be some other error
condition, in which case doing a 120 or 180 second *by default* delay
means no error messages at all for upwards of 3 minutes.

And in any case the proper place to change the default kernel command
timer value is in the kernel, not with a udev rule.

I don't know if a udev rule can say "If the drive exclusively uses md,
lvm, btrfs, zfs raid1, 4+ or nested of those, and if the drive does
not support configurable SCT ERC, then change the kernel command timer
for those devices to ~120 seconds" then that might be a plausible
solution to use consumer drives the manufacturer rather explicitly
proscribes from use in raid...

But the contra argument to that is, why should anyone do this work for
(sorry) basically cheap users who don't want to buy the proper drive
for the specific use case? There are limited resources for this work.
And in fact the problem has a work around, if not a solution.

What we still don't have is something that reports any such problems
to user space.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss)
  2015-02-18  1:02                     ` Chris Murphy
@ 2015-02-18 11:04                       ` Chris
  2015-02-19  6:12                         ` Chris Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Chris @ 2015-02-18 11:04 UTC (permalink / raw)
  To: linux-raid

>

Hello all,

the discussion about SCTERC boils down to letting the drive attempt ERC a
little more or less. For any given disk experience seems to tell the slight
difference is, that if ERC is allowed longer you may see the first
unrecoverable erros (UREs) just a little (maybe only a month) later.

UREs are inevitable. Thus, if I run a filesystem on just a single drive it
will get corrupted at some point, nothing to do about it.

Wait, except..., use a redundant raid! And here it makes a lot of a
difference that the drive's ERC actually terminates before the controller
timeout, to not loose all your redundacy again and be in hight risk of UREs
showing up during the re-sync.

So for a proper comparison we need to look at the difference it makes in the
usage scenarios (error delay vs. loosing redundant error resilence + URE
triggering), not at the single recoverable/unrecoverable error incidence. It
looks to me, that it makes a lot of a differnce to redundant raids and no
qualitative difference to single disk filesystems.

And we need to keep in mind that single disk filesystems do also depend on
the disk to stop grinding away with ERC attempts before the controller
timout. Otherwise disk reset may make the system clear buffers and loose
open files? Without prolonging the linux default controller timout, SCTERC
can prevent that where supported.



> in any case the proper place to change the default kernel command
> timer value is in the kernel, not with a udev rule.

Right. And as you write increasing the controller timout has clear downsides.

Noteing as well, as long as the proposed script (a temporary safety measure)
maximizes the controller timeout to remedy for disks that don's support
SCTERC, this would even fix the timout mismatch for single disk filesystems.
(Letting the controller wait until the disk finally succeeds or fails its
recovery attempts.)

So the proposed script actually provides a case that brings benefit for
raid0 setups as well (as long as the linux default is not adaptive to the
disk parameters), but increasing the controller timout in all cases would
introduce long and unreported i/o blocking into all redundant setups.


> I don't know if a udev rule can say "If the drive exclusively uses md,
> lvm, btrfs, zfs raid1, 4+ or nested of those, and if the drive does
> not support configurable SCT ERC, then change the kernel command timer
> for those devices to ~120 seconds" then that might be a plausible
> solution to use consumer drives the manufacturer rather explicitly
> proscribes from use in raid...

The script called by the udev rule could do that, but can be kept as simple
as proposed, and can set SCTERC regardles, because setting SCTERC below the
controller timout makes a qualitative difference in running the redundant
arrays and a marginal difference in running non-redundant filesystems. (And
nevertheless, set long controller timout for devices that don's support SCTERC.)



After all, this looks like a quite simple change is appropriate:

In udev-md-raid-assembly.rules, below LABEL="md_inc" (only handling all md
suppported devices) add one rule:

# fix timouts for redundant raids, if possible
TEST="/usr/sbin/smartctl", ENV{MD_LEVEL}=="raid[1-9]*",
RUN+="/usr/bin/mdadm-erc-timout-fix"


And in a new /usr/bin/mdadm-erc-timout-fix file implement:

  if smartctl -l scterc ${HDD_DEV} returns "Disabled" 
    /usr/sbin/smartctl -l scterc,70,70 ${HDD_DEV}
  else
    if smartctl -l scterc ${HDD_DEV} does not return "seconds"
      echo 180 >/sys/block/${HDD_DEV}/device/timeout


Regards,
Chris



^ permalink raw reply	[flat|nested] 24+ messages in thread

* help with the little script (erc timout fix)
  2015-02-16 23:49             ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) NeilBrown
  2015-02-17  7:52               ` What are mdadm maintainers to do? (error recovery redundancy/data loss) Chris
@ 2015-02-18 15:04               ` Chris
  2015-02-18 21:25                 ` NeilBrown
  1 sibling, 1 reply; 24+ messages in thread
From: Chris @ 2015-02-18 15:04 UTC (permalink / raw)
  To: linux-raid


Hello,

by adapting what I could find, I compiled the following short snippet now.

Could list members please look at this novice code and suggest a way to 
determine the containing disk device $HDD_DEV from the parition/disk,
before I dare to test this.



In udev-md-raid-assembly.rules, below LABEL="md_inc" (section only handling
all md suppported devices) add:

# fix timouts for redundant raids, if possible
IMPORT{program}="BINDIR/mdadm --examine --export $tempnode"
TEST="/usr/sbin/smartctl", ENV{MD_LEVEL}=="raid[1-9]*",
RUN+="BINDIR/mdadm-erc-timout-fix.sh $tempnode"

And in a new mdadm-erc-timout-fix.sh file implement:

  #! /bin/sh

  HDD_DEV= $1 somehow stipping off the tailing numbers?

  if smartctl -l scterc ${HDD_DEV} | grep -q Disabled ; then
    /usr/sbin/smartctl -l scterc,70,70 ${HDD_DEV}
  else
    if ! smartctl -l scterc ${HDD_DEV} | grep -q seconds ; then
      echo 180 >/sys/block/${HDD_DEV}/device/timeout
    fi
  fi

Correct execution during boot would seem to require that distro
package managers hook smartctl and the script into the initramfs
generation.

Regards,
Chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: help with the little script (erc timout fix)
  2015-02-18 15:04               ` help with the little script (erc timout fix) Chris
@ 2015-02-18 21:25                 ` NeilBrown
  0 siblings, 0 replies; 24+ messages in thread
From: NeilBrown @ 2015-02-18 21:25 UTC (permalink / raw)
  To: Chris; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2459 bytes --]

On Wed, 18 Feb 2015 15:04:53 +0000 (UTC) Chris <email.bug@arcor.de> wrote:

> 
> Hello,
> 
> by adapting what I could find, I compiled the following short snippet now.
> 
> Could list members please look at this novice code and suggest a way to 
> determine the containing disk device $HDD_DEV from the parition/disk,
> before I dare to test this.
> 
> 
> 
> In udev-md-raid-assembly.rules, below LABEL="md_inc" (section only handling
> all md suppported devices) add:
> 
> # fix timouts for redundant raids, if possible
> IMPORT{program}="BINDIR/mdadm --examine --export $tempnode"
> TEST="/usr/sbin/smartctl", ENV{MD_LEVEL}=="raid[1-9]*",
> RUN+="BINDIR/mdadm-erc-timout-fix.sh $tempnode"

It might make sense to have 2 rules, one for partitions and one for disks
(based on ENV{DEVTYPE}).  Then use $parent to get the device from the
partition, and  $devnode to get the device of the disk.

> 
> And in a new mdadm-erc-timout-fix.sh file implement:
> 
>   #! /bin/sh
> 
>   HDD_DEV= $1 somehow stipping off the tailing numbers?
> 
>   if smartctl -l scterc ${HDD_DEV} | grep -q Disabled ; then
>     /usr/sbin/smartctl -l scterc,70,70 ${HDD_DEV}
>   else
>     if ! smartctl -l scterc ${HDD_DEV} | grep -q seconds ; then
>       echo 180 >/sys/block/${HDD_DEV}/device/timeout
>     fi
>   fi

You should be consistent and use /usr/sbin/smartctl everywhere, or explicitly
set $PATH and just use smartctl  everywhere.

> 
> Correct execution during boot would seem to require that distro
> package managers hook smartctl and the script into the initramfs
> generation.
> 
> Regards,
> Chris

One problem with this approach is that it assumes circumstances don't change.
If you have a working RAID1, then limiting the timeout on both devices makes
sense.  If you have a degraded RAID1 with only one device left then you
really want the drive to try as hard as it can to get the data.

There is a "FAILFAST" mechanism in the kernel which allows the filesystem to
md etc to indicate that it wants accesses to "fail fast", which presumably
means to use a smaller timeout.
I would rather md used this flag where appropriate, and for the device to
respond to it by using suitable timeouts.

The problem is that FAILFAST isn't documented usefully and it is very hard to
figure out what exactly (if anything) it does.

But until that is resolved, a fix like this is probably a good idea.

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss)
  2015-02-18 11:04                       ` Chris
@ 2015-02-19  6:12                         ` Chris Murphy
  2015-02-20  5:12                           ` Roger Heflin
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Murphy @ 2015-02-19  6:12 UTC (permalink / raw)
  Cc: linux-raid

On Wed, Feb 18, 2015 at 4:04 AM, Chris <email.bug@arcor.de> wrote:
>>
>
> Hello all,
>
> the discussion about SCTERC boils down to letting the drive attempt ERC a
> little more or less. For any given disk experience seems to tell the slight
> difference is, that if ERC is allowed longer you may see the first
> unrecoverable erros (UREs) just a little (maybe only a month) later.




>
> UREs are inevitable. Thus, if I run a filesystem on just a single drive it
> will get corrupted at some point, nothing to do about it.

On a single randomly selective drive, I disagree. In aggregate, that's
true, eventually it will happen, you just won't know which drive or
when it'll happen. I have a number of 5+ year old drives that have
never reported a  URE. Meanwhile another drive has so many bad sectors
I only keep it around for abusive purposes.



>
> Wait, except..., use a redundant raid! And here it makes a lot of a
> difference that the drive's ERC actually terminates before the controller
> timeout, to not loose all your redundacy again and be in hight risk of UREs
> showing up during the re-sync.
>
> So for a proper comparison we need to look at the difference it makes in the
> usage scenarios (error delay vs. loosing redundant error resilence + URE
> triggering), not at the single recoverable/unrecoverable error incidence. It
> looks to me, that it makes a lot of a differnce to redundant raids and no
> qualitative difference to single disk filesystems.
>
> And we need to keep in mind that single disk filesystems do also depend on
> the disk to stop grinding away with ERC attempts before the controller
> timout. Otherwise disk reset may make the system clear buffers and loose
> open files? Without prolonging the linux default controller timout, SCTERC
> can prevent that where supported.

To get to one size fits all, where SCT ERC is disabled (consumer
drive), and the kernel command timer is increased accordingly, we
still need the delay reportable to user space. You can't have a by
default 2-3 minute showstopper without an explanation so that the user
can tune this back to 30 seconds or get rid of the drive or some other
mitigation. Otherwise this is a 2-3 minute silent failure. I know a
huge number of users who would assume this is a crash and force power
off the system.

The option where SCT ERC is configurable, you could also do this one
size fits all by setting this to say 50-70 deciseconds, and for read
failures to cause  recovery if raid1+ is used, or cause a read retry
if it's single, raid0, or linear. In other words, control the retries
in software for these drives.




>> I don't know if a udev rule can say "If the drive exclusively uses md,
>> lvm, btrfs, zfs raid1, 4+ or nested of those, and if the drive does
>> not support configurable SCT ERC, then change the kernel command timer
>> for those devices to ~120 seconds" then that might be a plausible
>> solution to use consumer drives the manufacturer rather explicitly
>> proscribes from use in raid...
>
> The script called by the udev rule could do that, but can be kept as simple
> as proposed, and can set SCTERC regardles, because setting SCTERC below the
> controller timout makes a qualitative difference in running the redundant
> arrays and a marginal difference in running non-redundant filesystems. (And
> nevertheless, set long controller timout for devices that don's support SCTERC.)

I can't agree at all, lacking facts, that this change is marginal for
non-redundant configurations. I've seen no data how common long
recovery incidents are, or how much more common data loss would be if
long recovery were prevented.

The mere fact they exist suggests they're necessary. It may very well
be that the ECC code or hardware used is so slow that it really does
take so unbelievably long (really 30 seconds is an eternity, and a
minute seems outrageous, and 2-3 minutes seems wholly ridiculous as in
worthy of brutal unrelenting ridicule); but that doesn't even matter
even if it is true, that's the behavior of the ECC whether we like it
or not, we can't just willy nilly turn these things off without
understanding the consequences. Just saying it's marginal doesn't make
it true.

So if SCT ERC is short, now you have to have a mitigation for the
possibly higher number of URE's this will result in, in the form of
kernel instigated read retries on read fail. And in fact, this may be
false. The retries the drive does internally might be completely
different than the kernel doing another read. The way data is encoded
on the drive these days bears no resemblance to discreet 1's and 0's.

And you also need a reliable opt out for SSD's. Their failures seem
rather different.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: What are mdadm maintainers to do? (error recovery redundancy/data loss)
  2015-02-19  6:12                         ` Chris Murphy
@ 2015-02-20  5:12                           ` Roger Heflin
  0 siblings, 0 replies; 24+ messages in thread
From: Roger Heflin @ 2015-02-20  5:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux RAID

On Thu, Feb 19, 2015 at 12:12 AM, Chris Murphy <lists@colorremedies.com> wrote:
> On Wed, Feb 18, 2015 at 4:04 AM, Chris <email.bug@arcor.de> wrote:
>>>
>>
>> Hello all,
>>
>
> On a single randomly selective drive, I disagree. In aggregate, that's
> true, eventually it will happen, you just won't know which drive or
> when it'll happen. I have a number of 5+ year old drives that have
> never reported a  URE. Meanwhile another drive has so many bad sectors
> I only keep it around for abusive purposes.

And I have seen the same.     Not all will fail even of a given type.

It also appears if one was really worried, running smartctl -t long often
(daily or weekly) can result in the disk finding and re-writing or moving
the bad sector.    I have a disk that started given me trouble and the bad
block count has risen a few times without an os level error during the
-t long test.

>
>
>

>
> To get to one size fits all, where SCT ERC is disabled (consumer
> drive), and the kernel command timer is increased accordingly, we
> still need the delay reportable to user space. You can't have a by
> default 2-3 minute showstopper without an explanation so that the user
> can tune this back to 30 seconds or get rid of the drive or some other
> mitigation. Otherwise this is a 2-3 minute silent failure. I know a
> huge number of users who would assume this is a crash and force power
> off the system.
>
> The option where SCT ERC is configurable, you could also do this one
> size fits all by setting this to say 50-70 deciseconds, and for read
> failures to cause  recovery if raid1+ is used, or cause a read retry
> if it's single, raid0, or linear. In other words, control the retries
> in software for these drives.

This gets more interesting.    From what I can tell with my drivers (reds
and seagate video driver) they some allow erc to be set only 7 or higher,
and some allow things to be set lower.   I have been setting mine lower
when it allows since I have raid 6 and expect to be able to get the data
from the other disks.    This min 7 vs min of lower may be a further
distinction between the green(none), red 7, seagate VX (1.0 allowed).

My has video recordings...when the video pauses I counting how long.
I almost always appear to see the full 7 seconds, so I suspect that if
it does not recover in a short time it appears to be unlikely to recover it
all all.      Given the data corruption issue without raid the vendors may
have the though that they cannot really do anything else but retry in the
no raid case.
>
>
>

> I can't agree at all, lacking facts, that this change is marginal for
> non-redundant configurations. I've seen no data how common long
> recovery incidents are, or how much more common data loss would be if
> long recovery were prevented.
>
> The mere fact they exist suggests they're necessary. It may very well
> be that the ECC code or hardware used is so slow that it really does
> take so unbelievably long (really 30 seconds is an eternity, and a
> minute seems outrageous, and 2-3 minutes seems wholly ridiculous as in
> worthy of brutal unrelenting ridicule); but that doesn't even matter
> even if it is true, that's the behavior of the ECC whether we like it
> or not, we can't just willy nilly turn these things off without
> understanding the consequences. Just saying it's marginal doesn't make
> it true.
>
> So if SCT ERC is short, now you have to have a mitigation for the
> possibly higher number of URE's this will result in, in the form of
> kernel instigated read retries on read fail. And in fact, this may be
> false. The retries the drive does internally might be completely
> different than the kernel doing another read. The way data is encoded
> on the drive these days bears no resemblance to discreet 1's and 0's.

Given the drive likely has some ability to adjust the levels of the 0 and 1,
I can see the disk retries possibly playing some games like that trying to get
a better answer.   It is worth nothing that 7 seconds does mean around 70
retries of the read (data comes under the head 70 times).   I doubt the ECC is
so slow it takes more than 10-20 ms to calculate more extreme failures.  So
I am betting on the retries being what is recovering the data.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: re-add POLICY
  2015-02-17 15:09     ` re-add POLICY Chris
@ 2015-02-22 13:23       ` Chris
  0 siblings, 0 replies; 24+ messages in thread
From: Chris @ 2015-02-22 13:23 UTC (permalink / raw)
  To: linux-raid


Hello,

I just noticed that I somehow overlooked that md3 and md7 on that old ubuntu
system *did* have a write-intent bitmap.

So in my tests action="spare" does not seem to allow automatic re-sync of
arrays without a bitmap.

To quote the man page again on "spare":
"if the device is bare it can become a spare if there is any array that it
is a candidate for based on domains and metadata."

I am fankly not sure I fully understand that. A bare device has no
superblock, so does  mdadm only look which array fits onto the device?
Since the partitions on the removed disk contain superblocks they are not
bare, may that by why action=spare does not apply, and an automatic re-sync
may either require a new action="re-sync" or be done by "re-add" as well?

Regards,
Chris




^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-02-22 13:23 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-14 21:59 re-add POLICY Chris
2015-02-15 19:03 ` re-add POLICY: conflict detection? Chris
2015-02-16  3:28 ` re-add POLICY NeilBrown
2015-02-16 12:23   ` Chris
2015-02-16 13:17     ` Phil Turmel
2015-02-16 16:15       ` desktop disk's error recovery timouts (was: re-add POLICY) Chris
2015-02-16 17:19         ` desktop disk's error recovery timouts Phil Turmel
2015-02-16 17:48           ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) Chris
2015-02-16 19:44             ` What are mdadm maintainers to do? Phil Turmel
2015-02-16 23:49             ` What are mdadm maintainers to do? (was: desktop disk's error recovery timeouts) NeilBrown
2015-02-17  7:52               ` What are mdadm maintainers to do? (error recovery redundancy/data loss) Chris
2015-02-17  8:48                 ` Mikael Abrahamsson
2015-02-17 10:37                   ` Chris
2015-02-17 19:33                 ` Chris Murphy
2015-02-17 22:47                   ` Adam Goryachev
2015-02-18  1:02                     ` Chris Murphy
2015-02-18 11:04                       ` Chris
2015-02-19  6:12                         ` Chris Murphy
2015-02-20  5:12                           ` Roger Heflin
2015-02-17 23:33                   ` Chris
2015-02-18 15:04               ` help with the little script (erc timout fix) Chris
2015-02-18 21:25                 ` NeilBrown
2015-02-17 15:09     ` re-add POLICY Chris
2015-02-22 13:23       ` Chris

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.