All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Andrey Zhunev <a-j@a-j.ru>
Cc: Chris Murphy <lists@colorremedies.com>,
	xfs list <linux-xfs@vger.kernel.org>
Subject: Re: Need help to recover root filesystem after a power supply issue
Date: Wed, 10 Jul 2019 09:45:28 -0600	[thread overview]
Message-ID: <CAJCQCtSpkAS086zSDCfB1jMQXZuacfE-SfyqQ2td4Ven4GwAzg@mail.gmail.com> (raw)
In-Reply-To: <1373677058.20190710182851@a-j.ru>

On Wed, Jul 10, 2019 at 9:29 AM Andrey Zhunev <a-j@a-j.ru> wrote:
>
> Well, this machine is always online (24/7, with a UPS backup power).
> Yesterday we found it switched OFF, without any signs of life. Trying
> to switch it on, the PSU made a humming noise and the machine didn't
> even try to start. So we replaced the PSU. After that, the machine
> powered on - but refused to boot... Something tells me these two
> failures are likely related...

Most likely the drive is dying and the spin down from power failure
and subsequent spin up has increased the rate of degradation, and
that's why they seem related.

What do you get for:

# smarctl -x /dev/sda



>
>
>
> # smartctl -l scterc /dev/sda
> smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-957.el7.x86_64] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> SCT Error Recovery Control:
>            Read:     70 (7.0 seconds)
>           Write:     70 (7.0 seconds)

Good news. This can be raised by a ton and maybe you'll recover the
bad sectors. You need to do two things. You might have to iterate some
of this because I don't know what the max SCT ERC value is for this
make/model drive. Consumer drives can have really high values, upwards
of three minutes, which is ridiculous but off topic. I'd like to think
60 seconds would be enough and also below whatever cap the drive
firmware has. Also, I've had drive firmware crash when issuing
multiple SCT ERC changes - so if the drive starts doing new crazy
things, we're not going to know if it's a firmware bug or more likely
if the drive is continuing to degrade.

I would shoot for a 90 second SCT ERC for reads, and hopefully that's
long enough and also isn't above the max value for this make/model.

# smartctl -l scterc,900,100

And next, raise the kernel's command timer into the stratosphere so
that it won't get mad and do a link reset if the drive takes a long
time to recover.

# echo 180 > /sys/block/sda/device/timeout

In this configuration, it's possible every single read command for a
(marginally) bad sector will take 90 seconds. So if you have a bunch
of these, an fsck might take hours. So that's not necessarily how I
would do it. Best to see the smartctl -x to have some idea how many
bad sectors there might be.



>
> #
>
> This is a WD RED series drive, WD30EFRX.

Yeah this is a NAS drive, and this low 70 decisecond value is meant
for RAID. It's a suboptimal value if you're using it for a boot drive.
But deal with that later after recovery.

>Jul 10 11:48:05 mgmt kernel: blk_update_request: I/O error, dev sda, sector 54439176

> Jul 10 11:59:03 mgmt kernel: blk_update_request: I/O error, dev sda, sector 176473048
> Jul 10 11:59:05 mgmt kernel: blk_update_request: I/O error, dev sda, sector 176473048

So at least two bad sectors and they aren't anywhere near each other.
The smartctl -x command might give us an idea how bad the drive is.
Anyway, these drives have decent warranties, but they're going to want
the drive returned to them. So if there's anything sensitive on it and
it's not encrypted  you'll want it still working long enough to wipe
it.



-- 
Chris Murphy

  reply	other threads:[~2019-07-10 15:45 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-10  9:47 Need help to recover root filesystem after a power supply issue Andrey Zhunev
2019-07-10 14:30 ` Chris Murphy
2019-07-10 15:28   ` Andrey Zhunev
2019-07-10 15:45     ` Chris Murphy [this message]
2019-07-10 16:07       ` Andrey Zhunev
2019-07-10 16:46         ` Chris Murphy
2019-07-10 16:47           ` Chris Murphy
2019-07-10 17:16             ` Andrey Zhunev
2019-07-10 18:03               ` Chris Murphy
2019-07-10 18:35                 ` Carlos E. R.
2019-07-10 19:30                   ` Chris Murphy
2019-07-10 23:43                     ` Andrey Zhunev
2019-07-11  2:47                       ` Carlos E. R.
2019-07-11  7:10                         ` Andrey Zhunev
2019-07-11 10:23                           ` Carlos E. R.
2019-07-10 16:51         ` Chris Murphy
2019-07-10  9:56 Andrey Zhunev
2019-07-10 13:26 ` Eric Sandeen
2019-07-10 13:58   ` Andrey Zhunev
2019-07-10 14:23     ` Eric Sandeen
2019-07-10 15:02       ` Andrey Zhunev
2019-07-10 15:23         ` Eric Sandeen
2019-07-10 18:21         ` Carlos E. R.

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJCQCtSpkAS086zSDCfB1jMQXZuacfE-SfyqQ2td4Ven4GwAzg@mail.gmail.com \
    --to=lists@colorremedies.com \
    --cc=a-j@a-j.ru \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.